PDEP-7: https://pandas.pydata.org/pdeps/0007-copy-on-write.html

An initial implementation was merged in https://github.com/pandas-dev/pandas/pull/46958/ (with the proposal described in more detail in https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit / discussed in https://github.com/pandas-dev/pandas/issues/36195).

In https://github.com/pandas-dev/pandas/issues/36195#issuecomment-1137706149 I mentioned some next steps that are still needed; moving this to a new issue.

Implementation

Complete the API surface:

  • [x] Use the lazy copy (with CoW) mechanism in more methods where it can be used. Overview issue at https://github.com/pandas-dev/pandas/issues/49473
  • [ ] What to do with the existing copy keyword?
  • [x] https://github.com/pandas-dev/pandas/issues/50535 -> https://github.com/pandas-dev/pandas/pull/51464
  • [ ] https://github.com/pandas-dev/pandas/issues/56022
  • [x] Implement the new behaviour for constructors (eg constructing a DataFrame/Series from an existing DataFrame/Series should follow the same rules as indexing -> behaves as a copy through CoW). Although, what about creating from a numpy array?
  • [x] https://github.com/pandas-dev/pandas/pull/49524
  • [x] https://github.com/pandas-dev/pandas/pull/51239
  • [x] https://github.com/pandas-dev/pandas/pull/50777
  • [x] https://github.com/pandas-dev/pandas/issues/50776
    • [x] https://github.com/pandas-dev/pandas/pull/51731
    • [x] https://github.com/pandas-dev/pandas/pull/52022
  • [x] Constructing DataFrame/Series from Index object (also keep track of references): https://github.com/pandas-dev/pandas/pull/52276 and https://github.com/pandas-dev/pandas/pull/52008 and https://github.com/pandas-dev/pandas/pull/52947
  • [ ] Explore / update the APIs that return numpy arrays (.values, to_numpy()). Potential idea is to make the returned array read-only by default.
  • [x] https://github.com/pandas-dev/pandas/pull/51082
  • [x] https://github.com/pandas-dev/pandas/pull/53704
  • [ ] We need to do the same for EAs? (now only for numpy arrays) -> https://github.com/pandas-dev/pandas/pull/61925
  • [x] Warning / error for chained setitem that no longer works -> https://github.com/pandas-dev/pandas/pull/49467
  • [ ] https://github.com/pandas-dev/pandas/issues/51315
  • [ ] https://github.com/pandas-dev/pandas/issues/56456
  • [x] Add the same warning/error for inplace methods that are called in a chained context (eg df['a'].fillna(.., inplace=True)
    • [x] https://github.com/pandas-dev/pandas/pull/53779
    • [x] https://github.com/pandas-dev/pandas/pull/54024
    • [x] https://github.com/pandas-dev/pandas/pull/54023

Improve the performance

  • [ ] Optimize setitem operations to prevent copies of whole blocks (eg splitting the block could help keeping a view for all other columns, and we only take a copy for the columns that are modified) where splitting the block could keep a view for all other columns, and
  • [x] https://github.com/pandas-dev/pandas/pull/51031
  • [ ] Check overall performance impact (eg run asv with / without CoW enabled by default and see the difference)

Provide upgrade path:

  • [x] Add a warning mode that gives deprecation warnings for all cases where the current behaviour would change (initially also behind an option): https://github.com/pandas-dev/pandas/issues/56019
  • [ ] We can also update the message of the existing SettingWithCopyWarnings to point users towards enabling CoW as a way to get rid of the warnings
  • [ ] Add a general FutureWarning "on first use that would change" that is only raised a single time

Documentation / feedback

Aside from finalizing the implementation, we also need to start documenting this, and it will be super useful to have people give this a try, run their code or test suites with it, etc, so we can iron out bugs / missing warnings / or discover unexpected consequences that need to be addressed/discussed.

  • [ ] Document this new feature (how it works, how you can test it)
  • [x] We can still add a note to the 1.5 whatsnew linking to those docs
  • [ ] Write a set of blogposts on the topic
  • [ ] Gather feedback from users / downstream packages
  • [ ] Update existing documentation:
  • [ ] https://github.com/pandas-dev/pandas/pull/56162
  • [ ]
  • [ ] Write an upgrade guide

Some remaining aspects of the API to figure out:

  • [x] What to do with the Series.view() method -> is deprecated
  • [x] Let head()/tail() return eager copies? (to avoid using those methods for exploration trigger CoW) -> https://github.com/pandas-dev/pandas/pull/54011

Comment From: bashtage

  • [ ] Add the same warning/error for inplace methods that are called in a chained context (eg df['a'].fillna(.., inplace=True)

This one just caught me out in statsmodels. Seems like it is hard to get high-performance in place filling on a column-by-column basis with CoW. Is this correct?

The code that cause the issue was

self.data[col].fillna(imp, inplace=True)

which is pretty standard IME.

I replaced it with

self.data[col] = self.data[col].fillna(imp)

Is there a better way when using CoW?

Comment From: bashtage

Ok, so a better way is self.data.fillna(imp_values, inplace=True) where imp_values is dict[col, fill_val]

Comment From: jorisvandenbossche

Yes, that's a good question. In general, the answer is always to do it through a method on the DataFrame directly, and indeed in this case DataFrame.fillna allows this with a dict, as you mention. When you want to do this for a single columns, going through the DataFrame method is a bit less obvious I think, but so maybe we could hint on doing that if we add a warning for this.