PDEP-7: https://pandas.pydata.org/pdeps/0007-copy-on-write.html
An initial implementation was merged in https://github.com/pandas-dev/pandas/pull/46958/ (with the proposal described in more detail in https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit / discussed in https://github.com/pandas-dev/pandas/issues/36195).
In https://github.com/pandas-dev/pandas/issues/36195#issuecomment-1137706149 I mentioned some next steps that are still needed; moving this to a new issue.
Implementation
Complete the API surface:
- [x] Use the lazy copy (with CoW) mechanism in more methods where it can be used. Overview issue at https://github.com/pandas-dev/pandas/issues/49473
- [ ] What to do with the existing
copy
keyword? - [x] https://github.com/pandas-dev/pandas/issues/50535 -> https://github.com/pandas-dev/pandas/pull/51464
- [ ] https://github.com/pandas-dev/pandas/issues/56022
- [x] Implement the new behaviour for constructors (eg constructing a DataFrame/Series from an existing DataFrame/Series should follow the same rules as indexing -> behaves as a copy through CoW). Although, what about creating from a numpy array?
- [x] https://github.com/pandas-dev/pandas/pull/49524
- [x] https://github.com/pandas-dev/pandas/pull/51239
- [x] https://github.com/pandas-dev/pandas/pull/50777
- [x] https://github.com/pandas-dev/pandas/issues/50776
- [x] https://github.com/pandas-dev/pandas/pull/51731
- [x] https://github.com/pandas-dev/pandas/pull/52022
- [x] Constructing DataFrame/Series from Index object (also keep track of references): https://github.com/pandas-dev/pandas/pull/52276 and https://github.com/pandas-dev/pandas/pull/52008 and https://github.com/pandas-dev/pandas/pull/52947
- [ ] Explore / update the APIs that return numpy arrays (
.values
,to_numpy()
). Potential idea is to make the returned array read-only by default. - [x] https://github.com/pandas-dev/pandas/pull/51082
- [x] https://github.com/pandas-dev/pandas/pull/53704
- [ ] We need to do the same for EAs? (now only for numpy arrays) -> https://github.com/pandas-dev/pandas/pull/61925
- [x] Warning / error for chained setitem that no longer works -> https://github.com/pandas-dev/pandas/pull/49467
- [ ] https://github.com/pandas-dev/pandas/issues/51315
- [ ] https://github.com/pandas-dev/pandas/issues/56456
- [x] Add the same warning/error for inplace methods that are called in a chained context (eg
df['a'].fillna(.., inplace=True)
- [x] https://github.com/pandas-dev/pandas/pull/53779
- [x] https://github.com/pandas-dev/pandas/pull/54024
- [x] https://github.com/pandas-dev/pandas/pull/54023
Improve the performance
- [ ] Optimize setitem operations to prevent copies of whole blocks (eg splitting the block could help keeping a view for all other columns, and we only take a copy for the columns that are modified) where splitting the block could keep a view for all other columns, and
- [x] https://github.com/pandas-dev/pandas/pull/51031
- [ ] Check overall performance impact (eg run asv with / without CoW enabled by default and see the difference)
Provide upgrade path:
- [x] Add a warning mode that gives deprecation warnings for all cases where the current behaviour would change (initially also behind an option): https://github.com/pandas-dev/pandas/issues/56019
- [ ] We can also update the message of the existing SettingWithCopyWarnings to point users towards enabling CoW as a way to get rid of the warnings
- [ ] Add a general FutureWarning "on first use that would change" that is only raised a single time
Documentation / feedback
Aside from finalizing the implementation, we also need to start documenting this, and it will be super useful to have people give this a try, run their code or test suites with it, etc, so we can iron out bugs / missing warnings / or discover unexpected consequences that need to be addressed/discussed.
- [ ] Document this new feature (how it works, how you can test it)
- [x] We can still add a note to the 1.5 whatsnew linking to those docs
- [ ] Write a set of blogposts on the topic
- [ ] Gather feedback from users / downstream packages
- [ ] Update existing documentation:
- [ ] https://github.com/pandas-dev/pandas/pull/56162
- [ ]
- [ ] Write an upgrade guide
Some remaining aspects of the API to figure out:
- [x] What to do with the
Series.view()
method -> is deprecated - [x] Let
head()
/tail()
return eager copies? (to avoid using those methods for exploration trigger CoW) -> https://github.com/pandas-dev/pandas/pull/54011
Comment From: bashtage
- [ ] Add the same warning/error for inplace methods that are called in a chained context (eg df['a'].fillna(.., inplace=True)
This one just caught me out in statsmodels. Seems like it is hard to get high-performance in place filling on a column-by-column basis with CoW. Is this correct?
The code that cause the issue was
self.data[col].fillna(imp, inplace=True)
which is pretty standard IME.
I replaced it with
self.data[col] = self.data[col].fillna(imp)
Is there a better way when using CoW?
Comment From: bashtage
Ok, so a better way is self.data.fillna(imp_values, inplace=True)
where imp_values
is dict[col, fill_val]
Comment From: jorisvandenbossche
Yes, that's a good question. In general, the answer is always to do it through a method on the DataFrame directly, and indeed in this case DataFrame.fillna
allows this with a dict, as you mention.
When you want to do this for a single columns, going through the DataFrame method is a bit less obvious I think, but so maybe we could hint on doing that if we add a warning for this.