Pandas Copy-on-Write (PDEP-7) follow-up overview issue

PDEP-7: https://pandas.pydata.org/pdeps/0007-copy-on-write.html

An initial implementation was merged in https://github.com/pandas-dev/pandas/pull/46958/ (with the proposal described in more detail in https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit / discussed in https://github.com/pandas-dev/pandas/issues/36195).

In https://github.com/pandas-dev/pandas/issues/36195#issuecomment-1137706149 I mentioned some next steps that are still needed; moving this to a new issue.

Implementation

Complete the API surface:

[x] Use the lazy copy (with CoW) mechanism in more methods where it can be used. Overview issue at https://github.com/pandas-dev/pandas/issues/49473
[ ] What to do with the existing copy keyword?
[x] https://github.com/pandas-dev/pandas/issues/50535 -> https://github.com/pandas-dev/pandas/pull/51464
[ ] https://github.com/pandas-dev/pandas/issues/56022
[x] Implement the new behaviour for constructors (eg constructing a DataFrame/Series from an existing DataFrame/Series should follow the same rules as indexing -> behaves as a copy through CoW). Although, what about creating from a numpy array?
[x] https://github.com/pandas-dev/pandas/pull/49524
[x] https://github.com/pandas-dev/pandas/pull/51239
[x] https://github.com/pandas-dev/pandas/pull/50777
[x] https://github.com/pandas-dev/pandas/issues/50776
- [x] https://github.com/pandas-dev/pandas/pull/51731
- [x] https://github.com/pandas-dev/pandas/pull/52022
[x] Constructing DataFrame/Series from Index object (also keep track of references): https://github.com/pandas-dev/pandas/pull/52276 and https://github.com/pandas-dev/pandas/pull/52008 and https://github.com/pandas-dev/pandas/pull/52947
[ ] Explore / update the APIs that return numpy arrays (.values, to_numpy()). Potential idea is to make the returned array read-only by default.
[x] https://github.com/pandas-dev/pandas/pull/51082
[x] https://github.com/pandas-dev/pandas/pull/53704
[ ] We need to do the same for EAs? (now only for numpy arrays) -> https://github.com/pandas-dev/pandas/pull/61925
[x] Warning / error for chained setitem that no longer works -> https://github.com/pandas-dev/pandas/pull/49467
[ ] https://github.com/pandas-dev/pandas/issues/51315
[ ] https://github.com/pandas-dev/pandas/issues/56456
[x] Add the same warning/error for inplace methods that are called in a chained context (eg df['a'].fillna(.., inplace=True)
- [x] https://github.com/pandas-dev/pandas/pull/53779
- [x] https://github.com/pandas-dev/pandas/pull/54024
- [x] https://github.com/pandas-dev/pandas/pull/54023

Improve the performance

[ ] Optimize setitem operations to prevent copies of whole blocks (eg splitting the block could help keeping a view for all other columns, and we only take a copy for the columns that are modified) where splitting the block could keep a view for all other columns, and
[x] https://github.com/pandas-dev/pandas/pull/51031
[ ] Check overall performance impact (eg run asv with / without CoW enabled by default and see the difference)

Provide upgrade path:

[x] Add a warning mode that gives deprecation warnings for all cases where the current behaviour would change (initially also behind an option): https://github.com/pandas-dev/pandas/issues/56019
[ ] We can also update the message of the existing SettingWithCopyWarnings to point users towards enabling CoW as a way to get rid of the warnings
[ ] Add a general FutureWarning "on first use that would change" that is only raised a single time

Documentation / feedback

Aside from finalizing the implementation, we also need to start documenting this, and it will be super useful to have people give this a try, run their code or test suites with it, etc, so we can iron out bugs / missing warnings / or discover unexpected consequences that need to be addressed/discussed.

[ ] Document this new feature (how it works, how you can test it)
[x] We can still add a note to the 1.5 whatsnew linking to those docs
[ ] Write a set of blogposts on the topic
[ ] Gather feedback from users / downstream packages
[ ] Update existing documentation:
[ ] https://github.com/pandas-dev/pandas/pull/56162
[ ]
[ ] Write an upgrade guide

Some remaining aspects of the API to figure out:

[x] What to do with the Series.view() method -> is deprecated
[x] Let head()/tail() return eager copies? (to avoid using those methods for exploration trigger CoW) -> https://github.com/pandas-dev/pandas/pull/54011

Comment From: bashtage

[ ] Add the same warning/error for inplace methods that are called in a chained context (eg df['a'].fillna(.., inplace=True)

This one just caught me out in statsmodels. Seems like it is hard to get high-performance in place filling on a column-by-column basis with CoW. Is this correct?

The code that cause the issue was

self.data[col].fillna(imp, inplace=True)

which is pretty standard IME.

I replaced it with

self.data[col] = self.data[col].fillna(imp)

Is there a better way when using CoW?

Comment From: bashtage

Ok, so a better way is self.data.fillna(imp_values, inplace=True) where imp_values is dict[col, fill_val]

Comment From: jorisvandenbossche

Yes, that's a good question. In general, the answer is always to do it through a method on the DataFrame directly, and indeed in this case DataFrame.fillna allows this with a dict, as you mention. When you want to do this for a single columns, going through the DataFrame method is a bit less obvious I think, but so maybe we could hint on doing that if we add a warning for this.