This was inspired by #33888 and #34584
Problem description
The behavior of copy
argument in df.reindex
is confusing. Current docstring does it explain it sufficiently clear. It also seems to me copy
is unnecessary.
Currently the docstring says
...
A new object is produced unless the new index is equivalent to the current one and ``copy=False``.
...
copy : bool, default True
Return a new object, even if the passed indexes are the same.
It is hard to clarify what is considered an "equivalent" index. See below for more details.
Further, I believe users rarely purposefully tries to reindex
with an "equivalent" index. It happens only if the user does not yet know the current index or the index to conform to, in which case a consistent behavior (e.g. always return new object) is probably preferred.
# On current master
>>> pd.__version__
'1.1.0.dev0+1802.g942beba1e'
>>> df = pd.DataFrame(range(3))
>>> df
0
0 0
1 1
2 2
>>> df.index
RangeIndex(start=0, stop=3, step=1)
# not equivalent
>>> df is df.reindex(range(3), copy=False)
False
# not equivalent
>>> df is df.reindex(list(range(3)), copy=False)
False
# equivalent
>>> df is df.reindex(pd.RangeIndex(start=0, stop=3, step=1), copy=False)
True
>>> df = pd.DataFrame(range(3), index=list(range(3)))
>>> df
0
0 0
1 1
2 2
>>> df.index
Int64Index([0, 1, 2], dtype='int64')
# not equivalent
>>> df is df.reindex(range(3), copy=False)
False
# even this is considered not equivalent
>>> df is df.reindex(list(range(3)), copy=False)
False
>>> df is df.reindex(pd.Int64Index([0, 1, 2]), copy=False)
True
You can see it is actually pretty strict to be "equivalent". I feel it does really make sense to have this copy
parameter because reindex
will return a new object in most cases anyway even when copy=False
.
So the question is, can we deprecate copy
?
Comment From: jorisvandenbossche
Even though the "equivalent" criterion is maybe too strict or not very clear (two things that might be worth solving for sure), there is still value in having this keyword for the case you really have an identical index.
For example, the copy=False
argument is used quite a lot internally to avoid copies. In some operations, input needs be reindexed (although this is often done through align
, but that is under the hood based on the same code as reindex
), and in those cases you want to avoid actually needing to reindex when the other object has the same index (eg df + df
)
Comment From: gauravchandok
The copy
parameter directly contradicts that a new object is created when the copy=false
argument meets a rare "equivalent" index. Although perhaps stating it as a special note might bring some attention to the how unique this is might help. It seems to be the only exception to the reindex
summary which is why I tried to keep it in.
Comment From: shirleylijie
Or maybe just leave it as a parameter that user define to return an object or not?
copy=False
means do reindex on the original dataframe.
Comment From: jorisvandenbossche
The copy parameter directly contradicts that a new object is created when the copy=false argument meets a rare "equivalent" index.
Again this is something that could be clarified in the docstring, but this behaviour of copy
is very similar to for example numpy's handling of the copy
argument on array coercion: eg np.array(some_obj, copy=False)
doesn't mean that no copy is made, but only means that "a copy is avoided when possible", while copy=True
means to simply always copy the input.
Comment From: jbrockmendel
Does CoW make this irrelevant?