Improve coverage of NDFrame.__finalize__
Pandas uses NDFrame.__finalize__
to propagate metadata from one NDFrame to
another. This ensures that things like self.attrs
and self.flags
are not
lost. In general we would like that any operation that accepts one or more
NDFrames and returns an NDFrame should propagate metadata by calling
__finalize__
.
The test file at https://github.com/pandas-dev/pandas/blob/master/pandas/tests/generic/test_finalize.py attempts to be an exhaustive suite of tests for all these cases. However there are many tests currently xfailing, and there are likely many APIs not covered.
This is a meta-issue to improve the use of __finalize__
. Here's a hopefully
accurate list of methods that don't currently call finalize.
Some general comments around finalize
- We don't have a good sense for what should happen to
attrs
when there are multiple NDFrames involved with differing attrs (e.g. in concat). The safest approach is to probably drop the attrs when they don't match, but this will need some thought. - We need to be mindful of performance.
__finalize__
can be somewhat expensive so we'd like to call it exactly once per user-facing method. This can be tricky for things likeDataFrame.apply
which is sometimes used internally. We may need to refactor some methods to have a user-facingDataFrame.apply
that calls an internalDataFrame._apply
. The internal method would not call__finalize__
, just the user-facingDataFrame.apply
would.
If you're interested in working on this please post a comment indicating which method you're working on. Un-xfail the test, then update the method to pass the test. Some of these will be much more difficult to work on than others (e.g. groupby is going to be difficult). If you're unsure whether a particular method is likely to be difficult, ask first.
- [x]
DataFrame.__getitem__
with a scalar - [ ]
DataFrame.eval
withengine="numexpr"
- [x]
DataFrame.duplicated
- [x]
DataFrame.add
,mul
, etc. (at least for most things; some work to do on conflicts / overlapping attrs in binops) - [ ]
DataFrame.combine
,DataFrame.combine_first
- [x]
DataFrame.update
- [x]
DataFrame.pivot
,pivot_table
- [x]
DataFrame.stack
- [x]
DataFrame.unstack
- [x]
DataFrame.explode
https://github.com/pandas-dev/pandas/pull/46629 - [x]
DataFrame.melt
https://github.com/pandas-dev/pandas/pull/46648 - [x]
DataFrame.diff
- [x]
DataFrame.applymap
- [x]
DataFrame.append
- [ ]
DataFrame.merge
- [x]
DataFrame.cov
- [x]
DataFrame.corrwith
- [x]
DataFrame.count
- [x]
DataFrame.nunique
- [x]
DataFrame.idxmax
,idxmin
- [x]
DataFrame.mode
- [x]
DataFrame.quantile
(scalar and list of quantiles) - [x]
DataFrame.isin
- [x]
DataFrame.pop
- [x]
DataFrame.squeeze
- [x]
Series.abs
- [x]
DataFrame.get
- [x]
DataFrame.round
- [x]
DataFrame.convert_dtypes
- [x]
DataFrame.pct_change
- [x]
DataFrame.transform
- [x]
DataFrame.apply
- [ ]
DataFrame.any
,sum
,std
,mean
, etdc. - [x]
Series.str.
operations returning a Series / DataFrame - [x]
Series.dt.
operations returning a Series / DataFrame - [x]
Series.cat.
operations returning a Series / DataFrame - [x] All groupby operations (at least some work)
- [x]
.iloc
/.loc
https://github.com/pandas-dev/pandas/pull/46101
Comment From: TomAugspurger
How __finalize__
works with methods like concat
requires a bit of discussion. How do we reconcile metadata from multiple sources?
In #27108, I'm using _metadata
to propagate whether the NDFrame allows duplicate labels. In this situation, my ideal reconciliation function would be
def reconcile_concat(others: List[DataFrame]) -> bool:
"""
Allow duplicates only if all the inputs allowed them.
If any disallow them, we disallow them.
"""
return all(x.allows_duplicates for x in others)
However, that reconciliation strategy isn't valid / proper for arbitrary metadata. Which I think argues for some kind of dispatch system for reconciling metadata, where the attribute gets to determine how things are handled.
allows_duplicate_meta = PandasMetadata("allows_duplicates") # the attribute name
@allows_duplicate_meta.register(pd.concat) # the method
def reconcile_concat():
...
Then we always pass method
to NDFrame.__finalize__
, which we'll use to look up the function for how to reconcile things. A default reconciliation can be provided.
cc @jbrockmendel, since I think you looked into metadata propagation in another issue.
Comment From: jbrockmendel
IIRC I was looking at _metadata to try to implement units (this predated EAs). One of the biggest problems I had was that metadata on a Series didn't behave well when that Series is inserted into a DataFrame.
Do we have an idea of how often _metadata is used in the wild? i.e. could we deprecate it and make an EA-based implementation?
Comment From: TomAugspurger
It’s essentially undocumented, so I’m OK with being aggressive here.
What would an EA-based implementation look like? For something like units, metadata may not be appropriate. I think an EA dtype makes more sense.
I’ll add this to the agenda for next weeks call.
Comment From: jbrockmendel
What would an EA-based implementation look like?
It's a bit ambiguous what this question is referring to, but big picture something like _metadata
but with dispatch logic could be done with something like (for this example I'm thinking of units in physics where (4 meters) / (2 seconds) == 2 m/s
):
class MyDtype(ExtensionDtype):
def __mul__(self, other):
other = getattr(other, "dtype", other)
return some_new_dtype
class MyEA(ExtensionArray):
def __mul__(self, other):
result = self._data * other
dtype = self.dtype * other
return type(self)(result, dtype)
Comment From: TomAugspurger
Right. Does the current EA interface suffice for that use case, or are there additional hooks needed?
Comment From: TomAugspurger
Not a blocker for 1.0.
Comment From: TomAugspurger
Do people think that accessor methods like .str
and .dt
should call finalize? IMO, yes they should.
Comment From: jorisvandenbossche
But would you then handle name
propagation there?
Comment From: TomAugspurger
Name propagation isn't (currently) handled in __finalize__
. I don't think it should be handled there currently, since the current __finalize__
isn't well suited to resolving the name when there are multiple inputs. In the future it might make sense.
My two motivating use-cases here are
- My
allow_duplicate_labels
PR, for disallowing duplicates - A workload that preserves something like
.attrs["source"] = "file.csv"
through as many operations as makes sense.
Comment From: jorisvandenbossche
Name propagation isn't (currently) handled in finalize.
It actually is, not? At least in some cases? (eg for new Series originating from other Series, where other
is that Series).
Comment From: TomAugspurger
Apologies, I forgot that name
was in _metadata
. So yes, name handling could be handled there.
Comment From: TomAugspurger
When should we call finalize? A high-level list:
Yes
- Reshape operations (stack, unstack, reset_index, set_index, to_frame)
- Indexing (take, loc, iloc,
__getitem__
, reindex, drop, assign, select_dtypes, nlargest?) - "transforms" (repeat, explode, shift, diff, round, isin, fillna, isna, dropna, copy, rename, applymap,
.transform
, sort_values, sort_index) - Accessor methods returning Series
.str
,.dt
,.cat
- Binary ops (arithmetic, comparison, combine,
- ufuncs
- concat / merge / joins, append
- cumulative aggregations (cumsum)?
Unsure
- Reductions (DataFrame.sum(), etc. count, quantlie, idxmin)
- groupby, pivot, pivot_table?
- DataFrame.apply?
- corr, cov, etc.
These are somewhat arbitrary. I can't really come up with a rule why a reduction like DataFrame.sum()
shouldn't call __finalize__
while DataFrame.cumsum
does. So perhaps the rule should be "any NDFrame method returning an NDFrame (or Tuple[NDFrame]) will call __finalize__
". I like the simplicity of that.
Comment From: TomAugspurger
I've updated the original post. Hopefully we can find some contributors interested in working on getting finalize called in more places.
Comment From: Sadin
@TomAugspurger I would be interested in contributing to pandas and start by helping to tackle some of these methods. Which methods might be good places to start?
Comment From: TomAugspurger
Thanks @Sadin. I think DataFrame.duplicated would be a good one.
Comment From: mzeitlin11
@TomAugspurger I would be interested in contributing, is it all right if I work on DataFrame.unstack?
Comment From: TomAugspurger
@mzeitlin11 that'd be great. Looks like @arw2019 is handling stack
in #37186
Comment From: Illviljan
Series attrs
are lost when using df.iloc
. Dask relies on iloc when creating meta dataframes so the attrs will always be lost for dask series.
Example:
import pandas as pd
df = pd.DataFrame({"A": [1, 2], "B": [3, 4], "C": [5, 6]})
df.attrs = {'date': '2020-10-16'}
df.A.attrs['unit'] = 'kg'
df.iloc[:0].attrs
Out[20]: {'date': '2020-10-16'}
df.iloc[:0].A.attrs
Out[21]: {}
Further reading: https://github.com/dask/dask/issues/6730
Comment From: Japanuspus
@TomAugspurger: I have a potential fix for part of the groupby
issues, see c882db554888b141fa95d1891f9cb84ab06b95f4. The code was originally included in PR #35688 but was taken out to allow a separate review of potential performance issues.
Should I just reference this issue in a pull request, or open a more specific issue first?
Comment From: abbywh
@TomAugspurger I'm interested in taking my first issue. Can I take DataFrame.Append?
Comment From: TomAugspurger
That'd be great.
On Thu, Oct 22, 2020 at 8:57 PM Joel Whittier notifications@github.com wrote:
@TomAugspurger https://github.com/TomAugspurger I'm interested in taking my first issue. Can I take DataFrame.Append?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/28283#issuecomment-714858662, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOITDM2KY2P2YOO2IIVLSMDPIHANCNFSM4ITRD3TQ .
Comment From: abbywh
Finished Append. Also did merge, eval, diff, applymap. Should have a PR coming soon.
Comment From: jorisvandenbossche
I have a potential fix for part of the groupby issues, see c882db5. The code was originally included in PR #35688 but was taken out to allow a separate review of potential performance issues.
@Japanuspus do you want to open a PR for it?
Comment From: Japanuspus
I have a potential fix for part of the groupby issues, see c882db5. The code was originally included in PR #35688 but was taken out to allow a separate review of potential performance issues.
@Japanuspus do you want to open a PR for it?
Sure -- I'll go ahead and open with a reference to this issue.
Comment From: liaoaoyuan97
@TomAugspurger I think pivot_table
hasn't been done yet, as it is still marked with xfail. Is it okay if I take this one?
Comment From: TomAugspurger
Yep, thanks.
On Jan 22, 2021, at 9:22 PM, Eve notifications@github.com wrote:
@TomAugspurger https://github.com/TomAugspurger I think pivot_table hasn't been done yet, as it is still marked with xfail. Is it okay if I take this one?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/28283#issuecomment-765851582, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIW4UKSD4UH3RCM4VK3S3I6IPANCNFSM4ITRD3TQ.
Comment From: liaoaoyuan97
@TomAugspurger Do you still remember why ?
is necessary in this regular expression?
https://github.com/pandas-dev/pandas/blob/0c18cc6b8a76031eeed9380c365b4149994b72df/pandas/tests/generic/test_finalize.py#L497
It seems to duplicate *
, if it means the capture group is optional.
Comment From: TomAugspurger
I’m not sure.
On Feb 10, 2021, at 23:06, Eve notifications@github.com wrote:
@TomAugspurger Do you still remember why ? is necessary in this regular expression? https://github.com/pandas-dev/pandas/blob/0c18cc6b8a76031eeed9380c365b4149994b72df/pandas/tests/generic/test_finalize.py#L497 It seems to duplicate *, if it means the capture group is optional.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
Comment From: palash247
I would like to work on few methods which are not so complicated to fix. Can someone please point me in the right direction? This will be my first contribution.
Comment From: rlkenny98
@TomAugspurger I'd like to work on eval for my first contribution if that is alright
Comment From: hans2520
We don't have a good sense for what should happen to attrs when there are multiple NDFrames involved with differing attrs (e.g. in concat).
I would argue that .update()
should be used vs other alternatives like dropping, with the order of ops determining which is being updated (e.g. A + B + C
means A.update(B); A.update(C)
)
Comment From: moha-rk
Hello. I'd like to contribute to this issue.
First, i chose the Series.abs
function, but after running some tests, I saw that this method propagates the Series attributes without using __finalize__
. I think that the numpy function np.abs
creates a copy of the object before calculating the absolute value of the Series entries. Here's an execution example:
>>> import pandas as pd
>>> s = pd.Series([-5, 4])
>>> s.attrs = {"a": 1, "z" : "b"}
>>> result = s.abs()
>>> s.attrs
{'a': 1, 'z': 'b'}
>>> result.attrs
{'a': 1, 'z': 'b'}
So I guess __finalize__
isn't necessary here, unless I'm wrong. Now I'm working on DataFrame.pop
, and will soon submit a PR.
Comment From: casperlundberg
Hi, I'm very new to open source contributing but I'd like to contribute to this issue. Would be nice if someone could recommend a method to start out with.
Edit
Working with:
- DataFrame.combine, DataFrame.combine_first
- DataFrame.explode
- DataFrame.isin
- DataFrame.merge
- DataFrame.melt
- DataFrame.mode
- DataFrame.quantile
Along with some friends
Comment From: EmilianoJordan
Currently I'm working on:
iloc
loc
to_frame
These were chosen as they're all tied to _LocIndexer
and to_frame
because it came up in a test case.
Also, there's been some discussion on which version of attrs to use when there are N number available (merge, concat etc). I believe this should follow the same logic as the return type when using subclassed objects. I've updated some of my thoughts on subclassed return types here.
Comment From: skwirskj
I am currently working on DataFrame.add
I also have a PR up that calls finalize in DataFrame.astype which will solve this issue for some of the functions, such as DataFrame.cov which calls astype at the end of the function. #44680
Comment From: theOehrly
I currently have an open PR (#46101) that intends to fix .loc
and .iloc
.
This might additionally fix .quantile
, but that needs some further checking.
Comment From: songsol1
Hi, I'm new to open source contributing but I am interested in contributing to this issue.
It looks like DataFrame.melt
and DataFrame.explode
are still unfinished, so I'll be working on these for now.
From my brief investigation, it looks like the following methods call __finalize__
or use DataFrame.apply
:
Dataframe.count
Dataframe.squeeze
Dataframe.get
Dataframe.round
Dataframe.convert_dtypes
Dataframe.pct_change
Dataframe.isin
Dataframe.apply
has been fixed and consequently the following have been fixed as well:
Dataframe.transform
Dataframe.nunique
Dataframe.mode
@TomAugspurger if you could update the original listing or double check this it would be appreciated.
Comment From: theOehrly
.quantile
can be ticked off as well. Done as of #47183
Comment From: covertg
Hi, I'd be happy to help tick some of the boxes off here. Would love to see attrs
move out of its experimental status. However, I'm new to contributing to pandas and would appreciate suggestions on which methods to start with.
Comment From: hamedgibago
Hi, I want to take my first issue and I'm new contributor to pandas. I'd like to get DataFrame.count
as my first. Is it good for start? If not please offer another one. Thanks.
Comment From: SomtochiUmeh
take
Comment From: SomtochiUmeh
Starting with Dataframe.idxmax() and Dataframe.idxmin()
Comment From: seljaks
Hi, I'm new to open source and I'm interested in contributing to this issue. I'd like to start with DataFrame.add
and if that goes well move on to sub
, mul
, and div
Comment From: yuanx749
Worked on corr
and cov
.
Skip corrwith
for the time being since it involves another NDFrame.
Comment From: bobzhang-stack
Hello, I'm new to contributing to open source and am interesting in doing one of the tasks for this issue. @TomAugspurger Is there any particular method (of the remaining ones still with finalize issue) that would be recommended for me to look into? Is pop alright to start with (it seemed like someone was working on it but I didn't see a pull request for resolving it)?
Comment From: TomAugspurger
@bobzhang-stack it looks like pop
might be done.
In [74]: df = pd.DataFrame({"A": [1, 2], "B": [1, 2]})
In [75]: df.attrs['a'] = 1
In [76]: df.pop("B").attrs
Out[76]: {'a': 1}
In [77]: df.attrs
Out[77]: {'a': 1}
It seems the majority of the remaining ones are related to operations between multiple objects with attrs (https://github.com/pandas-dev/pandas/issues/49916). Aside from that, there's .eval
with engine="numexpr"
. I'm not sure how hard that would be.
Comment From: KartikeyBartwal
Hi I'm starting to work on DataFrame.merge
Comment From: Frostbyte72
Would it be possible for me to be assigned to df.merge() to attempt a fix?
Comment From: aijams
Hello, I'm new to the Pandas project and I think I might want to work on this issue. @TomAugspurger, do you think this issue is important for Pandas?
Comment From: aijams
The DataFrame.eval
method is currently untested (xfailed) with the numexpr
engine, which fails when forced to run.
I saw that a couple people tried to get this issue resolved, but neither has been able to resolve it.
@TomAugspurger, have you heard from jiawei-zhang-a about his work on this issue since his last attempt?
I'm trying to gauge whether I should work on this issue.
Also, I can see a few items in the list that are still unchecked even though I found PRs that supposedly resolve them. I plan to check the tests to see if these items are tested and pass their tests.
Comment From: aijams
It looks like DataFrame.merge
is the only other method on the list without a passing test. I did see that DataFrame.dot
is xfailed as well.
Comment From: TomAugspurger
Thanks for checking. I've not heard, from others. I looks like https://github.com/niruta25/pandas/pull/1 was an initial attempt at merge
.
Comment From: aijams
I'll take this issue then. I plan to base my contribution on niruta25's initial attempt.
Comment From: aijams
take
Comment From: aijams
I saw that pd.merge
is only supposed to propagate the metadata .attrs
of its inputs if they are equal as dictionaries.
This requirement is shared with pd.concat
.
The failing test for merge
violated this requirement by passing metadata on the left, but no metadata on the right.
@TomAugspurger, is there agreement on this requirement for merge
and if so, is this the agreed upon behavior?