groupby-apply workflows are important pandas idioms. Here's a brief example grouping on a named DataFrame column:
>>> df = pd.DataFrame({'key': [1, 1, 1, 2, 2, 2, 3, 3, 3], 'value': range(9)})
>>> result = df.groupby('key').apply(lambda x: x['key'])
>>> result
key
1 0 1
1 1
2 1
2 3 2
4 2
5 2
3 6 3
7 3
8 3
Name: key, dtype: int64
An important highlight of this example is the ability to reference the grouped value -- eg, x['key']
-- inside the applied function.
pandas also supports grouping on arbitrary mapping functions, iterables, and lots of other objects. In these cases, the grouped value is not represented as a named column in the DataFrame. Thus, when using apply(...), there is no apparent way to access the group key value. The only alternative is to use a (slow) for-loop solution as in:
foo = lambda _k, _g: ...
grouped = df.groupby(grouper)
result_iter = (foo(key, group) for key, group in grouped)
key_iter = (key for key, group in grouped)
pd.DataFrame.from_records(result_iter, index=key_iter)
IMHO, the ability to access the grouped value in an idiomatic way from within the applied function is ergonomically important; the groupby-apply idiom is at best partially realized without it.
Comment From: jreback
see #9239 which implements the copy-assign idiom
Comment From: brianthelion
@jreback Sorry for being a complete n00b here, but at a quick glance I can't imagine how to solve the present issue with copy-assign. My imagination is pretty limited, though. Could you provide a basic intuition? Thanks!
Comment From: TomAugspurger
Do you have a small example where you need to reference the grouping key? I'm trying to see where this would be necessary (or just convenient).
Also, does the .name
attribute on the values of the groupby help you out? I think it has the value of the evaluated (grouping) function for that group.
Comment From: jreback
In [9]: df.groupby('key').apply(lambda x: Series(x.name,x.index))
Out[9]:
key
1 0 1
1 1
2 1
2 3 2
4 2
5 2
3 6 3
7 3
8 3
dtype: int64
Comment From: brianthelion
It appears that .name is the attribute that I've been looking for! Is this the intended use-case for the attribute? If so, my feeling is that "name" is somewhat nondescript.
Comment From: TomAugspurger
It is the intended use. I think it's documented in the transform section.
On Feb 24, 2015, at 6:07 PM, brianthelion notifications@github.com wrote:
It appears that .name is the attribute that I've been looking for! Is this the intended use-case for the attribute? If so, my feeling is that "name" is somewhat nondescript.
— Reply to this email directly or view it on GitHub.
Comment From: TomAugspurger
I just looked through the docs and didn't see anything about the .name
being set. It is in the docstring for gr.transform
Each subframe is endowed with the attribute 'name' in case you need to know which group you are working on
@brianthelion are you interested in submitting a pull-request to clarify the prose docs (and maybe the docstrings on apply?) I don't think the .name
attribute is set for .agg
operations. Not sure if this is intentional.
Comment From: brianthelion
Happy to! However, I think there should first be some discussion about the semantics of the attribute. IMHO, "name" is way too generic and semantically kinda nonsensical in lots of use-cases. I'm not sure what's appropriate for the others, but ".grouped_value" or thereabouts seems appropriate for the groupby-apply workflow.
Comment From: TomAugspurger
I'd vote for .key
.* *.name
was probably chosen because it already has a
meaning on Series.
Is there any risk of clobbering an attribute on someone's Series? I think
we're ok, unless they're explicitly setting something to .key
in the
apply function...
Comment From: alkasm
Is there any risk of clobbering an attribute on someone's Series?
@TomAugspurger yes :( Just happened to me. Not sure how I feel about an attribute with such a common name added to the dataframe which wasn't there in the ungrouped dataframe. Was a really hard bug to pinpoint. EDIT: the problem for me was that had a column named "name"
and tried accessing it with the dot access syntactic sugar. Same would happen with "key"
.
https://github.com/pandas-dev/pandas/issues/25457
Comment From: ericabrauer
Is there any risk of clobbering an attribute on someone's Series?
@TomAugspurger yes :( Just happened to me. Not sure how I feel about an attribute with such a common name added to the dataframe which wasn't there in the ungrouped dataframe. Was a really hard bug to pinpoint. EDIT: the problem for me was that had a column named
"name"
and tried accessing it with the dot access syntactic sugar. Same would happen with"key"
.25457
Could you change it to something like .name_ or .key_ for the official (non-column) attribute referencing?
Comment From: alkasm
@ericabrauer FYI, probably better to add that to the issue I linked, which is more aligned with identifying the .name
attribute as an issue.
Comment From: ericabrauer
@ericabrauer FYI, probably better to add that to the issue I linked, which is more aligned with identifying the
.name
attribute as an issue.
Oh ok- thanks @alkasm, I guess maybe I don't have a full enough grasp on the issue. I'll try to take a look at the one you referenced and see if I can better understand.
Comment From: Vishnu-sai-teja
I am new to Open Source Contribution , but have a good knowledge of how pandas work , I just understand the issue compared to that of other , I just want to try .
Comment From: samyarpotlapalli
take