Feature Type
-
[ ] Adding new functionality to pandas
-
[x] Changing existing functionality in pandas
-
[ ] Removing existing functionality in pandas
Problem Description
If a dataframe is grouped by a single column sometimes its name
is a numeric scalar and sometimes a single element tuple. This should be more consistent.
In the following I will consider the following example dataframe
>>> df
a b c
0 1 2 3
1 1 5 6
2 7 8 9
Cases where name
is a scalar
-
DataFrameGroupBy.groups
```python
print(df.groupby(['a']).groups) {1: [0, 1], 7: [2]} ```
-
DataFrameGroupBy.apply
```python
df.groupby(['a']).apply(lambda x: print(x.name), include_groups=False) 1 7 Empty DataFrame Columns: [] Index: []
```
Cases where name
is a one element tuple
DataFrameGroupBy.__iter__
python >>> for name, _ in df.groupby(['a']): print(name) ... (1,) (7,)
Documentation
It should perhaps be said that DataFrameGroupBy.name
is ill documented. But it is not a private property. It seems like the most natural thing to query if you need the information from the columns that you have grouped.
This is especially important, as pandas forces include_groups=False
in apply
with a FutureWarning
/DeprecationWarning
. So DataFrameGroupBy.name
seems like the most natural way to reobtain this information now.
Feature Description
Consistency in either direction:
- PRO SCALAR: It appears that in the majority of cases the name
is a scalar. Although to be fair I have not checked many cases. This is probably also more intuitive to people that do not think about multiple column groupings
- PRO TUPLE: The single element tuple makes this more consistent with the case, where multiple columns are selected.
Additional Context
No response
Comment From: rhshadrach
Thanks for the report. For apply
, I believe we want to move away from pinning name on the passed DataFrame. However for GroupBy.groups
, I'm positive on changing the keys to be tuples in the case where the user is grouping by a 1-element iterable. PRs to fix are welcome!
Comment From: FelixBenning
I am a bit confused why you want to remove this entirely from apply
. apply
feels like the functional alternative to __iter__
, where you provide the name. It is entirely reasonable to want to use this (constant) information in the apply function. Why would you want to remove this information? How would you access this information if both the columns and the name is removed?
Could include_groups=True
be a flag such that the group values are passed to the apply function as a second argument?