Feature Type
-
[ ] Adding new functionality to pandas
-
[x] Changing existing functionality in pandas
-
[ ] Removing existing functionality in pandas
Problem Description
If a dataframe is grouped by a single column sometimes its name is a numeric scalar and sometimes a single element tuple. This should be more consistent.
In the following I will consider the following example dataframe
>>> df
a b c
0 1 2 3
1 1 5 6
2 7 8 9
Cases where name is a scalar
-
DataFrameGroupBy.groups```python
print(df.groupby(['a']).groups) {1: [0, 1], 7: [2]} ```
-
DataFrameGroupBy.apply```python
df.groupby(['a']).apply(lambda x: print(x.name), include_groups=False) 1 7 Empty DataFrame Columns: [] Index: []
```
Cases where name is a one element tuple
DataFrameGroupBy.__iter__python >>> for name, _ in df.groupby(['a']): print(name) ... (1,) (7,)
Documentation
It should perhaps be said that DataFrameGroupBy.name is ill documented. But it is not a private property. It seems like the most natural thing to query if you need the information from the columns that you have grouped.
This is especially important, as pandas forces include_groups=False in apply with a FutureWarning/DeprecationWarning. So DataFrameGroupBy.name seems like the most natural way to reobtain this information now.
Feature Description
Consistency in either direction:
- PRO SCALAR: It appears that in the majority of cases the name is a scalar. Although to be fair I have not checked many cases. This is probably also more intuitive to people that do not think about multiple column groupings
- PRO TUPLE: The single element tuple makes this more consistent with the case, where multiple columns are selected.
Additional Context
No response
Comment From: rhshadrach
Thanks for the report. For apply, I believe we want to move away from pinning name on the passed DataFrame. However for GroupBy.groups, I'm positive on changing the keys to be tuples in the case where the user is grouping by a 1-element iterable. PRs to fix are welcome!
Comment From: FelixBenning
I am a bit confused why you want to remove this entirely from apply. apply feels like the functional alternative to __iter__, where you provide the name. It is entirely reasonable to want to use this (constant) information in the apply function. Why would you want to remove this information? How would you access this information if both the columns and the name is removed?
Could include_groups=True be a flag such that the group values are passed to the apply function as a second argument?
Comment From: rhshadrach
I am a bit confused why you want to remove this entirely from
apply.
This is https://github.com/pandas-dev/pandas/issues/41090. In order to keep the discussion consolidated, I suggest sharing any thoughts there and leaving this issue for the change to .groups.