-
[x] I have checked that this issue has not already been reported.
-
[x] I have confirmed this bug exists on the latest version of pandas.
-
[ ] (optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
import pandas as pd
df1 = pd.DataFrame(columns=["a", "b", "c"])
print(df1.groupby("a").sum().columns)
# => Index([], dtype='object')
df2 = pd.DataFrame({"a": [], "b": [], "c": []})
print(df2.groupby("a").sum().columns)
# => Index(['b', 'c'], dtype='object')
Problem description
groupby
-ing and sum
ming an empty dataframe led to dropped columns (df1
above); this doesn't occur with non-empty dataframes. This changing the columns of a dataframe based on its content is counter-intuitive and leads to key errors. The expected behaviour is shown above with df2
, and the fact that two empty dataframes show different behaviours when grouped and summed suggests that this isn't intended behaviour.
Expected Output
Output of the above snippet:
Comment From: phofl
Hi, thanks for your report.
This is actuall quite straighforward and has nothing to do with groupby itself. df1
has dtype object while df2
has dtype float. If you set numeric_only=False
you will have your columns as expected
Comment From: sergiykhan
The inconsistency appears to be related to the 'float' data type that you have in df2
. Here is different example
df1 = pd.DataFrame(columns=["a", "b", "c"], dtype='float')
print( df1.groupby("a").sum().columns )
# Index(['b', 'c'], dtype='object')
df2 = pd.DataFrame(columns=["a", "b", "c"], dtype='object')
print( df2.groupby("a").sum().columns )
# Index([], dtype='object')
Comment From: phofl
The dtype inference might be wrong here, but I don't know the history here and if this is intended
Comment From: RileyLazarou
@phofl thanks for the quick reply! I had no idea that these two methods of instantiating empty dataframes led to different dtypes
>>> pd.DataFrame(columns=["a"]).dtypes
a object
dtype: object
>>> pd.DataFrame({"a": []}).dtypes
a float64
dtype: object
Comment From: phofl
Reopening since the different dtype look strange
Comment From: debnathshoham
take
Comment From: debnathshoham
On further investigation, there is a mismatch in the .index as well
>>> pd.DataFrame(columns=["a"]).index
Index([], dtype='object')
>>> pd.DataFrame({"a": []}).index
RangeIndex(start=0, stop=0, step=1)
Comment From: debnathshoham
this has too many moving parts, and seems like a lot of dependent tests. I will unassign myself.