Code Sample (copy-pastable)
import pandas as pd
df1 = pd.DataFrame([["a", 1, 2, "05/29/2019"], ["a", 4, 5, "05/28/2019"], ["b", 2, 3, "05/27/2019"]], columns=["type", "num1", "num2", "date"]).assign(date=lambda df: pd.to_datetime(df["date"]))
df2 = pd.DataFrame(columns=["type", "num1", "num2", "date"]).assign(date=lambda df: pd.to_datetime(df["date"]))
groupbys = ["type", pd.Grouper(key="date", freq="1D")]
df1.groupby(groupbys).head()
df2.groupby(groupbys).head()
Problem description
The above code sample throws the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-1-ae6b527f8edf> in <module>
7
8 df1.groupby(groupbys).head()
----> 9 df2.groupby(groupbys).head()
~/miniconda3/envs/mbc/lib/python3.6/site-packages/pandas/core/groupby/groupby.py in head(self, n)
2062 """
2063 self._reset_group_selection()
-> 2064 mask = self._cumcount_array() < n
2065 return self._selected_obj[mask]
2066
~/miniconda3/envs/mbc/lib/python3.6/site-packages/pandas/core/groupby/groupby.py in _cumcount_array(self, ascending)
730 (though the default is sort=True) for groupby in general
731 """
--> 732 ids, _, ngroups = self.grouper.group_info
733 sorter = get_group_index_sorter(ids, ngroups)
734 ids, count = ids[sorter], len(ids)
pandas/_libs/properties.pyx in pandas._libs.properties.CachedProperty.__get__()
~/miniconda3/envs/mbc/lib/python3.6/site-packages/pandas/core/groupby/ops.py in group_info(self)
250 @cache_readonly
251 def group_info(self):
--> 252 comp_ids, obs_group_ids = self._get_compressed_labels()
253
254 ngroups = len(obs_group_ids)
~/miniconda3/envs/mbc/lib/python3.6/site-packages/pandas/core/groupby/ops.py in _get_compressed_labels(self)
266
267 def _get_compressed_labels(self):
--> 268 all_labels = [ping.labels for ping in self.groupings]
269 if len(all_labels) > 1:
270 group_index = get_group_index(all_labels, self.shape,
~/miniconda3/envs/mbc/lib/python3.6/site-packages/pandas/core/groupby/ops.py in <listcomp>(.0)
266
267 def _get_compressed_labels(self):
--> 268 all_labels = [ping.labels for ping in self.groupings]
269 if len(all_labels) > 1:
270 group_index = get_group_index(all_labels, self.shape,
~/miniconda3/envs/mbc/lib/python3.6/site-packages/pandas/core/groupby/grouper.py in labels(self)
365 def labels(self):
366 if self._labels is None:
--> 367 self._make_labels()
368 return self._labels
369
~/miniconda3/envs/mbc/lib/python3.6/site-packages/pandas/core/groupby/grouper.py in _make_labels(self)
386 # we have a list of groupers
387 if isinstance(self.grouper, BaseGrouper):
--> 388 labels = self.grouper.label_info
389 uniques = self.grouper.result_index
390 else:
pandas/_libs/properties.pyx in pandas._libs.properties.CachedProperty.__get__()
~/miniconda3/envs/mbc/lib/python3.6/site-packages/pandas/core/groupby/ops.py in label_info(self)
261 labels, _, _ = self.group_info
262 if self.indexer is not None:
--> 263 sorter = np.lexsort((labels, self.indexer))
264 labels = labels[sorter]
265 return labels
ValueError: all keys need to be the same shape
The pd.Grouper object seems to be modified inside the list. This is not the behavior I expect. I can resolve this by making explicit (deep) copies of the list so that new instances of pd.Grouper are passed into both groupby methods. Like so:
In [11]: import copy
In [12]: groupbys = ["type", pd.Grouper(key="date", freq="1D")]
In [13]: df1.groupby(copy.deepcopy(groupbys)).head()
Out[13]:
type num1 num2 date
0 a 1 2 2019-05-29
1 a 4 5 2019-05-28
2 b 2 3 2019-05-27
In [14]: df2.groupby(copy.deepcopy(groupbys)).head()
Out[14]:
Empty DataFrame
Columns: [type, num1, num2, date]
Index: []
Expected Output
As show above, expected output is the empty dataframe, instead of the ValueError. Interestingly, if I reverse the order and run the df2.groupby first, then run df1.groupby, it works fine. However, doing df2.groupby again throws the ValueError. There's definitely something in the df1.groupby that is modifying the pd.Grouper.
Output of pd.show_versions()
If this is expected behavior please let me know (we can close the issue). If this is not expected behavior, I'd love to take a crack at resolving this (any insight into the issue would be appreciated).
Comment From: jreback
this is probably inadvertant, so a PR to fix would be great.
note that more typically you would do
g = df.groupby(....)
g...
g...
IOW you only really would use this once
Comment From: alichaudry
- I just opened a PR to attempt to resolve this. Does #29800 seem too naive of a fix for this bug?
- Does this fix need a test?
- I'm not really sure which
whatsnewto add to.
Comment From: alichaudry
I see now that the updates needed for this are a little bit more complex than I thought. To ensure nothing else breaks more thought will have to go into this.
Comment From: jbrockmendel
51134 should have fixed this. Good topic for a new contributor would be to check
Comment From: jarent-nvidia
take