Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import datetime
import pandas as pd
X = pd.DataFrame(
{
"groupby_col": [0, 0, 1, 0],
"agg_col": [1, 2, 3, 4],
"date": [
pd.Timestamp(datetime.date(2000, 1, 1)),
pd.Timestamp(datetime.date(2000, 1, 2)),
pd.Timestamp(datetime.date(2000, 1, 3)),
pd.Timestamp(datetime.date(2001, 1, 1)),
],
}
)
# ----------------------------------------------------------------------
print(
X.groupby(["groupby_col"])
.rolling(window="5D", on="date")[["agg_col"]]
.agg("sum")
)
print(
X.groupby(["groupby_col"])[["agg_col", "date"]]
.rolling(window="5D", on="date")
.agg("sum")
)
print(
X.groupby(["groupby_col"])[["agg_col", "date"]]
.rolling(window="5D", on="date")[["agg_col"]]
.agg("sum")
)
# ----------------------------------------------------------------------
print(
X.groupby(["groupby_col"], as_index=False)
.rolling(window="5D", on="date")[["agg_col"]]
.agg("sum")
)
print(
X.groupby(["groupby_col"], as_index=False)[["agg_col", "date"]]
.rolling(window="5D", on="date")
.agg("sum")
)
print(
X.groupby(["groupby_col"], as_index=False)[["agg_col", "date"]]
.rolling(window="5D", on="date")[["agg_col"]]
.agg("sum")
)
# ----------------------------------------------------------------------
print(
X.groupby(["groupby_col"], as_index=False, sort=False)
.rolling(window="5D", on="date")[["agg_col"]]
.agg("sum")
)
print(
X.groupby(["groupby_col"], as_index=False, sort=False)[["agg_col", "date"]]
.rolling(window="5D", on="date")
.agg("sum")
)
print(
X.groupby(["groupby_col"], as_index=False, sort=False)[["agg_col", "date"]]
.rolling(window="5D", on="date")[["agg_col"]]
.agg("sum")
)
Issue Description
Behaviour is inconsistent depending on if we select columns on the DataFrameGroupBy
vs on the RollingGroupby
.
In the first example, behaviour is as expected and we get
agg_col
groupby_col date
0 2000-01-01 1.0
2000-01-02 3.0
2001-01-01 4.0
1 2000-01-03 3.0
In the second, we get
agg_col date
groupby_col
0 0 1.0 2000-01-01
1 3.0 2000-01-02
3 4.0 2001-01-01
1 2 3.0 2000-01-03
i.e. the original index is in there. I think I have seen a comment about this in another issue before, or in the docs but I can't seem to find it 🙃. Maybe related to https://github.com/pandas-dev/pandas/issues/56705?
The third example gives us the same as the first
agg_col
groupby_col date
0 2000-01-01 1.0
2000-01-02 3.0
2001-01-01 4.0
1 2000-01-03 3.0
How about if we try as_index=False
?
Fourth example shows that this doesn't work as expected (it has no effect)
agg_col
groupby_col date
0 2000-01-01 1.0
2000-01-02 3.0
2001-01-01 4.0
1 2000-01-03 3.0
But if we select before the do .rolling
, as in example five, we see that this does seem to work
groupby_col agg_col date
0 0 1.0 2000-01-01
1 0 3.0 2000-01-02
3 0 4.0 2001-01-01
2 1 3.0 2000-01-03
but im not sure if this is just a happy coincidence related to the weirdness from example two?
Example six shoes that selecting both pre and post .rolling
is effectively the same as only selecting post .rolling
agg_col
groupby_col date
0 2000-01-01 1.0
2000-01-02 3.0
2001-01-01 4.0
1 2000-01-03 3.0
Throwing sort
in there doesn't seem to do anything (as noted in other issues, e.g. https://github.com/pandas-dev/pandas/issues/50296, ), and exhibits the same behaviour as other examples wrt pre-rolling col selection, post-rolling col selection, both pre-and-post-rolling col selection, as shown by examples seven through nine.
Expected Behavior
- We would see consistent behaviour between pre-rolling col selection and post-rolling col selection
as_index
would always work and ifFalse
return aDataFrame
theby
from thegroupby
are in the columns, and not the index, presumably leaving the resulting index as a (potentially) unsorted version of the original indexsort
would work at all