Pandas QST: FutureWarning: Resampling with a PeriodIndex is deprecated, how to resample now?

Research

[X] I have searched the [pandas] tag on StackOverflow for similar questions.
[X] I have asked my usage related question on StackOverflow.

Link to question on StackOverflow

https://stackoverflow.com/questions/77862775/pandas-2-2-futurewarning-resampling-with-a-periodindex-is-deprecated

Question about pandas

Pandas version 2.2 raises a warning when using this code:

import pandas as pd

df = pd.DataFrame.from_dict({"something": {pd.Period("2022", "Y-DEC"): 2.5}})
# FutureWarning: Resampling with a PeriodIndex is deprecated.
# Cast index to DatetimeIndex before resampling instead.
print(df.resample("M").ffill())

#          something
# 2022-01        2.5
# 2022-02        2.5
# 2022-03        2.5
# 2022-04        2.5
# 2022-05        2.5
# 2022-06        2.5
# 2022-07        2.5
# 2022-08        2.5
# 2022-09        2.5
# 2022-10        2.5
# 2022-11        2.5
# 2022-12        2.5

This does not work:

df.index = df.index.to_timestamp()
print(df.resample("M").ffill())

#             something
# 2022-01-31        2.5

I have PeriodIndex all over the place and I need to resample them a lot, filling gaps with ffill. How to do this with Pandas 2.2?

Comment From: MarcoGorelli

This comes from #55968 , and here's the relevant issue https://github.com/pandas-dev/pandas/issues/53481

I'd suggest to create a datetimeindex and ffill

Comment From: matteo-zanoni

I just hit this too. For upsampling the PeriodIndex had a different result than the DatetimeIndex, in particular PeriodIndex would create a row for each new period in the old period:

import pandas as pd

s = pd.Series(1, index=pd.period_range(pd.Timestamp(2024, 1, 1), pd.Timestamp(2024, 1, 2), freq="d"))
s.resample("1h").ffill()

This would create a series including ALL hours of 2024-01-02.

If, instead, we first convert to DatetimeIndex:

import pandas as pd

s = pd.Series(1, index=pd.period_range(pd.Timestamp(2024, 1, 1), pd.Timestamp(2024, 1, 2), freq="d"))
d.index = d.index.to_timestamp()
s.resample("1h").ffill()

The output will only contain 1 hour of 2024-01-02. Even if the series is constructed directly with a DatetimeIndex (even one containing the frequency information) the result is the same:

import pandas as pd

s = pd.Series(1, index=pd.date_range(pd.Timestamp(2024, 1, 1), pd.Timestamp(2024, 1, 2), freq="d"))
s.resample("1h").ffill()

Will there be no way to obtain the old behaviour of PeriodIndex in future versions? IMO upsampling is quite common and the way PeriodIndex implemented it is more usefull. It would be a shame to loose it.

Comment From: andreas-wolf

IMO upsampling is quite common and the way PeriodIndex implemented it is more usefull. It would be a shame to loose it.

I agree. Now you'll need to do reindexing manually, while with periodIndex this was a one-liner. Furthermore resampling with a datetime index seems to change the data type (a bug?). Here some sample code:

import pandas as pd

# some sample data
data = {2023: 1, 2024: 2}
df = pd.DataFrame(list(data.values()), index=pd.PeriodIndex(data.keys(), freq="Y"))

# Old style resampling, just a one-liner
old_style_resampling = df.resample("M").ffill()
print(old_style_resampling)
print(type(old_style_resampling.iloc[0][0]))

# Convert index to DatetimeIndex
df.index = pd.to_datetime(df.index.start_time)
last_date = df.index[-1] + pd.offsets.YearEnd()
df_extended = df.reindex(
    df.index.union(pd.date_range(start=df.index[-1], end=last_date, freq="D"))
).ffill()
new_style_resampling = df_extended.resample("ME").ffill()
print(new_style_resampling)
print(type(new_style_resampling.iloc[0][0]))

I also opt for keeping the periodIndex resampling.

Comment From: ChadFulton

Related to my comments in #56588, I think that this is another example where Period is being deprecated too fast without a clear replacement in mind.

Comment From: jbrockmendel

Are all the relevant cases about upsampling and never downsampling? A big part of the motivation for deprecating was that PeriodIndexResampler._downsample is deeply broken in a way that didn't seem worth fixing. Potentially we could just deprecate downsampling and not upsampling?

Comment From: andreas-wolf

The upsampling example is just the one where it's very obvious what will be missing when periodindex resampling won't work any more.

When downsampling would not work anymore I would have to convert the index, downsample and convert the index back again. Does not sound very compelling.

The period index resampling (up and down) is very convenient when one has to combine different data sources in days, months, quarters and years. I can't remember a project where I did not use period resampling. The convenience was always an argument to use pandas instead of other libraries like polars where one has to handle all the conversions yourself.

From my point of view the PeriodIndex was always one of the great things about Pandas.

I have very limited experience with Pandas internals, so I don't understand how downsampling can be deeply broken so that it's not worth fixing when "just" converting to a datetime index would fix it? Can't the datetime indexing be used internally to fix it?

Comment From: MarcosYanase

I agree with @andreas-wolf. My projects have a lot of dataframes using PeriodIndex, with differents frequencies and resampling is very very useful tool for calculation. Keeping it at least for upsampling would be excellent, but for downsampling a workaround (like Andreas posted) is not trivial. What are the bugs related to PeriodIndexResampler._downsample?

Comment From: MarcosYanase

Any news about this issue?

The deprecation was due this #53481, correct @jbrockmendel ?

Some of this is because in PeriodIndexResampler._downsample (which many methods go through) we just return self.asfreq() in many cases which is very much wrong.

https://github.com/pandas-dev/pandas/blob/c375533d670a7114c36ebb114c01ec7d57b92753/pandas/core/resample.py#L1800C1-L1813C33

        if is_subperiod(ax.freq, self.freq):
            # Downsampling
            return self._groupby_and_aggregate(how, **kwargs)
        elif is_superperiod(ax.freq, self.freq):
            if how == "ohlc":
                # GH #13083
                # upsampling to subperiods is handled as an asfreq, which works
                # for pure aggregating/reducing methods
                # OHLC reduces along the time dimension, but creates multiple
                # values for each period -> handle by _groupby_and_aggregate()
                return self._groupby_and_aggregate(how)
            return self.asfreq()
        elif ax.freq == self.freq:
            return self.asfreq()

I'm naively changing the lines "return self.asfreq()" to "return super()._downsample(how, **kwargs)", delivering the responsability to DatetimeIndexResampler._downsample. It works for the example in #53481 and other tests that I'm doing (using other downsampling methods), giving similar outputs to DatetimeIndexResampler._downsample. But I'm sure it's not so simple. Could you please give more examples where PeriodIndexResampler._downsample continues to be broken?

Comment From: jbrockmendel

I don't have the bandwidth to give you a thorough answer. What I can tell you is that there are no plans to enforce this deprecation in 3.0.

Comment From: MarcosYanase

Is there an option to not deprecate resample with PeriodIndex? If yes, how is the process? We fix the issues with it first and delete the FutureWarning, or do both simultaneously?

I think it's possible to fix the example in #53481 doing what I wrote above:

if is_subperiod(ax.freq, self.freq):
    # Downsampling
    return self._groupby_and_aggregate(how, **kwargs)
elif is_superperiod(ax.freq, self.freq):
    if how == "ohlc":
        # GH #13083
        # upsampling to subperiods is handled as an asfreq, which works
        # for pure aggregating/reducing methods
        # OHLC reduces along the time dimension, but creates multiple
        # values for each period -> handle by _groupby_and_aggregate()
        return self._groupby_and_aggregate(how)
    return super()._downsample(how, **kwargs) #fixed here, it was return self.asfreq()
elif ax.freq == self.freq:
    return super()._downsample(how, **kwargs) #fixed here, it was return self.asfreq()
raise IncompatibleFrequency(
    f"Frequency {ax.freq} cannot be resampled to {self.freq}, "
    "as they are not sub or super periods"
)

About https://github.com/pandas-dev/pandas/pull/58021#issuecomment-2027782313:

test_resample_nat_index_series (reason="Don't know why this fails"): https://github.com/pandas-dev/pandas/blob/e4956ab403846387a435cd7b3a8f36828c23c0c7/pandas/tests/resample/test_base.py#L229) -- First error: ValueError("for Period, please use 'M' instead of 'ME'") -- Doing it, it will fail but we could fix it modifying https://github.com/pandas-dev/pandas/blob/e4956ab403846387a435cd7b3a8f36828c23c0c7/pandas/core/resample.py#L1643 to

            if isinstance(obj.index, PeriodIndex):
                obj.index = PeriodIndex(obj.index, freq=self.freq)
            else:
                obj.index = obj.index._with_freq(self.freq)

test_monthly_convention_span (reason="Commented out for more than 3 years. Should this work?") -- I think it (https://github.com/pandas-dev/pandas/blob/e4956ab403846387a435cd7b3a8f36828c23c0c7/pandas/tests/resample/test_period_index.py#L684C1-L696C49) should not work. Even correcting to something like:

def test_monthly_convention_span(self):

        rng = period_range("2000-01", periods=3, freq="M")
        ts = Series(np.arange(3), index=rng)

        # hacky way to get same thing
        exp_index = period_range("2000-01-01", "2000-03-31", freq="D")
        expected = ts.asfreq("D", how="start").reindex(exp_index)
        expected = expected.ffill()

        result = ts.resample("D").ffill()

        tm.assert_series_equal(result, expected)

we have

AssertionError: Attributes of Series are different

Attribute "dtype" are different
[left]:  int32
[right]: float64

which I think is expected (after reindex we have many NaN, changing the type to float64 and this continues after ffill)

And, using the fix that I'm suggesting, we will have:

FAILED test_period_index.py::TestPeriodIndex::test_resample_same_freq[mean] - AssertionError: Attributes of Series are different
FAILED test_period_index.py::TestPeriodIndex::test_resample_same_freq[sem] - AssertionError: Attributes of Series are different
FAILED test_period_index.py::TestPeriodIndex::test_resample_same_freq[median] - AssertionError: Attributes of Series are different
FAILED test_period_index.py::TestPeriodIndex::test_resample_same_freq[var] - AssertionError: Attributes of Series are different
FAILED test_period_index.py::TestPeriodIndex::test_resample_same_freq[std] - AssertionError: Attributes of Series are different
FAILED test_period_index.py::TestPeriodIndex::test_resample_same_freq[ohlc] - AttributeError: 'DataFrame' object has no attribute 'dtype'. Did you mean: 'dtypes'?
FAILED test_period_index.py::TestPeriodIndex::test_resample_same_freq[quantile] - AssertionError: Attributes of Series are different
FAILED test_period_index.py::TestPeriodIndex::test_resample_same_freq[count] - AssertionError: Series are different
FAILED test_period_index.py::TestPeriodIndex::test_resample_same_freq[size] - AssertionError: Series are different
FAILED test_period_index.py::TestPeriodIndex::test_resample_same_freq[nunique] - AssertionError: Series are different

because https://github.com/pandas-dev/pandas/blob/e4956ab403846387a435cd7b3a8f36828c23c0c7/pandas/tests/resample/test_period_index.py#L287C1-L294C1:

    def test_resample_same_freq(self, resample_method):
        # GH12770
        series = Series(range(3), index=period_range(start="2000", periods=3, freq="M"))
        expected = series

        result = getattr(series.resample("M"), resample_method)()
        tm.assert_series_equal(result, expected)

is expecting the wrong behavior pointed out by #53481. I think it's ok to expect the same values using sum, mean, etc... but how could we expected the same values/attributes using methods like std, var, ohlc, nunique, size, count, etc? I didn't find similar test for datetime_index.

Comment From: jorisvandenbossche

(we just discussed this at the dev meeting, but I already mostly wrote this before the meeting, so posting anyway to write down the reasoning)

I think we should remove this deprecation:

First, if we have no plans to enforce the deprecation (actually removing the feature), why bother people with the fear that it will go away, and the unclear situation it creates (can/should I still use it?)
If we want to make people aware of the issues (the bugs) with Period resampling, then the current warning also does not help, as it gives no details (and even this thread has not given much more details)
I am fairly convinced that for the majority of use cases that people use Period resampling for, it works perfectly fine. IIUC the broken behaviour is specifically in the case of downsampling in combination with a method where the result would not be equivalent to asfreq() (e.g. nunique() and size(), some aggregations like std(), but also many other methods work just fine I think?) But for example a typical operation for downsampling like forward/backward filling also works fine

Comment From: jorisvandenbossche

At the meeting we thought to remove the warning and instead explicitly raise a (not implemented) error for the cases that are broken right now (in the idea that it is better for the user to raise an error than silently return nonsense results, so this "breaking change" of starting to raise would be for the better)

Comment From: jorisvandenbossche

cc @rhshadrach as it relates to https://github.com/pandas-dev/pandas/pull/62398

Comment From: jbrockmendel

Looking at implementing what we discussed in the meeting, namely raising NotImplementedError only in the subset of cases that are actively broken:

    def _downsample(self, how, **kwargs):
        """
        Downsample the cython defined function.

        Parameters
        ----------
        how : string / cython mapped function
        **kwargs : kw args passed to how function
        """
        # we may need to actually resample as if we are timestamps
        if isinstance(self.ax, DatetimeIndex):
            return super()._downsample(how, **kwargs)

        ax = self.ax

        if is_subperiod(ax.freq, self.freq):
            # Downsampling
            return self._groupby_and_aggregate(how, **kwargs)
        elif is_superperiod(ax.freq, self.freq):
            if how == "ohlc":
                # GH #13083
                # upsampling to subperiods is handled as an asfreq, which works
                # for pure aggregating/reducing methods
                # OHLC reduces along the time dimension, but creates multiple
                # values for each period -> handle by _groupby_and_aggregate()
                return self._groupby_and_aggregate(how)
+            raise NotImplementedError(
+                f"Downsampling with method '{how}' onto a higher freq "
+                "is not supported with PeriodIndex"
+            )
-          return self.asfreq()
        elif ax.freq == self.freq:
+            raise NotImplementedError(
+                f"Downsampling with method '{how}' onto a matching freq "
+                "is not supported with PeriodIndex"
+            )
-          return self.asfreq()

This is mostly right, except for cases where self.asfreq() is correct: how in ["first", "last"] should be fine. min and max should work if the dtype is ordered. sum and mean should work for float dtypes. But I don't want to put a lot of effort into checking dtypes here. Is everyone OK with killing off all of the cases that raise in the diff, regardless of dtype or how?

Comment From: jbrockmendel

Looking at this more, I suspect that there is a #47227 bug lurking bc freq means something different for PeriodIndex than for DatetimeIndex. Nearly all of our tests use period_range to make sample data which obscures the distinction.

I'll make the change we discussed in the meeting, but I still think that basically every case which can do obj.to_timestamp().resample(...).method().to_period() should do it that way.

Comment From: jorisvandenbossche

Opened a PR to remove the deprecation warning in https://github.com/pandas-dev/pandas/pull/62480 (ended up essentially reverting the original PR adding it)