Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
series_1 = pd.Series([-1] + ([1] * 19))
print(series_1.kurt())
print(series_1.rolling(20).kurt().max())
series_2 = pd.Series(([-1] * 7) + ([1] * 19))
print(series_2.rolling(20).kurt().max())
series_3 = pd.Series(([-1] * 6) + ([1] * 19))
print(series_3.rolling(20).kurt().max())
Issue Description
I met a problem in calculating rolling kurtosis for a specific kind of data.
for series_1 = pd.Series([-1] + ([1] * 19)), I checked the source code and expected its kurtosis to be 20.00000000000001 because of the binary rounding error. While this holds true for calculating series_1.kurt(), the rolling version of it behaves oddly and returns an exact 20.0.
The numerical inconsistency also exists when I create another series series_2 = pd.Series(([-1] * 7) + ([1] * 19)). This time it returns 20.00000000000001, which is not equal to the max rolling kurtosis of series_1. However, series_3 would give a 20.0.
You can create similar series like above to see different behaviors. What is the rationale of it? Why would pandas sometimes give a 20.0?
Expected Behavior
Expected all results to be 20.00000000000001.
Installed Versions
Comment From: auderson
The rolling algos in pandas generally use the online updating version for performance. It is expected the result can be a bit different due to floating point artifacts.
If you really want consistent result you can try rolling.apply(lambda x: x.kurt()), but this is much slower.
Comment From: HaloCollider
The rolling algos in pandas generally use the online updating version for performance. It is expected the result can be a bit different due to floating point artifacts. If you really want consistent result you can try
rolling.apply(lambda x: x.kurt()), but this is much slower.
I understand that the online updating method may cause numerical instability in a time-series manner. But 20.0 or larger than 20.0 is an overall characteristic of a series. In other words, you cannot get a 20.0 and a larger than 20.0 in a single series.
For example:
pd.Series([1] * 19 + [-1] * 7 + [1] * 1).rolling(20).kurt().max() gives a 20.00000000000001,
while
pd.Series([1] * 19 + [-1] * 7 + [1] * 2).rolling(20).kurt().max() gives a 20.0.
Their difference is just an additional 1 at the tail, which doesn't affect the max kurtosis from the view of online updating.
(That's also why I'm calling it inconsistency rather than instability.)
Comment From: auderson
Looks like it's due to a demean operation prior to calculation:
https://github.com/pandas-dev/pandas/blob/283a2dcb2f91db3452a9d2ee299632a109b224f4/pandas/_libs/window/aggregations.pyx#L828-L838
Comment From: HaloCollider
Looks like it's due to a demean operation prior to calculation:
https://github.com/pandas-dev/pandas/blob/283a2dcb2f91db3452a9d2ee299632a109b224f4/pandas/_libs/window/aggregations.pyx#L828-L838
Thanks a lot. This solves my issue. Previously I checked the source but missed the demean operation, which made my version produce consistent results that caused confusion.
I found the exact thresholds of the proportion of 1 of a series being 0.25 and 0.75, i.e., the mean being -0.5 and 0.5. Out of range (0.25 to 0.75) distributions lead to 20.0.