Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
series_1 = pd.Series([-1] + ([1] * 19))
print(series_1.kurt())
print(series_1.rolling(20).kurt().max())
series_2 = pd.Series(([-1] * 7) + ([1] * 19))
print(series_2.rolling(20).kurt().max())
series_3 = pd.Series(([-1] * 6) + ([1] * 19))
print(series_3.rolling(20).kurt().max())
Issue Description
I met a problem in calculating rolling kurtosis for a specific kind of data.
for series_1 = pd.Series([-1] + ([1] * 19))
, I checked the source code and expected its kurtosis to be 20.00000000000001
because of the binary rounding error. While this holds true for calculating series_1.kurt()
, the rolling version of it behaves oddly and returns an exact 20.0
.
The numerical inconsistency also exists when I create another series series_2 = pd.Series(([-1] * 7) + ([1] * 19))
. This time it returns 20.00000000000001
, which is not equal to the max rolling kurtosis of series_1
. However, series_3
would give a 20.0
.
You can create similar series like above to see different behaviors. What is the rationale of it? Why would pandas sometimes give a 20.0
?
Expected Behavior
Expected all results to be 20.00000000000001
.
Installed Versions
Comment From: auderson
The rolling algos in pandas generally use the online updating version for performance. It is expected the result can be a bit different due to floating point artifacts.
If you really want consistent result you can try rolling.apply(lambda x: x.kurt())
, but this is much slower.
Comment From: HaloCollider
The rolling algos in pandas generally use the online updating version for performance. It is expected the result can be a bit different due to floating point artifacts. If you really want consistent result you can try
rolling.apply(lambda x: x.kurt())
, but this is much slower.
I understand that the online updating method may cause numerical instability in a time-series manner. But 20.0
or larger than 20.0
is an overall characteristic of a series. In other words, you cannot get a 20.0
and a larger than 20.0
in a single series.
For example:
pd.Series([1] * 19 + [-1] * 7 + [1] * 1).rolling(20).kurt().max()
gives a 20.00000000000001
,
while
pd.Series([1] * 19 + [-1] * 7 + [1] * 2).rolling(20).kurt().max()
gives a 20.0
.
Their difference is just an additional 1
at the tail, which doesn't affect the max kurtosis from the view of online updating.
(That's also why I'm calling it inconsistency rather than instability.)
Comment From: auderson
Looks like it's due to a demean operation prior to calculation:
https://github.com/pandas-dev/pandas/blob/283a2dcb2f91db3452a9d2ee299632a109b224f4/pandas/_libs/window/aggregations.pyx#L828-L838
Comment From: HaloCollider
Looks like it's due to a demean operation prior to calculation:
https://github.com/pandas-dev/pandas/blob/283a2dcb2f91db3452a9d2ee299632a109b224f4/pandas/_libs/window/aggregations.pyx#L828-L838
Thanks a lot. This solves my issue. Previously I checked the source but missed the demean operation, which made my version produce consistent results that caused confusion.
I found the exact thresholds of the proportion of 1
of a series being 0.25
and 0.75
, i.e., the mean being -0.5
and 0.5
. Out of range (0.25
to 0.75
) distributions lead to 20.0
.