Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import polars as pl
import pandas as pd
import numpy as np
import scipy.stats as st
data = np.array([-2.05191341e-05, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, -4.10391103e-05, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00])
print(pl.Series(data).kurtosis())
print(pd.Series(data).kurt())
print(st.kurtosis(data))
Issue Description
The output of pandas
kurtosis function is incorrect.
After simple debugging I found a comment at core/nanops.py
line 1360, in function nankurt
,
saying to fix #18044 it manually zeros out values less than 1e-14, which is in any way improper.
This affects whatever data comes with not much variance but lots of data.
Expected Behavior
Output of provided example:
14.916104870028523
0.0
14.916104870028551
Expected output: roughly 14.9161 for unbiased (pandas
's default behaviour) is correct.
Installed Versions
Comment From: dontgoto
Good point. Reproducing your example, this does happen in your example. Trying to scale it up to larger input distributions alleviates the issue though.
Your example is a sweet spot for this error, rescaling your distribution to be larger, the zeroing out stops happening very quickly due to the O(count^2) and O(count^3) terms in the numerator and denominator equations counteracting lifting the very small m4 and m2^2 above the e-14 threshold.
Doing a check of the form (pseudocode)
count < 100 and abs(frexp(denominator) - frexp(numerator)) < 24
before doing the zeroing out should alleviate this issue, but I would like to hear someone else's opinion before putting in a PR.
Comment From: dontgoto
Another note: the kurtosis fomulation then still deviates from the scipy implementation by 3, up until a distribution size of about 10x your example, using the same shape of your example.
I was not able to iron out that instability, though.
Comment From: j7168908jx
Another note: the kurtosis fomulation then still deviates from the scipy implementation by 3, up until a distribution size of about 10x your example, using the same shape of your example.
I was not able to iron out that instability, though.
Do you mean that the difference of their output is roughly 3? If you have not set bias=False
in scipy
or polars
, the difference here will be roughly 3.
Comment From: dontgoto
Do you mean that the difference of their output is roughly 3?
Exactly
If you have not set
bias=False
inscipy
orpolars
, the difference here will be roughly 3.
I did not, so then that's also explained. Then I see no issues with my solution anymore.
Comment From: kaixiongg
Why not apply welford method for skew and kurt?