Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
timestamps = [pd.Timestamp("2020-01-01") + i * pd.Timedelta(minutes=1) for i in range(200)]
dfs = []
for i in range(10):
df_i = pd.DataFrame.from_dict(
{
"Timestamp": timestamps,
"groupby_value": [i * 0.01] * len(timestamps),
"value": [i + 2] * len(timestamps)
}
)
dfs += [df_i]
df = pd.concat(dfs)
df["Timestamp"] = pd.to_datetime(df["Timestamp"])
df_grouped = df.set_index("Timestamp").groupby("groupby_value").resample("1H").mean()
print(sorted(list(set(df["groupby_value"]))), "\n", sorted(list(set(df_grouped["groupby_value"]))))
""" Output:
[0.0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09]
[0.0, 0.01, 0.02, 0.029999999999999995, 0.03, 0.04, 0.05, 0.05999999999999999, 0.06, 0.07, 0.08, 0.09] """
print(df_grouped.reset_index(drop=True).groupby("groupby_value").count())
""" Output:
value
groupby_value
0.00 4
0.01 4
0.02 4
0.03 3
0.03 1 # <- another 0.03?
0.04 4
0.05 4
0.06 3
0.06 1 # <- another 0.06?
0.07 4
0.08 4
0.09 4
"""
Issue Description
After groupby and resample, where the groupby column has float values, new values appear in that column, which are close but not equal to some of the existing values. For instance, in the example above we get 0.029999999999999995 and 0.03 instead of 0.03. This leads to incorrect aggregations as well (see the last output in the example).
Expected Behavior
Expected: keep original values of floats used for grouping.
Installed Versions
INSTALLED VERSIONS
commit : 3fa869ef9090281b7c6b595355a6795cc366876f python : 3.10.9.final.0 python-bits : 64 OS : Darwin OS-release : 22.2.0 Version : Darwin Kernel Version 22.2.0: Fri Nov 11 02:08:47 PST 2022; root:xnu-8792.61.2~4/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : en_US.UTF-8 LANG : en_US.UTF-8 LOCALE : en_US.UTF-8
pandas : 2.0.0.dev0+1265.g3fa869ef90 numpy : 1.24.1 pytz : 2022.7.1 dateutil : 2.8.2 setuptools : 65.6.3 pip : 22.3.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : None qtpy : None pyqt5 : None
Comment From: dicristina
The result is a consequence of floating point arithmetic. In the resample step new values appear on the groupby_value column because you take the mean of a floating point value repeated a different number of times: repeated 60 times for the first three hours and repeated 20 times for the remaining 20 minutes. These result will not necessarily be the same.
Consider the following pure numpy example:
import numpy as np
val = 0.03
np.full(20, val).mean() == np.full(60, val).mean()
In my system the arrays are of type float64
and the result is False
.
Going back to the original example, here we can see that the new values arise from resample groups of different sizes:
>>> pd.set_option("display.precision", 20)
>>> df.set_index("Timestamp").groupby("groupby_value").resample("H").agg(mean=("groupby_value", "mean"), count=("groupby_value", "count"))
mean count
groupby_value Timestamp
0.00000000000000000000 2020-01-01 00:00:00 0.00000000000000000000 60
2020-01-01 01:00:00 0.00000000000000000000 60
2020-01-01 02:00:00 0.00000000000000000000 60
2020-01-01 03:00:00 0.00000000000000000000 20
0.01000000000000000021 2020-01-01 00:00:00 0.01000000000000000021 60
2020-01-01 01:00:00 0.01000000000000000021 60
2020-01-01 02:00:00 0.01000000000000000021 60
2020-01-01 03:00:00 0.01000000000000000021 20
0.02000000000000000042 2020-01-01 00:00:00 0.02000000000000000042 60
2020-01-01 01:00:00 0.02000000000000000042 60
2020-01-01 02:00:00 0.02000000000000000042 60
2020-01-01 03:00:00 0.02000000000000000042 20
0.02999999999999999889 2020-01-01 00:00:00 0.02999999999999999542 60
2020-01-01 01:00:00 0.02999999999999999542 60
2020-01-01 02:00:00 0.02999999999999999542 60
2020-01-01 03:00:00 0.02999999999999999889 20
0.04000000000000000083 2020-01-01 00:00:00 0.04000000000000000083 60
2020-01-01 01:00:00 0.04000000000000000083 60
2020-01-01 02:00:00 0.04000000000000000083 60
2020-01-01 03:00:00 0.04000000000000000083 20
0.05000000000000000278 2020-01-01 00:00:00 0.05000000000000000278 60
2020-01-01 01:00:00 0.05000000000000000278 60
2020-01-01 02:00:00 0.05000000000000000278 60
2020-01-01 03:00:00 0.05000000000000000278 20
0.05999999999999999778 2020-01-01 00:00:00 0.05999999999999999084 60
2020-01-01 01:00:00 0.05999999999999999084 60
2020-01-01 02:00:00 0.05999999999999999084 60
2020-01-01 03:00:00 0.05999999999999999778 20
0.07000000000000000666 2020-01-01 00:00:00 0.07000000000000000666 60
2020-01-01 01:00:00 0.07000000000000000666 60
2020-01-01 02:00:00 0.07000000000000000666 60
2020-01-01 03:00:00 0.07000000000000000666 20
0.08000000000000000167 2020-01-01 00:00:00 0.08000000000000000167 60
2020-01-01 01:00:00 0.08000000000000000167 60
2020-01-01 02:00:00 0.08000000000000000167 60
2020-01-01 03:00:00 0.08000000000000000167 20
0.08999999999999999667 2020-01-01 00:00:00 0.08999999999999999667 60
2020-01-01 01:00:00 0.08999999999999999667 60
2020-01-01 02:00:00 0.08999999999999999667 60
2020-01-01 03:00:00 0.08999999999999999667 20
Comment From: takarzyna
Thank you for pointing out the root cause, that makes sense. What would be the suggested way to deal with this, then? And will it become a part of Pandas or at least be mentioned in documentation? I am asking because Pandas is used by Data Scientists who are not necessarily computer scientists expecting to deal with floating point arithmetic, however I do understand that the decision about what to do with it belongs to Pandas contributors.
Comment From: rhshadrach
What would be the suggested way to deal with this, then?
That is very much dependent on the purpose of the computation. For most computation in my experience, it is okay to ignore. Other times, rounding prior to output is sufficient. If attempting to count distinct values, then binning may be appropriate (e.g. https://pandas.pydata.org/docs/reference/api/pandas.cut.html).
And will it become a part of Pandas or at least be mentioned in documentation?
I wouldn't be opposed to a section in the Basic Functionality page of the User Guide. It would be great to link to an authoritative source if possible.