Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

timestamps = [pd.Timestamp("2020-01-01") + i * pd.Timedelta(minutes=1) for i in range(200)]
dfs = []
for i in range(10):
    df_i = pd.DataFrame.from_dict(
        {
            "Timestamp": timestamps, 
            "groupby_value": [i * 0.01] * len(timestamps),
            "value": [i + 2] * len(timestamps)
        }
    )
    dfs += [df_i]
df = pd.concat(dfs)
df["Timestamp"] = pd.to_datetime(df["Timestamp"])

df_grouped = df.set_index("Timestamp").groupby("groupby_value").resample("1H").mean()

print(sorted(list(set(df["groupby_value"]))), "\n", sorted(list(set(df_grouped["groupby_value"]))))
""" Output:
[0.0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09] 
 [0.0, 0.01, 0.02, 0.029999999999999995, 0.03, 0.04, 0.05, 0.05999999999999999, 0.06, 0.07, 0.08, 0.09] """
print(df_grouped.reset_index(drop=True).groupby("groupby_value").count())
""" Output:
               value
groupby_value       
0.00               4
0.01               4
0.02               4
0.03               3
0.03               1  # <- another 0.03?
0.04               4
0.05               4
0.06               3
0.06               1  # <- another 0.06?
0.07               4
0.08               4
0.09               4
"""

Issue Description

After groupby and resample, where the groupby column has float values, new values appear in that column, which are close but not equal to some of the existing values. For instance, in the example above we get 0.029999999999999995 and 0.03 instead of 0.03. This leads to incorrect aggregations as well (see the last output in the example).

Expected Behavior

Expected: keep original values of floats used for grouping.

Installed Versions

INSTALLED VERSIONS

commit : 3fa869ef9090281b7c6b595355a6795cc366876f python : 3.10.9.final.0 python-bits : 64 OS : Darwin OS-release : 22.2.0 Version : Darwin Kernel Version 22.2.0: Fri Nov 11 02:08:47 PST 2022; root:xnu-8792.61.2~4/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : en_US.UTF-8 LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 2.0.0.dev0+1265.g3fa869ef90 numpy : 1.24.1 pytz : 2022.7.1 dateutil : 2.8.2 setuptools : 65.6.3 pip : 22.3.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : None qtpy : None pyqt5 : None

Comment From: dicristina

The result is a consequence of floating point arithmetic. In the resample step new values appear on the groupby_value column because you take the mean of a floating point value repeated a different number of times: repeated 60 times for the first three hours and repeated 20 times for the remaining 20 minutes. These result will not necessarily be the same.


Consider the following pure numpy example:

import numpy as np
val = 0.03
np.full(20, val).mean() == np.full(60, val).mean()

In my system the arrays are of type float64 and the result is False.


Going back to the original example, here we can see that the new values arise from resample groups of different sizes:

>>> pd.set_option("display.precision", 20)

>>> df.set_index("Timestamp").groupby("groupby_value").resample("H").agg(mean=("groupby_value", "mean"), count=("groupby_value", "count"))
                                                              mean  count
groupby_value          Timestamp                                         
0.00000000000000000000 2020-01-01 00:00:00  0.00000000000000000000     60
                       2020-01-01 01:00:00  0.00000000000000000000     60
                       2020-01-01 02:00:00  0.00000000000000000000     60
                       2020-01-01 03:00:00  0.00000000000000000000     20
0.01000000000000000021 2020-01-01 00:00:00  0.01000000000000000021     60
                       2020-01-01 01:00:00  0.01000000000000000021     60
                       2020-01-01 02:00:00  0.01000000000000000021     60
                       2020-01-01 03:00:00  0.01000000000000000021     20
0.02000000000000000042 2020-01-01 00:00:00  0.02000000000000000042     60
                       2020-01-01 01:00:00  0.02000000000000000042     60
                       2020-01-01 02:00:00  0.02000000000000000042     60
                       2020-01-01 03:00:00  0.02000000000000000042     20
0.02999999999999999889 2020-01-01 00:00:00  0.02999999999999999542     60
                       2020-01-01 01:00:00  0.02999999999999999542     60
                       2020-01-01 02:00:00  0.02999999999999999542     60
                       2020-01-01 03:00:00  0.02999999999999999889     20
0.04000000000000000083 2020-01-01 00:00:00  0.04000000000000000083     60
                       2020-01-01 01:00:00  0.04000000000000000083     60
                       2020-01-01 02:00:00  0.04000000000000000083     60
                       2020-01-01 03:00:00  0.04000000000000000083     20
0.05000000000000000278 2020-01-01 00:00:00  0.05000000000000000278     60
                       2020-01-01 01:00:00  0.05000000000000000278     60
                       2020-01-01 02:00:00  0.05000000000000000278     60
                       2020-01-01 03:00:00  0.05000000000000000278     20
0.05999999999999999778 2020-01-01 00:00:00  0.05999999999999999084     60
                       2020-01-01 01:00:00  0.05999999999999999084     60
                       2020-01-01 02:00:00  0.05999999999999999084     60
                       2020-01-01 03:00:00  0.05999999999999999778     20
0.07000000000000000666 2020-01-01 00:00:00  0.07000000000000000666     60
                       2020-01-01 01:00:00  0.07000000000000000666     60
                       2020-01-01 02:00:00  0.07000000000000000666     60
                       2020-01-01 03:00:00  0.07000000000000000666     20
0.08000000000000000167 2020-01-01 00:00:00  0.08000000000000000167     60
                       2020-01-01 01:00:00  0.08000000000000000167     60
                       2020-01-01 02:00:00  0.08000000000000000167     60
                       2020-01-01 03:00:00  0.08000000000000000167     20
0.08999999999999999667 2020-01-01 00:00:00  0.08999999999999999667     60
                       2020-01-01 01:00:00  0.08999999999999999667     60
                       2020-01-01 02:00:00  0.08999999999999999667     60
                       2020-01-01 03:00:00  0.08999999999999999667     20
INSTALLED VERSIONS ------------------ commit : 87cfe4e38bafe7300a6003a1d18bd80f3f77c763 python : 3.10.6.final.0 python-bits : 64 OS : Linux OS-release : 5.15.0-25-generic Version : #25-Ubuntu SMP Wed Mar 30 15:54:22 UTC 2022 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.5.0 numpy : 1.23.3 pytz : 2022.2.1 dateutil : 2.8.2 setuptools : 59.6.0 pip : 22.0.2 Cython : None pytest : 7.1.3 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.9.1 html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.5.0 pandas_datareader: None bs4 : 4.11.1 bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.6.2 numba : None numexpr : None odfpy : None openpyxl : 3.0.10 pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None tzdata : None

Comment From: takarzyna

Thank you for pointing out the root cause, that makes sense. What would be the suggested way to deal with this, then? And will it become a part of Pandas or at least be mentioned in documentation? I am asking because Pandas is used by Data Scientists who are not necessarily computer scientists expecting to deal with floating point arithmetic, however I do understand that the decision about what to do with it belongs to Pandas contributors.

Comment From: rhshadrach

What would be the suggested way to deal with this, then?

That is very much dependent on the purpose of the computation. For most computation in my experience, it is okay to ignore. Other times, rounding prior to output is sufficient. If attempting to count distinct values, then binning may be appropriate (e.g. https://pandas.pydata.org/docs/reference/api/pandas.cut.html).

And will it become a part of Pandas or at least be mentioned in documentation?

I wouldn't be opposed to a section in the Basic Functionality page of the User Guide. It would be great to link to an authoritative source if possible.