Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

df = pd.DataFrame({"a": ["a", "b", float("nan")]})

df["a"] = df["a"].astype("category")

df = df.fillna(float("nan")).replace([float("nan")], [None])

print(df["a"].loc[2])

> nan

Issue Description

As of Pandas 2.0.0, pandas.DataFrame.replace now silently fails to replace math.nan with None on categorical type columns.

Expected Behavior

either: 1. nan should be replaced with None; or 2. an error should be raised.

Installed Versions

/Users/corynezin/.pyenv/versions/wearit/lib/python3.9/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils. warnings.warn("Setuptools is replacing distutils.") INSTALLED VERSIONS ------------------ commit : 478d340667831908b5b4bf09a2787a11a14560c9 python : 3.9.10.final.0 python-bits : 64 OS : Darwin OS-release : 21.6.0 Version : Darwin Kernel Version 21.6.0: Wed Aug 10 14:25:27 PDT 2022; root:xnu-8020.141.5~2/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.0.0 numpy : 1.23.5 pytz : 2023.3 dateutil : 2.8.2 setuptools : 67.6.1 pip : 23.0.1 Cython : None pytest : 7.3.0 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : 2.9.6 jinja2 : 3.1.2 IPython : 8.12.0 pandas_datareader: None bs4 : 4.12.2 bottleneck : None brotli : None fastparquet : 2023.2.0 fsspec : 2023.4.0 gcsfs : None matplotlib : 3.7.1 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 5.0.0 pyreadstat : None pyxlsb : None s3fs : None scipy : 1.10.1 snappy : None sqlalchemy : 2.0.9 tables : None tabulate : 0.9.0 xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None

Comment From: phofl

cc @lukemanley

Author: Luke Manley <lukemanley@gmail.com>
Date:   Tue Jan 24 14:02:33 2023 -0500

    BUG/PERF: Series(category).replace (#50857)

Comment From: lukemanley

Thanks for the report @corynezinstitchfix.

The 1.5.3 behavior seems a bit odd to me (adjusting your example a bit):

import pandas as pd

df = pd.DataFrame({"a": ["a", "b", float("nan")]})

df["a"] = df["a"].astype("category")

# converts to object dtype (loses category) and replaces nan with None
df.replace([float("nan")], [None])

# no effect (does not replace nan with "c")
df.replace([float("nan")], ["c"])

In 2.0, both of those replace calls have no effect so I think 2.0 is a bit more consistent when replacing nan. I'm not a huge fan of the 1.5.3 behavior that loses the category dtype. If you want that behavior in 2.0, I think you could just do df.astype(object).replace(np.nan, None).

I'm not sure about raising an error either. At the moment, replacing a category that does not exist in the existing categorical dtype has no effect:

# no effect ("c" is not an existing category)
df.replace(["c], ["d"])

Since the nan is not actually one of the categories (it represents a missing value), I think you could argue it should also have no effect when trying to replace.

All of that said, I don't feel too strongly here. Open to other thoughts.

@phofl, any thoughts?

Comment From: lukemanley

Looks like there is another open issue discussing replacing nan in a categorical: https://github.com/pandas-dev/pandas/issues/40472

Comment From: phofl

So we are avoiding losing the dtype now? This is a + imo. So sounds good. But should probably adjust whatsnew

Comment From: lukemanley

So we are avoiding losing the dtype now? This is a + imo. So sounds good. But should probably adjust whatsnew

Correct, 2.0 no longer loses the dtype and casts to object. Replacing nan in a categorical now has no-effect regardless of replacement value. (1.5.3 had an edge case that would convert to object and replace if the replacement value was NA-like)

Comment From: phofl

Sounds good to me.

@jbrockmendel any thoughts here?

Comment From: jbrockmendel

There was an issue a while back (not specific to categorical) about replacing nan with None and the conclusion IIRC was to respect that the user specifically asked for that. If we want to raise and tell the user to explicitly case first I'd be OK with that, but I don't think silently ignoring what they specifically asked for is helpful.

Also note that trying to replace nan with non-na 4 or even a category "a" silently is a no-op. This is definitely a bug.

Comment From: lithomas1

@lukemanley @phofl Any action to take here?

Comment From: jbrockmendel

The special casing of CategoricalDtype in Series.replace was deprecated in 2.x and is gone in main. Closing as complete.