Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
df = pd.DataFrame({"a": ["a", "b", float("nan")]})
df["a"] = df["a"].astype("category")
df = df.fillna(float("nan")).replace([float("nan")], [None])
print(df["a"].loc[2])
> nan
Issue Description
As of Pandas 2.0.0, pandas.DataFrame.replace
now silently fails to replace math.nan
with None
on categorical type columns.
Expected Behavior
either:
1. nan
should be replaced with None
; or
2. an error should be raised.
Installed Versions
Comment From: phofl
cc @lukemanley
Author: Luke Manley <lukemanley@gmail.com>
Date: Tue Jan 24 14:02:33 2023 -0500
BUG/PERF: Series(category).replace (#50857)
Comment From: lukemanley
Thanks for the report @corynezinstitchfix.
The 1.5.3 behavior seems a bit odd to me (adjusting your example a bit):
import pandas as pd
df = pd.DataFrame({"a": ["a", "b", float("nan")]})
df["a"] = df["a"].astype("category")
# converts to object dtype (loses category) and replaces nan with None
df.replace([float("nan")], [None])
# no effect (does not replace nan with "c")
df.replace([float("nan")], ["c"])
In 2.0, both of those replace
calls have no effect so I think 2.0 is a bit more consistent when replacing nan
. I'm not a huge fan of the 1.5.3 behavior that loses the category dtype. If you want that behavior in 2.0, I think you could just do df.astype(object).replace(np.nan, None)
.
I'm not sure about raising an error either. At the moment, replacing a category that does not exist in the existing categorical dtype has no effect:
# no effect ("c" is not an existing category)
df.replace(["c], ["d"])
Since the nan
is not actually one of the categories (it represents a missing value), I think you could argue it should also have no effect when trying to replace.
All of that said, I don't feel too strongly here. Open to other thoughts.
@phofl, any thoughts?
Comment From: lukemanley
Looks like there is another open issue discussing replacing nan in a categorical: https://github.com/pandas-dev/pandas/issues/40472
Comment From: phofl
So we are avoiding losing the dtype now? This is a + imo. So sounds good. But should probably adjust whatsnew
Comment From: lukemanley
So we are avoiding losing the dtype now? This is a + imo. So sounds good. But should probably adjust whatsnew
Correct, 2.0 no longer loses the dtype and casts to object. Replacing nan in a categorical now has no-effect regardless of replacement value. (1.5.3 had an edge case that would convert to object and replace if the replacement value was NA-like)
Comment From: phofl
Sounds good to me.
@jbrockmendel any thoughts here?
Comment From: jbrockmendel
There was an issue a while back (not specific to categorical) about replacing nan with None and the conclusion IIRC was to respect that the user specifically asked for that. If we want to raise and tell the user to explicitly case first I'd be OK with that, but I don't think silently ignoring what they specifically asked for is helpful.
Also note that trying to replace nan with non-na 4 or even a category "a" silently is a no-op. This is definitely a bug.
Comment From: lithomas1
@lukemanley @phofl Any action to take here?
Comment From: jbrockmendel
The special casing of CategoricalDtype in Series.replace was deprecated in 2.x and is gone in main. Closing as complete.