-
[x] I have checked that this issue has not already been reported.
-
[x] I have confirmed this bug exists on the latest version of pandas
-
[ ] (optional) I have confirmed this bug exists on the master branch of pandas.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
import pandas as pd
import numpy as np
#good
good = pd.Series({1: 'a', 2: 'b'}).astype('category').replace('a', 'c')
#bad
bad = pd.Series({1: np.nan, 2: 'b'}).astype('category').replace(np.nan, 'c')
# does not replace and bad is instead:
# 1 NaN
# 2 b
# dtype: category
# Categories (1, object): [b]
Problem description
When replacing np.nan on a categorical series, the values are not modified. This is a breaking change introduced in 1.0 (it worked fine in 0.25.3).
My guess is that this was introduced by https://github.com/pandas-dev/pandas/pull/27026/files which does nothing when "to_replace in cat.categories" evaluates to False.
Expected Output
pd.Series({1: 'c', 2: 'b'}).astype('category')
displaying like
# 1 c
# 2 b
# dtype: category
# Categories (2, object): [c, b]
Output of pd.show_versions()
Comment From: jenhseb
You need to use fillna
for NaN values. Notice that np.nan == np.nan
returns False. Thus, replace
isn't able to match it.
import pandas as pd
import numpy as np
pd.Series({1: np.nan, 2: 'b'}).fillna('c')
Comment From: dsaxton
You need to use
fillna
for NaN values. Notice thatnp.nan == np.nan
returns False. Thus,replace
isn't able to match it.``` import pandas as pd import numpy as np
pd.Series({1: np.nan, 2: 'b'}).fillna('c') ```
This actually isn't true in general. The replace works for object
but not category
:
[ins] In [8]: pd.Series([np.nan, "a"]).replace(np.nan, "a")
Out[8]:
0 a
1 a
dtype: object
[ins] In [9]: pd.Series([np.nan, "a"], dtype="category").replace(np.nan, "a")
Out[9]:
0 NaN
1 a
dtype: category
Categories (1, object): ['a']
Comment From: MaximeLaurenty
The replace works for object
& float
:
In [7]: pd.Series([np.nan, 2]).replace(np.nan, 1)
Out[7]:
0 1.0
1 2.0
dtype: float64
Hence why I find it surprising it doesn't work for category
anymore.
Comment From: mzeitlin11
@MaximeLaurenty have put up a PR which restores this behavior, but also explains potential rationale for deprecating it and forcing use of fillna
. Please let me know if you have any thoughts on what makes more sense!
Comment From: jreback
hmm, we have the default of replace=None
so accepting np.nan
here is a bit odd. we could change this to use no_default
and then this might be reasonable.
Comment From: MaximeLaurenty
I agree fillna
is better (and I changed our code to use it right after after spotting our issue).
I don't think fixing this regression is worth changing the default to_replace
.
In my opinion having:
- a warning that it'll be deprecated like suggested by @mzeitlin11
- and a fix until then that doesn't interfere with None
would be the best. (But I've no experience in maintaining open-source libraries, hence it's not a strong opinion)
Comment From: roib20
hmm, we have the default of
replace=None
so acceptingnp.nan
here is a bit odd. we could change this to useno_default
and then this might be reasonable.
In 1.4.0 this is now the default behavior of replace, default value parameter is value: NoDefault = lib.no_default
. But this specific bug is still present from my testing.
Comment From: jbrockmendel
The special-casing of CategoricalDtype in Series.replace was deprecated in 2.x and is gone in main. That fixes this issue (though the OP example raises as introducing a new category is not allowed). Closing.