Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
col = pd.Series(["a", "b", "c"], dtype=str)
cat = pd.api.types.CategoricalDtype(categories=["a", "b"])
col = col.astype(dtype=cat, errors="raise")
print(col)
0 a
1 b
2 NaN
dtype: category
Categories (2, object): ['a', 'b']
Issue Description
No error is raised when recasting as a category
, despite the presence of an undefined value, c
. Rather, c
is coerced to NaN
.
This behavior appears inconsistent with that of other data types, such as int
.
Expected Behavior
I believe an error should be raised.
Installed Versions
Comment From: asishm
Thanks for the report, could you please update the title to have a description?
That said, based on this comment - https://github.com/pandas-dev/pandas/issues/51074#issuecomment-1409344688 this is expected behavior
Comment From: rhshadrach
This behavior appears inconsistent with that of other data types, such as
int
.
Can you give an example that demonstrates the inconsistency?
Comment From: noahblakesmith
Sure thing @rhshadrach. Here is an example using int
, which throws an error. I also tested float
, "Int64"
, and "int64[pyarrow]"
, which produced similar errors.
import pandas as pd
col = pd.Series(["a", "b", "c"])
col = col.astype(dtype=int, errors="raise")
Traceback (most recent call last):
File "./test.py", line 4, in <module>
col = col.astype(dtype=int, errors="raise")
File "/opt/homebrew/Caskroom/miniconda/base/envs/test/lib/python3.10/site-packages/pandas/core/generic.py", line 6643, in astype
new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
File "/opt/homebrew/Caskroom/miniconda/base/envs/test/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 430, in astype
return self.apply(
File "/opt/homebrew/Caskroom/miniconda/base/envs/test/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 363, in apply
applied = getattr(b, f)(**kwargs)
File "/opt/homebrew/Caskroom/miniconda/base/envs/test/lib/python3.10/site-packages/pandas/core/internals/blocks.py", line 758, in astype
new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
File "/opt/homebrew/Caskroom/miniconda/base/envs/test/lib/python3.10/site-packages/pandas/core/dtypes/astype.py", line 237, in astype_array_safe
new_values = astype_array(values, dtype, copy=copy)
File "/opt/homebrew/Caskroom/miniconda/base/envs/test/lib/python3.10/site-packages/pandas/core/dtypes/astype.py", line 182, in astype_array
values = _astype_nansafe(values, dtype, copy=copy)
File "/opt/homebrew/Caskroom/miniconda/base/envs/test/lib/python3.10/site-packages/pandas/core/dtypes/astype.py", line 133, in _astype_nansafe
return arr.astype(dtype, copy=True)
ValueError: invalid literal for int() with base 10: 'a'
Comment From: rhshadrach
Thanks @noahblakesmith. I would not call this inconsistent since categorical dtype has it's own specialized semantics as @asishm mentioned. This is well-established and purposeful behavior, so it is also not a bug.
That said, there is agreement this is undesired behavior. This is very closely related, and may even be fixed by, #40996.