Pandas version checks
- [x] I have checked that the issue still exists on the latest versions of the docs on
main
here
Location of the documentation
This concerns the 3.0 migration guide: https://pandas.pydata.org/docs/user_guide/migration-3-strings.html)
Documentation problem
The string migration guide suggests using "str"
in place of "object"
to write compatible code. The example only showcases this suggestion for the Series constructor, where it indeed works as intended (Pandas 2.3.0):
>>> import pandas as pd
>>> pd.Series(["a", None, np.nan, pd.NA], dtype="str").array
<NumpyExtensionArray>
['a', None, nan, <NA>]
Length: 4, dtype: object
However, the semantics of using "str"
are different if the series has already been initialized with an "object"
dtype and the user calls astype("str")
on it:
>>> series = pd.Series(["a", None, np.nan, pd.NA])
>>> series.array
<NumpyExtensionArray>
['a', None, nan, <NA>]
Length: 4, dtype: object
>>> series.astype("str").array
<NumpyExtensionArray>
['a', 'None', 'nan', '<NA>']
Length: 4, dtype: object
Note that all values have been cast to strings. In fact, this behavior appears to be the behavior of passing the literal str
as the data type that is mentioned later in the bug-fix section.
Suggested fix for documentation
I believe this subtle difference should be pointed out in the migration guide. Ideally, a suggestion should be made on how one may write 3.0-compatible code using astype
. In my case, the current Pandas 2 code is casting a categorical column (with string categories) into an object column, but I'd like to write code such that this operation becomes a string column in Pandas 3.
Comment From: rhshadrach
Thanks for the report. Agreed this difference should be highlighted. With infer_string
being set to True, these now give ['a', nan, nan, nan]
. I'm thinking this should be added to the astype(str)
section and not be called a bugfix.
cc @jorisvandenbossche