Pandas version checks
- [x] I have checked that the issue still exists on the latest versions of the docs on
main
here
Location of the documentation
This concerns the 3.0 migration guide: https://pandas.pydata.org/docs/user_guide/migration-3-strings.html)
Documentation problem
The string migration guide suggests using "str"
in place of "object"
to write compatible code. The example only showcases this suggestion for the Series constructor, where it indeed works as intended (Pandas 2.3.0):
>>> import pandas as pd
>>> pd.Series(["a", None, np.nan, pd.NA], dtype="str").array
<NumpyExtensionArray>
['a', None, nan, <NA>]
Length: 4, dtype: object
However, the semantics of using "str"
are different if the series has already been initialized with an "object"
dtype and the user calls astype("str")
on it:
>>> series = pd.Series(["a", None, np.nan, pd.NA])
>>> series.array
<NumpyExtensionArray>
['a', None, nan, <NA>]
Length: 4, dtype: object
>>> series.astype("str").array
<NumpyExtensionArray>
['a', 'None', 'nan', '<NA>']
Length: 4, dtype: object
Note that all values have been cast to strings. In fact, this behavior appears to be the behavior of passing the literal str
as the data type that is mentioned later in the bug-fix section.
Suggested fix for documentation
I believe this subtle difference should be pointed out in the migration guide. Ideally, a suggestion should be made on how one may write 3.0-compatible code using astype
. In my case, the current Pandas 2 code is casting a categorical column (with string categories) into an object column, but I'd like to write code such that this operation becomes a string column in Pandas 3.
Comment From: rhshadrach
Thanks for the report. Agreed this difference should be highlighted. With infer_string
being set to True, these now give ['a', nan, nan, nan]
. I'm thinking this should be added to the astype(str)
section and not be called a bugfix.
cc @jorisvandenbossche
Comment From: jorisvandenbossche
Good point @cbourjau!
In my case, the current Pandas 2 code is casting a categorical column (with string categories) into an object column, but I'd like to write code such that this operation becomes a string column in Pandas 3.
So you currently have something like:
>>> pd.__version__
'2.3.1'
>>> ser = pd.Series(["a", "b", "a", None], dtype="category")
>>> ser.astype("object").values
array(['a', 'b', 'a', nan], dtype=object)
and then the question is how to write that such that it stays object dtype in 2.3 and becomes string dtype in 3.0.
And indeed doing astype("str")
does not work as desired because of that "bug" of also stringifying missing values:
>>> ser.astype(str).values
array(['a', 'b', 'a', 'nan'], dtype=object)
>>> ser.astype("str").values
array(['a', 'b', 'a', 'nan'], dtype=object)
Somehow I thought that this was only the case of str
and not "str"
... (given that I wrote exactly that in the migration guide in the section about the astype bug: "when using astype(str) (using the built-in str, not astype("str")!)", so that section is clearly wrong)
In that case I don't think there is another alternative than some conditional behaviour depending on the version, like:
ser.astype("str" if pd.__version__ > "3" else "object").values
Comment From: jorisvandenbossche
I opened a PR to rewrite the section about astype(str)
: https://github.com/pandas-dev/pandas/pull/62147. Feedback very welcome!