Pandas version checks

  • [x] I have checked that the issue still exists on the latest versions of the docs on main here

Location of the documentation

This concerns the 3.0 migration guide: https://pandas.pydata.org/docs/user_guide/migration-3-strings.html)

Documentation problem

The string migration guide suggests using "str" in place of "object" to write compatible code. The example only showcases this suggestion for the Series constructor, where it indeed works as intended (Pandas 2.3.0):

>>> import pandas as pd
>>> pd.Series(["a", None, np.nan, pd.NA], dtype="str").array 
 <NumpyExtensionArray>
 ['a', None, nan, <NA>]
 Length: 4, dtype: object

However, the semantics of using "str" are different if the series has already been initialized with an "object" dtype and the user calls astype("str") on it:

>>> series = pd.Series(["a", None, np.nan, pd.NA])
>>> series.array
<NumpyExtensionArray>
['a', None, nan, <NA>]
Length: 4, dtype: object
>>> series.astype("str").array
<NumpyExtensionArray>
['a', 'None', 'nan', '<NA>']
Length: 4, dtype: object

Note that all values have been cast to strings. In fact, this behavior appears to be the behavior of passing the literal str as the data type that is mentioned later in the bug-fix section.

Suggested fix for documentation

I believe this subtle difference should be pointed out in the migration guide. Ideally, a suggestion should be made on how one may write 3.0-compatible code using astype. In my case, the current Pandas 2 code is casting a categorical column (with string categories) into an object column, but I'd like to write code such that this operation becomes a string column in Pandas 3.

Comment From: rhshadrach

Thanks for the report. Agreed this difference should be highlighted. With infer_string being set to True, these now give ['a', nan, nan, nan]. I'm thinking this should be added to the astype(str) section and not be called a bugfix.

cc @jorisvandenbossche