Pandas version checks

  • [x] I have checked that the issue still exists on the latest versions of the docs on main here

Location of the documentation

This concerns the 3.0 migration guide: https://pandas.pydata.org/docs/user_guide/migration-3-strings.html)

Documentation problem

The string migration guide suggests using "str" in place of "object" to write compatible code. The example only showcases this suggestion for the Series constructor, where it indeed works as intended (Pandas 2.3.0):

>>> import pandas as pd
>>> pd.Series(["a", None, np.nan, pd.NA], dtype="str").array 
 <NumpyExtensionArray>
 ['a', None, nan, <NA>]
 Length: 4, dtype: object

However, the semantics of using "str" are different if the series has already been initialized with an "object" dtype and the user calls astype("str") on it:

>>> series = pd.Series(["a", None, np.nan, pd.NA])
>>> series.array
<NumpyExtensionArray>
['a', None, nan, <NA>]
Length: 4, dtype: object
>>> series.astype("str").array
<NumpyExtensionArray>
['a', 'None', 'nan', '<NA>']
Length: 4, dtype: object

Note that all values have been cast to strings. In fact, this behavior appears to be the behavior of passing the literal str as the data type that is mentioned later in the bug-fix section.

Suggested fix for documentation

I believe this subtle difference should be pointed out in the migration guide. Ideally, a suggestion should be made on how one may write 3.0-compatible code using astype. In my case, the current Pandas 2 code is casting a categorical column (with string categories) into an object column, but I'd like to write code such that this operation becomes a string column in Pandas 3.

Comment From: rhshadrach

Thanks for the report. Agreed this difference should be highlighted. With infer_string being set to True, these now give ['a', nan, nan, nan]. I'm thinking this should be added to the astype(str) section and not be called a bugfix.

cc @jorisvandenbossche

Comment From: jorisvandenbossche

Good point @cbourjau!

In my case, the current Pandas 2 code is casting a categorical column (with string categories) into an object column, but I'd like to write code such that this operation becomes a string column in Pandas 3.

So you currently have something like:

>>> pd.__version__
'2.3.1'
>>> ser = pd.Series(["a", "b", "a", None], dtype="category")
>>> ser.astype("object").values
array(['a', 'b', 'a', nan], dtype=object)

and then the question is how to write that such that it stays object dtype in 2.3 and becomes string dtype in 3.0. And indeed doing astype("str") does not work as desired because of that "bug" of also stringifying missing values:

>>> ser.astype(str).values
array(['a', 'b', 'a', 'nan'], dtype=object)
>>> ser.astype("str").values
array(['a', 'b', 'a', 'nan'], dtype=object)

Somehow I thought that this was only the case of str and not "str" ... (given that I wrote exactly that in the migration guide in the section about the astype bug: "when using astype(str) (using the built-in str, not astype("str")!)", so that section is clearly wrong)

In that case I don't think there is another alternative than some conditional behaviour depending on the version, like:

ser.astype("str" if pd.__version__ > "3" else "object").values

Comment From: jorisvandenbossche

I opened a PR to rewrite the section about astype(str): https://github.com/pandas-dev/pandas/pull/62147. Feedback very welcome!