Pandas String dtype: backwards compatibility of selecting "object" vs "str" columns in select_dtypes

We provide the DataFrame.select_dtypes() method to easily subset columns based on data types (groups). See https://pandas.pydata.org/pandas-docs/version/2.3/user_guide/basics.html#selecting-columns-based-on-dtype

At the moment, as documented, the select string columns you must use the object dtype:

>>> pd.options.future.infer_string = False
>>> df = pd.DataFrame(
...     {
...         "string": list("abc"),
...         "int64": list(range(1, 4)),
...     }
... )
>>> df.dtypes
string    object
int64      int64
dtype: object
>>> df.select_dtypes(include=[object])
  string
0      a
1      b
2      c

On current main, with the string dtype enabled, the above dataframe now has a str column, and so selecting object dtype columns gives an empty result. One can use str instead:

>>> pd.options.future.infer_string = True
>>> df = pd.DataFrame(
...     {
...         "string": list("abc"),
...         "int64": list(range(1, 4)),
...     }
... )
>>> df.dtypes
string      str
int64     int64
dtype: object
>>> df.select_dtypes(include=[object])
Empty DataFrame
Columns: []
Index: [0, 1, 2]
>>> df.select_dtypes(include=[str])
  string
0      a
1      b
2      c

On the one hand, that is an "obvious" behaviour change as a consequence of the column now having a different dtype. But on the other hand, this will also break all code currently using select_dtypes to select string columns (and potentially silently, since it just no longer select them).

How to write compatible code?

One can select both object and string dtypes, so you select those columns in both older and newer pandas. One gotcha is that df.select_dtypes(include=[str]) is not allowed in pandas<=2.3 ("string dtypes are not allowed, use 'object' instead"), and has to use "string" instead of "str" (although the default dtype is str ..). This will select opt-in nullable string columns as well, but so also the new default str dtype:

# this gives the same result in both infer_string=True or False
>>> df.select_dtypes(include=[object, "string"])
  string
0      a
1      b
2      c

TODO: this should be added to the migration guide in https://pandas.pydata.org/docs/dev/user_guide/migration-3-strings.html#the-dtype-is-no-longer-object-dtype

Can we make this upgrade experience smoother?

Given that this will essentially break every use case of select_dtypes that involves selecting string columns (and given the fact this is a method, so we are more flexible compared to ser.dtype == object), I am wondering if we should provide some better upgrading behaviour. Some options:

For now let select_dtypes(include=[object]) keep selecting string columns as well, for backwards compatibility (and we can (later) add a warning we will stop doing that in the future)
When a user does select_dtypes(include=[object]) in pandas 3.0, and we see that there are str columns, raise a warning mentioning to the user they likely want to do include=[str] instead.

For both cases, it gets annoying if you actually want to select object columns, because then you have a (false positive) warning that you can't really do anything about (except ignoring/suppressing)

And in any case, we should probably still add a warning to pandas 2.3 about this when the string mode is enabled (for if we do a 2.3.2 release)

Comment From: arthurlw

Since a lot of systems likely rely on select_dtypes(include=[object]) returning string columns, I think we should maintain backwards compatibility in 3.0, but emit a FutureWarning when str columns are implicitly selected. That avoids silent breakage while giving users time to update. In future versions, we can deprecate this behavior cleanly.