Because of the new string dtype, we also implicitly changes the representation of the unique categories in the Categorical dtype repr (aside the object
-> str
change for the dtype):
>>> pd.options.future.infer_string = False
>>> pd.Categorical(list("abca"))
['a', 'b', 'c', 'a']
Categories (3, object): ['a', 'b', 'c']
>>> pd.options.future.infer_string = True
>>> pd.Categorical(list("abca"))
['a', 'b', 'c', 'a']
Categories (3, str): [a, b, c]
So the actual array values are always quotes, but the list of unique categories in the dtype repr goes from ['a', 'b', 'c']
to [a, b, c]
.
Brock already fixed a bunch of xfails in the tests because of this in https://github.com/pandas-dev/pandas/pull/61727. And we also run into this issue for the failing doctests (https://github.com/pandas-dev/pandas/issues/61886).
@jbrockmendel mentioned there:
It isn't 100% obvious that the new repr for Categoricals is an improvement, but it's non-crazy.
With which I agree, also no strong opinion either way.
But before we also go fixing doctests, let's confirm that we are OK with this change. Because if we don't have a strong opinion that it is an improvement, we could also leave it how it was originally (and avoiding some breakage because of this for downstream projects or users (eg who also have doctests))
Comment From: jorisvandenbossche
The technical explanation of this change is that for Categorical.__repr__
, we have a Categorical._repr_categories
helper method that creates this data (called from Categorical._get_repr_footer
, which is used in the categorical repr but also in the Series repr if the dtype is categorical).
This function calls format_array
with QUOTE_NONNUMERIC
:
>>> from pandas.io.formats import format as fmt
>>> from csv import QUOTE_NONNUMERIC
>>> fmt.format_array(np.array(["a", "b"], dtype=object), formatter=None, quoting=QUOTE_NONNUMERIC)
[" 'a'", " 'b'"]
>>> fmt.format_array(pd.array(["a", "b"], dtype="str"), formatter=None, quoting=QUOTE_NONNUMERIC)
[' a', ' b']
But in the case of the string dtype, being an extension dtype, this format_array
uses the values._formatter(boxed=True)
of the ExtensionArray, and in the case of strings, when boxed=True
, those values are not quoted (eg as used in the Series repr, in contrast to the array repr). And so for extension dtypes, the QUOTE_NONNUMERIC
is also ignored.
So given that we also don't quote (or do show the "boxed" repr) for other types, like we don't use quoted strings to represent timestamp categories, the new behaviour seems a little bit more consistent.
But right now we essentially already special case strings in the categorical repr by passing QUOTE_NONNUMERIC
. Thus I think it is also perfectly reasonable to update that existing special case to cover the string dtype as well in addition to object dtype, to preserve the existing behaviour and minimize the repr changes.
Comment From: jbrockmendel
There are some remaining CategoricalIndex repr tests that are xfailed bc the padding changes. Would re-enabling the special casing here also get us the old padding behavior? I think it is a little nicer.
Comment From: jorisvandenbossche
I think that is still something else, because that is in formatting the data part of the array/index, not the dtype
Comment From: jorisvandenbossche
I think that is still something else, because that is in formatting the data part of the array/index, not the dtype
But it turned out to be a simple fix -> https://github.com/pandas-dev/pandas/pull/61894
Comment From: jorisvandenbossche
And FWIW I also have a PR with the necessary small change to preserve the special-case quoting for string categories in https://github.com/pandas-dev/pandas/pull/61891 (only, if we want that, I have to update the tests again to get that PR green)