When creating a pandas Series/Index/DataFrame, I think we generally differentiate between passing a pandas object with object
dtype and a numpy array with object
dtype:
>>> pd.options.future.infer_string = True
>>> pd.Index(pd.Series(["foo", "bar", "baz"], dtype="object"))
Index(['foo', 'bar', 'baz'], dtype='object')
>>> pd.Index(np.array(["foo", "bar", "baz"], dtype="object"))
Index(['foo', 'bar', 'baz'], dtype='str')
So for pandas objects, we preserve the dtype, for numpy arrays of object dtype, we essentially treat that as a sequence of python objects where we infer the dtype (@jbrockmendel that's also your understanding?)
But for categorical that doesn't seem to happen:
>>> pd.options.future.infer_string = True
>>> pd.Categorical(pd.Series(["foo", "bar", "baz"], dtype="object"))
['foo', 'bar', 'baz']
Categories (3, str): [bar, baz, foo] # <--- categories inferred as str
So we want to preserver the dtype for the categories here as well?
Comment From: jbrockmendel
(@jbrockmendel that's also your understanding?)
Yes.
So we want to preserver the dtype for the categories here as well?
Makes sense.