When creating a pandas Series/Index/DataFrame, I think we generally differentiate between passing a pandas object with object
dtype and a numpy array with object
dtype:
>>> pd.options.future.infer_string = True
>>> pd.Index(pd.Series(["foo", "bar", "baz"], dtype="object"))
Index(['foo', 'bar', 'baz'], dtype='object')
>>> pd.Index(np.array(["foo", "bar", "baz"], dtype="object"))
Index(['foo', 'bar', 'baz'], dtype='str')
So for pandas objects, we preserve the dtype, for numpy arrays of object dtype, we essentially treat that as a sequence of python objects where we infer the dtype (@jbrockmendel that's also your understanding?)
But for categorical that doesn't seem to happen:
>>> pd.options.future.infer_string = True
>>> pd.Categorical(pd.Series(["foo", "bar", "baz"], dtype="object"))
['foo', 'bar', 'baz']
Categories (3, str): [bar, baz, foo] # <--- categories inferred as str
So we want to preserver the dtype for the categories here as well?
Comment From: jbrockmendel
(@jbrockmendel that's also your understanding?)
Yes.
So we want to preserver the dtype for the categories here as well?
Makes sense.
Comment From: niruta25
How about we modify the Categorical constructor to distinguish between:
- Pandas objects (Index/Series) with object dtype → preserve object dtype
- Numpy arrays with object dtype → allow normal inference (existing behavior)
- Raw Python sequences → allow normal inference (existing behavior)
We can implement the change where dtype validation occurs. This change will preserve existing behavior for numpy arrays and raw sequences while fixing the inconsistency for pandas objects.
If you all agree with the solution, I can take it up.
Comment From: jbrockmendel
That's the right idea, give it a try.
Comment From: niruta25
take