When creating a pandas Series/Index/DataFrame, I think we generally differentiate between passing a pandas object with object dtype and a numpy array with object dtype:

>>> pd.options.future.infer_string = True
>>> pd.Index(pd.Series(["foo", "bar", "baz"], dtype="object"))
Index(['foo', 'bar', 'baz'], dtype='object')
>>> pd.Index(np.array(["foo", "bar", "baz"], dtype="object"))
Index(['foo', 'bar', 'baz'], dtype='str')

So for pandas objects, we preserve the dtype, for numpy arrays of object dtype, we essentially treat that as a sequence of python objects where we infer the dtype (@jbrockmendel that's also your understanding?)

But for categorical that doesn't seem to happen:

>>> pd.options.future.infer_string = True
>>> pd.Categorical(pd.Series(["foo", "bar", "baz"], dtype="object"))
['foo', 'bar', 'baz']
Categories (3, str): [bar, baz, foo]   # <--- categories inferred as str

So we want to preserver the dtype for the categories here as well?

Comment From: jbrockmendel

(@jbrockmendel that's also your understanding?)

Yes.

So we want to preserver the dtype for the categories here as well?

Makes sense.

Comment From: niruta25

How about we modify the Categorical constructor to distinguish between:

  • Pandas objects (Index/Series) with object dtype → preserve object dtype
  • Numpy arrays with object dtype → allow normal inference (existing behavior)
  • Raw Python sequences → allow normal inference (existing behavior)

We can implement the change where dtype validation occurs. This change will preserve existing behavior for numpy arrays and raw sequences while fixing the inconsistency for pandas objects.

If you all agree with the solution, I can take it up.

Comment From: jbrockmendel

That's the right idea, give it a try.

Comment From: niruta25

take