Pandas API: Expand read_csv dtype for categoricals

In https://github.com/pandas-dev/pandas/pull/13406 Chris added support for read_csv(..., dtype={'col': 'category'}) (thanks!). This issue is for expanding that syntax to allow a more complete specification of the resulting categorical.

# Your code here
df = pd.read_csv(path, dtype={'col': pd.Categorical(['a', 'b', 'c'], ordered=True})
df = pd.read_csv(path, dtype={'col': ['a', 'b', 'c']})  # shorthand, but unordered only
# we would still accept `dtype={'col': 'category'}` as well, to infer categories

Implementation-wise, I think we can keep all the parsing logic as is, and simply loop over dtype and call set_categories (and maybe as_ordered) on all the categoricals just before returning to the user.

This would help a bit in dask, where their category type inference can fail if the first partition doesn't contain all the categories (see https://github.com/dask/dask/issues/1705). This is why it'd be preferable to do it as an option to read_csv, rather than putting in on the user to followup with a set_categories.

Comment From: chris-b1

If it matters, there would be at least a little performance to be picked up by modifying the parsing code - you could pass the categories in here:

https://github.com/pandas-dev/pandas/blob/7a2bcb6605bacea858ec14cfac424898deb568b3/pandas/parser.pyx#L1523

And shortcut building the categories array, sorting, etc. It would also cause an error to be thrown much earlier if the data has a value not in the specified categories.

Comment From: TomAugspurger

Yes, I think you're right that it'd be better to do it in the Cython code. I was digging through there last week, and it shouldn't be to much extra effort.

Though I guess we'll need to have an API discussion about what to do if the user passes dtype={'A': pd.Categorical(['a', 'b'])} and a value outside those shows up. 1. Throw an exception 2. Set to NA

The second option would be consistent with set_categories.

In [1]: pd.Categorical(['a', 'b', 'c']).set_categories(['a', 'b'])
Out[1]:
[a, b, NaN]
Categories (2, object): [a, b]

On Mon, Nov 7, 2016 at 12:07 PM, chris-b1 notifications@github.com wrote:

If it matters, there would be at least a little performance to be picked up by modifying the parsing code - you could pass the categories in here:

https://github.com/pandas-dev/pandas/blob/7a2bcb6605bacea858 ec14cfac424898deb568b3/pandas/parser.pyx#L1523

And shortcut building the categories array, sorting, etc. It would also cause an error to be thrown much earlier if the data has a value not in the specified categories.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/14503#issuecomment-258914014, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIhfXgauFpjZnATzCRINp34u9bofMks5q72jzgaJpZM4KhaAy .

Comment From: jbrockmendel

users can specify CategoricalDtype(...). I'd rather not complicate the API to save a few keystrokes.