In https://github.com/pandas-dev/pandas/pull/13406 Chris added support for read_csv(..., dtype={'col': 'category'})
(thanks!). This issue is for expanding that syntax to allow a more complete specification of the resulting categorical.
# Your code here
df = pd.read_csv(path, dtype={'col': pd.Categorical(['a', 'b', 'c'], ordered=True})
df = pd.read_csv(path, dtype={'col': ['a', 'b', 'c']}) # shorthand, but unordered only
# we would still accept `dtype={'col': 'category'}` as well, to infer categories
Implementation-wise, I think we can keep all the parsing logic as is, and simply loop over dtype
and call set_categories
(and maybe as_ordered
) on all the categoricals just before returning to the user.
This would help a bit in dask, where their category type inference can fail if the first partition doesn't contain all the categories (see https://github.com/dask/dask/issues/1705). This is why it'd be preferable to do it as an option to read_csv
, rather than putting in on the user to followup with a set_categories
.
Comment From: chris-b1
If it matters, there would be at least a little performance to be picked up by modifying the parsing code - you could pass the categories in here:
https://github.com/pandas-dev/pandas/blob/7a2bcb6605bacea858ec14cfac424898deb568b3/pandas/parser.pyx#L1523
And shortcut building the categories array, sorting, etc. It would also cause an error to be thrown much earlier if the data has a value not in the specified categories.
Comment From: TomAugspurger
Yes, I think you're right that it'd be better to do it in the Cython code. I was digging through there last week, and it shouldn't be to much extra effort.
Though I guess we'll need to have an API discussion about what to do if the
user passes dtype={'A': pd.Categorical(['a', 'b'])}
and a value outside
those shows up.
1. Throw an exception
2. Set to NA
The second option would be consistent with set_categories
.
In [1]: pd.Categorical(['a', 'b', 'c']).set_categories(['a', 'b'])
Out[1]:
[a, b, NaN]
Categories (2, object): [a, b]
On Mon, Nov 7, 2016 at 12:07 PM, chris-b1 notifications@github.com wrote:
If it matters, there would be at least a little performance to be picked up by modifying the parsing code - you could pass the categories in here:
https://github.com/pandas-dev/pandas/blob/7a2bcb6605bacea858 ec14cfac424898deb568b3/pandas/parser.pyx#L1523
And shortcut building the categories array, sorting, etc. It would also cause an error to be thrown much earlier if the data has a value not in the specified categories.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/14503#issuecomment-258914014, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIhfXgauFpjZnATzCRINp34u9bofMks5q72jzgaJpZM4KhaAy .
Comment From: jbrockmendel
users can specify CategoricalDtype(...)
. I'd rather not complicate the API to save a few keystrokes.