Code Sample, a copy-pastable example if possible
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: from pandas.api.types import CategoricalDtype
In [4]: s1 = pd.Series([np.nan, np.nan]).astype('category')
In [5]: s1
Out[5]:
0 NaN
1 NaN
dtype: category
Categories (0, float64): []
In [6]: s2 = pd.Series([np.nan, np.nan]).astype(CategoricalDtype([]))
In [7]: s2
Out[7]:
0 NaN
1 NaN
dtype: category
Categories (0, object): []
In [8]: pd.api.types.union_categoricals([s1,s2])
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-8-8e364c994bd7> in <module>
----> 1 pd.api.types.union_categoricals([s1,s2])
C:\Anaconda3\lib\site-packages\pandas\core\dtypes\concat.py in union_categoricals(to_union, sort_categories, ignore_order)
361 if not all(is_dtype_equal(other.categories.dtype, first.categories.dtype)
362 for other in to_union[1:]):
--> 363 raise TypeError("dtype of categories must be the same")
364
365 ordered = False
TypeError: dtype of categories must be the same
Problem description
In the above, if you convert a Series
using astype('category')
, and the Series
has all NaN
values, the underlying dtype is float
, while if you pass CategoricalDtype([])
, the underlying dtype is object
.
There are a couple of issues that I don't know how to deal with:
- If you have categories of a certain underlying dtype, there is no way to change that dtype (e.g., in this example, I would want to change the underlying dtype of the categories backing
s1
to beobject
) - You can't specify the dtype of the underlying categories in the
CategoricalDtype
constructor
Now, you might ask, why does this matter? Let's suppose I have data that I know to be categorical, and I have missing values, and I want to use union_categoricals()
to merge the categories of two different Series that are both category dtype, and each Series was constructed using astype('category')
. Let's say that one had all missing values and has underlying dtype float
and the second one had strings and missing values, so it ends up with dtype O
, then I can't do union_categoricals()
on them.
I know there are various workarounds for this, but I still think there should be some way to manage the underlying dtype of the categories of a CategoricalDtype
.
Alternatively, maybe union_categoricals()
should be smart that when you are doing a union of categories and one of the categories has no choices, then it ignores the dtype
when doing the union.
Output of pd.show_versions()
Comment From: TomAugspurger
You can't specify the dtype of the underlying categories in the CategoricalDtype constructor
You can with
In [24]: pd.api.types.CategoricalDtype(categories=pd.Index([], dtype=int)).categories
Out[24]: Int64Index([], dtype='int64')
CategoricalDtype.categories is just an index. Would you want to accept a dtype
parameter in CategoricalDtype
that's passed through?
are both category dtype, and each Series was constructed using astype('category')
I think that's the root issue. .astype('category')
is going to use inference, which can fail. If you want full control you'll have to be explicit.
Comment From: Dr-Irv
Would you want to accept a dtype parameter in CategoricalDtype that's passed through?
Yes, I think that would help.
The other thing that would help is if union_categoricals
would accept the union of two categories where the dtype
was different, and one of the categories was empty. Then the result could have the dtype
of the category that had items in it. The reason I need this is that I'm reading a large file in chunks, and I know which columns are category columns, and want to keep doing union_categoricals
as new categories are discovered, and if a chunk was all missing values, have the types correctly inferred. (See my comment here: https://github.com/pandas-dev/pandas/issues/14177#issuecomment-417351304)
Comment From: TomAugspurger
I'd prefer to avoid special casing empty / all-NaN columns.
I think adding a dtype
keyword to the CategoricalDtype constructor would
be fine, with a default of float for backwards compatibility.
On Mon, Oct 22, 2018 at 4:45 PM Dr. Irv notifications@github.com wrote:
Would you want to accept a dtype parameter in CategoricalDtype that's passed through?
Yes, I think that would help.
The other thing that would help is if union_categoricals would accept the union of two categories where the dtype was different, and one of the categories was empty. Then the result could have the dtype of the category that had items in it. The reason I need this is that I'm reading a large file in chunks, and I know which columns are category columns, and want to keep doing union_categoricals as new categories are discovered, and if a chunk was all missing values, have the types correctly inferred. (See my comment here: #14177 (comment) https://github.com/pandas-dev/pandas/issues/14177#issuecomment-417351304 )
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/23242#issuecomment-432000675, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIkhOGh6RdADauT84FmvpWLRCOzYDks5unjxrgaJpZM4XxeJP .
Comment From: Dr-Irv
I think adding a
dtype
keyword to the CategoricalDtype constructor would be fine, with a default of float for backwards compatibility.
I think the default would have to be infer
, since if you pass no NaNs, then the dtype is inferred from the type of the passed categories. Then if all the values are NaN, it defaults to float.