Pandas Missing Values and Categoricals - inconsistent dtypes

Code Sample, a copy-pastable example if possible

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: from pandas.api.types import CategoricalDtype

In [4]: s1 = pd.Series([np.nan, np.nan]).astype('category')

In [5]: s1
Out[5]:
0   NaN
1   NaN
dtype: category
Categories (0, float64): []

In [6]: s2 = pd.Series([np.nan, np.nan]).astype(CategoricalDtype([]))

In [7]: s2
Out[7]:
0    NaN
1    NaN
dtype: category
Categories (0, object): []

In [8]: pd.api.types.union_categoricals([s1,s2])
--------------------------------------------------------------------------- TypeError                                 Traceback (most recent call last) <ipython-input-8-8e364c994bd7> in <module>
----> 1 pd.api.types.union_categoricals([s1,s2])

C:\Anaconda3\lib\site-packages\pandas\core\dtypes\concat.py in union_categoricals(to_union, sort_categories, ignore_order)
    361     if not all(is_dtype_equal(other.categories.dtype, first.categories.dtype)
    362                for other in to_union[1:]):
--> 363         raise TypeError("dtype of categories must be the same")
    364
    365     ordered = False

TypeError: dtype of categories must be the same

Problem description

In the above, if you convert a Series using astype('category'), and the Series has all NaN values, the underlying dtype is float, while if you pass CategoricalDtype([]), the underlying dtype is object.

There are a couple of issues that I don't know how to deal with:

If you have categories of a certain underlying dtype, there is no way to change that dtype (e.g., in this example, I would want to change the underlying dtype of the categories backing s1 to be object)
You can't specify the dtype of the underlying categories in the CategoricalDtype constructor

Now, you might ask, why does this matter? Let's suppose I have data that I know to be categorical, and I have missing values, and I want to use union_categoricals() to merge the categories of two different Series that are both category dtype, and each Series was constructed using astype('category'). Let's say that one had all missing values and has underlying dtype float and the second one had strings and missing values, so it ends up with dtype O, then I can't do union_categoricals() on them.

I know there are various workarounds for this, but I still think there should be some way to manage the underlying dtype of the categories of a CategoricalDtype.

Alternatively, maybe union_categoricals() should be smart that when you are doing a union of categories and one of the categories has no choices, then it ignores the dtype when doing the union.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.6.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.23.4 pytest: 3.8.1 pip: 10.0.1 setuptools: 40.4.3 Cython: 0.28.5 numpy: 1.15.2 scipy: 1.1.0 pyarrow: None xarray: 0.10.9 IPython: 7.0.1 sphinx: 1.8.1 patsy: 0.5.0 dateutil: 2.7.3 pytz: 2018.5 blosc: None bottleneck: 1.2.1 tables: 3.4.4 numexpr: 2.6.8 feather: None matplotlib: 3.0.0 openpyxl: 2.5.8 xlrd: 1.1.0 xlwt: 1.3.0 xlsxwriter: 1.1.1 lxml: 4.2.5 bs4: 4.6.3 html5lib: 1.0.1 sqlalchemy: 1.2.11 pymysql: 0.9.2 psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Comment From: TomAugspurger

You can't specify the dtype of the underlying categories in the CategoricalDtype constructor

You can with

In [24]: pd.api.types.CategoricalDtype(categories=pd.Index([], dtype=int)).categories
Out[24]: Int64Index([], dtype='int64')

CategoricalDtype.categories is just an index. Would you want to accept a dtype parameter in CategoricalDtype that's passed through?

are both category dtype, and each Series was constructed using astype('category')

I think that's the root issue. .astype('category') is going to use inference, which can fail. If you want full control you'll have to be explicit.

Comment From: Dr-Irv

Would you want to accept a dtype parameter in CategoricalDtype that's passed through?

Yes, I think that would help.

The other thing that would help is if union_categoricals would accept the union of two categories where the dtype was different, and one of the categories was empty. Then the result could have the dtype of the category that had items in it. The reason I need this is that I'm reading a large file in chunks, and I know which columns are category columns, and want to keep doing union_categoricals as new categories are discovered, and if a chunk was all missing values, have the types correctly inferred. (See my comment here: https://github.com/pandas-dev/pandas/issues/14177#issuecomment-417351304)

Comment From: TomAugspurger

I'd prefer to avoid special casing empty / all-NaN columns.

I think adding a dtype keyword to the CategoricalDtype constructor would be fine, with a default of float for backwards compatibility.

On Mon, Oct 22, 2018 at 4:45 PM Dr. Irv notifications@github.com wrote:

Would you want to accept a dtype parameter in CategoricalDtype that's passed through?

Yes, I think that would help.

The other thing that would help is if union_categoricals would accept the union of two categories where the dtype was different, and one of the categories was empty. Then the result could have the dtype of the category that had items in it. The reason I need this is that I'm reading a large file in chunks, and I know which columns are category columns, and want to keep doing union_categoricals as new categories are discovered, and if a chunk was all missing values, have the types correctly inferred. (See my comment here: #14177 (comment) https://github.com/pandas-dev/pandas/issues/14177#issuecomment-417351304 )

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/23242#issuecomment-432000675, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIkhOGh6RdADauT84FmvpWLRCOzYDks5unjxrgaJpZM4XxeJP .