Pandas issubdtype(<categorical>, np.bool_) raises error

In[14]: import pandas as pd
Backend Qt4Agg is interactive backend. Turning interactive mode on.
In[15]: import numpy as np
In[16]: s = pd.Series([1,2,3,1,2,3]).astype("category")
In[17]: s
Out[17]: 
0    1
1    2
2    3
3    1
4    2
5    3
dtype: category
Categories (3, int64): [1 < 2 < 3]
In[18]: np.issubdtype(s.dtype, np.bool_)
Traceback (most recent call last):
  File "c:\data\external\ipython\IPython\core\interactiveshell.py", line 3032, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-18-607a91e2a828>", line 1, in <module>
    np.issubdtype(s.dtype, np.bool_)
  File "C:\portabel\miniconda\envs\ipython\lib\site-packages\numpy\core\numerictypes.py", line 763, in issubdtype
    return issubclass(dtype(arg1).type, arg2)
TypeError: data type not understood

This is a problem in https://github.com/pydata/patsy/pull/47

Not sure if there is an easy way to get numpy to understand this (I've absolutely no numpy foo :-/ ). If not this means that every patsy/statsmodels method which does dtype magic has to guard against the category dtype :-/

Comment From: jreback

numpy doesn't understand this

you could do

isinstance(s.dtype, np.dtype) and np.issubdtype(s.dtype, np.bool_)

Comment From: jankatins

Just understood that this is again the dtype("category") problem :-(

Comment From: jreback

yep, you can also do com.is_categorical_dtype(....) which is safe.

Comment From: jankatins

The problem is more that every other dtype check is prone to raise :-(

This is the line which raises (https://github.com/pydata/patsy/blob/master/patsy/categorical.py#L185)

# fastpath to avoid doing an item-by-item iteration over boolean
        # arrays, as requested by #44
        if hasattr(data, "dtype") and np.issubdtype(data.dtype, np.bool_): # <= !!! this line !!!
            self._level_set = set([True, False])
            return True

-> if you want to work with pandas categorical data, you now have to look out for any (arbitrary) method, which errors out when getting a non-official dtype :-/

Comment From: jankatins

IMO the most user friendly solution would be to monkey-patch this numpy function to check for categorical first...

Comment From: jreback

you need to dispatch on categorical before this and treat it separately. Code that deals with multiple dtypes needs to be aware. You can simply np.asarray if you want on everything.

You cannot patch a c-level function.

Comment From: jankatins

np.issubdtype is not a c level function :-)

https://github.com/numpy/numpy/blob/master/numpy/core/numerictypes.py#L736

def issubdtype(arg1, arg2):
    if issubclass_(arg2, generic):
        return issubclass(dtype(arg1).type, arg2)
    mro = dtype(arg2).type.mro()
    if len(mro) > 1:
        val = mro[1]
    else:
        val = mro[0]
    return issubclass(dtype(arg1).type, val)

Comment From: jankatins

The problem is that (this and probably other) code worked great before pandas 0.15 and works great as long as no categorical comes along but then breaks if you pass in a categorical. So more or less the categorical introduction breaks unrelated code. :-(

Comment From: jreback

not sure what to tell you the numpy functions are just not safe you might have better luck doing np.asarray(...) then checking that dtype

Comment From: jankatins

So, push it to numpy-land?

Comment From: jorisvandenbossche

Couldn't this be solved by returning the CategoricalDType object instead of a string for s.dtype?

In [86]: catdt = pd.core.categorical.CategoricalDtype

In [87]: np.issubdtype(catdt, np.bool_)
Out[87]: False

EDIT: I mean an instance of the object instead of string -> so returning the class instead of an instance for s.dtype

Comment From: jreback

@JanSchulz I would explicity check or a Categorical in patsy. You need to handle them specially anyhow.

Comment From: jreback

@jorisvandenbossche Its already a dtype, just not in the numpy hierarchy (and they are not careful about the check)

In [1]: s = Series(list('abc')).astype('category')

In [2]: s.dtype
Out[2]: category

In [3]: type(s.dtype)
Out[3]: pandas.core.common.CategoricalDtype

Comment From: jankatins

I'm confused:

In[25]: type(s.dtype)
Out[25]: pandas.core.common.CategoricalDtype
In[26]: catdt = pd.core.categorical.CategoricalDtype
In[27]:  np.issubdtype(catdt, np.bool_)
Out[27]: False
In[28]: np.issubdtype(s.dtype, np.bool_)
Traceback (most recent call last):
  File "c:\data\external\ipython\IPython\core\interactiveshell.py", line 3032, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-28-607a91e2a828>", line 1, in <module>
    np.issubdtype(s.dtype, np.bool_)
  File "C:\portabel\miniconda\envs\ipython\lib\site-packages\numpy\core\numerictypes.py", line 763, in issubdtype
    return issubclass(dtype(arg1).type, arg2)
TypeError: data type not understood
In[29]: type(s.dtype) is type(catdt)
Out[29]: False
In[30]: type(s.dtype), type(catdt)
Out[30]: (pandas.core.common.CategoricalDtype, type)
In[31]: type(np.bool_)
Out[31]: type

Comment From: jankatins

So if we would not use an object but the class it would work?

Comment From: njsmith

Pandas did just go ahead and totally break tons of unrelated code here :-( It's pretty ugly to say that even simple checks that have nothing to do with categoricals (like "is this a boolean?"), and which have been written in the same way for as long as numpy existed, suddenly must always check for categoricals explicitly.

Comment From: njsmith

E.g. patsy has 12 calls to issubdtype and every one of them now needs to be audited to make sure they don't start raising random errors.

Comment From: njsmith

Also, sorry for comment spam, but just noticed this while writing a safe issubdtype wrapper: the suggestion of replacing

np.issubdtype(foo, bar)

with

isinstance(foo, np.dtype) and np.issubdtype(foo, bar)

is not correct in general -- e.g. issubdtype(int, np.integer) is True, but the above expression will return False.

Comment From: jorisvandenbossche

@njsmith Do you see other possible solutions to this problem, apart from making CategoricalDType a real numpy dtype in C?

Comment From: jankatins

Is it actually possible to extend (by writing c code) numpy dtypes from outside numpy?

The other way I see is really take every numpy dtype function, test if they handle the new dtype correctly and if not write a wrapper and monkey-patch numpy when pandas is imported...

Or write a real categorical numpy array... :-/

https://github.com/numpy/numpy/blob/4cbced22a8306d7214a4e18d12e651c034143922/doc/newdtype_example/floatint.c See also here: https://github.com/scipy/scipy/wiki/GSoC-project-ideas

Comment From: jreback

you need to write something like

isinstance(obj, np.ndarray) and issubdtype(obj.dtype,np.bool_)

in your code

np.issubdtype is prob ok, though it should prob handle array-likes in a safer manner

Comment From: shoyer

I suppose the numpy friendly way to have handled this would have been to use dtype=object for Categorical, and add some custom attribute like pandas_dtype. But that's pretty damn awkward, too... and not really something I want to expose in our API.

Comment From: cequencer

I am not sure if the issue I am running into below is identical to this issue or not. Here are the steps that is raising the TypeError: data type not understood when it encounters this class 'pandas.core.dtypes.dtypes.CategoricalDtype':

In[1]: pd.show_versions(as_json=False)
Out[1]: 
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.14.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.32-696.10.3.el6.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.22.0
pytest: None
pip: 9.0.3
setuptools: 39.0.1
Cython: None
numpy: 1.14.3
scipy: 1.0.1
pyarrow: None
xarray: None
IPython: 5.6.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: None
tables: 3.4.3
numexpr: 2.6.4
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

In[2]: dframe = pd.DataFrame({"A":["a","b","c","a"]})

In[3]: dframe["B"] = dframe["A"].astype('category')

In[4]: dframe.dtypes
Out[4]: 
A      object
B    category
dtype: object

In[5]: dframe.dtypes.map((lambda x: np.issubdtype(x, np.datetime64)))
Out[5]: 
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-67-78973e46a59c> in <module>()
----> 1 dframe.dtypes.map((lambda x: np.issubdtype(x, np.datetime64)))

/lib/python2.7/site-packages/pandas/core/series.pyc in map(self, arg, na_action)
   2352         else:
   2353             # arg is a function
-> 2354             new_values = map_f(values, arg)
   2355 
   2356         return self._constructor(new_values,

pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer()

<ipython-input-67-78973e46a59c> in <lambda>(x)
----> 1 dframe.dtypes.map((lambda x: np.issubdtype(x, np.datetime64)))

/lib/python2.7/site-packages/numpy/core/numerictypes.pyc in issubdtype(arg1, arg2)
    724     """
    725     if not issubclass_(arg1, generic):
--> 726         arg1 = dtype(arg1).type
    727     if not issubclass_(arg2, generic):
    728         arg2_orig = arg2

TypeError: data type not understood

In[6]: 
for i, v in dframe.dtypes.iteritems():
    print i
    print v
    t=type(v)
    print t
    if np.issubdtype(v, np.datetime64):
        print i
    print "----------------------"
Out[6]: 
A
object
<type 'numpy.dtype'>
----------------------
B
category
<class 'pandas.core.dtypes.dtypes.CategoricalDtype'>
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-71-24466a088749> in <module>()
      4     t=type(v)
      5     print t
----> 6     if np.issubdtype(v, np.datetime64):
      7         print i
      8     print "----------------------"

/lib/python2.7/site-packages/numpy/core/numerictypes.pyc in issubdtype(arg1, arg2)
    724     """
    725     if not issubclass_(arg1, generic):
--> 726         arg1 = dtype(arg1).type
    727     if not issubclass_(arg2, generic):
    728         arg2_orig = arg2

TypeError: data type not understood

The TypeError raised in step 5 was not very clear to me what is the issue, until I iterated through all the data types in step 6 that it was choking on the 'CategoricalDtype'.

I am not sure if this is the correct forum to provide such information and I apologize in advance if this is not. If there is better Pythonic way to do what I need to do, please let me know as well. Thank you!

Comment From: jreback

this is a numpy issue fundamentally IIRC opened one a few years ago about this

in any event pandas has lots of method to introspect dtypes via the pandas.api.types namespace so you shouldn’t need to do this

Comment From: toobaz

My understanding is that the way to check for dtypes (including custom ones) that works for both numpy and pandas and never raises is looking at their type attribute:

In [2]: issubclass(np.array([1, 2, 3]).dtype.type, np.integer)                                 
Out[2]: True

In [3]: issubclass(np.array('abc').dtype.type, str)                                            
Out[3]: True

... except that:

In [4]: issubclass(pd.Series('abc', dtype='category').dtype.type, pd.CategoricalDtype)         
Out[4]: False

In [5]: pd.Series('abc', dtype='category').dtype.type == pd.CategoricalDtype                   
Out[5]: False

because

In [6]: pd.Series('abc', dtype='category').dtype.type.__mro__                                  
Out[6]: (pandas.core.dtypes.dtypes.CategoricalDtypeType, type, object)

and indeed:

In [7]: issubclass(pd.Series('abc', dtype='category').dtype.type, pd.core.dtypes.dtypes.CategoricalDtypeType)                                                                         
Out[7]: True

Maybe that's something we can easily fix on the pandas side, so that the check doesn't require digging beyond the pandas API?

What I mean is:

In [7]: np.array('abc', dtype=np.dtype('<U').type)                                                                                                                                               
Out[7]: array('abc', dtype='<U3')

while:

In [8]:  pd.Series('abc', dtype=pd.CategoricalDtype.type)                                                                           
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-b3184e65f353> in <module>
----> 1  pd.Series('abc', dtype=pd.CategoricalDtype.type)

~/nobackup/repo/pandas/pandas/core/series.py in __init__(self, data, index, dtype, name, copy, fastpath)
    237                 data = {}
    238             if dtype is not None:
--> 239                 dtype = self._validate_dtype(dtype)
    240 
    241             if isinstance(data, MultiIndex):

~/nobackup/repo/pandas/pandas/core/generic.py in _validate_dtype(self, dtype)
    262 
    263         if dtype is not None:
--> 264             dtype = pandas_dtype(dtype)
    265 
    266             # a compound dtype

~/nobackup/repo/pandas/pandas/core/dtypes/common.py in pandas_dtype(dtype)
   1886         return npdtype
   1887     elif npdtype.kind == "O":
-> 1888         raise TypeError(f"dtype '{dtype}' not understood")
   1889 
   1890     return npdtype

TypeError: dtype '<class 'pandas.core.dtypes.dtypes.CategoricalDtypeType'>' not understood

... that is, our distinction between a dtype and its type is something that numpy avoids, and maybe we can avoid too? (That's a real question, as I don't know the code well)

I would go as far as saying that if we can fix this, and consider the above as the recommended way for type checks, then we could maybe: - provide some syntactic sugar for it (e.g. pd.Series.is_type(pd.CategoricalDtype)) - sooner or later, deprecate the pd.Series.dtype == 'category' comparison which is incompatible with numpy string aliases (as it break the rules that if a == 'repr' and b == 'repr', then a == b)

Comment From: toobaz

Related: #8814

That settled the question for dataframe.dtypes == 'category', which however behaves differently from dataframe.dtypes.iloc[0] == 'category' if the first column is for instance a bool (and we know that won't change on the numpy side), which is sad.

Comment From: jbrockmendel

There are now a bunch of ExtensionDtypes and we shouldn't expect any of them to play nicely with np.issubdtype. Closing as no-action.