In[14]: import pandas as pd
Backend Qt4Agg is interactive backend. Turning interactive mode on.
In[15]: import numpy as np
In[16]: s = pd.Series([1,2,3,1,2,3]).astype("category")
In[17]: s
Out[17]:
0 1
1 2
2 3
3 1
4 2
5 3
dtype: category
Categories (3, int64): [1 < 2 < 3]
In[18]: np.issubdtype(s.dtype, np.bool_)
Traceback (most recent call last):
File "c:\data\external\ipython\IPython\core\interactiveshell.py", line 3032, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-18-607a91e2a828>", line 1, in <module>
np.issubdtype(s.dtype, np.bool_)
File "C:\portabel\miniconda\envs\ipython\lib\site-packages\numpy\core\numerictypes.py", line 763, in issubdtype
return issubclass(dtype(arg1).type, arg2)
TypeError: data type not understood
This is a problem in https://github.com/pydata/patsy/pull/47
Not sure if there is an easy way to get numpy to understand this (I've absolutely no numpy foo :-/ ). If not this means that every patsy/statsmodels method which does dtype magic has to guard against the category dtype :-/
Comment From: jreback
numpy doesn't understand this
you could do
isinstance(s.dtype, np.dtype) and np.issubdtype(s.dtype, np.bool_)
Comment From: jankatins
Just understood that this is again the dtype("category")
problem :-(
Comment From: jreback
yep, you can also do com.is_categorical_dtype(....)
which is safe.
Comment From: jankatins
The problem is more that every other dtype check is prone to raise :-(
This is the line which raises (https://github.com/pydata/patsy/blob/master/patsy/categorical.py#L185)
# fastpath to avoid doing an item-by-item iteration over boolean
# arrays, as requested by #44
if hasattr(data, "dtype") and np.issubdtype(data.dtype, np.bool_): # <= !!! this line !!!
self._level_set = set([True, False])
return True
-> if you want to work with pandas categorical data, you now have to look out for any (arbitrary) method, which errors out when getting a non-official dtype :-/
Comment From: jankatins
IMO the most user friendly solution would be to monkey-patch this numpy function to check for categorical first...
Comment From: jreback
you need to dispatch on categorical before this and treat it separately. Code that deals with multiple dtypes needs to be aware. You can simply np.asarray
if you want on everything.
You cannot patch a c-level function.
Comment From: jankatins
np.issubdtype
is not a c level function :-)
https://github.com/numpy/numpy/blob/master/numpy/core/numerictypes.py#L736
def issubdtype(arg1, arg2):
if issubclass_(arg2, generic):
return issubclass(dtype(arg1).type, arg2)
mro = dtype(arg2).type.mro()
if len(mro) > 1:
val = mro[1]
else:
val = mro[0]
return issubclass(dtype(arg1).type, val)
Comment From: jankatins
The problem is that (this and probably other) code worked great before pandas 0.15 and works great as long as no categorical comes along but then breaks if you pass in a categorical. So more or less the categorical introduction breaks unrelated code. :-(
Comment From: jreback
not sure what to tell you
the numpy functions are just not safe
you might have better luck doing
np.asarray(...)
then checking that dtype
Comment From: jankatins
So, push it to numpy-land?
Comment From: jorisvandenbossche
Couldn't this be solved by returning the CategoricalDType
object instead of a string for s.dtype
?
In [86]: catdt = pd.core.categorical.CategoricalDtype
In [87]: np.issubdtype(catdt, np.bool_)
Out[87]: False
EDIT: I mean an instance of the object instead of string -> so returning the class instead of an instance for s.dtype
Comment From: jreback
@JanSchulz I would explicity check or a Categorical
in patsy
. You need to handle them specially anyhow.
Comment From: jreback
@jorisvandenbossche Its already a dtype, just not in the numpy hierarchy (and they are not careful about the check)
In [1]: s = Series(list('abc')).astype('category')
In [2]: s.dtype
Out[2]: category
In [3]: type(s.dtype)
Out[3]: pandas.core.common.CategoricalDtype
Comment From: jankatins
I'm confused:
In[25]: type(s.dtype)
Out[25]: pandas.core.common.CategoricalDtype
In[26]: catdt = pd.core.categorical.CategoricalDtype
In[27]: np.issubdtype(catdt, np.bool_)
Out[27]: False
In[28]: np.issubdtype(s.dtype, np.bool_)
Traceback (most recent call last):
File "c:\data\external\ipython\IPython\core\interactiveshell.py", line 3032, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-28-607a91e2a828>", line 1, in <module>
np.issubdtype(s.dtype, np.bool_)
File "C:\portabel\miniconda\envs\ipython\lib\site-packages\numpy\core\numerictypes.py", line 763, in issubdtype
return issubclass(dtype(arg1).type, arg2)
TypeError: data type not understood
In[29]: type(s.dtype) is type(catdt)
Out[29]: False
In[30]: type(s.dtype), type(catdt)
Out[30]: (pandas.core.common.CategoricalDtype, type)
In[31]: type(np.bool_)
Out[31]: type
Comment From: jankatins
So if we would not use an object but the class it would work?
Comment From: njsmith
Pandas did just go ahead and totally break tons of unrelated code here :-( It's pretty ugly to say that even simple checks that have nothing to do with categoricals (like "is this a boolean?"), and which have been written in the same way for as long as numpy existed, suddenly must always check for categoricals explicitly.
Comment From: njsmith
E.g. patsy has 12 calls to issubdtype
and every one of them now needs to be audited to make sure they don't start raising random errors.
Comment From: njsmith
Also, sorry for comment spam, but just noticed this while writing a safe issubdtype wrapper: the suggestion of replacing
np.issubdtype(foo, bar)
with
isinstance(foo, np.dtype) and np.issubdtype(foo, bar)
is not correct in general -- e.g. issubdtype(int, np.integer)
is True
, but the above expression will return False
.
Comment From: jorisvandenbossche
@njsmith Do you see other possible solutions to this problem, apart from making CategoricalDType
a real numpy dtype in C?
Comment From: jankatins
Is it actually possible to extend (by writing c code) numpy dtypes from outside numpy?
The other way I see is really take every numpy dtype function, test if they handle the new dtype correctly and if not write a wrapper and monkey-patch numpy when pandas is imported...
Or write a real categorical numpy array... :-/
https://github.com/numpy/numpy/blob/4cbced22a8306d7214a4e18d12e651c034143922/doc/newdtype_example/floatint.c See also here: https://github.com/scipy/scipy/wiki/GSoC-project-ideas
Comment From: jreback
you need to write something like
isinstance(obj, np.ndarray) and issubdtype(obj.dtype,np.bool_)
in your code
np.issubdtype
is prob ok, though it should prob handle array-likes in a safer manner
Comment From: shoyer
I suppose the numpy friendly way to have handled this would have been to use dtype=object
for Categorical, and add some custom attribute like pandas_dtype
. But that's pretty damn awkward, too... and not really something I want to expose in our API.
Comment From: cequencer
I am not sure if the issue I am running into below is identical to this issue or not. Here are the steps that is raising the TypeError: data type not understood when it encounters this class 'pandas.core.dtypes.dtypes.CategoricalDtype':
In[1]: pd.show_versions(as_json=False)
Out[1]:
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.14.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.32-696.10.3.el6.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None
pandas: 0.22.0
pytest: None
pip: 9.0.3
setuptools: 39.0.1
Cython: None
numpy: 1.14.3
scipy: 1.0.1
pyarrow: None
xarray: None
IPython: 5.6.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: None
tables: 3.4.3
numexpr: 2.6.4
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
In[2]: dframe = pd.DataFrame({"A":["a","b","c","a"]})
In[3]: dframe["B"] = dframe["A"].astype('category')
In[4]: dframe.dtypes
Out[4]:
A object
B category
dtype: object
In[5]: dframe.dtypes.map((lambda x: np.issubdtype(x, np.datetime64)))
Out[5]:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-67-78973e46a59c> in <module>()
----> 1 dframe.dtypes.map((lambda x: np.issubdtype(x, np.datetime64)))
/lib/python2.7/site-packages/pandas/core/series.pyc in map(self, arg, na_action)
2352 else:
2353 # arg is a function
-> 2354 new_values = map_f(values, arg)
2355
2356 return self._constructor(new_values,
pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer()
<ipython-input-67-78973e46a59c> in <lambda>(x)
----> 1 dframe.dtypes.map((lambda x: np.issubdtype(x, np.datetime64)))
/lib/python2.7/site-packages/numpy/core/numerictypes.pyc in issubdtype(arg1, arg2)
724 """
725 if not issubclass_(arg1, generic):
--> 726 arg1 = dtype(arg1).type
727 if not issubclass_(arg2, generic):
728 arg2_orig = arg2
TypeError: data type not understood
In[6]:
for i, v in dframe.dtypes.iteritems():
print i
print v
t=type(v)
print t
if np.issubdtype(v, np.datetime64):
print i
print "----------------------"
Out[6]:
A
object
<type 'numpy.dtype'>
----------------------
B
category
<class 'pandas.core.dtypes.dtypes.CategoricalDtype'>
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-71-24466a088749> in <module>()
4 t=type(v)
5 print t
----> 6 if np.issubdtype(v, np.datetime64):
7 print i
8 print "----------------------"
/lib/python2.7/site-packages/numpy/core/numerictypes.pyc in issubdtype(arg1, arg2)
724 """
725 if not issubclass_(arg1, generic):
--> 726 arg1 = dtype(arg1).type
727 if not issubclass_(arg2, generic):
728 arg2_orig = arg2
TypeError: data type not understood
The TypeError raised in step 5 was not very clear to me what is the issue, until I iterated through all the data types in step 6 that it was choking on the 'CategoricalDtype'.
I am not sure if this is the correct forum to provide such information and I apologize in advance if this is not. If there is better Pythonic way to do what I need to do, please let me know as well. Thank you!
Comment From: jreback
this is a numpy issue fundamentally IIRC opened one a few years ago about this
in any event pandas has lots of method to introspect dtypes via the pandas.api.types namespace so you shouldn’t need to do this
Comment From: toobaz
My understanding is that the way to check for dtypes (including custom ones) that works for both numpy and pandas and never raises is looking at their type
attribute:
In [2]: issubclass(np.array([1, 2, 3]).dtype.type, np.integer)
Out[2]: True
In [3]: issubclass(np.array('abc').dtype.type, str)
Out[3]: True
... except that:
In [4]: issubclass(pd.Series('abc', dtype='category').dtype.type, pd.CategoricalDtype)
Out[4]: False
In [5]: pd.Series('abc', dtype='category').dtype.type == pd.CategoricalDtype
Out[5]: False
because
In [6]: pd.Series('abc', dtype='category').dtype.type.__mro__
Out[6]: (pandas.core.dtypes.dtypes.CategoricalDtypeType, type, object)
and indeed:
In [7]: issubclass(pd.Series('abc', dtype='category').dtype.type, pd.core.dtypes.dtypes.CategoricalDtypeType)
Out[7]: True
Maybe that's something we can easily fix on the pandas side, so that the check doesn't require digging beyond the pandas API?
What I mean is:
In [7]: np.array('abc', dtype=np.dtype('<U').type)
Out[7]: array('abc', dtype='<U3')
while:
In [8]: pd.Series('abc', dtype=pd.CategoricalDtype.type)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-8-b3184e65f353> in <module>
----> 1 pd.Series('abc', dtype=pd.CategoricalDtype.type)
~/nobackup/repo/pandas/pandas/core/series.py in __init__(self, data, index, dtype, name, copy, fastpath)
237 data = {}
238 if dtype is not None:
--> 239 dtype = self._validate_dtype(dtype)
240
241 if isinstance(data, MultiIndex):
~/nobackup/repo/pandas/pandas/core/generic.py in _validate_dtype(self, dtype)
262
263 if dtype is not None:
--> 264 dtype = pandas_dtype(dtype)
265
266 # a compound dtype
~/nobackup/repo/pandas/pandas/core/dtypes/common.py in pandas_dtype(dtype)
1886 return npdtype
1887 elif npdtype.kind == "O":
-> 1888 raise TypeError(f"dtype '{dtype}' not understood")
1889
1890 return npdtype
TypeError: dtype '<class 'pandas.core.dtypes.dtypes.CategoricalDtypeType'>' not understood
... that is, our distinction between a dtype and its type is something that numpy avoids, and maybe we can avoid too? (That's a real question, as I don't know the code well)
I would go as far as saying that if we can fix this, and consider the above as the recommended way for type checks, then we could maybe:
- provide some syntactic sugar for it (e.g. pd.Series.is_type(pd.CategoricalDtype)
)
- sooner or later, deprecate the pd.Series.dtype == 'category'
comparison which is incompatible with numpy string aliases (as it break the rules that if a == 'repr'
and b == 'repr'
, then a == b
)
Comment From: toobaz
Related: #8814
That settled the question for dataframe.dtypes == 'category'
, which however behaves differently from dataframe.dtypes.iloc[0] == 'category'
if the first column is for instance a bool (and we know that won't change on the numpy side), which is sad.
Comment From: jbrockmendel
There are now a bunch of ExtensionDtypes and we shouldn't expect any of them to play nicely with np.issubdtype. Closing as no-action.