Pandas indexing with a Categorical of Intervals is inefficient

This line converts the IntervalIndex into a numpy object array:

https://github.com/pandas-dev/pandas/blob/faf3bbb1d7831f7db8fc72b36f3e83e7179bb3f9/pandas/core/dtypes/dtypes.py#L520

then in this block, a TypeError is raised and causes that object array to be converted into strings:

TypeError: (-0.00872, 0.439] of type is not a valid type for hashing, must be string or null

https://github.com/pandas-dev/pandas/blob/faf3bbb1d7831f7db8fc72b36f3e83e7179bb3f9/pandas/core/util/hashing.py#L333-L339

Comment From: jbrockmendel

Does hash array get called in indexing?

Comment From: flying-sheep

Yeah, when the indexed data frame’s .index is unique:

import pandas as pd

df = pd.DataFrame(dict(a=range(3)), pd.cut(range(3), 3))
assert df.index.is_unique  # bug only triggers if this is the case

df.loc[df.index.categories[:2]]

set a breakpoint in the except TypeError branch in _hash_ndarray and execute the above in a debugger, and the breakpoint will be hit.

I discovered this because in some older versions of pandas or numpy, the vals.astype(str).astype(object) raises a RuntimeWarning about “invalid values encountered in cast”. This no longer happens, but I think the casting should probably not happen here.

Comment From: jbrockmendel

Looks like in a .equals check we go through categories_match_up_to_permutation, which checks the hash of each dtype, which goes through path in the OP.