Pandas Multiarray searchsorted fails

Code Sample, a copy-pastable example if possible

pandas.MultiIndex([[0],["a"]], [[0],[0]]).searchsorted((1,"b"))

Problem description

The entry (1,"b") should come after the existing (0,"a") in the MultiIndex. (Alternatively, MultiIndex could throw a clean error message.) Instead, an intransparent exception is raised:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.5/site-packages/pandas/core/base.py", line 1156, in searchsorted
    return self.values.searchsorted(key, side=side, sorter=sorter)
TypeError: unorderable types: tuple() > str()

This is because Index.searchsorted naïvely passes its arguments to numpy.searchsorted, which is unaware that its second argument is a sequence of tuples, not a plain array just of dimension one higher.

Expected Output

Output of `pd.show_versions()`

# Paste the output here pd.show_versions() here NSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.8.10-1-ARCH machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: en_GB.UTF-8 pandas: 0.19.1 nose: 1.3.7 pip: 9.0.1 setuptools: 28.8.0 Cython: 0.25.1 numpy: 1.11.2 scipy: None statsmodels: None xarray: None IPython: None sphinx: None patsy: None dateutil: 2.6.0 pytz: 2016.7 blosc: None bottleneck: None tables: None numexpr: None matplotlib: 1.5.3 openpyxl: None xlrd: 1.0.0 xlwt: 0.8.0 xlsxwriter: None lxml: None bs4: 4.5.1 html5lib: 0.9999999 httplib2: 0.9.2 apiclient: None sqlalchemy: 1.1.4 pymysql: None psycopg2: None jinja2: 2.8 boto: None pandas_datareader: None

Comment From: jreback

I suppose could just disable this. numpy doesn't undertsand object array searchsorted generally

In [4]: pandas.MultiIndex([[0],["a"]], [[0],[0]]).values.searchsorted((1,"b"))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-b9352c7b8bea> in <module>()
----> 1 pandas.MultiIndex([[0],["a"]], [[0],[0]]).values.searchsorted((1,"b"))

TypeError: unorderable types: tuple() > str()

maybe just raise a NotImplementedError. This is pretty much a useless operation anyhow, you always to search by levels via indexing

Comment From: ritwickdsouza

Is this issue still available to fix ?

Comment From: jreback

yes,

note that we should simply define .searchsorted in pandas/indexes/multi.py and use the direct indexers, .get_indexer, which is way more efficient (as its hashtable based).

Comment From: ritwickdsouza

@jreback I am new to pandas, could you throw some light on how i can use .get_indexer to implement .searchsorted ?

Comment From: jreback

.get_indexer returns the indexer , IOW the location of the point. -1 marks not found items. This works on any multi-index. Note these don't even have to be sorted (but is more efficient if they are).

In [2]: i = pd.MultiIndex.from_tuples([(0, 'a'), (0, 'b'), (1, 'a')])

In [3]: i
Out[3]: 
MultiIndex(levels=[[0, 1], ['a', 'b']],
           labels=[[0, 0, 1], [0, 1, 0]])

In [4]: i.values
Out[4]: array([(0, 'a'), (0, 'b'), (1, 'a')], dtype=object)

In [5]: i.get_indexer([(0,'b'), (1, 'a'), (2, 'c')])
Out[5]: array([ 1,  2, -1])

Here's what searchsorted does; I am using an interger array because numpy doesn't play nice with tuples. It returns the indexer of the match (IOW where it is in the array). Note if something is not found it returns the last index before that (which is really unintuitve!)

In [6]: np.array([1, 2, 3]).searchsorted([2, 3])
Out[6]: array([1, 2])

In [7]: np.array([1, 2, 3]).searchsorted([2, 3, 5])
Out[7]: array([1, 2, 3])

Comment From: bhavybarca

@jreback this issue still open ?

Comment From: jreback

yes

Comment From: bhavybarca

@TomAugspurger @jreback i am a little bit confused to what should i do exactly, i mean should i simply raise a NotImplementedError as said by @jreback ? or should i replace the searchsorted by get_indexer and somehow get a value for even tuples as said is issue example

Comment From: TomAugspurger

If supporting searchsorted is an option that makes sense. https://github.com/pandas-dev/pandas/issues/14833#issuecomment-276719307 indicates that that can be done using get_indexer.

Comment From: btel

Since there was no activity since March, I would be interested in working on this issue. It might be nice to implement searchsorted as suggested by @jreback, but the issue I see is that numpy's searchsorted can give the location of an element in a sorted array that would keep the sort order even if the element does not exist. Here is an example:

>>> np.searchsorted([1, 3, 5], [2])
array([1])

From what I understand .get_indexer will simply return -1 (element not found). Naive implementation of numpy's behaviour might use bisect module from Python's standard library for the not-found elements, but it would be rather inefficient.

Interestingly, numpy's searchsorted can also work with tuples if we define an appropriate dtype:

>>> dtype = [("int", 'i8'), ("str", "U1" )]
>>> arr = np.array([(0, 'a'), (0, 'b'), (1, 'c')], dtype=dtype)
>>> arr.searchsorted(np.array([(1, 'a')], dtype=dtype))
array([2])

A possible implementation of searchsorted would then coerce the multiindex to an ndarray with adapted dtype, and use the numpy's builtin searchsorted. What do you think?

Comment From: jreback

so this might be easier now as the implementation of MI was recently refactored to directly keep the underlying codes in the cython table

Comment From: SaturnFromTitan

take

Comment From: Condielj

take

Comment From: GSAUC3

Hi, is anyone still working on this, or may I take it up? if the answer is NO, i.e. no one is working on this, then i have a couple of question:

btel suggested in https://github.com/pandas-dev/pandas/issues/14833#issuecomment-430431631

This can be one way to handle it. but it assumes the input array to be of 2-dimensional. Should the input array be restricted to 2 dimensional ? May i go ahead with this implementation, or should i just simple raise NotImplementedError ?

Comment From: GSAUC3

take

Comment From: GSAUC3

@jreback hi, I am relatively new to open source, and i saw there was no activity on this, since march 2022, so i thought of taking this issue up. I see that @mroeschke has removed this from contributions welcome. I am not entirely sure what exactly it means. Does that mean no contributions will be accepted? Apologies if this is a silly doubt, and thanks for your guidance in advance.

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`