Code Sample, a copy-pastable example if possible
pandas.MultiIndex([[0],["a"]], [[0],[0]]).searchsorted((1,"b"))
Problem description
The entry (1,"b")
should come after the existing (0,"a")
in the MultiIndex
. (Alternatively, MultiIndex could throw a clean error message.) Instead, an intransparent exception is raised:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.5/site-packages/pandas/core/base.py", line 1156, in searchsorted
return self.values.searchsorted(key, side=side, sorter=sorter)
TypeError: unorderable types: tuple() > str()
This is because Index.searchsorted
naïvely passes its arguments to numpy.searchsorted
, which is unaware that its second argument is a sequence of tuples, not a plain array just of dimension one higher.
Expected Output
1
Output of pd.show_versions()
Comment From: jreback
I suppose could just disable this. numpy doesn't undertsand object array searchsorted generally
In [4]: pandas.MultiIndex([[0],["a"]], [[0],[0]]).values.searchsorted((1,"b"))
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-4-b9352c7b8bea> in <module>()
----> 1 pandas.MultiIndex([[0],["a"]], [[0],[0]]).values.searchsorted((1,"b"))
TypeError: unorderable types: tuple() > str()
maybe just raise a NotImplementedError
. This is pretty much a useless operation anyhow, you always to search by levels via indexing
Comment From: ritwickdsouza
Is this issue still available to fix ?
Comment From: jreback
yes,
note that we should simply define .searchsorted
in pandas/indexes/multi.py
and use the direct indexers, .get_indexer
, which is way more efficient (as its hashtable based).
Comment From: ritwickdsouza
@jreback I am new to pandas, could you throw some light on how i can use .get_indexer
to implement .searchsorted
?
Comment From: jreback
.get_indexer
returns the indexer , IOW the location of the point. -1 marks not found items. This works on any multi-index. Note these don't even have to be sorted (but is more efficient if they are).
In [2]: i = pd.MultiIndex.from_tuples([(0, 'a'), (0, 'b'), (1, 'a')])
In [3]: i
Out[3]:
MultiIndex(levels=[[0, 1], ['a', 'b']],
labels=[[0, 0, 1], [0, 1, 0]])
In [4]: i.values
Out[4]: array([(0, 'a'), (0, 'b'), (1, 'a')], dtype=object)
In [5]: i.get_indexer([(0,'b'), (1, 'a'), (2, 'c')])
Out[5]: array([ 1, 2, -1])
Here's what searchsorted does; I am using an interger array because numpy doesn't play nice with tuples. It returns the indexer of the match (IOW where it is in the array). Note if something is not found it returns the last index before that (which is really unintuitve!)
In [6]: np.array([1, 2, 3]).searchsorted([2, 3])
Out[6]: array([1, 2])
In [7]: np.array([1, 2, 3]).searchsorted([2, 3, 5])
Out[7]: array([1, 2, 3])
Comment From: bhavybarca
@jreback this issue still open ?
Comment From: jreback
yes
Comment From: bhavybarca
@TomAugspurger @jreback i am a little bit confused to what should i do exactly, i mean should i simply raise a NotImplementedError as said by @jreback ? or should i replace the searchsorted by get_indexer and somehow get a value for even tuples as said is issue example
Comment From: TomAugspurger
If supporting searchsorted
is an option that makes sense. https://github.com/pandas-dev/pandas/issues/14833#issuecomment-276719307 indicates that that can be done using get_indexer
.
Comment From: btel
Since there was no activity since March, I would be interested in working on this issue. It might be nice to implement searchsorted as suggested by @jreback, but the issue I see is that numpy's searchsorted can give the location of an element in a sorted array that would keep the sort order even if the element does not exist. Here is an example:
>>> np.searchsorted([1, 3, 5], [2])
array([1])
From what I understand .get_indexer
will simply return -1 (element not found). Naive implementation of numpy's behaviour might use bisect
module from Python's standard library for the not-found elements, but it would be rather inefficient.
Interestingly, numpy's searchsorted can also work with tuples if we define an appropriate dtype:
>>> dtype = [("int", 'i8'), ("str", "U1" )]
>>> arr = np.array([(0, 'a'), (0, 'b'), (1, 'c')], dtype=dtype)
>>> arr.searchsorted(np.array([(1, 'a')], dtype=dtype))
array([2])
A possible implementation of searchsorted would then coerce the multiindex to an ndarray with adapted dtype, and use the numpy's builtin searchsorted. What do you think?
Comment From: jreback
so this might be easier now as the implementation of MI was recently refactored to directly keep the underlying codes in the cython table
Comment From: SaturnFromTitan
take
Comment From: Condielj
take
Comment From: GSAUC3
Hi, is anyone still working on this, or may I take it up? if the answer is NO, i.e. no one is working on this, then i have a couple of question:
btel suggested in https://github.com/pandas-dev/pandas/issues/14833#issuecomment-430431631
This can be one way to handle it. but it assumes the input array to be of 2-dimensional. Should the input array be restricted to 2 dimensional ? May i go ahead with this implementation, or should i just simple raise NotImplementedError ?
Comment From: GSAUC3
take
Comment From: GSAUC3
@jreback hi, I am relatively new to open source, and i saw there was no activity on this, since march 2022, so i thought of taking this issue up. I see that @mroeschke has removed this from contributions welcome. I am not entirely sure what exactly it means. Does that mean no contributions will be accepted? Apologies if this is a silly doubt, and thanks for your guidance in advance.