Pandas Indexing on series where index is output from pd.cut

Code Sample, a copy-pastable example if possible

import pandas as pd

data = pd.DataFrame(dict(values=range(20), quintiles=pd.cut(range(20), 5)))
grouped = data.groupby("quintiles")["values"]
samples = grouped.agg(lambda x: list(x.sample(3)))

samples
# quintiles
# (-0.019, 3.8]       [3, 2, 1]
# (3.8, 7.6]          [7, 4, 6]
# (7.6, 11.4]       [8, 10, 11]
# (11.4, 15.2]     [15, 14, 12]
# (15.2, 19.0]     [17, 16, 19]
# Name: values, dtype: object

samples[0]
# [3, 2, 1]
samples[1]
# [3, 2, 1]
samples[2.3]
# [3, 2, 1]
samples[19]
# [17, 16, 19]

samples.keys()
# CategoricalIndex([(-0.019, 3.8], (3.8, 7.6], (7.6, 11.4], (11.4, 15.2],
#                   (15.2, 19.0]],
#                  categories=[(-0.019, 3.8], (3.8, 7.6], (7.6, 11.4], (11.4, 15.2], (15.2, 19.0]], ordered=True, name='quintiles', dtype='category')

Problem description

Indexing a series object that has pd.cut() intervals as its index results in membership test of the provided index on the intervals, which I found interesting, but it was very unexpected. Is this intentional and documented somewhere? When I give the labels names (labels=list("abcde") in pd.cut), this doesn't happen anymore.

Expected Output

I expectived it to fall back on .iloc[].

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.7.3.final.0 python-bits: 64 OS: Linux OS-release: 3.10.0-957.12.2.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: en_GB.UTF-8 pandas: 0.24.2 pytest: 4.4.1 pip: 19.1.1 setuptools: 41.0.1 Cython: 0.29.6 numpy: 1.16.3 scipy: 1.2.1 pyarrow: None xarray: 0.12.1 IPython: 7.4.0 sphinx: None patsy: 0.5.1 dateutil: 2.8.0 pytz: 2019.1 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 3.1.0 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml.etree: 4.3.3 bs4: 4.7.1 html5lib: None sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None gcsfs: None

Comment From: mroeschke

This looks to work on master now. Could use a test

In [7]: samples
Out[7]:
quintiles
(-0.019, 3.8]       [0, 2, 3]
(3.8, 7.6]          [7, 6, 5]
(7.6, 11.4]       [11, 8, 10]
(11.4, 15.2]     [13, 12, 15]
(15.2, 19.0]     [18, 16, 17]
Name: values, dtype: object

In [8]: samples[0]
Out[8]: [0, 2, 3]

In [9]: samples[1]
Out[9]: [7, 6, 5]

In [10]: samples[19]
IndexError: index 19 is out of bounds for axis 0 with size 5

Comment From: feverbrew

Is this still open? Also want to confirm that this should go in test_cut.py?

Comment From: djgray780

take

Comment From: jackgoldsmith4

@djgray780 are you still working on this?

Comment From: jackgoldsmith4

take

Comment From: ryuusama09

Is this issue still open , if yes then i would like to contribute !

Comment From: tab1tha

@ryuusama09 No one is assigned to the issue so I think you can go ahead and work on it.

Comment From: mahaoyu

take

Comment From: sotiri-geo

take

Comment From: sotiri-geo

Just started looking into this issue (as I see its still open). As far as I can see the issue still exists. The behaviour using .iloc[index] is as expected but fails for series[index] and series.loc[index] as demonstrated below:

>>> import pandas as pd
>>> data = pd.DataFrame(dict(values=range(20), quintiles=pd.cut(range(20), 5)))
>>> grouped = data.groupby("quintiles")["values"]
>>> samples = grouped.agg(lambda x: list(x.sample(3)))
>>> samples
quintiles
(-0.019, 3.8]       [3, 2, 1]
(3.8, 7.6]          [6, 4, 7]
(7.6, 11.4]        [11, 8, 9]
(11.4, 15.2]     [13, 14, 12]
(15.2, 19.0]     [18, 19, 16]
>>> samples[0]
[3, 2, 1]
>>> samples[1]
[3, 2, 1]
>>> samples.loc[0]
[3, 2, 1]
>>> samples.loc[1]
[3, 2, 1]
>>> samples.iloc[0]
[3, 2, 1]
>>> samples.iloc[1] # expected behaviour
[6, 4, 7]

Extending this, the bug also occurs when constructing a series from an instance of IntervalIndex (which might be the root of this problem). I demonstrate recreating a series object with the profile from above but this time using the index as a type of IntervalIndex.

>>> samples.index.categories
IntervalIndex([(-0.019, 3.8], (3.8, 7.6], (7.6, 11.4], (11.4, 15.2], (15.2, 19.0]], dtype='interval[float64, right]')
>>> new_series = pd.Series(samples.values, index=samples.index.categories)
>>> new_series
(-0.019, 3.8]       [3, 2, 1]
(3.8, 7.6]          [6, 4, 7]
(7.6, 11.4]        [11, 8, 9]
(11.4, 15.2]     [13, 14, 12]
(15.2, 19.0]     [18, 19, 16]
dtype: object
>>> new_series[0]
[3, 2, 1]
>>> new_series[1]
[3, 2, 1]

I will attempt to write up some unittests for this (seems the logical grouping for Series indexing tests is under tests/series/indexing/test_getitem.py). Just from a quick browse a lot of the index logic for Series objects are happening in the IndexingMixin base class?

Comment From: sotiri-geo

^^ I think having worked on this issue I realised this is expected behaviour and changing this ends up breaking a lot of tests. The expected behaviour was already implemented in tests a few months back here.

The call stack was roughly going from checking the fallback from series

To then checking the categorical _should_fallback, which just redirected the call to the categories object

Which obviously fell back to the IntervalIndex _should_fallback method

In this case I initially tried modifying the list of accepted strings to include i and f dtypes (i worked for some f for others) but broke many other tests.

This matter should be closed and will delete my PR above.

Comment From: liang3zy22

Just tried on master. This issue still exists.

In [3]: data = pd.DataFrame(dict(values=range(20), quintiles=pd.cut(range(20), 5)))

In [4]: grouped = data.groupby("quintiles", observed=False)["values"]

In [5]: samples = grouped.agg(lambda x: list(x.sample(3)))

In [6]: samples
Out[6]: 
quintiles
(-0.019, 3.8]       [2, 1, 0]
(3.8, 7.6]          [5, 7, 4]
(7.6, 11.4]       [11, 10, 8]
(11.4, 15.2]     [12, 15, 14]
(15.2, 19.0]     [17, 16, 19]
Name: values, dtype: object

In [7]: samples[0]
Out[7]: [2, 1, 0]

In [8]: samples[1]
Out[8]: [2, 1, 0]

In [9]: samples[19]
Out[9]: [17, 16, 19]

Pandas Indexing on series where index is output from pd.cut

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`