Code Sample, a copy-pastable example if possible
import pandas as pd
data = pd.DataFrame(dict(values=range(20), quintiles=pd.cut(range(20), 5)))
grouped = data.groupby("quintiles")["values"]
samples = grouped.agg(lambda x: list(x.sample(3)))
samples
# quintiles
# (-0.019, 3.8] [3, 2, 1]
# (3.8, 7.6] [7, 4, 6]
# (7.6, 11.4] [8, 10, 11]
# (11.4, 15.2] [15, 14, 12]
# (15.2, 19.0] [17, 16, 19]
# Name: values, dtype: object
samples[0]
# [3, 2, 1]
samples[1]
# [3, 2, 1]
samples[2.3]
# [3, 2, 1]
samples[19]
# [17, 16, 19]
samples.keys()
# CategoricalIndex([(-0.019, 3.8], (3.8, 7.6], (7.6, 11.4], (11.4, 15.2],
# (15.2, 19.0]],
# categories=[(-0.019, 3.8], (3.8, 7.6], (7.6, 11.4], (11.4, 15.2], (15.2, 19.0]], ordered=True, name='quintiles', dtype='category')
Problem description
Indexing a series object that has pd.cut() intervals as its index results in membership test of the provided index on the intervals, which I found interesting, but it was very unexpected. Is this intentional and documented somewhere? When I give the labels names (labels=list("abcde") in pd.cut), this doesn't happen anymore.
Expected Output
I expectived it to fall back on .iloc[].
Output of pd.show_versions()
Comment From: mroeschke
This looks to work on master now. Could use a test
In [7]: samples
Out[7]:
quintiles
(-0.019, 3.8] [0, 2, 3]
(3.8, 7.6] [7, 6, 5]
(7.6, 11.4] [11, 8, 10]
(11.4, 15.2] [13, 12, 15]
(15.2, 19.0] [18, 16, 17]
Name: values, dtype: object
In [8]: samples[0]
Out[8]: [0, 2, 3]
In [9]: samples[1]
Out[9]: [7, 6, 5]
In [10]: samples[19]
IndexError: index 19 is out of bounds for axis 0 with size 5
Comment From: feverbrew
Is this still open? Also want to confirm that this should go in test_cut.py
?
Comment From: djgray780
take
Comment From: jackgoldsmith4
@djgray780 are you still working on this?
Comment From: jackgoldsmith4
take
Comment From: ryuusama09
Is this issue still open , if yes then i would like to contribute !
Comment From: tab1tha
@ryuusama09 No one is assigned to the issue so I think you can go ahead and work on it.
Comment From: mahaoyu
take
Comment From: sotiri-geo
take
Comment From: sotiri-geo
Just started looking into this issue (as I see its still open). As far as I can see the issue still exists. The behaviour using .iloc[index]
is as expected but fails for series[index]
and series.loc[index]
as demonstrated below:
>>> import pandas as pd
>>> data = pd.DataFrame(dict(values=range(20), quintiles=pd.cut(range(20), 5)))
>>> grouped = data.groupby("quintiles")["values"]
>>> samples = grouped.agg(lambda x: list(x.sample(3)))
>>> samples
quintiles
(-0.019, 3.8] [3, 2, 1]
(3.8, 7.6] [6, 4, 7]
(7.6, 11.4] [11, 8, 9]
(11.4, 15.2] [13, 14, 12]
(15.2, 19.0] [18, 19, 16]
>>> samples[0]
[3, 2, 1]
>>> samples[1]
[3, 2, 1]
>>> samples.loc[0]
[3, 2, 1]
>>> samples.loc[1]
[3, 2, 1]
>>> samples.iloc[0]
[3, 2, 1]
>>> samples.iloc[1] # expected behaviour
[6, 4, 7]
Extending this, the bug also occurs when constructing a series from an instance of IntervalIndex
(which might be the root of this problem). I demonstrate recreating a series object with the profile from above but this time using the index as a type of IntervalIndex
.
>>> samples.index.categories
IntervalIndex([(-0.019, 3.8], (3.8, 7.6], (7.6, 11.4], (11.4, 15.2], (15.2, 19.0]], dtype='interval[float64, right]')
>>> new_series = pd.Series(samples.values, index=samples.index.categories)
>>> new_series
(-0.019, 3.8] [3, 2, 1]
(3.8, 7.6] [6, 4, 7]
(7.6, 11.4] [11, 8, 9]
(11.4, 15.2] [13, 14, 12]
(15.2, 19.0] [18, 19, 16]
dtype: object
>>> new_series[0]
[3, 2, 1]
>>> new_series[1]
[3, 2, 1]
I will attempt to write up some unittests for this (seems the logical grouping for Series
indexing tests is under tests/series/indexing/test_getitem.py
). Just from a quick browse a lot of the index logic for Series
objects are happening in the IndexingMixin
base class?
Comment From: sotiri-geo
^^ I think having worked on this issue I realised this is expected behaviour and changing this ends up breaking a lot of tests. The expected behaviour was already implemented in tests a few months back here.
The call stack was roughly going from checking the fallback from series
To then checking the categorical _should_fallback
, which just redirected the call to the categories object
Which obviously fell back to the IntervalIndex _should_fallback
method
In this case I initially tried modifying the list of accepted strings to include i
and f
dtypes (i
worked for some f
for others) but broke many other tests.
This matter should be closed and will delete my PR above.
Comment From: liang3zy22
Just tried on master. This issue still exists.
In [3]: data = pd.DataFrame(dict(values=range(20), quintiles=pd.cut(range(20), 5)))
In [4]: grouped = data.groupby("quintiles", observed=False)["values"]
In [5]: samples = grouped.agg(lambda x: list(x.sample(3)))
In [6]: samples
Out[6]:
quintiles
(-0.019, 3.8] [2, 1, 0]
(3.8, 7.6] [5, 7, 4]
(7.6, 11.4] [11, 10, 8]
(11.4, 15.2] [12, 15, 14]
(15.2, 19.0] [17, 16, 19]
Name: values, dtype: object
In [7]: samples[0]
Out[7]: [2, 1, 0]
In [8]: samples[1]
Out[8]: [2, 1, 0]
In [9]: samples[19]
Out[9]: [17, 16, 19]
This might be expected behavior.
Comment From: matthijsdewit111
take
Comment From: shriyase
Can I work on this if it's still open as I'm a beginner?
Comment From: duk7292
take
Comment From: Thanushri16
take
Comment From: sreeja97
take