Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
#!/usr/bin/env python3
import pandas as pd
midx = pd.MultiIndex.from_product([(0, 1), (0, 1)], names=('x', 'y'))
midx = midx.set_levels([0] * len(midx), level='x', verify_integrity=False)
print(midx)
print('Duplicated: ', midx.duplicated())
print('Unique: ',midx.is_unique)
Issue Description
Pandas does not detect multiindex duplicates that were created using set_levels()
.
MRE outputs:
MultiIndex([(0, 0),
(0, 1),
(0, 0),
(0, 1)],
names=['x', 'y'])
Duplicated: [False False False False]
Unique: True
Python debugger cuts out in multi.py::duplicated()
and I think the final error is somewhere in autogenerated cython hashtable bindings here? I'm not sure what to do to debug from multi.py onward.
I found https://github.com/pandas-dev/pandas/issues/27035#issuecomment-505446429, that mentions missing preconditions check that might be related, but this is pure speculation on my part. Besides, tuples should be hashable.
Expected Behavior
Detect duplicates/non-uniqueness. MRE outputs:
MultiIndex([(0, 0),
(0, 1),
(0, 0),
(0, 1)],
names=['x', 'y'])
Duplicated: [False False True True]
Unique: False
Installed Versions
Comment From: rhshadrach
Thanks for the report. The levels that you are setting are not compatible with the codes in the MultiIndex. The set_levels
method only changes the levels, and they must be compatible with the codes. You can see this error by passing verify_integrity=True
.
As such you wind up with a MultiIndex that has an invalid state, it is going to give you wrong answers.
Comment From: mmatous
If manipulating the index in a way that creates duplicates makes these methods return invalid results, then what is the purpose of is_unique
or .duplicated()
?
In any case what are the recommended steps here?
Basically I need to do something similar to the MRE for my df. Disable checks for it's index. Recalculate values. I know this will result in duplicates. Then I wanted to use pd.Index.duplicated()
to drop those and bring the index back to valid state.
I didn't come up with this, I took it from the docs.
Right now I'm resetting and setting index as a workaround.
Comment From: rhshadrach
If manipulating the index in a way that creates duplicates makes these methods return invalid results
It's not that you are creating duplicates. You are disabling safety checks, and then passing invalid data. That allows the index to get into an invalid state.
In any case what are the recommended steps here?
Always pass verify_integrity=True
.
Comment From: mmatous
Always pass verify_integrity=True.
That results in
ValueError: Level values must be unique: [0, 0, 0, 0] on level 0
That's, of course, to be expected but like I said, I need to perform a calculation that temporarily results in duplicate values. Hence the verify_integrity=False
.
Comment From: rhshadrach
You cannot use set_levels
the way you are trying to. You must adhere to the requirements of a MultiIndex that pandas assumes, namely that the level values and codes must be consistent. You are making them inconsistent. When you make them inconsistent, pandas of course gives wrong results.
Closing.
Comment From: mmatous
OK, so for anyone who finds this in the future (and to clarify for myself)
set_levels()
simply changes the labels, as in "displayed names", of the MIndex, not values in a level.- the "real values" of MIndex are, in fact, the mentioned
codes
which act as an indices or pointers to the array with labels calledlevels
. Which is a bit counterintuitive if you ask me, I would expectnames
orlabels
, and.set_levels()
to simply set values in a level, and label/name to be set by.set_labels()
or.set_names()
.
In my defence, this and @rhshadrach's talk about "codes" wasn't exactly clear because:
- there's only single occurrence of "codes" in https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html while mentioning
.set_codes()
with no further explanation. - there's only single occurrence of "codes" in https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#advanced-reindexing-and-alignment while being passed as a keyword, again without further explanation.
- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.html explains
codes
arg as "Integers for each level designating which label at each location" which I'm not sure is even valid EN sentence and it feels like there's at least a verb missing. - It's clearer after reading about about categoricals but that's 7 chapters later in user guide and UG chapters aren't exactly short. And understanding mindex should come with reading the related chapter, not looking for an obscure reference a long while later. Why should anyone expect that MultiIndex is basically a Categorical?
You could argue that it's obvious from examples in user guide and it is now, but it sure wasn't before I understood. Is the summary above correct? If so, I would like to make a clarifying PR to https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.html and https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#advanced-reindexing-and-alignment.