Pandas BUG: .duplicated() ignoring duplicates for MultiIndex

Pandas version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

#!/usr/bin/env python3

import pandas as pd

midx = pd.MultiIndex.from_product([(0, 1), (0, 1)], names=('x', 'y'))
midx = midx.set_levels([0] * len(midx), level='x', verify_integrity=False)

print(midx)
print('Duplicated: ', midx.duplicated())
print('Unique: ',midx.is_unique)

Issue Description

Pandas does not detect multiindex duplicates that were created using set_levels(). MRE outputs:

MultiIndex([(0, 0),
            (0, 1),
            (0, 0),
            (0, 1)],
           names=['x', 'y'])
Duplicated:  [False False False False]
Unique:  True

Python debugger cuts out in multi.py::duplicated() and I think the final error is somewhere in autogenerated cython hashtable bindings here? I'm not sure what to do to debug from multi.py onward.

I found https://github.com/pandas-dev/pandas/issues/27035#issuecomment-505446429, that mentions missing preconditions check that might be related, but this is pure speculation on my part. Besides, tuples should be hashable.

Expected Behavior

Detect duplicates/non-uniqueness. MRE outputs:

MultiIndex([(0, 0),
            (0, 1),
            (0, 0),
            (0, 1)],
           names=['x', 'y'])
Duplicated:  [False False True True]
Unique:  False

Installed Versions

INSTALLED VERSIONS ------------------ commit : 3f7bc81ae6839803ecc0da073fe83e9194759550 python : 3.12.7 python-bits : 64 OS : Linux OS-release : 6.11.4-gentoo Version : #1 SMP PREEMPT_DYNAMIC Tue Oct 22 20:38:14 CEST 2024 machine : x86_64 processor : AMD Ryzen 5 4500 6-Core Processor byteorder : little LC_ALL : None LANG : en_IE.utf8 LOCALE : en_IE.UTF-8 pandas : 3.0.0.dev0+1654.g3f7bc81ae numpy : 2.1.3 dateutil : 2.9.0.post0 pip : 24.2 Cython : None sphinx : None IPython : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None blosc : None bottleneck : None fastparquet : None fsspec : None html5lib : None hypothesis : None gcsfs : None jinja2 : None lxml.etree : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None psycopg2 : None pymysql : None pyarrow : None pyreadstat : None pytest : None python-calamine : None pytz : 2024.2 pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlsxwriter : None zstandard : None tzdata : 2024.2 qtpy : None pyqt5 : None

Comment From: rhshadrach

Thanks for the report. The levels that you are setting are not compatible with the codes in the MultiIndex. The set_levels method only changes the levels, and they must be compatible with the codes. You can see this error by passing verify_integrity=True.

As such you wind up with a MultiIndex that has an invalid state, it is going to give you wrong answers.

Comment From: mmatous

If manipulating the index in a way that creates duplicates makes these methods return invalid results, then what is the purpose of is_unique or .duplicated()? In any case what are the recommended steps here? Basically I need to do something similar to the MRE for my df. Disable checks for it's index. Recalculate values. I know this will result in duplicates. Then I wanted to use pd.Index.duplicated() to drop those and bring the index back to valid state. I didn't come up with this, I took it from the docs. Right now I'm resetting and setting index as a workaround.

Comment From: rhshadrach

If manipulating the index in a way that creates duplicates makes these methods return invalid results

It's not that you are creating duplicates. You are disabling safety checks, and then passing invalid data. That allows the index to get into an invalid state.

In any case what are the recommended steps here?

Always pass verify_integrity=True.

Comment From: mmatous

Always pass verify_integrity=True.

That results in

ValueError: Level values must be unique: [0, 0, 0, 0] on level 0

That's, of course, to be expected but like I said, I need to perform a calculation that temporarily results in duplicate values. Hence the verify_integrity=False.

Comment From: rhshadrach

You cannot use set_levels the way you are trying to. You must adhere to the requirements of a MultiIndex that pandas assumes, namely that the level values and codes must be consistent. You are making them inconsistent. When you make them inconsistent, pandas of course gives wrong results.

Closing.

Comment From: mmatous

OK, so for anyone who finds this in the future (and to clarify for myself)

set_levels() simply changes the labels, as in "displayed names", of the MIndex, not values in a level.
the "real values" of MIndex are, in fact, the mentioned codes which act as an indices or pointers to the array with labels called levels. Which is a bit counterintuitive if you ask me, I would expect names or labels, and .set_levels() to simply set values in a level, and label/name to be set by .set_labels() or .set_names().

In my defence, this and @rhshadrach's talk about "codes" wasn't exactly clear because:

there's only single occurrence of "codes" in https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html while mentioning .set_codes() with no further explanation.
there's only single occurrence of "codes" in https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#advanced-reindexing-and-alignment while being passed as a keyword, again without further explanation.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.html explains codes arg as "Integers for each level designating which label at each location" which I'm not sure is even valid EN sentence and it feels like there's at least a verb missing.
It's clearer after reading about about categoricals but that's 7 chapters later in user guide and UG chapters aren't exactly short. And understanding mindex should come with reading the related chapter, not looking for an obscure reference a long while later. Why should anyone expect that MultiIndex is basically a Categorical?

You could argue that it's obvious from examples in user guide and it is now, but it sure wasn't before I understood. Is the summary above correct? If so, I would like to make a clarifying PR to https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.html and https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#advanced-reindexing-and-alignment.