Pandas version checks

  • [x] I have checked that this issue has not already been reported.

  • [x] I have confirmed this bug exists on the latest version of pandas.

  • [x] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

df = pd.DataFrame(
    [
        ("aap", "1991-01-02", 100.0000),
        ("aap", "2024-12-24", 75.7575),
        ("noot", "1960-01-04", 11.111),
        ("noot", "2024-12-24", 123.45),
        ("noot", "2024-12-30", 321.54),
    ],
    columns=["name", "date", "value"],
).set_index(["name", "date"])["value"]

index = df.iloc[:-1].copy().index
assert all(index.levels[-1] == sorted(index.levels[-1]))

index2 = index.remove_unused_levels()
assert all(index2.levels[-1] == sorted(index2.levels[-1]))

Issue Description

Order or the multi index level is not kept. This causes issues with code like unstack being mis-ordered:

import pandas as pd

df = pd.DataFrame(
    [
        ("aap", "1991-01-02", 100.0000),
        ("aap", "2024-12-24", 75.7575),
        ("noot", "1960-01-04", 11.111),
        ("noot", "2024-12-24", 123.45),
        ("noot", "2024-12-30", 321.54),
    ],
    columns=["name", "date", "value"],
).set_index(["name", "date"])["value"]

df.iloc[:-1].unstack(level=0)

Expected Behavior

I expect that the current order or the multi index level is kept.

Installed Versions

INSTALLED VERSIONS ------------------ commit : f538741432edf55c6b9fb5d0d496d2dd1d7c2457 python : 3.9.13.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19045 machine : AMD64 processor : Intel64 Family 6 Model 85 Stepping 7, GenuineIntel byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : English_United States.1252 pandas : 2.2.0 numpy : 1.26.4 pytz : 2023.3 dateutil : 2.8.2 setuptools : 63.4.1 pip : 24.3.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.8.0 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : None IPython : 8.7.0 pandas_datareader : 0.10.0 adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.2 bottleneck : 1.3.7 dataframe-api-compat : None fastparquet : None fsspec : 2024.10.0 gcsfs : None matplotlib : None numba : None numexpr : 2.8.8 odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None python-calamine : None pyxlsb : None s3fs : 2024.10.0 scipy : 1.13.1 sqlalchemy : None tables : None tabulate : 0.9.0 xarray : None xlrd : 2.0.1 zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None

Comment From: mathman79

Naively I would expected remove_unused_index_levels do something like below (assuming we always want the levels to be sorted):

def remove_unused_index_levels(index: pd.MultiIndex) -> pd.MultiIndex:
    """Remove unused index levels, keeping levels ordered."""
    codes, levels, names = index_codes_levels_names(index)
    for i, (code, level) in enumerate(zip(codes, levels)):
        uniq_code = np.unique(code)
        codes[i] = np.searchsorted(uniq_code, code)
        levels[i] = level[uniq_code]
    return pd.MultiIndex(levels, codes, names=names)

Comment From: rhshadrach

Thanks for the report. Agreed with the expected behavior that removing unused index levels should not modify the output of other operations down the line. However it's not clear to me if the order of the index levels should be an implementation detail of MultiIndex (and thus, the issue is with unstack), or if the index levels should have an influence on things like sorting. Further investigations are welcome, marking this as Needs Discussion for now.

Comment From: mathman79

Any thoughts on this? In my example above the _lexsort_depth also changes:

index = df.iloc[:-1].copy().index
index2 = index.remove_unused_levels()
assert index._lexsort_depth == index2._lexsort_depth

This means that suddenly errors like below could occur, because unused index levels were removed:

import pandas as pd

df = pd.DataFrame(
    [
        ("aap", "1991-01-02", 100.0000),
        ("aap", "2024-12-24", 75.7575),
        ("noot", "1960-01-04", 11.111),
        ("noot", "2024-12-24", 123.45),
        ("noot", "2024-12-30", 321.54),
    ],
    columns=["name", "date", "value"],
).set_index(["name", "date"])["value"]

df_ = df.iloc[:-1].copy()
df_.loc[:, "2000-01-01":"2024-12-24"]

df_.index = df_.index.remove_unused_levels()
df_.loc[:, "2000-01-01":"2024-12-24"]

which fails with

UnsortedIndexError: 'MultiIndex slicing requires the index to be lexsorted: slicing on levels [1], lexsort depth 1'

on the second slice, but things work fine on the first slice.

This might be a better example showing that even basic well-defined operations break with the current remove_unused_levels implementation and that this should be fixed.

Comment From: mathman79

Simplied example with only the multi-index and similar to the example in the code

>>> mi = pd.MultiIndex.from_tuples([(0, "b"), (0, "c"), (1, "a"), (1, "c"), (1, "d")])
>>> mi
MultiIndex([(0, 'b'),
            (0, 'c'),
            (1, 'a'),
            (1, 'c'),
            (1, 'd')],
           )
>>> mi.levels
FrozenList([[0, 1], ['a', 'b', 'c', 'd']])
>>> mi[:-1]
MultiIndex([(0, 'b'),
            (0, 'c'),
            (1, 'a'),
            (1, 'c')],
           )
>>> mi2 = mi[:-1].remove_unused_levels()
>>> mi2.levels
FrozenList([[0, 1], ['b', 'c', 'a']])

Comment From: rhshadrach

Thanks @mathman79 for the loc example in https://github.com/pandas-dev/pandas/issues/61245#issuecomment-3250315979. I'm positive on treating the order of levels as user-facing and thus it should not change when removing unused levels.

@jbrockmendel - would you call this a bug as well?