Pandas version checks
-
[x] I have checked that this issue has not already been reported.
-
[x] I have confirmed this bug exists on the latest version of pandas.
-
[x] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
df = pd.DataFrame(
[
("aap", "1991-01-02", 100.0000),
("aap", "2024-12-24", 75.7575),
("noot", "1960-01-04", 11.111),
("noot", "2024-12-24", 123.45),
("noot", "2024-12-30", 321.54),
],
columns=["name", "date", "value"],
).set_index(["name", "date"])["value"]
index = df.iloc[:-1].copy().index
assert all(index.levels[-1] == sorted(index.levels[-1]))
index2 = index.remove_unused_levels()
assert all(index2.levels[-1] == sorted(index2.levels[-1]))
Issue Description
Order or the multi index level is not kept. This causes issues with code like unstack being mis-ordered:
import pandas as pd
df = pd.DataFrame(
[
("aap", "1991-01-02", 100.0000),
("aap", "2024-12-24", 75.7575),
("noot", "1960-01-04", 11.111),
("noot", "2024-12-24", 123.45),
("noot", "2024-12-30", 321.54),
],
columns=["name", "date", "value"],
).set_index(["name", "date"])["value"]
df.iloc[:-1].unstack(level=0)
Expected Behavior
I expect that the current order or the multi index level is kept.
Installed Versions
Comment From: mathman79
Naively I would expected remove_unused_index_levels
do something like below (assuming we always want the levels to be sorted):
def remove_unused_index_levels(index: pd.MultiIndex) -> pd.MultiIndex:
"""Remove unused index levels, keeping levels ordered."""
codes, levels, names = index_codes_levels_names(index)
for i, (code, level) in enumerate(zip(codes, levels)):
uniq_code = np.unique(code)
codes[i] = np.searchsorted(uniq_code, code)
levels[i] = level[uniq_code]
return pd.MultiIndex(levels, codes, names=names)
Comment From: rhshadrach
Thanks for the report. Agreed with the expected behavior that removing unused index levels should not modify the output of other operations down the line. However it's not clear to me if the order of the index levels should be an implementation detail of MultiIndex (and thus, the issue is with unstack), or if the index levels should have an influence on things like sorting. Further investigations are welcome, marking this as Needs Discussion for now.
Comment From: mathman79
Any thoughts on this? In my example above the _lexsort_depth
also changes:
index = df.iloc[:-1].copy().index
index2 = index.remove_unused_levels()
assert index._lexsort_depth == index2._lexsort_depth
This means that suddenly errors like below could occur, because unused index levels were removed:
import pandas as pd
df = pd.DataFrame(
[
("aap", "1991-01-02", 100.0000),
("aap", "2024-12-24", 75.7575),
("noot", "1960-01-04", 11.111),
("noot", "2024-12-24", 123.45),
("noot", "2024-12-30", 321.54),
],
columns=["name", "date", "value"],
).set_index(["name", "date"])["value"]
df_ = df.iloc[:-1].copy()
df_.loc[:, "2000-01-01":"2024-12-24"]
df_.index = df_.index.remove_unused_levels()
df_.loc[:, "2000-01-01":"2024-12-24"]
which fails with
UnsortedIndexError: 'MultiIndex slicing requires the index to be lexsorted: slicing on levels [1], lexsort depth 1'
on the second slice, but things work fine on the first slice.
This might be a better example showing that even basic well-defined operations break with the current remove_unused_levels
implementation and that this should be fixed.
Comment From: mathman79
Simplied example with only the multi-index and similar to the example in the code
>>> mi = pd.MultiIndex.from_tuples([(0, "b"), (0, "c"), (1, "a"), (1, "c"), (1, "d")])
>>> mi
MultiIndex([(0, 'b'),
(0, 'c'),
(1, 'a'),
(1, 'c'),
(1, 'd')],
)
>>> mi.levels
FrozenList([[0, 1], ['a', 'b', 'c', 'd']])
>>> mi[:-1]
MultiIndex([(0, 'b'),
(0, 'c'),
(1, 'a'),
(1, 'c')],
)
>>> mi2 = mi[:-1].remove_unused_levels()
>>> mi2.levels
FrozenList([[0, 1], ['b', 'c', 'a']])
Comment From: rhshadrach
Thanks @mathman79 for the loc example in https://github.com/pandas-dev/pandas/issues/61245#issuecomment-3250315979. I'm positive on treating the order of levels as user-facing and thus it should not change when removing unused levels.
@jbrockmendel - would you call this a bug as well?