Pandas version checks
-
[x] I have checked that this issue has not already been reported.
-
[x] I have confirmed this bug exists on the latest version of pandas.
-
[x] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import numpy as np
import pandas as pd
from natsort import index_natsorted
df = pd.DataFrame(
[[1, 2], [3, 4]],
columns=pd.MultiIndex.from_product([["a"], ["top10", "top2"]], names=("A", "B")),
)
df.sort_index(axis=1, level=1) # Passes
df.sort_index(axis=1, level="B") # Passes
df.sort_index(axis=1, level=1, key=lambda x: np.argsort(index_natsorted(x))) # Passes
df.sort_index(axis=1, level="B", key=lambda x: np.argsort(index_natsorted(x))) # Fails with KeyError: 'Level B not found'
Issue Description
When sorting over a multi-index with level name and key being set, an error is raised.
Early investigations:
* in the sort process the name of the level is dropped, due to key
* Happens in ensure_key_mapped
https://github.com/pandas-dev/pandas/blob/e209a35403f8835bbcff97636b83d2fc39b51e68/pandas/core/sorting.py#L547-592. Key is applied on the values of the index, which drops the name.
* when sort_level
is called from get_indexer_indexer
, the sort is attempted on the level name, which has been dropped already
Expected Behavior
sort_index
should support both level id or level name
Installed Versions
Comment From: gnkl
I could take an initial look on this
Comment From: gnkl
take
Comment From: gnkl
Noting down that same happens when MultiIndex is used for the index:
import numpy as np
import pandas as pd
from natsort import index_natsorted
df = pd.DataFrame(
[[1, 2], [3, 4]],
index=pd.MultiIndex.from_product([["a"], ["top10", "top2"]], names=("A", "B")),
# adding an MultiIndex to the columns
)
df.sort_index(level=1) # Passes
df.sort_index(level="B") # Passes
df.sort_index(level=1, key=lambda x: np.argsort(index_natsorted(x))) # Passes
df.sort_index(level="B", key=lambda x: np.argsort(index_natsorted(x))) # Fails with KeyError: 'Level B not found'
Comment From: rhshadrach
Thanks for the report, I get the code to run by changing the lambda to:
pd.Index(np.argsort(index_natsorted(x)), name=x.name)
Also note that it's documented this callable should return an Index
instance. It seems to me that we should set the name internally and not require the user to do so. But if we don't wrap the return in an Index
ourselves, it's possible that this could break some code as the OP demonstrates that key
can be successful even if it doesn't return an Index
instance.
@mroeschke @jbrockmendel - should we set the name
internally here, and if so, also wrap the result in an Index
if necessary?