Pandas version checks
-
[x] I have checked that this issue has not already been reported.
-
[x] I have confirmed this bug exists on the latest version of pandas.
-
[x] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import numpy as np
import pandas as pd
from natsort import index_natsorted
df = pd.DataFrame(
[[1, 2], [3, 4]],
columns=pd.MultiIndex.from_product([["a"], ["top10", "top2"]], names=("A", "B")),
)
df.sort_index(axis=1, level=1) # Passes
df.sort_index(axis=1, level="B") # Passes
df.sort_index(axis=1, level=1, key=lambda x: np.argsort(index_natsorted(x))) # Passes
df.sort_index(axis=1, level="B", key=lambda x: np.argsort(index_natsorted(x))) # Fails with KeyError: 'Level B not found'
Issue Description
When sorting over a multi-index with level name and key being set, an error is raised.
Early investigations:
* in the sort process the name of the level is dropped, due to key
* Happens in ensure_key_mapped https://github.com/pandas-dev/pandas/blob/e209a35403f8835bbcff97636b83d2fc39b51e68/pandas/core/sorting.py#L547-592. Key is applied on the values of the index, which drops the name.
* when sort_level is called from get_indexer_indexer, the sort is attempted on the level name, which has been dropped already
Expected Behavior
sort_index should support both level id or level name
Installed Versions
Comment From: gnkl
I could take an initial look on this
Comment From: gnkl
take
Comment From: gnkl
Noting down that same happens when MultiIndex is used for the index:
import numpy as np
import pandas as pd
from natsort import index_natsorted
df = pd.DataFrame(
[[1, 2], [3, 4]],
index=pd.MultiIndex.from_product([["a"], ["top10", "top2"]], names=("A", "B")),
# adding an MultiIndex to the columns
)
df.sort_index(level=1) # Passes
df.sort_index(level="B") # Passes
df.sort_index(level=1, key=lambda x: np.argsort(index_natsorted(x))) # Passes
df.sort_index(level="B", key=lambda x: np.argsort(index_natsorted(x))) # Fails with KeyError: 'Level B not found'
Comment From: rhshadrach
Thanks for the report, I get the code to run by changing the lambda to:
pd.Index(np.argsort(index_natsorted(x)), name=x.name)
Also note that it's documented this callable should return an Index instance. It seems to me that we should set the name internally and not require the user to do so. But if we don't wrap the return in an Index ourselves, it's possible that this could break some code as the OP demonstrates that key can be successful even if it doesn't return an Index instance.
@mroeschke @jbrockmendel - should we set the name internally here, and if so, also wrap the result in an Index if necessary?
Comment From: jbrockmendel
im fine with that, but would also be OK with being strict about requiring the documented behavior.
Comment From: gnkl
hey @rhshadrach, will you work on this, or shall I look into it?
Comment From: rhshadrach
@gnkl - I am not planning to take this up, it's all yours!