Pandas version checks
-
[x] I have checked that this issue has not already been reported.
-
[x] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Construct index using MultiIndex.from_product():
levels1 = ['a', 'b']
levels2 = pd.Series([1, 2, pd.NA], dtype=pd.Int64Dtype())
index1 = pd.MultiIndex.from_product([levels1, levels2], names=['one', 'two'])
series1 = pd.Series([f'{i1}-{i2}' for i1, i2 in index1], index=index1)
series1
one two
a 1 a-1
2 a-2
<NA> a-<NA>
b 1 b-1
2 b-2
<NA> b-<NA>
dtype: object
Split series by first index level and recombine using pd.concat():
series2 = pd.concat([series1.loc[i1] for i1 in levels1], keys=levels1, names=['one'])
series2
one two
a 1 a-1
2 a-2
<NA> a-<NA>
b 1 b-1
2 b-2
<NA> b-<NA>
dtype: object
Series 1 ok:
def check(series):
for ix in series.index:
print(repr(ix), end=': ')
print(repr(series.at[ix]))
check(series1)
('a', 1): 'a-1'
('a', 2): 'a-2'
('a', <NA>): 'a-<NA>'
('b', 1): 'b-1'
('b', 2): 'b-2'
('b', <NA>): 'b-<NA>'
check(series2)
('a', 1): 'a-1'
('a', 2): 'a-2'
('a', <NA>):
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File ~/opt/mambaforge/envs/myenv/lib/python3.12/site-packages/pandas/core/indexes/multi.py:3072, in MultiIndex.get_loc(self, key)
3071 try:
-> 3072 return self._engine.get_loc(key)
3073 except KeyError as err:
File pandas/_libs/index.pyx:794, in pandas._libs.index.BaseMultiIndexCodesEngine.get_loc()
File pandas/_libs/index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/hashtable_class_helper.pxi:2152, in pandas._libs.hashtable.UInt64HashTable.get_item()
File pandas/_libs/hashtable_class_helper.pxi:2176, in pandas._libs.hashtable.UInt64HashTable.get_item()
KeyError: 17
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In[210], line 1
----> 1 check(series2)
Cell In[208], line 4, in check(series)
2 for ix in series.index:
3 print(repr(ix), end=': ')
----> 4 print(repr(series.at[ix]))
File ~/opt/mambaforge/envs/myenv/lib/python3.12/site-packages/pandas/core/indexing.py:2576, in _AtIndexer.__getitem__(self, key)
2573 raise ValueError("Invalid call for scalar access (getting)!")
2574 return self.obj.loc[key]
-> 2576 return super().__getitem__(key)
File ~/opt/mambaforge/envs/myenv/lib/python3.12/site-packages/pandas/core/indexing.py:2528, in _ScalarAccessIndexer.__getitem__(self, key)
2525 raise ValueError("Invalid call for scalar access (getting)!")
2527 key = self._convert_key(key)
-> 2528 return self.obj._get_value(*key, takeable=self._takeable)
File ~/opt/mambaforge/envs/myenv/lib/python3.12/site-packages/pandas/core/series.py:1249, in Series._get_value(self, label, takeable)
1246 return self._values[label]
1248 # Similar to Index.get_value, but we do not fall back to positional
-> 1249 loc = self.index.get_loc(label)
1251 if is_integer(loc):
1252 return self._values[loc]
File ~/opt/mambaforge/envs/myenv/lib/python3.12/site-packages/pandas/core/indexes/multi.py:3074, in MultiIndex.get_loc(self, key)
3072 return self._engine.get_loc(key)
3073 except KeyError as err:
-> 3074 raise KeyError(key) from err
3075 except TypeError:
3076 # e.g. test_partial_slicing_with_multiindex partial string slicing
3077 loc, _ = self.get_loc_level(key, list(range(self.nlevels)))
KeyError: ('a', <NA>)
Issue Description
This seems like a weird corner case, but somehow pd.concat() creates an invalid MultiIndex when the concatenated Series (example shown above) or DataFrames have indices with Int64Dtype data type that contain NA values. When using at or loc with an index tuple containing NA a KeyError is raised. This doesn't happen with what should be an identical Series/DataFrame.
The two indices in the example do not compare equal according to .equals() but do have equal values according to ==:
>>> series1.index.equals(series2.index)
False
>>> series1.index == series2.index
array([ True, True, True, True, True, True])
A difference can be seen in the levels and codes attributes:
>>> series1.index.levels
FrozenList([['a', 'b'], [1, 2]])
>>> series2.index.levels
FrozenList([['a', 'b'], [1, 2, <NA>]])
>>> series1.index.codes
FrozenList([[0, 0, 0, 1, 1, 1], [0, 1, -1, 0, 1, -1]])
>>> series2.index.codes
FrozenList([[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])
Expected Behavior
Lookup succeeds.
Installed Versions
Comment From: rhshadrach
Thanks for the report. Agree with the expectation, we should succeed when the NA value in the level values. Further investigations and PRs to fix are welcome.
That being said, putting NA values in an Index is likely to always be a footgun.
Comment From: jlumpe
In my use case, it definitely makes sense conceptually to use an NA in the index. The other index values are positive integers so it is possible to use a 0 in its place, and that is probably what I would do if it were just being used in internal code. However this is a table returned by a public API function as part of a report, and NA is what makes more sense for the user.
I don't think it would be an unreasonable choice for Pandas to forbid NA-like values in an index, but in that case I think the choice should be documented and attempting to create such an index should result in an explicit error.
Comment From: rhshadrach
@jlumpe - that would mean (among other things) df.groupby(...) would not work on groupings with NA values unless you also pass as_index=False. Even in that case, many groupby paths implement this as just calling .reset_index() at the end. Similar remarks about DataFrame.value_counts which uses groupby, but exposes no as_index argument. It does not seem to me we can merely forbid NA values in an index without a lot of changes or negative repercussions.
Comment From: mohsinm-dev
I tried to reproduce this error. In concat's _make_concat_multiindex (when all input indexes are the same and the inner index is not a MultiIndex), we build the result with: - levels = [new_index.unique()] and - codes = new_index.unique().get_indexer(new_index)
This assigns a non-negative code to
I think the minimal fix would be to use factorize_from_iterable(new_index) in that branch so NA is encoded as -1 and excluded from levels, matching our invariant and the "not-all-indexes-same" branch.
Related but separate edge case: NA in keys (outermost level) when levels=None and keys is 1-D, also keeps NA as a category today. If desired, we can mirror the same approach for keys (either factorize keys or remap NA key codes to -1) in a follow-up.
@rhshadrach, can you please confirm if I am thinking in the correct direction?
Comment From: zachyattack23
Hi, I'm a college student at the University of Michigan and I'm taking a course where the task is to submit a pr. I have lots of familiarity using pandas, but no experience making a contribution to open source project. Would this be a doable task I could do in max 7 hours worth of time? Thanks
Comment From: rhshadrach
@mohsinm-dev - it does seem like the right direction. But I think we can avoid factorize by calling dropna on the unique values. Then get_indexer will naturally return -1 for any missing NA values.
@zachyattack23
Would this be a doable task I could do in max 7 hours worth of time?
I do not think this is answerable. Partially because of the large variability in how long an issue takes different people, and partly because one can start working on a issue thinking it will be easy only for it to take a very long time.
Comment From: parthava-adabala
If no one's taking up this issue, I would like to work on it.
Comment From: parthava-adabala
I have pushed PR #63050 and what I did is, I changed the logic to exclude pd.NA from the list of level values using .dropna() as suggested and encoded pd.NA values in the index codes as -1.
Comment From: rhshadrach
Apologies @parthava-adabala - looking at this again I think fixing .at is a better solution rather than coding the MultiIndex differently. Note that .loc already works.
index = pd.MultiIndex(levels=[[1, 2], [2, pd.NA]], codes=[[0, 1], [0, 1]])
df = pd.DataFrame({"a": [1, 2]}, index=index)
print(df.loc[(2, pd.NA)])
# a 2
# Name: (2, nan), dtype: int64
print(df.at[(2, pd.NA)])
# KeyError: nan
# The above exception was the direct cause of the following exception:
# KeyError: <NA>
My reason for this is that one of the places we must use positive codes is groupby(..., dropna=False) as the groups must be nonnegative codes. It is then most natural to use these in the resulting index as we currently do.
df = pd.DataFrame({"a": [1, 2, pd.NA], "b": [1, pd.NA, pd.NA], "c": [4, 5, 6]})
print(df.groupby(["a", "b"], dropna=False).sum().index.codes)
# [[0, 1, 2], [0, 1, 1]]
Comment From: parthava-adabala
Sounds right! I will try to fix the .at to correctly handle the lookup for keys.