Pandas BUG: pd.concat() produces invalid index when inputs have Int64Dtype index with NAs

Pandas version checks

[x] I have checked that this issue has not already been reported.
[x] I have confirmed this bug exists on the latest version of pandas.
[ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

Construct index using MultiIndex.from_product():

levels1 = ['a', 'b']
levels2 = pd.Series([1, 2, pd.NA], dtype=pd.Int64Dtype())

index1 = pd.MultiIndex.from_product([levels1, levels2], names=['one', 'two'])
series1 = pd.Series([f'{i1}-{i2}' for i1, i2 in index1], index=index1)

series1

one  two 
a    1          a-1
     2          a-2
     <NA>    a-<NA>
b    1          b-1
     2          b-2
     <NA>    b-<NA>
dtype: object

Split series by first index level and recombine using pd.concat():

series2 = pd.concat([series1.loc[i1] for i1 in levels1], keys=levels1, names=['one'])

series2

one  two 
a    1          a-1
     2          a-2
     <NA>    a-<NA>
b    1          b-1
     2          b-2
     <NA>    b-<NA>
dtype: object

Series 1 ok:

def check(series):
    for ix in series.index:
        print(repr(ix), end=': ')
        print(repr(series.at[ix]))

check(series1)

('a', 1): 'a-1'
('a', 2): 'a-2'
('a', <NA>): 'a-<NA>'
('b', 1): 'b-1'
('b', 2): 'b-2'
('b', <NA>): 'b-<NA>'

check(series2)

('a', 1): 'a-1'
('a', 2): 'a-2'
('a', <NA>): 
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/opt/mambaforge/envs/myenv/lib/python3.12/site-packages/pandas/core/indexes/multi.py:3072, in MultiIndex.get_loc(self, key)
   3071 try:
-> 3072     return self._engine.get_loc(key)
   3073 except KeyError as err:

File pandas/_libs/index.pyx:794, in pandas._libs.index.BaseMultiIndexCodesEngine.get_loc()

File pandas/_libs/index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:2152, in pandas._libs.hashtable.UInt64HashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:2176, in pandas._libs.hashtable.UInt64HashTable.get_item()

KeyError: 17

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[210], line 1
----> 1 check(series2)

Cell In[208], line 4, in check(series)
      2 for ix in series.index:
      3     print(repr(ix), end=': ')
----> 4     print(repr(series.at[ix]))

File ~/opt/mambaforge/envs/myenv/lib/python3.12/site-packages/pandas/core/indexing.py:2576, in _AtIndexer.__getitem__(self, key)
   2573         raise ValueError("Invalid call for scalar access (getting)!")
   2574     return self.obj.loc[key]
-> 2576 return super().__getitem__(key)

File ~/opt/mambaforge/envs/myenv/lib/python3.12/site-packages/pandas/core/indexing.py:2528, in _ScalarAccessIndexer.__getitem__(self, key)
   2525         raise ValueError("Invalid call for scalar access (getting)!")
   2527 key = self._convert_key(key)
-> 2528 return self.obj._get_value(*key, takeable=self._takeable)

File ~/opt/mambaforge/envs/myenv/lib/python3.12/site-packages/pandas/core/series.py:1249, in Series._get_value(self, label, takeable)
   1246     return self._values[label]
   1248 # Similar to Index.get_value, but we do not fall back to positional
-> 1249 loc = self.index.get_loc(label)
   1251 if is_integer(loc):
   1252     return self._values[loc]

File ~/opt/mambaforge/envs/myenv/lib/python3.12/site-packages/pandas/core/indexes/multi.py:3074, in MultiIndex.get_loc(self, key)
   3072     return self._engine.get_loc(key)
   3073 except KeyError as err:
-> 3074     raise KeyError(key) from err
   3075 except TypeError:
   3076     # e.g. test_partial_slicing_with_multiindex partial string slicing
   3077     loc, _ = self.get_loc_level(key, list(range(self.nlevels)))

KeyError: ('a', <NA>)

Issue Description

This seems like a weird corner case, but somehow pd.concat() creates an invalid MultiIndex when the concatenated Series (example shown above) or DataFrames have indices with Int64Dtype data type that contain NA values. When using at or loc with an index tuple containing NA a KeyError is raised. This doesn't happen with what should be an identical Series/DataFrame.

The two indices in the example do not compare equal according to .equals() but do have equal values according to ==:

>>> series1.index.equals(series2.index)
False
>>> series1.index == series2.index
array([ True,  True,  True,  True,  True,  True])

A difference can be seen in the levels and codes attributes:

>>> series1.index.levels
FrozenList([['a', 'b'], [1, 2]])
>>> series2.index.levels
FrozenList([['a', 'b'], [1, 2, <NA>]])

>>> series1.index.codes
FrozenList([[0, 0, 0, 1, 1, 1], [0, 1, -1, 0, 1, -1]])
>>> series2.index.codes
FrozenList([[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])

Expected Behavior

Lookup succeeds.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 9c8bc3e55188c8aff37207a74f1dd144980b8874 python : 3.12.3 python-bits : 64 OS : Linux OS-release : 4.18.0-477.27.1.el8_8.x86_64 Version : #1 SMP Thu Aug 31 10:29:22 EDT 2023 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.3.3 numpy : 1.26.4 pytz : 2025.2 dateutil : 2.9.0.post0 pip : 25.2 Cython : 3.1.4 sphinx : 8.2.3 IPython : 9.6.0 adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None blosc : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2025.9.0 html5lib : None hypothesis : None gcsfs : None jinja2 : 3.1.6 lxml.etree : None matplotlib : 3.10.1 numba : None numexpr : 2.14.1 odfpy : None openpyxl : 3.1.5 pandas_gbq : None psycopg2 : 2.9.9 pymysql : None pyarrow : 16.1.0 pyreadstat : None pytest : 8.4.2 python-calamine : None pyxlsb : None s3fs : None scipy : 1.16.2 sqlalchemy : 2.0.44 tables : 3.9.2 tabulate : 0.9.0 xarray : None xlrd : None xlsxwriter : 3.2.9 zstandard : 0.23.0 tzdata : 2025.2 qtpy : None pyqt5 : None

Comment From: rhshadrach

Thanks for the report. Agree with the expectation, we should succeed when the NA value in the level values. Further investigations and PRs to fix are welcome.

That being said, putting NA values in an Index is likely to always be a footgun.

Comment From: jlumpe

In my use case, it definitely makes sense conceptually to use an NA in the index. The other index values are positive integers so it is possible to use a 0 in its place, and that is probably what I would do if it were just being used in internal code. However this is a table returned by a public API function as part of a report, and NA is what makes more sense for the user.

I don't think it would be an unreasonable choice for Pandas to forbid NA-like values in an index, but in that case I think the choice should be documented and attempting to create such an index should result in an explicit error.

Comment From: rhshadrach

@jlumpe - that would mean (among other things) df.groupby(...) would not work on groupings with NA values unless you also pass as_index=False. Even in that case, many groupby paths implement this as just calling .reset_index() at the end. Similar remarks about DataFrame.value_counts which uses groupby, but exposes no as_index argument. It does not seem to me we can merely forbid NA values in an index without a lot of changes or negative repercussions.

Comment From: mohsinm-dev

I tried to reproduce this error. In concat's _make_concat_multiindex (when all input indexes are the same and the inner index is not a MultiIndex), we build the result with: - levels = [new_index.unique()] and - codes = new_index.unique().get_indexer(new_index)

This assigns a non-negative code to and includes in levels. MultiIndex invariants expect "missing-at-row" to be encoded as -1 in codes (and not as a category in levels), which is what get_loc/engines rely on for NA lookup. The mismatch leads to the KeyError on ('a', pd.NA).

I think the minimal fix would be to use factorize_from_iterable(new_index) in that branch so NA is encoded as -1 and excluded from levels, matching our invariant and the "not-all-indexes-same" branch.

Related but separate edge case: NA in keys (outermost level) when levels=None and keys is 1-D, also keeps NA as a category today. If desired, we can mirror the same approach for keys (either factorize keys or remap NA key codes to -1) in a follow-up.

@rhshadrach, can you please confirm if I am thinking in the correct direction?

Comment From: zachyattack23

Hi, I'm a college student at the University of Michigan and I'm taking a course where the task is to submit a pr. I have lots of familiarity using pandas, but no experience making a contribution to open source project. Would this be a doable task I could do in max 7 hours worth of time? Thanks

Comment From: rhshadrach

@mohsinm-dev - it does seem like the right direction. But I think we can avoid factorize by calling dropna on the unique values. Then get_indexer will naturally return -1 for any missing NA values.

@zachyattack23

Would this be a doable task I could do in max 7 hours worth of time?

I do not think this is answerable. Partially because of the large variability in how long an issue takes different people, and partly because one can start working on a issue thinking it will be easy only for it to take a very long time.

Comment From: parthava-adabala

If no one's taking up this issue, I would like to work on it.

Comment From: parthava-adabala

I have pushed PR #63050 and what I did is, I changed the logic to exclude pd.NA from the list of level values using .dropna() as suggested and encoded pd.NA values in the index codes as -1.

Comment From: rhshadrach

Apologies @parthava-adabala - looking at this again I think fixing .at is a better solution rather than coding the MultiIndex differently. Note that .loc already works.

index = pd.MultiIndex(levels=[[1, 2], [2, pd.NA]], codes=[[0, 1], [0, 1]])
df = pd.DataFrame({"a": [1, 2]}, index=index)
print(df.loc[(2, pd.NA)])
# a    2
# Name: (2, nan), dtype: int64
print(df.at[(2, pd.NA)])
# KeyError: nan
# The above exception was the direct cause of the following exception:
# KeyError: <NA>

My reason for this is that one of the places we must use positive codes is groupby(..., dropna=False) as the groups must be nonnegative codes. It is then most natural to use these in the resulting index as we currently do.

df = pd.DataFrame({"a": [1, 2, pd.NA], "b": [1, pd.NA, pd.NA], "c": [4, 5, 6]})
print(df.groupby(["a", "b"], dropna=False).sum().index.codes)
# [[0, 1, 2], [0, 1, 1]]

Comment From: parthava-adabala

Sounds right! I will try to fix the .at to correctly handle the lookup for keys.