Pandas version checks
-
[x] I have checked that this issue has not already been reported.
-
[x] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
df = pd.DataFrame(
{
"model": ["model1", "model2"],
"second_index": [(1, 2), (3, 4)],
"first_index": [0, 1],
}
)
df = df.set_index(["first_index", "second_index"], append=True)
df.to_parquet("temp.parquet")
pd.read_parquet("temp.parquet") # >> KeyError
import polars as pl
pl.read_parquet('temp.parquet') #--> OK
Issue Description
I am writing a dataframe with a multiindex, some level of the multiindex contains tuples.
I can save it to parquet, and the obtained parquet seems to be valid since polars
reads it correctly.
I can't load it back to pandas
, it produces a key error.
Expected Behavior
I expected pd.read_parquet
to give batck the df
from df.to_parquet
. This produces the correct result:
df = pl.read_parquet("temp.parquet").to_pandas()
df["second_index"] = df["second_index"].apply(lambda x: tuple(x))
df = df.set_index(["first_index", "second_index"])
Installed Versions
Comment From: Jopestpe
replace
df.to_parquet("temp.parquet")
with
df.reset_index().to_parquet("temp.parquet")
it worked for me, maybe it works for you too.
Comment From: elbg
Yup that's another workaround. However, I would rather have pandas handling the index, rather than resetting it when saving to parquet and setting it back again when loading the parquet
Comment From: rhshadrach
Thanks for the report. Further investigations welcome. This may also be an upstream issue in PyArrow.