Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
records = [
{"name": "A", "project": "foo"},
{"name": "A", "project": "bar"},
{"name": "A", "project": pd.NA},
]
df = pd.DataFrame.from_records(records).astype({"project": "string"})
agg_df = df.groupby("name").agg({"project": "unique"})
agg_df.to_parquet("test.parquet")
Issue Description
It seems that when aggregating on a "string"
type column (potentially nullable), the resulting dataframe uses StringArray
to represent the aggregated value (here project
). When attempting to write to parquet, this fails with the following error:
pyarrow.lib.ArrowInvalid: ("Could not convert <StringArray>\n['foo', 'bar', <NA>]\nLength: 3, dtype: string with type StringArray: did not recognize Python value type when inferring an Arrow data type", 'Conversion failed for column project with type object')
I'm not sure why this is happening, because StringArray
clearly implements the __arrow_array__
protocol that should be allowing this conversion. This also happens if the string is of type "string[pyarrow]"
.
The workaround I've been using so far is to re-type the column as object
prior to the groupby
. This results in values of type np.ndarray
, which are parquet-able. But this is suboptimal because we use read_parquet
upstream that types the column as "string"
, which means that we have to remember to retype it every time we do this aggregation, else it will fail downstream when writing to parquet. It seems like this should work with the column typed as "string"
.
Expected Behavior
It should successfully write the parquet file without failures.
Installed Versions
Comment From: eduardo2715
take
Comment From: lithomas1
Hm, this looks like Arrow is not liking having a StringArray "scalar" inside an object dtype column.
Calling pyarrow.array
on agg_df['project']
fails (and seems to be the root error)
>>> pa.array(agg_df['project'])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyarrow/array.pxi", line 339, in pyarrow.lib.array
File "pyarrow/array.pxi", line 85, in pyarrow.lib._ndarray_to_array
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not convert <StringArray>
['foo', 'bar', <NA>]
Length: 3, dtype: string with type StringArray: did not recognize Python value type when inferring an Arrow data type
Tentatively labelling upstream issue, since the affected code paths go through pyarrow.