Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

records = [
    {"name": "A", "project": "foo"}, 
    {"name": "A", "project": "bar"}, 
    {"name": "A", "project": pd.NA},
]
df = pd.DataFrame.from_records(records).astype({"project": "string"})
agg_df = df.groupby("name").agg({"project": "unique"})

agg_df.to_parquet("test.parquet")

Issue Description

It seems that when aggregating on a "string" type column (potentially nullable), the resulting dataframe uses StringArray to represent the aggregated value (here project). When attempting to write to parquet, this fails with the following error:

pyarrow.lib.ArrowInvalid: ("Could not convert <StringArray>\n['foo', 'bar', <NA>]\nLength: 3, dtype: string with type StringArray: did not recognize Python value type when inferring an Arrow data type", 'Conversion failed for column project with type object')

I'm not sure why this is happening, because StringArray clearly implements the __arrow_array__ protocol that should be allowing this conversion. This also happens if the string is of type "string[pyarrow]".

The workaround I've been using so far is to re-type the column as object prior to the groupby. This results in values of type np.ndarray, which are parquet-able. But this is suboptimal because we use read_parquet upstream that types the column as "string", which means that we have to remember to retype it every time we do this aggregation, else it will fail downstream when writing to parquet. It seems like this should work with the column typed as "string".

Expected Behavior

It should successfully write the parquet file without failures.

Installed Versions

INSTALLED VERSIONS ------------------ commit : bdc79c146c2e32f2cab629be240f01658cfb6cc2 python : 3.9.17.final.0 python-bits : 64 OS : Darwin OS-release : 23.3.0 Version : Darwin Kernel Version 23.3.0: Wed Dec 20 21:30:44 PST 2023; root:xnu-10002.81.5~7/RELEASE_ARM64_T6000 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.1 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : 58.1.0 pip : 23.0.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : 8.18.1 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 15.0.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None

Comment From: eduardo2715

take

Comment From: lithomas1

Hm, this looks like Arrow is not liking having a StringArray "scalar" inside an object dtype column.

Calling pyarrow.array on agg_df['project'] fails (and seems to be the root error)

>>> pa.array(agg_df['project'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/array.pxi", line 339, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 85, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not convert <StringArray>
['foo', 'bar', <NA>]
Length: 3, dtype: string with type StringArray: did not recognize Python value type when inferring an Arrow data type

Tentatively labelling upstream issue, since the affected code paths go through pyarrow.