Pandas version checks
-
[x] I have checked that this issue has not already been reported.
-
[x] I have confirmed this bug exists on the latest version of pandas.
-
[x] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
import numpy as np
df = pd.DataFrame(data={"A": [np.nan, 1]}, dtype="double[pyarrow]")
df.aggregate(lambda x: x.mean()).dtypes
Issue Description
The input is a dataframe with a pyarrow backend but the output uses numpy float64.
This behaviour is very inconsistent especially considering that, if called like this: df.aggregate("mean").dtypes
the output is a "double[pyarrow]".
Expected Behavior
I expect the returned dtypes to be double[pyarrow], since pyarrow type was given as input (based on https://github.com/pandas-dev/pandas/issues/53831).
Installed Versions
Comment From: AdrianoCLeao
I've taken some time to verify it locally on my setup (pandas 2.3.1, pyarrow 20.0.0, Python 3.10.12). I was able to test the return in some ways:
- df.aggregate(lambda x: x.mean()) →
float64
(loses pyarrow dtype) - df.aggregate("mean") →
double[pyarrow]
(preserves pyarrow dtype) - df.aggregate(np.mean) →
double[pyarrow]
(preserves pyarrow dtype) - df.mean() →
double[pyarrow]
(preserves pyarrow dtype)
So:
- Any aggregation using a lambda function does lose the extension dtype.
- Native string methods and numpy functions both preserve it.
I get the same fallback for df.apply(lambda x: x.mean())—so this isn't just limited to .aggregate() but is about how callables are dispatched internally.
Comment From: arthurlw
Confirmed on main. PRs are welcome!
df.aggregate(np.mean) → double[pyarrow] (preserves pyarrow dtype)
In my testing, this also returns a float64
dtype, so investigations for this are welcome as well.
Thanks for raising this!