Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np
pd.options.mode.dtype_backend = 'pyarrow'

df = pd.DataFrame({
    'tags': pd.Series([1,1,1,2,2,2,3,3,3,4,4,4,5,5,5],dtype='int64[pyarrow]'),
    'value': pd.Series(np.random.rand(15),dtype='double[pyarrow]')
    })

result1 = df.groupby('tags')['value'].transform(lambda x: x.sum())
print(result1.dtype)
result2 = df.groupby('tags')['value'].transform('sum')
print(result2.dtype)

result3 = df.groupby('tags')['value'].apply(lambda x: x.sum())
print(result3.dtype)
result4 = df.groupby('tags')['value'].apply('sum')
print(result4.dtype)

Issue Description

Currently having a look at the RC for my current work project. I noticed when using the lambda function with groupby together with apply or transform the dtype changes from double[pyarrow] back to float64.

Expected Behavior

I'd expect that we consistently get the same datatype

Installed Versions

2.0.0rc0

Comment From: mroeschke

Thanks for the report. For a UDF (e.g. custom lambda function) passed into transform/apply/agg, if the dtype isn't specified by the UDF, the dtype needs to inferred, and numpy dtypes are still the default implementation returned from inference.

This will be solved once pandas DataFrame/Series constructors can return pyarrow dtypes from inference. We plan to add this as a keyword in a future version, but for this case a global option will probably be needed

Comment From: jbrockmendel

There's a lot going on here. apply, transform, and agg have different expectations about what kind of output they see, and corresponding different paths for inference. But they don't enforce those expectations xref #51445. In this case, transform doesn't do any inference internally bc it expects UDFs to return results that already have a dtype, whereas the UDF here returns a scalar.

-1 on a constructor keyword, xref #51846. The idiomatic way to do this is with maybe_cast_pointwise_result as close as possible to the UDF call.

xref #49163 for pyarrow dtype inference on UDF

Comment From: jbrockmendel

On main the transform examples are still float64 but the applys are double[pyarrow]. As i mentioned above, nothing we can do on the transform cases. Closing as complete.