Pandas version checks
-
[x] I have checked that this issue has not already been reported.
-
[ ] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
from pyarrow import string
df = pd.DataFrame([
[0,"X","A"],
[1,"X","A"],
[2,"X","A"],
[3,"X","B"],
[4,"X","B"],
[5,"X","B"],], columns = ["a","b","c"]).astype({"a":int,
"b":str,"c":pd.ArrowDtype(string())})
df.set_index("b").groupby("a").agg(lambda df: df.to_dict())
Issue Description
When applying groupby aggregate on a column with type defined using pd.ArrowDtype()
the pandas tries to cast the output into the original type, which can raise an error (e.g. pyarrow.lib.ArrowNotImplementedError: Unsupported cast from struct<location_abbreviation: string> to utf8 using function cast_string
for the example provided).
For example, if string[pyarrow]
is used, then this behaviour doesn't occur:
import pandas as pd
df = pd.DataFrame([
[0,"X","A"],
[1,"X","A"],
[2,"X","A"],
[3,"X","B"],
[4,"X","B"],
[5,"X","B"],], columns = ["a","b","c"]).astype({"a":int,
"b":str,"c":"string[pyarrow]"})
df.set_index("b").groupby("a").agg(lambda df: df.to_dict())
Or if the user-defined function also has *args
or **kwargs
, this coercion is not applied:
import pandas as pd
df = pd.DataFrame([
[0,"X","A"],
[1,"X","A"],
[2,"X","A"],
[3,"X","B"],
[4,"X","B"],
[5,"X","B"],], columns = ["a","b","c"]).astype({"a":int,
"b":str,"c":"string[pyarrow]"})
df.set_index("b").groupby("a").agg(lambda df, _: df.to_dict(), [])
both returns:
a | c |
---|---|
0 | {'X': 'A'} |
1 | {'X': 'A'} |
2 | {'X': 'A'} |
3 | {'X': 'B'} |
4 | {'X': 'B'} |
5 | {'X': 'B'} |
Expected Behavior
I would expect the code from example to return: | a | c | |----:|:-----------| | 0 | {'X': 'A'} | | 1 | {'X': 'A'} | | 2 | {'X': 'A'} | | 3 | {'X': 'B'} | | 4 | {'X': 'B'} | | 5 | {'X': 'B'} |
Installed Versions
Comment From: heoh
Thanks for describing the issue. I'd like to try work on it.
Comment From: heoh
take