Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
import pyarrow as pa
s = pd.Series([1, 2], dtype=pd.ArrowDtype(pa.int32()))
r1 = s.rank(method="min")
df = s.to_frame(name="a")
r2 = df.rank(method="min")
>>> s
0 1
1 2
dtype: int32[pyarrow]
>>> df.dtypes
a int32[pyarrow]
dtype: object
>>> r1
0 1
1 2
dtype: uint64[pyarrow]
>>> r2
a
0 1.0
1 2.0
>>> r2.dtypes
a float64
dtype: object
Issue Description
When we have a dataframe backed with pyarrow type data, when we call df.rank(method="min"), returned result is not arrow backed dataframe. This behavior does not happen for Series.rank(), we could see Series.rank() returned result is still arrow backed Series.
Incorrect:
df.dtypes a int32[pyarrow] dtype: object r2 = df.rank(method="min") r2.dtypes a float64 dtype: object
Correct:
s 0 1 1 2 dtype: int32[pyarrow] r1 = s.rank(method="min") r1.dtype uint64[pyarrow]
Expected Behavior
DataFrame.rank should return pyarrow backed dataframe when original dataframe filled with pyarrow.
Installed Versions
pd.version '2.0.0'
Comment From: mroeschke
This appears like a general issue with ExtensionArrays
In [23]: pd.DataFrame([1], dtype="Int64").rank().dtypes
Out[23]:
0 float64
dtype: object
Comment From: mroeschke
Looks like this condition needs to account for EAs when ndim == 2
def ranker(data):
if data.ndim == 2:
# i.e. DataFrame, we cast to ndarray
values = data.values
Comment From: oscar-garzon
take
Comment From: Julian048
@oscar-garzon Are you still working on this?
Comment From: jbrockmendel
This looks pretty easy: NDFrame.rank should go through self._mgr.apply. That'll also avoid a copy in data.values.