Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
import pyarrow as pa
a = pa.array([1.0, float("NaN"), None, float("+inf")])
s = pd.Series(a, dtype=pd.Float64Dtype())
# In [7]: s
# Out[7]:
# 0 1.0
# 1 <NA>
# 2 <NA>
# 3 inf
sa = pd.Series(a, dtype=pd.ArrowDtype(a.type))
# In [11]: sa
# Out[11]:
# 0 1.0
# 1 NaN
# 2 <NA>
# 3 inf
# dtype: double[pyarrow]
sc = sa.astype(pd.Float64Dtype())
# In [12]: sa.astype(pd.Float64Dtype())
# Out[12]:
# 0 1.0
# 1 <NA>
# 2 <NA>
# 3 inf
# dtype: Float64
Issue Description
Presumably, the nullable Float64Dtype()
is intended to allow users to disambiguate NaN from NA, but when constructing such a series from a pyarrow array (or casting to it via astype(...)
, all NaNs are converted to NA.
Expected Behavior
# In [12]: sa.astype(pd.Float64Dtype())
# Out[12]:
# 0 1.0
# 1 NaN
# 2 <NA>
# 3 inf
# dtype: Float64
Installed Versions
Comment From: tswast
~ A workaround seems to be to construct the Series this way:
# Edit: found a more compatible workaround. See below.
# series = column.to_pandas(integer_object_nulls=True, types_mapper=lambda _: dtype)
~Note: the integer_object_nulls=True
argument is necessary, even for float64 arrays.~
Edit: The previous workaround doesn't work in pandas 1.5.x, instead the following works (and is likely faster too):
if dtype == pandas.Float64Dtype():
# Preserve NA/NaN distinction. Note: This is currently needed, even if we use
# nullable Float64Dtype in the types_mapper. See:
# https://github.com/pandas-dev/pandas/issues/55668
pd_array = pandas.arrays.FloatingArray(
column.to_numpy(),
pyarrow.compute.is_null(column).to_numpy(),
)
series = pandas.Series(pd_array, dtype=dtype
Comment From: mroeschke
Thanks for the report. Yeah there's still an ongoing discussion regarding this behavior in https://github.com/pandas-dev/pandas/issues/32265
Comment From: tswast
That's quite the thread! (https://github.com/pandas-dev/pandas/issues/32265) Give that the nullable Float64Dtype was added (https://github.com/pandas-dev/pandas/pull/34307) and does distinguish NA from NaN in object dtype (I think this is intentional), I would expect it to do the same for other array types such as Arrow, where the distinction between NA values and NaN values is built-in already.
Comment From: mroeschke
Yeah I would agree; would be happy to have that feedback in that thread!