Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import pyarrow as pa


a = pa.array([1.0, float("NaN"), None, float("+inf")])
s = pd.Series(a, dtype=pd.Float64Dtype())

# In [7]: s
# Out[7]: 
# 0     1.0
# 1    <NA>
# 2    <NA>
# 3     inf

sa = pd.Series(a, dtype=pd.ArrowDtype(a.type))

# In [11]: sa
# Out[11]: 
# 0     1.0
# 1     NaN
# 2    <NA>
# 3     inf
# dtype: double[pyarrow]

sc = sa.astype(pd.Float64Dtype())

# In [12]: sa.astype(pd.Float64Dtype())
# Out[12]: 
# 0     1.0
# 1    <NA>
# 2    <NA>
# 3     inf
# dtype: Float64

Issue Description

Presumably, the nullable Float64Dtype() is intended to allow users to disambiguate NaN from NA, but when constructing such a series from a pyarrow array (or casting to it via astype(...), all NaNs are converted to NA.

Expected Behavior

# In [12]: sa.astype(pd.Float64Dtype())
# Out[12]: 
# 0     1.0
# 1    NaN
# 2    <NA>
# 3     inf
# dtype: Float64

Installed Versions

In [3]: pd.show_versions() /usr/local/google/home/swast/envs/bigframes/lib/python3.10/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils. warnings.warn("Setuptools is replacing distutils.") INSTALLED VERSIONS ------------------ commit : e86ed377639948c64c429059127bcf5b359ab6be python : 3.10.9.final.0 python-bits : 64 OS : Linux OS-release : 6.5.3-1rodete1-amd64 Version : #1 SMP PREEMPT_DYNAMIC Debian 6.5.3-1rodete1 (2023-09-15) machine : x86_64 processor : byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.1.1 numpy : 1.25.2 pytz : 2023.3.post1 dateutil : 2.8.2 setuptools : 65.5.0 pip : 23.2.1 Cython : None pytest : 7.4.2 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.15.0 pandas_datareader : None bs4 : None bottleneck : None dataframe-api-compat: None fastparquet : None fsspec : 2023.6.0 gcsfs : 2023.6.0 matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 12.0.1 pyreadstat : None pyxlsb : None s3fs : None scipy : 1.11.2 sqlalchemy : 2.0.20 tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None

Comment From: tswast

~ A workaround seems to be to construct the Series this way:

# Edit: found a more compatible workaround. See below.
# series = column.to_pandas(integer_object_nulls=True, types_mapper=lambda _: dtype)

~Note: the integer_object_nulls=True argument is necessary, even for float64 arrays.~

Edit: The previous workaround doesn't work in pandas 1.5.x, instead the following works (and is likely faster too):

if dtype == pandas.Float64Dtype():
            # Preserve NA/NaN distinction. Note: This is currently needed, even if we use
            # nullable Float64Dtype in the types_mapper. See:
            # https://github.com/pandas-dev/pandas/issues/55668
            pd_array = pandas.arrays.FloatingArray(
                column.to_numpy(),
                pyarrow.compute.is_null(column).to_numpy(),
            )
            series = pandas.Series(pd_array, dtype=dtype

Comment From: mroeschke

Thanks for the report. Yeah there's still an ongoing discussion regarding this behavior in https://github.com/pandas-dev/pandas/issues/32265

Comment From: tswast

That's quite the thread! (https://github.com/pandas-dev/pandas/issues/32265) Give that the nullable Float64Dtype was added (https://github.com/pandas-dev/pandas/pull/34307) and does distinguish NA from NaN in object dtype (I think this is intentional), I would expect it to do the same for other array types such as Arrow, where the distinction between NA values and NaN values is built-in already.

Comment From: mroeschke

Yeah I would agree; would be happy to have that feedback in that thread!