Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas
import numpy 

df_with_numpy_values = pandas.DataFrame(
    {
        "col_int": [numpy.int64(1), numpy.int64(2)],
        "col_float": [numpy.float64(1.5), numpy.float64(2.5)],
        "col_bool": [numpy.bool_(True), numpy.bool_(False)],
        "col_str": [numpy.str_("a"), numpy.str_("b")],
    }
)

df_as_object = df_with_numpy_values.astype(object)

for column in df_as_object.columns:
    for value in df_as_object[column]:
        assert type(value) in (
            int,
            float,
            str,
            bool,
        ), f"Value {value} in column {column} is not a Python type, but {type(value)}"

Issue Description

Calling .astype(object) on a dataframe with numpy values converts the types of the values to the python equivalents, except for numy.str_.

Expected Behavior

I would expect that values with numpy.str_ would be turned into str.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 0691c5cf90477d3503834d983f69350f250a6ff7 python : 3.10.12 python-bits : 64 OS : Linux OS-release : 5.15.0-126-generic Version : #136-Ubuntu SMP Wed Nov 6 10:38:22 UTC 2024 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.3 numpy : 2.1.3 pytz : 2024.2 dateutil : 2.9.0.post0 pip : 22.0.2 Cython : None sphinx : None IPython : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None blosc : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None html5lib : None hypothesis : None gcsfs : None jinja2 : None lxml.etree : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None psycopg2 : None pymysql : None pyarrow : None pyreadstat : None pytest : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlsxwriter : None zstandard : None tzdata : 2024.2 qtpy : None pyqt5 : None

Comment From: rhshadrach

Thanks for the report. pandas does not have any support for NumPy string dtype yet, so these are treated purely as Python objects. That is why without conversion, pandas results in object dtype.

df_with_numpy_values = pd.DataFrame(
    {
        "col_int": [np.int64(1), np.int64(2)],
        "col_float": [np.float64(1.5), np.float64(2.5)],
        "col_bool": [np.bool_(True), np.bool_(False)],
        "col_str": [np.str_("a"), np.str_("b")],
    }
)
print(df_with_numpy_values.dtypes)
# col_int        int64
# col_float    float64
# col_bool        bool
# col_str       object
# dtype: object

@jorisvandenbossche @WillAyd - do we have any tracking issues for NumPy string dtype support? I'm not seeing any.

Comment From: jorisvandenbossche

@rhshadrach note that this is not about the numpy new string dtype, but about the older unicode string dtype "U" and its scalars (but an issue about the new numpy string dtype is https://github.com/pandas-dev/pandas/issues/58503)

Calling .astype(object) on a dataframe with numpy values converts the types of the values to the python equivalents, except for numy.str_.

The issue here is that this column is already object dtype (storing those np.str_ scalars), as @rhshadrach showed above, and therefore the astype(object) step does nothing.

With the upcoming pandas 3.0 (or on main testing with enabling the future option), we will start to infer the numpy scalars as a proper string dtype instead of object dtype, and at that point astype(object) will also convert it to python strings:

In [13]: pd.options.future.infer_string = True

In [14]: df_with_numpy_values = pd.DataFrame(
    ...:     {
    ...:         "col_int": [np.int64(1), np.int64(2)],
    ...:         "col_float": [np.float64(1.5), np.float64(2.5)],
    ...:         "col_bool": [np.bool_(True), np.bool_(False)],
    ...:         "col_str": [np.str_("a"), np.str_("b")],
    ...:     }
    ...: )

In [15]: print(df_with_numpy_values.dtypes)
col_int        int64
col_float    float64
col_bool        bool
col_str          str
dtype: object

In [16]: type(df_with_numpy_values["col_str"].astype(object)[0])
Out[16]: str

Comment From: rhshadrach

Thanks @jorisvandenbossche! With that, I don't see anything that needs to be done here. Marking as a closing candidate.

Comment From: jorisvandenbossche

Indeed, closing.