Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas
import numpy
df_with_numpy_values = pandas.DataFrame(
{
"col_int": [numpy.int64(1), numpy.int64(2)],
"col_float": [numpy.float64(1.5), numpy.float64(2.5)],
"col_bool": [numpy.bool_(True), numpy.bool_(False)],
"col_str": [numpy.str_("a"), numpy.str_("b")],
}
)
df_as_object = df_with_numpy_values.astype(object)
for column in df_as_object.columns:
for value in df_as_object[column]:
assert type(value) in (
int,
float,
str,
bool,
), f"Value {value} in column {column} is not a Python type, but {type(value)}"
Issue Description
Calling .astype(object) on a dataframe with numpy values converts the types of the values to the python equivalents, except for numy.str_.
Expected Behavior
I would expect that values with numpy.str_ would be turned into str.
Installed Versions
Comment From: rhshadrach
Thanks for the report. pandas does not have any support for NumPy string dtype yet, so these are treated purely as Python objects. That is why without conversion, pandas results in object dtype.
df_with_numpy_values = pd.DataFrame(
{
"col_int": [np.int64(1), np.int64(2)],
"col_float": [np.float64(1.5), np.float64(2.5)],
"col_bool": [np.bool_(True), np.bool_(False)],
"col_str": [np.str_("a"), np.str_("b")],
}
)
print(df_with_numpy_values.dtypes)
# col_int int64
# col_float float64
# col_bool bool
# col_str object
# dtype: object
@jorisvandenbossche @WillAyd - do we have any tracking issues for NumPy string dtype support? I'm not seeing any.
Comment From: jorisvandenbossche
@rhshadrach note that this is not about the numpy new string dtype, but about the older unicode string dtype "U"
and its scalars (but an issue about the new numpy string dtype is https://github.com/pandas-dev/pandas/issues/58503)
Calling .astype(object) on a dataframe with numpy values converts the types of the values to the python equivalents, except for numy.str_.
The issue here is that this column is already object dtype (storing those np.str_
scalars), as @rhshadrach showed above, and therefore the astype(object)
step does nothing.
With the upcoming pandas 3.0 (or on main testing with enabling the future option), we will start to infer the numpy scalars as a proper string dtype instead of object dtype, and at that point astype(object)
will also convert it to python strings:
In [13]: pd.options.future.infer_string = True
In [14]: df_with_numpy_values = pd.DataFrame(
...: {
...: "col_int": [np.int64(1), np.int64(2)],
...: "col_float": [np.float64(1.5), np.float64(2.5)],
...: "col_bool": [np.bool_(True), np.bool_(False)],
...: "col_str": [np.str_("a"), np.str_("b")],
...: }
...: )
In [15]: print(df_with_numpy_values.dtypes)
col_int int64
col_float float64
col_bool bool
col_str str
dtype: object
In [16]: type(df_with_numpy_values["col_str"].astype(object)[0])
Out[16]: str
Comment From: rhshadrach
Thanks @jorisvandenbossche! With that, I don't see anything that needs to be done here. Marking as a closing candidate.
Comment From: jorisvandenbossche
Indeed, closing.