Pandas version checks
-
[x] I have checked that this issue has not already been reported.
-
[x] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
import numpy as np
a = pd.Series(np.zeros(1000000), dtype="float32") + np.float32(1)
b = pd.Series(np.zeros(1000001), dtype="float32") + np.float32(1)
print(a.dtype, b.dtype)
Issue Description
Performing binary operations on larger Series
with dtype == 'float32'
leads to unexpected upcasts to float64
.
Above example prints float32 float64
.
Using to_numpy()
on the series before addition inhibits the implicit upcast.
Expected Behavior
I expect above snippet to print float32 float32
.
Installed Versions
Comment From: stertingen
After stepping through with a debugger, I have the following insights to share:
With series larger than 1000000 items, Pandas uses NumExpr.
Also, pandas converts the numpy float32 scalar to a Python floating point number in ops.maybe_prepare_scalar_for_op
.
Then, NumExpr behaves as described in https://numexpr.readthedocs.io/en/latest/user_guide.html#casting-rules, assuming a double precision floating point value.
Comment From: allecole
Hi, I'd like to tackle this issue, I am new to the project, but I agree that the conversion to the Python float is causing this bug. A fix I could implement would be to remove the use of float() in order to preserve type [as it is being converted to a Python float].
This code inside[ _array_ops.py]
elif isinstance(obj, np.floating):
return float(obj)
Should return a float32 scaler instead, by removing the assertion:
elif isinstance(obj, np.floating):
if obj.dtype == np.float32:
return obj
else:
return float(obj)
I will test this within my own fork prior to making a PR.
Comment From: allecole
take
Comment From: stertingen
Hi, I'd like to tackle this issue, I am new to the project, but I agree that the conversion to the Python float is causing this bug. A fix I could implement would be to remove the use of float() in order to preserve type [as it is being converted to a Python float].
This code inside[ _array_ops.py]
elif isinstance(obj, np.floating): return float(obj)
Should return a float32 scaler instead, by removing the assertion:
elif isinstance(obj, np.floating): if obj.dtype == np.float32: return obj else: return float(obj)
I will test this within my own fork prior to making a PR.
A few thoughts from my side as a user, not a library maintainer:
Well, you could just remove the conversion in case of all numpy objects in that case, not just float32.
However, I think this piece of code exists for a reason.
It was introduced in https://github.com/pandas-dev/pandas/pull/55739, referring to https://numpy.org/neps/nep-0050-scalar-promotion.html.
It looks like scalars are converted to Python scalars in order to invoke Numpy's introspective casting behavior (inspecting the values and determining the best Numpy value for the result), as documented in NEP 50.
However, Numexpr does not have this introspective casting behavior and casts the result to float64
.
So IMHO the fix would be to only cast to Python scalar when using Numpy, not when using Numexpr.
Comment From: rhshadrach
Agreed this is a bug. I haven't looked into the details on the proposed fixes so can't give any feedback there, but PRs to fix would be welcome!
Comment From: jbrockmendel
Will not-upcasting prevent us from using numexpr?