Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import numpy as np
import pandas as pd
a = [-3.22, 4]
x = pd.Series(a)
np.maximum(x, 0, where=x > 2)
Issue Description
Segmentation fault (core dumped)
when executing above code.
np.maximum(...)
goes into an infinite call cycle which eventually exceeds the max. stack size.
Call stack (bottom up):
...
array_ufunc, arraylike.py:399
__array_ufunc__, generic.py:2171
array_ufunc, arraylike.py:399
__array_ufunc__, generic.py:2171
array_ufunc, arraylike.py:399
__array_ufunc__, generic.py:2171
__array_ufunc__, generic.py:2171
(core/generic.py
):
class NDFrame
...
@final
def __array_ufunc__(
self, ufunc: np.ufunc, method: str, *inputs: Any, **kwargs: Any
):
return arraylike.array_ufunc(self, ufunc, method, *inputs, **kwargs) <--
array_ufunc, arraylike.py:399
(core/arraylike.py
):
elif self.ndim == 1:
# ufunc(series, ...)
inputs = tuple(extract_array(x, extract_numpy=True) for x in inputs)
result = getattr(ufunc, method)(*inputs, **kwargs) <--
else:
# ufunc(dataframe)
if method == "__call__" and not kwargs:
Expected Behavior
No recursion and successful execution of code. This used to work fine in pandas==2.1.1
(or perhaps even higher).
Installed Versions
Comment From: rhshadrach
Thanks for the report, I am not able to get the example working on pandas 2.1.1. Can you post the environment details where you get this working?
Versions
INSTALLED VERSIONS
------------------
commit : e86ed377639948c64c429059127bcf5b359ab6be
python : 3.11.11.final.0
python-bits : 64
OS : Linux
OS-release : 6.8.0-49-generic
Version : #49~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Nov 6 17:42:15 UTC 2
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.1.1
numpy : 1.26.4
pytz : 2024.2
dateutil : 2.9.0.post0
setuptools : 59.6.0
pip : 24.2
Cython : 3.0.11
pytest : 8.3.3
hypothesis : 6.112.1
sphinx : 8.0.2
blosc : 1.11.2
feather : None
xlsxwriter : 3.2.0
lxml.etree : 5.3.0
html5lib : 1.1
pymysql : 1.4.6
psycopg2 : 2.9.9
jinja2 : 3.1.4
IPython : 8.27.0
pandas_datareader : None
bs4 : 4.12.3
bottleneck : 1.4.0
dataframe-api-compat: None
fastparquet : 2024.5.0
fsspec : 2024.9.0
gcsfs : 2024.9.0post1
matplotlib : 3.9.2
numba : 0.60.0
numexpr : 2.10.1
odfpy : None
openpyxl : 3.1.5
pandas_gbq : None
pyarrow : 17.0.0
pyreadstat : 1.2.7
pyxlsb : 1.0.10
s3fs : 2024.9.0
scipy : 1.14.1
sqlalchemy : 2.0.35
tables : 3.10.1
tabulate : 0.9.0
xarray : 2024.9.0
xlrd : 2.0.1
zstandard : 0.23.0
tzdata : 2024.1
qtpy : None
pyqt5 : None
Comment From: ssche
Interesting. It works for me, right off the bat. See this:
>>> import numpy as np
>>> import pandas as pd
>>> a = [-3.22, 4]
>>> x = pd.Series(a)
>>> np.maximum(x, 0, where=x > 2)
0 6.900705e-310
1 4.000000e+00
dtype: float64
>>>
>>> pd.show_versions()
virtualenv/lib/python3.11/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")
INSTALLED VERSIONS
------------------
commit : e86ed377639948c64c429059127bcf5b359ab6be
python : 3.11.11.final.0
python-bits : 64
OS : Linux
OS-release : 6.12.5-200.fc41.x86_64
Version : #1 SMP PREEMPT_DYNAMIC Sun Dec 15 16:48:23 UTC 2024
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_AU.UTF-8
LOCALE : en_AU.UTF-8
pandas : 2.1.1
numpy : 1.24.3
pytz : 2020.4
dateutil : 2.8.2
setuptools : 67.7.2
pip : 24.0
Cython : 0.29.34
pytest : 7.3.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 0.9.6
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.9.6
jinja2 : 2.11.2
IPython : None
pandas_datareader : None
bs4 : None
bottleneck : 1.3.5
dataframe-api-compat: None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.9.2
numba : None
numexpr : 2.8.4
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.10.1
sqlalchemy : 1.3.23
tables : 3.8.0
tabulate : None
xarray : None
xlrd : 2.0.1
zstandard : None
tzdata : 2023.4
qtpy : None
pyqt5 : None
I'm using numpy 1.24.3, while you tried with numpy 1.26.4. With numpy 1.26.4, I'm running into the same issue that I described (and which you are probably also experiencing with your venv).
Comment From: ssche
I ran some tests with pandas 2.1.1 and the issue occurred first with numpy 1.25.0, so numpy 1.24.4 was the last version this has been working with pandas 2.1.1.
There's been some changes around __array_ufunc__
in numpy 1.25.0 which may have contributed to the regression. One I found which may be relevant is https://numpy.org/doc/stable/release/1.25.0-notes.html#array-likes-that-define-array-ufunc-can-now-override-ufuncs-if-used-as-where
If the
where
keyword argument of anumpy.ufunc
is a subclass ofnumpy.ndarray
or is a duck type that definesnumpy.class.__array_ufunc__
it can override the behavior of theufunc
using the same mechanism as the input and output arguments. Note that for this to work properly, thewhere.__array_ufunc__
implementation will have to unwrap the where argument to pass it into the default implementation of theufunc
or, fornumpy.ndarray
subclasses before usingsuper().__array_ufunc__
.
Indeed, when I use straight numpy arrays instead of series for the where mask and the first argument, the problem goes away.
>>> import numpy as np
>>> import pandas as pd
>>> a = [-3.22, 4]
>>> x = pd.Series(a)
>>> np.maximum(x.values, 0, where=(x > 2).values)
array([0., 4.])
Comment From: rhshadrach
Thanks @ssche - agreed that appears to be it. Further investigations and PRs to fix are welcome!
Comment From: ssche
This discussion in the PR for https://github.com/numpy/numpy/issues/23219 about compatibility with Dask (and downstream libs in general) may be relevant. I might try to see if I can observe any changes in the argument list of __array_ufunc__
to detect whether this is a where-call (to change the behaviour in that case to avoid the recursion).
Comment From: ssche
Would this be a viable start for a fix in arraylike.py
(if "where" in kwargs and...
)?
def array_ufunc(self, ufunc: np.ufunc, method: str, *inputs: Any, **kwargs: Any):
...
if method == "reduce":
# e.g. test.series.test_ufunc.test_reduce
result = dispatch_reduction_ufunc(self, ufunc, method, *inputs, **kwargs)
if result is not NotImplemented:
return result
# We still get here with kwargs `axis` for e.g. np.maximum.accumulate
# and `dtype` and `keepdims` for np.ptp
if "where" in kwargs and isinstance(kwargs["where"], Series):
where = kwargs["where"]
kwargs['where'] = where.values
if self.ndim > 1 and (len(inputs) > 1 or ufunc.nout > 1):
# Just give up on preserving types in the complex case.
# In theory we could preserve them for them.
# * nout>1 is doable if BlockManager.apply took nout and
# returned a Tuple[BlockManager].
# * len(inputs) > 1 is doable when we know that we have
# aligned blocks / dtypes.
# e.g. my_ufunc, modf, logaddexp, heaviside, subtract, add
inputs = tuple(np.asarray(x) for x in inputs)
# Note: we can't use default_array_ufunc here bc reindexing means
# that `self` may not be among `inputs`
result = getattr(ufunc, method)(*inputs, **kwargs)
...
Comment From: jbrockmendel
We are recursing because extract_array is not getting called on the "where" entry of **kwargs. Getting this right in The General Case is a very hard problem. i.e. in this particular case we could do kwargs["where"] = extract_array(kwargs["where"], extract_numpy=True)
, but if the Series passed happened to be not-aligned that would give silently-incorrect results.
Properly dealing with kwargs is why implementing __array_function__
never got off the ground.