Pandas BUG: Segfault on np.maximum(series, ...)

Pandas version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import numpy as np
import pandas as pd
a = [-3.22, 4]
x = pd.Series(a)
np.maximum(x, 0, where=x > 2)

Issue Description

Segmentation fault (core dumped) when executing above code.

np.maximum(...) goes into an infinite call cycle which eventually exceeds the max. stack size.

Call stack (bottom up):

...
array_ufunc, arraylike.py:399
__array_ufunc__, generic.py:2171
array_ufunc, arraylike.py:399
__array_ufunc__, generic.py:2171
array_ufunc, arraylike.py:399
__array_ufunc__, generic.py:2171

`__array_ufunc__, generic.py:2171` (`core/generic.py`):

class NDFrame
    ...
    @final
    def __array_ufunc__(
        self, ufunc: np.ufunc, method: str, *inputs: Any, **kwargs: Any
    ):
        return arraylike.array_ufunc(self, ufunc, method, *inputs, **kwargs)  <--

`array_ufunc, arraylike.py:399` (`core/arraylike.py`):


    elif self.ndim == 1:
        # ufunc(series, ...)
        inputs = tuple(extract_array(x, extract_numpy=True) for x in inputs)
        result = getattr(ufunc, method)(*inputs, **kwargs)   <--
    else:
        # ufunc(dataframe)
        if method == "__call__" and not kwargs:

Expected Behavior

No recursion and successful execution of code. This used to work fine in pandas==2.1.1 (or perhaps even higher).

Installed Versions

INSTALLED VERSIONS ------------------ commit : 0691c5cf90477d3503834d983f69350f250a6ff7 python : 3.13.1 python-bits : 64 OS : Linux OS-release : 6.12.5-200.fc41.x86_64 Version : #1 SMP PREEMPT_DYNAMIC Sun Dec 15 16:48:23 UTC 2024 machine : x86_64 processor : byteorder : little LC_ALL : None LANG : en_AU.UTF-8 LOCALE : en_AU.UTF-8 pandas : 2.2.3 numpy : 2.2.1 pytz : 2020.4 dateutil : 2.9.0.post0 pip : 24.3.1 Cython : 3.0.11 sphinx : None IPython : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None blosc : None bottleneck : 1.4.2 dataframe-api-compat : None fastparquet : None fsspec : None html5lib : None hypothesis : None gcsfs : None jinja2 : None lxml.etree : None matplotlib : None numba : None numexpr : 2.10.2 odfpy : None openpyxl : 3.1.2 pandas_gbq : None psycopg2 : 2.9.10 pymysql : None pyarrow : 18.1.0 pyreadstat : None pytest : 8.3.4 python-calamine : None pyxlsb : None s3fs : None scipy : 1.14.1 sqlalchemy : None tables : 3.10.1 tabulate : None xarray : None xlrd : 2.0.1 xlsxwriter : None zstandard : None tzdata : 2024.2 qtpy : None pyqt5 : None

Comment From: rhshadrach

Thanks for the report, I am not able to get the example working on pandas 2.1.1. Can you post the environment details where you get this working?

Versions

INSTALLED VERSIONS
------------------
commit              : e86ed377639948c64c429059127bcf5b359ab6be
python              : 3.11.11.final.0
python-bits         : 64
OS                  : Linux
OS-release          : 6.8.0-49-generic
Version             : #49~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Nov  6 17:42:15 UTC 2
machine             : x86_64
processor           : x86_64
byteorder           : little
LC_ALL              : None
LANG                : en_US.UTF-8
LOCALE              : en_US.UTF-8

pandas              : 2.1.1
numpy               : 1.26.4
pytz                : 2024.2
dateutil            : 2.9.0.post0
setuptools          : 59.6.0
pip                 : 24.2
Cython              : 3.0.11
pytest              : 8.3.3
hypothesis          : 6.112.1
sphinx              : 8.0.2
blosc               : 1.11.2
feather             : None
xlsxwriter          : 3.2.0
lxml.etree          : 5.3.0
html5lib            : 1.1
pymysql             : 1.4.6
psycopg2            : 2.9.9
jinja2              : 3.1.4
IPython             : 8.27.0
pandas_datareader   : None
bs4                 : 4.12.3
bottleneck          : 1.4.0
dataframe-api-compat: None
fastparquet         : 2024.5.0
fsspec              : 2024.9.0
gcsfs               : 2024.9.0post1
matplotlib          : 3.9.2
numba               : 0.60.0
numexpr             : 2.10.1
odfpy               : None
openpyxl            : 3.1.5
pandas_gbq          : None
pyarrow             : 17.0.0
pyreadstat          : 1.2.7
pyxlsb              : 1.0.10
s3fs                : 2024.9.0
scipy               : 1.14.1
sqlalchemy          : 2.0.35
tables              : 3.10.1
tabulate            : 0.9.0
xarray              : 2024.9.0
xlrd                : 2.0.1
zstandard           : 0.23.0
tzdata              : 2024.1
qtpy                : None
pyqt5               : None

Comment From: ssche

Interesting. It works for me, right off the bat. See this:

>>> import numpy as np
>>> import pandas as pd
>>> a = [-3.22, 4]
>>> x = pd.Series(a)
>>> np.maximum(x, 0, where=x > 2)
0    6.900705e-310
1     4.000000e+00
dtype: float64
>>> 
>>> pd.show_versions()
virtualenv/lib/python3.11/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils.
  warnings.warn("Setuptools is replacing distutils.")

INSTALLED VERSIONS
------------------
commit              : e86ed377639948c64c429059127bcf5b359ab6be
python              : 3.11.11.final.0
python-bits         : 64
OS                  : Linux
OS-release          : 6.12.5-200.fc41.x86_64
Version             : #1 SMP PREEMPT_DYNAMIC Sun Dec 15 16:48:23 UTC 2024
machine             : x86_64
processor           : 
byteorder           : little
LC_ALL              : None
LANG                : en_AU.UTF-8
LOCALE              : en_AU.UTF-8

pandas              : 2.1.1
numpy               : 1.24.3
pytz                : 2020.4
dateutil            : 2.8.2
setuptools          : 67.7.2
pip                 : 24.0
Cython              : 0.29.34
pytest              : 7.3.1
hypothesis          : None
sphinx              : None
blosc               : None
feather             : None
xlsxwriter          : 0.9.6
lxml.etree          : None
html5lib            : None
pymysql             : None
psycopg2            : 2.9.6
jinja2              : 2.11.2
IPython             : None
pandas_datareader   : None
bs4                 : None
bottleneck          : 1.3.5
dataframe-api-compat: None
fastparquet         : None
fsspec              : None
gcsfs               : None
matplotlib          : 3.9.2
numba               : None
numexpr             : 2.8.4
odfpy               : None
openpyxl            : 3.1.2
pandas_gbq          : None
pyarrow             : 11.0.0
pyreadstat          : None
pyxlsb              : None
s3fs                : None
scipy               : 1.10.1
sqlalchemy          : 1.3.23
tables              : 3.8.0
tabulate            : None
xarray              : None
xlrd                : 2.0.1
zstandard           : None
tzdata              : 2023.4
qtpy                : None
pyqt5               : None

I'm using numpy 1.24.3, while you tried with numpy 1.26.4. With numpy 1.26.4, I'm running into the same issue that I described (and which you are probably also experiencing with your venv).

Comment From: ssche

I ran some tests with pandas 2.1.1 and the issue occurred first with numpy 1.25.0, so numpy 1.24.4 was the last version this has been working with pandas 2.1.1.

There's been some changes around __array_ufunc__ in numpy 1.25.0 which may have contributed to the regression. One I found which may be relevant is https://numpy.org/doc/stable/release/1.25.0-notes.html#array-likes-that-define-array-ufunc-can-now-override-ufuncs-if-used-as-where

If the where keyword argument of a numpy.ufunc is a subclass of numpy.ndarray or is a duck type that defines numpy.class.__array_ufunc__ it can override the behavior of the ufunc using the same mechanism as the input and output arguments. Note that for this to work properly, the where.__array_ufunc__ implementation will have to unwrap the where argument to pass it into the default implementation of the ufunc or, for numpy.ndarray subclasses before using super().__array_ufunc__.

Indeed, when I use straight numpy arrays instead of series for the where mask and the first argument, the problem goes away.

>>> import numpy as np
>>> import pandas as pd
>>> a = [-3.22, 4]
>>> x = pd.Series(a)
>>> np.maximum(x.values, 0, where=(x > 2).values)
array([0., 4.])

Comment From: rhshadrach

Thanks @ssche - agreed that appears to be it. Further investigations and PRs to fix are welcome!

Comment From: ssche

This discussion in the PR for https://github.com/numpy/numpy/issues/23219 about compatibility with Dask (and downstream libs in general) may be relevant. I might try to see if I can observe any changes in the argument list of __array_ufunc__ to detect whether this is a where-call (to change the behaviour in that case to avoid the recursion).

Comment From: ssche

Would this be a viable start for a fix in arraylike.py (if "where" in kwargs and...)?

def array_ufunc(self, ufunc: np.ufunc, method: str, *inputs: Any, **kwargs: Any):
    ...
    if method == "reduce":
        # e.g. test.series.test_ufunc.test_reduce
        result = dispatch_reduction_ufunc(self, ufunc, method, *inputs, **kwargs)
        if result is not NotImplemented:
            return result

    # We still get here with kwargs `axis` for e.g. np.maximum.accumulate
    #  and `dtype` and `keepdims` for np.ptp

    if "where" in kwargs and isinstance(kwargs["where"], Series):
        where = kwargs["where"]
        kwargs['where'] = where.values

    if self.ndim > 1 and (len(inputs) > 1 or ufunc.nout > 1):
        # Just give up on preserving types in the complex case.
        # In theory we could preserve them for them.
        # * nout>1 is doable if BlockManager.apply took nout and
        #   returned a Tuple[BlockManager].
        # * len(inputs) > 1 is doable when we know that we have
        #   aligned blocks / dtypes.

        # e.g. my_ufunc, modf, logaddexp, heaviside, subtract, add
        inputs = tuple(np.asarray(x) for x in inputs)
        # Note: we can't use default_array_ufunc here bc reindexing means
        #  that `self` may not be among `inputs`
        result = getattr(ufunc, method)(*inputs, **kwargs)
    ...

Comment From: jbrockmendel

We are recursing because extract_array is not getting called on the "where" entry of **kwargs. Getting this right in The General Case is a very hard problem. i.e. in this particular case we could do kwargs["where"] = extract_array(kwargs["where"], extract_numpy=True), but if the Series passed happened to be not-aligned that would give silently-incorrect results.

Properly dealing with kwargs is why implementing __array_function__ never got off the ground.

Pandas BUG: Segfault on np.maximum(series, ...)

Pandas version checks

Reproducible Example

Issue Description

__array_ufunc__, generic.py:2171 (core/generic.py):

array_ufunc, arraylike.py:399 (core/arraylike.py):

Expected Behavior

Installed Versions

`__array_ufunc__, generic.py:2171` (`core/generic.py`):

`array_ufunc, arraylike.py:399` (`core/arraylike.py`):