Pandas Use BaseExecutionEngine for Python and Numba engines

In #61032 we have created a new base class BaseExecutionEngine that engines can subclass to handle apply and map operations. The base class has been initially created to allow third-party engines to be passed to DataFrame.apply(..., engine=third_party_engine). But our core engines Python and Numba can also be implemented as instances of this base class. This will make the code cleaner, more maintainable, and it may allow to move the Numba engine outside of the pandas code base easily.

The whole migration to the new interface is quite a big change, so it's recommended to make the transition step by step, in small pull requests.

Comment From: arthurlw

Thanks for assigning me this @datapythonista ! This looks interesting to work on and I'll start looking into it.

Comment From: datapythonista

Thanks @arthurlw. A possible approach could be starting by numba only. The numba engine is only implemented for DataFrame.apply for now, and only for certain types of the parameters. For example, it doesn't work with ufuncs.

I think all the numba engine has been introduced in two PRs, https://github.com/pandas-dev/pandas/pull/54666 and https://github.com/pandas-dev/pandas/pull/55104, and hasn't change much. So it should be easy to see all the changes implemented for the engine.

The main logic is implemented here: https://github.com/pandas-dev/pandas/blob/main/pandas/core/apply.py#L1096

I think having all the numba engine as a sublass of the base executor would be already quite valuable, and much easier than refactoring all the Python engine code.

For reference, you have an implementation of a third-party executor engine in this PR: https://github.com/bodo-ai/Bodo/pull/410/files

Comment From: arthurlw

Hey @datapythonista I’ve been thinking about how to best organize the engine subclasses and avoid circular imports. One option is to move the base class and all engine implementations into a new pandas/core/engines/ sub-package:

pandas/core/
├─ apply.py
└─ engines/
   ├─ base.py              # BaseExecutionEngine
   ├─ python_engine.py     # PythonExecutionEngine
   └─ numba_engine.py      # NumbaExecutionEngine

This keeps each engine in its own file and provides a clear plugin point for third-party engines. What do you think?

Comment From: datapythonista

This looks reasonable. I'd probably start creating the NumbaExecutionEngine class in apply.py for now, as I think it'll be somehow small. And being in the same file you'll also avoid circular imports. But as we properly split the Python and the Numba engines, I think it makes sense to split this way. Maybe it'd be more clear to name the directory/module apply, since engine can mean different things in pandas.

Comment From: arthurlw

Hey @datapythonista, I’m working on the PythonExecutionEngine and wanted to propose a plan for splitting the work into PRs:

Add support for third-party execution engines for DataFrame.map, similar to what's done in #61467
Implement PythonExecutionEngine in apply.py
(Optional) Split engines into submodules proposed here

One question I had: for PythonExecutionEngine.apply, should we follow the approach used for NumbaExecutionEngine and lift logic from apply_raw, or should we call back into the logic defined in frame.py?

Comment From: datapythonista

Thanks @arthurlw for working on this, good questions.

What you propose sounds good to me. What makes sense to me is that we add the engine keyword with the existing behavior not only to DataFrame.map, but also to Series.apply and .pipe of both methods.

Before starting with the implementation of PythonExecutionEngine I think we should have the NumbaExecutionEngine merged. I think using the new interface for Numba is way easier, and also I think it should make it easier to implement the python engine. But PythonExecutionEngine should follow the same API as the Numba one. It's not only about apply_way, but the whole apply.py. The raw in apply_raw means Numpy arrays as opposed of pandas Series. Meaning that when you apply a function to data, if you apply it to the Series is "normal", if you apply it to the "raw" data, it means the underlying Numpy array. Numba only understands Numpy, not pandas, that's why most of the logic of the numba engine lives in apply_raw. But when dealing with the Python engine, then it's not only that method, but the rest of the class where things happen.

For the Numba engine, the idea is that DataFrame.apply, instead of always calling Apply.apply, it will call NumbaExecutionEngine.apply. This method should do all the checks of things unsupported by Numba that now live in Apply.apply. For example:

class Apply
    def apply(self):
          if is_list_like(self.func):
              if self.engine == "numba":
                  raise NotImplementedError(
                      "the 'numba' engine doesn't support lists of callables yet"
                  )

will be something like:

class NumbaExecutionEngine:
    def apply(...):
        if is_list_like(func):
            raise NotImplementedError(...)

so, the class Apply shouldn't receive the engine, but be called directly only for the default engine. And for the things that the numba engine does support, from NumbaExecutionEngine you can call the methods in apply (you may need to make them functions outside the class, and call them from both Apply and NumbaExecutionEngine instead).

For what I know, all the if self.engine == "numba": in apply.py were implemented for DataFrame.apply. But the apply of group by and window operations also supports the numba engine. I think this code lives elsewhere, but I'm not too sure. You may one to have a look.