In #61032 we have created a new base class BaseExecutionEngine
that engines can subclass to handle apply
and map
operations. The base class has been initially created to allow third-party engines to be passed to DataFrame.apply(..., engine=third_party_engine)
. But our core engines Python and Numba can also be implemented as instances of this base class. This will make the code cleaner, more maintainable, and it may allow to move the Numba engine outside of the pandas code base easily.
The whole migration to the new interface is quite a big change, so it's recommended to make the transition step by step, in small pull requests.
Comment From: arthurlw
Thanks for assigning me this @datapythonista ! This looks interesting to work on and I'll start looking into it.
Comment From: datapythonista
Thanks @arthurlw. A possible approach could be starting by numba only. The numba engine is only implemented for DataFrame.apply
for now, and only for certain types of the parameters. For example, it doesn't work with ufuncs.
I think all the numba engine has been introduced in two PRs, https://github.com/pandas-dev/pandas/pull/54666 and https://github.com/pandas-dev/pandas/pull/55104, and hasn't change much. So it should be easy to see all the changes implemented for the engine.
The main logic is implemented here: https://github.com/pandas-dev/pandas/blob/main/pandas/core/apply.py#L1096
I think having all the numba engine as a sublass of the base executor would be already quite valuable, and much easier than refactoring all the Python engine code.
For reference, you have an implementation of a third-party executor engine in this PR: https://github.com/bodo-ai/Bodo/pull/410/files
Comment From: arthurlw
Hey @datapythonista I’ve been thinking about how to best organize the engine subclasses and avoid circular imports. One option is to move the base class and all engine implementations into a new pandas/core/engines/
sub-package:
pandas/core/
├─ apply.py
└─ engines/
├─ base.py # BaseExecutionEngine
├─ python_engine.py # PythonExecutionEngine
└─ numba_engine.py # NumbaExecutionEngine
This keeps each engine in its own file and provides a clear plugin point for third-party engines. What do you think?
Comment From: datapythonista
This looks reasonable. I'd probably start creating the NumbaExecutionEngine
class in apply.py
for now, as I think it'll be somehow small. And being in the same file you'll also avoid circular imports. But as we properly split the Python and the Numba engines, I think it makes sense to split this way. Maybe it'd be more clear to name the directory/module apply
, since engine
can mean different things in pandas.
Comment From: arthurlw
Hey @datapythonista, I’m working on the PythonExecutionEngine and wanted to propose a plan for splitting the work into PRs:
-
Add support for third-party execution engines for
DataFrame.map
, similar to what's done in #61467 -
Implement PythonExecutionEngine in
apply.py
-
(Optional) Split engines into submodules proposed here
One question I had: for PythonExecutionEngine.apply, should we follow the approach used for NumbaExecutionEngine and lift logic from apply_raw
, or should we call back into the logic defined in frame.py
?
Comment From: datapythonista
Thanks @arthurlw for working on this, good questions.
What you propose sounds good to me. What makes sense to me is that we add the engine keyword with the existing behavior not only to DataFrame.map
, but also to Series.apply
and .pipe
of both methods.
Before starting with the implementation of PythonExecutionEngine
I think we should have the NumbaExecutionEngine
merged. I think using the new interface for Numba is way easier, and also I think it should make it easier to implement the python engine. But PythonExecutionEngine
should follow the same API as the Numba one. It's not only about apply_way
, but the whole apply.py
. The raw
in apply_raw
means Numpy arrays as opposed of pandas Series. Meaning that when you apply a function to data, if you apply it to the Series is "normal", if you apply it to the "raw" data, it means the underlying Numpy array. Numba only understands Numpy, not pandas, that's why most of the logic of the numba engine lives in apply_raw
. But when dealing with the Python engine, then it's not only that method, but the rest of the class where things happen.
For the Numba engine, the idea is that DataFrame.apply
, instead of always calling Apply.apply
, it will call NumbaExecutionEngine.apply
. This method should do all the checks of things unsupported by Numba that now live in Apply.apply
. For example:
class Apply
def apply(self):
if is_list_like(self.func):
if self.engine == "numba":
raise NotImplementedError(
"the 'numba' engine doesn't support lists of callables yet"
)
will be something like:
class NumbaExecutionEngine:
def apply(...):
if is_list_like(func):
raise NotImplementedError(...)
so, the class Apply
shouldn't receive the engine, but be called directly only for the default engine. And for the things that the numba engine does support, from NumbaExecutionEngine
you can call the methods in apply
(you may need to make them functions outside the class, and call them from both Apply
and NumbaExecutionEngine
instead).
For what I know, all the if self.engine == "numba":
in apply.py
were implemented for DataFrame.apply
. But the apply
of group by and window operations also supports the numba engine. I think this code lives elsewhere, but I'm not too sure. You may one to have a look.