Pandas ENH: Add argument "multiprocessing" to DataFrame.apply() method (any axis)

Is your feature request related to a problem?

When a function is passed to DataFrame.apply (with any axis) unfortunately, this doesn't take advantage of the multiprocessing module, making it inefficient for large datasets, especially when more cores are available to work.

Other modules like modin or dask, already implement this, but I think that pandas should implement by itself, if called for.

Describe the solution you'd like

It should be able to work with the multiprocessing module out of the box, as an initial enhancement, and then in the future support other possible backends like joblib.

An example of application would be:

df.apply(lambda x: x['A'] * x['B'], axis=1, multiprocessing=True)

API breaking implications

This should not change established behavior, considering that the default value for the "multiprocessing" argument should be "None" by default.

The only concern would be to return the DataFrame without changing indices, and the result be the same as without multiprocessing.

Describe alternatives you've considered

I have also considered extra backend options for future enhancements of this implementation, like joblib, ray, dask.

Additional context

NOTE: I have already a proof-of-concept for the solution, so I can work a bit further on it.

Comment From: anilkumarKanasani

Hello @nf78 ,

I would like to work on this issue. I will provide a POC on this issue soon.

Comment From: TomAugspurger

@anilkumarKanasani there needs to be some discussion on the design first.

One problem here:

df.apply(lambda x: x['A'] * x['B'], axis=1, multiprocessing=True)

That doesn't allow any flexibility in how the parallelism is achieved. If we add this, it'd be better to standardize around something like concurrent.futures's API. There are some upstream issues with that (you can't use concurrent.futures.gather() on tasks that don't subclass concurrent.futures.Task IIRC) but it's at least more flexible that simply multiprocessing or nothing.

Comment From: nf78

@TomAugspurger and @anilkumarKanasani ,

Thanks for your replies and interest on this issue.

When I suggested the argument "multiprocessing = True", I don't mean this to be a hard rule. This is meant to be done with a module like multiprocessing or joblib, just as an example, and the argument could actually be set to a boolean True/False, if we choose to use only one backend like multiprocessing (for simplicity).

On the other hand, we can actually pass to the argument a value like "None", "multiprocessing" or "joblib", to allow the user to select the prefered backend. This should give more flexibility to the user on how the parallelism is achieved.

I have already a POC based on joblib, splitting the dataframe into the number of available cores (this could also be a separate argument like "n_jobs"). Per example a dataset with 1 million rows would be split into 4 dataframes of 250 thousand rows, then use apply() method on each dataframe, then finally concatenating the result. This is similar to how other backends like dask and modin are already doing.

Let me know your comments. Thanks.

Comment From: anilkumarKanasani

@nf78 ,

Thanks for your quick response. I understand the requirement for this issue.

I will start working on this. Do we have any discussion form (or) virtual meeting rooms to discuss with community members ( for any questions or suggestions ).

Can I know, the process to contribute for pandas package ? Do we need to block this issue ?

Thanks, Anil Kumar Kanasani

Comment From: nf78

Hi @anilkumarKanasani,

Thanks for your interest in contributing, but as I mentioned before on other notes and on the "Additional context", I already have a proof-of-concept that I can just work a bit further on it.

I just need to know if it is ok to go ahead @TomAugspurger ? Is it ok to actually pass to the argument a value like "None", "multiprocessing" or "joblib", to allow the user to select the prefered backend?

Thanks.