Pandas ENH: parallelize DataFrame.corr - Aurora Blog|java/go/python

Is your feature request related to a problem?

DataFrame.corr(method="spearman") is extremely slow. method="pearson" is quite slow too. I can see from my machine resource monitor that the implementation is single threaded. Is it a design choice? If so, there should be at least an optional argument to parallelize it (at C++ level, of course). I did not check the actual code implementing this method.

Describe the solution you'd like

scipy.stats.spearmanr implements this computation on a numpy array in 1/20 of the time in my 6-core machine.

API breaking implications

None.

Describe alternatives you've considered

Add an optional argument (ex. "parallelize"=[True, False]) so that you give the user this option. Then, the method should either be reimplemented from scratch at C++ level or we must use the existing scipy.stats function on the DataFrame.values, wrapping the returned array in a new DataFrame.

Additional context

IMPORTANT: DataFrame.corr and spearmanr gives slightly different results (some kind of small rounding error of about 10e-15)

import numpy as np
from scipy.stats import spearmanr
import pandas as pd

df = pd.DataFrame(np.random.rand(1000, 2000))
pd_corr = df.corr(method='spearman')  # a few seconds
scipy_corr, p_value = spearmanr(df.values)  # <1 sec

np.equal(pd_corr.values, scipy_corr)  # False
np.sum(np.abs(corr_m.values - corr_m_sci) > 1e-15)  # 0

Comment From: jreback

@Vysybyl you're welcome to contribute an implementation

Comment From: lithomas1

@Vysybyl We use cython for the corr code not C++. The relevant code can be found here. https://github.com/pandas-dev/pandas/blob/84d9c5ed7965b3cc667b5a3a61d5911b1748af49/pandas/_libs/algos.pyx#L324

Comment From: GF-Huang

So how to parallel?

Comment From: astrojuanlu

Was this fixed in #42761?

Comment From: jbrockmendel

Looks like we have a min_periods keyword that scipy doesnt. Other than that i don't see why we couldn't do e.g

try:
    import scipy.stats
except ImportError:
    result = do_what_we_do_now()
else:
    result = scipy.stats.spearmanr(values)

Comment From: CangyuanLi

Hi, I am interested in contributing to this issue! In my own project, I just modify the code slightly to remove the nested loop and use cython.parallel.prange. However, this requires OpenMP, which if I understand correctly Pandas doesn't rely on at the moment? If this isn't an issue, I would be happy to submit a PR!