Is your feature request related to a problem?
DataFrame.corr(method="spearman") is extremely slow. method="pearson" is quite slow too. I can see from my machine resource monitor that the implementation is single threaded. Is it a design choice? If so, there should be at least an optional argument to parallelize it (at C++ level, of course). I did not check the actual code implementing this method.
Describe the solution you'd like
scipy.stats.spearmanr implements this computation on a numpy array in 1/20 of the time in my 6-core machine.
API breaking implications
None.
Describe alternatives you've considered
Add an optional argument (ex. "parallelize"=[True, False]) so that you give the user this option. Then, the method should either be reimplemented from scratch at C++ level or we must use the existing scipy.stats function on the DataFrame.values, wrapping the returned array in a new DataFrame.
Additional context
IMPORTANT: DataFrame.corr and spearmanr gives slightly different results (some kind of small rounding error of about 10e-15)
import numpy as np
from scipy.stats import spearmanr
import pandas as pd
df = pd.DataFrame(np.random.rand(1000, 2000))
pd_corr = df.corr(method='spearman') # a few seconds
scipy_corr, p_value = spearmanr(df.values) # <1 sec
np.equal(pd_corr.values, scipy_corr) # False
np.sum(np.abs(corr_m.values - corr_m_sci) > 1e-15) # 0
Comment From: jreback
@Vysybyl you're welcome to contribute an implementation
Comment From: lithomas1
@Vysybyl We use cython for the corr code not C++. The relevant code can be found here. https://github.com/pandas-dev/pandas/blob/84d9c5ed7965b3cc667b5a3a61d5911b1748af49/pandas/_libs/algos.pyx#L324
Comment From: GF-Huang
So how to parallel?
Comment From: astrojuanlu
Was this fixed in #42761?
Comment From: jbrockmendel
Looks like we have a min_periods keyword that scipy doesnt. Other than that i don't see why we couldn't do e.g
try:
import scipy.stats
except ImportError:
result = do_what_we_do_now()
else:
result = scipy.stats.spearmanr(values)
Comment From: CangyuanLi
Hi, I am interested in contributing to this issue! In my own project, I just modify the code slightly to remove the nested loop and use cython.parallel.prange. However, this requires OpenMP, which if I understand correctly Pandas doesn't rely on at the moment? If this isn't an issue, I would be happy to submit a PR!