I think that the title says it all. The pd.merge
function can be compute intensive and can benefit (I think) from parallel computing.
It does not appear to currently release the GIL. I can easily push my CPU to 100% but no higher when performing parallel joins.
Comment From: jorisvandenbossche
@mrocklin do you have an example case of such an intensive merge where you do not see an speedup by parallellizing? On some examples I tried, I already do see some speedup (but probably can be improved).
For example, with
left = pd.DataFrame({'key': list(range(1,11)) * 100000})
right = pd.DataFrame({'key': range(10), 'val': range(10)})
I already see some speedup:
def f():
left.merge(right, how='inner')
def g4():
for i in range(4):
f()
from pandas.util.testing import test_parallel
@test_parallel(num_threads=4)
def pg4():
f()
In [21]: %timeit g4()
10 loops, best of 3: 149 ms per loop
In [22]: %timeit pg4()
10 loops, best of 3: 99.2 ms per loop
When I profile this merge operation (prof_merge3.out), the main operations that take time are (the number are for this specific example, but with others I get similar trends):
- factorization (ca 36%) -> hastable Factorizer -> this is already releases the GIL where possible I think
- the actual inner join (ca 31%)
- ca 2/3 of the time is spend in
algos.groupsort_indexer
-> this also already releases the GIL(code) - the remaining logic in the
_join.inner_join
function itself -> this can further release the GIL, but I think is only ca 10% of overall time of merge operation - combining the results (ca 20%) -> comes down to mainly
take_1d/2d
algos -> these also already release the GIL to some extent (at least the 1d ones, 2d for some reason not)
So from a first quick exploration, there are certainly some small improvements to be made, but seems the bigger ones are already done (but with further analysis quite possible that it can further be improved).
Comment From: mrocklin
OK. Let me come up with a few examples and get back to you. If as you say most of this is already done then I'll be quite happy to be incorrect here :)
Comment From: jreback
FYI: https://github.com/jreback/pandas/commit/a295e83cdd1b0e26afdd0fac2ac2b56e7ced8eda
this makes factorization about 30% faster and releases the gil in the core parts (but this currently breaks other stuff).
Comment From: jorisvandenbossche
But still, I only get a speed improvement of factor 1.5 on 4 cores, so it also not that impressive.
Comment From: jreback
@mrocklin I think to make this a truly parallel merge, you would need to change the problem a bit I think. e.g. partition across workers, replicate the Dataframe, then concat?
Comment From: mrocklin
@jreback yes, it could be that by operating on different dataframes we have less memory contention and would see larger speedups?
@jorisvandenbossche I'm hearing two things:
- We can get about a 50% speedup on 4 cores
- Most of the gains have already occurred
This raises the question of fundamentally why isn't something closer to a 4x speedup possible? Is this a memory hierarchy bound operation?
Comment From: jreback
@jorisvandenbossche is your test with processes? or threads?
Comment From: mrocklin
from pandas.util.testing import test_parallel
@test_parallel(num_threads=4)
def pg4():
f()
Comment From: jorisvandenbossche
Yes, I was using the test_parallel
decorator, so was testing with threads.
I don't have much experience with this, but the fact that the GIL free operations are spread throughout the merge operation (the full merge operation separately releases the GIL in potentially 5 or 6 different algos), is this is a reason for overhead and less efficient use of multiple threads / less speedup?
Comment From: jbrockmendel
Not clear there is anything more we can do here