Pandas Release GIL for Merge - Aurora Blog|java/go/python

I think that the title says it all. The pd.merge function can be compute intensive and can benefit (I think) from parallel computing.

It does not appear to currently release the GIL. I can easily push my CPU to 100% but no higher when performing parallel joins.

Comment From: jorisvandenbossche

@mrocklin do you have an example case of such an intensive merge where you do not see an speedup by parallellizing? On some examples I tried, I already do see some speedup (but probably can be improved).

For example, with

left = pd.DataFrame({'key': list(range(1,11)) * 100000})
right = pd.DataFrame({'key': range(10), 'val': range(10)})

I already see some speedup:

def f():
    left.merge(right, how='inner')

def g4():
    for i in range(4):
        f()

from pandas.util.testing import test_parallel

@test_parallel(num_threads=4)
def pg4():
    f()

In [21]: %timeit g4()
10 loops, best of 3: 149 ms per loop

In [22]: %timeit pg4()
10 loops, best of 3: 99.2 ms per loop

When I profile this merge operation (prof_merge3.out), the main operations that take time are (the number are for this specific example, but with others I get similar trends):

factorization (ca 36%) -> hastable Factorizer -> this is already releases the GIL where possible I think
the actual inner join (ca 31%)
ca 2/3 of the time is spend in algos.groupsort_indexer -> this also already releases the GIL(code)
the remaining logic in the _join.inner_join function itself -> this can further release the GIL, but I think is only ca 10% of overall time of merge operation
combining the results (ca 20%) -> comes down to mainly take_1d/2d algos -> these also already release the GIL to some extent (at least the 1d ones, 2d for some reason not)

So from a first quick exploration, there are certainly some small improvements to be made, but seems the bigger ones are already done (but with further analysis quite possible that it can further be improved).

Comment From: mrocklin

OK. Let me come up with a few examples and get back to you. If as you say most of this is already done then I'll be quite happy to be incorrect here :)

Comment From: jreback

FYI: https://github.com/jreback/pandas/commit/a295e83cdd1b0e26afdd0fac2ac2b56e7ced8eda

this makes factorization about 30% faster and releases the gil in the core parts (but this currently breaks other stuff).

Comment From: jorisvandenbossche

But still, I only get a speed improvement of factor 1.5 on 4 cores, so it also not that impressive.

Comment From: jreback

@mrocklin I think to make this a truly parallel merge, you would need to change the problem a bit I think. e.g. partition across workers, replicate the Dataframe, then concat?

Comment From: mrocklin

@jreback yes, it could be that by operating on different dataframes we have less memory contention and would see larger speedups?

@jorisvandenbossche I'm hearing two things:

We can get about a 50% speedup on 4 cores
Most of the gains have already occurred

This raises the question of fundamentally why isn't something closer to a 4x speedup possible? Is this a memory hierarchy bound operation?

Comment From: jreback

@jorisvandenbossche is your test with processes? or threads?

Comment From: mrocklin

from pandas.util.testing import test_parallel

@test_parallel(num_threads=4)
def pg4():
    f()

Comment From: jorisvandenbossche

Yes, I was using the test_parallel decorator, so was testing with threads.

I don't have much experience with this, but the fact that the GIL free operations are spread throughout the merge operation (the full merge operation separately releases the GIL in potentially 5 or 6 different algos), is this is a reason for overhead and less efficient use of multiple threads / less speedup?

Comment From: jbrockmendel

Not clear there is anything more we can do here