import numpy as np
import pandas as pd
import pandas._libs.join as libjoin
left = np.random.randint(0, 10**8, size=10**8)
right = np.random.randint(0, 10**8, size=10**8)
left.sort()
right.sort()
# The current usage
In [28]: %timeit result = libjoin.inner_join_indexer(left, right)
2.89 s ± 60 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# An alternative that _may_ be parallelizable
lloc = len(left) // 2
rloc = right.searchsorted(left[lloc]) # 734 ns ± 16.6 ns
chunk1 = libjoin.inner_join_indexer(left[:lloc], right[:rloc])
chunk2 = libjoin.inner_join_indexer(left[lloc:], right[rloc:])
result = tuple([np.r_[chunk1[i], chunk2[i]] for i in range(3)])
# totals 3.9 s ± 209 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The chunk1/chunk2 calls are each 1.47 s ± 40 ms and the concat is 1.05 s ± 179 ms. In a pyarrow or otherwise-chunked context the concat might not be needed.
For the non-object-dtype case, I'm pretty sure we could release the GIL in inner_join_indexer. Could we then do the chunk1/chunk2 calls in parallel? If so, could we split it further? cc @WillAyd?
Comment From: WillAyd
Do we know that our current GIL-releasing usage actually improves performance? AFAIK releasing the GIL allows for the use of multiple threads, but I'm not sure that helps most of our calculations which are single-threaded and CPU-bound?
Comment From: jbrockmendel
ive wondered this myself
Comment From: WillAyd
WIthout opening a can of worms, if you wanted concurrency here I think you could map the two chunks to different proceses and communicate the result back to the parent process. Of course makes the code more complicated and there's a break even point where it may not be worth it, but that is my best guess on how to tackle
Comment From: TomAugspurger
https://github.com/jcrist/ptime is handy for checking if stuff is being hampered by the GIL.
but I'm not sure that helps most of our calculations which are single-threaded and CPU-bound?
I'm not aware of anywhere in pandas that's currently multi-threaded, but I haven't followed things closely in a bit. GIL-releasing is crucial for libraries that use pandas in parallel, like Dask.
if you wanted concurrency here I think you could map the two chunks to different proceses and communicate the result back to the parent process.
Just to confirm: we're probably talking about threads here, not OS-level processes.
Is there a larger discussion around parallelizing things within pandas itself, rather than deferring to libraries like Dask? It is indeed a can of worms.
Comment From: WillAyd
Nice thanks for the insight @TomAugspurger . Yea sorry meant to say parallelism not concurrency
Comment From: jbrockmendel
Is there a larger discussion around parallelizing things within pandas itself, rather than deferring to libraries like Dask? It is indeed a can of worms.
Not really. I've been looking into what we can add to EAs to make it more appealing for dask/modin/cudf/etc to implement EAs. That's what got me looking at this chunk of code.
Comment From: WillAyd
Interesting conversation though - so do we think the current GIL-releasing stuff we have in pandas definitely helps Dask downstream? Is that tested / benchmarked there?
There are definitely some cases where we have a strange mix of Tempita / fused types that I think came back to a lack of support for conditional GIL releasing for object types. If someone took it up as a passion project we might be able to revisit that and simplify a good deal
Comment From: jbrockmendel
IIRC we got rid of most of the tempita. Most of what's left were cases where there were two separate dtyped-arguments that weren't necessarily the same dtype e.g. take_1d_{{in_dtype}}_{{out_dtype}}
. If there's a nice way to move that to fused types I'd be all for it.
The conditional GIL i think is mostly already using fused types, will have a wave of additional cleanup once cython3 is released.
Comment From: lithomas1
xref https://github.com/pandas-dev/pandas/issues/43313 for I/O.
We should really start thinking of a keyword to control parallelism in pandas. I believe Arrow operations are parallel by default.
For cython, I wouldn't mind if we used Cython's OpenMP capabilities which seems like just using prange vs range (would make my life a lot harder? with packaging pandas). Personally, I think it's about time we added some sort of parallel capability to pandas (not the fancy out-of-core stuff etc. that dask does though).
Comment From: jbrockmendel
Personally, I think it's about time we added some sort of parallel capability to pandas (not the fancy out-of-core stuff etc. that dask does though).
+1. Aside from prange you mentioned, I don't have a clear picture of what is available.
Comment From: jreback
there is an effort to standardize the parallel user facing apis across numpy, scikit-learn and pandas (this is nasa funded ; @MarcoGorelli can hopefully point to the docs that exist)
so certainly experiment - but we should tread carefully here
generally +1 in internal parallelism where we can and that is appropriate
Comment From: jbrockmendel
would make my life a lot harder? with packaging pandas
@lithomas1 here's your first request for help on this topic. trying to add prange to libalgos.nancorr, added "-fopenmp" to extra_compile_args and extra_link_args in setup.py, getting clang: error: unsupported option '-fopenmp'
at build time on Mac. Google suggested doing brew install llvm libomp
but all i managed to do was mess up my dev environment. Suggestions?
Comment From: lithomas1
Install the conda-forge compilers package. Does that help?
I don't think conda-forge/brew play nicely with each other.
Comment From: WillAyd
@jbrockmendel macOS confusingly aliases gcc to clang, so you may think you are running the former when you are actually running the latter. It looks like clang might have a different command line argument to make that work
https://stackoverflow.com/a/33400688/621736
Comment From: jbrockmendel
Is there a chance that with numba this might Just Work?
Comment From: MarcoGorelli
there is an effort to standardize the parallel user facing apis across numpy, scikit-learn and pandas (this is nasa funded ; @MarcoGorelli can hopefully point to the docs that exist)
yup, here you go: https://thomasjpfan.github.io/parallelism-python-libraries-design/