The performance of array reductions in nanops/bottleneck can be significantly improved upon for large data using numba. The improvements are due to two factors: - single-pass algorithms when null values are present and avoiding any copies. - multi-threading over chunked of array or over an axis in a single axis reduction.

This screenshot demonstrates a potential 4x improvement on a DataFrame of 10-million rows and 5 columns of various types.

Image

I am running the code on a features branch, and all unit tests for the feature branch are passing locally. https://github.com/eoincondron/pandas/tree/nanops-numba-implementation

The hardware is a new MacBook Pro with 8 cores.

The performance is still slightly better at 1-million rows and is even greater at larger magnitudes (8x at 100 million rows). The caveat is that all JIT-compilation is already completed. I have carried out a more comprehensive performance comparison and these results hold up.

Similarly to bottleneck, these codepaths can be toggled on and off.