The same code with DataFrame.apply
is >4x slower when the data is in Arrow dtypes versus Numpy.
Pandas version checks
-
[x] I have checked that this issue has not already been reported.
-
[x] I have confirmed this issue exists on the latest version of pandas.
-
[x] I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
import pandas as pd
import numpy as np
import pyarrow as pa
import time
NUM_ROWS = 500_000
df = pd.DataFrame({"A": np.arange(NUM_ROWS) % 30, "B": np.arange(NUM_ROWS)+1.0})
print(df.dtypes)
df2 = df.astype({"A": pd.ArrowDtype(pa.int64()), "B": pd.ArrowDtype(pa.float64())})
print(df2.dtypes)
t0 = time.time()
df.apply(lambda r: 0 if r.A == 0 else (r.B // r.A), axis=1)
print(f"Non-Arrow time: {time.time() - t0:.2f} seconds")
t0 = time.time()
df2.apply(lambda r: 0 if r.A == 0 else (r.B // r.A), axis=1)
print(f"Arrow time: {time.time() - t0:.2f} seconds")
Output with Pandas 2.3 on a local M1 Mac (tested on main branch too).
A int64
B float64
dtype: object
A int64[pyarrow]
B double[pyarrow]
dtype: object
Non-Arrow time: 3.21 seconds
Arrow time: 16.66 seconds
Installed Versions
Prior Performance
No response
Comment From: jbrockmendel
Any hotspots show up in profiling?
Comment From: ehsantn
Profiler output is a bit hard to read as usual. Here are some snakeviz screenshots. The fast_xs function has different code paths for ExtensionDtype
that look suspicious to me. find_common_type
and _from_sequence
stand out looks like.
Comment From: jbrockmendel
That's a tough one. In core.apply.FrameColumnApply.series_generator we have a fastpath that only works with numpy dtypes.
We might be able to get some mileage for EA Dtypes by changing
for i in range(len(obj)):
yield obj._ixs(i, axis=0)
to something like
dtype = ser.dtype
for i in range(len(obj)):
new_vals = the_part_of_fast_xs_after_interleaved_dtype_is_called()
new_arr = type(ser.array)._from_sequence(new_vals, dtype=dtype)
yield obj._constructor(new_arr, name=name)
That would save 20% of the runtime by avoiding the interleaved_dtype calls.