Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this issue exists on the latest version of pandas.
-
[ ] I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
Hi, so it seems some interaction between copy-on-write and .attrs data leads to extremely slow performance, at least with custom aggregations. In the below code, the timed aggregations all perform identical in v2.1. But in v2.2, the last one, with custom .attrs data and copy-on-write enabled, is about 10x slower. Using my original dataset, which I cannot share, but which is simply larger in both dimensions, the result was even more extreme, being almost 50x slower (from less than a second to 40s).
import numpy as np
import pandas as pd
from sklearn import datasets
from pandas import options as pdopt
print(f"{pd.__version__=}")
X, y = datasets.fetch_covtype(return_X_y=True, as_frame=True)
X["group"] = np.random.choice(range(20_000), size=len(X))
print("\nExecution times with and without metadata before setting copy_on_write to 'warn'")
%timeit -n1 -r1 X.groupby("group").Elevation.apply(lambda ser: (ser >= 3000).sum() / len(ser))
X.attrs["metadata"] = {col: {"hello": {"world": "foobar"}} for col in X.columns}
%timeit -n1 -r1 X.groupby("group").Elevation.apply(lambda ser: (ser >= 3000).sum() / len(ser))
pdopt.mode.copy_on_write = True #"warn"
X, y = datasets.fetch_covtype(return_X_y=True, as_frame=True)
X["group"] = np.random.choice(range(20_000), size=len(X))
print("\nExecution times with and without metadata after setting copy_on_write to 'warn'")
%timeit -n1 -r1 X.groupby("group").Elevation.apply(lambda ser: (ser >= 3000).sum() / len(ser))
X.attrs["metadata"] = {col: {"hello": {"world": "foobar"}} for col in X.columns}
%timeit -n1 -r1 X.groupby("group").Elevation.apply(lambda ser: (ser >= 3000).sum() / len(ser))
The output:
pd.__version__='2.2.3'
Execution times with and without metadata before setting copy_on_write to 'warn'
661 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
667 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
Execution times with and without metadata after setting copy_on_write to 'warn'
671 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
5.22 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
Installed Versions
Prior Performance
pd.__version__='2.1.4'
Execution times with and without metadata before setting copy_on_write to 'warn'
703 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
695 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
Execution times with and without metadata after setting copy_on_write to 'warn'
694 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
691 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
Comment From: buhrmann
On a related note, reading the docs for 2.2 I was under the impression that copy_on_write = "warn"
would only warn about certain cases, but not actually enable copy_on_write mode, which seems to be what's happening here? If so, perhaps the docs could make that clearer...
Comment From: buhrmann
Hm, the problem even seems to occur in some cases in v2.2 with copy-on-write=False, though I haven't managed to create a minimal reproducible example yet. But for now the only safe option seems to be to stick to <2.2.
Comment From: rhshadrach
Thanks for the report. The issue here is that (ser >= 3000).sum() / len(ser)
is needing to copy the attrs data for every group. I don't think there is a way around this. The solution to the performance issue is to not use apply.
%timeit (X["Elevation"] >= 3000).groupby(X["group"]).mean()
# 7.99 ms ± 65.7 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Does this solve your issue?
Comment From: timhoffm
I haven't looked into the details, but note that since 2.2 (https://github.com/pandas-dev/pandas/pull/55314) attrs are always deep-copied to prevent accidental data sharing (motivation: safety over performance). It should be fast if attrs
is just a small dict with a handful of metadata. If performance is critical and you have a lot of context data. attrs
is likely not suited and you should manage that state separately.
Comment From: mroeschke
Seems like there's nothing left to do here so closing