Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
import numpy as np
from numpy.random import default_rng
import time
rng = default_rng()
r = 1000000000
data = rng.integers(0, r, size=(r, 2))
df = pd.DataFrame(data).add_prefix("col")
t1 = time.time()
df.sum()
t2 = time.time()
print((t2-t1)*1000)
df = pd.DataFrame(data, columns=["col0", "col1"])
t1 = time.time()
df.sum()
t2 = time.time()
print((t2-t1)*1000)
Issue Description
Ref: example.
I'm creating a df from a numpy array (10^9, 2) and then call df.sum()
.
When I create df from, df = pd.DataFrame(data).add_prefix("col")
it takes 1502ms for the df.sum()
But,
When I create df from, df = pd.DataFrame(data, columns=["col0", "col1"])
it takes 11979ms for the df.sum()
! :astonished:
Why would there be such a drastic timing difference?
Expected Behavior
Both cases should take similar times
Installed Versions
Comment From: jbrockmendel
Why would there be such a drastic timing difference?
Tentatively looks like the .add_prefix("col")
is making a copy, which is changing the layout of the data in a way that makes .sum faster.
Comment From: nirandaperera
Why would there be such a drastic timing difference?
Tentatively looks like the
.add_prefix("col")
is making a copy, which is changing the layout of the data in a way that makes .sum faster.
But could that cause an 8x performance difference?
Comment From: nirandaperera
I checked np.sum(data, axis=0)
, and it also seem to take a similar amount of time (~12s).
And when I changed data
to fortran-order, np.asfortranarray(data)
, the timings are consistent araound 1.5s.
So, I'm guessing this is expected behavior?
Comment From: jbrockmendel
So, I'm guessing this is expected behavior?
Yah, I don't see what we could do differently. Open to ideas.
Comment From: mishra222shreya
The big difference in time you're seeing between the two ways of making the DataFrame might be because of how pandas does its work.
When you use add_prefix("col"), pandas first makes a DataFrame with default names for the columns (like "0" and "1"), then adds "col" to each name. This might be faster because pandas can do it in a way that's better for the computer to handle.
But when you say columns=["col0", "col1"], pandas has to do more work to set up the DataFrame. It has to line up the names you gave with the data you have. This extra work could make it slower when you do things like df.sum().