Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd 
import numpy as np
from numpy.random import default_rng
import time

rng = default_rng()

r = 1000000000
data = rng.integers(0, r, size=(r, 2))
df = pd.DataFrame(data).add_prefix("col")

t1 = time.time()
df.sum()
t2 = time.time()
print((t2-t1)*1000)

df = pd.DataFrame(data, columns=["col0", "col1"])
t1 = time.time()
df.sum()
t2 = time.time()
print((t2-t1)*1000)

Issue Description

Ref: example. I'm creating a df from a numpy array (10^9, 2) and then call df.sum(). When I create df from, df = pd.DataFrame(data).add_prefix("col") it takes 1502ms for the df.sum() But, When I create df from, df = pd.DataFrame(data, columns=["col0", "col1"]) it takes 11979ms for the df.sum()! :astonished:

Why would there be such a drastic timing difference?

Expected Behavior

Both cases should take similar times

Installed Versions

> pd.show_versions() Traceback (most recent call last): File "", line 1, in File "******/.conda/envs/cylon_dev/lib/python3.8/site-packages/pandas/util/_print_versions.py", line 109, in show_versions deps = _get_dependency_info() File "******/.conda/envs/cylon_dev/lib/python3.8/site-packages/pandas/util/_print_versions.py", line 88, in _get_dependency_info mod = import_optional_dependency(modname, errors="ignore") File "******/.conda/envs/cylon_dev/lib/python3.8/site-packages/pandas/compat/_optional.py", line 126, in import_optional_dependency module = importlib.import_module(name) File "******/.conda/envs/cylon_dev/lib/python3.8/importlib/__init__.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1014, in _gcd_import File "", line 991, in _find_and_load File "", line 975, in _find_and_load_unlocked File "", line 671, in _load_unlocked File "", line 843, in exec_module File "", line 219, in _call_with_frames_removed File "******/.conda/envs/cylon_dev/lib/python3.8/site-packages/setuptools/__init__.py", line 8, in import _distutils_hack.override # noqa: F401 File "******/.conda/envs/cylon_dev/lib/python3.8/site-packages/_distutils_hack/override.py", line 1, in __import__('_distutils_hack').do_override() File "******/.conda/envs/cylon_dev/lib/python3.8/site-packages/_distutils_hack/__init__.py", line 71, in do_override ensure_local_distutils() File "******/.conda/envs/cylon_dev/lib/python3.8/site-packages/_distutils_hack/__init__.py", line 59, in ensure_local_distutils assert '_distutils' in core.__file__, core.__file__ AssertionError: ******/.conda/envs/cylon_dev/lib/python3.8/distutils/core.py > pd.__version__ '1.4.0'

Comment From: jbrockmendel

Why would there be such a drastic timing difference?

Tentatively looks like the .add_prefix("col") is making a copy, which is changing the layout of the data in a way that makes .sum faster.

Comment From: nirandaperera

Why would there be such a drastic timing difference?

Tentatively looks like the .add_prefix("col") is making a copy, which is changing the layout of the data in a way that makes .sum faster.

But could that cause an 8x performance difference?

Comment From: nirandaperera

I checked np.sum(data, axis=0), and it also seem to take a similar amount of time (~12s). And when I changed data to fortran-order, np.asfortranarray(data), the timings are consistent araound 1.5s. So, I'm guessing this is expected behavior?

Comment From: jbrockmendel

So, I'm guessing this is expected behavior?

Yah, I don't see what we could do differently. Open to ideas.

Comment From: mishra222shreya

The big difference in time you're seeing between the two ways of making the DataFrame might be because of how pandas does its work.

When you use add_prefix("col"), pandas first makes a DataFrame with default names for the columns (like "0" and "1"), then adds "col" to each name. This might be faster because pandas can do it in a way that's better for the computer to handle.

But when you say columns=["col0", "col1"], pandas has to do more work to set up the DataFrame. It has to line up the names you gave with the data you have. This extra work could make it slower when you do things like df.sum().