Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this issue exists on the latest version of pandas.

  • [ ] I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

import pandas as pd

df = pd.DataFrame(["hello world"] * int(5e6))
df.mean()

It concatenates million strings, which is basically to slow. But then it fails, because result string is not numerical. Maybe it would be right to check if it is numerical first?

Installed Versions

INSTALLED VERSIONS ------------------ commit : 0f437949513225922d851e9581723d82120684a6 python : 3.8.17.final.0 python-bits : 64 OS : Darwin OS-release : 22.5.0 Version : Darwin Kernel Version 22.5.0: Thu Jun 8 22:22:22 PDT 2023; root:xnu-8796.121.3~7/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : None LOCALE : None.UTF-8 pandas : 2.0.3 numpy : 1.24.4 pytz : 2023.3 dateutil : 2.8.2 setuptools : 56.0.0 pip : 23.0.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : 2.9.7 jinja2 : 3.0.3 IPython : 8.12.2 pandas_datareader: None bs4 : 4.12.2 bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 12.0.1 pyreadstat : None pyxlsb : None s3fs : None scipy : 1.10.1 snappy : None sqlalchemy : 2.0.20 tables : None tabulate : 0.9.0 xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None

Prior Performance

No response

Comment From: lithomas1

Your DF is object dtype because of the strings. I think a check might be moderately expensive in the case you have object dtype of e.g. numbers (since you would need to iterate over every element to check if its numeric)

Also, note that the default value of numeric_only is False for mean. You might want to pass in numeric_only=True if you want to exclude the strings.

Comment From: Tialo

I am not sure that numeric_only will work that way. Just checked

import pandas as pd

df = pd.DataFrame(["asd", 1, 4])
mean = df.mean(numeric_only=True)

this code gives mean = Series([], dtype: float64) It skips not values, but columns.

So if it does not check each value, maybe it will be not expensive to just check dtype of a column?

Comment From: lithomas1

Your DF is of object dtype though.

Anything can be in an object dtype ndarray which is why you can't just check onlythe dtype here.

Comment From: Tialo

Oh, I see. Well if there is a way to check if columns has string, that would be great to know before, trying to concatenate millions of them.