Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this issue exists on the latest version of pandas.
-
[ ] I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
import pandas as pd
df = pd.DataFrame(["hello world"] * int(5e6))
df.mean()
It concatenates million strings, which is basically to slow. But then it fails, because result string is not numerical. Maybe it would be right to check if it is numerical first?
Installed Versions
Prior Performance
No response
Comment From: lithomas1
Your DF is object dtype because of the strings. I think a check might be moderately expensive in the case you have object dtype of e.g. numbers (since you would need to iterate over every element to check if its numeric)
Also, note that the default value of numeric_only
is False for mean. You might want to pass in numeric_only=True
if you want to exclude the strings.
Comment From: Tialo
I am not sure that numeric_only
will work that way. Just checked
import pandas as pd
df = pd.DataFrame(["asd", 1, 4])
mean = df.mean(numeric_only=True)
this code gives mean = Series([], dtype: float64) It skips not values, but columns.
So if it does not check each value, maybe it will be not expensive to just check dtype of a column?
Comment From: lithomas1
Your DF is of object dtype though.
Anything can be in an object dtype ndarray which is why you can't just check onlythe dtype here.
Comment From: Tialo
Oh, I see. Well if there is a way to check if columns has string, that would be great to know before, trying to concatenate millions of them.