Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
pd.options.mode.copy_on_write = True
def df_merge(
left,
right,
how: Literal["left", "right", "inner", "outer", "cross"] = "inner",
on=None,
left_on=None,
right_on=None,
left_index: bool = False,
right_index: bool = False,
sort: bool = False,
suffixes=("_x", "_y"),
copy: bool = True,
indicator: bool = False,
validate=None,
):
if not pd.api.types.is_dtype_backend(left, "pyarrow"):
left = pa.Table.from_pandas(left).to_pandas()
if not pd.api.types.is_dtype_backend(right, "pyarrow"):
right = pa.Table.from_pandas(right).to_pandas()
Issue Description
I have a python project use pickle file ,pandas2.1, when it run in x86 centos7,cost 107s, but only need 71s in mac m2, and I upgrade pandas to 2.2.3,and set:pd.options.mode.copy_on_write = True and edit function df_merge(which is the most cost time fuciton), change df to arrow first。 then only need 35s。 but,same in x86 centos7,still need 100s,why arrow not work?
Expected Behavior
use arrow twice quickly in x86 centos7,but no effect!
Installed Versions
2.2.3
Comment From: speco29
There can be few reasons for that like: Pandas Options: Double-check the Pandas options you're using. Sometimes, tweaking options like pd.options.mode.chained_assignment or pd.options.mode.use_inf_as_na can have an impact on performance.
Arrow Installation: Ensure that Arrow is properly installed and configured on your CentOS 7 system. Sometimes, missing dependencies or incorrect configurations can affect performance
Comment From: wonb168
but,same data and code,faster in mac,not work in x86 server
Comment From: wonb168
How to set default backend to arrow for dataframe in Pandas2.2.3? and if df not arrow backend,and how to change it to?
Comment From: jbrockmendel
and if df not arrow backend,and how to change it to?
There is no flag to change the default. You can specify dtype="int64[pyarrow]"
explicitly or convert to arrow with df = df.convert_dtypes(dtype_backend="pyarrow")
To diagnose the performance difference it would be helpful to see cProfile output.