Feature Type
-
[X] Adding new functionality to pandas
-
[X] Changing existing functionality in pandas
-
[ ] Removing existing functionality in pandas
Problem Description
When comparing pandas dataframes with floating point numbers, it can be extremely useful to compare with an absolute tolerance (atol
) as we see in pandas.testing.assert_frame_equal.
Feature Description
I propose we add an argument to the function signature of pd.DataFrame.compare() as follows:
class DataFrame(NDFrame, OpsMixin):
def __init__(...)
...
def compare(self, ..., atol: float = None)
# implement code to compare numeric comparison with tolerance
Alternative Solutions
This is some workaround code that works for my specific use case, but is most definitely not general
def deep_compare(
df1: pd.DataFrame, df2: pd.DataFrame, atol: float
) -> pd.DataFrame:
"""Compare two pandas dataframes at a deep level. This will
return a dataframe with the differences between the two frames
explicitly shown.
Args:
df1 (pd.DataFrame): The left dataframe
df2 (pd.DataFrame): The right dataframe
atol (float): Absolute tolerance
Returns:
pd.DataFrame: A dataframe with the differences between the two frames
"""
diff_df = pd.DataFrame(index=df1.index, columns=df1.columns)
for col in df1.columns:
if check_cols_are_numeric(df1, df2, col):
diff_df[col] = tolerance_compare(df1, df2, atol, col)
else:
diff_df[col] = exact_compare(df1, df2, col)
diff_df = remove_rows_cols_all_na(diff_df)
diff_colums = diff_df.columns
right_df = df2[diff_colums]
diff_df = diff_df.merge(
right_df, left_index=True, right_index=True, suffixes=("_pg", "_snf")
)
return diff_df
def exact_compare(
df1: pd.DataFrame, df2: pd.DataFrame, col: str
) -> np.ndarray:
return np.where(df1[col] != df2[col], df1[col], np.nan)
def tolerance_compare(
df1: pd.DataFrame, df2: pd.DataFrame, atol: float, col: str
) -> np.ndarray:
return np.where(np.abs(df1[col] - df2[col]) > atol, df1[col], np.nan)
def remove_rows_cols_all_na(diff_df: pd.DataFrame) -> pd.DataFrame:
diff_df = diff_df.dropna(how="all")
diff_df = diff_df.dropna(axis=1, how="all")
return diff_df
def check_cols_are_numeric(
df1: pd.DataFrame, df2: pd.DataFrame, col: str
) -> bool:
return pd.api.types.is_numeric_dtype(
df1[col]
) and pd.api.types.is_numeric_dtype(df2[col])
Additional Context
No response
Comment From: aanilpala
I'd rather use is_any_real_numeric_dtype
to avoid tolerance comparison on boolean vals
Comment From: tomhoq
@mroeschke Hi! I would love to work on this enhancement, would it be ok to start working on it even if it has not yet been reviewed? Also if someone could in the meanwhile review it I would appreciate.
Thank you!
Comment From: mroeschke
I would say any issue that has not been triaged yet should not be worked on until a core team member has reviewed the issue
Comment From: jlefeaux
Floating-point difference tolerance would be nice in DataFrame.eq()
and DataFrame.equals()
as well as DataFrame.compare()
.