Feature Type
-
[X] Adding new functionality to pandas
-
[ ] Changing existing functionality in pandas
-
[ ] Removing existing functionality in pandas
Problem Description
Our internal validation tool's tolerance needs to depend on compared metrics. For example, when obtaining results from an analytical database from a query like
SELECT count(distinct device_id) as device_count, avg(score) as score GROUP BY ...
We expect device_count
to always be accurate, but score
is expected to have random numerical floating point inaccuracies.
My old code ran assert_frame_equal
several times on different subsets of columns, which is cumbersome and doesn't express the intent well.
I recently refactored it by extracting assert_frame_equal
's implementation and just adding the extra arguments to support per-column customizable rtol
and atol
.
It would be nice if such an ability was built into Pandas.
Note that this overlaps a bit with feature request https://github.com/pandas-dev/pandas/issues/54861 .
Feature Description
One way is to add extra arguments to assert_frame_equal
, usable like so:
assert_frame_equal(
left,
right,
rtol=1e-5,
atol=1e-8,
rtols={'device_count': 0, 'score': 1e-6},
atols={'device_count': 0}, # for unspecified columns, the rtol/atol argument is used as default
)
Or the entire comparison configuration (check_exact
, check_datetimelike_compat
etc) could be overridden per-series, for example
assert_frame_equal(
left,
right,
overrides={
'device_count': {'check_exact': True},
'score': {'rtol': 1e-6},
}
)
Alternative Solutions
The current way to do it with public APIs is to do something like
for column_names, rtol in [(["device_count", ...], 0.0), (["score", ...], 1e-6), ...]:
left = # extract index and columns from left
right = # extract index and columns from right
assert_frame_equal(left, right, rtol=rtol)
Comment From: specialkapa
take
Comment From: rhshadrach
Thanks for the request!
Compared to DataFrame methods, what makes assert_frame_equal
unique in that it should support by-column arguments?
It does not seem to me to be maintainable to allow by-column specific arguments across the API for DataFrame methods, and therefore we should not do so here for API consistency. The alternative solution in the OP appears to me to be the right, sustainable, approach.
Comment From: specialkapa
Hi @rhshadrach. Thanks for the comment. I have done some work on this and I think the solution I've come up with is sustainable going forward. Just ironing out a few details. I got a few tests failing but they appear irrelevant to 'assert_frame_equal'. I should be ready to open a PR this week. Perhaps we can discuss if the solution compatible with the API on the PR review section?
Comment From: rhshadrach
@specialkapa - without an answer to the above question, I am opposed to adding this feature. The issue I have with sustainability is not for this one particular feature, but rather having to add similar things to other methods for DataFrames.
Comment From: specialkapa
That is a good point. Thanks for getting back to me.
On Sun, 25 Aug 2024 at 16:30, Richard Shadrach @.***> wrote:
@specialkapa https://github.com/specialkapa - without an answer to the above question, I am opposed to adding this feature. The issue I have with sustainability is not for this one particular feature, but rather having to add similar things to other methods for DataFrames.
— Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/59548#issuecomment-2308896936, or unsubscribe https://github.com/notifications/unsubscribe-auth/AW7JCDIJBMYKVBH5VJY2IK3ZTH2DFAVCNFSM6AAAAABMX4CA62VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBYHA4TMOJTGY . You are receiving this because you were mentioned.Message ID: @.***>
Comment From: mroeschke
Appears there hasn't been maintainer interest in including this feature so closing