AsOfMerge._get_join_indexers calls to_numpy() on EAs, which can be costly. _MergeOperation._get_join_indexers is a bit more forgiving, using _values_for_factorize in _factorize_keys. The latter still requires a cast to numpy, which makes this a non-starter for (hypothetical) distributed/GPU EAs.
Can we push this up into an EA method that can be potentially overridden? This might be a use case for @mroeschke 's "ExtensionManager" idea.
Comment From: MichaelTiemannOSC
As an FYI, my EA (which extends PintArray to support uncertain magnitudes, https://github.com/hgrecco/pint-pandas/pull/140) is getting hung up on a mismatch between what I understood to be the interface to _values_from_factorize and what rizer._value_from_factorize is returning. My grief comes from here (pandas/tests/extension/base/reshaping.py):
def test_merge_on_extension_array_duplicates(self, data):
# GH 23020
a, b = data[:2]
key = type(data)._from_sequence([a, b, a], dtype=data.dtype)
df1 = pd.DataFrame({"key": key, "val": [1, 2, 3]})
df2 = pd.DataFrame({"key": key, "val": [1, 2, 3]})
result = pd.merge(df1, df2, on="key")
expected = pd.DataFrame(
{
"key": key.take([0, 0, 0, 0, 1]),
"val_x": [1, 1, 3, 3, 2],
"val_y": [1, 3, 1, 3, 2],
}
)
self.assert_frame_equal(result, expected)
data
is [1.0, 2.0, 1.0]
for both df1
and df2
we get
key val
0 1.0 1
1 2.0 2
2 1.0 3
Down in the merge we have:
> /Users/michael/Documents/GitHub/MichaelTiemannOSC/pandas-dev/pandas/core/reshape/merge.py(2396)_factorize_keys()
-> isinstance(lk.dtype, DatetimeTZDtype) and isinstance(rk.dtype, DatetimeTZDtype)
(Pdb) lk # rk is the same
<PintArray>
[1.0, 2.0, 1.0]
Length: 3, dtype: pint[nanometer]
(Pdb)
and the obviously NONUNIQUE values here:
(Pdb) lk._values_for_factorize() # rk._values_for_factorize() is the same
(array([1.0, 2.0, 1.0], dtype=object), nan)
My understanding was that values returned by _values_for_factorize should be unique (and should not contain the na_value if use_na_sentinel is true). When I allow my _values_for_factorize to return duplicated values, other tests fail.
Clearly this code (from pandas/core/arrays/base.py) is doing nothing to unique-ify the values of the data, though the documentation snippet does refer to uniques
(which are not present in the code):
def _values_for_factorize(self) -> tuple[np.ndarray, Any]:
"""
Return an array and missing value suitable for factorization.
Returns
-------
values : ndarray
An array suitable for factorization. This should maintain order
and be a supported dtype (Float64, Int64, UInt64, String, Object).
By default, the extension array is cast to object dtype.
na_value : object
The value in `values` to consider missing. This will be treated
as NA in the factorization routines, so it will be coded as
`-1` and not included in `uniques`. By default,
``np.nan`` is used.
Notes
-----
The values returned by this method are also used in
:func:`pandas.util.hash_pandas_object`. If needed, this can be
overridden in the ``self._hash_pandas_object()`` method.
"""
return self.astype(object), np.nan
Help?
Comment From: jbrockmendel
My understanding was that values returned by _values_for_factorize should be unique (and should not contain the na_value if use_na_sentinel is true). When I allow my _values_for_factorize to return duplicated values, other tests fail.
An EA's values_for_factorize does _not need to be unique. The reference to "uniques" in the _values_for_factorize docstring is about what is returned by factorize. I agree this is confusing, and in fact am of the opinion that _values_for_factorize is a bad pattern that should go in general (xref #53501). As for things failing when you make values_for_factorize return something non-unique, let's find a dedicated place to discuss that. Is pint-pandas#140 appropriate for that?
Comment From: MichaelTiemannOSC
I got it sorted by aligning the non-unique _values_for_factorize with my EA factorize code that does the unique-ification itself. All good. I'm now passing. Please close.