Pandas version checks
-
[x] I have checked that this issue has not already been reported.
-
[x] I have confirmed this bug exists on the latest version of pandas.
-
[x] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
# Create a datetime index
t = pd.date_range("2025-07-06", periods=3, freq="h")
# Left dataframe: one row per timestamp
df1 = pd.DataFrame({"time": t, "val1": [1, 2, 3]})
# Right dataframe: two rows per timestamp (duplicates)
df2 = pd.DataFrame({"time": t.repeat(2), "val2": [10, 20, 30, 40, 50, 60]})
# This works
print(pd.merge(df1, df2, on="time", how="left"))
# This fails
print(
pd.merge(
df1.convert_dtypes(dtype_backend="pyarrow"),
df2.convert_dtypes(dtype_backend="pyarrow"),
on="time", # pyarrow datetime column causes error
how="left",
)
)
Issue Description
Error message:
ValueError: Length mismatch: Expected axis has 6 elements, new values have 3 elements
Expected Behavior
The merge should succeed and return 6 rows, like it does when not using dtype_backend="pyarrow"
.
Installed Versions
Comment From: rhshadrach
Thanks for the report! This is an issue in pandas.core.indexes.base.Index._get_join_target
. There we convert to NumPy for PyArrow-backed data, but do not view as i8
. However for NumPy-backed datetimes, we view as i8
in DatetimeIndexOpsMixin._get_engine_target
.
@jbrockmendel - any suggested design for solving this? It seems we could either add logic specifically in _get_join_target
or perhaps add/use a method on the Arrow EAs.
Comment From: hasrat17
Hi @rhshadrach , Thanks for identifying the root cause! I’d like to help with this issue. I'm happy to implement the fix in _get_join_target or via Arrow EA method, depending on which design is preferred.
Comment From: jbrockmendel
I think you're right. in Index.join we have a try/except for self._join_monotonic. That raises bc we don't cast to i8 and so falls through to self._join_via_get_indexer, which returns a result with only 3 elements.
Patching _get_join_target fixes the OP example, but I'm confused by join_via_get_indexer. The 3 elements it returns match what i expect a left-join to look like. Is my "join" intuition off? Or do I need more caffeine?
Comment From: rhshadrach
@jbrockmendel -
The 3 elements it returns match what i expect a left-join to look like. Is my "join" intuition off? Or do I need more caffeine?
Duplicates on the right will cause there to be more rows.
Comment From: jbrockmendel
My understanding is that join_monotonic is a fastpath but shouldn't actually have different behavior than join_via_get_indexer.
Comment From: rhshadrach
@jbrockmendel - I haven't checked the history here, but my guess is that _join_via_get_indexer
was only meant to be called when both self
and other
are unique. From
https://github.com/pandas-dev/pandas/blob/d4ae6494f2c4489334be963e1bdc371af7379cd5/pandas/core/indexes/base.py#L4435-L4450
I suspect we should use _join_non_unique
in this case when _join_monotonic
fails.
Comment From: aijams
I ran this example using the latest development version of pandas and the output was correct.
Comment From: rhshadrach
Thanks @aijams - a git bisect shows this was fixed by #62276.
@jbrockmendel - do you think my suspicion in https://github.com/pandas-dev/pandas/issues/61926#issuecomment-3138353671 is incorrect? If so, we can just mark this as needs tests.
Comment From: jbrockmendel
I suspect that you are right that only both-unique cases should go through _join_via_get_indexer. But i think in practice cases that raise on L4443 go through it with only-one-unique.
Comment From: rhshadrach
In that case, I'd suggest making
https://github.com/pandas-dev/pandas/blob/d4ae6494f2c4489334be963e1bdc371af7379cd5/pandas/core/indexes/base.py#L4447-L4448
be just if not self.is_unique or not other.is_unique:
and add a test.
Comment From: 13muskanp
take