Pandas BUG: Merge fails on pyarrow datetime columns

Pandas version checks

[x] I have checked that this issue has not already been reported.
[x] I have confirmed this bug exists on the latest version of pandas.
[x] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

# Create a datetime index
t = pd.date_range("2025-07-06", periods=3, freq="h")

# Left dataframe: one row per timestamp
df1 = pd.DataFrame({"time": t, "val1": [1, 2, 3]})

# Right dataframe: two rows per timestamp (duplicates)
df2 = pd.DataFrame({"time": t.repeat(2), "val2": [10, 20, 30, 40, 50, 60]})

# This works
print(pd.merge(df1, df2, on="time", how="left"))

# This fails
print(
    pd.merge(
        df1.convert_dtypes(dtype_backend="pyarrow"),
        df2.convert_dtypes(dtype_backend="pyarrow"),
        on="time",  # pyarrow datetime column causes error
        how="left",
    )
)

Issue Description

Error message: ValueError: Length mismatch: Expected axis has 6 elements, new values have 3 elements

Expected Behavior

The merge should succeed and return 6 rows, like it does when not using dtype_backend="pyarrow".

Installed Versions

INSTALLED VERSIONS ------------------ commit : c888af6d0bb674932007623c0867e1fbd4bdc2c6 python : 3.12.11 python-bits : 64 OS : Darwin OS-release : 24.5.0 Version : Darwin Kernel Version 24.5.0: Tue Apr 22 19:54:29 PDT 2025; root:xnu-11417.121.6~2/RELEASE_ARM64_T6030 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : None LOCALE : C.UTF-8 pandas : 2.3.1 numpy : 2.3.1 pytz : 2025.2 dateutil : 2.9.0.post0 pip : 25.1.1 Cython : None sphinx : None IPython : 9.4.0 adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None blosc : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2025.5.1 html5lib : None hypothesis : None gcsfs : None jinja2 : 3.1.6 lxml.etree : None matplotlib : 3.10.3 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None psycopg2 : None pymysql : None pyarrow : 20.0.0 pyreadstat : None pytest : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.16.0 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlsxwriter : None zstandard : None tzdata : 2025.2 qtpy : None pyqt5 : None

Comment From: rhshadrach

Thanks for the report! This is an issue in pandas.core.indexes.base.Index._get_join_target. There we convert to NumPy for PyArrow-backed data, but do not view as i8. However for NumPy-backed datetimes, we view as i8 in DatetimeIndexOpsMixin._get_engine_target.

@jbrockmendel - any suggested design for solving this? It seems we could either add logic specifically in _get_join_target or perhaps add/use a method on the Arrow EAs.

Comment From: hasrat17

Hi @rhshadrach , Thanks for identifying the root cause! I’d like to help with this issue. I'm happy to implement the fix in _get_join_target or via Arrow EA method, depending on which design is preferred.

Comment From: jbrockmendel

I think you're right. in Index.join we have a try/except for self._join_monotonic. That raises bc we don't cast to i8 and so falls through to self._join_via_get_indexer, which returns a result with only 3 elements.

Patching _get_join_target fixes the OP example, but I'm confused by join_via_get_indexer. The 3 elements it returns match what i expect a left-join to look like. Is my "join" intuition off? Or do I need more caffeine?

Comment From: rhshadrach

@jbrockmendel -

The 3 elements it returns match what i expect a left-join to look like. Is my "join" intuition off? Or do I need more caffeine?

Duplicates on the right will cause there to be more rows.

Comment From: jbrockmendel

My understanding is that join_monotonic is a fastpath but shouldn't actually have different behavior than join_via_get_indexer.

Comment From: rhshadrach

@jbrockmendel - I haven't checked the history here, but my guess is that _join_via_get_indexer was only meant to be called when both self and other are unique. From

https://github.com/pandas-dev/pandas/blob/d4ae6494f2c4489334be963e1bdc371af7379cd5/pandas/core/indexes/base.py#L4435-L4450

I suspect we should use _join_non_unique in this case when _join_monotonic fails.

Comment From: aijams

I ran this example using the latest development version of pandas and the output was correct.

Comment From: rhshadrach

Thanks @aijams - a git bisect shows this was fixed by #62276.

@jbrockmendel - do you think my suspicion in https://github.com/pandas-dev/pandas/issues/61926#issuecomment-3138353671 is incorrect? If so, we can just mark this as needs tests.

Comment From: jbrockmendel

I suspect that you are right that only both-unique cases should go through _join_via_get_indexer. But i think in practice cases that raise on L4443 go through it with only-one-unique.

Comment From: rhshadrach

In that case, I'd suggest making

https://github.com/pandas-dev/pandas/blob/d4ae6494f2c4489334be963e1bdc371af7379cd5/pandas/core/indexes/base.py#L4447-L4448

be just if not self.is_unique or not other.is_unique: and add a test.

Comment From: 13muskanp

take

Comment From: Ayomide906

Hey! I’m interested in contributing to this issue. I’ve reproduced the bug and would like to help fix it. Is anyone currently working on it?

Comment From: 13muskanp

Hi, yes I’m working on it.

Comment From: G26karthik

I've submitted PR #62592 to add a regression test for this issue. The test ensures that merge operations with pyarrow datetime columns and duplicate values work correctly as fixed by PR #62276.