Pandas version checks
-
[x] I have checked that this issue has not already been reported.
-
[x] I have confirmed this bug exists on the latest version of pandas.
-
[x] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
s2 = pd.Series([4, 5, 6], index=['a', 'b', 'c'])
s2.index = s2.index.astype('string')
s1 < s2 # fails
s1, s2 = s1.align(s2)
s1 < s2 # also fails
s1 = s1.reindex(s2.index)
s1 < s2 # succeeds
Issue Description
When a series (or dataframe) with otherwise identical indices are compared, but the indexes are technically dtype(object) and dtype(string), element-wise comparison fails. In the debugger, it looks like the ExtensionArray StringArray.equals is False when comparing to a python list of strings, causing Series._indexed_same to return False.
Expected Behavior
Ideally the string and object dtype would be comparable. This in-between state for Pandas dtypes has been quite awkward, with some libraries porting over to numpy-nullable / pyarrow dtype backends, but the Pandas library defaults not using them yet.
Installed Versions
Comment From: sanggon6107
Hi @wahsmail,
I think this should work since Index.equals()
doc stated that dtype is not compared.
https://github.com/pandas-dev/pandas/blob/5d9cf431f7b774a6724b1dd4c5e6f6fe95647aff/pandas/core/indexes/base.py#L5453-L5463
Also confirmed that the comparison doesn't raise when Index.equals()
inside the Series._indexed_same()
returns True
.
Comment From: sanggon6107
take
Comment From: MayurKishorKumar
take
Comment From: MayurKishorKumar
Hi @rhshadrach 👋
I’m working on fixing [https://github.com/pandas-dev/pandas/issues/61099] and ran into a failure in test_mixed_col_index_dtype.
My fix updates Index.equals so that StringDtype and object dtypes are treated as equivalent when comparing column indexes. As a result, this test now fails because result.columns.dtype becomes "string" while expected.columns.dtype remains object.
There are two options I’m considering:
Update the test to explicitly cast expected.columns to "string" when using_infer_string=True, so it reflects the result. Adjust internal logic so the result stays object, but that might go against the spirit of treating string/object as equal. Would updating the test be acceptable in this case?
Thanks!
Comment From: rhshadrach
@MayurKishorKumar - in that test I'm seeing that when using_infer_string=True
, the expected is being explicitly cast to non-object.
https://github.com/pandas-dev/pandas/blob/5736b9647068d31fdf8673d3528cb64e35060bac/pandas/tests/frame/test_arithmetic.py#L2193-L2200
So I don't see how expected.columns.dtype
remains object. It might be helpful to put up your PR as a draft.
Comment From: sanggon6107
Hi @MayurKishorKumar , are you still working on this? I would like to contribute if you're not.