Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import io
import pandas as pd
buf = io.StringIO("date,value\n2024-01-01 00:00:00,1\n2024-02-01 00:00:00,2")
df = pd.read_csv(buf, parse_dates=["date"])
df.set_index("date").loc["2024-01"] # works
buf = io.StringIO("date,value\n2024-01-01 00:00:00,1\n2024-02-01 00:00:00,2")
df = pd.read_csv(buf, parse_dates=["date"], dtype_backend="pyarrow", engine="pyarrow")
df.set_index("date").loc["2024-01"] # KeyError
### Issue Description
The pyarrow timestamp type gets put into a generic `Index` when assigned via set_index, so the datetime overloads do not work correctly
### Expected Behavior
The pyarrow timestamp type should be wrapped by a DatetimeIndex
### Installed Versions
3.0.0.dev0+1696.gfae3e8034f'
**Comment From: WillAyd**
I think this is another one to keep track of for PDEP-13 https://github.com/pandas-dev/pandas/pull/58455
**Comment From: AbhishekChaudharii**
take
**Comment From: robert-schmidtke**
Hi, I see that #58455 was closed but this is still open. What's the status of this or are there any recommended workarounds?
**Comment From: show981111**
@AbhishekChaudharii @WillAyd
Are you still working on this? If not I would love to take this issue.
**Comment From: show981111**
take
**Comment From: show981111**
I tried the example and I documented some findings here. I just want to make sure the direction we are going towards are aligned. Let me know what you think. @WillAyd
## The issue
When we call `set_index` on pyarrow timestamp type, it sets the regular `index` instead of `DatetimeIndex`.
For example, if I do
import io import pandas as pd
buf = io.StringIO("date,value\n2024-01-01 00:00:00,1\n2024-02-01 00:00:00,2") df = pd.read_csv(buf, parse_dates=["date"]) res = df.set_index("date") print(res.index) # prints DatetimeIndex(['2024-01-01', '2024-02-01'], dtype='datetime64[s]', name='date', freq=None)
However, if I use pyarrow,
buf = io.StringIO("date,value\n2024-01-01 00:00:00,1\n2024-02-01 00:00:00,2")
df = pd.read_csv(buf, parse_dates=["date"], dtype_backend="pyarrow", engine="pyarrow")
res = df.set_index("date")
print(res.index) # prints Index([2024-01-01 00:00:00, 2024-02-01 00:00:00], dtype='timestamp[s][pyarrow]', name='date')
``
Therefore, if I dolocwith2024-01, since it is a regular index, it doesn't perform the range search like it does forDatetimeIndex`.
Problem
-
The first issue is that when constructing the index, it goes here https://github.com/pandas-dev/pandas/blob/e97a56e746f8cdeabf7e83ec83455cbf5386c909/pandas/core/indexes/base.py#L580 which end up here since the type of the array is
ArrowExtensionArray(isinstance(dtype, ExtensionDtype)returns true), https://github.com/pandas-dev/pandas/blob/e97a56e746f8cdeabf7e83ec83455cbf5386c909/pandas/core/indexes/base.py#L609 Now the issue is that the above returnsIndexnot theDatetimeIndexeven though dtype istimestamp[s][pyarrow]Over here, I asuume the expectation isDatetimeIndexfor arrow timestamp dtype? (correct me if I am wrong). (I'm still debugging this... I'll add more findings after I find the issue). -
I tried just returning
DatetimeIndexat the above line, but it still doesn't solve the issue. It errors out here: https://github.com/pandas-dev/pandas/blob/e97a56e746f8cdeabf7e83ec83455cbf5386c909/pandas/core/indexes/base.py#L656 Since<class 'pandas.core.arrays.arrow.array.ArrowExtensionArray'>is not an instance of<class 'pandas.core.arrays.datetimes.DatetimeArray'>. For this issue, do we have to make another special class likeArrowDatetimeArray? I saw there is anArrowStringArray. (PS: I even tried skipping this assert.set_indexdoes work, but when it tries to print the result, it errors out saying'ArrowExtensionArray' object has no attribute 'freq', which makes sense sinceArrowExtensionArraydoesn't implementDatetimeIndexOpsMixin)