Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import datetime
import pandas as pd
tz = 'America/Santiago'
start_date = datetime.datetime(2018, 8, 10, 0, 0, 0)
end_date = datetime.datetime(2018, 8, 14, 23, 0, 0)
freq = 'H'
times = pd.date_range(start=start_date, end=end_date, freq=freq)
times = times.tz_localize(tz=tz, ambiguous='infer',
                          nonexistent='shift_forward')
print(pd.infer_freq(times[:10]))
pd.infer_freq(times)
print(pd.infer_freq(times[:10]))

Issue Description

Initially, infer_freq on the first 10 items of the index returns H, after attempting it on the full index, it returns None on the first 10 items of the index. Confirmed expected behavior in version 2.0.3.

Expected Behavior

Return H in both instances of pd.infer_freq(times[:10]) in the example.

Installed Versions

INSTALLED VERSIONS ------------------ commit : a60ad39b4a9febdea9a59d602dad44b1538b0ea5 python : 3.10.13.final.0 python-bits : 64 OS : Darwin OS-release : 21.6.0 Version : Darwin Kernel Version 21.6.0: Wed Oct 4 23:56:02 PDT 2023; root:xnu-8020.240.18.704.15~1/RELEASE_ARM64_T6000 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.1.2 numpy : 1.26.0 pytz : 2023.3.post1 dateutil : 2.8.2 setuptools : 68.0.0 pip : 23.3 Cython : None pytest : 7.4.3 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.9.3 html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.15.0 pandas_datareader : None bs4 : 4.12.2 bottleneck : None dataframe-api-compat: None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.8.0 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : 1.11.3 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : 2.2.0 pyqt5 : None

Comment From: kandersolar

Perhaps this was introduced in https://github.com/pandas-dev/pandas/pull/51738? A call to times._engine.clear_mapping() seems to fix things:

times = pd.date_range(start="2018-08-11 20:00", end="2018-08-12 04:00", freq="H")
times = times.tz_localize(tz="America/Santiago", ambiguous='infer',
                          nonexistent='shift_forward')

print(pd.infer_freq(times[:3]))  # H
pd.infer_freq(times)
print(pd.infer_freq(times[:3]))  # None
times._engine.clear_mapping()
print(pd.infer_freq(times[:3]))  # H

Here is a related example:

times = pd.date_range(start="2018-08-11 20:00", end="2018-08-12 04:00", freq="H")
times = times.tz_localize(tz="America/Santiago", ambiguous='infer', nonexistent='shift_forward')

print(times[:3]._is_unique)  # True
times._is_unique
print(times[:3]._is_unique)  # False
times._engine.clear_mapping()
print(times[:3]._is_unique)  # True

This times DatetimeIndex contains equivalent/duplicate times. The tested slice does not, but incorrectly inherits the cached determination of non-uniqueness from its parent. Perhaps a suitable fix is to make slices not inherit unique and need_unique_check from the parent index?

Comment From: kandersolar

A slightly more minimal reproducer:

# last datetime is a duplicate
times = pd.to_datetime(['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-03'])

print(pd.infer_freq(times[:3]))  # D
pd.infer_freq(times)
print(pd.infer_freq(times[:3]))  # None
times._engine.clear_mapping()
print(pd.infer_freq(times[:3]))  # D