Pandas BUG: Timedelta created by to_timedelta does not add correctly to datetime in Pandas 2

Pandas version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

from datetime import datetime
import pandas as pd

# Create a datetime object (midnight)
dt = datetime(2015, 1, 1, 0)

# Create a pd.Timedelta object (1 hour)
delta_1 = pd.Timedelta("1H")

# Create a pd.Timedelta object from a DateTimeIndex frequency (1 hour)
delta_2 = pd.to_timedelta(pd.date_range('2015-01-01 00', '2015-01-01 01', freq="1H").freq)

# First check that the pd.Timedelta objects are equal
assert(delta_1 == delta_2)  # True

# Then check whether adding each pd.Timedelta to the datetime object moves the datetime by one hour: this does not behave correctly since Pandas 2.0
assert(dt + delta_1 == dt + delta_2)  # False -> AssertionError

# Left-hand side looks right: the hour was added
dt + delta_1  # datetime.datetime(2015, 1, 1, 1, 0)

# Right-hand side appears to be wrong: the hour was not added
dt + delta_2  # datetime.datetime(2015, 1, 1, 0, 0)

Issue Description

It seems that the pd.Timedelta object produced by calling pd.to_timedelta on a pd.DateTimeIndex frequency can no longer be added to a datetime.datetime object since pandas>1.5.3.

It's not evident to me why the two pd.Timedelta objects (appearing to be equal) are behaving differently.

Expected Behavior

Running the above example in pandas==1.5.3 (and also 1.4.4) works as expected.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 4149f323882a55a2fe57415ccc190ea2aaba869b python : 3.9.9.final.0 python-bits : 64 OS : Linux OS-release : 5.15.11-76051511-generic Version : #202112220937~1640185481~21.04~b3a2c21-Ubuntu SMP Mon Jan 3 16:5 machine : x86_64 processor : byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.1.0.dev0+956.g4149f32388 numpy : 1.22.4 pytz : 2022.2.1 dateutil : 2.8.2 setuptools : 65.2.0 pip : 23.1.2 Cython : 0.29.30 pytest : 7.0.1 hypothesis : None sphinx : 4.4.0 blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : 1.1 pymysql : None psycopg2 : 2.9.3 jinja2 : 3.1.2 IPython : None pandas_datareader: None bs4 : 4.11.1 bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.5.1 numba : 0.56.0 numexpr : None odfpy : None openpyxl : 3.0.10 pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : 1.8.0 snappy : None sqlalchemy : 1.4.31 tables : None tabulate : 0.8.9 xarray : None xlrd : 2.0.1 zstandard : None tzdata : 2022.5 qtpy : None pyqt5 : None

Comment From: jbrockmendel

I have an explanation but not a solution.

delta_2 here has unit="s". Timedeltas with unit="s" can go outside the range supported by datetime.timedelta, so the way they are initialized passes seconds=0 to the parent class (datetime.timedelta) constructor. So the pytimedelta struct (which pandas ignores) looks like timedelta(0). When you do dt + delta_2, that calls the pydatetime's __add__ method, which treats delta_2 as a regular pytimedelta, not a pd.Timedelta. So it uses the 0, giving the incorrect result.

Reversing the operation and doing delta_2 + dt gives the correct result, as does pd.Timestamp(dt) + delta_2.

A solution here would need to be in the stdlib datetime.__add__ returning NotImplemented

Comment From: miccoli

A solution here would need to be in the stdlib datetime.__add__ returning NotImplemented

No way! pd.Timedelta claims to be a datetime.timedelta subclass:

>>> import datetime
>>> import pandas as pd
>>> issubclass(pd.Timedelta, datetime.timedelta)
True

The Liskov substitution principle requires that datetime.__add__(self, other) is semantically equivalent, whenever other is a datetime.timedelta subclass.

By the way, why is pd.Timedelta a datetime.timedelta subclass? They are mostly incompatible. The only way forward seems to me to completely drop the claim that pd.Timedelta should be a datetime.timedelta subclass.

Comment From: GiorgioBalestrieri

Just caught a bug caused by this while trying to upgrade to pandas 2.0. This is both quite concerning (as mentioned above, pd.Timedelta should behave like dt.timedelta, since it claims to be a subclass), and potentially a blocker, because we'll have to go and ensure that each single pd.Timedelta is converted to dt.timedelta before doing any arithmetics on it.

I really don't want to sound like I'm demanding a fix for a bug/issue in an open source library/project, but I think solving this should be priority.

Comment From: Flix6x

Maybe a silly suggestion, but could we not simply have pd.Timedelta.__add__ conditionally switch the order of self and other?

A solution here would need to be in the stdlib datetime.__add__ returning NotImplemented

Could you reformulate this solution as an issue with the stdlib? As it stands, I don't understand what would be wrong there.

They are mostly incompatible.

Since pd.Timedelta is subclassing datetime.timedelta I would actually expect they are mostly compatible. Could you elaborate on this?

Comment From: miccoli

There is no easy solution, right now: the problem is with datetime.datetime.__add__ (and not not with pd.Timedelta.__add__).

See this example:

import datetime
import pandas as pd

dt = datetime.datetime(2015, 1, 1, 0)
delta_2 = pd.to_timedelta(pd.date_range('2015-01-01 00', '2015-01-01 01', freq="1h").freq)
assert isinstance(delta_2, datetime.timedelta)

print(dt.__add__(delta_2))  # 2015-01-01 00:00:00
print(delta_2.__add__(dt))  # 2015-01-01 01:00:00

So python stdlib datetime.datetime.__add__ method will see that delta_2 is a datetime.timedelta instance and assume that it behaves like datetime.timedelta(seconds=3600), which is not true.

As a general comment, maybe datetime.datetime.__add__ is a little too optimistic, since for efficiency it assumes that the pytimedelta struct holds a valid value, while of course it could also query the higher level property pd.Timedelta.seconds...

By the way, there is a generale feature of Python that allows subclasses to override their ancestors’ operations, see the note at the end of https://docs.python.org/3/reference/datamodel.html#emulating-numeric-types.

In fact

delta_1 = datetime.timedelta(seconds=3600)
print(delta_1 + delta_2)  # 0 days 02:00:00

works as expected, because delta_1 + delta_2 will call the reflected delta_2.__radd__(delta_1) first, even if delta_1.__add__(delta_2) does not return NotImplemented

repr(delta_1 + delta_2)  # "Timedelta('0 days 02:00:00')"
repr(delta_1.__add__(delta_2))  # 'datetime.timedelta(seconds=3600)'
repr(delta_2.__radd__(delta_1))  # "Timedelta('0 days 02:00:00')"

So maybe a way forward is to convince the CPython developers that special code should be added to datetime.datetime.__add__(x) to check first for a reflected operator if type(x) is a subclass of datetime.timedelta... but I think that it is highly improbable that a fix will be accepted form the side of CPython.

From the side of pandas, the obvious solution would be to ensure that issubclass(pandas._libs.tslibs.timedeltas.Timedelta, datetime.timedelta) returns false, so that CPython is forced to use the reflected

delta_2.__radd__(dt)  # Timestamp('2015-01-01 01:00:00')

which works as expected.

Bottom line: do not mix python datetime.* objects with pandas time objects, because they are largely incompatible, and could give raise to strange behaviour.

Comment From: giuliobeseghi

Do we understand why this wasn't an issue in pandas<2.0?

Comment From: GiorgioBalestrieri

Timedeltas with unit="s" can go outside the range supported by datetime.timedelta

curious: what is the use case for this? If datetime arithmetic is not guaranteed to work anyway, wouldn't it be reasonable to constraint pd.Timedelta to the same range of values of datetime.timedelta, and represent the timedelta object "correctly" internally, which would effectively solve this issue?

Code in question: https://github.com/pandas-dev/pandas/blob/adec21f3fa896684cad04ecf1878a9f2492370ea/pandas/_libs/tslibs/timedeltas.pyx#L965-L976

This was introduced in: https://github.com/pandas-dev/pandas/pull/47641

Comment From: miccoli

curious: what is the use case for this? If datetime arithmetic is not guaranteed to work anyway,

Pandas datetime arithmetic is guaranteed to work: it is mixed python and pandas arithmetic which is problematic. Simply put, the pytimedelta structure could overflow, so it is simply not initialised for resolutions grater than µs. Note that this structure is not needed for Pandas temporal arithmetic, but only to claim that pd.Timedelta is a datetime.timedelta subclass...

Here a comparison to numpy behaviour, which was the model for Pandas implementation.

>>> issubclass(np.timedelta64, datetime.timedelta)
False
>>> np.timedelta64(1, 'ns').astype(datetime.timedelta)
1
>>> np.timedelta64(1, 'ms').astype(datetime.timedelta)
datetime.timedelta(microseconds=1000)
>>> np.timedelta64(10**9*24*3600*1000-1, 'ms').astype(datetime.timedelta)
datetime.timedelta(days=999999999, seconds=86399, microseconds=999000)
>>> np.timedelta64(10**9*24*3600*1000, 'ms').astype(datetime.timedelta)
86400000000000000

Main takes: - np.timedelta64 does not claims to be a datetime.timedelta subclass - when an explicit cast to datetime.timedelta is possible without resolution loss, overflow, underflow, a datetime.timedelta object is returned, otherwise ant int without resolution.

Maybe a mitigation could be to initialise the pytimedelta structure when this is possible without overflow, even if the resolution is greater than µs?

Moreover, here an other problem, which shows up on the other end of the spectrum:

>>> datetime.timedelta(0) + pd.Timedelta(np.timedelta64(999_999_999, 'D'))
Traceback (most recent call last):
  File "timedeltas.pyx", line 1682, in pandas._libs.tslibs.timedeltas._Timedelta._as_creso
  File "np_datetime.pyx", line 683, in pandas._libs.tslibs.np_datetime.convert_reso
OverflowError: result would overflow

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "timedeltas.pyx", line 806, in pandas._libs.tslibs.timedeltas._binary_op_method_timedeltalike.f
  File "timedeltas.pyx", line 1685, in pandas._libs.tslibs.timedeltas._Timedelta._as_creso
pandas._libs.tslibs.np_datetime.OutOfBoundsTimedelta: Cannot cast 999999999 days 00:00:00 to unit='us' without overflow.

but

>>> datetime.timedelta(0) + pd.Timedelta(np.timedelta64(999_999_999, 'D')).to_pytimedelta()
datetime.timedelta(days=999999999)

In fact

>>> datetime.timedelta.max
datetime.timedelta(days=999999999, seconds=86399, microseconds=999999)

is 86399999999999999999 µs, which is 67 bit long.

And here we are again at the starting point:

python has a datetime.timedelta object with a fixed 1µs resolution, and about 67 (plus sign) bits range.
numpy.timedelta64 are 64 bits (with sign) with a user chosen resolution, from 1as (one attosecond) to 1Y (one year).

So there is no doubt that it is impossible to have these two types mix and match smoothly... Pandas tried to obtain this compatibility, but there are a lot of rough edges.

Comment From: GiorgioBalestrieri

So there is no doubt that it is impossible to have these two types mix and match smoothly... Pandas tried to obtain this compatibility, but there are a lot of rough edges.

I'm probably missing some context here, but it sounds like making pd.Timedelta not a subclass of dt.timedelta is the least bad option here, and actually reflects what's going on? This would be a breaking change (in terms of type checking, at least, and any logic using isinstance, I guess), but at least it should alleviate the concerns that there can be silent bugs in upgrading to pandas 2.0?

Could any devs chime in on the feasibility of this? Could it be rolled out as part of pandas 2.0, or would it have to wait for pandas 3.0?

One alternative might be to make the internal representation not 0 for timedeltas with second/millisecond "frequency", and instead capping them at the largest possible value for dt.timedelta if they are out of range. My impression is that it would make bugs even harder to catch, since the behavior would only be incorrect for very large timedeltas (less likely to be caught by a test).

Comment From: miccoli

One alternative might be to make the internal representation not 0 for timedeltas with second/millisecond "frequency", and instead capping them at the largest possible value for dt.timedelta if they are out of range. My impression is that it would make bugs even harder to catch, since the behavior would only be incorrect for very large timedeltas (less likely to be caught by a test).

I agree: silently clipping at 86399999999999999999 µs would be a bad option.

I fear that there is no easy way forward here, as also

>>> issubclass(pd.Timestamp, datetime.datetime)
True

so that a complete overhaul of the pandas datetime system should be implemented. But here the opinion of a pandas developer is needed.