Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
from datetime import datetime
import pandas as pd
# Create a datetime object (midnight)
dt = datetime(2015, 1, 1, 0)
# Create a pd.Timedelta object (1 hour)
delta_1 = pd.Timedelta("1H")
# Create a pd.Timedelta object from a DateTimeIndex frequency (1 hour)
delta_2 = pd.to_timedelta(pd.date_range('2015-01-01 00', '2015-01-01 01', freq="1H").freq)
# First check that the pd.Timedelta objects are equal
assert(delta_1 == delta_2) # True
# Then check whether adding each pd.Timedelta to the datetime object moves the datetime by one hour: this does not behave correctly since Pandas 2.0
assert(dt + delta_1 == dt + delta_2) # False -> AssertionError
# Left-hand side looks right: the hour was added
dt + delta_1 # datetime.datetime(2015, 1, 1, 1, 0)
# Right-hand side appears to be wrong: the hour was not added
dt + delta_2 # datetime.datetime(2015, 1, 1, 0, 0)
Issue Description
It seems that the pd.Timedelta
object produced by calling pd.to_timedelta
on a pd.DateTimeIndex
frequency can no longer be added to a datetime.datetime
object since pandas>1.5.3
.
It's not evident to me why the two pd.Timedelta
objects (appearing to be equal) are behaving differently.
Expected Behavior
Running the above example in pandas==1.5.3
(and also 1.4.4) works as expected.
Installed Versions
Comment From: jbrockmendel
I have an explanation but not a solution.
delta_2 here has unit="s". Timedeltas with unit="s" can go outside the range supported by datetime.timedelta, so the way they are initialized passes seconds=0 to the parent class (datetime.timedelta) constructor. So the pytimedelta struct (which pandas ignores) looks like timedelta(0). When you do dt + delta_2, that calls the pydatetime's __add__
method, which treats delta_2 as a regular pytimedelta, not a pd.Timedelta. So it uses the 0, giving the incorrect result.
Reversing the operation and doing delta_2 + dt gives the correct result, as does pd.Timestamp(dt) + delta_2.
A solution here would need to be in the stdlib datetime.__add__
returning NotImplemented
Comment From: miccoli
A solution here would need to be in the stdlib
datetime.__add__
returningNotImplemented
No way! pd.Timedelta
claims to be a datetime.timedelta
subclass:
>>> import datetime
>>> import pandas as pd
>>> issubclass(pd.Timedelta, datetime.timedelta)
True
The Liskov substitution principle requires that datetime.__add__(self, other)
is semantically equivalent, whenever other
is a datetime.timedelta
subclass.
By the way, why is pd.Timedelta
a datetime.timedelta
subclass? They are mostly incompatible. The only way forward seems to me to completely drop the claim that pd.Timedelta
should be a datetime.timedelta
subclass.
Comment From: GiorgioBalestrieri
Just caught a bug caused by this while trying to upgrade to pandas 2.0. This is both quite concerning (as mentioned above, pd.Timedelta
should behave like dt.timedelta
, since it claims to be a subclass), and potentially a blocker, because we'll have to go and ensure that each single pd.Timedelta
is converted to dt.timedelta
before doing any arithmetics on it.
I really don't want to sound like I'm demanding a fix for a bug/issue in an open source library/project, but I think solving this should be priority.
Comment From: Flix6x
Maybe a silly suggestion, but could we not simply have pd.Timedelta.__add__
conditionally switch the order of self
and other
?
A solution here would need to be in the stdlib
datetime.__add__
returningNotImplemented
Could you reformulate this solution as an issue with the stdlib? As it stands, I don't understand what would be wrong there.
They are mostly incompatible.
Since pd.Timedelta
is subclassing datetime.timedelta
I would actually expect they are mostly compatible. Could you elaborate on this?
Comment From: miccoli
There is no easy solution, right now: the problem is with datetime.datetime.__add__
(and not not with pd.Timedelta.__add__
).
See this example:
import datetime
import pandas as pd
dt = datetime.datetime(2015, 1, 1, 0)
delta_2 = pd.to_timedelta(pd.date_range('2015-01-01 00', '2015-01-01 01', freq="1h").freq)
assert isinstance(delta_2, datetime.timedelta)
print(dt.__add__(delta_2)) # 2015-01-01 00:00:00
print(delta_2.__add__(dt)) # 2015-01-01 01:00:00
So python stdlib datetime.datetime.__add__
method will see that delta_2
is a datetime.timedelta
instance and assume that it behaves like datetime.timedelta(seconds=3600)
, which is not true.
As a general comment, maybe datetime.datetime.__add__
is a little too optimistic, since for efficiency it assumes that the pytimedelta
struct holds a valid value, while of course it could also query the higher level property pd.Timedelta.seconds
...
By the way, there is a generale feature of Python that allows subclasses to override their ancestors’ operations, see the note at the end of https://docs.python.org/3/reference/datamodel.html#emulating-numeric-types.
In fact
delta_1 = datetime.timedelta(seconds=3600)
print(delta_1 + delta_2) # 0 days 02:00:00
works as expected, because delta_1 + delta_2
will call the reflected delta_2.__radd__(delta_1)
first, even if delta_1.__add__(delta_2)
does not return NotImplemented
repr(delta_1 + delta_2) # "Timedelta('0 days 02:00:00')"
repr(delta_1.__add__(delta_2)) # 'datetime.timedelta(seconds=3600)'
repr(delta_2.__radd__(delta_1)) # "Timedelta('0 days 02:00:00')"
So maybe a way forward is to convince the CPython developers that special code should be added to datetime.datetime.__add__(x)
to check first for a reflected operator if type(x)
is a subclass of datetime.timedelta
... but I think that it is highly improbable that a fix will be accepted form the side of CPython.
From the side of pandas, the obvious solution would be to ensure that issubclass(pandas._libs.tslibs.timedeltas.Timedelta, datetime.timedelta)
returns false, so that CPython is forced to use the reflected
delta_2.__radd__(dt) # Timestamp('2015-01-01 01:00:00')
which works as expected.
Bottom line: do not mix python datetime.*
objects with pandas time objects, because they are largely incompatible, and could give raise to strange behaviour.
Comment From: giuliobeseghi
Do we understand why this wasn't an issue in pandas<2.0
?
Comment From: GiorgioBalestrieri
Timedeltas with
unit="s"
can go outside the range supported bydatetime.timedelta
curious: what is the use case for this? If datetime arithmetic is not guaranteed to work anyway, wouldn't it be reasonable to constraint pd.Timedelta
to the same range of values of datetime.timedelta
, and represent the timedelta
object "correctly" internally, which would effectively solve this issue?
Code in question: https://github.com/pandas-dev/pandas/blob/adec21f3fa896684cad04ecf1878a9f2492370ea/pandas/_libs/tslibs/timedeltas.pyx#L965-L976
This was introduced in: https://github.com/pandas-dev/pandas/pull/47641
Comment From: miccoli
curious: what is the use case for this? If datetime arithmetic is not guaranteed to work anyway,
Pandas datetime arithmetic is guaranteed to work: it is mixed python and pandas arithmetic which is problematic. Simply put, the pytimedelta structure could overflow, so it is simply not initialised for resolutions grater than µs. Note that this structure is not needed for Pandas temporal arithmetic, but only to claim that pd.Timedelta
is a datetime.timedelta
subclass...
Here a comparison to numpy behaviour, which was the model for Pandas implementation.
>>> issubclass(np.timedelta64, datetime.timedelta)
False
>>> np.timedelta64(1, 'ns').astype(datetime.timedelta)
1
>>> np.timedelta64(1, 'ms').astype(datetime.timedelta)
datetime.timedelta(microseconds=1000)
>>> np.timedelta64(10**9*24*3600*1000-1, 'ms').astype(datetime.timedelta)
datetime.timedelta(days=999999999, seconds=86399, microseconds=999000)
>>> np.timedelta64(10**9*24*3600*1000, 'ms').astype(datetime.timedelta)
86400000000000000
Main takes:
- np.timedelta64
does not claims to be a datetime.timedelta
subclass
- when an explicit cast to datetime.timedelta
is possible without resolution loss, overflow, underflow, a datetime.timedelta
object is returned, otherwise ant int
without resolution.
Maybe a mitigation could be to initialise the pytimedelta structure when this is possible without overflow, even if the resolution is greater than µs?
Moreover, here an other problem, which shows up on the other end of the spectrum:
>>> datetime.timedelta(0) + pd.Timedelta(np.timedelta64(999_999_999, 'D'))
Traceback (most recent call last):
File "timedeltas.pyx", line 1682, in pandas._libs.tslibs.timedeltas._Timedelta._as_creso
File "np_datetime.pyx", line 683, in pandas._libs.tslibs.np_datetime.convert_reso
OverflowError: result would overflow
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "timedeltas.pyx", line 806, in pandas._libs.tslibs.timedeltas._binary_op_method_timedeltalike.f
File "timedeltas.pyx", line 1685, in pandas._libs.tslibs.timedeltas._Timedelta._as_creso
pandas._libs.tslibs.np_datetime.OutOfBoundsTimedelta: Cannot cast 999999999 days 00:00:00 to unit='us' without overflow.
but
>>> datetime.timedelta(0) + pd.Timedelta(np.timedelta64(999_999_999, 'D')).to_pytimedelta()
datetime.timedelta(days=999999999)
In fact
>>> datetime.timedelta.max
datetime.timedelta(days=999999999, seconds=86399, microseconds=999999)
is 86399999999999999999 µs, which is 67 bit long.
And here we are again at the starting point:
- python has a
datetime.timedelta
object with a fixed 1µs resolution, and about 67 (plus sign) bits range. numpy.timedelta64
are 64 bits (with sign) with a user chosen resolution, from 1as (one attosecond) to 1Y (one year).
So there is no doubt that it is impossible to have these two types mix and match smoothly... Pandas tried to obtain this compatibility, but there are a lot of rough edges.
Comment From: GiorgioBalestrieri
So there is no doubt that it is impossible to have these two types mix and match smoothly... Pandas tried to obtain this compatibility, but there are a lot of rough edges.
I'm probably missing some context here, but it sounds like making pd.Timedelta
not a subclass of dt.timedelta
is the least bad option here, and actually reflects what's going on? This would be a breaking change (in terms of type checking, at least, and any logic using isinstance
, I guess), but at least it should alleviate the concerns that there can be silent bugs in upgrading to pandas 2.0?
Could any devs chime in on the feasibility of this? Could it be rolled out as part of pandas 2.0, or would it have to wait for pandas 3.0?
One alternative might be to make the internal representation not 0 for timedeltas with second/millisecond "frequency", and instead capping them at the largest possible value for dt.timedelta
if they are out of range. My impression is that it would make bugs even harder to catch, since the behavior would only be incorrect for very large timedeltas (less likely to be caught by a test).
Comment From: miccoli
One alternative might be to make the internal representation not 0 for timedeltas with second/millisecond "frequency", and instead capping them at the largest possible value for
dt.timedelta
if they are out of range. My impression is that it would make bugs even harder to catch, since the behavior would only be incorrect for very large timedeltas (less likely to be caught by a test).
I agree: silently clipping at 86399999999999999999 µs would be a bad option.
I fear that there is no easy way forward here, as also
>>> issubclass(pd.Timestamp, datetime.datetime)
True
so that a complete overhaul of the pandas datetime system should be implemented. But here the opinion of a pandas developer is needed.