After https://github.com/pandas-dev/pandas/pull/55901, we now do inference of the best resolution, and so allow to create non-nanosecond data by default (instead of raising for out of bounds data).
To be clear, it is a very nice improvement to stop raising those OutOfBounds errors while the timestamp would perfectly fit in another resolution. But I do think we could maybe reconsider the exact logic of how to determine the resolution.
With the latest changes you get the following:
>>> pd.to_datetime(["2024-03-22 11:43:01"]).dtype
dtype('<M8[s]')
>>> pd.to_datetime(["2024-03-22 11:43:01.002"]).dtype
dtype('<M8[ms]')
>>> pd.to_datetime(["2024-03-22 11:43:01.002003"]).dtype
dtype('<M8[us]')
The resulting dtype instance depends on the exact input value (not type). I do think this has some downsides:
- The result dtype becomes very data dependent (while in general we want to avoid value dependent behavior)
- You can very easily get multiple datetime dtypes in a workflow, causing more casting (to different unit) than necessary
The fact that pandas by default truncates the string repr of datetimes (i.e. we don't show the subsecond parts if they are all zero, regardless of the actual resolution), in contrast to numpy, also means that round-tripping through a text representation (eg CSV) will very often lead to a change in dtype.
As a potential alternative, we could also decide to have a fixed default resolution (e.g. microseconds), and then the logic for inferring the resolution could be: try to use the default resolution, and only if that does not work (either out of bounds or too much precision, i.e. nanoseconds present), use the inferred resolution from the data.
That still gives some values dependent behaviour, but I think this would make it a lot less common to see. And using a resolution like microseconds is sufficient for by far most use cases (in terms of bounds it supports: [290301 BC, 294241 AD])
Comment From: jorisvandenbossche
cc @pandas-dev/pandas-core
Comment From: WillAyd
This sounds reasonable and I think could help simplify the implementation
Comment From: bashtage
I think this would be an improvement. Seems like a good idea that anyone working with human times (say down to second precision) with a range of the modern era would get the same basis for the timestamp.
Comment From: jorisvandenbossche
I think could help simplify the implementation
I don't think it will simplify things generally, because we still need the current inference logic when the default unit does not fit, but from looking a bit into it, I also don't think it should make the code much more complex.
Comment From: Pranav-Wadhwa
take
Comment From: Pranav-Wadhwa
Based on discussions, I will update to_datetime
to always use nanoseconds in the given scenarios.
Comment From: Pranav-Wadhwa
@jorisvandenbossche would updating the to_datetime
function to accept a default unit='ns'
be a feasible solution for this? Or are there cases where it wouldn't make sense to default to nanoseconds?
Comment From: WillAyd
@Pranav-Wadhwa nanoseconds is what we used previously, so I don't think we want to go back to that. The OP suggests microseconds as a default resolution, although I'm not sure its as simple as changing the to_datetime signature either.
Before diving into the details I think should get some more agreement from the pandas core team. @jbrockmendel is our datetime guru so let's see if he has any thoughts first
Comment From: jbrockmendel
I’m fine with OP suggestion as long as we are internally consistent, I.e. Timestamp constructor
Comment From: Pranav-Wadhwa
@jbrockmendel what do you mean by timestamp constructor? If we set the default value of unit='us'
in to_datetime
, it resolves the example that the OP mentioned but would require changing many test cases in tests/tools/test_to_datetime.py
.
Comment From: rhshadrach
I would prefer going even further: don't automatically fallback if us
is not suitable - require the user to manually override if they are dealing with dates beyond that range. That said, I'm still in favor of the OP as-is being what I see as a step in the right direction.
@Pranav-Wadhwa - I believe there are many places this would need to change in pandas to be consistent beyond just to_datetime
.
Comment From: Pranav-Wadhwa
@Pranav-Wadhwa - I believe there are many places this would need to change in pandas to be consistent beyond just
to_datetime
.
I believe the scope of this work is outside my knowledge as this is my first issues with pandas. If it's helpful to future assignees, I found that the to_datetime
function would default to the appropriate unit if no unit was passed in, but if us
was passed in as the unit, it would create an object with dtype ns
inside the _to_datetime_with_unit
function.
Comment From: jbrockmendel
If we do change the behavior, wouldn't it make more sense to keep "ns" as the default for backward compat?
Comment From: jorisvandenbossche
If we do change the behavior, wouldn't it make more sense to keep "ns" as the default for backward compat?
That would also be an option, yes. I think long term, microseconds is a better default (a larger fraction of use cases will fit in the default resolution, better compatibility with other tools), so personally I would like to switch to that at some point. But that doesn't necessarily have to happen now (e.g. could also when we would switch to a nullable extension dtype for datetime64).
I had actually started looking at this a year ago when I opened the issue, but never got around opening a PR for it. I revived that branch and got it to a basic working state -> https://github.com/pandas-dev/pandas/pull/62031. Currently that PR uses microseconds, but it should be relatively easy to change it to default to nanoseconds (with the logic of falling back to us when it does not fit in the range).
Comment From: rhshadrach
better compatibility with other tools
NumPy, PySpark, and Polars all default to microseconds, and PySpark doesn't support nano. This has been a bit of an annoyance for me personally at times (particularly PySpark). I'm positive on switching pandas over to micro, but I have no sense as to how big of a code change this would be.
Comment From: jorisvandenbossche
I'm positive on switching pandas over to micro, but I have no sense as to how big of a code change this would be.
For pandas itself, it is not that big, you can check it in https://github.com/pandas-dev/pandas/pull/62031 (that PR is not yet entirely complete, but already changes the default for most cases in pd.to_datetime
and pd.Timestamp
. Most additional code changes will be needed in updating tests)
Comment From: jorisvandenbossche
@rhshadrach suggested at the dev meeting to make this change more gradually (because no longer inferring nanoseconds by default is a breaking change if you relied on the integer representation; although this already happens if you data source has type information that we preserve, eg reading a Parquet file that uses non-nanosecond resolution will already be read as such in pandas 2.x)
The idea would be to still default to nanoseconds (when converting from strings, stdlib datetime objecs, etc) as in pandas 2.x, but introduce a future flag to globally enable the future default of microseconds.
General thoughts on that?
Some practical questions here:
- How do we want to name this? Something like
pd.options.future.infer_microseconds = True
? (is that clear enough that it is about datetimelike data? it could also control the timedelta64 behaviour when that is implemented, so it can also be a plus that it is not too specific) - What to do for cases that currently raise an Out-of-Bounds error for nanoseconds (for example,
pd.to_datetime("2300-01-01")
) Options: - Since this is currently raising an error, it is less of a backwards compatibility issue to already start using microseconds here by default. And this already enables the nice improvement to more easily handle out-of-bounds timestamps in pandas 3.0 by default (without having to enable a feature flag)
- Keep raising an error as we do for 2.x, but then we can update the error message to point the user to enable the future flag to have it work,
- For
pd.Timestamp("..")
, the data-dependent resolution inference was actually already released in pandas 2.x. In the current PR I made for defaulting to microseconds, I also updated this constructor to consistently use microseconds (if possible). If we change to keep defaulting to nanoseconds and put the microseconds behind a future flag, I would also update thepd.Timestamp("..")
to follow this (but this does mean a bit of back and forth for the behaviour of this constructor ..) - If we do this, how do we test it? Add one build that enables the future default? I assume so, but given the amount of test changes currently in #62031, that might add a lot of if/else cases in the tests .. (although in some tests that are not about testing the default, we could probably fix the unit of the input data to not have it depend on the default inference)
Comment From: Dr-Irv
My answers:
- I do think the change should be gradual.
- Maybe
pd.options.future.default_microseconds
might communicate it better? - Don't raise if we don't have to
- I think any place where we are inferring a default resolution should be changed based on the flag. So if
to_datetime()
is changed, thenTimestamp
andTimedelta
should probably be changed as well. - On testing, I looked at a few of the tests that you changed, and I'm not sure that the test needed to be changed. If a test specified a
dtype
ofdatetime64[s]
, that should work independent of this change of default resolution. I think it would be best to really determine which test results are dependent on the default resolution (I would hope very few!), duplicate those tests to another test file, and turn on the option within that file. I guess it would depend on how many tests are really dependent on this resolution.
Comment From: jorisvandenbossche
I think it would be best to really determine which test results are dependent on the default resolution (I would hope very few!),
In practice we have a lot of tests that depend on the default resolution, but just because they create test data (or expected result) using strings/timestamps/datetimes, and the conversion of that thus depends on the default resolution. But typically for what is being tested itself it does not matter, so in some cases we can indeed specify the dtype to not have it dependent on the default.
Comment From: rhshadrach
- Maybe
pd.options.future.default_microseconds
might communicate it better?
Agree on default
over infer
. I don't see it as inference when a user does pd.Series([dt.timedelta(20)])
.
Comment From: jbrockmendel
I don't see it as inference when a user does pd.Series([dt.timedelta(20)]).
nitpicky heads-up: resolution inference is implemented for datetime64 but not for timedelta64