Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
pd.to_datetime(['31 May 2023', '1 Apr 2023'])
Issue Description
When the first date entries have "May" as the month, the parser assumes the months are written out as "month as locale’s full name" (format code %B
), and then, when it hits an abbreviated month (correct format code %b
- abbreviated name), it gives an error:
ValueError: time data "1 Apr 2023" doesn't match format "%d %B %Y", at position 1. You might want to try:
- passing `format` if your strings have a consistent format;
- passing `format='ISO8601'` if your strings are all ISO8601 but not necessarily in exactly the same format;
- passing `format='mixed'`, and the format will be inferred for each element individually. You might want to use `dayfirst` alongside this.
Expected Behavior
It would be very good if the parser remained agnostic as to whether the month specifier was abbreviated until after it had seen a month other than "May".
Installed Versions
Comment From: Aloqeely
Thanks for the report! This is a duplicate of #57521 which got closed for "not a bug" But I think we should improve this, I can work on a fix but want to get @rhshadrach's opinion first
Comment From: rhshadrach
Thanks for the ping @Aloqeely. That issue was closed by its author while I sought the opinion of @MarcoGorelli. I'd like to do that here as well!
Comment From: MarcoGorelli
thanks for the ping - I wouldn't consider this a bug, this looks intended
if this was auto-inferred as %b, then someone else would complain when it would raise on "April"
But I think we should improve this
how, what do you suggest?
Comment From: asishm
I'd agree in general with @rhshadrach's comments in the linked issue (specifically around the edge cases where say most of the entries are May but one is April etc.). In the meantime, I think a doc update which contains language from the release notes (see below) would be helpful. Maybe add this specific caveat around inferring May which could lead to either %b or %B in the notes section.
As of version 2.0.0, parsing will use a consistent format, determined by the first non-NA value (unless the user specifies a format, in which case that is used).
That language is present in the current docs under, but the fact that this param is deprecated doesn't make it very clear as to what is the current behavior.
infer_datetime_formatbool, default False If True and no format is given, attempt to infer the format of the datetime strings based on the first non-NaN element, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by ~5-10x. Deprecated since version 2.0.0: A strict version of this argument is now the default, passing it has no effect.
Comment From: matthew-brett
Just to say - I've read the previous issue - but I do think this is a bug in the guess of the initial date format, unless there is, for some reason, an absolute requirement that the guess is not revisited, and cannot be improved by later entries. I mean, I think it is reasonable to consider a risky guess as a bug. Here, when the first date has "May" as the month, the chances of full-month-format (%B
) being correct are - difficult to estimate - it would depend on how frequent full-month and abbreviated-month formats are in the wild. My guess is that abbreviated month formats are more common, so the current guess is more often wrong than right. Which is not to suggest that the default be reversed, but that it is in principle reasonable to consider ways of updating the guess if, as here, there is a high probability that the guess is wrong.
Comment From: matthew-brett
I don't know the code, and if that's an issue - I can take a look. I guess the problem is that there isn't currently a way of updating the guess once an error arises? If not - would that be practical to implement in restricted circumstances with high risk guesses and obvious alternatives, as here?
Comment From: MarcoGorelli
tbh I wouldn't be opposed to a simple solution which considers more than just the first element - it needs to be simple though, this code's already quite complex
Comment From: asishm
A 2-pass approach might work. If the first pass errors out, then re-run with format='mixed'
- and should also raise a PerformanceWarning
Comment From: matthew-brett
@MarcoGorelli - thanks - I do not know the Pandas code-base very well, but if no-one else is interested to look at this, I could give it a go.
Comment From: Aloqeely
But I think we should improve this
how, what do you suggest?
Using the first element that doesn't have May as month. Opened #58336
Comment From: rhshadrach
tbh I wouldn't be opposed to a simple solution which considers more than just the first element - it needs to be simple though, this code's already quite complex
Another requirement I'd add is predictable - not only in behavior but performance. It should be easy to describe to a user what pandas is doing to infer, and the procedure should be in the docstring.
Comment From: matthew-brett
Just thinking aloud here, but I wonder whether there is some way of maintaining an "alternative" format that would be triggered by an error. As in:
31-May-2023 -> guessed %B, alternative %b 1-Apr-2023 -> error for %B, switch to alternative %b, recast previous and continue
but
31-May-2023 -> guessed %B, alternative %b 1-January-2023 -> remove alternative as invalid 1-Apr-2023 -> raise usual error
This could also mean automatic correct parsing of the old dd-mm-yyyy issue:
12-10-2023 -> guessed mm-dd-yyyy, alternative dd-mm-yyyy 13-10-2023 -> error for mm-dd-yyyy, switch to dd-mm-yyyy, recast all previous.
Where:
12-10-2023 -> guessed mm-dd-yyyy, alternative dd-mm-yyyy 12-13-2023 -> remove alternative as invalid 13-10-2023 -> raise usual error
Of course this would have some parsing-time cost, but this would only be for a few elements, typically, before the alternative can be excluded.
Comment From: Aloqeely
The general "alternative format" idea sounds great but there will be an endless amount of alternative formats, and I'm not sure if it is considered a "simple solution", maybe you can try tweaking with it and see if it's easily doable.
Comment From: matthew-brett
I'll have a look at the code - thanks.
Sketching here, I think it need only catch some common cases, and others could argue for further extensions as they run into them. Then the code might look something like:
...
found_fmt = guess_format(first_entry)
alternatives = _find_alternatives(found_fmt)
...
for current_entry in entries:
...
if alternatives:
alternatives = _prune_alternatives(current_entry, alternatives)
try:
out = parse_entry(current_entry, found_fmt)
except ValueError as e:
if alternatives:
return dp.to_datetime(series, format=alternatives)
raise e
Comment From: MarcoGorelli
Using the first element that doesn't have May as month. Opened https://github.com/pandas-dev/pandas/pull/58336
thanks - tbh I feel uneasy about special-casing 'may' like this
Maybe just take the first 5 non-null elements, guess the format for each, and take a majority vote? I think this'd be pretty simple and easy to explain
In the end, this "guess datetime format" is just provided for convenience, if you want reliable parsing you should pass a format explicitly
Comment From: Aloqeely
thanks - tbh I feel uneasy about special-casing 'may' like this
I agree, Brett also mentioned (in my PR) that it wouldn't work for other locale's month names
Maybe just take the first 5 non-null elements, guess the format for each, and take a majority vote? I think this'd be pretty simple and easy to explain
I'm a big fan of your suggestion and Brett's suggestion and think we should implement both, I.E. take first 5 non-null elements, for each element see all possible guesses, and take the first format that works for all 5. Is this good?
Now I don't know if we can accomplish this without complicating the code, @matthew-brett are you able to do it? If not, I can just implement the idea of checking the first 5 non-null elements and call it a day
Comment From: asishm
I'm not necessarily a fan of using first K because there's nothing fundamentally different from K=1 (which is today's behavior) and you would still run into issues if "April" pops up on the 6th value instead of within the first K.
Comment From: Aloqeely
What do you suggest?
Comment From: Aloqeely
After having this conversation, one thing is clear to me, not only do we have to describe the procedure pandas does to infer (as Mr. Shadrach suggested), we should also highly encourage the use of the format
parameter.
Comment From: matthew-brett
@Aloqeely - thanks for looking at this - yes - I'm happy to take a go at this, and try and come up with something suitably simple, to make it as likely as possible that the guess will be correct. I have about 2 weeks of heavy work just now, but I will get to it soon afterwards.
Comment From: RedHeadphone
Not sure if it is right, but my idea for the fix would be as below
- checking if the first element has 'May' in the date string
- if so then get the date format as usual. Suppose the date format we got is "%B-%d-%Y"
- then try converting all dates with "%B-%d-%Y" and "%b-%d-%Y" (replace capital B with small b) . If one of them doesn't work, continue with the other format.
Something like below:
if "May" in data[0]:
expected_format = get_format(data[0]) # "%B-%d-%Y"
formats_to_try = [expected_format, expected_format.replace('%B', '%b')]
for fmt in formats_to_try:
try:
return pd.to_datetime(*same_args, format=fmt)
except ValueError:
continue
raise ValueError("Mixed date formats")
Comment From: attack68
Haven't looked at the code here but if it is not favoured to special branch "May" in the code for a second pass which I can understand, is it easier to detect the cause of the raise and branch to better, more informed error messages? In this case just describing what the likely problem is around "May" and how a user can quickly fix it?
A targeted error message might also allow a user to monkey patch any dynamic code.