Pandas BUG: incorrect assumption of full-name month format when "May" is initial month

Pandas version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
pd.to_datetime(['31 May 2023', '1 Apr 2023'])

Issue Description

When the first date entries have "May" as the month, the parser assumes the months are written out as "month as locale’s full name" (format code %B), and then, when it hits an abbreviated month (correct format code %b - abbreviated name), it gives an error:

ValueError: time data "1 Apr 2023" doesn't match format "%d %B %Y", at position 1. You might want to try:
    - passing `format` if your strings have a consistent format;
    - passing `format='ISO8601'` if your strings are all ISO8601 but not necessarily in exactly the same format;
    - passing `format='mixed'`, and the format will be inferred for each element individually. You might want to use `dayfirst` alongside this.

Expected Behavior

It would be very good if the parser remained agnostic as to whether the month specifier was abbreviated until after it had seen a month other than "May".

Installed Versions

INSTALLED VERSIONS ------------------ commit : 2fbabb11db42d3a109cd13e3377327c8c1834213 python : 3.10.12.final.0 python-bits : 64 OS : Darwin OS-release : 23.4.0 Version : Darwin Kernel Version 23.4.0: Fri Mar 15 00:11:05 PDT 2024; root:xnu-10063.101.17~1/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 3.0.0.dev0+736.g2fbabb11db numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : 69.2.0 pip : 24.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : 8.23.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pyarrow : None pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None

Comment From: Aloqeely

Thanks for the report! This is a duplicate of #57521 which got closed for "not a bug" But I think we should improve this, I can work on a fix but want to get @rhshadrach's opinion first

Comment From: rhshadrach

Thanks for the ping @Aloqeely. That issue was closed by its author while I sought the opinion of @MarcoGorelli. I'd like to do that here as well!

Comment From: MarcoGorelli

thanks for the ping - I wouldn't consider this a bug, this looks intended

if this was auto-inferred as %b, then someone else would complain when it would raise on "April"

But I think we should improve this

how, what do you suggest?

Comment From: asishm

I'd agree in general with @rhshadrach's comments in the linked issue (specifically around the edge cases where say most of the entries are May but one is April etc.). In the meantime, I think a doc update which contains language from the release notes (see below) would be helpful. Maybe add this specific caveat around inferring May which could lead to either %b or %B in the notes section.

As of version 2.0.0, parsing will use a consistent format, determined by the first non-NA value (unless the user specifies a format, in which case that is used).

That language is present in the current docs under, but the fact that this param is deprecated doesn't make it very clear as to what is the current behavior.

infer_datetime_formatbool, default False If True and no format is given, attempt to infer the format of the datetime strings based on the first non-NaN element, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by ~5-10x. Deprecated since version 2.0.0: A strict version of this argument is now the default, passing it has no effect.

Comment From: matthew-brett

Just to say - I've read the previous issue - but I do think this is a bug in the guess of the initial date format, unless there is, for some reason, an absolute requirement that the guess is not revisited, and cannot be improved by later entries. I mean, I think it is reasonable to consider a risky guess as a bug. Here, when the first date has "May" as the month, the chances of full-month-format (%B) being correct are - difficult to estimate - it would depend on how frequent full-month and abbreviated-month formats are in the wild. My guess is that abbreviated month formats are more common, so the current guess is more often wrong than right. Which is not to suggest that the default be reversed, but that it is in principle reasonable to consider ways of updating the guess if, as here, there is a high probability that the guess is wrong.

Comment From: matthew-brett

I don't know the code, and if that's an issue - I can take a look. I guess the problem is that there isn't currently a way of updating the guess once an error arises? If not - would that be practical to implement in restricted circumstances with high risk guesses and obvious alternatives, as here?

Comment From: MarcoGorelli

tbh I wouldn't be opposed to a simple solution which considers more than just the first element - it needs to be simple though, this code's already quite complex

Comment From: asishm

A 2-pass approach might work. If the first pass errors out, then re-run with format='mixed' - and should also raise a PerformanceWarning

Comment From: matthew-brett

@MarcoGorelli - thanks - I do not know the Pandas code-base very well, but if no-one else is interested to look at this, I could give it a go.

Comment From: Aloqeely

But I think we should improve this

how, what do you suggest?

Using the first element that doesn't have May as month. Opened #58336

Comment From: rhshadrach

tbh I wouldn't be opposed to a simple solution which considers more than just the first element - it needs to be simple though, this code's already quite complex

Another requirement I'd add is predictable - not only in behavior but performance. It should be easy to describe to a user what pandas is doing to infer, and the procedure should be in the docstring.

Comment From: matthew-brett

Just thinking aloud here, but I wonder whether there is some way of maintaining an "alternative" format that would be triggered by an error. As in:

31-May-2023 -> guessed %B, alternative %b 1-Apr-2023 -> error for %B, switch to alternative %b, recast previous and continue

but

31-May-2023 -> guessed %B, alternative %b 1-January-2023 -> remove alternative as invalid 1-Apr-2023 -> raise usual error

This could also mean automatic correct parsing of the old dd-mm-yyyy issue:

12-10-2023 -> guessed mm-dd-yyyy, alternative dd-mm-yyyy 13-10-2023 -> error for mm-dd-yyyy, switch to dd-mm-yyyy, recast all previous.

Where:

12-10-2023 -> guessed mm-dd-yyyy, alternative dd-mm-yyyy 12-13-2023 -> remove alternative as invalid 13-10-2023 -> raise usual error

Of course this would have some parsing-time cost, but this would only be for a few elements, typically, before the alternative can be excluded.

Comment From: Aloqeely

The general "alternative format" idea sounds great but there will be an endless amount of alternative formats, and I'm not sure if it is considered a "simple solution", maybe you can try tweaking with it and see if it's easily doable.

Comment From: matthew-brett

I'll have a look at the code - thanks.

Sketching here, I think it need only catch some common cases, and others could argue for further extensions as they run into them. Then the code might look something like:

...
found_fmt = guess_format(first_entry)
alternatives = _find_alternatives(found_fmt)
...
for current_entry in entries:
    ...
    if alternatives:
        alternatives = _prune_alternatives(current_entry, alternatives)
    try:
        out = parse_entry(current_entry, found_fmt)
    except ValueError as e:
        if alternatives:
            return dp.to_datetime(series, format=alternatives)
        raise e

Comment From: MarcoGorelli

Using the first element that doesn't have May as month. Opened https://github.com/pandas-dev/pandas/pull/58336

thanks - tbh I feel uneasy about special-casing 'may' like this

Maybe just take the first 5 non-null elements, guess the format for each, and take a majority vote? I think this'd be pretty simple and easy to explain

In the end, this "guess datetime format" is just provided for convenience, if you want reliable parsing you should pass a format explicitly

Comment From: Aloqeely

thanks - tbh I feel uneasy about special-casing 'may' like this

I agree, Brett also mentioned (in my PR) that it wouldn't work for other locale's month names

Maybe just take the first 5 non-null elements, guess the format for each, and take a majority vote? I think this'd be pretty simple and easy to explain

I'm a big fan of your suggestion and Brett's suggestion and think we should implement both, I.E. take first 5 non-null elements, for each element see all possible guesses, and take the first format that works for all 5. Is this good?

Now I don't know if we can accomplish this without complicating the code, @matthew-brett are you able to do it? If not, I can just implement the idea of checking the first 5 non-null elements and call it a day

Comment From: asishm

I'm not necessarily a fan of using first K because there's nothing fundamentally different from K=1 (which is today's behavior) and you would still run into issues if "April" pops up on the 6th value instead of within the first K.

Comment From: Aloqeely

What do you suggest?

Comment From: Aloqeely

After having this conversation, one thing is clear to me, not only do we have to describe the procedure pandas does to infer (as Mr. Shadrach suggested), we should also highly encourage the use of the format parameter.

Comment From: matthew-brett

@Aloqeely - thanks for looking at this - yes - I'm happy to take a go at this, and try and come up with something suitably simple, to make it as likely as possible that the guess will be correct. I have about 2 weeks of heavy work just now, but I will get to it soon afterwards.

Comment From: RedHeadphone

Not sure if it is right, but my idea for the fix would be as below

checking if the first element has 'May' in the date string
if so then get the date format as usual. Suppose the date format we got is "%B-%d-%Y"
then try converting all dates with "%B-%d-%Y" and "%b-%d-%Y" (replace capital B with small b) . If one of them doesn't work, continue with the other format.

Something like below:

if "May" in data[0]:
    expected_format = get_format(data[0])  # "%B-%d-%Y"
    formats_to_try = [expected_format, expected_format.replace('%B', '%b')]

    for fmt in formats_to_try:
        try:
            return pd.to_datetime(*same_args, format=fmt)
        except ValueError:
            continue

    raise ValueError("Mixed date formats")

Comment From: attack68

Haven't looked at the code here but if it is not favoured to special branch "May" in the code for a second pass which I can understand, is it easier to detect the cause of the raise and branch to better, more informed error messages? In this case just describing what the likely problem is around "May" and how a user can quickly fix it?

A targeted error message might also allow a user to monkey patch any dynamic code.