Pandas version checks
-
[x] I have checked that this issue has not already been reported.
-
[x] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import re
import pandas as pd
rex = re.compile("foo",flags=re.IGNORECASE)
l = ["Foo","foo","Bar","_Foo_","_foo_"]
s = pd.Series(l, index=l)
pd.DataFrame({
"Python Match":[bool(rex.match(x)) for x in l],
"Python Search":[bool(rex.search(x)) for x in l],
"Match Flat":s.str.match(rex),
"Match Case":s.str.match(rex,case=False),
"Contains Flat":s.str.contains(rex),
# "Contains Case":s.str.contains(rex,case=False),
}, index=l)
Issue Description
(0) if you uncomment the last line (Contains Case), you get an error
ValueError: cannot process flags argument with a compiled pattern
this looks like a bug in its own right, but I am not filing a separate issue because I think both bugs reside in the same code and a single patch will fix both.
(1) the code supplied returns
Python Match Python Search Match Flat Match Case Contains Flat
Foo True True False True True
foo True True True True True
Bar False False False False False
_Foo_ False True False False True
_foo_ False True False False True
Since I already passed re.IGNORECASE to re.compile, I expected that Python Match, Match Flat and Match Case to be identical.
This is not the case.
Note that Contains Flat and Python Search are identical (good!)
Expected Behavior
I expected the columns Python Match, Match Flat and Match Case to be identical, because the compiled regexp should override the case argument.
If you will argue that the case argument to str.match overrides the flags argument to re.compile,
then the behavior of str.contains is inconsistent because there the default case does not override the flags argument to re.compile.
Thus, the default case behavior of str.contains and str.match are inconsistent with each other.
Installed Versions
Comment From: sam-s
this is a regression in pandas-2.3.2.
with 2.3.1, I get
(0) same error with compiled regexp for case=True in both str.match and str.contains
(1) Python Match is identical to Match Flat
Comment From: rhshadrach
Thanks for the report!
(0) if you uncomment the last line (Contains Case), you get an error
This does not appear to be the case on main.
I am indeed seeing various places where the flags in a compiled regex are being ignored. We'll need a solid proposal before we can move forward here. Some options I see:
- Document that the flags in a compiled regex are ignored; ignore everywhere.
- Raise an error when the flags in a compiled regex are incompatible with the
caseargument supplied. - Choose whether the flags in a compiled regex take priority over the
caseargument, or vice-versa. - Change the default of
casetolib.no_default, have it take precedence when provided, have it be the compiled regex flags when not provided, and have it beTruewhenpatis a string.
It'd also be helpful to know just how many methods this impacts and what the current differences are in behavior.
Comment From: sam-s
- Change the default of
casetolib.no_default, have it take precedence when provided, have it be the compiled regex flags when not provided, and have it beTruewhenpatis a string.
I like this approach: more control for the user, fewer exceptions.
Comment From: jorisvandenbossche
We'll need a solid proposal before we can move forward here
Separately we should also fix the regression, though, which we should ideally do for 2.3.x (because that is actually given wrong results where it was correct before, AFAIU)
(0) if you uncomment the last line (Contains Case), you get an error
This does not appear to be the case on main.
It seems that this is because of using the new string dtype on main. If you run the example on main but with explicitly creating object dtype (s = pd.Series(l, index=l, dtype=object)), then the contains case still gives an error.
So it is the pyarrow-based string dtype that does not give this error, vs the python-based one:
>>> s = pd.Series(l, index=l, dtype=pd.StringDtype("pyarrow"))
>>> s.str.contains(rex,case=False)
Foo True
foo True
Bar False
_Foo_ True
_foo_ True
dtype: boolean
>>> s = pd.Series(l, index=l, dtype=pd.StringDtype("python"))
>>> s.str.contains(rex,case=False)
...
ValueError: cannot process flags argument with a compiled pattern
But so this inconsistency we of course should fix, which requires deciding which behaviour we think is best (i.e. your list of options)
Comment From: jorisvandenbossche
I opened a PR for just fixing the regression: https://github.com/pandas-dev/pandas/pull/62251 (that does not yet address the inconsistencies between pyarrow vs string dtype how it handles compiled regex + case keyword)