Pandas version checks
-
[x] I have checked that this issue has not already been reported.
-
[x] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import re
import pandas as pd
rex = re.compile("foo",flags=re.IGNORECASE)
l = ["Foo","foo","Bar","_Foo_","_foo_"]
s = pd.Series(l, index=l)
pd.DataFrame({
"Python Match":[bool(rex.match(x)) for x in l],
"Python Search":[bool(rex.search(x)) for x in l],
"Match Flat":s.str.match(rex),
"Match Case":s.str.match(rex,case=False),
"Contains Flat":s.str.contains(rex),
# "Contains Case":s.str.contains(rex,case=False),
}, index=l)
Issue Description
(0) if you uncomment the last line (Contains Case
), you get an error
ValueError: cannot process flags argument with a compiled pattern
this looks like a bug in its own right, but I am not filing a separate issue because I think both bugs reside in the same code and a single patch will fix both.
(1) the code supplied returns
Python Match Python Search Match Flat Match Case Contains Flat
Foo True True False True True
foo True True True True True
Bar False False False False False
_Foo_ False True False False True
_foo_ False True False False True
Since I already passed re.IGNORECASE
to re.compile
, I expected that Python Match
, Match Flat
and Match Case
to be identical.
This is not the case.
Note that Contains Flat
and Python Search
are identical (good!)
Expected Behavior
I expected the columns Python Match
, Match Flat
and Match Case
to be identical, because the compiled regexp should override the case
argument.
If you will argue that the case
argument to str.match
overrides the flags
argument to re.compile
,
then the behavior of str.contains
is inconsistent because there the default case
does not override the flags
argument to re.compile
.
Thus, the default case
behavior of str.contains
and str.match
are inconsistent with each other.
Installed Versions
Comment From: sam-s
this is a regression in pandas-2.3.2.
with 2.3.1, I get
(0) same error with compiled regexp for case=True
in both str.match
and str.contains
(1) Python Match
is identical to Match Flat
Comment From: rhshadrach
Thanks for the report!
(0) if you uncomment the last line (Contains Case), you get an error
This does not appear to be the case on main.
I am indeed seeing various places where the flags in a compiled regex are being ignored. We'll need a solid proposal before we can move forward here. Some options I see:
- Document that the flags in a compiled regex are ignored; ignore everywhere.
- Raise an error when the flags in a compiled regex are incompatible with the
case
argument supplied. - Choose whether the flags in a compiled regex take priority over the
case
argument, or vice-versa. - Change the default of
case
tolib.no_default
, have it take precedence when provided, have it be the compiled regex flags when not provided, and have it beTrue
whenpat
is a string.
It'd also be helpful to know just how many methods this impacts and what the current differences are in behavior.
Comment From: sam-s
- Change the default of
case
tolib.no_default
, have it take precedence when provided, have it be the compiled regex flags when not provided, and have it beTrue
whenpat
is a string.
I like this approach: more control for the user, fewer exceptions.
Comment From: jorisvandenbossche
We'll need a solid proposal before we can move forward here
Separately we should also fix the regression, though, which we should ideally do for 2.3.x (because that is actually given wrong results where it was correct before, AFAIU)
(0) if you uncomment the last line (Contains Case), you get an error
This does not appear to be the case on main.
It seems that this is because of using the new string dtype on main. If you run the example on main but with explicitly creating object dtype (s = pd.Series(l, index=l, dtype=object)
), then the contains case still gives an error.
So it is the pyarrow-based string dtype that does not give this error, vs the python-based one:
>>> s = pd.Series(l, index=l, dtype=pd.StringDtype("pyarrow"))
>>> s.str.contains(rex,case=False)
Foo True
foo True
Bar False
_Foo_ True
_foo_ True
dtype: boolean
>>> s = pd.Series(l, index=l, dtype=pd.StringDtype("python"))
>>> s.str.contains(rex,case=False)
...
ValueError: cannot process flags argument with a compiled pattern
But so this inconsistency we of course should fix, which requires deciding which behaviour we think is best (i.e. your list of options)
Comment From: jorisvandenbossche
I opened a PR for just fixing the regression: https://github.com/pandas-dev/pandas/pull/62251 (that does not yet address the inconsistencies between pyarrow vs string dtype how it handles compiled regex + case keyword)