Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

# Enable pandas 3.0 options
pd.options.mode.copy_on_write = True
pd.options.future.infer_string = True

pat = r"(?P<one>\w+) (?P<two>\w+) (?P<three>\w+)"
ser = pd.Series(['One Two Three', 'Foo Bar Baz'])

repl = r"\g<three> \g<two> \g<one>"
ser.str.replace(pat, repl, regex=True)

repl = r"\g<2>0"  # Note that this should be different from r"\20" according to the `re.sub` docs.
ser.str.replace(pat, repl, regex=True)

repl = r"\20"  # Should throw error since group 20 doesn't exist according to the `re.sub` docs.
ser.str.replace(pat, repl, regex=True)

Issue Description

The docs for .str.replace imply that the repl string can be anything that re.sub supports when regex=True. I think this is true for the current functionality, but isn't when using the pyarrow string dtype (pd.options.future.infer_string = True).

Specifically, the first 2 cases above throw the following error, but should work:

ArrowInvalid: Invalid replacement string: Rewrite schema error: '\' must be followed by a digit or '\'.

The last case above should throw, but instead is incorrectly treated as \g<2>0.

I guess that this is really an issue in the PyArrow backend rather than in pandas itself.

Expected Behavior

Matches pd.options.future.infer_string = False and matches re.sub.

Installed Versions

INSTALLED VERSIONS
------------------
commit                : f538741432edf55c6b9fb5d0d496d2dd1d7c2457
python                : 3.11.7.final.0
python-bits           : 64
OS                    : Windows
OS-release            : 10
Version               : 10.0.22621
machine               : AMD64
processor             : Intel64 Family 6 Model 154 Stepping 4, GenuineIntel
byteorder             : little
LC_ALL                : None
LANG                  : en_US.UTF-8
LOCALE                : English_United Kingdom.1252

pandas                : 2.2.0
numpy                 : 1.26.4
pytz                  : 2024.1
dateutil              : 2.8.2
setuptools            : 69.0.3
pip                   : 24.0
Cython                : None
pytest                : 8.0.0
hypothesis            : None
sphinx                : None
blosc                 : None
feather               : None
xlsxwriter            : None
lxml.etree            : None
html5lib              : None
pymysql               : None
psycopg2              : None
jinja2                : 3.1.3
IPython               : 8.21.0
pandas_datareader     : None
adbc-driver-postgresql: None
adbc-driver-sqlite    : None
bs4                   : 4.12.3
bottleneck            : 1.3.7
dataframe-api-compat  : None
fastparquet           : None
fsspec                : 2024.2.0
gcsfs                 : None
matplotlib            : 3.8.2
numba                 : 0.59.0
numexpr               : 2.9.0
odfpy                 : None
openpyxl              : None
pandas_gbq            : None
pyarrow               : 15.0.0
pyreadstat            : None
python-calamine       : None
pyxlsb                : None
s3fs                  : None
scipy                 : 1.12.0
sqlalchemy            : None
tables                : None
tabulate              : None
xarray                : None
xlrd                  : None
zstandard             : None
tzdata                : 2023.4
qtpy                  : None
pyqt5                 : None

Comment From: jamesmyatt

Full trace for first case:

---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
Cell In[47], line 6
      3 ser = pd.Series(['One Two Three', 'Foo Bar Baz'])
      5 repl = r"\g<three> \g<two> \g<one>"
----> 6 ser.str.replace(pat, repl, regex=True)
      8 repl = r"\g<2>0"  # Note that this is different from r"\20".
      9 ser.str.replace(pat, repl, regex=True)

File ...\Lib\site-packages\pandas\core\strings\accessor.py:137, in forbid_nonstring_types.<locals>._forbid_nonstring_types.<locals>.wrapper(self, *args, **kwargs)
    132     msg = (
    133         f"Cannot use .str.{func_name} with values of "
    134         f"inferred dtype '{self._inferred_dtype}'."
    135     )
    136     raise TypeError(msg)
--> 137 return func(self, *args, **kwargs)

File ...\Lib\site-packages\pandas\core\strings\accessor.py:1567, in StringMethods.replace(self, pat, repl, n, case, flags, regex)
   1564 if case is None:
   1565     case = True
-> 1567 result = self._data.array._str_replace(
   1568     pat, repl, n=n, case=case, flags=flags, regex=regex
   1569 )
   1570 return self._wrap_result(result)

File ...\Lib\site-packages\pandas\core\arrays\string_arrow.py:417, in ArrowStringArray._str_replace(self, pat, repl, n, case, flags, regex)
    414     return super()._str_replace(pat, repl, n, case, flags, regex)
    416 func = pc.replace_substring_regex if regex else pc.replace_substring
--> 417 result = func(self._pa_array, pattern=pat, replacement=repl, max_replacements=n)
    418 return type(self)(result)

File ...\Lib\site-packages\pyarrow\compute.py:263, in _make_generic_wrapper.<locals>.wrapper(memory_pool, options, *args, **kwargs)
    261 if args and isinstance(args[0], Expression):
    262     return Expression._call(func_name, list(args), options)
--> 263 return func.call(args, options, memory_pool)

File ...\Lib\site-packages\pyarrow\_compute.pyx:385, in pyarrow._compute.Function.call()

File ...\Lib\site-packages\pyarrow\error.pxi:154, in pyarrow.lib.pyarrow_internal_check_status()

File ...\Lib\site-packages\pyarrow\error.pxi:91, in pyarrow.lib.check_status()

ArrowInvalid: Invalid replacement string: Rewrite schema error: '\' must be followed by a digit or '\'.

Comment From: jamesmyatt

Update: I've updated the issue to reflect the fact that I think there are wider issues with the repl string parsing than just named groups.