Pandas API/DEPR: dtype=(str|bytes) interpret as pyarrow

In 2.0 we made a lot of progress in ensuring passing dtype=foo or .astype(foo) actually returned the requested dtype rather than silently giving something else. bytes and str are the main remaining cases where we silently do something else (cast to object, but not as consistently as intended).

Instead, let's interpret dtype=str as string[pyarrow] and dtype=bytes as bytes[pyarrow] (with a deprecation cycle, and once we require pyarrow)

Comment From: Dr-Irv

Another option is to interpret dtype=str as dtype=pd.StringDtype() . I don't know why one would pick string[pyarrow] versus the extension dtype we already created. I'm sure there are good reasons to prefer the pyarrow implementation, but can that be clarified?

Comment From: simonjayhawkins

I don't know why one would pick string[pyarrow] versus the extension dtype we already created. I'm sure there are good reasons to prefer the pyarrow implementation, but can that be clarified?

I'm also not clear (up-to-date) on what the thinking is here. (hence my comment in https://github.com/pandas-dev/pandas/issues/52509#issuecomment-1506617432)

Comment From: jbrockmendel

A simple case I ran into today where the string[pyarrow] outperforms by 10x

data = ["foo", "bar", "baz", "pow", "zap"]
ser = pd.Series(data * 10**6)
ser2 = ser.astype("string")
ser3 =ser.astype("string[pyarrow]")

%timeit ser == "foo"
222 ms ± 8.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit ser2 == "foo"
249 ms ± 26.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit ser3 == "foo"
24.2 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Comment From: jbrockmendel

@datapythonista both here and in #52711 a request has been made to explain how great pyarrow string dtypes are. Want to sing their praises?

Comment From: jbrockmendel

Looking at #35864, looks like "zfill" isnt implemented in arrow yet so is slightly slower, but other string methods mentioned later in the thread outperform quite a bit:

non_padded = pd.Series(np.random.randint(100, 99999, size=10000))
ser1 = non_padded.astype(str)
ser2 = non_padded.astype("string")
ser3 = non_padded.astype("string[pyarrow]")

%timeit ser.str.zfill(5)
1.67 ms ± 40.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  # <- ser1
1.78 ms ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  # <- ser2
2.16 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  #  <- ser3

%timeit ser.str.upper()
1.97 ms ± 97.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  # <- ser1
2.02 ms ± 39.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  # <- ser2
163 µs ± 7.62 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)  # <- ser3

Comment From: jbrockmendel

xref #49398

Comment From: jorisvandenbossche

For the str part of this issue:

Another option is to interpret dtype=str as dtype=pd.StringDtype()

With PDEP-14 accepted, the idea is that dtype=str will be an alias for the new future default string dtype (i.e. pd.StringDtype(na_value=np.nan))

Comment From: jbrockmendel

Closing as completed.

.