In 2.0 we made a lot of progress in ensuring passing dtype=foo or .astype(foo) actually returned the requested dtype rather than silently giving something else. bytes and str are the main remaining cases where we silently do something else (cast to object, but not as consistently as intended).
Instead, let's interpret dtype=str
as string[pyarrow]
and dtype=bytes
as bytes[pyarrow]
(with a deprecation cycle, and once we require pyarrow)
Comment From: Dr-Irv
Another option is to interpret dtype=str
as dtype=pd.StringDtype()
. I don't know why one would pick string[pyarrow]
versus the extension dtype we already created. I'm sure there are good reasons to prefer the pyarrow
implementation, but can that be clarified?
Comment From: simonjayhawkins
I don't know why one would pick
string[pyarrow]
versus the extension dtype we already created. I'm sure there are good reasons to prefer thepyarrow
implementation, but can that be clarified?
I'm also not clear (up-to-date) on what the thinking is here. (hence my comment in https://github.com/pandas-dev/pandas/issues/52509#issuecomment-1506617432)
Comment From: jbrockmendel
A simple case I ran into today where the string[pyarrow]
outperforms by 10x
data = ["foo", "bar", "baz", "pow", "zap"]
ser = pd.Series(data * 10**6)
ser2 = ser.astype("string")
ser3 =ser.astype("string[pyarrow]")
%timeit ser == "foo"
222 ms ± 8.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit ser2 == "foo"
249 ms ± 26.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit ser3 == "foo"
24.2 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Comment From: jbrockmendel
@datapythonista both here and in #52711 a request has been made to explain how great pyarrow string dtypes are. Want to sing their praises?
Comment From: jbrockmendel
Looking at #35864, looks like "zfill" isnt implemented in arrow yet so is slightly slower, but other string methods mentioned later in the thread outperform quite a bit:
non_padded = pd.Series(np.random.randint(100, 99999, size=10000))
ser1 = non_padded.astype(str)
ser2 = non_padded.astype("string")
ser3 = non_padded.astype("string[pyarrow]")
%timeit ser.str.zfill(5)
1.67 ms ± 40.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) # <- ser1
1.78 ms ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) # <- ser2
2.16 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # <- ser3
%timeit ser.str.upper()
1.97 ms ± 97.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) # <- ser1
2.02 ms ± 39.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) # <- ser2
163 µs ± 7.62 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) # <- ser3
Comment From: jbrockmendel
xref #49398
Comment From: jorisvandenbossche
For the str
part of this issue:
Another option is to interpret
dtype=str
asdtype=pd.StringDtype()
With PDEP-14 accepted, the idea is that dtype=str
will be an alias for the new future default string dtype (i.e. pd.StringDtype(na_value=np.nan)
)
Comment From: jbrockmendel
Closing as completed.