Since we won't get it for free via #61642, is would be good to add a Polars engine manually, so pandas users can benefit from state-of-the-art speed while readings CSVs.
@pandas-dev/pandas-core any objection?
Comment From: mroeschke
- How does it compare performance wise to the PyArrow csv parser?
- Compared to the PyArrow csv reader, I'm less eager to add a Polars engine since it already has a
to_pandas
method and pandasread_csv
doesn't have a use for the intermediate Polars data structures unlike PyArrow (i.e.ArrowExtensionArray
usingpyarrow.ChunkedArray
s whendtype_backend="pyarrow"
)
Comment From: datapythonista
- Last time I checked it took one third of the time compared to pandas with PyArrow
- Not sure I understand what's the problem. Polars will return a Polars dataframe that will be converted to a pandas dataframe backed by ArrowExtensionArray and PyArrow, no? Do you mind expanding on what's the issue?
Comment From: jbrockmendel
No objection in principle.
I am curious if we can learn from what they've done to improve our engine.
Would the implementation be roughly return pl.read_csv(...).to_pandas()
or would the kwargs/outputs need some massaging like with the pyarrow engine?
Will the tests Just Work with this engine or will they need a bunch of if engine == "polars": ...
tweaks?
Comment From: datapythonista
I didn't check the mapping of all parameters in detail, but I'd use the lazy version with at least a .select()
and a .filter()
to support column pruning and predicate pushdown. So, not a single liner, but my expectation is that it's a simple wrapper.
I'm hoping tests will pass. I guess not all kwargs may be supported as with pyarrow, so maybe something custom is needed.
Comment From: datapythonista
I am curious if we can learn from what they've done to improve our engine.
I checked some time ago and wrote about some of my findings in this blog post. It also contains benchmarks of different CSV readers: https://datapythonista.me/blog/how-fast-can-we-process-a-csv-file
I can tell you that Ritchie spent a huge amount of time optimizing the Polars reader. But if you have time and interest, improving our C engine sounds great.
Comment From: mroeschke
Do you mind expanding on what's the issue?
More just that, as @jbrockmendel mentioned, would pd.read_csv(..., engine="polars")
just be syntatic sugar for pl.read_csv(...).to_pandas()
correct?
While at least with pd.read_csv(..., engine="pyarrow", dtype_backend="pyarrow")
, it's not just syntatic sugar as we're still holding/using PyArrow objects after the reading of the CSV. i.e. there is more "use" for PyArrow here.
EDIT: I see that you mentioned in https://github.com/pandas-dev/pandas/issues/61813#issuecomment-3049520253 it might not just a be a 1 liner but fitting the right lazy APIs to our read_csv
signature, so I would be a bit more positive including this now as there's more "art" than just being a pl.read_csv(...).to_pandas()
passthrough
Comment From: jbrockmendel
I read the blog post and got curious about when engine="python" is necessary. Patching read_csv to change engine in {"python", "python-fwf"}
to "c" breaks 26 I applied the patch incorrectly. Will update with correct number tests. 10 of those are about on_bad_line being callable. The rest need a closer look but tentatively look like they are about string inference. It may be feasible to just get rid of the python engine. 112 tests. on_bad_lines being callable, regex separators, skipfooter support are the main ones.
Next up, patching to always use the pyarrow engine and see if it breaks the world. 3137 failures. Looks like mostly about unsupported keywords like low_memory, thousands.
Comment From: WillAyd
Nice blog post. Those are some impressive benchmarks on the polars side.
Do you think it matters at all that polars uses string views for storage whereas we are going to default to large strings? I think that gets doubly confusing when you try to mix the pyarrow backend with the polars engine, as I'm unsure what data type a user would expect in that case (probably string_view?)