Since we won't get it for free via #61642, is would be good to add a Polars engine manually, so pandas users can benefit from state-of-the-art speed while readings CSVs.
@pandas-dev/pandas-core any objection?
Comment From: mroeschke
- How does it compare performance wise to the PyArrow csv parser?
- Compared to the PyArrow csv reader, I'm less eager to add a Polars engine since it already has a
to_pandas
method and pandasread_csv
doesn't have a use for the intermediate Polars data structures unlike PyArrow (i.e.ArrowExtensionArray
usingpyarrow.ChunkedArray
s whendtype_backend="pyarrow"
)
Comment From: datapythonista
- Last time I checked it took one third of the time compared to pandas with PyArrow
- Not sure I understand what's the problem. Polars will return a Polars dataframe that will be converted to a pandas dataframe backed by ArrowExtensionArray and PyArrow, no? Do you mind expanding on what's the issue?
Comment From: jbrockmendel
No objection in principle.
I am curious if we can learn from what they've done to improve our engine.
Would the implementation be roughly return pl.read_csv(...).to_pandas()
or would the kwargs/outputs need some massaging like with the pyarrow engine?
Will the tests Just Work with this engine or will they need a bunch of if engine == "polars": ...
tweaks?
Comment From: datapythonista
I didn't check the mapping of all parameters in detail, but I'd use the lazy version with at least a .select()
and a .filter()
to support column pruning and predicate pushdown. So, not a single liner, but my expectation is that it's a simple wrapper.
I'm hoping tests will pass. I guess not all kwargs may be supported as with pyarrow, so maybe something custom is needed.
Comment From: datapythonista
I am curious if we can learn from what they've done to improve our engine.
I checked some time ago and wrote about some of my findings in this blog post. It also contains benchmarks of different CSV readers: https://datapythonista.me/blog/how-fast-can-we-process-a-csv-file
I can tell you that Ritchie spent a huge amount of time optimizing the Polars reader. But if you have time and interest, improving our C engine sounds great.
Comment From: mroeschke
Do you mind expanding on what's the issue?
More just that, as @jbrockmendel mentioned, would pd.read_csv(..., engine="polars")
just be syntatic sugar for pl.read_csv(...).to_pandas()
correct?
While at least with pd.read_csv(..., engine="pyarrow", dtype_backend="pyarrow")
, it's not just syntatic sugar as we're still holding/using PyArrow objects after the reading of the CSV. i.e. there is more "use" for PyArrow here.
EDIT: I see that you mentioned in https://github.com/pandas-dev/pandas/issues/61813#issuecomment-3049520253 it might not just a be a 1 liner but fitting the right lazy APIs to our read_csv
signature, so I would be a bit more positive including this now as there's more "art" than just being a pl.read_csv(...).to_pandas()
passthrough
Comment From: jbrockmendel
I read the blog post and got curious about when engine="python" is necessary. Patching read_csv to change engine in {"python", "python-fwf"}
to "c" breaks 26 I applied the patch incorrectly. Will update with correct number tests. 10 of those are about on_bad_line being callable. The rest need a closer look but tentatively look like they are about string inference. It may be feasible to just get rid of the python engine. 112 tests. on_bad_lines being callable, regex separators, skipfooter support are the main ones.
Next up, patching to always use the pyarrow engine and see if it breaks the world. 3137 failures. Looks like mostly about unsupported keywords like low_memory, thousands.
Comment From: WillAyd
Nice blog post. Those are some impressive benchmarks on the polars side.
Do you think it matters at all that polars uses string views for storage whereas we are going to default to large strings? I think that gets doubly confusing when you try to mix the pyarrow backend with the polars engine, as I'm unsure what data type a user would expect in that case (probably string_view?)
Comment From: samukweku
Polars read_csv is very fast. Won't it be easier for the user to do a pl.read_csv
or pl.scan_csv
followed by to_pandas
? There is also the maintenance aspect of it. Or maybe mention in the user guide that there are faster CSV readers that the user may access instead. Just an opinion
Comment From: datapythonista
Won't it be easier for the user to do a
pl.read_csv
orpl.scan_csv
followed byto_pandas
?
Easier for us, but not easier for users in my opinion. I don't disagree with you, but in many parts of the pandas API we provide syntactic sugar to make user code look very simple and compact. For example allowing urls or compressed files when reading files. My preferred option would be the PR referenced in the description, but since there is no consensus for that, I think providing polars in the same way as we provide pyarrow is fair. Otherwise we are encouraging users to use a much slower reader.
Comment From: kuri-menu
Most people won't be reading large amounts of data that would require using Polars' engine, and the current approach will be sufficient. If you want to use Polars, just read it with pl.read_csv and then use to_pandas. I also tried reading 2,000 CSV files (total of 100 million records) and creating a pandas DataFrame, and pyarrow was faster than Polars.