Pandas version checks
- [X] I have checked that the issue still exists on the latest versions of the docs on
main
here
Location of the documentation
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_parquet.html
Documentation problem
For the pyarrow
engine, there are some important features behind the kwargs
that aren't aren't described here, and it might not be obvious to users where to look in PyArrow. For example:
- Using
filters
, users can prune which files and/or row groups are read. - Using
filesystem
, users can configure a filesystem such as S3
Suggested fix for documentation
At the very least, we should document for each engine where those kwargs
are passed. But it might even be worthwhile to provide examples of filters, reading partitioned datasets, and configuring remote filesystems. Does that seem reasonable?
Comment From: rhshadrach
Thanks for the report! +1 on saying the function that's called from pandas and linking to it's documentation. However, if we were to document which kwargs there's a bit more maintenance burden keeping it in sync (e.g. what can be passed for filters just changed in 10.0.0) and I don't think it provides significant benefit to the user.
Comment From: phofl
Yep agreed, would rather link to the functions itself too
Comment From: LuchiLucs
I'm interested in examples of: 1. leveraging the filter argument to filter rows based on their index, for instance if the index is a datetime, to return rows in a given datetime interval 2. leveraging the columns argument considering the case of unknown existence. For instance, if given columns do exist, filter based on them, if a sub-set do exist and a sub-set do not, filter based only on those existing without raising.
Comment From: ProgerDav
take