Feature Type
-
[x] Adding new functionality to pandas
-
[ ] Changing existing functionality in pandas
-
[ ] Removing existing functionality in pandas
Problem Description
The current pandas.read_csv()
implementation is designed for robust and complete CSV parsing. However, even when users request only a few lines using nrows=X
, the function:
- Initializes the full parsing engine
- Performs column-wise type inference
- Scans for delimiter/header consistency
- May read a large portion or all of the file, even for small previews
For large datasets (10–100GB CSVs), this results in significant I/O, CPU, and memory overhead — all when the user likely just wants a quick preview of the data.
This is a common pattern in: - Exploratory Data Analysis (EDA) - Data cataloging and profiling - Schema validation or column sniffing - Dashboards and notebook tooling
Currently, users resort to workarounds like:
pd.read_csv(..., chunksize=5)
next(...)
or shell-level hacks like:
head -n 5 large_file.csv
These are non-intuitive, unstructured, or outside the pandas ecosystem.
Feature Description
Introduces a new Function
pandas.preview_csv(filepath_or_buffer, nrows=5, ...)
### Goals
- Read only the first n rows + header lines
- Avoid loading or inferring types from null dataset
- No full cloumn validation
- Fallback to object dtype unless dtype_infer = true
- Support basic options like delimiter, encoding, header presence.
Proposed API:
def preview_csv(
filepath_or_buffer,
nrows: int = 5,
delimiter: str = ",",
encoding: str = "utf-8",
has_header: bool = True,
dtype_infer: bool = False,
as_generator: bool = False
) -> pd.DataFrame:
...
Alternative Solutions
Tool / Method | Behavior | Limitation |
---|---|---|
pd.read_csv(nrows=X) |
Reads entire file into memory, performs dtype inference and column validation | Not optimized for quick previews; incurs overhead even for small nrows |
pd.read_csv(chunksize=X) |
Returns an iterator of chunks (DataFrames of size X ) |
Requires non-intuitive iterator handling; users often want DataFrame directly |
csv.reader + slicing |
Python’s built-in CSV reader is lightweight and fast | Returns raw lists, not a DataFrame; lacks header handling and column inference |
subprocess.run(["head", "-n"]) |
OS-level utility that returns first N lines | Not portable across platforms, doesn't integrate with DataFrame workflow |
Polars: pl.read_csv(..., n_rows) |
Rust-based, blazing fast CSV reader | Requires installing a new library; pandas users might not want to switch ecosystems |
Dask: dd.read_csv(...).head() |
Lazy, out-of-core loading with chunked processing | Overhead of distributed engine is unnecessary for simple previews |
open(...).readlines(N) |
Naive Python read of first N lines | Doesn’t handle parsing, delimiters, or schema properly |
pyarrow.csv.read_csv(...)[0:X] |
Efficient Arrow-based preview | Requires using Apache Arrow APIs; returns Arrow tables unless converted |
While workarounds exist, none provide a clean, idiomatic, native pandas function to:
- Efficiently load the first N rows
- Return a DataFrame
immediately
- Avoid dtype inference
- Skip full file validation
- Avoid requiring third-party dependencies
A dedicated pandas.preview_csv()
would fill this gap and offer an elegant, performant solution for quick data previews.
Additional Context
No response
Comment From: rhshadrach
Thanks for the request. Having to maintain an entirely different code path that does very similar things to read_csv
seems to me to be a non-starter. I would like to understand why read_csv
could not be improved to fit this purpose.
Comment From: visheshrwl
Thank you for the thoughtful feedback @rhshadrach !
I completely understand the reluctance to maintain a separate code path - especially in a core function like read_csv()
, which already carries significant complexity.
read_csv()
is designed for full-fidelity, schema-validated and optionally type-inferred ingestion. Introducing conditional short circuits for preview-style use cases pollutes that logic and increases branching inside a hot, complex code path.
On the other hand, a dedicated preview_csv()
function:
- Defines a minimal contract: "Read the top N
rows quickly with minimal parsing"
- Requires no inference or post-processing logic
- Makes the behaviour explicit, predictable, and easy to optimize separately.
From a user intent perspective:
- read_csv(nrows=X)
implies: "I want a truncated but fully parsed and inferred subset of the data"
- preview_csv(nrows=X)
would mean: "I just want to see the first X lines, as fast as possible - even if it's untyped or partially parsed."
This distinction matters - especially in workflows where previewing is decoupled from actual analysis, such as: - Data cataloging - EDA profiling - Schema sniffing - Logging pipelines
Any performance optimization embedded in read_csv()
must:
- Preserve dozens of edge cases
- Remain compatible with all backends (C, python, Arrow-based readers)
- Honor ~50+ keyword arguments (dtype
, parse_dates
, converters
, skiprows
, etc.)
This would introduce non-trivial complexity and testing burden to a critical code path and create surface area for subtle regressions.
Both polars.read_csv(..., n_rows=X)
and vaex.open(...).head(X)
implement optimized preview semantics using fast readers with early stopping. These tools don't override their full read_csv()
equivalents - they recognize the preview use case is distinct.
Pandas could adopt a similar design without breaking the existing contract of read_csv()
If approved, I'm happy to:
- Own the implementation of preview_csv()
- Benchmark it vs read_csv()
under real workloads (10GB+)
- Keep it behind a dedicated namespace (e.g. pandas.io.preview
)
- Ensure full test coverage and documentation.
Would love your thoughts - and if there's a preferred entry point you'd recommend for this to remain modular and maintainable long-term.
Thanks again!
Comment From: rhshadrach
Can you post sample data and benchmarks demonstrating the performance issue with specifying nrows=N
.
Comment From: jbrockmendel
I dont think this merits a new function in pa das.