Feature Type

  • [x] Adding new functionality to pandas

  • [x] Changing existing functionality in pandas

  • [ ] Removing existing functionality in pandas

Problem Description

When parsing CSVs with pd.read_csv, there is no function‑scoped way to (a) react to malformed rows as they happen and (b) capture the exact CSV line numbers for those rows, without introducing process‑global side effects or using a subprocess.

  • on_bad_lines='warn' emits a Python warning like “Skipping line N: …” (includes the line number). But to programmatically capture those line numbers during parsing, one must intercept warnings, which either delays them (warnings.catch_warnings) or redirects all process warnings (logging.captureWarnings), both undesirable in large applications.

  • on_bad_lines=<callable> allows immediate, local handling (ideal), but the callable only receives the parsed fields (list[str]) and does NOT receive the source line number. This prevents building a precise per‑line record of malformed rows in real time.

A local, non‑global mechanism that surfaces the line number for each bad line during parsing, enabling immediate logging and exact recording of which lines were affected, without impacting the rest of the application would solve this.

Feature Description

Add a function‑scoped callback that is invoked for every malformed row and provides structured context including the CSV line number. Either of the following designs would solve the problem:

Option A (new parameter): - Introduce a new parameter to read_csv, e.g. bad_line_callback, called synchronously for each malformed row:

    def bad_line_callback(
        fields: list[str], *,
        line_number: int,
        raw_line: str | None = None,
        message: str | None = None
    ) -> list[str] | None:
        """
        Return None to skip the row (default), or return a corrected list[str] to keep it.
        Called per malformed record; function-scoped with no global side-effects.
        """

Usage (capturing exact line numbers):

    bad_line_numbers: list[int] = []

    def capture_bad_line(fields, *, line_number, raw_line=None, message=None):
        bad_line_numbers.append(line_number)
        # Optional: log or store message/raw_line if needed
        return None  # keep default skip behavior

    df = pd.read_csv(
        path,
        engine="python",
        sep=None,
        on_bad_lines="skip",          # existing semantics preserved
        bad_line_callback=capture_bad_line
    )
    # bad_line_numbers now contains the exact CSV line numbers seen as bad.

Option B (extend existing callable): - Enhance on_bad_lines=<callable> to accept optional keyword-only context parameters if supported by the user’s callable:

    def on_bad_lines_callable(fields, *, line_number=None, raw_line=None, message=None):
        ...
  • Backward compatible: if the user’s callable only accepts positional fields, pandas behaves exactly as today; if it accepts the kwargs, pandas supplies the line number and optional context.

Common semantics (both options): - line_number is 1-based and matches current warning text (“Skipping line N: …”). - Callback is function-scoped, synchronous, and has no process-global effects. - Works with engine="python" (which supports on_bad_lines) and with sep=None (sniffer). Behavior with chunksize should be documented and consistent. - The callback can be used purely for observability (logging/capture of exact line numbers) or to fix/replace malformed rows by returning a corrected list[str].

Alternative Solutions

  • on_bad_lines='warn' + warnings.catch_warnings(record=True): allows extracting line numbers post-parse by parsing warning text, but warnings are not emitted live and this approach is brittle.
  • logging.captureWarnings(True): routes all Python warnings process-wide; enables live capture but introduces global side-effects and potential interference in large apps.
  • Overriding warnings.showwarning: process-global, not thread-safe, and risky even if restored carefully.
  • Running parsing in a subprocess/worker: safe isolation but adds orchestration/ops overhead.

None provide a simple, function-scoped hook that delivers line numbers for immediate, per-row handling without global effects.

Additional Context

  • Typical current warning format: “Skipping line N: …”. Users might need to log these events as they occur and record the exact line numbers (for audit, remediation, or user-facing summaries) without altering application-wide logging/warnings behavior.
  • This enhancement would significantly improve operational robustness for ETL/ingestion pipelines and large applications that need precise, real-time observability of malformed input rows.

Comment From: sanggon6107

Hi @laelhalawani , I think this is a duplicate of #61838 .