Pandas ENH: Include line number and number of fields when read_csv() callable raises ParserWarning

Feature Type

[ ] Adding new functionality to pandas
[x] Changing existing functionality in pandas
[ ] Removing existing functionality in pandas

Problem Description

I wish I could use pandas to detect and repair issues in a CSV file, but raise an informative warning when an unrepairable issue is encountered.

I have written a function which identifies common issues (e.g. the field delimiter being improperly used within a field) and checks surrounding fields to estimate the original intent of the data, but when the issue cannot be identified with this logic, the function would return the original line and the user should be directed to the problematic line.

Feature Description

Given a CSV with bad lines (e.g. line 3 having an extra "E"):

id,field_1,field_2
101,A,B
102,C,D,E
103,F,G

read_csv() will, with all defaults (on_bad_lines='error'), raise a ParserError:

pandas.errors.ParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 4

With on_bad_lines='warn', it will raise a ParserWarning, with the same helpful information:

<stdin>:1: ParserWarning: Skipping line 3: expected 3 fields, saw 4

However, when a using a callable (e.g. on_bad_lines=line_fixer), the ParserWarning message is very generic, not indicating the line number, expected fields, nor seen fields:

>>> import pandas as pd
>>> def line_fixer(line):
...     return [1, 2, 3, 4, 5]
...
>>> df = pd.read_csv('test.csv', engine='python', on_bad_lines=line_fixer)
<stdin>:1: ParserWarning: Length of header or names does not match length of data. This leads to a loss of data with index_col=False.

Including these details would allow the user to find and fix the input CSV manually.

Alternative Solutions

Pre-process the CSV file separately from the read_csv() function.
Pass line number and expected field count to the callable function, which can raise its own descriptive warning.

Additional Context

No response

Comment From: sanggon6107

Hi @matthewgottlieb , It seems a method accepts the expected col num and the actual col num when engine='pyarrow', so I think maybe we can do the same thing for engine='python' as well.

import pandas as pd
import warnings

def on_bad_lines_pyarrow(arg):
    warnings.warn(
    f'Expected {arg[0]} columns, got {arg[1]}. Skip this row',
        pd.errors.ParserWarning
    )
    return "skip"

file = pd.read_csv('input.csv', on_bad_lines=on_bad_lines_pyarrow, engine='pyarrow')

# ParserWarning : Expected 3 columns, got 4. Skip this row

Could anyone kindly help confirm if this would be acceptable and I can work on this?

Expecting to be like :

import pandas as pd
import warnings

def on_bad_lines_python(line, expected_col_num):
    warnings.warn(
        f"Expected {expected_col_num}, got {len(line)} : {line}",
        pd.errors.ParserWarning
    )
    return [i for i in range(len(line))]

file = pd.read_csv('input.csv', on_bad_lines=on_bad_lines_python, engine='python')

# Expected 3, got 4 : ['102', 'C', 'D', 'E']