Feature Type
-
[ ] Adding new functionality to pandas
-
[x] Changing existing functionality in pandas
-
[ ] Removing existing functionality in pandas
Problem Description
I wish I could use pandas to detect and repair issues in a CSV file, but raise an informative warning when an unrepairable issue is encountered.
I have written a function which identifies common issues (e.g. the field delimiter being improperly used within a field) and checks surrounding fields to estimate the original intent of the data, but when the issue cannot be identified with this logic, the function would return the original line and the user should be directed to the problematic line.
Feature Description
Given a CSV with bad lines (e.g. line 3 having an extra "E"):
id,field_1,field_2
101,A,B
102,C,D,E
103,F,G
read_csv() will, with all defaults (on_bad_lines='error'
), raise a ParserError:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 4
With on_bad_lines='warn'
, it will raise a ParserWarning, with the same helpful information:
<stdin>:1: ParserWarning: Skipping line 3: expected 3 fields, saw 4
However, when a using a callable (e.g. on_bad_lines=line_fixer
), the ParserWarning message is very generic, not indicating the line number, expected fields, nor seen fields:
>>> import pandas as pd
>>> def line_fixer(line):
... return [1, 2, 3, 4, 5]
...
>>> df = pd.read_csv('test.csv', engine='python', on_bad_lines=line_fixer)
<stdin>:1: ParserWarning: Length of header or names does not match length of data. This leads to a loss of data with index_col=False.
Including these details would allow the user to find and fix the input CSV manually.
Alternative Solutions
- Pre-process the CSV file separately from the read_csv() function.
- Pass line number and expected field count to the callable function, which can raise its own descriptive warning.
Additional Context
No response
Comment From: sanggon6107
Hi @matthewgottlieb ,
It seems a method accepts the expected col num and the actual col num when engine='pyarrow'
, so I think maybe we can do the same thing for engine='python'
as well.
import pandas as pd
import warnings
def on_bad_lines_pyarrow(arg):
warnings.warn(
f'Expected {arg[0]} columns, got {arg[1]}. Skip this row',
pd.errors.ParserWarning
)
return "skip"
file = pd.read_csv('input.csv', on_bad_lines=on_bad_lines_pyarrow, engine='pyarrow')
# ParserWarning : Expected 3 columns, got 4. Skip this row
Could anyone kindly help confirm if this would be acceptable and I can work on this?
Expecting to be like :
import pandas as pd
import warnings
def on_bad_lines_python(line, expected_col_num):
warnings.warn(
f"Expected {expected_col_num}, got {len(line)} : {line}",
pd.errors.ParserWarning
)
return [i for i in range(len(line))]
file = pd.read_csv('input.csv', on_bad_lines=on_bad_lines_python, engine='python')
# Expected 3, got 4 : ['102', 'C', 'D', 'E']