Pandas version checks
-
[x] I have checked that this issue has not already been reported.
-
[x] I have confirmed this bug exists on the latest version of pandas.
-
[x] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
content = "a,b\n1,2\n3\n4,5,extra\n6,7"
with open("malformed.csv", "w", newline="") as f:
f.write(content)
path = "malformed.csv"
print(f"Processing file: {path} with chunksize=3")
for chunk in pd.read_csv(path, chunksize=3, on_bad_lines="skip"):
print(chunk)
# Output:
# Processing file: malformed.csv with chunksize=3
# a b
# 0 1 2.0
# 1 3 NaN
# 2 6 7.0
print(f"Processing file: {path} with chunksize=2")
for chunk in pd.read_csv(path, chunksize=2, on_bad_lines="skip"):
print(chunk)
# Output:
# Processing file: malformed.csv with chunksize=2
# a b
# 0 1 2.0
# 1 3 NaN
# a b
# 2 4 5
# 3 6 7
Issue Description
Depending on the chunksize
parameter, some rows are skipped or not.
Expected Behavior
The rows that are skipped should be independent of chunksize.
Installed Versions
INSTALLED VERSIONS
commit : c888af6d0bb674932007623c0867e1fbd4bdc2c6 python : 3.12.3 python-bits : 64 OS : Linux OS-release : 6.6.87.2-microsoft-standard-WSL2 Version : #1 SMP PREEMPT_DYNAMIC Thu Jun 5 18:30:46 UTC 2025 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : None LOCALE : C.UTF-8
pandas : 2.3.1 numpy : 2.3.2 pytz : 2025.2 dateutil : 2.9.0.post0 pip : None Cython : None sphinx : None IPython : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None blosc : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None html5lib : None hypothesis : None gcsfs : None jinja2 : None lxml.etree : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None psycopg2 : None pymysql : None pyarrow : None pyreadstat : None pytest : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlsxwriter : None zstandard : None tzdata : 2025.2 qtpy : None pyqt5 : None None
Comment From: rhshadrach
Thanks for the report! Confirmed on main, further investigations and PRs to fix are welcome!
Comment From: alexbra1
I also noticed that this issue does not happen when setting engine="python"
.
Comment From: khemkaran10
take
Comment From: khemkaran10
The issue occurs when the bad line is the first line in chunk:
content = "a,b\n1,2\n3\n4,5,extra\n6,7"
======= Chunk Size: 2 =======
a b
0 1 2.0
1 3 NaN
a b
2 4 5 # Bad Line
3 6 7
======= Chunk Size: 3 =======
a b
0 1 2.0
1 3 NaN
2 6 7.0
content: "a,b\n1,2\n3\n8,9\n4,5,extra\n6,7" # added one more row 8,9 before 5,4,extra
======= Chunk Size: 2 =======
a b
0 1 2.0
1 3 NaN
a b
2 8 9
3 6 7
======= Chunk Size: 3 =======
a b
0 1 2.0
1 3 NaN
2 8 9.0
a b
3 4 5 # Bad Line
4 6 7
Test Script:
import pandas as pd
contents = ["a,b\n1,2\n3\n4,5,extra\n6,7",
"a,b\n1,2\n3\n8,9\n4,5,extra\n6,7"]
path = "malformed.csv"
for i, content in enumerate(contents):
with open(f"malformed_{i}.csv", "w", newline="") as f:
f.write(content)
for chunksize in [2, 3]:
print("======= Chunk Size: ", chunksize, "=======")
for chunk in pd.read_csv(f"malformed_{i}.csv", chunksize=chunksize, on_bad_lines="skip"):
print(chunk)
Comment From: khemkaran10
@rhshadrach The issue is with this code block: https://github.com/pandas-dev/pandas/blob/d4ae6494f2c4489334be963e1bdc371af7379cd5/pandas/_libs/src/parser/tokenizer.c#L416-L427
when the bad line is the first line in the chunk the self->lines will be 1 and self->header_end will be 0, so !(self->lines <= self->header_end + 1)
becomes False and line skip will not happen. This is the edge case and not sure if it's worth fixing.