Pandas version checks

  • [x] I have checked that this issue has not already been reported.

  • [x] I have confirmed this bug exists on the latest version of pandas.

  • [x] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

content = "a,b\n1,2\n3\n4,5,extra\n6,7"
with open("malformed.csv", "w", newline="") as f:
    f.write(content)
path = "malformed.csv"
print(f"Processing file: {path} with chunksize=3")
for chunk in pd.read_csv(path, chunksize=3, on_bad_lines="skip"):
    print(chunk)

# Output:
# Processing file: malformed.csv with chunksize=3
#   a    b
# 0  1  2.0
# 1  3  NaN
# 2  6  7.0

print(f"Processing file: {path} with chunksize=2")
for chunk in pd.read_csv(path, chunksize=2, on_bad_lines="skip"):
    print(chunk)

# Output:
# Processing file: malformed.csv with chunksize=2
#    a    b
# 0  1  2.0
# 1  3  NaN
#    a  b
# 2  4  5
# 3  6  7

Issue Description

Depending on the chunksize parameter, some rows are skipped or not.

Expected Behavior

The rows that are skipped should be independent of chunksize.

Installed Versions

INSTALLED VERSIONS

commit : c888af6d0bb674932007623c0867e1fbd4bdc2c6 python : 3.12.3 python-bits : 64 OS : Linux OS-release : 6.6.87.2-microsoft-standard-WSL2 Version : #1 SMP PREEMPT_DYNAMIC Thu Jun 5 18:30:46 UTC 2025 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : None LOCALE : C.UTF-8

pandas : 2.3.1 numpy : 2.3.2 pytz : 2025.2 dateutil : 2.9.0.post0 pip : None Cython : None sphinx : None IPython : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None blosc : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None html5lib : None hypothesis : None gcsfs : None jinja2 : None lxml.etree : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None psycopg2 : None pymysql : None pyarrow : None pyreadstat : None pytest : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlsxwriter : None zstandard : None tzdata : 2025.2 qtpy : None pyqt5 : None None

Comment From: rhshadrach

Thanks for the report! Confirmed on main, further investigations and PRs to fix are welcome!

Comment From: alexbra1

I also noticed that this issue does not happen when setting engine="python".

Comment From: khemkaran10

take

Comment From: khemkaran10

The issue occurs when the bad line is the first line in chunk:

content = "a,b\n1,2\n3\n4,5,extra\n6,7"
======= Chunk Size:  2 =======
   a    b
0  1  2.0
1  3  NaN
   a  b
2  4  5         # Bad Line
3  6  7
======= Chunk Size:  3 =======
   a    b
0  1  2.0
1  3  NaN
2  6  7.0


content: "a,b\n1,2\n3\n8,9\n4,5,extra\n6,7"  # added one more row 8,9 before 5,4,extra
======= Chunk Size:  2 =======
   a    b
0  1  2.0
1  3  NaN
   a  b
2  8  9
3  6  7
======= Chunk Size:  3 =======
   a    b
0  1  2.0
1  3  NaN
2  8  9.0
   a  b
3  4  5       # Bad Line
4  6  7

Test Script:

import pandas as pd
contents = ["a,b\n1,2\n3\n4,5,extra\n6,7",
           "a,b\n1,2\n3\n8,9\n4,5,extra\n6,7"] 
path = "malformed.csv"
for i, content in enumerate(contents):
    with open(f"malformed_{i}.csv", "w", newline="") as f:
        f.write(content)
    for chunksize in [2, 3]:
        print("======= Chunk Size: ", chunksize, "=======")
        for chunk in pd.read_csv(f"malformed_{i}.csv", chunksize=chunksize, on_bad_lines="skip"):
            print(chunk)

Comment From: khemkaran10

@rhshadrach The issue is with this code block: https://github.com/pandas-dev/pandas/blob/d4ae6494f2c4489334be963e1bdc371af7379cd5/pandas/_libs/src/parser/tokenizer.c#L416-L427

when the bad line is the first line in the chunk the self->lines will be 1 and self->header_end will be 0, so !(self->lines <= self->header_end + 1) becomes False and line skip will not happen. This is the edge case and not sure if it's worth fixing.