Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
# Example code to reproduce the issue
import pandas as pd
# Imagine this CSV content represents a large file with similar structure
csv_content = """
column1
0
1
2
3
...
'abcd' # A string value inserted here
...
999996
999997
999998
999999
"""
Issue Description
I have encountered an issue where Pandas seems to infer different data types for the same numeric values based on the presence of a single string value in a large CSV file (about 100w rows). The columns with numeric values are being inferred as object dtype if a single string is inserted somewhere in the column. This affects not only the row with the string but also a significant number of rows around it, the dtype of the 20w(it's different every time) rows before and after have been converted to object, leading to inconsistent dtype inference.The numerical values in different rows may have different types.
And read_csv with low_memory=False, the result is normal, the numerical values dtypes will be consistent.
Expected Behavior
raw = pd.read_csv('test.csv')
raw['type'] = raw['key'].map(lambda x: str(type(x)))
print(raw[raw.key=='abcd'])
key type
599999 abcd <class 'str'>
print(raw.type.value_counts())
type
<class 'int'> 524288
<class 'str'> 475713
print(raw[raw.type=="<class 'int'>"])
key type
0 0 <class 'int'>
1 1 <class 'int'>
2 2 <class 'int'>
3 3 <class 'int'>
4 4 <class 'int'>
... ... ...
524283 524283 <class 'int'>
524284 524284 <class 'int'>
524285 524285 <class 'int'>
524286 524286 <class 'int'>
524287 524287 <class 'int'>
print(raw[raw.type=="<class 'str'>"])
key type
524288 524288 <class 'str'>
524289 524289 <class 'str'>
524290 524290 <class 'str'>
524291 524291 <class 'str'>
524292 524292 <class 'str'>
... ... ...
999996 999995 <class 'str'>
999997 999996 <class 'str'>
999998 999997 <class 'str'>
999999 999998 <class 'str'>
1000000 999999 <class 'str'>
Installed Versions
pandas version 2.1.4 system ubuntu 22.04
Comment From: anirudh-hegde
Hi @yinzhedfs, have you tried using StringIO to create a file-like object for csv_content?