Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
# Example code to reproduce the issue
import pandas as pd
# Imagine this CSV content represents a large file with similar structure
csv_content = """
column1
0
1
2
3
...
'abcd' # A string value inserted here
...
999996
999997
999998
999999
"""
Issue Description
I have encountered an issue where Pandas seems to infer different data types for the same numeric values based on the presence of a single string value in a large CSV file (about 100w rows). The columns with numeric values are being inferred as object
dtype if a single string is inserted somewhere in the column. This affects not only the row with the string but also a significant number of rows around it, the dtype of the 20w(it's different every time) rows before and after have been converted to object
, leading to inconsistent dtype inference.The numerical values in different rows may have different types.
And read_csv with low_memory=False, the result is normal, the numerical values dtypes will be consistent.
Expected Behavior
raw = pd.read_csv('test.csv')
raw['type'] = raw['key'].map(lambda x: str(type(x)))
print(raw[raw.key=='abcd'])
key type
599999 abcd <class 'str'>
print(raw.type.value_counts())
type
<class 'int'> 524288
<class 'str'> 475713
print(raw[raw.type=="<class 'int'>"])
key type
0 0 <class 'int'>
1 1 <class 'int'>
2 2 <class 'int'>
3 3 <class 'int'>
4 4 <class 'int'>
... ... ...
524283 524283 <class 'int'>
524284 524284 <class 'int'>
524285 524285 <class 'int'>
524286 524286 <class 'int'>
524287 524287 <class 'int'>
print(raw[raw.type=="<class 'str'>"])
key type
524288 524288 <class 'str'>
524289 524289 <class 'str'>
524290 524290 <class 'str'>
524291 524291 <class 'str'>
524292 524292 <class 'str'>
... ... ...
999996 999995 <class 'str'>
999997 999996 <class 'str'>
999998 999997 <class 'str'>
999999 999998 <class 'str'>
1000000 999999 <class 'str'>
Installed Versions
pandas version 2.1.4 system ubuntu 22.04
Comment From: anirudh-hegde
Hi @yinzhedfs, have you tried using StringIO to create a file-like object for csv_content?