Pandas version checks
-
[x] I have checked that this issue has not already been reported.
-
[x] I have confirmed this bug exists on the latest version of pandas.
-
[x] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
"field1" ,"field2" ,"field3" ,"field4" ,"field5" ,"field6" ,"field7"
"1" , 14 , 6 , 21 ,"euia" , 0.54 , 1
"2" , 30 , 5 , 26 ,"euia" , 0.82 , 1
"2" , 1 , 0 , 0 ,"eua" , 0 , 0
"3" , 27 , 7 , 17 ,"euia" , 1 , 1
"4" , 14 , 0 , 9 ,"euia" , 0.64 , 0.92
"4" , 10 , 0 , 0 ,"eua" , 0 , 0
"9" , 17 , 1 , 6 ,"euia" , 0.65 , 0.58
"10" , 27 , 4 , 13 ,"eu" , 1 ,
"10" , , 0 , 0 ,"euia" , 0 ,
"12" , 14 , 1 , 13 ,"uia" , 1 , 0.75
"12" , 5 , 1 , 4 ,"ui eiuaea" , 1 , 1
"13" , 22 , 3 , 7 ," euia" , 0.89 , 1
"6" , 22 , 3 , 5 ,"euia" , 0.84 , 0.79
"7" , 23 , 5 , 4 ,"uia" , 0.78 , 1
"8" , 26 , 11 , 2 ,"euia" , 1.12 , 1.30
"5" , 28 , 3 , 3 ,"euia" , 0.72 , 0.68
import pandas as pd
pd.set_option('display.max_columns', 1000)
pd.set_option('display.max_rows', 1000)
pd.set_option('display.width', 1000)
pd.set_option("display.max_colwidth", None)
df = pd.read_csv("exemple.csv")
# df = pd.read_csv("exemple.csv", quoting=1) # change nothing
list(df.columns)
df.dtypes
list(df["field5 "])
df = pd.read_csv("exemple.csv", sep=r"\s*,\s*", engine="python")
list(df.columns)
df.dtypes
list(df["field5"])
df = pd.read_csv("exemple.csv", quoting=2)
list(df.columns)
df.dtypes
list(df["field5 "])
df = pd.read_csv("exemple.csv", quoting=3)
list(df.columns)
df.dtypes
list(df['"field5" '])
df = pd.read_csv("exemple.csv", quoting=2, dtype={"field1 ": "object",
"field2 ": "Int32", # fail
"field3 ": "int",
"field4 ": "int",
"field5 ": "object",
"field6 ": "float",
"field7": "float" # fail
})
Issue Description
Hello,
I tried to parse a file like the exemple given, and I spent an afternoon just on this. Nothing looks logical to me. So I am sorry, I will make one ticket for everything, cause it would be to long to make one for each problem. Fill free to divide it in several task.
Expected colums dtypes look quite easy to guess to me : the user used quotemarks on field1
to force a string type. Fields 2-4 are expected to be integers. It could be almost understandable if field2
was converted to a float because np.int dtype doesn’t manage NA values. But Pandas has a integer type which does. So there is no reason. Field5
should be string containing text between quotemarks. Field 6 and 7 are expected to be float. Let see what happen
First try : df = pd.read_csv("exemple.csv")
- Columns names quotemarks are removed, but trailing space are keeped. That’s quite surprising as there is no logic : Or you consider quotemarks are text delimiters and should be removed, but in this case, why to keep characters outside the delimiters ? Or you consider a everything is part of the string and in this case you must keep everything.
- dtypes are problematic:
field1
have been implicitly converted toint64
. The user explicitly asked for astr
. The convention “what is between quotemarks is a string” is common to R, C++ and Python and wide spread. Why to not respect itfield2
is converted to a string. Missing values are a common case to handle. I would understand a conversion to float, or an error raised. But why a conversion to a string ?field5
have the same problem than column names.field7
is converted to a string. Here it is not understandable at all as np.float handle NA values.- Other field are correct. Which is also a little surprising. So initials and trailing spaces pose problem in string fields and empty fields, but not in number field ?
Case : df = pd.read_csv("exemple.csv", sep=r"\s*,\s*", engine="python")
Here init and trailing spaces are removed, but not quotemarks. This ticket is probably already opened somewhere. Field types are ok, excepted for field2
, which should be Int32
.
Case : df = pd.read_csv("exemple.csv", quoting=2)
Here I tried to explicitly tel the methods that quotemarks means string. Nonetheless it doesn’t work. But integer field are now floats. Excepted for field2
and field7
which are… strings !
Case : df = pd.read_csv("exemple.csv", quoting=3)
Here, the parsing of column names and string fields is wrong, but at least logical. It just keep everything. Fields containing NA values are still converted to string.
Case : df = pd.read_csv("exemple.csv", quoting=2, dtype={"field1 ": "object",
"field2 ": "Int32", # fail
"field3 ": "int",
"field4 ": "int",
"field5 ": "object",
"field6 ": "float",
"field7": "float" # fail
})
Raise errors and doesn’t handle fields names correctly.
Expected Behavior
No implicit conversion. Never.
For string field : I understand I may have to tweak the quoting
and quotechar
parameters, but once done, everything between quotemark should be string, not int or float, and white spaces outside should be ignored.
For float fields containing NA values : should be float field with NA values.
For int field containing NA values : ideally should be parsed as pandas IntXX
which handle NA values. At minimum as a np.float. But never a string.
Installed Versions
Comment From: shyok21
Hi, I'd like to work on this issue. I've reproduced it locally and observed the problems as described. Before starting on a fix, I wanted to propose a few possible solution directions and ask for guidance on which approach aligns best with pandas' design philosophy.
✅Observed Problems
1. Quoted values being cast to numeric types
Example CSV:
"field1","field2"
"1",2
"2",3
df = pd.read_csv("example.csv")
print(df.dtypes)
Actual:
field1
→ int64
Expected:
Since values are quoted, field1
should be inferred as object
(string)
Suggested Fix Options:
Option 1: [Proposed] Add a flag infer_quoted_strings=True
df = pd.read_csv("example.csv", infer_quoted_strings=True)
print(df.dtypes) # field1 → object
Option 2: Auto-infer quoted numeric as strings
Internally if value was quoted, skip numeric coercion
# Current: "1" → 1 → int
# Proposed: "1" → "1" → str
2. Columns with missing values default to object
Example CSV row:
"10" , , 0 , 0 ,"euia", 0 ,
df = pd.read_csv("example.csv")
print(df.dtypes)
Actual:
field2
→object
field7
→object
Expected:
field2
→Int32
(nullable int)field7
→float64
Suggested Fix Options:
Add a flag dtype_backend="nullable"
such that if a column has numeric-looking values + NA, fallback should prefer:
- Int32
for integers with NA
- float64
for float with NA
df = pd.read_csv("example.csv", dtype_backend="nullable")
print(df.dtypes) # field2 → Int32, field7 → float64
3. Column names keep trailing spaces
CSV header:
"field1 ","field2 ","field5 "
df = pd.read_csv("example.csv")
print(df.columns) # ['field1 ', 'field2 ', 'field5 ']]
Access like df["field5"]
fails unless user matches exact spacing.
Suggested Fix Options:
Option 1: [Proposed] Add a flag strip_column_names=True
df = pd.read_csv("example.csv", strip_column_names=True)
print(df.columns) # ['field1', 'field2', 'field5']
Option 2: Just document a helper:
df.columns = df.columns.str.strip()
🙋♂️ Request for Direction
Would love to hear your thoughts on: - Which of the above ideas (if any) would be acceptable to implement? - Should these be separated into multiple issues/PRs or handled together? - Is it fine to add optional flags to control these behaviors?
Happy to start by writing tests first, or submitting a patch once the preferred approach is clear. Thanks!