Pandas version checks

  • [x] I have checked that this issue has not already been reported.

  • [x] I have confirmed this bug exists on the latest version of pandas.

  • [x] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

"field1" ,"field2" ,"field3" ,"field4" ,"field5"      ,"field6" ,"field7"
"1"      ,      14 ,       6 ,      21 ,"euia"        ,    0.54 ,    1
"2"      ,      30 ,       5 ,      26 ,"euia"        ,    0.82 ,    1
"2"      ,       1 ,       0 ,       0 ,"eua"         ,    0    ,    0
"3"      ,      27 ,       7 ,      17 ,"euia"        ,    1    ,    1
"4"      ,      14 ,       0 ,       9 ,"euia"        ,    0.64 ,    0.92
"4"      ,      10 ,       0 ,       0 ,"eua"         ,    0    ,    0
"9"      ,      17 ,       1 ,       6 ,"euia"        ,    0.65 ,    0.58
"10"     ,      27 ,       4 ,      13 ,"eu"          ,    1    ,     
"10"     ,         ,       0 ,       0 ,"euia"        ,    0    ,     
"12"     ,      14 ,       1 ,      13 ,"uia"         ,    1    ,    0.75
"12"     ,       5 ,       1 ,       4 ,"ui   eiuaea" ,    1    ,    1
"13"     ,      22 ,       3 ,       7 ," euia"       ,    0.89 ,    1
"6"      ,      22 ,       3 ,       5 ,"euia"        ,    0.84 ,    0.79
"7"      ,      23 ,       5 ,       4 ,"uia"         ,    0.78 ,    1
"8"      ,      26 ,      11 ,       2 ,"euia"        ,    1.12 ,    1.30
"5"      ,      28 ,       3 ,       3 ,"euia"        ,    0.72 ,    0.68



import pandas as pd


pd.set_option('display.max_columns', 1000)
pd.set_option('display.max_rows', 1000)
pd.set_option('display.width', 1000)
pd.set_option("display.max_colwidth", None)

df = pd.read_csv("exemple.csv")
# df = pd.read_csv("exemple.csv", quoting=1)  # change nothing
list(df.columns)
df.dtypes
list(df["field5      "])

df = pd.read_csv("exemple.csv", sep=r"\s*,\s*", engine="python")
list(df.columns)
df.dtypes
list(df["field5"])

df = pd.read_csv("exemple.csv", quoting=2)
list(df.columns)
df.dtypes
list(df["field5      "])

df = pd.read_csv("exemple.csv", quoting=3)
list(df.columns)
df.dtypes
list(df['"field5"      '])

df = pd.read_csv("exemple.csv", quoting=2, dtype={"field1 ": "object",
                                                  "field2 ": "Int32",  # fail
                                                  "field3 ": "int",
                                                  "field4 ": "int",
                                                  "field5      ": "object",
                                                  "field6 ": "float",
                                                  "field7": "float"  # fail
                                                  })

Issue Description

Hello,

I tried to parse a file like the exemple given, and I spent an afternoon just on this. Nothing looks logical to me. So I am sorry, I will make one ticket for everything, cause it would be to long to make one for each problem. Fill free to divide it in several task.

Expected colums dtypes look quite easy to guess to me : the user used quotemarks on field1 to force a string type. Fields 2-4 are expected to be integers. It could be almost understandable if field2 was converted to a float because np.int dtype doesn’t manage NA values. But Pandas has a integer type which does. So there is no reason. Field5 should be string containing text between quotemarks. Field 6 and 7 are expected to be float. Let see what happen

First try : df = pd.read_csv("exemple.csv")

  • Columns names quotemarks are removed, but trailing space are keeped. That’s quite surprising as there is no logic : Or you consider quotemarks are text delimiters and should be removed, but in this case, why to keep characters outside the delimiters ? Or you consider a everything is part of the string and in this case you must keep everything.
  • dtypes are problematic:
    • field1 have been implicitly converted to int64. The user explicitly asked for a str. The convention “what is between quotemarks is a string” is common to R, C++ and Python and wide spread. Why to not respect it
    • field2 is converted to a string. Missing values are a common case to handle. I would understand a conversion to float, or an error raised. But why a conversion to a string ?
    • field5 have the same problem than column names.
    • field7 is converted to a string. Here it is not understandable at all as np.float handle NA values.
    • Other field are correct. Which is also a little surprising. So initials and trailing spaces pose problem in string fields and empty fields, but not in number field ?

Case : df = pd.read_csv("exemple.csv", sep=r"\s*,\s*", engine="python")

Here init and trailing spaces are removed, but not quotemarks. This ticket is probably already opened somewhere. Field types are ok, excepted for field2, which should be Int32.

Case : df = pd.read_csv("exemple.csv", quoting=2)

Here I tried to explicitly tel the methods that quotemarks means string. Nonetheless it doesn’t work. But integer field are now floats. Excepted for field2 and field7 which are… strings !

Case : df = pd.read_csv("exemple.csv", quoting=3)

Here, the parsing of column names and string fields is wrong, but at least logical. It just keep everything. Fields containing NA values are still converted to string.

Case : df = pd.read_csv("exemple.csv", quoting=2, dtype={"field1 ": "object", "field2 ": "Int32", # fail "field3 ": "int", "field4 ": "int", "field5 ": "object", "field6 ": "float", "field7": "float" # fail })

Raise errors and doesn’t handle fields names correctly.

Expected Behavior

No implicit conversion. Never.

For string field : I understand I may have to tweak the quoting and quotechar parameters, but once done, everything between quotemark should be string, not int or float, and white spaces outside should be ignored.

For float fields containing NA values : should be float field with NA values.

For int field containing NA values : ideally should be parsed as pandas IntXX which handle NA values. At minimum as a np.float. But never a string.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 2cc37625532045f4ac55b27176454bbbc9baf213 python : 3.13.3 python-bits : 64 OS : Linux OS-release : 6.12.34-1-MANJARO Version : #1 SMP PREEMPT_DYNAMIC Thu, 19 Jun 2025 15:49:06 +0000 machine : x86_64 processor : byteorder : little LC_ALL : None LANG : fr_FR.UTF-8 LOCALE : fr_FR.UTF-8 pandas : 2.3.0 numpy : 2.3.1 pytz : 2025.2 dateutil : 2.9.0.post0 pip : 25.0.1 Cython : None sphinx : None IPython : 9.3.0 adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.13.3 blosc : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None html5lib : 1.1 hypothesis : None gcsfs : None jinja2 : 3.1.6 lxml.etree : 5.4.0 matplotlib : 3.10.3 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None psycopg2 : 2.9.10 pymysql : None pyarrow : None pyreadstat : None pytest : 8.4.1 python-calamine : None pyxlsb : None s3fs : None scipy : 1.15.3 sqlalchemy : None tables : None tabulate : 0.9.0 xarray : None xlrd : None xlsxwriter : None zstandard : 0.23.0 tzdata : 2025.2 qtpy : None pyqt5 : None

Comment From: shyok21

Hi, I'd like to work on this issue. I've reproduced it locally and observed the problems as described. Before starting on a fix, I wanted to propose a few possible solution directions and ask for guidance on which approach aligns best with pandas' design philosophy.

✅Observed Problems

1. Quoted values being cast to numeric types

Example CSV:

"field1","field2"
"1",2
"2",3
df = pd.read_csv("example.csv")
print(df.dtypes)
Actual:

field1int64

Expected:

Since values are quoted, field1 should be inferred as object (string)

Suggested Fix Options:

Option 1: [Proposed] Add a flag infer_quoted_strings=True

df = pd.read_csv("example.csv", infer_quoted_strings=True)
print(df.dtypes) # field1 → object

Option 2: Auto-infer quoted numeric as strings

Internally if value was quoted, skip numeric coercion

# Current: "1" → 1 → int  
# Proposed: "1" → "1" → str

2. Columns with missing values default to object

Example CSV row:

"10" , , 0 , 0 ,"euia", 0 ,
df = pd.read_csv("example.csv")
print(df.dtypes)

Actual:

  • field2object
  • field7object

Expected:

  • field2Int32 (nullable int)
  • field7float64

Suggested Fix Options:

Add a flag dtype_backend="nullable" such that if a column has numeric-looking values + NA, fallback should prefer: - Int32 for integers with NA - float64 for float with NA

df = pd.read_csv("example.csv", dtype_backend="nullable") 
print(df.dtypes) # field2 → Int32, field7 → float64

3. Column names keep trailing spaces

CSV header:

"field1 ","field2 ","field5      " 

df = pd.read_csv("example.csv")
print(df.columns) # ['field1 ', 'field2 ', 'field5      ']] 

Access like df["field5"] fails unless user matches exact spacing.

Suggested Fix Options:

Option 1: [Proposed] Add a flag strip_column_names=True

df = pd.read_csv("example.csv", strip_column_names=True)
print(df.columns) # ['field1', 'field2', 'field5']

Option 2: Just document a helper:

df.columns = df.columns.str.strip()

🙋‍♂️ Request for Direction

Would love to hear your thoughts on: - Which of the above ideas (if any) would be acceptable to implement? - Should these be separated into multiple issues/PRs or handled together? - Is it fine to add optional flags to control these behaviors?

Happy to start by writing tests first, or submitting a patch once the preferred approach is clear. Thanks!