Pandas version checks

  • [x] I have checked that this issue has not already been reported.

  • [x] I have confirmed this bug exists on the latest version of pandas.

  • [ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

from pathlib import Path

import pandas as pd

# Save this into a Excel sheet, adjust path if needed
#    1   True      c
#    a      b   True
# True      1      1
filex = Path(__file__).parent / "intbool.xlsx"

# Read with Excel, getting:
#    0     1     2
# 0  1  True     c
# 1  a     b  True
# 2  1  True  True
dfx = pd.read_excel(filex, header=None, dtype=str)
print(dfx)

Issue Description

Weird interaction between Python & Excel. When both integers and booleans of the same value are present on the same column (1/True or 0/False), read_excel will cast all these to the value that appears first in each series.

  • Behaviour tested with Openpyxl & Calamine.
  • I expected that setting dtype=str would fix the issue, but it had no effect
  • read_csv doesn't have this issue. The data would be read as strings
  • writing Excel file from dataframe of strings will format excel content as string, thus preventing the issue to happen

I tracked the issue down to sanitize_objects coded in Cython. I believe the issue is from this piece of code: * Iterations over the content of the column. memo stores the values already known. * When current value val is in memo, reuse it. * As 1 == True is true, the first value of 1/True found in the series is used for all the upcoming 1/True. Similar behaviour with 0/False

https://github.com/pandas-dev/pandas/blob/e87248e1a5d6d78a138039f2856a3aec6b9fef54/pandas/_libs/parsers.pyx#L2143-L2144

Expected Behavior

I may not have the full picture, but i guess * current behaviour is fine when there are no strings in the series * when strings are found in the series, read everything as strings ? Not sure of this one * setting dtype=str in read_excel should read the series as containing only strs, thus preventing this conversion

Installed Versions

INSTALLED VERSIONS ------------------ commit : 4665c10899bc413b639194f6fb8665a5c70f7db5 python : 3.12.2 python-bits : 64 OS : Windows OS-release : 11 Version : 10.0.22631 machine : AMD64 processor : Intel64 Family 6 Model 140 Stepping 1, GenuineIntel byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : English_United Kingdom.1252 pandas : 2.3.2 numpy : 1.26.2 pytz : 2023.3.post1 dateutil : 2.8.2 pip : 24.0 Cython : None sphinx : None IPython : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None blosc : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None html5lib : None hypothesis : None gcsfs : None jinja2 : None lxml.etree : None matplotlib : 3.9.1 numba : None numexpr : None odfpy : None openpyxl : 3.1.5 pandas_gbq : None psycopg2 : None pymysql : None pyarrow : None pyreadstat : None pytest : 8.0.0 python-calamine : None pyxlsb : None s3fs : None scipy : 1.14.0 sqlalchemy : None tables : None tabulate : None xarray : 2024.6.0 xlrd : None xlsxwriter : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None

Comment From: asishm

Thanks for the report. This seems to be a duplicate of https://github.com/pandas-dev/pandas/issues/60088