Pandas BUG: Automatic transformation from int to bool with read_excel

Pandas version checks

[x] I have checked that this issue has not already been reported.
[x] I have confirmed this bug exists on the latest version of pandas.
[ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

from pathlib import Path

import pandas as pd

# Save this into a Excel sheet, adjust path if needed
#    1   True      c
#    a      b   True
# True      1      1
filex = Path(__file__).parent / "intbool.xlsx"

# Read with Excel, getting:
#    0     1     2
# 0  1  True     c
# 1  a     b  True
# 2  1  True  True
dfx = pd.read_excel(filex, header=None, dtype=str)
print(dfx)

Issue Description

Weird interaction between Python & Excel. When both integers and booleans of the same value are present on the same column (1/True or 0/False), read_excel will cast all these to the value that appears first in each series.

Behaviour tested with Openpyxl & Calamine.
I expected that setting dtype=str would fix the issue, but it had no effect
read_csv doesn't have this issue. The data would be read as strings
writing Excel file from dataframe of strings will format excel content as string, thus preventing the issue to happen

I tracked the issue down to sanitize_objects coded in Cython. I believe the issue is from this piece of code: * Iterations over the content of the column. memo stores the values already known. * When current value val is in memo, reuse it. * As 1 == True is true, the first value of 1/True found in the series is used for all the upcoming 1/True. Similar behaviour with 0/False

https://github.com/pandas-dev/pandas/blob/e87248e1a5d6d78a138039f2856a3aec6b9fef54/pandas/_libs/parsers.pyx#L2143-L2144

Expected Behavior

I may not have the full picture, but i guess * current behaviour is fine when there are no strings in the series * when strings are found in the series, read everything as strings ? Not sure of this one * setting dtype=str in read_excel should read the series as containing only strs, thus preventing this conversion

Installed Versions

INSTALLED VERSIONS ------------------ commit : 4665c10899bc413b639194f6fb8665a5c70f7db5 python : 3.12.2 python-bits : 64 OS : Windows OS-release : 11 Version : 10.0.22631 machine : AMD64 processor : Intel64 Family 6 Model 140 Stepping 1, GenuineIntel byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : English_United Kingdom.1252 pandas : 2.3.2 numpy : 1.26.2 pytz : 2023.3.post1 dateutil : 2.8.2 pip : 24.0 Cython : None sphinx : None IPython : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None blosc : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None html5lib : None hypothesis : None gcsfs : None jinja2 : None lxml.etree : None matplotlib : 3.9.1 numba : None numexpr : None odfpy : None openpyxl : 3.1.5 pandas_gbq : None psycopg2 : None pymysql : None pyarrow : None pyreadstat : None pytest : 8.0.0 python-calamine : None pyxlsb : None s3fs : None scipy : 1.14.0 sqlalchemy : None tables : None tabulate : None xarray : 2024.6.0 xlrd : None xlsxwriter : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None

Comment From: asishm

Thanks for the report. This seems to be a duplicate of https://github.com/pandas-dev/pandas/issues/60088