Pandas BUG: Large XML files on Windows trigger false Encoding error

Pandas version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

from pathlib import Path
import pandas

out_path = Path('tmp.xml')

def generate_xml(N):
    record ='''
    <outer>
      <inner1>1</inner1>
      <inner2>b</inner2>
    </outer>
    '''
    with open(out_path, 'w') as f:
        f.write('<?xml version="1.0" encoding="UTF-8"?>\n<root_elem>')
        for i in range(1, N):
            f.write(record)
        f.write('</root_elem>')


## bigger files take a long time to load, but for me, the 'magic number' is between 10M and 20M:
## 20M always fails to read, 10M always succeeds
## I think more accurate predictor is file size, which needs to be over 1.2GB roughly.
## Small sanity check:
generate_xml(1000)
pandas.read_xml(out_path)

## The bug:
generate_xml(20000000)
from xml.etree import ElementTree as ET
## File is obviously valid XML, and ET can parse it
tmp = ET.parse(out_path)
## The following fails
pandas.read_xml(out_path)

Issue Description

Reading big XML files, on Windows 11 (not reproduced on Linux), causes an XMLSyntaxError: switching encoding: encoder error, line 1, column 1. Trigger size is between 1.2 - 1.6 GB.

Expected Behavior

To either parse the file or give an accurate error about running out of memory, etc. The misleading error is very frustrating when working with 3rd party XMLs where it's not inconceivable they're in the wrong encoding, causing a futile search for a problem in the file.

Installed Versions

INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.12.4.final.0 python-bits : 64 OS : Windows OS-release : 11 Version : 10.0.22631 machine : AMD64 processor : Intel64 Family 6 Model 186 Stepping 2, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : Lithuanian_Lithuania.1252 pandas : 2.2.2 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : None pip : 24.1.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 5.2.2 html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : 8.26.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 16.1.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None

Comment From: rhshadrach

Thanks for the report - can you give this issue an informative title? Right now it is BUG:.

Comment From: liudvikasakelis

Sorry, my bad!

Comment From: ParfaitG

Thank you for your interesting report. Given the inconsistency of this behavior as you demonstrate across operating systems, client machines of varying memory levels, and XML documents, it may be difficult to raise a pandas-specific message to users. Additionally, pandas does not control the behavior or parsing messages of etree or lxml and simply reports out the underlying package's exceptions.

But overall, what is your real world use case? While you point out an interesting exercise, do you really face a 20-million node XML with incorrect encoding? For large GB-sized XML, consider the iterparse argument of pandas.read_xml which avoids reading the entire XML in memory at one time. For wrongly encoded XML, check the creation of such markup. All well-formed XML should be created by W3C-compliant DOM libraries like Python's etree and lxml using methods like parse() and tree.write() to read and write XML. Often treating XML like simple text files causes such encoding or syntax issues.

Comment From: liudvikasakelis

Hi, I see now that I wasn't sufficiently direct in my original report. If you re-read it, I hope you'll see that

bug is triggered by perfect input files in plain ASCII with no encoding issues
bug manifests on Windows
bug triggers by modestly sized (~1GB) files, nowhere near RAM capacity

Up to maintainers to decide the severity, but I don't feel your comment added clarity as it misrepresented my claims about the bug. Thank you for you suggestions re encoding issues and iterparse :)