Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
from pathlib import Path
import pandas
out_path = Path('tmp.xml')
def generate_xml(N):
record ='''
<outer>
<inner1>1</inner1>
<inner2>b</inner2>
</outer>
'''
with open(out_path, 'w') as f:
f.write('<?xml version="1.0" encoding="UTF-8"?>\n<root_elem>')
for i in range(1, N):
f.write(record)
f.write('</root_elem>')
## bigger files take a long time to load, but for me, the 'magic number' is between 10M and 20M:
## 20M always fails to read, 10M always succeeds
## I think more accurate predictor is file size, which needs to be over 1.2GB roughly.
## Small sanity check:
generate_xml(1000)
pandas.read_xml(out_path)
## The bug:
generate_xml(20000000)
from xml.etree import ElementTree as ET
## File is obviously valid XML, and ET can parse it
tmp = ET.parse(out_path)
## The following fails
pandas.read_xml(out_path)
Issue Description
Reading big XML files, on Windows 11 (not reproduced on Linux), causes an XMLSyntaxError: switching encoding: encoder error, line 1, column 1.
Trigger size is between 1.2 - 1.6 GB.
Expected Behavior
To either parse the file or give an accurate error about running out of memory, etc. The misleading error is very frustrating when working with 3rd party XMLs where it's not inconceivable they're in the wrong encoding, causing a futile search for a problem in the file.
Installed Versions
Comment From: rhshadrach
Thanks for the report - can you give this issue an informative title? Right now it is BUG:
.
Comment From: liudvikasakelis
Sorry, my bad!
Comment From: ParfaitG
Thank you for your interesting report. Given the inconsistency of this behavior as you demonstrate across operating systems, client machines of varying memory levels, and XML documents, it may be difficult to raise a pandas-specific message to users. Additionally, pandas
does not control the behavior or parsing messages of etree
or lxml
and simply reports out the underlying package's exceptions.
But overall, what is your real world use case? While you point out an interesting exercise, do you really face a 20-million node XML with incorrect encoding? For large GB-sized XML, consider the iterparse
argument of pandas.read_xml
which avoids reading the entire XML in memory at one time. For wrongly encoded XML, check the creation of such markup. All well-formed XML should be created by W3C-compliant DOM libraries like Python's etree
and lxml
using methods like parse()
and tree.write()
to read and write XML. Often treating XML like simple text files causes such encoding or syntax issues.
Comment From: liudvikasakelis
Hi, I see now that I wasn't sufficiently direct in my original report. If you re-read it, I hope you'll see that
- bug is triggered by perfect input files in plain ASCII with no encoding issues
- bug manifests on Windows
- bug triggers by modestly sized (~1GB) files, nowhere near RAM capacity
Up to maintainers to decide the severity, but I don't feel your comment added clarity as it misrepresented my claims about the bug. Thank you for you suggestions re encoding issues and iterparse
:)