Feature Type
-
[x] Adding new functionality to pandas
-
[ ] Changing existing functionality in pandas
-
[ ] Removing existing functionality in pandas
Problem Description
read_xml()
fails with the error message XMLSyntaxError: xmlSAX2Characters: huge text node
.
A similar problem can be overcome when manually parsing the tree like so:
from lxml import etree
with open(filename) as f:
tree = etree.parse(f, etree.XMLParser(huge_tree=True))
Feature Description
I am not sure what the best way to supply options to the parser would be.
Alternative Solutions
Right now, I have to read the file using the 'etree' parser like so
df = pd.read_xml(
filename,
parser='etree',
)
Additional Context
Similarly, the following option could be passed to the parser recover=True
.
Comment From: ParfaitG
Thank you for your report. My thought is that read_xml
provides a convenience method to parse shallow, flatter XML to DataFrames. And since XML is an open-ended type that can range in dimensions and DataFrames require the two-dimension types, read_xml
is not meant to cover exceptional cases like you point out. Also, as you show lxml
provides much more functionality to parse any kind of XML.
Possibly, we can incorporate a kwargs
implementation for users to pass arguments to third party connectors? But this may detract from the practice with other IO tools. Then maintenance and testing can be a concern since kwargs
will allow open-ended number of arguments.
For special cases, consider directly using the upstream (lxml
) package to parse XML. Then, retrieve content in lists, dicts. etc. to pass on to DataFrames.