Feature Type

  • [x] Adding new functionality to pandas

  • [ ] Changing existing functionality in pandas

  • [ ] Removing existing functionality in pandas

Problem Description

read_xml() fails with the error message XMLSyntaxError: xmlSAX2Characters: huge text node.

A similar problem can be overcome when manually parsing the tree like so:

from lxml import etree
with open(filename) as f:
    tree = etree.parse(f, etree.XMLParser(huge_tree=True))

Feature Description

I am not sure what the best way to supply options to the parser would be.

Alternative Solutions

Right now, I have to read the file using the 'etree' parser like so

df = pd.read_xml(
    filename,
    parser='etree',
)

Additional Context

Similarly, the following option could be passed to the parser recover=True.

Comment From: ParfaitG

Thank you for your report. My thought is that read_xml provides a convenience method to parse shallow, flatter XML to DataFrames. And since XML is an open-ended type that can range in dimensions and DataFrames require the two-dimension types, read_xml is not meant to cover exceptional cases like you point out. Also, as you show lxml provides much more functionality to parse any kind of XML.

Possibly, we can incorporate a kwargs implementation for users to pass arguments to third party connectors? But this may detract from the practice with other IO tools. Then maintenance and testing can be a concern since kwargs will allow open-ended number of arguments.

For special cases, consider directly using the upstream (lxml) package to parse XML. Then, retrieve content in lists, dicts. etc. to pass on to DataFrames.