Golang encoding/xml: add flag for stricter XML char parsing

Proposal Details

Background

As reported by @DemiMarie

The encoding/xml package does not properly validate that the characters within comments, processing instructions, or directives are properly within the CharData range as defined by the XML specification.

Proposal

Add a godebug flag, xmlvalidatechars=1, which enables more strict validation of characters within comments, processing instructions, and directives. It is my understanding that changing XML behavior can sometimes lead to unexpected behavior/breaking changes, but I have tested what would happen if this flag were enabled by default internally and ran into zero issues.

Comment From: gabyhelp

Related Issues and Documentation

proposal: encoding/xml: ignore invalid characters when Decoder.Strict is false #62217

_{(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)}

Comment From: DemiMarie

I’d prefer if strict validation were on by default, at least eventually if not initially. Would this be possible, or would it break the Go 1 stability guarantee?

Comment From: ianlancetaylor

Making encoding/xml do strict validation by default would be OK with the Go 1 compatibility guarantee. Adding the GODEBUG would give people an easy way to back off.

That said: why would strict validation be a good default? Who would that help in practice?

Comment From: DemiMarie

Making encoding/xml do strict validation by default would be OK with the Go 1 compatibility guarantee. Adding the GODEBUG would give people an easy way to back off.

That said: why would strict validation be a good default? Who would that help in practice?

XML with malformed characters is ill-formed, so the Principle of Least Astonishment suggests that it should be rejected.

Comment From: ianlancetaylor

XML with malformed characters is ill-formed, so the Principle of Least Astonishment suggests that it should be rejected.

If we were starting from scratch, I would certainly agree. But we aren't. Making this change will most likely break some existing working code, which in an ecosystem that stresses backward compatibility is astonishing in a different way.

Comment From: DemiMarie

XML with malformed characters is ill-formed, so the Principle of Least Astonishment suggests that it should be rejected.

If we were starting from scratch, I would certainly agree. But we aren't. Making this change will most likely break some existing working code, which in an ecosystem that stresses backward compatibility is astonishing in a different way.

Would it be better to create a new XML parsing entry point and mark the old entry points as deprecated? There are a ton of problems in encoding/xml and absolutely strict well-formedness checking might break a lot of code.

Comment From: ianlancetaylor

I think that we should have a plan for encoding/xml/v2. It's clear that there are many problems with encoding/xml. But that is a big undertaking.

In the meantime, we need to decide about this proposal, which is not about v2.

Comment From: DemiMarie

I think that we should have a plan for encoding/xml/v2. It's clear that there are many problems with encoding/xml.

:+1:

But that is a big undertaking.

:100:

In the meantime, we need to decide about this proposal, which is not about v2.

Fair!

You have much more experience than I do when it comes to the Go ecosystem, so I think it would be better for you to make this decision. Other options include adding an extra flag field or setter method.

Comment From: aclements

This proposal has been added to the active column of the proposals project and will now be reviewed at the weekly proposal review meetings.

Comment From: aclements

Just to make sure I understand, the specific part of the XML specification you're talking about validating is the Char non-terminal as used in Comment, PI, and CData, right? This is what CL 601815 implements.

Comment From: aclements

There are many ways in which encoding/xml is not as strict as it ought to be. Making it stricter in any way presents compatibility problems, and I don't think we want to introduce separate GODEBUGs for every way it could be stricter. At the same time, we're not going to tackle all of the ways it could be stricter at once (we probably don't even know all of the ways!), so it's not practical to just have a big switch for stricter parsing.

It's also not clear that a GODEBUG is the right way to control this. GODEBUG is very coarse-grained. We already have decoder configuration in the encoding/xml.Decoder type, so this would seem to be a natural place to put control over this. However, encoding/xml.Decoder already has a Strict field and, unfortunately, it already defaults to true! We're not going to add a Stricter field. :)

So here's a counter-proposal: we add a Lax field to Decoder (I'm open to better names) that's a bit-field of ways in which the parser should be non-strict. That way, the Strict field continues to default to true, but we tweak it to mean "be as strict as you can about parsing, except in the ways specified by the Lax field". For XML char parsing, we'd define a bit in the Lax bitfield, say LaxChars. For compatibility, this bit would be set by default in the Lax field. Any other ways we make the parser more strict in the future would get their own bits. If an XML consumer wants the strictest possible parsing, it can set Lax to 0. It a consumer wants to be strict, but knows there are some issues in the data it's parsing, it can set Lax to just the bits that make sense for what it's parsing. If a strictness check in the XML decoder fails, it can report a helpful error like "consider setting LaxChars". Finally, we could define one GODEBUG, say xmllax, that can take a list of the bitfield names, such as xmllax=chars,etc, to override the default value of the Lax field.

Comment From: Merovius

@aclements I don't think I understand your Lax suggestion.

For compatibility, this bit would be set by default in the Lax field. […] If an XML consumer wants the strictest possible parsing, it can set Lax to 0.

This doesn't really make sense to me. If Lax is 0, for compatibility sake, the Decoder must behave as it currently does. Because a new field with existing code will be defauting to 0. Any additional checks performed must correspond to more bits being 1.

So, ISTM the straight forward way to do this is to have a bitfield ExtraChecks and a bunch of constants a user can || into that to enable extra checks. And if it is 0 (the default) it behaves as it does right now. In my opinion, at that point we don't even need a GODEBUG at all.

To me, the Lax thing you describe seems very confusing and full of double-negatives. Just to avoid the wart of having Strict bool; Stricter ExtraChecks (or whatever). Yes, it's a wart to have two fields. But it seems a preferable wart to expecting people to reason their way through inverting the meaning of "a bit being set" and what the "default" is and introducing extra GODEBUGs.

If a strictness check in the XML decoder fails, it can report a helpful error like "consider setting LaxChars"

I am against this. I believe the error message should report the spec-violation. We parse customer XML data and we pass the error message reported by encoding/xml to our customers. If they read "consider setting LexChars", that is not helpful for them. They need to know how to fix their XML.

That is the target audience of "set LexChar" is the developer, not the consumer of the error message.

To be helpful to us, that is to make it more discoverable that we need to set LaxChars, it is sufficient to have a constant or sentinel-error in encoding/xml. That way, when the customer says "we are getting the error invalid character in input, but for weird reasons we can't fix our XML, can you disable that check", we can grep for invalid character in input, find the corresponding constant with the comment "if you want to disable this check, set LaxChars" (or whatever).

Comment From: aclements

If Lax is 0, for compatibility sake, the Decoder must behave as it currently does. Because a new field with existing code will be defauting to 0.

The zero value of Decoder is already not usable. You have to call NewDecoder. So in this case we don't have the constraint that new fields need to default to 0.

But I definitely see your argument that Lax is sort of a double-negative.

What I'm trying to avoid is having Strict: true mean the fairly arbitrary set of ways in which encoding/xml happens to be strict as of 2024. Maybe at some level that's unavoidable.

One thing I like about Lax is that it's really easy to ask for "the strictest parsing this version of Go can do" by just setting Lax to 0. With a "Stricter" approach, I think we're need to define some other way to ask for the strictest thing; whether that's a const who's value can change in newer versions, or a function/method.

Orthogonally, this doesn't have to be a bit field. It could be a set of bool fields.

Comment From: aclements

Stepping back, I think the first question is, why is stricter XML char parsing actually necessary? I don't think that's been answered in this issue.

Comment From: DemiMarie

@aclements Programs might assume that encoding/xml has done validation and that therefore they don’t need to do it themselves.

Comment From: ianlancetaylor

@DemiMarie Thanks, but that kind of pushes the problem back one level. Why do those programs need to validate the XML?

I'm not trying to be obnoxious, we're just trying to understand the need.

Comment From: DemiMarie

@ianlancetaylor You aren’t being obnoxious 🙂. The general reason would be that they pass the XML, or data extracted from the XML, to something that would be safe if encoding/xml had validated the XML, but is not safe because it has not done such validation. For instance, if CharData excludes ANSI escape codes, then printing the data on a PTY would be safe if encoding/xml validated it, but is not safe in the absense of such validation.

Comment From: aclements

Thanks @DemiMarie , that's reasonably convincing.

The tension between my proposed Lax approach and @Merovius 's proposed Stricter approach comes from the "in between" strictness of the current Strict field.

@ianlancetaylor suggested that we effectively deprecate the Strict field and instead have a full Strictness that allows control over the ways we're currently strict as well as new ways. If Strictness is the zero value, then we use the Strict field, which we can redefine into a value of Strictness; otherwise, we ignore the Strict field.

The default returned by NewDecoder would still set Strict to true and would set Strictness to the zero value, maintaining the current behavior. But if the user sets Strictness, we'll ignore Strict. This gets a little ambiguous if they want no strictness: setting Strictness to the zero value wouldn't achieve anything, so that would be done by setting Strict to false, like today. I think this would resolve the question about whether we go "up" or "down" from Strict.

That leaves the question of how Strictness should be defined. I'm inclined to make it a struct of bool-typed fields. This is more readable than, say, a uint64 bit field, avoids limits, makes it easier to print, etc. We could map directly between GODEBUG flag names and the names of fields in this struct.

Comment From: DemiMarie

@aclements I would like an easy way to get the strictest (read: most standards-compliant) parsing a given version of Go supports, without having to change the source code of the program.

In the long term, I recommend differential fuzzing of encoding/xml (or its standards-compliant successor) against libxml2 or another standards-compliant XML parser. This will catch bugs like these before they are released.

Comment From: ianlancetaylor

If we do tie these new flags to GODEBUG settings, then you will be able to set strictness defaults by setting an environment variable. Note that it will still be possible for the program to override those defaults.

Comment From: DemiMarie

What I mean is that as a program author, I want to get the strictest parsing the given version of Go supports, regardless of GODEBUG.

Comment From: ianlancetaylor

I see. If I understand you correctly, you want some way to ask for the strictest version of XML parsing, such that if some new Go release adds a new strictness flag, you get that turned on by default.

One approach would be to add a new function NewStrictDecoder, which returns a Decoder with maximal strictness. The existing NewDecoder function would continue to function as it does today.

Comment From: DemiMarie

@ianlancetaylor You understand correctly.

Comment From: rsc

People often say they want the strictest setting possible, but then when we add a new check and their program stops processing inputs it used to accept, they often change their mind. It's unclear we should do the "be as strict as possible including rejecting new things tomorrow" mode.

What if the "Strictness" field is called Check, as in

Check struct {
    ValidChars bool
    ValidUTF8 bool
    ...
}

and the Strict bool just implies a couple of these are forced on. At that point Strict is maybe not quite deprecated but not necessary anymore.

Comment From: iwdgo

It might be worth including https://github.com/golang/go/issues/25755 in this proposal

Comment From: ianlancetaylor

@iwdgo #25755 is about XML version 1.1 support, which seems entirely unrelated to how we set the strictness level of XML parsing. What did you have in mind? Thanks.

Comment From: iwdgo

A comprehensive list of topics (namespace, procint, unicode,...) could be considered as stricter XML 1.0. As mentioned above, stricter is not always well accepted. The implementation of XML 1.1 using the foreseen prolog would introduce a version adhering to a published standard.

The XML 1.1 prolog is only usable for full XML documents. It implies that partial documents would go through the current behaviour. It should avoid the permanent concern of breaking code. Quoting the XML 1.1 introduction seems fair XML 1.1 to feed this discussion.

Therefore XML 1.1 adds NEL (#x85) to the list of line-end characters. For completeness, the Unicode line separator character, #x2028, is also supported.

Finally, there is considerable demand to define a standard representation of arbitrary Unicode characters in XML documents. Therefore, XML 1.1 allows the use of character references to the control characters #x1 through #x1F, most of which are forbidden in XML 1.0. For reasons of robustness, however, these characters still cannot be used directly in documents. In order to improve the robustness of character encoding detection, the additional control characters #x7F through #x9F, which were freely allowed in XML 1.0 documents, now must also appear only as character references. (Whitespace characters are of course exempt.) The minor sacrifice of backward compatibility is considered not significant. Due to potential problems with APIs, #x0 is still forbidden both directly and as a character reference.

Finally, XML 1.1 defines a set of constraints called "full normalization" on XML documents, which document creators SHOULD adhere to, and document processors SHOULD verify. Using fully normalized documents ensures that identity comparisons of names, attribute values, and character content can be made correctly by simple binary comparison of Unicode strings.

Comment From: aclements

People often say they want the strictest setting possible, but then when we add a new check and their program stops processing inputs it used to accept, they often change their mind.

With a Check struct, if someone really wants maximal strictness, they can always use reflection to construct the maximally strict value of Check!

and the Strict bool just implies a couple of these are forced on. At that point Strict is maybe not quite deprecated but not necessary anymore.

This seems like a good simplification.

Comment From: aclements

Have all remaining concerns about this proposal been addressed?

The proposal is to add a Check field to xml.Decoder and redefine the Strict field:

// Strict enforces some validation requirements of the XML
// specification. Setting Strict is equivalent to setting
// Check.EndTag and Check.Entites to true.
//
// Strict defaults to true.
// If set to false, the parser allows input containing common
// mistakes:
//  * If an element is missing an end tag, the parser invents
//    end tags as necessary to keep the return values from Token
//    properly balanced.
//  * In attribute values and character data, unknown or malformed
//    character entities (sequences beginning with &) are left alone.
//
// ...
Strict bool

// Check enables enforcement of specific validation requirements
// of the XML specification.
//
// If the Strict field is true, then EndTag and Entites are validated
// even if they are set to false here.
Check struct {
    // EndTag validates that all tags have matching end tags.
    // If both EndTag and Strict are false and an element is missing
    // an end tag, the parser invents end tags as necessary to keep
    // the return values from Token properly balanced.
    EndTag bool

    // Entities validates that character entites (sequences beginning with &)
    // in attribute values and character data are well-formed.
    // If both Entites and Strict are false, unknown or malformed
    // character entities are left alone.
    Entities bool

    // Chars validates that characters within comments, processing instructions,
    // and directives are properly within the XML CharData range.
    Chars bool
    ...
}

We will also define a new GODEBUG=xmlcheck setting that is a comma-separated list of field names from the Check struct to set to true by default in NewDecoder.

In the future, additional fields may be added to the Check struct. #68293 contains a long list of ways the xml package currently accepts ill-formed XML that could be addressed by adding check fields.

Comment From: aclements

Based on the discussion above, this proposal seems like a likely accept. — aclements for the proposal review group

The proposal is to add a Check field to xml.Decoder and redefine the Strict field:

// Strict enforces some validation requirements of the XML
// specification. Setting Strict is equivalent to setting
// Check.EndTag and Check.Entites to true.
//
// Strict defaults to true.
// If set to false, the parser allows input containing common
// mistakes:
//  * If an element is missing an end tag, the parser invents
//    end tags as necessary to keep the return values from Token
//    properly balanced.
//  * In attribute values and character data, unknown or malformed
//    character entities (sequences beginning with &) are left alone.
//
// ...
Strict bool

// Check enables enforcement of specific validation requirements
// of the XML specification.
//
// If the Strict field is true, then EndTag and Entites are validated
// even if they are set to false here.
Check struct {
    // EndTag validates that all tags have matching end tags.
    // If both EndTag and Strict are false and an element is missing
    // an end tag, the parser invents end tags as necessary to keep
    // the return values from Token properly balanced.
    EndTag bool

    // Entities validates that character entites (sequences beginning with &)
    // in attribute values and character data are well-formed.
    // If both Entites and Strict are false, unknown or malformed
    // character entities are left alone.
    Entities bool

    // Chars validates that characters within comments, processing instructions,
    // and directives are properly within the XML CharData range.
    Chars bool
    ...
}

We will also define a new GODEBUG=xmlcheck setting that is a comma-separated list of field names from the Check struct to set to true by default in NewDecoder.

Comment From: rogpeppe

Drive-by comment: perhaps it would be nicer to define a new type for the value Check field so it's possible to declare and use it independently from the Decoder type. Inline struct types can be frustrating.

Comment From: aclements

Good point. We can certainly give it a name.

Comment From: aclements

No change in consensus, so accepted. 🎉 This issue now tracks the work of implementing the proposal. — aclements for the proposal review group

The proposal is to add a Check field to xml.Decoder and redefine the Strict field:

// Strict enforces some validation requirements of the XML
// specification. Setting Strict is equivalent to setting
// Check.EndTag and Check.Entities to true.
//
// Strict defaults to true.
// If set to false, the parser allows input containing common
// mistakes:
//        * If an element is missing an end tag, the parser invents
//          end tags as necessary to keep the return values from Token
//          properly balanced.
//        * In attribute values and character data, unknown or malformed
//          character entities (sequences beginning with &) are left alone.
//
// ...
Strict bool

// Check enables enforcement of specific validation requirements
// of the XML specification.
//
// If the Strict field is true, then EndTag and Entities are validated
// even if they are set to false here.
Check Checks

The Checks type is defined as:

type Checks struct {
    // EndTag validates that all tags have matching end tags.
    // If both EndTag and Strict are false and an element is missing
    // an end tag, the parser invents end tags as necessary to keep
    // the return values from Token properly balanced.
    EndTag bool

    // Entities validates that character entities (sequences beginning with &)
    // in attribute values and character data are well-formed.
    // If both Entities and Strict are false, unknown or malformed
    // character entities are left alone.
    Entities bool

    // Chars validates that characters within comments, processing instructions,
    // and directives are properly within the XML CharData range.
    Chars bool
    ...
}

We will also define a new GODEBUG=xmlcheck setting that is a comma-separated list of field names from the Check struct to set to true by default in NewDecoder.

In the future, additional fields may be added to the Checks struct. #68293 contains a long list of ways the xml package currently accepts ill-formed XML that could be addressed by adding check fields.

Comment From: iwdgo

In the approved comments, I suppose that "Entities" should be read instead of "Entites".

While looking in the implementation, defining xmlcheck as a comma-separated list does not totally respect the existing definition (https://pkg.go.dev/runtime#hdr-Environment_Variables) which says

The GODEBUG variable controls debugging variables within the runtime. It is a comma-separated list of name=val pairs setting these named variables:

For instance, GODEBUG=xmlcheck=endtags,entities,installgoroot=all is cumbersome to read using the usual strings.Split(os.Getenv("GODEBUG"), ","). In case revision is possible, using colon : to separate checks seems simpler to analyze. The string would be GODEBUG=xmlcheck=endtags:entities,installgoroot=all

Comment From: seankhliao

wouldn't + be more in line with combining checks, and has precedent in netdns=go+2

Comment From: aclements

In the approved comments, I suppose that "Entities" should be read instead of "Entites".

At least I was consistent in the comments 😅. Thanks. Fixed in place.

For instance, GODEBUG=xmlcheck=endtags,entities,installgoroot=all is cumbersome to read using the usual strings.Split(os.Getenv("GODEBUG"), ",").

Oh, that's a good point. I'll bring this detail back to proposal review.

Comment From: aclements

Using +, or really any list separator, has a few downsides:

It's easy to forget and use commas by accident, in which case we'll silently ignore the checks after the first comma because we'll parse them as other (unknown) GODEBUG values.
You can't bisect the individual flags.
+ is weird. 😅

Another option would be to have a separate GODEBUG flag for each flag. In a way, this fits better with the format of GODEBUG, it's harder to mess up, and it makes each flag bisectable. To make the grouping of these settings clear, we can use a common prefix. I propose xmlcheck:<flag>, so the example from above would look like:

GODEBUG=xmlcheck:endtags=1,xmlcheck:entities=1,installgoroot=all

I think separate flags rather than a single list also makes it more clear that this adds to any flags set by the application, rather than overriding all of them.

Comment From: aclements

Let's go with GODEBUG=xmlcheck:flag, where flag is the name of a field in the Checks struct, converted to lower case. This will cause a little bit of trouble with the internal/godebugs, which currently assumes all GODEBUGs are also valid Go identifiers, but I think it will be trivial to tweak that to, say, replace : with _ for the identifier.

Comment From: iwdgo

Currently on tip, an entity as &12;in character data, decoded with Strict to false, is rejected with error illegal character code U+000C ( complete test is https://go.dev/play/p/L62pVsNaoZ4?v=gotip ) in line with https://www.w3.org/TR/xml/#charsets

The wording of Èntities hints at removing this error when Check.Entities flag is false as the complete sequence of the entity should be left alone. Is this change of behaviour expected ?

Comment From: apparentlymart

From the discussion above it seems like the intention is for encoding/xml to implement an XML-like grammar that doesn't exactly conform to the XML 1.0 specifications, and to allow gradually opting in to stronger conformance to the relevant specifications.

I understand the argument for backward-compatibility and don't wish to argue for any change to the accepted compromises, but I would like to advocate for changing the docstring of the package so that it explicitly states that this is not a fully-conforming XML 1.0 parser, describes the various ways it diverges from the specification -- at least, the ones we already know about -- and mentions that the level of conformance is partially configurable but that 100% conformance is not currently available.

My motivation here is related to the "principle of least astonishment" discussed earlier: if this package claims that it's a conforming XML 1.0 parser then it's likely to be used in ways that rely on that being true, potentially causing general functionality or security problems for depending applications. Hopefully an explicit note in the docs will cause developers to look a little more closely to make sure that the current implementation is suitable for their needs.

Comment From: DemiMarie

That is my thought too. Given that making encoding/xml strictly conforming to the XML spec is not feasible, I think it would be best to deprecate the package, and provide an encoding/xml/v2 that has tests (including differential fuzzing against libxml2) to make sure it does strictly conform to the standards.

Comment From: gopherbot

Change https://go.dev/cl/662235 mentions this issue: encoding/xml: adds Checks struct to allow selection of checks.

Comment From: prattmic

Do I understand correctly that Strict and Checks.EndTag/Checks.Entities combine via logical AND? The doc comments are not clear about the precedence of Checks when Strict is false (only when it is true).

`Strict`	`Checks.EndTag`	End tag validation
`false`	`false`	No
`false`	`true`	No
`true`	`false`	No
`true`	`true`	Yes

My initial thought was they should be logical OR, making the middle cases both "yes". However, if we do that then we are changing the behavior of existing code like:

d := xml.NewDecoder()
d.Strict = false

That can't be correct because it makes Strict completely useless on its own, breaking existing uses (which have explicitly opted-out of strict behavior!).

Comment From: prattmic

The more I read the doc comments, the more my brain gets stuck in a twisty maze of confusion.

For example, on Checks: "If the Strict field is true, then EndTag and Entities are validated even if they are set to false here."

This is in conflict with my logical AND conclusion, but it also doesn't seem to make much sense?

NewDecoder defaults to:

d.Strict = true
d.Checks.EndTags = true
d.Checks.Entities = true

With the above behavior, disabling end tag validation would require setting both EndTags and Strict to false. Why? That seems needlessly complicated. It also makes it impossible to disable only one of EndTags or Entities (because setting Strict to false disables both).

In https://go.dev/cl/662235, I have proposed documenting Strict strictly in terms of Checks:

// Strict allows overriding the behavior of Check.EndTags and Check.Entities.
//
// Strict defaults to true. When set to true, validation behavior is controlled
// by the settings in Check.
//
// When set to false, Check.EndTags and Check.Entities validation are disabled,
// regardless of the values of Check.EndTags and Check.Entities.
//
// For clarity, new code should prefer to set Check.EndTags = false and
// Check.Entities = false instead of setting Strict = false.
Strict bool

Personally I think this makes it easier to wrap my head around the behavior, though it does seem to differ slightly from the accepted behavior.

Comment From: prattmic

One more (minor) finding from the implementation: GODEBUGs get a runtime/metric like /godebug/non-default-behavior/<name>:events. In metrics, : is the separator for the metric unit. Thus the implementation replaces xmlcheck:endtags with xmlcheck_endtags in the metric name.

It's not the end of the world, but annoyingly inconsistent. Perhaps we should pick yet another separator.