When dealing with incomplete JSON, the (*Decoder).Token
method returns io.EOF
at the end of the input instead of providing an appropriate error. For example, the following code does not produce an error:
https://go.dev/play/p/HHEwVkRCs7a
dec := json.NewDecoder(strings.NewReader(`[123,`))
for {
_, err := dec.Token()
if err == io.EOF {
break
}
if err != nil {
panic(err)
}
}
According to the documentation, Token guarantees that the delimiters [ ] { } it returns are properly nested and matched. However, in this example, [
is not properly nested and matched.
Comment From: timothy-king
CC @rsc, @dsnet, @bradfitz, @mvdan.
Comment From: adonovan
It's not obvious to me that this is a bug. The doc comments state that Token returns (nil, EOF) at the end of the stream, which it does; and that Token guarantees that the delimiters it returns are properly nested and matched, which it does too, to the extent that you'll never see a mismatched or premature close bracket. The actual error in this case is that we ran out of input (for which EOF is appropriate), not that there was something wrong with the input we received, so it would be strange for it to report a syntax error, or to synthesize close bracket tokens that weren't present.
What would you have it do?
Comment From: dsnet
doc comments state that Token returns (nil, EOF) at the end of the stream
The question comes down to: what is a stream?
The docs for json.Decoder.Token
defines a "stream" as:
The input stream consists of basic JSON values —bool, string, number, and null—along with delimiters [ ] { } of type Delim to mark the start and end of arrays and objects.
This is unfortunately problematic. The term "value" according to RFC 8259 implies a complete object or array (i.e., including the paired terminating ']' or '}'), but the prose in the latter half of the sentence is describing what RFC 8259 would actually call a "token".
Essentially:
* If Decoder
is reading a stream of JSON values, then io.ErrUnexpectedEOF
is correct.
* If Decoder
is reading a stream of JSON tokens, then io.EOF
is correct.
The "github.com/go-json-experiment/json" module reports io.ErrUnexpectedEOF
since it defines jsontext.Decoder
as decoding a stream of top-level "values". In all my own usages of jsontext.Decoder
, this is the semantic I would have wanted. While I'm parsing a JSON value token-by-token, I'm often still operating under the assumption that the entire value is valid.
Comment From: gazerro
The same example with XML returns the error XML syntax error on line 1: unexpected EOF
:
https://go.dev/play/p/wCvkSOlqeH0
dec := xml.NewDecoder(strings.NewReader(`<values>123`))
for {
_, err := dec.Token()
if err == io.EOF {
break
}
if err != nil {
panic(err)
}
}
Therefore, I notice an inconsistency in the behavior of Decoder.Token
methods between encoding/json
and encoding/xml
. This inconsistency doesn't seem justified by the differences between JSON and XML.
I believe the correct behavior would be to return an io.ErrUnexpectedEOF
error when encountering incomplete input.
Comment From: adonovan
I believe the correct behavior would be to return an io.ErrUnexpectedEOF error when encountering incomplete input.
ErrUnexpectedEOF is appropriate for the case when the EOF occurs in the middle of a token, such as an unclosed string literal, or in the middle of "null", "true", or "false"; and indeed that's what the decoder does. But Token promises to deliver tokens, and a missing close bracket is not a token-decoding error.
The input stream consists of basic JSON values —bool, string, number, and null—along with delimiters...
@dsnet: This is unfortunately problematic. The term "value" according to RFC 8259 implies a complete object or array (i.e., including the paired terminating ']' or '}'), but the prose in the latter half of the sentence is describing what RFC 8259 would actually call a "token".
The quoted sentence says "basic values", and immediately defines them as the elementary, single-token values; I don't think one can argue that value is meant here in its general sense. So Token returns a stream of "basic values and delimiters".
Comment From: dsnet
immediately defines them as the elementary, single-token values
If this were true, then I would expect the parser to be handling the tokens as just a plain sequence of JSON tokens (i.e., only validating for the grammar for a JSON "token", but not the grammar for a JSON "value"). However, that's not quite how it behaves. Consider the following:
dec := json.NewDecoder(strings.NewReader(`{ "hello" }`))
for {
tok, err := dec.Token()
if err != nil {
fmt.Println(err)
return
}
fmt.Println(tok)
}
which prints:
{
hello
invalid character '}' after object key
The fact that we see invalid character '}' after object key
implies that it's using a push-down state-machine to validate whether the sequence of tokens is valid according to the grammar for a JSON "value". The difference between io.EOF
or io.ErrUnexpectedEOF
at the very end comes down to whether the push-down state machine has an empty stack or not.
Comment From: gazerro
In Go 1.5 and earlier, the behavior of Decoder.Token
in the encoding/xml
package was identical to the current behavior in the encoding/json
package. Proposal #11405 recommended changing this behavior, and the change was introduced via CL https://golang.org/cl/14315 in Go 1.6.
Given that the Decoder.Token
method was added to the encoding/json
package around the same time as the modification to encoding/xml
, it seems like an unfortunate mistake that the behavior in encoding/json
was not updated to reflect this change.
Comment From: adonovan
The fact that we see invalid character '}' after object key implies that it's using a push-down state-machine to validate whether the sequence of tokens is valid according to the grammar for a JSON "value". The difference between io.EOF or io.ErrUnexpectedEOF at the very end comes down to whether the push-down state machine has an empty stack or not.
Hmm, good point. Both the (documented) no-mismatched-paren rule and the (undocumented) object grammar enforcement you just mentioned are evidence that Token actually intends to return only valid, complete sequences of tokens, and so returning ErrUnexpectedEOF if the sequence is incomplete seems reasonable.
Comment From: gopherbot
Change https://go.dev/cl/689516 mentions this issue: encoding/json: fix truncated Token error regression in goexperiment.jsonv2