Golang encoding/json: Decoder.Token does not return an error for incomplete JSON

When dealing with incomplete JSON, the (*Decoder).Token method returns io.EOF at the end of the input instead of providing an appropriate error. For example, the following code does not produce an error:

https://go.dev/play/p/HHEwVkRCs7a

dec := json.NewDecoder(strings.NewReader(`[123,`))
for {
    _, err := dec.Token()
    if err == io.EOF {
        break
    }
    if err != nil {
        panic(err)
    }
}

According to the documentation, Token guarantees that the delimiters [ ] { } it returns are properly nested and matched. However, in this example, [ is not properly nested and matched.

Comment From: timothy-king

CC @rsc, @dsnet, @bradfitz, @mvdan.

Comment From: adonovan

It's not obvious to me that this is a bug. The doc comments state that Token returns (nil, EOF) at the end of the stream, which it does; and that Token guarantees that the delimiters it returns are properly nested and matched, which it does too, to the extent that you'll never see a mismatched or premature close bracket. The actual error in this case is that we ran out of input (for which EOF is appropriate), not that there was something wrong with the input we received, so it would be strange for it to report a syntax error, or to synthesize close bracket tokens that weren't present.

What would you have it do?

Comment From: dsnet

doc comments state that Token returns (nil, EOF) at the end of the stream

The question comes down to: what is a stream?

The docs for json.Decoder.Token defines a "stream" as:

The input stream consists of basic JSON values —bool, string, number, and null—along with delimiters [ ] { } of type Delim to mark the start and end of arrays and objects.

This is unfortunately problematic. The term "value" according to RFC 8259 implies a complete object or array (i.e., including the paired terminating ']' or '}'), but the prose in the latter half of the sentence is describing what RFC 8259 would actually call a "token".

Essentially: * If Decoder is reading a stream of JSON values, then io.ErrUnexpectedEOF is correct. * If Decoder is reading a stream of JSON tokens, then io.EOF is correct.

The "github.com/go-json-experiment/json" module reports io.ErrUnexpectedEOF since it defines jsontext.Decoder as decoding a stream of top-level "values". In all my own usages of jsontext.Decoder, this is the semantic I would have wanted. While I'm parsing a JSON value token-by-token, I'm often still operating under the assumption that the entire value is valid.

Comment From: gazerro

The same example with XML returns the error XML syntax error on line 1: unexpected EOF:

https://go.dev/play/p/wCvkSOlqeH0

dec := xml.NewDecoder(strings.NewReader(`<values>123`))
for {
    _, err := dec.Token()
    if err == io.EOF {
        break
    }
    if err != nil {
        panic(err)
    }
}

Therefore, I notice an inconsistency in the behavior of Decoder.Token methods between encoding/json and encoding/xml. This inconsistency doesn't seem justified by the differences between JSON and XML.

I believe the correct behavior would be to return an io.ErrUnexpectedEOF error when encountering incomplete input.

Comment From: adonovan

I believe the correct behavior would be to return an io.ErrUnexpectedEOF error when encountering incomplete input.

ErrUnexpectedEOF is appropriate for the case when the EOF occurs in the middle of a token, such as an unclosed string literal, or in the middle of "null", "true", or "false"; and indeed that's what the decoder does. But Token promises to deliver tokens, and a missing close bracket is not a token-decoding error.

The input stream consists of basic JSON values —bool, string, number, and null—along with delimiters...

@dsnet: This is unfortunately problematic. The term "value" according to RFC 8259 implies a complete object or array (i.e., including the paired terminating ']' or '}'), but the prose in the latter half of the sentence is describing what RFC 8259 would actually call a "token".

The quoted sentence says "basic values", and immediately defines them as the elementary, single-token values; I don't think one can argue that value is meant here in its general sense. So Token returns a stream of "basic values and delimiters".

Comment From: dsnet

immediately defines them as the elementary, single-token values

If this were true, then I would expect the parser to be handling the tokens as just a plain sequence of JSON tokens (i.e., only validating for the grammar for a JSON "token", but not the grammar for a JSON "value"). However, that's not quite how it behaves. Consider the following:

dec := json.NewDecoder(strings.NewReader(`{ "hello" }`))
for {
    tok, err := dec.Token()
    if err != nil {
        fmt.Println(err)
        return
    }
    fmt.Println(tok)
}

which prints:

{
hello
invalid character '}' after object key

The fact that we see invalid character '}' after object key implies that it's using a push-down state-machine to validate whether the sequence of tokens is valid according to the grammar for a JSON "value". The difference between io.EOF or io.ErrUnexpectedEOF at the very end comes down to whether the push-down state machine has an empty stack or not.

Comment From: gazerro

In Go 1.5 and earlier, the behavior of Decoder.Token in the encoding/xml package was identical to the current behavior in the encoding/json package. Proposal #11405 recommended changing this behavior, and the change was introduced via CL https://golang.org/cl/14315 in Go 1.6.

Given that the Decoder.Token method was added to the encoding/json package around the same time as the modification to encoding/xml, it seems like an unfortunate mistake that the behavior in encoding/json was not updated to reflect this change.

Comment From: adonovan

The fact that we see invalid character '}' after object key implies that it's using a push-down state-machine to validate whether the sequence of tokens is valid according to the grammar for a JSON "value". The difference between io.EOF or io.ErrUnexpectedEOF at the very end comes down to whether the push-down state machine has an empty stack or not.

Hmm, good point. Both the (documented) no-mismatched-paren rule and the (undocumented) object grammar enforcement you just mentioned are evidence that Token actually intends to return only valid, complete sequences of tokens, and so returning ErrUnexpectedEOF if the sequence is incomplete seems reasonable.

Comment From: gopherbot

Change https://go.dev/cl/689516 mentions this issue: encoding/json: fix truncated Token error regression in goexperiment.jsonv2