Golang proposal: go/scanner: add (*Scanner).Pos()

Proposal Details

Currently there is no easy and reliable way to access the End position of a token through go/scanner.

It can be done to some extend with:

pos, tok, lit := s.Scan()
tokLength := len(lit)
if !tok.IsLiteral() && tok != token.COMMENT {
    tokLength = len(tok.String())
}
tokEnd := pos + token.Pos(tokLength)

It looks correct, but actually it is not. There are few issues:

len(lit) (line 2) is wrong for comments and raw string literals, since carriage returns ('\r') are not included in the literal, thus for such cases the tokEnd is already wrong.

When a file ends (just before an EOF token) and impliedSemi==true, then an artificial SEMICOLON token is emitted. Say you want to inspect the whitespace between tokens: ```go func TestScanner(t *testing.T) { const src = "package a; var a int"

file := token.NewFileSet().AddFile("", -1, len(src))
var s scanner.Scanner
s.Init(file, []byte(src), func(pos token.Position, msg string) {
    panic("unreachable: " + msg)
}, scanner.ScanComments)

prevEndOff := 0
for {
    pos, tok, lit := s.Scan()
    t.Logf("%v %v %v", pos, tok, lit)
    off := file.Offset(pos)

    white := src[prevEndOff:off] // panics when tok == EOF
    for _, c := range white {
        switch c {
        case ' ', '\t', '\n', '\r', '\ufeff':
        default:
            panic("unreachable: " + strconv.QuoteRune(c))
        }
    }
    t.Logf("%q", white)

    tokLength := len(lit)
    if !tok.IsLiteral() && tok != token.COMMENT {
        tokLength = len(tok.String())
    }
    prevEndOff = off + tokLength

    if tok == token.EOF {
        break
    }
}

} ``` This code panics, because of the artificial SEMICOLON token.

To solve such problems, and to simplify the logic i propose to add to the go/scanner following new API:

package scanner // go/scanner

// Pos returns the current position in the source where the next Scan call
// will begin tokenizing.
// It also represents the end position of the previous token.
func (s *Scanner) Pos() token.Pos {
    return s.file.Pos(s.offset)
}

CC @adonovan @findleyr

Comment From: gopherbot

Change https://go.dev/cl/694615 mentions this issue: go/parser: properly calculate the end position of comments

Comment From: mateusz834

And just to note that the same issue about \r exists in the go/parser positions returned by End()

https://github.com/golang/go/blob/fbac94a79998d4730a58592f0634fa8a39d8b9fb/src/go/ast/ast.go#L64-L67

https://github.com/golang/go/blob/fbac94a79998d4730a58592f0634fa8a39d8b9fb/src/go/ast/ast.go#L313-L317

This does not try to solve that, but it makes the situation better, since with such API you could pipe the source starting at node.Pos() to go/scanner, then scan a single token and look at the Pos(). This is still a workaround, but now at least it would be possible to get the correct end pos (if needed).

Comment From: mateusz834

Uhhh, actually it would not solve in 100% the second issue/point, it would still require some workarounds.

https://github.com/golang/go/blob/fbac94a79998d4730a58592f0634fa8a39d8b9fb/src/go/scanner/scanner.go#L803-L804

I am curions why the scanner does such thing. EDIT: #54941

EDIT2:

Actually this behaviour is kind-of conflicting with the proposed API, since:

// Pos returns the current position in the source where the next Scan call
// will begin tokenizing.
// It also represents the end position of the previous token.

is not 100% true then.

Comment From: gabyhelp

Related Code Changes

go/scanner: emit implicit semicolon tokens in correct order

Related Documentation

Package scanner > type Scanner > func (*Scanner) Scan

_{(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)}

Comment From: adonovan

cc @griesemer