Golang x/tools/gopls: SEGV accessing map[string]...

#!stacks
"runtime.sigpanic" &&
   "runtime.mapaccess2_faststr:+84" &&
   "imports.(*DirInfoCache).Load:+3"

Issue created by stacks.

imports.(*DirInfoCache).Load:

    info, ok := d.dirs[dir]

In mapaccess2_faststr below, it looks like one of k, key, or b is an invalid (nil or corrupt) pointer. We know that b and key are non-nil; and k is derived from b.keys(), which is an offset of b; so the only possibility is data corruption.

    top := tophash(hash)
    for ; b != nil; b = b.overflow(t) {
        for i, kptr := uintptr(0), b.keys(); i < abi.MapBucketCount; i, kptr = i+1, add(kptr, 2*goarch.PtrSize) {
            k := (*stringStruct)(kptr)
            if k.len != key.len || b.tophash[i] != top {  <------- SEGV
                continue
            }
            if k.str == key.str || memequal(k.str, key.str, uintptr(key.len)) {
                return add(unsafe.Pointer(b), dataOffset+abi.MapBucketCount*2*goarch.PtrSize+i*uintptr(t.ValueSize)), true
            }
        }
    }

This stack sAEkLA was reported by telemetry:

golang.org/x/tools/gopls@v0.18.1 go1.23.8 linux/amd64 vscode (1)

Comment From: gabyhelp

Related Issues

_{(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)}

Comment From: gengliqi

Hi, we've seen a very similar panic in TiDB (see pingcap/tidb#61962). We don't believe it’s caused by concurrent access. This string map is only ever used by a single worker goroutine, so there are no simultaneous reads/writes. We've also reviewed our use of unsafe pointers and haven't found any suspicious unsafe memory writes so far.

One intriguing clue is the invalid address(0x1000000000010) in our stack trace. According to my analysis(https://github.com/pingcap/tidb/issues/61962#issuecomment-2999798116), it looks like the bucket pointer b was somehow set to 0x1000000000000 (1<<48), which lies just outside the canonical address range on typical 48-bit physical address systems. Does the address in this issue's panic match the same 0x1000000000010 value?

Comment From: adonovan

Intriguing indeed. Unfortunately our telemetry system is quite fastidious about not lifting arbitrary dynamic data (such as pointer values) out of the user's machine, so all we have is the backtrace.

Of course it is possible that misuse of concurrency in one goroutine tramples on memory belonging to another, even if the hapless latter goroutine is entirely well behaved. However, one would expect to see a degree of locality in the behavior, and that crashes should show up most often near to where they are caused, or in highly stereotypical patterns. For example, racing writes on an interface variable can coerce values of arbitrary types (e.g. X ↔ Y, or *X ↔ int), but if this happens one would expect to see a small number of places that commonly access fields of X or Y would often panic because they hold invalid pointers. We haven't yet identified any such pattern in gopls; the corruption seems to be spread all over the address space.

Comment From: adonovan

cc @prattmic

Comment From: prattmic

It looks like this crash was on 1.23, and the tidb one on 1.21 (I think?). Note that the map implementation is completely rewritten in 1.24. If this is corruption unrelated to the map, there will probably still be crash, but they will move elsewhere.