Golang runtime: SIGSEGV on nil pointer in mheap.freeManual

Go version

1.23.8

Output of `go env` in your module/workspace:

GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/home/kris/.cache/go-build'
GOENV='/home/kris/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/home/kris/mygo/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/kris/mygo'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/opt/go/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/opt/go/go/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.23.8'
GODEBUG=''
GOTELEMETRY='local'
GOTELEMETRYDIR='/home/kris/.config/go/telemetry'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build2339619411=/tmp/go-build -gno-record-gcc-switches'

What did you do?

Have been seeing this SIGSEGV sporadically happen across a large number of high load services, its pretty rare so I have very very low confidence that I will be able to zero into a good repro.

We have probably 40+ of our backend service running across 10+ large AMD EPYC servers with lots of ECC RAM. The services are running in KVM on top of a fully updated Proxmox environment. The crashing backend service has between 4 vCPUs and 16GB of RAM and 16 vCPUs and 128GB of RAM.

We saw the crash start happening when we transitioned to the 1.23 runtime, it has happened on most of our physical hosts, none of the hosts are reporting memory problems and prior to deployment we did a memtest and stress test on all the hosts. The service typically runs for over a month before we see this sigsev and we never see it on our low-load instances, only high load (more GC activity, so that isn't unexpected).

The backing service makes extensive use of mmap for large disk backed data structures.

We just moved our builds over to the 1.24.X runtime but we haven't run them in production for long enough to see if the crash goes away and I can see that the mheap code is wildly different in 1.24 vs 1.23.

What did you see happen?

Crash with the following backtrace (happens in exactly the same spot every time):

SIGSEGV: segmentation violation
PC=0x42a5bc m=16 sigcode=1 addr=0x64

goroutine 0 gp=0xc000704540 m=16 mp=0xc000081c08 [idle]:
runtime.(*mheap).freeManual(0x29a1940, 0x0, 0x2)
        runtime/mheap.go:1605 +0xbc fp=0xc000c8ffa0 sp=0xc000c8ff70 pc=0x42a5bc
runtime.(*sweepLocked).sweep.func2()
        runtime/mgcsweep.go:826 +0x70 fp=0xc000c8ffc8 sp=0xc000c8ffa0 pc=0x4273b0
runtime.systemstack(0x3240c903240c903)
        runtime/asm_amd64.s:514 +0x4a fp=0xc000c8ffd8 sp=0xc000c8ffc8 pc=0x4778ea

Digging in its pretty clear we are getting a nil pointer from runtime.spanOf which then causes the crash when runtime.freeManual attempts to assign s.needzero = 1

Across all our crashes the arguments to runtime.freeManual are always (0x0, 0x2) according to the backtraces

What did you expect to see?

No SIGSEGV

Comment From: randall77

That value 2 for the spanAllocType should no longer happen at all in 1.24, so upgrading might very well fix things for you.

I don't see how spanOf could return nil for a non-nil s.largeType. Very strange. I think without a reproducer it will be very hard to track this down.

Comment From: kris-watts-gravwell

That's what I was worried about, of our 40ish workers that have processed literal 100s of petabytes, we have seen it about 12 times.

Knowing that 1.24 shouldn't see this at all is good enough for us, but I figured I should post it in case anyone had ideas and because 1.23 is still maintained.

Comment From: gopherbot

Change https://go.dev/cl/671096 mentions this issue: runtime: remove ptr/scalar bitmap metric

Comment From: gopherbot

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)