Go version

go1.24.3 linux/amd64

also this was using GOEXPERIMENT=synctest in go1.24.3 but given the location of the crash, I sort of doubt that is a factor, but it could be.

Output of go env in your module/workspace:

AR='ar'
CC='gcc'
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_ENABLED='1'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
CXX='g++'
GCCGO='gccgo'
GO111MODULE='auto'
GOAMD64='v1'
GOARCH='amd64'
GOAUTH='netrc'
GOBIN=''
GOCACHE='/home/jaten/.cache/go-build'
GOCACHEPROG=''
GODEBUG=''
GOENV='/home/jaten/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFIPS140='off'
GOFLAGS=''
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-pref\
ix-map=/tmp/go-build2503454780=/tmp/go-build -gno-record-gcc-switches'
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMOD='/home/jaten/zdb/go.mod'
GOMODCACHE='/home/jaten/go/pkg/mod'
GONOPROXY='github.com/glycerine/*'
GONOSUMDB='github.com/glycerine/*'
GOOS='linux'
GOPATH='/home/jaten/go'
GOPRIVATE='github.com/glycerine/*'
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/mnt/oldrog/usr/local/go1.24.3'
GOSUMDB='sum.golang.org'
GOTELEMETRY='off'
GOTELEMETRYDIR='/home/jaten/.config/go/telemetry'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/mnt/oldrog/usr/local/go1.24.3/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.24.3'
GOWORK=''
PKG_CONFIG='pkg-config'

What did you do?

My Linux development host detailed context:

Under go1.24.3 amd64
on Ubuntu 24.04.2 LTS with a 6.11.0-26-generic kernel, 
booted with nospec_store_bypass_disable
(per https://github.com/rr-debugger/rr/wiki/Zen#ssb-mitigation )
kernel.apparmor_restrict_unprivileged_userns=1
kernel.yama.ptrace_scope = 0
kernel.perf_event_paranoid=-1
$ uname -a
Linux rog 6.11.0-26-generic #26~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Apr 17 19:20:47 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=24.04
DISTRIB_CODENAME=noble
DISTRIB_DESCRIPTION="Ubuntu 24.04.2 LTS"
 $ 
vendor_id       : AuthenticAMD
cpu family      : 23
model           : 49
model name      : AMD Ryzen Threadripper 3960X 24-Core Processor
stepping        : 0
microcode       : 0x830107c

rr was compiled from source at release version 5.9 using clang++
https://github.com/rr-debugger/rr
[edit: actually, it was at rr tip: daaab24f10c81394f209cf05ef1191af3365ccd8 ]

$ clang++ --version
Ubuntu clang version 18.1.3 (1ubuntu1)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

Since my CPU is AMD, the rr Zen workaround python script must be, and was,
deployed, as described https://github.com/rr-debugger/rr/wiki/Zen

Running a small Go test for my Raft code under rr record -h for chaos mode for the first time may have found a bug in the Go runtime garbage collection code.

Or, of course, I may just have made a goof somewhere.

rr is a fairly sophisticated bit of kit, and I'm still getting oriented to it.

However I don't want to dismiss its findings out of inexperience when it may be help improve the Go garbage collector code.

And since I hope to be doing more rr work in the future to help harden my Raft code, I need to know if this a false positive to expect, or we are seeing a real issue in the runtime GC code.

I wrote a brief introduction to rr and its chaos mode for helping to locate hard to find bugs (that depend on CPU starvation or hard to obtain small scheduling quantum thread interleafings) here:

https://groups.google.com/g/golang-nuts/c/ouBKO-Q0Mtw

In short, the rr debugger in record -h chaos mode uses ptrace to take control of the threading in the target (here Go) program, run all threads on a single CPU, and forces the thread scheduling to adopt unusual schedules empirically and heuristically aimed at uncovering hard to find concurrency bugs.

https://robert.ocallahan.org/2016/02/introducing-rr-chaos-mode.html and https://robert.ocallahan.org/2016/02/deeper-into-chaos.html

provide readable introductions to some of the early approaches, which have almost certainly been refined in the intervening 9 years.

The recording run:

-*- mode: compilation; default-directory: "~/zdb/tube/" -*-
Compilation started at Thu Jun  5 13:35:25

make rr
GOTRACEBACK=all GOEXPERIMENT=synctest go test -race -c -o rpc.test
rr record -h ./rpc.test -test.v -test.run 016
rr: Saving execution to trace directory `/mnt/oldrog/home/jaten/.local/share/rr/rpc.test-0'.
faketime = true
=== RUN   Test016_tube_parallel_linz

tube.go:482 2000-01-01T00:00:00.000000000+00:00 faketime=true; cfg.UseSimNet=true; cfg.Barrier=true
fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x471843]

goroutine 3 gp=0xc000003340 m=7 mp=0xc000480008 [running]:
runtime.throw({0x10298a0?, 0x446968?})
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/panic.go:1101 +0x48 fp=0xc000076e80 sp=0xc000076e50 pc=0x4b1c08
runtime.sigpanic()
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/signal_unix.go:909 +0x3c9 fp=0xc000076ee0 sp=0xc000076e80 pc=0x4b4029
runtime.(*spanSet).pop(0x1?)
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/mspanset.go:189 +0xa3 fp=0xc000076f08 sp=0xc000076ee0 pc=0x471843
runtime.(*mheap).nextSpanForSweep(0x1631540)
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/mgcsweep.go:105 +0xc8 fp=0xc000076f30 sp=0xc000076f08 pc=0x462588
runtime.sweepone()
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/mgcsweep.go:370 +0x8d fp=0xc000076f80 sp=0xc000076f30 pc=0x462b8d
runtime.bgsweep(0xc000030100)
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/mgcsweep.go:297 +0xff fp=0xc000076fc8 sp=0xc000076f80 pc=0x4629bf
runtime.gcenable.gowrap1()
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/mgc.go:204 +0x25 fp=0xc000076fe0 sp=0xc000076fc8 pc=0x456e25
runtime.goexit({})
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc000076fe8 sp=0xc000076fe0 pc=0x4ba141
created by runtime.gcenable in goroutine 1
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/mgc.go:204 +0x66

goroutine 1 gp=0xc000002380 m=nil [chan receive]:
runtime.gopark(0x18?, 0x16277e0?, 0x48?, 0xc3?, 0xc00019f718?)
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/proc.go:435 +0xce fp=0xc0004e36b0 sp=0xc0004e3690 pc=0x4b1d4e
runtime.chanrecv(0xc0001024d0, 0xc00019f796, 0x1)
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/chan.go:664 +0x3e5 fp=0xc0004e3728 sp=0xc0004e36b0 pc=0x447545
runtime.chanrecv1(0x1625780?, 0xf2e660?)
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/chan.go:506 +0x12 fp=0xc0004e3750 sp=0xc0004e3728 pc=0x447132
testing.(*T).Run(0xc000106e00, {0x101944e, 0x1a}, 0x104c6a8)
    /mnt/oldrog/usr/local/go1.24.3/src/testing/testing.go:1859 +0x91e fp=0xc0004e3888 sp=0xc0004e3750 pc=0x5b0c9e
testing.runTests.func1(0xc000106e00)
    /mnt/oldrog/usr/local/go1.24.3/src/testing/testing.go:2279 +0x86 fp=0xc0004e38d8 sp=0xc0004e3888 pc=0x5b51a6
testing.tRunner(0xc000106e00, 0xc00019fae0)
    /mnt/oldrog/usr/local/go1.24.3/src/testing/testing.go:1792 +0x226 fp=0xc0004e39a8 sp=0xc0004e38d8 pc=0x5af1e6
testing.runTests(0xc00027c2e8, {0x161dba0, 0x47, 0x47}, {0x1?, 0x1?, 0x0?})
    /mnt/oldrog/usr/local/go1.24.3/src/testing/testing.go:2277 +0x96d fp=0xc0004e3b10 sp=0xc0004e39a8 pc=0x5b500d
testing.(*M).Run(0xc0001488c0)
    /mnt/oldrog/usr/local/go1.24.3/src/testing/testing.go:2142 +0xeeb fp=0xc0004e3e88 sp=0xc0004e3b10 pc=0x5b25eb
main.main()
    _testmain.go:387 +0x165 fp=0xc0004e3f50 sp=0xc0004e3e88 pc=0xe370a5
runtime.main()
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/proc.go:283 +0x28b fp=0xc0004e3fe0 sp=0xc0004e3f50 pc=0x47a7cb
runtime.goexit({})
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc0004e3fe8 sp=0xc0004e3fe0 pc=0x4ba141

goroutine 2 gp=0xc000002e00 m=nil [force gc (idle)]:
runtime.gopark(0x15bb610?, 0x16277e0?, 0x0?, 0x0?, 0x0?)
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/proc.go:435 +0xce fp=0xc0000607a8 sp=0xc000060788 pc=0x4b1d4e
runtime.goparkunlock(...)
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/proc.go:441
runtime.forcegchelper()
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/proc.go:348 +0xb3 fp=0xc0000607e0 sp=0xc0000607a8 pc=0x47ab13
runtime.goexit({})
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc0000607e8 sp=0xc0000607e0 pc=0x4ba141
created by runtime.init.7 in goroutine 1
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/proc.go:336 +0x1a

goroutine 4 gp=0xc000003500 m=nil [GC scavenge wait]:
runtime.gopark(0x10000?, 0x1173788?, 0x0?, 0x0?, 0x0?)
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/proc.go:435 +0xce fp=0xc000070f78 sp=0xc000070f58 pc=0x4b1d4e
runtime.goparkunlock(...)
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/proc.go:441
runtime.(*scavengerState).park(0x16260a0)
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/mgcscavenge.go:425 +0x49 fp=0xc000070fa8 sp=0xc000070f78 pc=0x460409
runtime.bgscavenge(0xc000030100)
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/mgcscavenge.go:658 +0x59 fp=0xc000070fc8 sp=0xc000070fa8 pc=0x460999
runtime.gcenable.gowrap2()
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/mgc.go:205 +0x25 fp=0xc000070fe0 sp=0xc000070fc8 pc=0x456dc5
runtime.goexit({})
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc000070fe8 sp=0xc000070fe0 pc=0x4ba141
created by runtime.gcenable in goroutine 1
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/mgc.go:205 +0xa5

goroutine 18 gp=0xc000106380 m=nil [finalizer wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/proc.go:435 +0xce fp=0xc000077e30 sp=0xc000077e10 pc=0x4b1d4e
runtime.runfinq()
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/mfinal.go:196 +0x145 fp=0xc000077fe0 sp=0xc000077e30 pc=0x455de5
runtime.goexit({})
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc000077fe8 sp=0xc000077fe0 pc=0x4ba141
created by runtime.createfing in goroutine 1
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/mfinal.go:166 +0x3d

goroutine 19 gp=0xc000106540 m=nil [chan receive]:
runtime.gopark(0x1648530?, 0xffffffff?, 0x29?, 0xd1?, 0xc000394f50?)
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/proc.go:435 +0xce fp=0xc000394f18 sp=0xc000394ef8 pc=0x4b1d4e
runtime.chanrecv(0xc0001020e0, 0x0, 0x1)
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/chan.go:664 +0x3e5 fp=0xc000394f90 sp=0xc000394f18 pc=0x447545
runtime.chanrecv1(0x0?, 0x0?)
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/chan.go:506 +0x12 fp=0xc000394fb8 sp=0xc000394f90 pc=0x447132
runtime.unique_runtime_registerUniqueMapCleanup.func2(...)
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/mgc.go:1796
runtime.unique_runtime_registerUniqueMapCleanup.gowrap1()
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/mgc.go:1799 +0x2f fp=0xc000394fe0 sp=0xc000394fb8 pc=0x459f6f
runtime.goexit({})
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc000394fe8 sp=0xc000394fe0 pc=0x4ba141
created by unique.runtime_registerUniqueMapCleanup in goroutine 1
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/mgc.go:1794 +0x79

goroutine 20 gp=0xc000106700 m=nil [select, locked to thread]:
runtime.gopark(0xc000075fa8?, 0x2?, 0x60?, 0x0?, 0xc000075f94?)
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/proc.go:435 +0xce fp=0xc000075de8 sp=0xc000075dc8 pc=0x4b1d4e
runtime.selectgo(0xc000075fa8, 0xc000075f90, 0x0?, 0x0, 0x2?, 0x1)
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/select.go:351 +0x985 fp=0xc000075f58 sp=0xc000075de8 pc=0x48e385
runtime.ensureSigM.func1()
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/signal_unix.go:1085 +0x192 fp=0xc000075fe0 sp=0xc000075f58 pc=0x4abfb2
runtime.goexit({})
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc000075fe8 sp=0xc000075fe0 pc=0x4ba141
created by runtime.ensureSigM in goroutine 1
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/signal_unix.go:1068 +0xc8

goroutine 21 gp=0xc0001068c0 m=0 mp=0x16277e0 [syscall]:
runtime.notetsleepg(0x1648d60, 0xffffffffffffffff)
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/lock_futex.go:123 +0x29 fp=0xc00005cfa0 sp=0xc00005cf78 pc=0x44d889
os/signal.signal_recv()
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/sigqueue.go:152 +0x29 fp=0xc00005cfc0 sp=0xc00005cfa0 pc=0x4b4069
os/signal.loop()
    /mnt/oldrog/usr/local/go1.24.3/src/os/signal/signal_unix.go:23 +0x1d fp=0xc00005cfe0 sp=0xc00005cfc0 pc=0x5ea5dd
runtime.goexit({})
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc00005cfe8 sp=0xc00005cfe0 pc=0x4ba141
created by os/signal.Notify.func1.1 in goroutine 1
    /mnt/oldrog/usr/local/go1.24.3/src/os/signal/signal.go:152 +0x47

goroutine 22 gp=0xc000106a80 m=nil [chan receive]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/proc.go:435 +0xce fp=0xc0004b9e18 sp=0xc0004b9df8 pc=0x4b1d4e
runtime.chanrecv(0xc00027bb90, 0xc0004b9f98, 0x1)
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/chan.go:664 +0x3e5 fp=0xc0004b9e90 sp=0xc0004b9e18 pc=0x447545
runtime.chanrecv2(0x0?, 0x0?)
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/chan.go:511 +0x12 fp=0xc0004b9eb8 sp=0xc0004b9e90 pc=0x447152
github.com/glycerine/rpc25519.init.1.func1()
    /home/jaten/go/pkg/mod/github.com/glycerine/rpc25519@v1.21.58/filterstacks.go:22 +0x28c fp=0xc0004b9fe0 sp=0xc0004b9eb8 pc=0xd1856c
runtime.goexit({})
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc0004b9fe8 sp=0xc0004b9fe0 pc=0x4ba141
created by github.com/glycerine/rpc25519.init.1 in goroutine 1
    /home/jaten/go/pkg/mod/github.com/glycerine/rpc25519@v1.21.58/filterstacks.go:21 +0xca

goroutine 23 gp=0xc000106c40 m=nil [chan receive]:
runtime.gopark(0xc00027be20?, 0x0?, 0x0?, 0x0?, 0x0?)
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/proc.go:435 +0xce fp=0xc0004b8e80 sp=0xc0004b8e60 pc=0x4b1d4e
runtime.chanrecv(0xc00027bdc0, 0xc0004b8fa0, 0x1)
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/chan.go:664 +0x3e5 fp=0xc0004b8ef8 sp=0xc0004b8e80 pc=0x447545
runtime.chanrecv2(0x0?, 0x0?)
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/chan.go:511 +0x12 fp=0xc0004b8f20 sp=0xc0004b8ef8 pc=0x447152
github.com/glycerine/zdb/tube.init.0.func1()
    /home/jaten/zdb/tube/filterstacks.go:21 +0x1c5 fp=0xc0004b8fe0 sp=0xc0004b8f20 pc=0xe31345
runtime.goexit({})
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc0004b8fe8 sp=0xc0004b8fe0 pc=0x4ba141
created by github.com/glycerine/zdb/tube.init.0 in goroutine 1
    /home/jaten/zdb/tube/filterstacks.go:20 +0xca

goroutine 24 gp=0xc000107180 m=nil [synctest.Run, synctest group 24]:
runtime.gopark(0xc000390e90?, 0x0?, 0x0?, 0x0?, 0xc000494000?)
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/proc.go:435 +0xce fp=0xc000390e40 sp=0xc000390e20 pc=0x4b1d4e
internal/synctest.Run(0xc000494000)
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/synctest.go:191 +0x1a5 fp=0xc000390eb8 sp=0xc000390e40 pc=0x4b5f05
testing/synctest.Run(...)
    /mnt/oldrog/usr/local/go1.24.3/src/testing/synctest/synctest.go:38
github.com/glycerine/zdb/tube.bubbleOrNot(...)
    /home/jaten/zdb/tube/synctest.go:22
github.com/glycerine/zdb/tube.Test016_tube_parallel_linz(0xc000106fc0)
    /home/jaten/zdb/tube/porc_test.go:47 +0x8c fp=0xc000390ee0 sp=0xc000390eb8 pc=0xdf968c
testing.tRunner(0xc000106fc0, 0x104c6a8)
    /mnt/oldrog/usr/local/go1.24.3/src/testing/testing.go:1792 +0x226 fp=0xc000390fb0 sp=0xc000390ee0 pc=0x5af1e6
testing.(*T).Run.gowrap1()
    /mnt/oldrog/usr/local/go1.24.3/src/testing/testing.go:1851 +0x45 fp=0xc000390fe0 sp=0xc000390fb0 pc=0x5b0f45
runtime.goexit({})
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc000390fe8 sp=0xc000390fe0 pc=0x4ba141
created by testing.(*T).Run in goroutine 1
    /mnt/oldrog/usr/local/go1.24.3/src/testing/testing.go:1851 +0x8f3

goroutine 34 gp=0xc000482380 m=4 mp=0xc000067808 [running, synctest group 24]:
    goroutine running on other thread; stack unavailable
created by internal/synctest.Run in goroutine 24
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/synctest.go:178 +0x10d

goroutine 35 gp=0xc000482540 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/proc.go:435 +0xce fp=0xc0004bff38 sp=0xc0004bff18 pc=0x4b1d4e
runtime.gcBgMarkWorker(0xc0004ae0e0)
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/mgc.go:1423 +0xe9 fp=0xc0004bffc8 sp=0xc0004bff38 pc=0x459289
runtime.gcBgMarkStartWorkers.gowrap1()
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/mgc.go:1339 +0x25 fp=0xc0004bffe0 sp=0xc0004bffc8 pc=0x459165
runtime.goexit({})
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc0004bffe8 sp=0xc0004bffe0 pc=0x4ba141
created by runtime.gcBgMarkStartWorkers in goroutine 34
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/mgc.go:1339 +0x105

goroutine 36 gp=0xc000482700 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/proc.go:435 +0xce fp=0xc0004bef38 sp=0xc0004bef18 pc=0x4b1d4e
runtime.gcBgMarkWorker(0xc0004ae0e0)
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/mgc.go:1423 +0xe9 fp=0xc0004befc8 sp=0xc0004bef38 pc=0x459289
runtime.gcBgMarkStartWorkers.gowrap1()
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/mgc.go:1339 +0x25 fp=0xc0004befe0 sp=0xc0004befc8 pc=0x459165
runtime.goexit({})
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc0004befe8 sp=0xc0004befe0 pc=0x4ba141
created by runtime.gcBgMarkStartWorkers in goroutine 34
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/mgc.go:1339 +0x105

goroutine 37 gp=0xc0004828c0 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/proc.go:435 +0xce fp=0xc0004bdf38 sp=0xc0004bdf18 pc=0x4b1d4e
runtime.gcBgMarkWorker(0xc0004ae0e0)
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/mgc.go:1423 +0xe9 fp=0xc0004bdfc8 sp=0xc0004bdf38 pc=0x459289
runtime.gcBgMarkStartWorkers.gowrap1()
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/mgc.go:1339 +0x25 fp=0xc0004bdfe0 sp=0xc0004bdfc8 pc=0x459165
runtime.goexit({})
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc0004bdfe8 sp=0xc0004bdfe0 pc=0x4ba141
created by runtime.gcBgMarkStartWorkers in goroutine 34
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/mgc.go:1339 +0x105

goroutine 38 gp=0xc000482a80 m=nil [GC worker (idle)]:
runtime.gopark(0x53b739deb6ac?, 0x0?, 0x0?, 0x0?, 0x0?)
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/proc.go:435 +0xce fp=0xc0004bcf38 sp=0xc0004bcf18 pc=0x4b1d4e
runtime.gcBgMarkWorker(0xc0004ae0e0)
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/mgc.go:1423 +0xe9 fp=0xc0004bcfc8 sp=0xc0004bcf38 pc=0x459289
runtime.gcBgMarkStartWorkers.gowrap1()
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/mgc.go:1339 +0x25 fp=0xc0004bcfe0 sp=0xc0004bcfc8 pc=0x459165
runtime.goexit({})
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc0004bcfe8 sp=0xc0004bcfe0 pc=0x4ba141
created by runtime.gcBgMarkStartWorkers in goroutine 34
    /mnt/oldrog/usr/local/go1.24.3/src/runtime/mgc.go:1339 +0x105
make: *** [Makefile:11: rr] Error 2

Compilation exited abnormally with code 2 at Thu Jun  5 13:35:26

Under replay: [the (rr) indicates we are in a gdb session of an rr replay]

jaten@rog ~/zdb/tube (master) $ make replay
rr replay
GNU gdb (Ubuntu 15.0.50.20240403-0ubuntu1) 15.0.50.20240403-git
Copyright (C) 2024 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /mnt/oldrog/home/jaten/.local/share/rr/rpc.test-0/mmap_pack_0_rpc.test...
Loading Go Runtime support.
Remote debugging using 127.0.0.1:31997
Reading symbols from /lib64/ld-linux-x86-64.so.2...
Reading symbols from /usr/lib/debug/.build-id/1c/8db5f83bba514f8fd5f1fb6d7be975be1bb855.debug...
BFD: warning: system-supplied DSO at 0x6fffd000 has a section extending past end of file
Downloading separate debug info for system-supplied DSO at 0x6fffd000
0x000076133c408540 in _start () from /lib64/ld-linux-x86-64.so.2
(rr) c

Continuing.
faketime = true
=== RUN   Test016_tube_parallel_linz

tube.go:482 2000-01-01T00:00:00.000000000+00:00 faketime=true; cfg.UseSimNet=true; cfg.Barrier=true
[New Thread 678585.678591]
[New Thread 678585.678586]
[New Thread 678585.678587]
[New Thread 678585.678588]
[New Thread 678585.678589]
[New Thread 678585.678590]

Thread 2 received signal SIGSEGV, Segmentation fault.
[Switching to Thread 678585.678591]
runtime.(*spanSet).pop (b=<optimized out>, ~r0=<optimized out>)
    at /mnt/oldrog/usr/local/go1.24.3/src/runtime/mspanset.go:189
189     s := block.spans[bottom].Load()

(rr) list

184 
185     // Given that the spine length is correct, we know we will never
186     // see a nil block here, since the length is always updated after
187     // the block is set.
188     block := blockp.Load()
189     s := block.spans[bottom].Load()
190     for s == nil {
191         // We raced with the span actually being set, but given that we
192         // know a block for this span exists, the race window here is
193         // extremely small. Try again.
(rr) p block

$1 = <optimized out>
(rr) p bottom

No symbol "bottom" in current context.
(rr) up

#1  0x0000000000462588 in runtime.(*mheap).nextSpanForSweep (
    h=0x1631540 <runtime.mheap_>, ~r0=<optimized out>)
    at /mnt/oldrog/usr/local/go1.24.3/src/runtime/mgcsweep.go:105
105             s = c.partialUnswept(sg).pop()
(rr) down

#0  runtime.(*spanSet).pop (b=<optimized out>, ~r0=<optimized out>)
    at /mnt/oldrog/usr/local/go1.24.3/src/runtime/mspanset.go:189
189     s := block.spans[bottom].Load()
(rr) bt

#0  runtime.(*spanSet).pop (b=<optimized out>, ~r0=<optimized out>)
    at /mnt/oldrog/usr/local/go1.24.3/src/runtime/mspanset.go:189
#1  0x0000000000462588 in runtime.(*mheap).nextSpanForSweep (
    h=0x1631540 <runtime.mheap_>, ~r0=<optimized out>)
    at /mnt/oldrog/usr/local/go1.24.3/src/runtime/mgcsweep.go:105
#2  0x0000000000462b8d in runtime.sweepone (~r0=<optimized out>)
    at /mnt/oldrog/usr/local/go1.24.3/src/runtime/mgcsweep.go:370
#3  0x00000000004629bf in runtime.bgsweep (c=chan int)
    at /mnt/oldrog/usr/local/go1.24.3/src/runtime/mgcsweep.go:297
#4  0x0000000000456e25 in runtime.gcenable.gowrap1 ()
    at /mnt/oldrog/usr/local/go1.24.3/src/runtime/mgc.go:204
#5  0x00000000004ba141 in runtime.goexit ()
    at /mnt/oldrog/usr/local/go1.24.3/src/runtime/asm_amd64.s:1700
#6  0x0000000000000000 in ?? ()
(rr) i reg

rax            0x1                 1
rbx            0x0                 0
rcx            0x218008001880      36833773754496
rdx            0x0                 0
rsi            0x218008001880      36833773754496
rdi            0x100000001         4294967297
rbp            0xc000076ef8        0xc000076ef8
rsp            0xc000076ee0        0xc000076ee0
r8             0x1                 1
r9             0x8                 8
r10            0x41                65
r11            0x400               1024
r12            0x1                 1
r13            0xc000480008        824638439432
r14            0xc000003340        824633733952
r15            0x4                 4
rip            0x471843            0x471843 <runtime.(*spanSet).pop+163>
eflags         0x10246             [ PF ZF IF RF ]
cs             0x33                51
ss             0x2b                43
ds             0x0                 0
es             0x0                 0
fs             0x0                 0
gs             0x0                 0
fs_base        0x28a86ad266c0      0x28a86ad266c0
gs_base        0x0                 0x0
(rr) down

Bottom (innermost) frame selected; you cannot go down.
(rr) i thr

  Id   Target Id                       Frame 
  1    Thread 678585.678585 (rpc.test) 0x0000000070000002 in syscall_traced ()
* 2    Thread 678585.678591 (rpc.test) runtime.(*spanSet).pop (b=<optimized out>, 
    ~r0=<optimized out>)
    at /mnt/oldrog/usr/local/go1.24.3/src/runtime/mspanset.go:189
  3    Thread 678585.678586 (rpc.test) runtime.usleep ()
    at /mnt/oldrog/usr/local/go1.24.3/src/runtime/sys_linux_amd64.s:135
  4    Thread 678585.678587 (rpc.test) 0x0000000070000002 in syscall_traced ()
  5    Thread 678585.678588 (rpc.test) 0x000000000047185b in runtime.(*spanSet).pop (b=<optimized out>, ~r0=<optimized out>)
    at /mnt/oldrog/usr/local/go1.24.3/src/runtime/mspanset.go:190
  6    Thread 678585.678589 (rpc.test) 0x0000000070000002 in syscall_traced ()
  7    Thread 678585.678590 (rpc.test) 0x0000000070000002 in syscall_traced ()
(rr) i goro

  1 waiting  runtime.gopark
  2 waiting  runtime.gopark
* 3 running  runtime.systemstack_switch
  4 waiting  runtime.gopark
  18 waiting  runtime.gopark
  19 waiting  runtime.gopark
  20 waiting  runtime.gopark
* 21 syscall  runtime.notetsleepg
  22 waiting  runtime.gopark
  23 waiting  runtime.gopark
  24 waiting  runtime.gopark
* 34 running  runtime.systemstack_switch
  35 waiting  runtime.gopark
  36 waiting  runtime.gopark
  37 waiting  runtime.gopark
  38 waiting  runtime.gopark
(rr) i sig

Signal        Stop  Print   Pass to program Description

SIGHUP        Yes   Yes Yes     Hangup
SIGINT        Yes   Yes No      Interrupt
SIGQUIT       Yes   Yes Yes     Quit
SIGILL        Yes   Yes Yes     Illegal instruction
SIGTRAP       Yes   Yes No      Trace/breakpoint trap
SIGABRT       Yes   Yes Yes     Aborted
SIGEMT        Yes   Yes Yes     Emulation trap
SIGFPE        Yes   Yes Yes     Arithmetic exception
SIGKILL       Yes   Yes Yes     Killed
SIGBUS        Yes   Yes Yes     Bus error
SIGSEGV       Yes   Yes Yes     Segmentation fault
SIGSYS        Yes   Yes Yes     Bad system call
SIGPIPE       Yes   Yes Yes     Broken pipe
SIGALRM       No    No  Yes     Alarm clock
SIGTERM       Yes   Yes Yes     Terminated
SIGURG        No    No  Yes     Urgent I/O condition
SIGSTOP       Yes   Yes Yes     Stopped (signal)
SIGTSTP       Yes   Yes Yes     Stopped (user)
SIGCONT       Yes   Yes Yes     Continued
SIGCHLD       No    No  Yes     Child status changed
SIGTTIN       Yes   Yes Yes     Stopped (tty input)
SIGTTOU       Yes   Yes Yes     Stopped (tty output)
SIGIO         No    No  Yes     I/O possible
SIGXCPU       Yes   Yes Yes     CPU time limit exceeded
SIGXFSZ       Yes   Yes Yes     File size limit exceeded
SIGVTALRM     No    No  Yes     Virtual timer expired
SIGPROF       No    No  Yes     Profiling timer expired
SIGWINCH      No    No  Yes     Window size changed
SIGLOST       Yes   Yes Yes     Resource lost
SIGUSR1       Yes   Yes Yes     User defined signal 1
SIGUSR2       Yes   Yes Yes     User defined signal 2
SIGPWR        Yes   Yes Yes     Power fail/restart
SIGPOLL       No    No  Yes     Pollable event occurred
SIGWIND       Yes   Yes Yes     SIGWIND
SIGPHONE      Yes   Yes Yes     SIGPHONE
SIGWAITING    No    No  Yes     Process's LWPs are blocked
SIGLWP        No    No  Yes     Signal LWP
SIGDANGER     Yes   Yes Yes     Swap space dangerously low
SIGGRANT      Yes   Yes Yes     Monitor mode granted
SIGRETRACT    Yes   Yes Yes     Need to relinquish monitor mode
SIGMSG        Yes   Yes Yes     Monitor mode data available
SIGSOUND      Yes   Yes Yes     Sound completed
SIGSAK        Yes   Yes Yes     Secure attention
SIGPRIO       No    No  Yes     SIGPRIO
SIG33         Yes   Yes Yes     Real-time event 33
SIG34         Yes   Yes Yes     Real-time event 34
SIG35         Yes   Yes Yes     Real-time event 35
SIG36         Yes   Yes Yes     Real-time event 36
SIG37         Yes   Yes Yes     Real-time event 37
SIG38         Yes   Yes Yes     Real-time event 38
SIG39         Yes   Yes Yes     Real-time event 39
SIG40         Yes   Yes Yes     Real-time event 40
SIG41         Yes   Yes Yes     Real-time event 41
SIG42         Yes   Yes Yes     Real-time event 42
SIG43         Yes   Yes Yes     Real-time event 43
SIG44         Yes   Yes Yes     Real-time event 44
SIG45         Yes   Yes Yes     Real-time event 45
SIG46         Yes   Yes Yes     Real-time event 46
SIG47         Yes   Yes Yes     Real-time event 47
SIG48         Yes   Yes Yes     Real-time event 48
SIG49         Yes   Yes Yes     Real-time event 49
SIG50         Yes   Yes Yes     Real-time event 50
SIG51         Yes   Yes Yes     Real-time event 51
SIG52         Yes   Yes Yes     Real-time event 52
SIG53         Yes   Yes Yes     Real-time event 53
SIG54         Yes   Yes Yes     Real-time event 54
SIG55         Yes   Yes Yes     Real-time event 55
SIG56         Yes   Yes Yes     Real-time event 56
SIG57         Yes   Yes Yes     Real-time event 57
SIG58         Yes   Yes Yes     Real-time event 58
SIG59         Yes   Yes Yes     Real-time event 59
SIG60         Yes   Yes Yes     Real-time event 60
SIG61         Yes   Yes Yes     Real-time event 61
SIG62         Yes   Yes Yes     Real-time event 62
SIG63         Yes   Yes Yes     Real-time event 63
SIGCANCEL     No    No  Yes     LWP internal signal
SIG32         Yes   Yes Yes     Real-time event 32
SIG64         Yes   Yes Yes     Real-time event 64
SIG65         Yes   Yes Yes     Real-time event 65
SIG66         Yes   Yes Yes     Real-time event 66
SIG67         Yes   Yes Yes     Real-time event 67
SIG68         Yes   Yes Yes     Real-time event 68
SIG69         Yes   Yes Yes     Real-time event 69
SIG70         Yes   Yes Yes     Real-time event 70
SIG71         Yes   Yes Yes     Real-time event 71
SIG72         Yes   Yes Yes     Real-time event 72
SIG73         Yes   Yes Yes     Real-time event 73
SIG74         Yes   Yes Yes     Real-time event 74
SIG75         Yes   Yes Yes     Real-time event 75
SIG76         Yes   Yes Yes     Real-time event 76
SIG77         Yes   Yes Yes     Real-time event 77
SIG78         Yes   Yes Yes     Real-time event 78
SIG79         Yes   Yes Yes     Real-time event 79
SIG80         Yes   Yes Yes     Real-time event 80
SIG81         Yes   Yes Yes     Real-time event 81
SIG82         Yes   Yes Yes     Real-time event 82
SIG83         Yes   Yes Yes     Real-time event 83
SIG84         Yes   Yes Yes     Real-time event 84
SIG85         Yes   Yes Yes     Real-time event 85
SIG86         Yes   Yes Yes     Real-time event 86
SIG87         Yes   Yes Yes     Real-time event 87
SIG88         Yes   Yes Yes     Real-time event 88
SIG89         Yes   Yes Yes     Real-time event 89
SIG90         Yes   Yes Yes     Real-time event 90
SIG91         Yes   Yes Yes     Real-time event 91
SIG92         Yes   Yes Yes     Real-time event 92
SIG93         Yes   Yes Yes     Real-time event 93
SIG94         Yes   Yes Yes     Real-time event 94
SIG95         Yes   Yes Yes     Real-time event 95
SIG96         Yes   Yes Yes     Real-time event 96
SIG97         Yes   Yes Yes     Real-time event 97
SIG98         Yes   Yes Yes     Real-time event 98
SIG99         Yes   Yes Yes     Real-time event 99
SIG100        Yes   Yes Yes     Real-time event 100
SIG101        Yes   Yes Yes     Real-time event 101
SIG102        Yes   Yes Yes     Real-time event 102
SIG103        Yes   Yes Yes     Real-time event 103
SIG104        Yes   Yes Yes     Real-time event 104
SIG105        Yes   Yes Yes     Real-time event 105
SIG106        Yes   Yes Yes     Real-time event 106
SIG107        Yes   Yes Yes     Real-time event 107
SIG108        Yes   Yes Yes     Real-time event 108
SIG109        Yes   Yes Yes     Real-time event 109
SIG110        Yes   Yes Yes     Real-time event 110
SIG111        Yes   Yes Yes     Real-time event 111
SIG112        Yes   Yes Yes     Real-time event 112
SIG113        Yes   Yes Yes     Real-time event 113
SIG114        Yes   Yes Yes     Real-time event 114
SIG115        Yes   Yes Yes     Real-time event 115
SIG116        Yes   Yes Yes     Real-time event 116
SIG117        Yes   Yes Yes     Real-time event 117
SIG118        Yes   Yes Yes     Real-time event 118
SIG119        Yes   Yes Yes     Real-time event 119
SIG120        Yes   Yes Yes     Real-time event 120
SIG121        Yes   Yes Yes     Real-time event 121
SIG122        Yes   Yes Yes     Real-time event 122
SIG123        Yes   Yes Yes     Real-time event 123
SIG124        Yes   Yes Yes     Real-time event 124
SIG125        Yes   Yes Yes     Real-time event 125
SIG126        Yes   Yes Yes     Real-time event 126
SIG127        Yes   Yes Yes     Real-time event 127
SIGINFO       Yes   Yes Yes     Information request
EXC_BAD_ACCESS Yes  Yes Yes     Could not access memory
EXC_BAD_INSTRUCTION Yes Yes Yes     Illegal instruction/operand
EXC_ARITHMETIC Yes  Yes Yes     Arithmetic exception
EXC_EMULATION Yes   Yes Yes     Emulation instruction
EXC_SOFTWARE  Yes   Yes Yes     Software generated exception
EXC_BREAKPOINT Yes  Yes Yes     Breakpoint
SIGLIBRT      No    No  Yes     librt internal signal

Use the "handle" command to change these tables.
(rr)

Attaching the (packed for portability) recording and my test binary in a tar ball.

edit: see next comment for the full tarball; this tarball mistakenly is only the recording, the test binary was accidentally left out. rr.panic.in.gc.recording.tar.gz

What did you see happen?

See above.

What did you expect to see?

No segfault in the runtime GC code.

Comment From: glycerine

arg. The previous upload did not include my test binary Go program in the tarball by mistake. Trying again:

bk.rr.go.crash.unexpected.signal.packed.tar.gz

Comment From: gabyhelp

Related Issues

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

Comment From: glycerine

I tried Go tip at 4d1c255f159d90557b43ede07f8b9a209e1fb49c in the hope-springs-eternal wishful thinking that maybe this has been already been fixed somehow.

No luck.

The crash is different, so the latest runtime still seems not to be immune to the charms of rr chaos mode.

I'll summarize the log.txt below. The full log.txt transcript is in the tarball attached to this comment. In it, I've included the new full binary (rebuilt with Go at tip 4d1c), and the rr recorded-and-packed traces as well.

(summary; see log.txt in tarball for full transcript)
$ GOTRACEBACK=all GOEXPERIMENT=synctest /usr/local/dev-go/go/bin/go  test -race -c -o tube.go1.25.pre.test

$ rr record -h ./tube.go1.25.pre.test  -test.v -test.run 016

rr: Saving execution to trace directory `/mnt/oldrog/home/jaten/.local/share/rr/tube.go1.25.pre.test-0'.
faketime = true
=== RUN   Test016_tube_parallel_linz

tube.go:482 2000-01-01T00:00:00.000000000+00:00 faketime=true; cfg.UseSimNet=true; cfg.Barrier=true

simnet.go:2189 [goID 8] 2000-01-01 00:00:00.000000000 +0000 UTC scheduler is running on goro = 8
unexpected fault address 0x218600000000
fatal error: fault
unexpected fault address 0xbda43364da8
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x2 addr=0xbda43364da8 pc=0xbda43364da8]

goroutine 48 gp=0xc100104c40 m=9 mp=0xc100099008 [running, synctest bubble 21, durably blocked]:
runtime.throw({0x101d271?, 0x1?})
        /mnt/oldrog/usr/local/dev-go/go/src/runtime/panic.go:1094 +0x48 fp=0xbda43364d20 sp=0xbda43364cf0 pc=0x4b9268
runtime.goroutineheader(0xc100104c40)
        /mnt/oldrog/usr/local/dev-go/go/src/runtime/traceback.go:1264 +0x3ea fp=0xbda43364d78 sp=0xbda43364d20 pc=0x4a8d8a
runtime.traceback(...)
        /mnt/oldrog/usr/local/dev-go/go/src/runtime/traceback.go:803
runtime.dopanic_m(0xc100104c40, 0x4b9268, 0xbda43364cf0, 0x0)
        /mnt/oldrog/usr/local/dev-go/go/src/runtime/panic.go:1427 +0x2b9 fp=0xbda43364df8 sp=0xbda43364d78 pc=0x47def9
runtime.fatalthrow.func1()
        /mnt/oldrog/usr/local/dev-go/go/src/runtime/panic.go:1280 +0x6d fp=0xbda43364e40 sp=0xbda43364df8 pc=0x47d8ed
runtime.systemstack(0x98f000)
        /mnt/oldrog/usr/local/dev-go/go/src/runtime/asm_amd64.s:513 +0x4a fp=0xbda43364e50 sp=0xbda43364e40 pc=0x4bfaea
created by github.com/glycerine/rpc25519.(*Client).Start in goroutine 22
        /home/jaten/go/pkg/mod/github.com/glycerine/rpc25519@v1.21.58/cli.go:1665 +0x3f6

goroutine 1 gp=0xc100002380 m=nil [chan receive]:
runtime.gopark(0x18?, 0x164fee0?, 0x60?, 0x0?, 0xc10019f710?)
        /mnt/oldrog/usr/local/dev-go/go/src/runtime/proc.go:460 +0xce fp=0xc1001476a8 sp=0xc100147688 pc=0x4b93ae
runtime.chanrecv(0xc1003d6080, 0xc10019f78e, 0x1)
        /mnt/oldrog/usr/local/dev-go/go/src/runtime/chan.go:667 +0x3ee fp=0xc100147720 sp=0xc1001476a8 pc=0x44a62e
runtime.chanrecv1(0x164dc60?, 0xf3dfa0?)
        /mnt/oldrog/usr/local/dev-go/go/src/runtime/chan.go:509 +0x12 fp=0xc100147748 sp=0xc100147720 pc=0x44a212
testing.(*T).Run(0xc1003828c0, {0x102ddf9, 0x1a}, 0x1061af0)
        /mnt/oldrog/usr/local/dev-go/go/src/testing/testing.go:2002 +0x9fe fp=0xc100147888 sp=0xc100147748 pc=0x5bf27e
testing.runTests.func1(0xc1003828c0)
...

(and in the rr replay):

Thread 2 received signal SIGSEGV, Segmentation fault.
[Switching to Thread 978270.978278]
0x0000000000402650 in __tsan::ShadowSet(__tsan::RawShadow*, __tsan::RawShadow*, __tsan::RawShadow) ()

(rr) bt

#0  0x0000000000402650 in __tsan::ShadowSet(__tsan::RawShadow*, __tsan::RawShadow*, __tsan::RawShadow) ()
#1  0x0000000000415d38 in __tsan::MemoryRangeSet(unsigned long, unsigned long, __tsan::RawShadow) [clone .part.0] ()
#2  0x0000000002202000 in __tsan::ctx_placeholder ()
#3  0x000000c903fef0d0 in ?? ()
#4  0x000000c903fef088 in ?? ()
#5  0x000000c100099008 in ?? ()
#6  0x000000c100104c40 in ?? ()
#7  0x00000000004c3263 in racecall ()
    at /mnt/oldrog/usr/local/dev-go/go/src/runtime/race_amd64.s:454
#8  0x0000000000000000 in ?? ()

add.to.issue_74019_latest_dev_go_segfaults_too.tar.gz

Comment From: glycerine

To make it easier to reproduce, (to avoid having to build rr if possible), the specific rr binary I'm running is at github.com/rr-debugger/obj2.clang/bin/rr in the following (see github.com/rr-debugger/ldd.dependencies subdir).

Hmm. Github issues reject attaching an 87MB tarball, so I've put it here instead:

https://github.com/glycerine/rr_binary_for_issue74019/raw/refs/heads/master/rr-debuger.tar.gz

It's not terribly hard to build from source, but the other point was to make it precise exactly what I'm running. All major linux distributions seem to carry rr too, so building from source may be completely unnecessary if you want to replay the recording.

Note that many cloud provider virtual machines disable the necessary conditional-branches-retired hardware counter, and many ARM chips don't provide it at all, but the rr pages says "Certain AArch64 microarchitectures (e.g. ARM Neoverse N1 or the Apple Silicon M-series)" should work; see https://github.com/rr-debugger/rr/blob/daaab24f10c81394f209cf05ef1191af3365ccd8/README.md?plain=1#L16). Re clouds, they note there, "Running in a VM guest is supported, as long as the VM supports virtualization of hardware performance counters. (VMware and KVM are known to work; Xen does not.)"

See the also if your hardware is suitable at the discussions https://github.com/rr-debugger/rr/wiki/Zen and https://github.com/rr-debugger/rr/wiki/Will-rr-work-on-my-system and build guide https://github.com/rr-debugger/rr/wiki/Building-And-Installing

Comment From: mknyszek

Thank you very much for the thorough and detailed bug report!

I spent some time reasoning through the span set code again, and I really don't see how a nil block could end up there because of a bug in the span set code. Given that you see other failures non-deterministically, assuming rr's chaos mode can be trusted, I am inclined to say that you did stumble upon a real bug, but that it's worse. That it's possibly some kind of memory corruption bug instead.

If rr cannot be fully trusted (and I only bring this up because of my experience with gdb -- don't get me wrong, it's an incredibly valuable tool I use regularly, but it is also surprisingly easy to get it to segfault), my first suspicion leads me to something about how chaos mode handles signal delivery (if at all). Go programs tend to send way more weird signals to threads than C/C++/Java/etc. programs due to asynchronous preemption. For reference, this feature has uncovered subtle signal handling memory corruption bugs in both Linux and Oxide's OS in the past.

As a shot in the dark, what happens if you run under chaos mode with with GODEBUG=asyncpreemptoff=1, which disables asynchronous preemption. Does that change anything? If not, then I'm not sure I have more leads in this direction, and we might just want to assume some kind of bug in the Go runtime. If so, then it could still be a bug in the runtime, but I think at that point I'd like to know a bit more about how rr handles signal delivery.

Thank you again for taking the time to write this up. I'll take a closer look at the rest of your messages next week. I suspect this is going to uncover something interesting. :)

Comment From: glycerine

Hi @mknyszek

I'll see what I can do. The classic way, I'm told, to debug using the rr recorded trace, which does provide a complete history of every instruction executed and every signal delivered to the application during the run (which kind of makes up for the Go runtime not being deterministic :), and also includes all external IO so it too can be replayed, and any/all child processes forked can be replayed too; and the results of each and every system call... nice!... so the classic way is to run the replay to the crash, then set a hardware watchpoint on the variable of interest (that which should not be nil, perhaps), and the do a reverse-continue in rr replay mode to replay the execution in reverse from the segfault until the watchpoint is hit.

I'd be happy to try this myself, its just I don't really know which variables to set the watch or watches on. The line that the segfault is on is

s := block.spans[bottom].Load() // runtime/mspanset.go:189

but a gdb print block said block "optimized out", so I was kind of stumped at that point.

If you can help me figure out what if anything relevant is in memory (or register, though not sure if the hardware watchpoints can watch them...) then I'd be better able to try and make some deductions. I did run rr pack on the trace, so theoretically (fingers crossed) anyone should be able to replay the trace in reverse execution mode and figure out where things went wrong, and what signals were delivered.

I'll try some runs with GODEBUG=asyncpreemptoff=1 and report back. I would be curious to know, if you're willing to try it, if rr lives up to its promises and you can yourself replay the execution in reverse from the crash point. It would kind of be a bummer if not, but also it might inform us about how much hardware alignment etc is needed to provide anyone with the ability to reproduce any recorded bug--as that's the really huge potential here too (and why I'd like to get the runtime happy under chaos mode, so I can use it to find actual interleaving related bugs in my own Go code).

How much can rr be trusted? Well, my impression is that rr itself is very reliable, but since I'm so new to it, the user (me) "using it wrong" would not surprise me.

About the possible memory corruption alternative... I don't have a good mental picture of what the possible causes or consequences of that would be, but if any of them would suggest hypotheses that can be checked with the hardware watchpoints, that's still the easy path. If you can describe a possible scenario that could lead to the segfault observed, I might be able to do more. As it is I just don't know enough about the Go runtime to do much more debugging on my own. I certainly wish I did, but I've never had time to look into it.

Comment From: glycerine

Re how signals are handled, I'll try to quote the relevant explanations from the rr extended tech report https://arxiv.org/pdf/1705.05937

RR performs preemptive context switching by periodically interrupting the running thread with an asynchronous signal (see Section 2.4). The normal scheduler honors Linux thread priorities strictly: higher priority threads always run in preference to lowerpriority threads. Threads with equal priority run in a round-robin manner.

(My comment: presumably rr does not deliver that signal to the program under test, but it does raise a concern about cross talk between signals Go wants and signals rr uses, although most of my concern seems to be allayed in the following.)

2.3.9 Signal Handlers

When a synchronous or asynchronous signal triggers a signal handler during recording, the kernel sets up a userspace stack frame for the handler and transfers control to the address of the handler. ptrace reports the delivery of a signal as via a ptrace-stop notification, then RR triggers the transition to the signal handler by issuing a PTRACE SINGLESTEP request. (RR must track internally which signals have handlers so it knows in advance which signals will trigger a handler. It also needs to be aware of quirky edge cases that suppress entry into a signal handler; for example, when code triggerings a SIGILL or SIGSEGV, but the tracee has used sigprocmask to block the signal, the signal is delivered but handlers do not run (usually with fatal results).)

When RR has transitioned into the signal handler, it records the contents of the stack frame and registers set up by the kernel. During replay, no signal handlers are set up and no real signals are delivered. To replay the entry to a signal handler, we just write the recorded signal handler frame into memory and set the registers to the recorded values. By not delivering real signals during replay, we avoid having to set up signal-handler state in the replaying process. ...

2.3.10 Signal Handling And Interrupted System Calls

During normal non-recorded execution, when a system call is interrupted by a signal, the kernel adjusts the user-space registers with the program counter immediately after the system-call instruction and a special -ERESTARTSYS (or similar) value in the system-call result register. If a signal handler needs to run, the signal frame is built and the signal handler runs normally. When execution returns to the pre-handler state via sigreturn, or a signal handler did not run but the thread is resumed by some other mechanism, then just before returning to user-space the kernel’s systemcall restart mechanism kicks in: when the kernel returns to userspace, if the system-call result register contains -ERESTARTSYS and certain other conditions are met, the kernel automatically backs up the program counter to before the system call instruction and restores the system-call-number register. This causes the system call to reexecute.

RR avoids interfering with this restart mechanism and relies on the kernel restarting interrupted system calls as it normally would. However, because we set up state on system-call entry that expects to be processed on systemcall exit, we have to detect when a ptrace system-call notification corresponds to a restarted system call instead of a new system call. There is no foolproof way to do this, but we have a heuristic that works well. We maintain a stack of pending interrupted system calls and their saved parameter registers. On every system call entry, if the parameter registers match the parameters of the most recent pending interrupted system call, we assume that system call is being restarted and pop it off the stack. If a signal handler interrupted a system call and then returns, but the interrupted system call is not immediately restarted, we assume it has been abandoned and pop it off the stack.

2.4 Asynchronous Events

We need to support two kinds of asynchronous events: preemptive context switches and signals. We treat the former as a special case of the latter, forcing a context switch by sending a signal to a running tracee thread. We need to ensure that during replay, a signal is delivered when the program is in exactly the same state as it was when the signal was delivered during recording. ...

2.4.1 on Nondeterministic Performance Counters ... Fortunately, modern Intel CPUs have exactly one deterministic performance counter: “retired conditional branches” (“RCB”), so we use that. We cannot just count the number of RCBs during recording and deliver the signal after we have executed that number of RCBs during replay, because the RCB count does not uniquely determine the execution point to deliver the signal at. Therefore we pair the RCB count with the complete state of general-purpose registers (including the program counter) to identify an execution point.

In general that still does not uniquely identify an execution point (e.g. consider the infinite loop label: inc [global var]; jmp label;). However, in practice we have found it works reliably; code that returns to the same instruction with no intervening conditional branch must be very rare, and it only matters to RR for regular replay if an asynchronous event occurs at such an instruction — in which case replay would probably diverge and fail.

Comment From: glycerine

@mknyszek I did a bunch of work to track down one issue:

I found a simpler reproducer, recorded some rr -h chaos mode traces, and put my analysis here.

https://github.com/glycerine/rr_binary_for_issue74019?tab=readme-ov-file#analysis

The high level take away is that there appear to be alot of races between the thread sanitizer/tsan/race detector and the Go runtime basic thread start processes that rr's chaos mode will allow to be (relatively) quickly provoked, recorded, debugged and dispatched.

There are alot of other issues that rr discovers, but one by one. I depended pretty heavily on gemini to learn my around, and it was very good at teaching me how to debug better with gdb and rr.

Gemini seems pretty convinced about the nature of that trace's bug. Unfortuntately I don't myself have a mental model of the runtime to validate its conviction against, so I'm hoping you can tell me if we are at least on the right track.