Go version
go1.24.3 linux/amd64
also this was using GOEXPERIMENT=synctest in go1.24.3 but given the location of the crash, I sort of doubt that is a factor, but it could be.
Output of go env
in your module/workspace:
AR='ar'
CC='gcc'
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_ENABLED='1'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
CXX='g++'
GCCGO='gccgo'
GO111MODULE='auto'
GOAMD64='v1'
GOARCH='amd64'
GOAUTH='netrc'
GOBIN=''
GOCACHE='/home/jaten/.cache/go-build'
GOCACHEPROG=''
GODEBUG=''
GOENV='/home/jaten/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFIPS140='off'
GOFLAGS=''
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-pref\
ix-map=/tmp/go-build2503454780=/tmp/go-build -gno-record-gcc-switches'
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMOD='/home/jaten/zdb/go.mod'
GOMODCACHE='/home/jaten/go/pkg/mod'
GONOPROXY='github.com/glycerine/*'
GONOSUMDB='github.com/glycerine/*'
GOOS='linux'
GOPATH='/home/jaten/go'
GOPRIVATE='github.com/glycerine/*'
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/mnt/oldrog/usr/local/go1.24.3'
GOSUMDB='sum.golang.org'
GOTELEMETRY='off'
GOTELEMETRYDIR='/home/jaten/.config/go/telemetry'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/mnt/oldrog/usr/local/go1.24.3/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.24.3'
GOWORK=''
PKG_CONFIG='pkg-config'
What did you do?
My Linux development host detailed context:
Under go1.24.3 amd64
on Ubuntu 24.04.2 LTS with a 6.11.0-26-generic kernel,
booted with nospec_store_bypass_disable
(per https://github.com/rr-debugger/rr/wiki/Zen#ssb-mitigation )
kernel.apparmor_restrict_unprivileged_userns=1
kernel.yama.ptrace_scope = 0
kernel.perf_event_paranoid=-1
$ uname -a
Linux rog 6.11.0-26-generic #26~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Apr 17 19:20:47 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=24.04
DISTRIB_CODENAME=noble
DISTRIB_DESCRIPTION="Ubuntu 24.04.2 LTS"
$
vendor_id : AuthenticAMD
cpu family : 23
model : 49
model name : AMD Ryzen Threadripper 3960X 24-Core Processor
stepping : 0
microcode : 0x830107c
rr was compiled from source at release version 5.9 using clang++
https://github.com/rr-debugger/rr
[edit: actually, it was at rr tip: daaab24f10c81394f209cf05ef1191af3365ccd8 ]
$ clang++ --version
Ubuntu clang version 18.1.3 (1ubuntu1)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
Since my CPU is AMD, the rr Zen workaround python script must be, and was,
deployed, as described https://github.com/rr-debugger/rr/wiki/Zen
Running a small Go test for my Raft code under rr record -h for chaos mode for the first time may have found a bug in the Go runtime garbage collection code.
Or, of course, I may just have made a goof somewhere.
rr is a fairly sophisticated bit of kit, and I'm still getting oriented to it.
However I don't want to dismiss its findings out of inexperience when it may be help improve the Go garbage collector code.
And since I hope to be doing more rr work in the future to help harden my Raft code, I need to know if this a false positive to expect, or we are seeing a real issue in the runtime GC code.
I wrote a brief introduction to rr and its chaos mode for helping to locate hard to find bugs (that depend on CPU starvation or hard to obtain small scheduling quantum thread interleafings) here:
https://groups.google.com/g/golang-nuts/c/ouBKO-Q0Mtw
In short, the rr debugger in record -h chaos mode uses ptrace to take control of the threading in the target (here Go) program, run all threads on a single CPU, and forces the thread scheduling to adopt unusual schedules empirically and heuristically aimed at uncovering hard to find concurrency bugs.
https://robert.ocallahan.org/2016/02/introducing-rr-chaos-mode.html and https://robert.ocallahan.org/2016/02/deeper-into-chaos.html
provide readable introductions to some of the early approaches, which have almost certainly been refined in the intervening 9 years.
The recording run:
-*- mode: compilation; default-directory: "~/zdb/tube/" -*-
Compilation started at Thu Jun 5 13:35:25
make rr
GOTRACEBACK=all GOEXPERIMENT=synctest go test -race -c -o rpc.test
rr record -h ./rpc.test -test.v -test.run 016
rr: Saving execution to trace directory `/mnt/oldrog/home/jaten/.local/share/rr/rpc.test-0'.
faketime = true
=== RUN Test016_tube_parallel_linz
tube.go:482 2000-01-01T00:00:00.000000000+00:00 faketime=true; cfg.UseSimNet=true; cfg.Barrier=true
fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x471843]
goroutine 3 gp=0xc000003340 m=7 mp=0xc000480008 [running]:
runtime.throw({0x10298a0?, 0x446968?})
/mnt/oldrog/usr/local/go1.24.3/src/runtime/panic.go:1101 +0x48 fp=0xc000076e80 sp=0xc000076e50 pc=0x4b1c08
runtime.sigpanic()
/mnt/oldrog/usr/local/go1.24.3/src/runtime/signal_unix.go:909 +0x3c9 fp=0xc000076ee0 sp=0xc000076e80 pc=0x4b4029
runtime.(*spanSet).pop(0x1?)
/mnt/oldrog/usr/local/go1.24.3/src/runtime/mspanset.go:189 +0xa3 fp=0xc000076f08 sp=0xc000076ee0 pc=0x471843
runtime.(*mheap).nextSpanForSweep(0x1631540)
/mnt/oldrog/usr/local/go1.24.3/src/runtime/mgcsweep.go:105 +0xc8 fp=0xc000076f30 sp=0xc000076f08 pc=0x462588
runtime.sweepone()
/mnt/oldrog/usr/local/go1.24.3/src/runtime/mgcsweep.go:370 +0x8d fp=0xc000076f80 sp=0xc000076f30 pc=0x462b8d
runtime.bgsweep(0xc000030100)
/mnt/oldrog/usr/local/go1.24.3/src/runtime/mgcsweep.go:297 +0xff fp=0xc000076fc8 sp=0xc000076f80 pc=0x4629bf
runtime.gcenable.gowrap1()
/mnt/oldrog/usr/local/go1.24.3/src/runtime/mgc.go:204 +0x25 fp=0xc000076fe0 sp=0xc000076fc8 pc=0x456e25
runtime.goexit({})
/mnt/oldrog/usr/local/go1.24.3/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc000076fe8 sp=0xc000076fe0 pc=0x4ba141
created by runtime.gcenable in goroutine 1
/mnt/oldrog/usr/local/go1.24.3/src/runtime/mgc.go:204 +0x66
goroutine 1 gp=0xc000002380 m=nil [chan receive]:
runtime.gopark(0x18?, 0x16277e0?, 0x48?, 0xc3?, 0xc00019f718?)
/mnt/oldrog/usr/local/go1.24.3/src/runtime/proc.go:435 +0xce fp=0xc0004e36b0 sp=0xc0004e3690 pc=0x4b1d4e
runtime.chanrecv(0xc0001024d0, 0xc00019f796, 0x1)
/mnt/oldrog/usr/local/go1.24.3/src/runtime/chan.go:664 +0x3e5 fp=0xc0004e3728 sp=0xc0004e36b0 pc=0x447545
runtime.chanrecv1(0x1625780?, 0xf2e660?)
/mnt/oldrog/usr/local/go1.24.3/src/runtime/chan.go:506 +0x12 fp=0xc0004e3750 sp=0xc0004e3728 pc=0x447132
testing.(*T).Run(0xc000106e00, {0x101944e, 0x1a}, 0x104c6a8)
/mnt/oldrog/usr/local/go1.24.3/src/testing/testing.go:1859 +0x91e fp=0xc0004e3888 sp=0xc0004e3750 pc=0x5b0c9e
testing.runTests.func1(0xc000106e00)
/mnt/oldrog/usr/local/go1.24.3/src/testing/testing.go:2279 +0x86 fp=0xc0004e38d8 sp=0xc0004e3888 pc=0x5b51a6
testing.tRunner(0xc000106e00, 0xc00019fae0)
/mnt/oldrog/usr/local/go1.24.3/src/testing/testing.go:1792 +0x226 fp=0xc0004e39a8 sp=0xc0004e38d8 pc=0x5af1e6
testing.runTests(0xc00027c2e8, {0x161dba0, 0x47, 0x47}, {0x1?, 0x1?, 0x0?})
/mnt/oldrog/usr/local/go1.24.3/src/testing/testing.go:2277 +0x96d fp=0xc0004e3b10 sp=0xc0004e39a8 pc=0x5b500d
testing.(*M).Run(0xc0001488c0)
/mnt/oldrog/usr/local/go1.24.3/src/testing/testing.go:2142 +0xeeb fp=0xc0004e3e88 sp=0xc0004e3b10 pc=0x5b25eb
main.main()
_testmain.go:387 +0x165 fp=0xc0004e3f50 sp=0xc0004e3e88 pc=0xe370a5
runtime.main()
/mnt/oldrog/usr/local/go1.24.3/src/runtime/proc.go:283 +0x28b fp=0xc0004e3fe0 sp=0xc0004e3f50 pc=0x47a7cb
runtime.goexit({})
/mnt/oldrog/usr/local/go1.24.3/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc0004e3fe8 sp=0xc0004e3fe0 pc=0x4ba141
goroutine 2 gp=0xc000002e00 m=nil [force gc (idle)]:
runtime.gopark(0x15bb610?, 0x16277e0?, 0x0?, 0x0?, 0x0?)
/mnt/oldrog/usr/local/go1.24.3/src/runtime/proc.go:435 +0xce fp=0xc0000607a8 sp=0xc000060788 pc=0x4b1d4e
runtime.goparkunlock(...)
/mnt/oldrog/usr/local/go1.24.3/src/runtime/proc.go:441
runtime.forcegchelper()
/mnt/oldrog/usr/local/go1.24.3/src/runtime/proc.go:348 +0xb3 fp=0xc0000607e0 sp=0xc0000607a8 pc=0x47ab13
runtime.goexit({})
/mnt/oldrog/usr/local/go1.24.3/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc0000607e8 sp=0xc0000607e0 pc=0x4ba141
created by runtime.init.7 in goroutine 1
/mnt/oldrog/usr/local/go1.24.3/src/runtime/proc.go:336 +0x1a
goroutine 4 gp=0xc000003500 m=nil [GC scavenge wait]:
runtime.gopark(0x10000?, 0x1173788?, 0x0?, 0x0?, 0x0?)
/mnt/oldrog/usr/local/go1.24.3/src/runtime/proc.go:435 +0xce fp=0xc000070f78 sp=0xc000070f58 pc=0x4b1d4e
runtime.goparkunlock(...)
/mnt/oldrog/usr/local/go1.24.3/src/runtime/proc.go:441
runtime.(*scavengerState).park(0x16260a0)
/mnt/oldrog/usr/local/go1.24.3/src/runtime/mgcscavenge.go:425 +0x49 fp=0xc000070fa8 sp=0xc000070f78 pc=0x460409
runtime.bgscavenge(0xc000030100)
/mnt/oldrog/usr/local/go1.24.3/src/runtime/mgcscavenge.go:658 +0x59 fp=0xc000070fc8 sp=0xc000070fa8 pc=0x460999
runtime.gcenable.gowrap2()
/mnt/oldrog/usr/local/go1.24.3/src/runtime/mgc.go:205 +0x25 fp=0xc000070fe0 sp=0xc000070fc8 pc=0x456dc5
runtime.goexit({})
/mnt/oldrog/usr/local/go1.24.3/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc000070fe8 sp=0xc000070fe0 pc=0x4ba141
created by runtime.gcenable in goroutine 1
/mnt/oldrog/usr/local/go1.24.3/src/runtime/mgc.go:205 +0xa5
goroutine 18 gp=0xc000106380 m=nil [finalizer wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
/mnt/oldrog/usr/local/go1.24.3/src/runtime/proc.go:435 +0xce fp=0xc000077e30 sp=0xc000077e10 pc=0x4b1d4e
runtime.runfinq()
/mnt/oldrog/usr/local/go1.24.3/src/runtime/mfinal.go:196 +0x145 fp=0xc000077fe0 sp=0xc000077e30 pc=0x455de5
runtime.goexit({})
/mnt/oldrog/usr/local/go1.24.3/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc000077fe8 sp=0xc000077fe0 pc=0x4ba141
created by runtime.createfing in goroutine 1
/mnt/oldrog/usr/local/go1.24.3/src/runtime/mfinal.go:166 +0x3d
goroutine 19 gp=0xc000106540 m=nil [chan receive]:
runtime.gopark(0x1648530?, 0xffffffff?, 0x29?, 0xd1?, 0xc000394f50?)
/mnt/oldrog/usr/local/go1.24.3/src/runtime/proc.go:435 +0xce fp=0xc000394f18 sp=0xc000394ef8 pc=0x4b1d4e
runtime.chanrecv(0xc0001020e0, 0x0, 0x1)
/mnt/oldrog/usr/local/go1.24.3/src/runtime/chan.go:664 +0x3e5 fp=0xc000394f90 sp=0xc000394f18 pc=0x447545
runtime.chanrecv1(0x0?, 0x0?)
/mnt/oldrog/usr/local/go1.24.3/src/runtime/chan.go:506 +0x12 fp=0xc000394fb8 sp=0xc000394f90 pc=0x447132
runtime.unique_runtime_registerUniqueMapCleanup.func2(...)
/mnt/oldrog/usr/local/go1.24.3/src/runtime/mgc.go:1796
runtime.unique_runtime_registerUniqueMapCleanup.gowrap1()
/mnt/oldrog/usr/local/go1.24.3/src/runtime/mgc.go:1799 +0x2f fp=0xc000394fe0 sp=0xc000394fb8 pc=0x459f6f
runtime.goexit({})
/mnt/oldrog/usr/local/go1.24.3/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc000394fe8 sp=0xc000394fe0 pc=0x4ba141
created by unique.runtime_registerUniqueMapCleanup in goroutine 1
/mnt/oldrog/usr/local/go1.24.3/src/runtime/mgc.go:1794 +0x79
goroutine 20 gp=0xc000106700 m=nil [select, locked to thread]:
runtime.gopark(0xc000075fa8?, 0x2?, 0x60?, 0x0?, 0xc000075f94?)
/mnt/oldrog/usr/local/go1.24.3/src/runtime/proc.go:435 +0xce fp=0xc000075de8 sp=0xc000075dc8 pc=0x4b1d4e
runtime.selectgo(0xc000075fa8, 0xc000075f90, 0x0?, 0x0, 0x2?, 0x1)
/mnt/oldrog/usr/local/go1.24.3/src/runtime/select.go:351 +0x985 fp=0xc000075f58 sp=0xc000075de8 pc=0x48e385
runtime.ensureSigM.func1()
/mnt/oldrog/usr/local/go1.24.3/src/runtime/signal_unix.go:1085 +0x192 fp=0xc000075fe0 sp=0xc000075f58 pc=0x4abfb2
runtime.goexit({})
/mnt/oldrog/usr/local/go1.24.3/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc000075fe8 sp=0xc000075fe0 pc=0x4ba141
created by runtime.ensureSigM in goroutine 1
/mnt/oldrog/usr/local/go1.24.3/src/runtime/signal_unix.go:1068 +0xc8
goroutine 21 gp=0xc0001068c0 m=0 mp=0x16277e0 [syscall]:
runtime.notetsleepg(0x1648d60, 0xffffffffffffffff)
/mnt/oldrog/usr/local/go1.24.3/src/runtime/lock_futex.go:123 +0x29 fp=0xc00005cfa0 sp=0xc00005cf78 pc=0x44d889
os/signal.signal_recv()
/mnt/oldrog/usr/local/go1.24.3/src/runtime/sigqueue.go:152 +0x29 fp=0xc00005cfc0 sp=0xc00005cfa0 pc=0x4b4069
os/signal.loop()
/mnt/oldrog/usr/local/go1.24.3/src/os/signal/signal_unix.go:23 +0x1d fp=0xc00005cfe0 sp=0xc00005cfc0 pc=0x5ea5dd
runtime.goexit({})
/mnt/oldrog/usr/local/go1.24.3/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc00005cfe8 sp=0xc00005cfe0 pc=0x4ba141
created by os/signal.Notify.func1.1 in goroutine 1
/mnt/oldrog/usr/local/go1.24.3/src/os/signal/signal.go:152 +0x47
goroutine 22 gp=0xc000106a80 m=nil [chan receive]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
/mnt/oldrog/usr/local/go1.24.3/src/runtime/proc.go:435 +0xce fp=0xc0004b9e18 sp=0xc0004b9df8 pc=0x4b1d4e
runtime.chanrecv(0xc00027bb90, 0xc0004b9f98, 0x1)
/mnt/oldrog/usr/local/go1.24.3/src/runtime/chan.go:664 +0x3e5 fp=0xc0004b9e90 sp=0xc0004b9e18 pc=0x447545
runtime.chanrecv2(0x0?, 0x0?)
/mnt/oldrog/usr/local/go1.24.3/src/runtime/chan.go:511 +0x12 fp=0xc0004b9eb8 sp=0xc0004b9e90 pc=0x447152
github.com/glycerine/rpc25519.init.1.func1()
/home/jaten/go/pkg/mod/github.com/glycerine/rpc25519@v1.21.58/filterstacks.go:22 +0x28c fp=0xc0004b9fe0 sp=0xc0004b9eb8 pc=0xd1856c
runtime.goexit({})
/mnt/oldrog/usr/local/go1.24.3/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc0004b9fe8 sp=0xc0004b9fe0 pc=0x4ba141
created by github.com/glycerine/rpc25519.init.1 in goroutine 1
/home/jaten/go/pkg/mod/github.com/glycerine/rpc25519@v1.21.58/filterstacks.go:21 +0xca
goroutine 23 gp=0xc000106c40 m=nil [chan receive]:
runtime.gopark(0xc00027be20?, 0x0?, 0x0?, 0x0?, 0x0?)
/mnt/oldrog/usr/local/go1.24.3/src/runtime/proc.go:435 +0xce fp=0xc0004b8e80 sp=0xc0004b8e60 pc=0x4b1d4e
runtime.chanrecv(0xc00027bdc0, 0xc0004b8fa0, 0x1)
/mnt/oldrog/usr/local/go1.24.3/src/runtime/chan.go:664 +0x3e5 fp=0xc0004b8ef8 sp=0xc0004b8e80 pc=0x447545
runtime.chanrecv2(0x0?, 0x0?)
/mnt/oldrog/usr/local/go1.24.3/src/runtime/chan.go:511 +0x12 fp=0xc0004b8f20 sp=0xc0004b8ef8 pc=0x447152
github.com/glycerine/zdb/tube.init.0.func1()
/home/jaten/zdb/tube/filterstacks.go:21 +0x1c5 fp=0xc0004b8fe0 sp=0xc0004b8f20 pc=0xe31345
runtime.goexit({})
/mnt/oldrog/usr/local/go1.24.3/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc0004b8fe8 sp=0xc0004b8fe0 pc=0x4ba141
created by github.com/glycerine/zdb/tube.init.0 in goroutine 1
/home/jaten/zdb/tube/filterstacks.go:20 +0xca
goroutine 24 gp=0xc000107180 m=nil [synctest.Run, synctest group 24]:
runtime.gopark(0xc000390e90?, 0x0?, 0x0?, 0x0?, 0xc000494000?)
/mnt/oldrog/usr/local/go1.24.3/src/runtime/proc.go:435 +0xce fp=0xc000390e40 sp=0xc000390e20 pc=0x4b1d4e
internal/synctest.Run(0xc000494000)
/mnt/oldrog/usr/local/go1.24.3/src/runtime/synctest.go:191 +0x1a5 fp=0xc000390eb8 sp=0xc000390e40 pc=0x4b5f05
testing/synctest.Run(...)
/mnt/oldrog/usr/local/go1.24.3/src/testing/synctest/synctest.go:38
github.com/glycerine/zdb/tube.bubbleOrNot(...)
/home/jaten/zdb/tube/synctest.go:22
github.com/glycerine/zdb/tube.Test016_tube_parallel_linz(0xc000106fc0)
/home/jaten/zdb/tube/porc_test.go:47 +0x8c fp=0xc000390ee0 sp=0xc000390eb8 pc=0xdf968c
testing.tRunner(0xc000106fc0, 0x104c6a8)
/mnt/oldrog/usr/local/go1.24.3/src/testing/testing.go:1792 +0x226 fp=0xc000390fb0 sp=0xc000390ee0 pc=0x5af1e6
testing.(*T).Run.gowrap1()
/mnt/oldrog/usr/local/go1.24.3/src/testing/testing.go:1851 +0x45 fp=0xc000390fe0 sp=0xc000390fb0 pc=0x5b0f45
runtime.goexit({})
/mnt/oldrog/usr/local/go1.24.3/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc000390fe8 sp=0xc000390fe0 pc=0x4ba141
created by testing.(*T).Run in goroutine 1
/mnt/oldrog/usr/local/go1.24.3/src/testing/testing.go:1851 +0x8f3
goroutine 34 gp=0xc000482380 m=4 mp=0xc000067808 [running, synctest group 24]:
goroutine running on other thread; stack unavailable
created by internal/synctest.Run in goroutine 24
/mnt/oldrog/usr/local/go1.24.3/src/runtime/synctest.go:178 +0x10d
goroutine 35 gp=0xc000482540 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
/mnt/oldrog/usr/local/go1.24.3/src/runtime/proc.go:435 +0xce fp=0xc0004bff38 sp=0xc0004bff18 pc=0x4b1d4e
runtime.gcBgMarkWorker(0xc0004ae0e0)
/mnt/oldrog/usr/local/go1.24.3/src/runtime/mgc.go:1423 +0xe9 fp=0xc0004bffc8 sp=0xc0004bff38 pc=0x459289
runtime.gcBgMarkStartWorkers.gowrap1()
/mnt/oldrog/usr/local/go1.24.3/src/runtime/mgc.go:1339 +0x25 fp=0xc0004bffe0 sp=0xc0004bffc8 pc=0x459165
runtime.goexit({})
/mnt/oldrog/usr/local/go1.24.3/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc0004bffe8 sp=0xc0004bffe0 pc=0x4ba141
created by runtime.gcBgMarkStartWorkers in goroutine 34
/mnt/oldrog/usr/local/go1.24.3/src/runtime/mgc.go:1339 +0x105
goroutine 36 gp=0xc000482700 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
/mnt/oldrog/usr/local/go1.24.3/src/runtime/proc.go:435 +0xce fp=0xc0004bef38 sp=0xc0004bef18 pc=0x4b1d4e
runtime.gcBgMarkWorker(0xc0004ae0e0)
/mnt/oldrog/usr/local/go1.24.3/src/runtime/mgc.go:1423 +0xe9 fp=0xc0004befc8 sp=0xc0004bef38 pc=0x459289
runtime.gcBgMarkStartWorkers.gowrap1()
/mnt/oldrog/usr/local/go1.24.3/src/runtime/mgc.go:1339 +0x25 fp=0xc0004befe0 sp=0xc0004befc8 pc=0x459165
runtime.goexit({})
/mnt/oldrog/usr/local/go1.24.3/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc0004befe8 sp=0xc0004befe0 pc=0x4ba141
created by runtime.gcBgMarkStartWorkers in goroutine 34
/mnt/oldrog/usr/local/go1.24.3/src/runtime/mgc.go:1339 +0x105
goroutine 37 gp=0xc0004828c0 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
/mnt/oldrog/usr/local/go1.24.3/src/runtime/proc.go:435 +0xce fp=0xc0004bdf38 sp=0xc0004bdf18 pc=0x4b1d4e
runtime.gcBgMarkWorker(0xc0004ae0e0)
/mnt/oldrog/usr/local/go1.24.3/src/runtime/mgc.go:1423 +0xe9 fp=0xc0004bdfc8 sp=0xc0004bdf38 pc=0x459289
runtime.gcBgMarkStartWorkers.gowrap1()
/mnt/oldrog/usr/local/go1.24.3/src/runtime/mgc.go:1339 +0x25 fp=0xc0004bdfe0 sp=0xc0004bdfc8 pc=0x459165
runtime.goexit({})
/mnt/oldrog/usr/local/go1.24.3/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc0004bdfe8 sp=0xc0004bdfe0 pc=0x4ba141
created by runtime.gcBgMarkStartWorkers in goroutine 34
/mnt/oldrog/usr/local/go1.24.3/src/runtime/mgc.go:1339 +0x105
goroutine 38 gp=0xc000482a80 m=nil [GC worker (idle)]:
runtime.gopark(0x53b739deb6ac?, 0x0?, 0x0?, 0x0?, 0x0?)
/mnt/oldrog/usr/local/go1.24.3/src/runtime/proc.go:435 +0xce fp=0xc0004bcf38 sp=0xc0004bcf18 pc=0x4b1d4e
runtime.gcBgMarkWorker(0xc0004ae0e0)
/mnt/oldrog/usr/local/go1.24.3/src/runtime/mgc.go:1423 +0xe9 fp=0xc0004bcfc8 sp=0xc0004bcf38 pc=0x459289
runtime.gcBgMarkStartWorkers.gowrap1()
/mnt/oldrog/usr/local/go1.24.3/src/runtime/mgc.go:1339 +0x25 fp=0xc0004bcfe0 sp=0xc0004bcfc8 pc=0x459165
runtime.goexit({})
/mnt/oldrog/usr/local/go1.24.3/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc0004bcfe8 sp=0xc0004bcfe0 pc=0x4ba141
created by runtime.gcBgMarkStartWorkers in goroutine 34
/mnt/oldrog/usr/local/go1.24.3/src/runtime/mgc.go:1339 +0x105
make: *** [Makefile:11: rr] Error 2
Compilation exited abnormally with code 2 at Thu Jun 5 13:35:26
Under replay: [the (rr) indicates we are in a gdb session of an rr replay]
jaten@rog ~/zdb/tube (master) $ make replay
rr replay
GNU gdb (Ubuntu 15.0.50.20240403-0ubuntu1) 15.0.50.20240403-git
Copyright (C) 2024 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /mnt/oldrog/home/jaten/.local/share/rr/rpc.test-0/mmap_pack_0_rpc.test...
Loading Go Runtime support.
Remote debugging using 127.0.0.1:31997
Reading symbols from /lib64/ld-linux-x86-64.so.2...
Reading symbols from /usr/lib/debug/.build-id/1c/8db5f83bba514f8fd5f1fb6d7be975be1bb855.debug...
BFD: warning: system-supplied DSO at 0x6fffd000 has a section extending past end of file
Downloading separate debug info for system-supplied DSO at 0x6fffd000
0x000076133c408540 in _start () from /lib64/ld-linux-x86-64.so.2
(rr) c
Continuing.
faketime = true
=== RUN Test016_tube_parallel_linz
tube.go:482 2000-01-01T00:00:00.000000000+00:00 faketime=true; cfg.UseSimNet=true; cfg.Barrier=true
[New Thread 678585.678591]
[New Thread 678585.678586]
[New Thread 678585.678587]
[New Thread 678585.678588]
[New Thread 678585.678589]
[New Thread 678585.678590]
Thread 2 received signal SIGSEGV, Segmentation fault.
[Switching to Thread 678585.678591]
runtime.(*spanSet).pop (b=<optimized out>, ~r0=<optimized out>)
at /mnt/oldrog/usr/local/go1.24.3/src/runtime/mspanset.go:189
189 s := block.spans[bottom].Load()
(rr) list
184
185 // Given that the spine length is correct, we know we will never
186 // see a nil block here, since the length is always updated after
187 // the block is set.
188 block := blockp.Load()
189 s := block.spans[bottom].Load()
190 for s == nil {
191 // We raced with the span actually being set, but given that we
192 // know a block for this span exists, the race window here is
193 // extremely small. Try again.
(rr) p block
$1 = <optimized out>
(rr) p bottom
No symbol "bottom" in current context.
(rr) up
#1 0x0000000000462588 in runtime.(*mheap).nextSpanForSweep (
h=0x1631540 <runtime.mheap_>, ~r0=<optimized out>)
at /mnt/oldrog/usr/local/go1.24.3/src/runtime/mgcsweep.go:105
105 s = c.partialUnswept(sg).pop()
(rr) down
#0 runtime.(*spanSet).pop (b=<optimized out>, ~r0=<optimized out>)
at /mnt/oldrog/usr/local/go1.24.3/src/runtime/mspanset.go:189
189 s := block.spans[bottom].Load()
(rr) bt
#0 runtime.(*spanSet).pop (b=<optimized out>, ~r0=<optimized out>)
at /mnt/oldrog/usr/local/go1.24.3/src/runtime/mspanset.go:189
#1 0x0000000000462588 in runtime.(*mheap).nextSpanForSweep (
h=0x1631540 <runtime.mheap_>, ~r0=<optimized out>)
at /mnt/oldrog/usr/local/go1.24.3/src/runtime/mgcsweep.go:105
#2 0x0000000000462b8d in runtime.sweepone (~r0=<optimized out>)
at /mnt/oldrog/usr/local/go1.24.3/src/runtime/mgcsweep.go:370
#3 0x00000000004629bf in runtime.bgsweep (c=chan int)
at /mnt/oldrog/usr/local/go1.24.3/src/runtime/mgcsweep.go:297
#4 0x0000000000456e25 in runtime.gcenable.gowrap1 ()
at /mnt/oldrog/usr/local/go1.24.3/src/runtime/mgc.go:204
#5 0x00000000004ba141 in runtime.goexit ()
at /mnt/oldrog/usr/local/go1.24.3/src/runtime/asm_amd64.s:1700
#6 0x0000000000000000 in ?? ()
(rr) i reg
rax 0x1 1
rbx 0x0 0
rcx 0x218008001880 36833773754496
rdx 0x0 0
rsi 0x218008001880 36833773754496
rdi 0x100000001 4294967297
rbp 0xc000076ef8 0xc000076ef8
rsp 0xc000076ee0 0xc000076ee0
r8 0x1 1
r9 0x8 8
r10 0x41 65
r11 0x400 1024
r12 0x1 1
r13 0xc000480008 824638439432
r14 0xc000003340 824633733952
r15 0x4 4
rip 0x471843 0x471843 <runtime.(*spanSet).pop+163>
eflags 0x10246 [ PF ZF IF RF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0
fs_base 0x28a86ad266c0 0x28a86ad266c0
gs_base 0x0 0x0
(rr) down
Bottom (innermost) frame selected; you cannot go down.
(rr) i thr
Id Target Id Frame
1 Thread 678585.678585 (rpc.test) 0x0000000070000002 in syscall_traced ()
* 2 Thread 678585.678591 (rpc.test) runtime.(*spanSet).pop (b=<optimized out>,
~r0=<optimized out>)
at /mnt/oldrog/usr/local/go1.24.3/src/runtime/mspanset.go:189
3 Thread 678585.678586 (rpc.test) runtime.usleep ()
at /mnt/oldrog/usr/local/go1.24.3/src/runtime/sys_linux_amd64.s:135
4 Thread 678585.678587 (rpc.test) 0x0000000070000002 in syscall_traced ()
5 Thread 678585.678588 (rpc.test) 0x000000000047185b in runtime.(*spanSet).pop (b=<optimized out>, ~r0=<optimized out>)
at /mnt/oldrog/usr/local/go1.24.3/src/runtime/mspanset.go:190
6 Thread 678585.678589 (rpc.test) 0x0000000070000002 in syscall_traced ()
7 Thread 678585.678590 (rpc.test) 0x0000000070000002 in syscall_traced ()
(rr) i goro
1 waiting runtime.gopark
2 waiting runtime.gopark
* 3 running runtime.systemstack_switch
4 waiting runtime.gopark
18 waiting runtime.gopark
19 waiting runtime.gopark
20 waiting runtime.gopark
* 21 syscall runtime.notetsleepg
22 waiting runtime.gopark
23 waiting runtime.gopark
24 waiting runtime.gopark
* 34 running runtime.systemstack_switch
35 waiting runtime.gopark
36 waiting runtime.gopark
37 waiting runtime.gopark
38 waiting runtime.gopark
(rr) i sig
Signal Stop Print Pass to program Description
SIGHUP Yes Yes Yes Hangup
SIGINT Yes Yes No Interrupt
SIGQUIT Yes Yes Yes Quit
SIGILL Yes Yes Yes Illegal instruction
SIGTRAP Yes Yes No Trace/breakpoint trap
SIGABRT Yes Yes Yes Aborted
SIGEMT Yes Yes Yes Emulation trap
SIGFPE Yes Yes Yes Arithmetic exception
SIGKILL Yes Yes Yes Killed
SIGBUS Yes Yes Yes Bus error
SIGSEGV Yes Yes Yes Segmentation fault
SIGSYS Yes Yes Yes Bad system call
SIGPIPE Yes Yes Yes Broken pipe
SIGALRM No No Yes Alarm clock
SIGTERM Yes Yes Yes Terminated
SIGURG No No Yes Urgent I/O condition
SIGSTOP Yes Yes Yes Stopped (signal)
SIGTSTP Yes Yes Yes Stopped (user)
SIGCONT Yes Yes Yes Continued
SIGCHLD No No Yes Child status changed
SIGTTIN Yes Yes Yes Stopped (tty input)
SIGTTOU Yes Yes Yes Stopped (tty output)
SIGIO No No Yes I/O possible
SIGXCPU Yes Yes Yes CPU time limit exceeded
SIGXFSZ Yes Yes Yes File size limit exceeded
SIGVTALRM No No Yes Virtual timer expired
SIGPROF No No Yes Profiling timer expired
SIGWINCH No No Yes Window size changed
SIGLOST Yes Yes Yes Resource lost
SIGUSR1 Yes Yes Yes User defined signal 1
SIGUSR2 Yes Yes Yes User defined signal 2
SIGPWR Yes Yes Yes Power fail/restart
SIGPOLL No No Yes Pollable event occurred
SIGWIND Yes Yes Yes SIGWIND
SIGPHONE Yes Yes Yes SIGPHONE
SIGWAITING No No Yes Process's LWPs are blocked
SIGLWP No No Yes Signal LWP
SIGDANGER Yes Yes Yes Swap space dangerously low
SIGGRANT Yes Yes Yes Monitor mode granted
SIGRETRACT Yes Yes Yes Need to relinquish monitor mode
SIGMSG Yes Yes Yes Monitor mode data available
SIGSOUND Yes Yes Yes Sound completed
SIGSAK Yes Yes Yes Secure attention
SIGPRIO No No Yes SIGPRIO
SIG33 Yes Yes Yes Real-time event 33
SIG34 Yes Yes Yes Real-time event 34
SIG35 Yes Yes Yes Real-time event 35
SIG36 Yes Yes Yes Real-time event 36
SIG37 Yes Yes Yes Real-time event 37
SIG38 Yes Yes Yes Real-time event 38
SIG39 Yes Yes Yes Real-time event 39
SIG40 Yes Yes Yes Real-time event 40
SIG41 Yes Yes Yes Real-time event 41
SIG42 Yes Yes Yes Real-time event 42
SIG43 Yes Yes Yes Real-time event 43
SIG44 Yes Yes Yes Real-time event 44
SIG45 Yes Yes Yes Real-time event 45
SIG46 Yes Yes Yes Real-time event 46
SIG47 Yes Yes Yes Real-time event 47
SIG48 Yes Yes Yes Real-time event 48
SIG49 Yes Yes Yes Real-time event 49
SIG50 Yes Yes Yes Real-time event 50
SIG51 Yes Yes Yes Real-time event 51
SIG52 Yes Yes Yes Real-time event 52
SIG53 Yes Yes Yes Real-time event 53
SIG54 Yes Yes Yes Real-time event 54
SIG55 Yes Yes Yes Real-time event 55
SIG56 Yes Yes Yes Real-time event 56
SIG57 Yes Yes Yes Real-time event 57
SIG58 Yes Yes Yes Real-time event 58
SIG59 Yes Yes Yes Real-time event 59
SIG60 Yes Yes Yes Real-time event 60
SIG61 Yes Yes Yes Real-time event 61
SIG62 Yes Yes Yes Real-time event 62
SIG63 Yes Yes Yes Real-time event 63
SIGCANCEL No No Yes LWP internal signal
SIG32 Yes Yes Yes Real-time event 32
SIG64 Yes Yes Yes Real-time event 64
SIG65 Yes Yes Yes Real-time event 65
SIG66 Yes Yes Yes Real-time event 66
SIG67 Yes Yes Yes Real-time event 67
SIG68 Yes Yes Yes Real-time event 68
SIG69 Yes Yes Yes Real-time event 69
SIG70 Yes Yes Yes Real-time event 70
SIG71 Yes Yes Yes Real-time event 71
SIG72 Yes Yes Yes Real-time event 72
SIG73 Yes Yes Yes Real-time event 73
SIG74 Yes Yes Yes Real-time event 74
SIG75 Yes Yes Yes Real-time event 75
SIG76 Yes Yes Yes Real-time event 76
SIG77 Yes Yes Yes Real-time event 77
SIG78 Yes Yes Yes Real-time event 78
SIG79 Yes Yes Yes Real-time event 79
SIG80 Yes Yes Yes Real-time event 80
SIG81 Yes Yes Yes Real-time event 81
SIG82 Yes Yes Yes Real-time event 82
SIG83 Yes Yes Yes Real-time event 83
SIG84 Yes Yes Yes Real-time event 84
SIG85 Yes Yes Yes Real-time event 85
SIG86 Yes Yes Yes Real-time event 86
SIG87 Yes Yes Yes Real-time event 87
SIG88 Yes Yes Yes Real-time event 88
SIG89 Yes Yes Yes Real-time event 89
SIG90 Yes Yes Yes Real-time event 90
SIG91 Yes Yes Yes Real-time event 91
SIG92 Yes Yes Yes Real-time event 92
SIG93 Yes Yes Yes Real-time event 93
SIG94 Yes Yes Yes Real-time event 94
SIG95 Yes Yes Yes Real-time event 95
SIG96 Yes Yes Yes Real-time event 96
SIG97 Yes Yes Yes Real-time event 97
SIG98 Yes Yes Yes Real-time event 98
SIG99 Yes Yes Yes Real-time event 99
SIG100 Yes Yes Yes Real-time event 100
SIG101 Yes Yes Yes Real-time event 101
SIG102 Yes Yes Yes Real-time event 102
SIG103 Yes Yes Yes Real-time event 103
SIG104 Yes Yes Yes Real-time event 104
SIG105 Yes Yes Yes Real-time event 105
SIG106 Yes Yes Yes Real-time event 106
SIG107 Yes Yes Yes Real-time event 107
SIG108 Yes Yes Yes Real-time event 108
SIG109 Yes Yes Yes Real-time event 109
SIG110 Yes Yes Yes Real-time event 110
SIG111 Yes Yes Yes Real-time event 111
SIG112 Yes Yes Yes Real-time event 112
SIG113 Yes Yes Yes Real-time event 113
SIG114 Yes Yes Yes Real-time event 114
SIG115 Yes Yes Yes Real-time event 115
SIG116 Yes Yes Yes Real-time event 116
SIG117 Yes Yes Yes Real-time event 117
SIG118 Yes Yes Yes Real-time event 118
SIG119 Yes Yes Yes Real-time event 119
SIG120 Yes Yes Yes Real-time event 120
SIG121 Yes Yes Yes Real-time event 121
SIG122 Yes Yes Yes Real-time event 122
SIG123 Yes Yes Yes Real-time event 123
SIG124 Yes Yes Yes Real-time event 124
SIG125 Yes Yes Yes Real-time event 125
SIG126 Yes Yes Yes Real-time event 126
SIG127 Yes Yes Yes Real-time event 127
SIGINFO Yes Yes Yes Information request
EXC_BAD_ACCESS Yes Yes Yes Could not access memory
EXC_BAD_INSTRUCTION Yes Yes Yes Illegal instruction/operand
EXC_ARITHMETIC Yes Yes Yes Arithmetic exception
EXC_EMULATION Yes Yes Yes Emulation instruction
EXC_SOFTWARE Yes Yes Yes Software generated exception
EXC_BREAKPOINT Yes Yes Yes Breakpoint
SIGLIBRT No No Yes librt internal signal
Use the "handle" command to change these tables.
(rr)
Attaching the (packed for portability) recording and my test binary in a tar ball.
edit: see next comment for the full tarball; this tarball mistakenly is only the recording, the test binary was accidentally left out. rr.panic.in.gc.recording.tar.gz
What did you see happen?
See above.
What did you expect to see?
No segfault in the runtime GC code.
Comment From: glycerine
arg. The previous upload did not include my test binary Go program in the tarball by mistake. Trying again:
bk.rr.go.crash.unexpected.signal.packed.tar.gz
Comment From: gabyhelp
Related Issues
- runtime: fatal error: unknown caller pc #48921
- runtime: crash in race detector when execution tracer reads from CPU profile buffer #65607 (closed)
- runtime: runtime crashes, reproducer same as #20427 #36371 (closed)
- runtime #60196 (closed)
- runtime: spurious SIGSEGV in systemstack() #68632
- runtime: fatal error: stack barrier lost #11256 (closed)
- runtime: panic: non-empty mark queue after concurrent mark (Go1.14, Go1.15) #41303 (closed)
- runtime: memory corruption crashes with os/exec on Linux kernel 4.4 #20427 (closed)
- runtime: panic: unexpected return pc for runtime.heapBitsSetType #38851 (closed)
- runtime: crash in sbrk mode #53887
(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)
Comment From: glycerine
I tried Go tip at 4d1c255f159d90557b43ede07f8b9a209e1fb49c in the hope-springs-eternal wishful thinking that maybe this has been already been fixed somehow.
No luck.
The crash is different, so the latest runtime still seems not to be immune to the charms of rr chaos mode.
I'll summarize the log.txt below. The full log.txt transcript is in the tarball attached to this comment. In it, I've included the new full binary (rebuilt with Go at tip 4d1c), and the rr recorded-and-packed traces as well.
(summary; see log.txt in tarball for full transcript)
$ GOTRACEBACK=all GOEXPERIMENT=synctest /usr/local/dev-go/go/bin/go test -race -c -o tube.go1.25.pre.test
$ rr record -h ./tube.go1.25.pre.test -test.v -test.run 016
rr: Saving execution to trace directory `/mnt/oldrog/home/jaten/.local/share/rr/tube.go1.25.pre.test-0'.
faketime = true
=== RUN Test016_tube_parallel_linz
tube.go:482 2000-01-01T00:00:00.000000000+00:00 faketime=true; cfg.UseSimNet=true; cfg.Barrier=true
simnet.go:2189 [goID 8] 2000-01-01 00:00:00.000000000 +0000 UTC scheduler is running on goro = 8
unexpected fault address 0x218600000000
fatal error: fault
unexpected fault address 0xbda43364da8
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x2 addr=0xbda43364da8 pc=0xbda43364da8]
goroutine 48 gp=0xc100104c40 m=9 mp=0xc100099008 [running, synctest bubble 21, durably blocked]:
runtime.throw({0x101d271?, 0x1?})
/mnt/oldrog/usr/local/dev-go/go/src/runtime/panic.go:1094 +0x48 fp=0xbda43364d20 sp=0xbda43364cf0 pc=0x4b9268
runtime.goroutineheader(0xc100104c40)
/mnt/oldrog/usr/local/dev-go/go/src/runtime/traceback.go:1264 +0x3ea fp=0xbda43364d78 sp=0xbda43364d20 pc=0x4a8d8a
runtime.traceback(...)
/mnt/oldrog/usr/local/dev-go/go/src/runtime/traceback.go:803
runtime.dopanic_m(0xc100104c40, 0x4b9268, 0xbda43364cf0, 0x0)
/mnt/oldrog/usr/local/dev-go/go/src/runtime/panic.go:1427 +0x2b9 fp=0xbda43364df8 sp=0xbda43364d78 pc=0x47def9
runtime.fatalthrow.func1()
/mnt/oldrog/usr/local/dev-go/go/src/runtime/panic.go:1280 +0x6d fp=0xbda43364e40 sp=0xbda43364df8 pc=0x47d8ed
runtime.systemstack(0x98f000)
/mnt/oldrog/usr/local/dev-go/go/src/runtime/asm_amd64.s:513 +0x4a fp=0xbda43364e50 sp=0xbda43364e40 pc=0x4bfaea
created by github.com/glycerine/rpc25519.(*Client).Start in goroutine 22
/home/jaten/go/pkg/mod/github.com/glycerine/rpc25519@v1.21.58/cli.go:1665 +0x3f6
goroutine 1 gp=0xc100002380 m=nil [chan receive]:
runtime.gopark(0x18?, 0x164fee0?, 0x60?, 0x0?, 0xc10019f710?)
/mnt/oldrog/usr/local/dev-go/go/src/runtime/proc.go:460 +0xce fp=0xc1001476a8 sp=0xc100147688 pc=0x4b93ae
runtime.chanrecv(0xc1003d6080, 0xc10019f78e, 0x1)
/mnt/oldrog/usr/local/dev-go/go/src/runtime/chan.go:667 +0x3ee fp=0xc100147720 sp=0xc1001476a8 pc=0x44a62e
runtime.chanrecv1(0x164dc60?, 0xf3dfa0?)
/mnt/oldrog/usr/local/dev-go/go/src/runtime/chan.go:509 +0x12 fp=0xc100147748 sp=0xc100147720 pc=0x44a212
testing.(*T).Run(0xc1003828c0, {0x102ddf9, 0x1a}, 0x1061af0)
/mnt/oldrog/usr/local/dev-go/go/src/testing/testing.go:2002 +0x9fe fp=0xc100147888 sp=0xc100147748 pc=0x5bf27e
testing.runTests.func1(0xc1003828c0)
...
(and in the rr replay):
Thread 2 received signal SIGSEGV, Segmentation fault.
[Switching to Thread 978270.978278]
0x0000000000402650 in __tsan::ShadowSet(__tsan::RawShadow*, __tsan::RawShadow*, __tsan::RawShadow) ()
(rr) bt
#0 0x0000000000402650 in __tsan::ShadowSet(__tsan::RawShadow*, __tsan::RawShadow*, __tsan::RawShadow) ()
#1 0x0000000000415d38 in __tsan::MemoryRangeSet(unsigned long, unsigned long, __tsan::RawShadow) [clone .part.0] ()
#2 0x0000000002202000 in __tsan::ctx_placeholder ()
#3 0x000000c903fef0d0 in ?? ()
#4 0x000000c903fef088 in ?? ()
#5 0x000000c100099008 in ?? ()
#6 0x000000c100104c40 in ?? ()
#7 0x00000000004c3263 in racecall ()
at /mnt/oldrog/usr/local/dev-go/go/src/runtime/race_amd64.s:454
#8 0x0000000000000000 in ?? ()
add.to.issue_74019_latest_dev_go_segfaults_too.tar.gz
Comment From: glycerine
To make it easier to reproduce, (to avoid having to build rr if possible), the specific rr binary I'm running is at github.com/rr-debugger/obj2.clang/bin/rr in the following (see github.com/rr-debugger/ldd.dependencies subdir).
Hmm. Github issues reject attaching an 87MB tarball, so I've put it here instead:
https://github.com/glycerine/rr_binary_for_issue74019/raw/refs/heads/master/rr-debuger.tar.gz
It's not terribly hard to build from source, but the other point was to make it precise exactly what I'm running. All major linux distributions seem to carry rr too, so building from source may be completely unnecessary if you want to replay the recording.
Note that many cloud provider virtual machines disable the necessary conditional-branches-retired hardware counter, and many ARM chips don't provide it at all, but the rr pages says "Certain AArch64 microarchitectures (e.g. ARM Neoverse N1 or the Apple Silicon M-series)" should work; see https://github.com/rr-debugger/rr/blob/daaab24f10c81394f209cf05ef1191af3365ccd8/README.md?plain=1#L16). Re clouds, they note there, "Running in a VM guest is supported, as long as the VM supports virtualization of hardware performance counters. (VMware and KVM are known to work; Xen does not.)"
See the also if your hardware is suitable at the discussions https://github.com/rr-debugger/rr/wiki/Zen and https://github.com/rr-debugger/rr/wiki/Will-rr-work-on-my-system and build guide https://github.com/rr-debugger/rr/wiki/Building-And-Installing
Comment From: mknyszek
Thank you very much for the thorough and detailed bug report!
I spent some time reasoning through the span set code again, and I really don't see how a nil block could end up there because of a bug in the span set code. Given that you see other failures non-deterministically, assuming rr
's chaos mode can be trusted, I am inclined to say that you did stumble upon a real bug, but that it's worse. That it's possibly some kind of memory corruption bug instead.
If rr
cannot be fully trusted (and I only bring this up because of my experience with gdb
-- don't get me wrong, it's an incredibly valuable tool I use regularly, but it is also surprisingly easy to get it to segfault), my first suspicion leads me to something about how chaos mode handles signal delivery (if at all). Go programs tend to send way more weird signals to threads than C/C++/Java/etc. programs due to asynchronous preemption. For reference, this feature has uncovered subtle signal handling memory corruption bugs in both Linux and Oxide's OS in the past.
As a shot in the dark, what happens if you run under chaos mode with with GODEBUG=asyncpreemptoff=1
, which disables asynchronous preemption. Does that change anything? If not, then I'm not sure I have more leads in this direction, and we might just want to assume some kind of bug in the Go runtime. If so, then it could still be a bug in the runtime, but I think at that point I'd like to know a bit more about how rr
handles signal delivery.
Thank you again for taking the time to write this up. I'll take a closer look at the rest of your messages next week. I suspect this is going to uncover something interesting. :)
Comment From: glycerine
Hi @mknyszek
I'll see what I can do. The classic way, I'm told, to debug using the rr recorded trace, which does provide a complete history of every instruction executed and every signal delivered to the application during the run (which kind of makes up for the Go runtime not being deterministic :), and also includes all external IO so it too can be replayed, and any/all child processes forked can be replayed too; and the results of each and every system call... nice!... so the classic way is to run the replay to the crash, then set a hardware watchpoint on the variable of interest (that which should not be nil, perhaps), and the do a reverse-continue in rr replay mode to replay the execution in reverse from the segfault until the watchpoint is hit.
I'd be happy to try this myself, its just I don't really know which variables to set the watch or watches on. The line that the segfault is on is
s := block.spans[bottom].Load() // runtime/mspanset.go:189
but a gdb print block said block "optimized out", so I was kind of stumped at that point.
If you can help me figure out what if anything relevant is in memory (or register, though not sure if the hardware watchpoints can watch them...) then I'd be better able to try and make some deductions. I did run rr pack
on the trace, so theoretically (fingers crossed) anyone should be able to replay the trace in reverse execution mode and figure out where things went wrong, and what signals were delivered.
I'll try some runs with GODEBUG=asyncpreemptoff=1 and report back. I would be curious to know, if you're willing to try it, if rr lives up to its promises and you can yourself replay the execution in reverse from the crash point. It would kind of be a bummer if not, but also it might inform us about how much hardware alignment etc is needed to provide anyone with the ability to reproduce any recorded bug--as that's the really huge potential here too (and why I'd like to get the runtime happy under chaos mode, so I can use it to find actual interleaving related bugs in my own Go code).
How much can rr be trusted? Well, my impression is that rr itself is very reliable, but since I'm so new to it, the user (me) "using it wrong" would not surprise me.
About the possible memory corruption alternative... I don't have a good mental picture of what the possible causes or consequences of that would be, but if any of them would suggest hypotheses that can be checked with the hardware watchpoints, that's still the easy path. If you can describe a possible scenario that could lead to the segfault observed, I might be able to do more. As it is I just don't know enough about the Go runtime to do much more debugging on my own. I certainly wish I did, but I've never had time to look into it.
Comment From: glycerine
Re how signals are handled, I'll try to quote the relevant explanations from the rr extended tech report https://arxiv.org/pdf/1705.05937
RR performs preemptive context switching by periodically interrupting the running thread with an asynchronous signal (see Section 2.4). The normal scheduler honors Linux thread priorities strictly: higher priority threads always run in preference to lowerpriority threads. Threads with equal priority run in a round-robin manner.
(My comment: presumably rr does not deliver that signal to the program under test, but it does raise a concern about cross talk between signals Go wants and signals rr uses, although most of my concern seems to be allayed in the following.)
2.3.9 Signal Handlers
When a synchronous or asynchronous signal triggers a signal handler during recording, the kernel sets up a userspace stack frame for the handler and transfers control to the address of the handler. ptrace reports the delivery of a signal as via a ptrace-stop notification, then RR triggers the transition to the signal handler by issuing a PTRACE SINGLESTEP request. (RR must track internally which signals have handlers so it knows in advance which signals will trigger a handler. It also needs to be aware of quirky edge cases that suppress entry into a signal handler; for example, when code triggerings a SIGILL or SIGSEGV, but the tracee has used sigprocmask to block the signal, the signal is delivered but handlers do not run (usually with fatal results).)
When RR has transitioned into the signal handler, it records the contents of the stack frame and registers set up by the kernel. During replay, no signal handlers are set up and no real signals are delivered. To replay the entry to a signal handler, we just write the recorded signal handler frame into memory and set the registers to the recorded values. By not delivering real signals during replay, we avoid having to set up signal-handler state in the replaying process. ...
2.3.10 Signal Handling And Interrupted System Calls
During normal non-recorded execution, when a system call is interrupted by a signal, the kernel adjusts the user-space registers with the program counter immediately after the system-call instruction and a special -ERESTARTSYS (or similar) value in the system-call result register. If a signal handler needs to run, the signal frame is built and the signal handler runs normally. When execution returns to the pre-handler state via sigreturn, or a signal handler did not run but the thread is resumed by some other mechanism, then just before returning to user-space the kernel’s systemcall restart mechanism kicks in: when the kernel returns to userspace, if the system-call result register contains -ERESTARTSYS and certain other conditions are met, the kernel automatically backs up the program counter to before the system call instruction and restores the system-call-number register. This causes the system call to reexecute.
RR avoids interfering with this restart mechanism and relies on the kernel restarting interrupted system calls as it normally would. However, because we set up state on system-call entry that expects to be processed on systemcall exit, we have to detect when a ptrace system-call notification corresponds to a restarted system call instead of a new system call. There is no foolproof way to do this, but we have a heuristic that works well. We maintain a stack of pending interrupted system calls and their saved parameter registers. On every system call entry, if the parameter registers match the parameters of the most recent pending interrupted system call, we assume that system call is being restarted and pop it off the stack. If a signal handler interrupted a system call and then returns, but the interrupted system call is not immediately restarted, we assume it has been abandoned and pop it off the stack.
2.4 Asynchronous Events
We need to support two kinds of asynchronous events: preemptive context switches and signals. We treat the former as a special case of the latter, forcing a context switch by sending a signal to a running tracee thread. We need to ensure that during replay, a signal is delivered when the program is in exactly the same state as it was when the signal was delivered during recording. ...
2.4.1 on Nondeterministic Performance Counters ... Fortunately, modern Intel CPUs have exactly one deterministic performance counter: “retired conditional branches” (“RCB”), so we use that. We cannot just count the number of RCBs during recording and deliver the signal after we have executed that number of RCBs during replay, because the RCB count does not uniquely determine the execution point to deliver the signal at. Therefore we pair the RCB count with the complete state of general-purpose registers (including the program counter) to identify an execution point.
In general that still does not uniquely identify an execution point (e.g. consider the infinite loop label: inc [global var]; jmp label;). However, in practice we have found it works reliably; code that returns to the same instruction with no intervening conditional branch must be very rare, and it only matters to RR for regular replay if an asynchronous event occurs at such an instruction — in which case replay would probably diverge and fail.
Comment From: glycerine
@mknyszek I did a bunch of work to track down one issue:
I found a simpler reproducer, recorded some rr -h chaos mode traces, and put my analysis here.
https://github.com/glycerine/rr_binary_for_issue74019?tab=readme-ov-file#analysis
The high level take away is that there appear to be alot of races between the thread sanitizer/tsan/race detector and the Go runtime basic thread start processes that rr's chaos mode will allow to be (relatively) quickly provoked, recorded, debugged and dispatched.
There are alot of other issues that rr discovers, but one by one. I depended pretty heavily on gemini to learn my around, and it was very good at teaching me how to debug better with gdb and rr.
Gemini seems pretty convinced about the nature of that trace's bug. Unfortuntately I don't myself have a mental model of the runtime to validate its conviction against, so I'm hoping you can tell me if we are at least on the right track.