At Datadog, we've seen segfaults during runtime.(*unwinder).next
. The programs are on Linux, running arm64 (in all the examples I've seen), on Go 1.24.1 and Go 1.23.6.
Here is the first example, on Go 1.24.1:
SIGSEGV: segmentation violation
PC=0x468da4 m=13 sigcode=1 addr=0x118
goroutine 0 [idle]:
runtime.(*unwinder).next(0xfc510200e438)
/usr/local/go/src/runtime/traceback.go:458 +0x184
runtime.traceback2(0xfc510200e438, 0x1, 0x0, 0x2e)
/usr/local/go/src/runtime/traceback.go:967 +0xcc
runtime.traceback1.func1(0x1)
/usr/local/go/src/runtime/traceback.go:903 +0x54
runtime.traceback1(0x400a702540?, 0x417800?, 0x3?, 0x400a702540, 0x68?)
/usr/local/go/src/runtime/traceback.go:927 +0x19c
runtime.traceback(...)
/usr/local/go/src/runtime/traceback.go:803
runtime.tracebackothers.func1(0x400a702540)
/usr/local/go/src/runtime/traceback.go:1279 +0x104
runtime.forEachGRace(0xfc510200e6c8)
/usr/local/go/src/runtime/proc.go:720 +0x68
runtime.tracebackothers(0x40074efdc0?)
/usr/local/go/src/runtime/traceback.go:1265 +0xcc
runtime.Stack.func1()
/usr/local/go/src/runtime/mprof.go:1717 +0xb4
runtime.systemstack(0x0)
/usr/local/go/src/runtime/asm_arm64.s:244 +0x6c
goroutine 989290 gp=0x40074efdc0 m=13 mp=0x400078e008 [running]:
runtime.systemstack_switch()
/usr/local/go/src/runtime/asm_arm64.s:201 +0x8 fp=0x4007d22960 sp=0x4007d22950 pc=0x481048
runtime.Stack({0x40107d6000?, 0x100000?, 0x100000?}, 0x1)
/usr/local/go/src/runtime/mprof.go:1707 +0xe0 fp=0x4007d22a00 sp=0x4007d22960 pc=0x43ab40
runtime/pprof.writeGoroutineStacks({0x27ac520, 0x4009abfd40})
/usr/local/go/src/runtime/pprof/pprof.go:764 +0x6c fp=0x4007d22a40 sp=0x4007d22a00 pc=0x8f890c
runtime/pprof.writeGoroutine({0x27ac520?, 0x4009abfd40?}, 0x0?)
/usr/local/go/src/runtime/pprof/pprof.go:753 +0x2c fp=0x4007d22a80 sp=0x4007d22a40 pc=0x8f884c
runtime/pprof.(*Profile).WriteTo(0x23cae63?, {0x27ac520?, 0x4009abfd40?}, 0x206e4c0?)
/usr/local/go/src/runtime/pprof/pprof.go:377 +0x14c fp=0x4007d22b90 sp=0x4007d22a80 pc=0x8f5f5c
gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).lookupProfile(0x40000cac08?, {0x23cae63?, 0x780a278d9052?}, {0x27ac520, 0x4009abfd40}, 0x2)
/go/pkg/mod/gopkg.in/!data!dog/dd-trace-go.v1@v1.72.2/profiler/profiler.go:136 +0x58 fp=0x4007d22bd0 sp=0x4007d22b90 pc=0x1681538
gopkg.in/DataDog/dd-trace-go.v1/profiler.init.func2(0x4008cfc0a0)
/go/pkg/mod/gopkg.in/!data!dog/dd-trace-go.v1@v1.72.2/profiler/profile.go:168 +0xf4 fp=0x4007d22c60 sp=0x4007d22bd0 pc=0x167c414
gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).runProfile(0x4008cfc0a0, 0x5)
/go/pkg/mod/gopkg.in/!data!dog/dd-trace-go.v1@v1.72.2/profiler/profile.go:348 +0x17c fp=0x4007d22e50 sp=0x4007d22c60 pc=0x167fabc
gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).collect.func2(0x5)
/go/pkg/mod/gopkg.in/!data!dog/dd-trace-go.v1@v1.72.2/profiler/profiler.go:355 +0xb8 fp=0x4007d22fb0 sp=0x4007d22e50 pc=0x1682da8
gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).collect.gowrap2()
/go/pkg/mod/gopkg.in/!data!dog/dd-trace-go.v1@v1.72.2/profiler/profiler.go:367 +0x30 fp=0x4007d22fd0 sp=0x4007d22fb0 pc=0x1682cb0
runtime.goexit({})
[ ... elided ... ]
r0 0xfc510200e438
r1 0x0
r2 0x1
r3 0x1
r4 0x400a702540
r5 0x0
r6 0x1
r7 0x0
r8 0x3627cb0
r9 0x1
r10 0x279e7e8
r11 0x6372732f6f672f6c
r12 0x656d69746e75722f
r13 0x6f672e636f72702f
r14 0x30372e3176406370
r15 0x7265746e692f302e
r16 0xfc510180ef10
r17 0xfc510200e0c0
r18 0x0
r19 0x0
r20 0xfc510200e0b4
r21 0xfc510200e498
r22 0x1
r23 0x400ba15108
r24 0x202a2e0
r25 0x0
r26 0xffffffffffffffff
r27 0x4058000
r28 0x40000fe380
r29 0xfc510200e038
lr 0x468c68
sp 0xfc510200e040
pc 0x468da4
fault 0x118
The crash happens on this line, during a call to runtime.Stack
triggered by calling pprof.Lookup("goroutine").WriteTo(w, 2)
. Unfortunately there are not goroutine addresses in this output (not sure why) so it's hard to tell which goroutine's stack was being unwound in this case.
The other occurrence is in a different program, build with Go 1.23.6. It's segfaulting on the same line in runtime.(*unwinder).next
, during garbage collection:
SIGSEGV: segmentation violation
PC=0x488148 m=21 sigcode=1 addr=0x118
goroutine 0 gp=0x4003a88380 m=21 mp=0x40085fa008 [idle]:
runtime.(*unwinder).next(0xe5b01f40e280)
/usr/local/go/src/runtime/traceback.go:458 +0x188 fp=0xe5b01f40e230 sp=0xe5b01f40e1a0 pc=0x488148
runtime.scanstack(0x4002a8ea80, 0x400007b250)
/usr/local/go/src/runtime/mgcmark.go:887 +0x290 fp=0xe5b01f40e370 sp=0xe5b01f40e230 pc=0x4460a0
runtime.markroot.func1()
/usr/local/go/src/runtime/mgcmark.go:238 +0xa8 fp=0xe5b01f40e3c0 sp=0xe5b01f40e370 pc=0x444b78
runtime.markroot(0x400007b250, 0x234, 0x1)
/usr/local/go/src/runtime/mgcmark.go:212 +0x1c8 fp=0xe5b01f40e470 sp=0xe5b01f40e3c0 pc=0x444848
runtime.gcDrain(0x400007b250, 0xb)
/usr/local/go/src/runtime/mgcmark.go:1188 +0x434 fp=0xe5b01f40e4e0 sp=0xe5b01f40e470 pc=0x446b14
runtime.gcDrainMarkWorkerFractional(...)
/usr/local/go/src/runtime/mgcmark.go:1118
runtime.gcBgMarkWorker.func2()
/usr/local/go/src/runtime/mgc.go:1506 +0x7c fp=0xe5b01f40e530 sp=0xe5b01f40e4e0 pc=0x442a1c
runtime.systemstack(0x0)
/usr/local/go/src/runtime/asm_arm64.s:244 +0x6c fp=0xe5b01f40e540 sp=0xe5b01f40e530 pc=0x4a3a3c
goroutine 9 gp=0x4000254a80 m=21 mp=0x40085fa008 [GC worker (active)]:
runtime.systemstack_switch()
/usr/local/go/src/runtime/asm_arm64.s:201 +0x8 fp=0x4000cc4f10 sp=0x4000cc4f00 pc=0x4a39b8
runtime.gcBgMarkWorker(0x4000066690)
/usr/local/go/src/runtime/mgc.go:1472 +0x200 fp=0x4000cc4fb0 sp=0x4000cc4f10 pc=0x4426d0
runtime.gcBgMarkStartWorkers.gowrap1()
/usr/local/go/src/runtime/mgc.go:1328 +0x28 fp=0x4000cc4fd0 sp=0x4000cc4fb0 pc=0x442498
runtime.goexit({})
/usr/local/go/src/runtime/asm_arm64.s:1223 +0x4 fp=0x4000cc4fd0 sp=0x4000cc4fd0 pc=0x4a5ee4
[ ... ]
goroutine 1267 gp=0x4002a8ea80 m=nil [runnable (scan)]:
runtime.asyncPreempt2()
/usr/local/go/src/runtime/preempt.go:308 +0x3c fp=0x4004cec4c0 sp=0x4004cec4a0 pc=0x46353c
runtime.asyncPreempt()
/usr/local/go/src/runtime/preempt_arm64.s:47 +0x9c fp=0x4004cec6b0 sp=0x4004cec4c0 pc=0x4a6a8c
github.com/vishvananda/netlink/nl.(*NetlinkSocket).Receive(0x14360300000000?)
/go/pkg/mod/github.com/!data!dog/netlink@v1.0.1-0.20240223195320-c7a4f832a3d1/nl/nl_linux.go:803 +0x130 fp=0x4004cfc710 sp=0x4004cec6c0 pc=0xf95de0
The last goroutine appears to be the goroutine that was being scanned. The crash output ends there.
Comment From: adonovan
See also https://github.com/golang/go/issues/73043#issuecomment-2775964745, whose attached hypothesis is Windows-only.
Comment From: dmitshur
CC @golang/runtime.
Comment From: prattmic
Looks like gp.m is nil, as 0x118 is the offset of incgo in the M. Though it would be good to disassemble a binary and double check.
Comment From: prattmic
gp.m could become nil if we didn't suspend the G before doing traceback. But that seems unlikely to me, as I would expect that to cause much more spectacular failures (especially from the GC).
I'd say more likely is that the stack somehow contains a function that isn't valid (triggering the (!flr.valid()
path) even though the G is parked (and thus has no M). I think the idea in this code path is that only a running G could possibly contain an invalid function, thus it is safe to dereference gp.m.
I assume that this happens rarely and you can't reliably reproduce yet?
Comment From: gabyhelp
Related Issues
- runtime/trace: crash during traceAdvance when collecting call stack for cgo-calling goroutine #69085 (closed)
- runtime: segmentation violation runtime.(*unwinder).next #66151 (closed)
- runtime: frame pointer unwinding can fail on system goroutines #63630
- runtime: crashes with "runtime: traceback stuck" #62086
- runtime: crash in race detector when execution tracer reads from CPU profile buffer #65607 (closed)
- runtime/traceback: segmentation violation failures from unwinding crash #64030
- runtime/trace: segfault in runtime.fpTracebackPCs during deferred call after recovering from panic #61766 (closed)
- runtime: fatal error: unexpected signal during runtime execution #60783 (closed)
- runtime: SIGSEGV crash #22324 (closed)
- runtime: SIGPROF walking stack-in-motion causes missed write barrier panic #12932 (closed)
(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)
Comment From: nsrip-dd
I assume that this happens rarely and you can't reliably reproduce yet?
Yeah, it doesn't seem to happen frequently. I've seen low double-digits of occurrences over a week across all of our production services. I just started investigating this morning and haven't yet reproduced the crash.
Comment From: nsrip-dd
Looks like gp.m is nil, as 0x118 is the offset of incgo in the M. Though it would be good to disassemble a binary and double check.
Here's the disassembly of runtime.(*unwinder).next
from the first example, around where the fault is happening:
traceback.go:457 0x468d94 9a9f17e3 CSET EQ, R3
traceback.go:458 0x468d98 370801a2 TBNZ $1, R2, 13(PC)
traceback.go:458 0x468d9c f94043e4 MOVD 128(RSP), R4
traceback.go:458 0x468da0 f9401885 MOVD 48(R4), R5
traceback.go:458 0x468da4 394460a5 MOVBU 280(R5), R5 // <<<<< HERE
traceback.go:458 0x468da8 360000e5 TBZ $0, R5, 7(PC)
traceback.go:458 0x468dac f94037e5 MOVD 104(RSP), R5
traceback.go:458 0x468db0 3940a0a6 MOVBU 40(R5), R6
traceback.go:458 0x468db4 71004cdf CMPW $19, R6
Full output in case it helps
TEXT runtime.(*unwinder).next(SB) /usr/local/go/src/runtime/traceback.go
traceback.go:440 0x468c20 f9400b90 MOVD 16(R28), R16
traceback.go:440 0x468c24 d10043f1 SUB $16, RSP, R17
traceback.go:440 0x468c28 eb10023f CMP R16, R17
traceback.go:440 0x468c2c 54001949 BLS 202(PC)
traceback.go:440 0x468c30 f8170ffe MOVD.W R30, -144(RSP)
traceback.go:440 0x468c34 f81f83fd MOVD R29, -8(RSP)
traceback.go:440 0x468c38 d10023fd SUB $8, RSP, R29
traceback.go:441 0x468c3c f9400401 MOVD 8(R0), R1
traceback.go:446 0x468c40 f9401002 MOVD 32(R0), R2
traceback.go:446 0x468c44 b40013a2 CBZ R2, 157(PC)
traceback.go:446 0x468c48 f9004fe0 MOVD R0, 152(RSP)
traceback.go:441 0x468c4c f9003be1 MOVD R1, 112(RSP)
traceback.go:443 0x468c50 f9402401 MOVD 72(R0), R1
traceback.go:442 0x468c54 f9400003 MOVD (R0), R3
traceback.go:442 0x468c58 f90037e3 MOVD R3, 104(RSP)
runtime2.go:240 0x468c5c f90043e1 MOVD R1, 128(RSP)
traceback.go:450 0x468c60 aa0203e0 MOVD R2, R0
traceback.go:450 0x468c64 940057bf CALL runtime.findfunc(SB)
traceback.go:451 0x468c68 b4000900 CBZ R0, 72(PC)
traceback.go:477 0x468c6c f9404fe3 MOVD 152(RSP), R3
traceback.go:477 0x468c70 f9400864 MOVD 16(R3), R4
traceback.go:477 0x468c74 f9401065 MOVD 32(R3), R5
traceback.go:477 0x468c78 eb0400bf CMP R4, R5
traceback.go:477 0x468c7c 540000a1 BNE 5(PC)
traceback.go:477 0x468c80 f9401465 MOVD 40(R3), R5
traceback.go:477 0x468c84 f9401866 MOVD 48(R3), R6
traceback.go:477 0x468c88 eb0500df CMP R5, R6
traceback.go:477 0x468c8c 54001260 BEQ 147(PC)
traceback.go:484 0x468c90 f94037e4 MOVD 104(RSP), R4
traceback.go:484 0x468c94 3940a085 MOVBU 40(R4), R5
traceback.go:484 0x468c98 71004cbf CMPW $19, R5
traceback.go:484 0x468c9c 54000060 BEQ 3(PC)
traceback.go:484 0x468ca0 71000cbf CMPW $3, R5
traceback.go:484 0x468ca4 54000061 BNE 3(PC)
traceback.go:484 0x468ca8 b24003e5 ORR $1, ZR, R5
traceback.go:484 0x468cac 14000003 JMP 3(PC)
traceback.go:484 0x468cb0 710018bf CMPW $6, R5
traceback.go:484 0x468cb4 9a9f17e5 CSET EQ, R5
traceback.go:485 0x468cb8 360000a5 TBZ $0, R5, 5(PC)
traceback.go:486 0x468cbc 39416466 MOVBU 89(R3), R6
traceback.go:486 0x468cc0 b27e00c6 ORR $4, R6, R6
traceback.go:486 0x468cc4 39016466 MOVB R6, 89(R3)
traceback.go:486 0x468cc8 14000004 JMP 4(PC)
traceback.go:488 0x468ccc 39416466 MOVBU 89(R3), R6
traceback.go:488 0x468cd0 927df8c6 AND $-5, R6, R6
traceback.go:488 0x468cd4 39016466 MOVB R6, 89(R3)
traceback.go:492 0x468cd8 3940a084 MOVBU 40(R4), R4
traceback.go:492 0x468cdc 39016064 MOVB R4, 88(R3)
traceback.go:493 0x468ce0 f9000060 MOVD R0, (R3)
traceback.go:493 0x468ce4 f9000461 MOVD R1, 8(R3)
traceback.go:494 0x468ce8 f9401064 MOVD 32(R3), R4
traceback.go:494 0x468cec f9000864 MOVD R4, 16(R3)
traceback.go:495 0x468cf0 f900107f MOVD ZR, 32(R3)
traceback.go:496 0x468cf4 f9401864 MOVD 48(R3), R4
traceback.go:496 0x468cf8 f9001464 MOVD R4, 40(R3)
traceback.go:497 0x468cfc f900187f MOVD ZR, 48(R3)
traceback.go:501 0x468d00 36000365 TBZ $0, R5, 27(PC)
traceback.go:502 0x468d04 aa0403e1 MOVD R4, R1
traceback.go:503 0x468d08 91004021 ADD $16, R1, R1
traceback.go:502 0x468d0c f9400082 MOVD (R4), R2
traceback.go:502 0x468d10 f90027e2 MOVD R2, 72(RSP)
traceback.go:503 0x468d14 f9001461 MOVD R1, 40(R3)
traceback.go:504 0x468d18 f9400860 MOVD 16(R3), R0
traceback.go:504 0x468d1c 94005791 CALL runtime.findfunc(SB)
traceback.go:505 0x468d20 f9404fe2 MOVD 152(RSP), R2
traceback.go:505 0x468d24 f9000040 MOVD R0, (R2)
traceback.go:505 0x468d28 f9000441 MOVD R1, 8(R2)
traceback.go:506 0x468d2c b5000080 CBNZ R0, 4(PC)
traceback.go:507 0x468d30 f94027e3 MOVD 72(RSP), R3
traceback.go:507 0x468d34 f9000843 MOVD R3, 16(R2)
traceback.go:507 0x468d38 1400000c JMP 12(PC)
traceback.go:508 0x468d3c f9400843 MOVD 16(R2), R3
symtab.go:1174 0x468d40 b9401002 MOVWU 16(R0), R2
symtab.go:1174 0x468d44 b24003e4 ORR $1, ZR, R4
symtab.go:1174 0x468d48 97ffe856 CALL runtime.pcvalue(SB)
traceback.go:508 0x468d4c 350000a0 CBNZW R0, 5(PC)
traceback.go:509 0x468d50 f94027e3 MOVD 72(RSP), R3
traceback.go:509 0x468d54 f9404fe0 MOVD 152(RSP), R0
traceback.go:509 0x468d58 f9001003 MOVD R3, 32(R0)
traceback.go:509 0x468d5c 14000002 JMP 2(PC)
traceback.go:513 0x468d60 f9404fe0 MOVD 152(RSP), R0
traceback.go:513 0x468d64 aa0003e2 MOVD R0, R2
traceback.go:513 0x468d68 aa0203e3 MOVD R2, R3
traceback.go:513 0x468d6c aa0303e0 MOVD R3, R0
traceback.go:513 0x468d70 aa1f03e1 MOVD ZR, R1
traceback.go:513 0x468d74 aa0103e2 MOVD R1, R2
traceback.go:513 0x468d78 97fffed2 CALL runtime.(*unwinder).resolveInternal(SB)
traceback.go:514 0x468d7c f85f83fd MOVD -8(RSP), R29
traceback.go:514 0x468d80 f84907fe MOVD.P 144(RSP), R30
traceback.go:514 0x468d84 d65f03c0 RET
traceback.go:456 0x468d88 f9404fe0 MOVD 152(RSP), R0
traceback.go:456 0x468d8c 39416402 MOVBU 89(R0), R2
traceback.go:457 0x468d90 721f005f TSTW $2, R2
traceback.go:457 0x468d94 9a9f17e3 CSET EQ, R3
traceback.go:458 0x468d98 370801a2 TBNZ $1, R2, 13(PC)
traceback.go:458 0x468d9c f94043e4 MOVD 128(RSP), R4
traceback.go:458 0x468da0 f9401885 MOVD 48(R4), R5
traceback.go:458 0x468da4 394460a5 MOVBU 280(R5), R5
traceback.go:458 0x468da8 360000e5 TBZ $0, R5, 7(PC)
traceback.go:458 0x468dac f94037e5 MOVD 104(RSP), R5
traceback.go:458 0x468db0 3940a0a6 MOVBU 40(R5), R6
traceback.go:458 0x468db4 71004cdf CMPW $19, R6
traceback.go:458 0x468db8 540000e1 BNE 7(PC)
traceback.go:458 0x468dbc aa1f03e3 MOVD ZR, R3
traceback.go:458 0x468dc0 14000005 JMP 5(PC)
symtab.go:1110 0x468dc4 f94037e5 MOVD 104(RSP), R5
traceback.go:458 0x468dc8 14000003 JMP 3(PC)
traceback.go:466 0x468dcc f94043e4 MOVD 128(RSP), R4
symtab.go:1110 0x468dd0 f94037e5 MOVD 104(RSP), R5
traceback.go:456 0x468dd4 7200045f TSTW $3, R2
traceback.go:465 0x468dd8 54000040 BEQ 2(PC)
traceback.go:465 0x468ddc 36000623 TBZ $0, R3, 49(PC)
traceback.go:456 0x468de0 3900ffe2 MOVB R2, 63(RSP)
symtab.go:1110 0x468de4 b5000085 CBNZ R5, 4(PC)
symtab.go:1110 0x468de8 aa1f03e1 MOVD ZR, R1
symtab.go:1110 0x468dec aa1f03e3 MOVD ZR, R3
traceback.go:466 0x468df0 14000007 JMP 7(PC)
symtab.go:1113 0x468df4 b98004a1 MOVW 4(R5), R1
symtab.go:1113 0x468df8 f9403be0 MOVD 112(RSP), R0
symtab.go:1113 0x468dfc 97ffe761 CALL runtime.(*moduledata).funcName(SB)
traceback.go:466 0x468e00 f94043e4 MOVD 128(RSP), R4
traceback.go:466 0x468e04 aa0003e3 MOVD R0, R3
traceback.go:466 0x468e08 f9404fe0 MOVD 152(RSP), R0
traceback.go:466 0x468e0c f90023e1 MOVD R1, 64(RSP)
traceback.go:466 0x468e10 f9003fe3 MOVD R3, 120(RSP)
traceback.go:466 0x468e14 f9405081 MOVD 160(R4), R1
traceback.go:466 0x468e18 f90033e1 MOVD R1, 96(RSP)
traceback.go:466 0x468e1c f9401000 MOVD 32(R0), R0
traceback.go:466 0x468e20 f9002fe0 MOVD R0, 88(RSP)
traceback.go:466 0x468e24 97ff6c43 CALL runtime.printlock(SB)
traceback.go:466 0x468e28 f000fb20 ADRP 32927744(PC), R0
traceback.go:466 0x468e2c 9114cc00 ADD $1331, R0, R0
traceback.go:466 0x468e30 d2800161 MOVD $11, R1
traceback.go:466 0x468e34 97ff6e63 CALL runtime.printstring(SB)
traceback.go:466 0x468e38 f94033e0 MOVD 96(RSP), R0
traceback.go:466 0x468e3c 97ff6da9 CALL runtime.printuint(SB)
traceback.go:466 0x468e40 b000fc80 ADRP 33099776(PC), R0
traceback.go:466 0x468e44 9120e000 ADD $2104, R0, R0
traceback.go:466 0x468e48 d2800361 MOVD $27, R1
traceback.go:466 0x468e4c 97ff6e5d CALL runtime.printstring(SB)
traceback.go:466 0x468e50 f9403fe0 MOVD 120(RSP), R0
traceback.go:466 0x468e54 f94023e1 MOVD 64(RSP), R1
traceback.go:466 0x468e58 97ff6e5a CALL runtime.printstring(SB)
traceback.go:466 0x468e5c f000fb40 ADRP 32944128(PC), R0
traceback.go:466 0x468e60 912c1c00 ADD $2823, R0, R0
traceback.go:466 0x468e64 d28001a1 MOVD $13, R1
traceback.go:466 0x468e68 97ff6e56 CALL runtime.printstring(SB)
traceback.go:466 0x468e6c f9402fe0 MOVD 88(RSP), R0
traceback.go:466 0x468e70 97ff6dec CALL runtime.printhex(SB)
traceback.go:466 0x468e74 97ff6cbb CALL runtime.printnl(SB)
traceback.go:466 0x468e78 97ff6c4e CALL runtime.printunlock(SB)
traceback.go:467 0x468e7c f94043e0 MOVD 128(RSP), R0
traceback.go:467 0x468e80 f9400401 MOVD 8(R0), R1
traceback.go:467 0x468e84 f9400000 MOVD (R0), R0
traceback.go:467 0x468e88 f9404fe2 MOVD 152(RSP), R2
traceback.go:467 0x468e8c aa1f03e3 MOVD ZR, R3
traceback.go:467 0x468e90 94000a54 CALL runtime.tracebackHexdump(SB)
traceback.go:456 0x468e94 3940ffe0 MOVBU 63(RSP), R0
traceback.go:456 0x468e98 7200041f TSTW $3, R0
traceback.go:472 0x468e9c f9404fe0 MOVD 152(RSP), R0
traceback.go:469 0x468ea0 54000140 BEQ 10(PC)
traceback.go:472 0x468ea4 f900101f MOVD ZR, 32(R0)
traceback.go:473 0x468ea8 94000032 CALL runtime.(*unwinder).finishInternal(SB)
traceback.go:474 0x468eac f85f83fd MOVD -8(RSP), R29
traceback.go:474 0x468eb0 f84907fe MOVD.P 144(RSP), R30
traceback.go:474 0x468eb4 d65f03c0 RET
traceback.go:447 0x468eb8 9400002e CALL runtime.(*unwinder).finishInternal(SB)
traceback.go:448 0x468ebc f85f83fd MOVD -8(RSP), R29
traceback.go:448 0x468ec0 f84907fe MOVD.P 144(RSP), R30
traceback.go:448 0x468ec4 d65f03c0 RET
traceback.go:470 0x468ec8 b000fba0 ADRP 32985088(PC), R0
traceback.go:470 0x468ecc 91213c00 ADD $2127, R0, R0
traceback.go:470 0x468ed0 d2800221 MOVD $17, R1
traceback.go:470 0x468ed4 940047c7 CALL runtime.throw(SB)
traceback.go:477 0x468ed8 f9002fe4 MOVD R4, 88(RSP)
traceback.go:477 0x468edc f9002be5 MOVD R5, 80(RSP)
traceback.go:479 0x468ee0 97ff6c14 CALL runtime.printlock(SB)
traceback.go:479 0x468ee4 f000fca0 ADRP 33124352(PC), R0
traceback.go:479 0x468ee8 9134a000 ADD $3368, R0, R0
traceback.go:479 0x468eec d28003a1 MOVD $29, R1
traceback.go:479 0x468ef0 97ff6e34 CALL runtime.printstring(SB)
traceback.go:479 0x468ef4 f9402fe0 MOVD 88(RSP), R0
traceback.go:479 0x468ef8 97ff6dca CALL runtime.printhex(SB)
traceback.go:479 0x468efc f000fa80 ADRP 32845824(PC), R0
traceback.go:479 0x468f00 911c8c00 ADD $1827, R0, R0
traceback.go:479 0x468f04 b27e03e1 ORR $4, ZR, R1
traceback.go:479 0x468f08 97ff6e2e CALL runtime.printstring(SB)
traceback.go:479 0x468f0c f9402be0 MOVD 80(RSP), R0
traceback.go:479 0x468f10 97ff6dc4 CALL runtime.printhex(SB)
traceback.go:479 0x468f14 97ff6c93 CALL runtime.printnl(SB)
traceback.go:479 0x468f18 97ff6c26 CALL runtime.printunlock(SB)
traceback.go:480 0x468f1c f94043e0 MOVD 128(RSP), R0
traceback.go:480 0x468f20 f9400001 MOVD (R0), R1
traceback.go:480 0x468f24 f9400400 MOVD 8(R0), R0
traceback.go:480 0x468f28 f9404fe2 MOVD 152(RSP), R2
traceback.go:480 0x468f2c f9401443 MOVD 40(R2), R3
traceback.go:480 0x468f30 aa0003e4 MOVD R0, R4
traceback.go:480 0x468f34 aa0103e0 MOVD R1, R0
traceback.go:480 0x468f38 aa0403e1 MOVD R4, R1
traceback.go:480 0x468f3c 94000a29 CALL runtime.tracebackHexdump(SB)
traceback.go:481 0x468f40 9000fb80 ADRP 32964608(PC), R0
traceback.go:481 0x468f44 911ba400 ADD $1769, R0, R0
traceback.go:481 0x468f48 b2400fe1 ORR $15, ZR, R1
traceback.go:481 0x468f4c 940047a9 CALL runtime.throw(SB)
traceback.go:481 0x468f50 d503201f NOOP
traceback.go:440 0x468f54 f90007e0 MOVD R0, 8(RSP)
traceback.go:440 0x468f58 aa1e03e3 MOVD R30, R3
traceback.go:440 0x468f5c 940060a1 CALL runtime.morestack_noctxt.abi0(SB)
traceback.go:440 0x468f60 f94007e0 MOVD 8(RSP), R0
traceback.go:440 0x468f64 17ffff2f JMP runtime.(*unwinder).next(SB)
traceback.go:440 0x468f68 00000000 ?
traceback.go:440 0x468f6c 00000000 ?
Comment From: prattmic
Thanks, that definitely looks like the dereference for m.incgo.
Comment From: prattmic
@adonovan Your crash has a different fault address (0xe8
), but I checked and that is the offset of incgo
on windows-amd64, so I think that is the same bug.
Comment From: mknyszek
I marked #73413 as a dupe, but it had a potential reproducer: https://github.com/golang/go/issues/73413#issuecomment-2812496241.
Comment From: sirzooro
Hi, have you made some progress with this bug? Can we expect that fix will be part of 1.25 release (there is no milestone assigned to this bug)? We can stay on 1.23 for a while as it is more stable than 1.24, but with release of version 1.25 we would have to upgrade.
BTW, yesterday I saw another crash from 1.23. This callstack is a bit different than previous ones:
Thread 3.13 "fuzztests" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fff917fa640 (LWP 4309)]
runtime.raise () at runtime/sys_linux_amd64.s:154
154 runtime/sys_linux_amd64.s: No such file or directory.
#0 runtime.raise () at runtime/sys_linux_amd64.s:154
No locals.
#1 0x0000555555655da5 in runtime.dieFromSignal (sig=6)
at runtime/signal_unix.go:967
No locals.
#2 0x000055555563e6fc in runtime.crash () at runtime/signal_unix.go:1056
No locals.
#3 runtime.fatalthrow.func1 () at runtime/panic.go:1287
gp = 0xc00015afc0
pc = 93824993397898
sp = 140735634446816
#4 0x000055555563e658 in runtime.fatalthrow (t=<optimized out>)
at runtime/panic.go:1276
pc = <optimized out>
sp = 6
gp = 0xc00015afc0
#5 0x0000555555670c8a in runtime.throw (s=...) at runtime/panic.go:1101
No locals.
#6 0x000055555566082a in runtime.(*unwinder).finishInternal (
u=<optimized out>) at runtime/traceback.go:566
gp = 0xc00015a000
#7 0x0000555555660632 in runtime.(*unwinder).next (u=0x0)
at runtime/traceback.go:447
gp = <optimized out>
~r0.ptr = <optimized out>
~r0.len = <optimized out>
#8 0x0000555555626fa9 in runtime.scanstack (gp=0xc00015a000,
gcw=0xc00003b750, ~r0=<optimized out>) at runtime/mgcmark.go:904
sp = <optimized out>
scannedSize = <optimized out>
p = <optimized out>
state = <optimized out>
u = <optimized out>
#9 0x00005555556259f1 in runtime.markroot.func1 () at runtime/mgcmark.go:240
gp = 0xc00015a000
&workDone = 0x7fff917f9c68
gcw = 0xc00003b750
userG = 0xc000092540
selfScan = false
#10 0x0000555555625699 in runtime.markroot (gcw=0xc00003b750, i=27,
flushBgCredit=true, ~r0=<optimized out>) at runtime/mgcmark.go:214
status = <optimized out>
gp = 0x6
workCounter = <optimized out>
~r0.ptr = <optimized out>
~r0.ptr = <optimized out>
workDone = <optimized out>
~r0.len = <optimized out>
~r0.len = <optimized out>
#11 0x0000555555627a34 in runtime.gcDrain (gcw=0xc00003b750, flags=3)
at runtime/mgcmark.go:1186
gp = 0xc000092540
pp = 0xc00003a508
flushBgCredit = true
initScanWork = 0
checkWork = 9223372036854775807
check = {void (bool)} 0x7fff917f9d20
#12 0x0000555555623d1a in runtime.gcDrainMarkWorkerDedicated (gcw=0x0,
untilPreempt=<optimized out>) at runtime/mgcmark.go:1110
flags = <optimized out>
#13 runtime.gcBgMarkWorker.func2 () at runtime/mgc.go:1500
gp = 0xc000092540
pp = 0xc00003a508
#14 0x0000555555675b27 in runtime.systemstack () at runtime/asm_amd64.s:514
No locals.
#15 0x01ffffffffffff28 in ?? ()
No symbol table info available.
#16 0x0000000000800000 in ?? ()
No symbol table info available.
#17 0x000000c00015afc0 in ?? ()
No symbol table info available.
#18 0x0000555555675a20 in ?? ()
No locals.
#19 0x0000555555675a25 in runtime.mstart () at runtime/asm_amd64.s:395
No locals.
#20 0x000055555576a1a8 in crosscall1 () at gcc_amd64.S:42
No locals.
#21 0x00007fffaaffc910 in ?? ()
No symbol table info available.
#22 0x00007ffff7a677d0 in ?? () at ./nptl/pthread_create.c:321
from /lib/x86_64-linux-gnu/libc.so.6
#23 0x0000000000000000 in ?? ()
Comment From: prattmic
That latest crash is an explicit throw: https://cs.opensource.google/go/go/+/master:src/runtime/traceback.go;l=566;drc=3fd729b2a14a7efcf08465cbea60a74da5457f06?q=traceback.go:447&ss=go
Those print lines should appear in your stderr, do you have them?
Comment From: prattmic
For the original issue, if you can reproduce with https://go.dev/cl/676635 in your toolchain, you should get more context in the crash output.
You can use https://pkg.go.dev/golang.org/dl/gotip to easily build a toolchain at that CL. This will basically make a Go 1.24.3 toolchain plus my CL.
$ gotip download 676635
$ gotip build your.program/exe
Comment From: gopherbot
Change https://go.dev/cl/676635 mentions this issue: [release-branch.go1.24] DO NOT SUBMIT: runtime: add traceback nil m context
Comment From: sirzooro
That latest crash is an explicit throw: https://cs.opensource.google/go/go/+/master:src/runtime/traceback.go;l=566;drc=3fd729b2a14a7efcf08465cbea60a74da5457f06?q=traceback.go:447&ss=go
Those print lines should appear in your stderr, do you have them?
Unfortunately output from test binary was not captured. I am looking how to capture it in the future.
For the original issue, if you can reproduce with https://go.dev/cl/676635 in your toolchain, you should get more context in the crash output.
You can use https://pkg.go.dev/golang.org/dl/gotip to easily build a toolchain at that CL. This will basically make a Go 1.24.3 toolchain plus my CL.
$ gotip download 676635 $ gotip build your.program/exe
Thanks, I will try it. Is there a way to run gotip download
in non-interactive mode? Now it asks for confirmation before downloading.
Comment From: prattmic
Is there a way to run gotip download in non-interactive mode?
Hm, I don't think so, though I believe yes | gotip download 676635
will work.
Otherwise, you can do the steps manually, it's not too difficult:
$ git clone https://go.googlesource.com/go
$ cd go/src
$ git fetch https://go.googlesource.com/go refs/changes/35/676635/1 && git checkout -b change-676635 FETCH_HEAD
$ ./make.bash
# Now use ../bin/go as your go binary.
Note you will want to get output figured out. My CL will print more useful output, but the crash stack trace won't be very different.
Comment From: tsheinen
I have a limited reproducer for what seems to be this same bug.
package main
import (
"runtime"
)
const RECEIVE_BUFFER_SIZE = 65536
//go:noinline
func big_stack(val int) int {
var big_buffer = make([]byte, RECEIVE_BUFFER_SIZE)
sum := 0
// this was added by vibes in the middle of the night to confound the optimizer
for i := 0; i < RECEIVE_BUFFER_SIZE; i++ {
big_buffer[i] = byte(val)
}
for i := 0; i < RECEIVE_BUFFER_SIZE; i++ {
sum ^= int(big_buffer[i])
}
return sum
}
//go:noinline
func calls_big_stack() {
for {
_ = big_stack(1000)
}
}
func main() {
go func() {
for {
runtime.GC()
}
}()
calls_big_stack()
}
It reliably crashes in a few minutes on my test machine (a linux aarch64 128 core server)
SIGSEGV: segmentation violation
PC=0x60598 m=8 sigcode=1 addr=0x118
goroutine 0 gp=0x400019c540 m=8 mp=0x4000198708 [idle]:
runtime.(*unwinder).next(0x400030fd10)
/home/thea/sdk/go1.23.4/src/runtime/traceback.go:458 +0x188 fp=0x400030fcc0 sp=0x400030fc30 pc=0x60598
runtime.scanstack(0x40000021c0, 0x400002f750)
/home/thea/sdk/go1.23.4/src/runtime/mgcmark.go:887 +0x290 fp=0x400030fe00 sp=0x400030fcc0 pc=0x274f0
runtime.markroot.func1()
/home/thea/sdk/go1.23.4/src/runtime/mgcmark.go:238 +0xa8 fp=0x400030fe50 sp=0x400030fe00 pc=0x25fc8
runtime.markroot(0x400002f750, 0x14, 0x1)
/home/thea/sdk/go1.23.4/src/runtime/mgcmark.go:212 +0x1c8 fp=0x400030ff00 sp=0x400030fe50 pc=0x25c98
runtime.gcDrain(0x400002f750, 0x3)
/home/thea/sdk/go1.23.4/src/runtime/mgcmark.go:1188 +0x434 fp=0x400030ff70 sp=0x400030ff00 pc=0x27f64
runtime.gcDrainMarkWorkerDedicated(...)
/home/thea/sdk/go1.23.4/src/runtime/mgcmark.go:1112
runtime.gcBgMarkWorker.func2()
/home/thea/sdk/go1.23.4/src/runtime/mgc.go:1489 +0x94 fp=0x400030ffc0 sp=0x400030ff70 pc=0x23f04
runtime.systemstack(0x400030c000)
/home/thea/sdk/go1.23.4/src/runtime/asm_arm64.s:244 +0x6c fp=0x400030ffd0 sp=0x400030ffc0 pc=0x72c8c
goroutine 117 gp=0x40002f3180 m=8 mp=0x4000198708 [GC worker (active)]:
runtime.systemstack_switch()
/home/thea/sdk/go1.23.4/src/runtime/asm_arm64.s:201 +0x8 fp=0x4000681f10 sp=0x4000681f00 pc=0x72c08
runtime.gcBgMarkWorker(0x40002a0000)
/home/thea/sdk/go1.23.4/src/runtime/mgc.go:1472 +0x200 fp=0x4000681fb0 sp=0x4000681f10 pc=0x23ba0
runtime.gcBgMarkStartWorkers.gowrap1()
/home/thea/sdk/go1.23.4/src/runtime/mgc.go:1328 +0x28 fp=0x4000681fd0 sp=0x4000681fb0 pc=0x23968
runtime.goexit({})
/home/thea/sdk/go1.23.4/src/runtime/asm_arm64.s:1223 +0x4 fp=0x4000681fd0 sp=0x4000681fd0 pc=0x74f74
created by runtime.gcBgMarkStartWorkers in goroutine 5
/home/thea/sdk/go1.23.4/src/runtime/mgc.go:1328 +0x140
goroutine 1 gp=0x40000021c0 m=nil [runnable (scan)]:
runtime.asyncPreempt2()
/home/thea/sdk/go1.23.4/src/runtime/preempt.go:308 +0x3c fp=0x40003bfcf0 sp=0x40003bfcd0 pc=0x400cc
runtime.asyncPreempt()
/home/thea/sdk/go1.23.4/src/runtime/preempt_arm64.s:47 +0x9c fp=0x40003bfee0 sp=0x40003bfcf0 pc=0x75aec
main.big_stack(0x40003cff38?)
/home/thea/dev/stack_corruption_reproducer/main.go:29 +0x94 fp=0x40003cff00 sp=0x40003bfef0 pc=0x77c04
Segmentation fault (core dumped)
real 1m29.165s
user 4m4.987s
sys 0m43.212s
I've tested on 1.23.4 and see crashes consistently but not on 1.23.9. I am seeing production crashes on 1.23.9 so it's my assumption this is just a tuning issue with the reproducer, not a matter of the bug being fixed. I have only observed this crash on linux/aarch64 machines, never amd64.
As far as I can tell the bug occurs when async preemption happens in the middle of the function epilogue of a function with a big stack.
main.go:29 0x77bf8 910023fd ADD $8, RSP, R29
main.go:29 0x77bfc 914043bd ADD $(16<<12), R29, R29
main.go:29 0x77c00 910043ff ADD $16, RSP, RSP
main.go:29 0x77c04 914043ff ADD $(16<<12), RSP, RSP
main.go:29 0x77c08 d65f03c0 RET
The goroutine is asynchronously preempted between ADD $16, RSP, RSP
and ADD $(16<<12), RSP, RSP
on both the reproducer and the production crashes I've seen. I do still see crashes with gcshrinkstackoff=1
so my working theory is this has nothing to do with stack shrinking and instead occurs when code is generated which splits stack pointer addition into two instructions. It makes sense to me that async preemption would be unsound in the middle of the function epilogue, but I don't have a good enough understanding of the runtime to do more than speculate.
Comment From: tsheinen
Hi! I've dug into this some more and have a workaround and a possible runtime fix. The cause is that if the goroutine is asynchronously preempted between two add x, rsp
instructions it leaves the stack frame in a weird state. This isn't memory corruption exactly and will fix itself as the goroutine is scheduled again and the stack pointer is adjusted the rest of the way. However, if the stack is unwound in this weird state (garbage collection, panic/recover, etc) the unwinder will dereference the partially added stack pointer and get bad data. We worked around this bug in production by making the problematic stack buffer spill to the heap so the stack wasn't big enough to require two instructions.
The reproducer I posted earlier is a bit finicky and reliability depends on go compiler version -- but the bug is still present on the most recent version. It can be reliably triggered by using a debugger, breaking in between instructions, and emitting SIGURG to trigger async preemption.
I fixed it in a test fork by emitting preempt unsafe points around the function epilogue if the constants require more than one add instruction. It works in my testing but before I put up a PR for review I'd like to think further on alternatives and confirm if this same bug occurs on other fixed instruction length architectures.
Comment From: prattmic
@tsheinen do you have a concrete example you can share of one of these crashes including the crashing PC/assembly? Function preambles should be marked by the compiler as "unsafe points", which means we won't perform async preemption if a signal lands there. So what you describe sounds like a bug in where we mark unsafe points.
Comment From: tsheinen
@prattmic I should note that I'm seeing this in the epilogue not the prologue. It's my understanding from skimming internal/obj/arm64/obj7.go
that unsafe points are sprinkled around the preamble but I didn't see any for the epilogue. One of my earlier comments included a stack trace and the instruction which was preempted when it crashed, but I can grab a coredump if you'd like more detail.
Comment From: prattmic
@tsheinen Apologies, I should have looked right above. :) The epilogue (for arm64) does seem problematic to me when the add is split. @cherrymui or @randall77 may be more familiar with this code. Do you agree?
Your https://github.com/tsheinen/go/commit/cf4bfec05b67c326ea8d2dc7e2cc7440a61cfcab looks reasonable to me.
Comment From: randall77
It's not immediately clear to me that async preempting mid-epilog would be a problem. As long as the pcdata for stack frame size is correct, it should work? I'm not 100% sure about that though.
Marking the epilog as non-preemptible sounds fine though.
Comment From: randall77
We can also change the epilog code to always use a single add (building the frame size value in a different register and adding it in all in one go).
Comment From: randall77
Yeah, the frame size info looks wrong. For this code:
package main
//go:noinline
func f(x, y int) byte {
var a [100000]byte
a[x] = 1
return a[y]
}
func main() {
f(3, 4)
}
The epilog does:
100069630: a97ffbfd ldp x29, x30, [sp, #-8]
100069634: 911b03ff add sp, sp, #1728
100069638: 914063ff add sp, sp, #24, lsl #12 ; =98304
10006963c: d65f03c0 ret
But the frame size data doesn't have that intermediate step.
frame size: 100032
[1000695c0:1000695eb]: 0
[1000695ec:10006963b]: 100032
[10006963c:10006963f]: 0
[100069640:10006965f]: 100032
[100069660:10006967f]: 0
The prolog looks ok, it computes the new SP in a different reigster and moves it into the SP register with a single instruction.
Comment From: prattmic
I have trouble following the assembler code w.r.t. splitting the instructions, but it seems plausible that when a Prog is split, Spadj is not split accordingly.
Comment From: gopherbot
Change https://go.dev/cl/689235 mentions this issue: cmd/compile: for arm64 epilog, do SP increment with a single instruction
Comment From: nsrip-dd
Nice work @tsheinen! Nothing to add on the fix/root cause side, but I will say that for the second example in my original report, we're crashing while scanning the stack of a goroutine preempted at the end of this function: https://github.com/vishvananda/netlink/blob/c7a4f832a3d1a5328cef0a565404e4507eb2bb69/nl/nl_linux.go#L803. That function indeed has a big stack frame due to this buffer. Here's the disassembly where the function is getting preempted:
f95dd8: a97ffbfd ldp x29, x30, [sp, #-0x8]
f95ddc: 910143ff add sp, sp, #0x50
VVV---- HERE ----VVV
f95de0: 914043ff add sp, sp, #0x10, lsl #12 // =0x10000
f95de4: d65f03c0 ret
So I think what you're describing is the same issue that we're seeing 👍
Comment From: prattmic
@gopherbot Please backport to 1.23 and 1.24. This issue causes random crashes when preempting functions with large stack frames on arm64.
Comment From: gopherbot
Backport issue(s) opened: #74693 (for 1.23), #74694 (for 1.24).
Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.
Comment From: sirzooro
Good to see that fix for arm64 is ready. I had similar crash in tests running on amd64 (x86_64), please check code for it too.
Comment From: randall77
@sirzooro The code for amd64 looks ok. Whatever you are seeing is not the same bug as was fixed in CL 689235. Could you open a separate bug with details of your problem?