At Datadog, we've seen segfaults during runtime.(*unwinder).next. The programs are on Linux, running arm64 (in all the examples I've seen), on Go 1.24.1 and Go 1.23.6.

Here is the first example, on Go 1.24.1:

SIGSEGV: segmentation violation
PC=0x468da4 m=13 sigcode=1 addr=0x118
goroutine 0 [idle]:
runtime.(*unwinder).next(0xfc510200e438)
    /usr/local/go/src/runtime/traceback.go:458 +0x184
runtime.traceback2(0xfc510200e438, 0x1, 0x0, 0x2e)
    /usr/local/go/src/runtime/traceback.go:967 +0xcc
runtime.traceback1.func1(0x1)
    /usr/local/go/src/runtime/traceback.go:903 +0x54
runtime.traceback1(0x400a702540?, 0x417800?, 0x3?, 0x400a702540, 0x68?)
    /usr/local/go/src/runtime/traceback.go:927 +0x19c
runtime.traceback(...)
    /usr/local/go/src/runtime/traceback.go:803
runtime.tracebackothers.func1(0x400a702540)
    /usr/local/go/src/runtime/traceback.go:1279 +0x104
runtime.forEachGRace(0xfc510200e6c8)
    /usr/local/go/src/runtime/proc.go:720 +0x68
runtime.tracebackothers(0x40074efdc0?)
    /usr/local/go/src/runtime/traceback.go:1265 +0xcc
runtime.Stack.func1()
    /usr/local/go/src/runtime/mprof.go:1717 +0xb4
runtime.systemstack(0x0)
    /usr/local/go/src/runtime/asm_arm64.s:244 +0x6c

goroutine 989290 gp=0x40074efdc0 m=13 mp=0x400078e008 [running]:
runtime.systemstack_switch()
    /usr/local/go/src/runtime/asm_arm64.s:201 +0x8 fp=0x4007d22960 sp=0x4007d22950 pc=0x481048
runtime.Stack({0x40107d6000?, 0x100000?, 0x100000?}, 0x1)
    /usr/local/go/src/runtime/mprof.go:1707 +0xe0 fp=0x4007d22a00 sp=0x4007d22960 pc=0x43ab40
runtime/pprof.writeGoroutineStacks({0x27ac520, 0x4009abfd40})
    /usr/local/go/src/runtime/pprof/pprof.go:764 +0x6c fp=0x4007d22a40 sp=0x4007d22a00 pc=0x8f890c
runtime/pprof.writeGoroutine({0x27ac520?, 0x4009abfd40?}, 0x0?)
    /usr/local/go/src/runtime/pprof/pprof.go:753 +0x2c fp=0x4007d22a80 sp=0x4007d22a40 pc=0x8f884c
runtime/pprof.(*Profile).WriteTo(0x23cae63?, {0x27ac520?, 0x4009abfd40?}, 0x206e4c0?)
    /usr/local/go/src/runtime/pprof/pprof.go:377 +0x14c fp=0x4007d22b90 sp=0x4007d22a80 pc=0x8f5f5c
gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).lookupProfile(0x40000cac08?, {0x23cae63?, 0x780a278d9052?}, {0x27ac520, 0x4009abfd40}, 0x2)
    /go/pkg/mod/gopkg.in/!data!dog/dd-trace-go.v1@v1.72.2/profiler/profiler.go:136 +0x58 fp=0x4007d22bd0 sp=0x4007d22b90 pc=0x1681538
gopkg.in/DataDog/dd-trace-go.v1/profiler.init.func2(0x4008cfc0a0)
    /go/pkg/mod/gopkg.in/!data!dog/dd-trace-go.v1@v1.72.2/profiler/profile.go:168 +0xf4 fp=0x4007d22c60 sp=0x4007d22bd0 pc=0x167c414
gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).runProfile(0x4008cfc0a0, 0x5)
    /go/pkg/mod/gopkg.in/!data!dog/dd-trace-go.v1@v1.72.2/profiler/profile.go:348 +0x17c fp=0x4007d22e50 sp=0x4007d22c60 pc=0x167fabc
gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).collect.func2(0x5)
    /go/pkg/mod/gopkg.in/!data!dog/dd-trace-go.v1@v1.72.2/profiler/profiler.go:355 +0xb8 fp=0x4007d22fb0 sp=0x4007d22e50 pc=0x1682da8
gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).collect.gowrap2()
    /go/pkg/mod/gopkg.in/!data!dog/dd-trace-go.v1@v1.72.2/profiler/profiler.go:367 +0x30 fp=0x4007d22fd0 sp=0x4007d22fb0 pc=0x1682cb0
runtime.goexit({})

[ ... elided ... ]

r0      0xfc510200e438
r1      0x0
r2      0x1
r3      0x1
r4      0x400a702540
r5      0x0
r6      0x1
r7      0x0
r8      0x3627cb0
r9      0x1
r10     0x279e7e8
r11     0x6372732f6f672f6c
r12     0x656d69746e75722f
r13     0x6f672e636f72702f
r14     0x30372e3176406370
r15     0x7265746e692f302e
r16     0xfc510180ef10
r17     0xfc510200e0c0
r18     0x0
r19     0x0
r20     0xfc510200e0b4
r21     0xfc510200e498
r22     0x1
r23     0x400ba15108
r24     0x202a2e0
r25     0x0
r26     0xffffffffffffffff
r27     0x4058000
r28     0x40000fe380
r29     0xfc510200e038
lr      0x468c68
sp      0xfc510200e040
pc      0x468da4
fault   0x118

The crash happens on this line, during a call to runtime.Stack triggered by calling pprof.Lookup("goroutine").WriteTo(w, 2). Unfortunately there are not goroutine addresses in this output (not sure why) so it's hard to tell which goroutine's stack was being unwound in this case.

The other occurrence is in a different program, build with Go 1.23.6. It's segfaulting on the same line in runtime.(*unwinder).next, during garbage collection:

SIGSEGV: segmentation violation
PC=0x488148 m=21 sigcode=1 addr=0x118
goroutine 0 gp=0x4003a88380 m=21 mp=0x40085fa008 [idle]:
runtime.(*unwinder).next(0xe5b01f40e280)
        /usr/local/go/src/runtime/traceback.go:458 +0x188 fp=0xe5b01f40e230 sp=0xe5b01f40e1a0 pc=0x488148
runtime.scanstack(0x4002a8ea80, 0x400007b250)
        /usr/local/go/src/runtime/mgcmark.go:887 +0x290 fp=0xe5b01f40e370 sp=0xe5b01f40e230 pc=0x4460a0
runtime.markroot.func1()
        /usr/local/go/src/runtime/mgcmark.go:238 +0xa8 fp=0xe5b01f40e3c0 sp=0xe5b01f40e370 pc=0x444b78
runtime.markroot(0x400007b250, 0x234, 0x1)
        /usr/local/go/src/runtime/mgcmark.go:212 +0x1c8 fp=0xe5b01f40e470 sp=0xe5b01f40e3c0 pc=0x444848
runtime.gcDrain(0x400007b250, 0xb)
        /usr/local/go/src/runtime/mgcmark.go:1188 +0x434 fp=0xe5b01f40e4e0 sp=0xe5b01f40e470 pc=0x446b14
runtime.gcDrainMarkWorkerFractional(...)
        /usr/local/go/src/runtime/mgcmark.go:1118
runtime.gcBgMarkWorker.func2()
        /usr/local/go/src/runtime/mgc.go:1506 +0x7c fp=0xe5b01f40e530 sp=0xe5b01f40e4e0 pc=0x442a1c
runtime.systemstack(0x0)
        /usr/local/go/src/runtime/asm_arm64.s:244 +0x6c fp=0xe5b01f40e540 sp=0xe5b01f40e530 pc=0x4a3a3c

goroutine 9 gp=0x4000254a80 m=21 mp=0x40085fa008 [GC worker (active)]:
runtime.systemstack_switch()
        /usr/local/go/src/runtime/asm_arm64.s:201 +0x8 fp=0x4000cc4f10 sp=0x4000cc4f00 pc=0x4a39b8
runtime.gcBgMarkWorker(0x4000066690)
        /usr/local/go/src/runtime/mgc.go:1472 +0x200 fp=0x4000cc4fb0 sp=0x4000cc4f10 pc=0x4426d0
runtime.gcBgMarkStartWorkers.gowrap1()
        /usr/local/go/src/runtime/mgc.go:1328 +0x28 fp=0x4000cc4fd0 sp=0x4000cc4fb0 pc=0x442498
runtime.goexit({})
        /usr/local/go/src/runtime/asm_arm64.s:1223 +0x4 fp=0x4000cc4fd0 sp=0x4000cc4fd0 pc=0x4a5ee4

[ ... ]

goroutine 1267 gp=0x4002a8ea80 m=nil [runnable (scan)]:
runtime.asyncPreempt2()
        /usr/local/go/src/runtime/preempt.go:308 +0x3c fp=0x4004cec4c0 sp=0x4004cec4a0 pc=0x46353c
runtime.asyncPreempt()
        /usr/local/go/src/runtime/preempt_arm64.s:47 +0x9c fp=0x4004cec6b0 sp=0x4004cec4c0 pc=0x4a6a8c
github.com/vishvananda/netlink/nl.(*NetlinkSocket).Receive(0x14360300000000?)
        /go/pkg/mod/github.com/!data!dog/netlink@v1.0.1-0.20240223195320-c7a4f832a3d1/nl/nl_linux.go:803 +0x130 fp=0x4004cfc710 sp=0x4004cec6c0 pc=0xf95de0

The last goroutine appears to be the goroutine that was being scanned. The crash output ends there.

Comment From: adonovan

See also https://github.com/golang/go/issues/73043#issuecomment-2775964745, whose attached hypothesis is Windows-only.

Comment From: dmitshur

CC @golang/runtime.

Comment From: prattmic

Looks like gp.m is nil, as 0x118 is the offset of incgo in the M. Though it would be good to disassemble a binary and double check.

Comment From: prattmic

gp.m could become nil if we didn't suspend the G before doing traceback. But that seems unlikely to me, as I would expect that to cause much more spectacular failures (especially from the GC).

I'd say more likely is that the stack somehow contains a function that isn't valid (triggering the (!flr.valid() path) even though the G is parked (and thus has no M). I think the idea in this code path is that only a running G could possibly contain an invalid function, thus it is safe to dereference gp.m.

I assume that this happens rarely and you can't reliably reproduce yet?

Comment From: gabyhelp

Related Issues

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

Comment From: nsrip-dd

I assume that this happens rarely and you can't reliably reproduce yet?

Yeah, it doesn't seem to happen frequently. I've seen low double-digits of occurrences over a week across all of our production services. I just started investigating this morning and haven't yet reproduced the crash.

Comment From: nsrip-dd

Looks like gp.m is nil, as 0x118 is the offset of incgo in the M. Though it would be good to disassemble a binary and double check.

Here's the disassembly of runtime.(*unwinder).next from the first example, around where the fault is happening:

  traceback.go:457  0x468d94        9a9f17e3        CSET EQ, R3
  traceback.go:458  0x468d98        370801a2        TBNZ $1, R2, 13(PC)
  traceback.go:458  0x468d9c        f94043e4        MOVD 128(RSP), R4
  traceback.go:458  0x468da0        f9401885        MOVD 48(R4), R5
  traceback.go:458  0x468da4        394460a5        MOVBU 280(R5), R5 // <<<<< HERE
  traceback.go:458  0x468da8        360000e5        TBZ $0, R5, 7(PC)
  traceback.go:458  0x468dac        f94037e5        MOVD 104(RSP), R5
  traceback.go:458  0x468db0        3940a0a6        MOVBU 40(R5), R6
  traceback.go:458  0x468db4        71004cdf        CMPW $19, R6
Full output in case it helps
TEXT runtime.(*unwinder).next(SB) /usr/local/go/src/runtime/traceback.go
  traceback.go:440  0x468c20        f9400b90        MOVD 16(R28), R16               
  traceback.go:440  0x468c24        d10043f1        SUB $16, RSP, R17               
  traceback.go:440  0x468c28        eb10023f        CMP R16, R17                    
  traceback.go:440  0x468c2c        54001949        BLS 202(PC)                 
  traceback.go:440  0x468c30        f8170ffe        MOVD.W R30, -144(RSP)               
  traceback.go:440  0x468c34        f81f83fd        MOVD R29, -8(RSP)               
  traceback.go:440  0x468c38        d10023fd        SUB $8, RSP, R29                
  traceback.go:441  0x468c3c        f9400401        MOVD 8(R0), R1                  
  traceback.go:446  0x468c40        f9401002        MOVD 32(R0), R2                 
  traceback.go:446  0x468c44        b40013a2        CBZ R2, 157(PC)                 
  traceback.go:446  0x468c48        f9004fe0        MOVD R0, 152(RSP)               
  traceback.go:441  0x468c4c        f9003be1        MOVD R1, 112(RSP)               
  traceback.go:443  0x468c50        f9402401        MOVD 72(R0), R1                 
  traceback.go:442  0x468c54        f9400003        MOVD (R0), R3                   
  traceback.go:442  0x468c58        f90037e3        MOVD R3, 104(RSP)               
  runtime2.go:240   0x468c5c        f90043e1        MOVD R1, 128(RSP)               
  traceback.go:450  0x468c60        aa0203e0        MOVD R2, R0                 
  traceback.go:450  0x468c64        940057bf        CALL runtime.findfunc(SB)           
  traceback.go:451  0x468c68        b4000900        CBZ R0, 72(PC)                  
  traceback.go:477  0x468c6c        f9404fe3        MOVD 152(RSP), R3               
  traceback.go:477  0x468c70        f9400864        MOVD 16(R3), R4                 
  traceback.go:477  0x468c74        f9401065        MOVD 32(R3), R5                 
  traceback.go:477  0x468c78        eb0400bf        CMP R4, R5                  
  traceback.go:477  0x468c7c        540000a1        BNE 5(PC)                   
  traceback.go:477  0x468c80        f9401465        MOVD 40(R3), R5                 
  traceback.go:477  0x468c84        f9401866        MOVD 48(R3), R6                 
  traceback.go:477  0x468c88        eb0500df        CMP R5, R6                  
  traceback.go:477  0x468c8c        54001260        BEQ 147(PC)                 
  traceback.go:484  0x468c90        f94037e4        MOVD 104(RSP), R4               
  traceback.go:484  0x468c94        3940a085        MOVBU 40(R4), R5                
  traceback.go:484  0x468c98        71004cbf        CMPW $19, R5                    
  traceback.go:484  0x468c9c        54000060        BEQ 3(PC)                   
  traceback.go:484  0x468ca0        71000cbf        CMPW $3, R5                 
  traceback.go:484  0x468ca4        54000061        BNE 3(PC)                   
  traceback.go:484  0x468ca8        b24003e5        ORR $1, ZR, R5                  
  traceback.go:484  0x468cac        14000003        JMP 3(PC)                   
  traceback.go:484  0x468cb0        710018bf        CMPW $6, R5                 
  traceback.go:484  0x468cb4        9a9f17e5        CSET EQ, R5                 
  traceback.go:485  0x468cb8        360000a5        TBZ $0, R5, 5(PC)               
  traceback.go:486  0x468cbc        39416466        MOVBU 89(R3), R6                
  traceback.go:486  0x468cc0        b27e00c6        ORR $4, R6, R6                  
  traceback.go:486  0x468cc4        39016466        MOVB R6, 89(R3)                 
  traceback.go:486  0x468cc8        14000004        JMP 4(PC)                   
  traceback.go:488  0x468ccc        39416466        MOVBU 89(R3), R6                
  traceback.go:488  0x468cd0        927df8c6        AND $-5, R6, R6                 
  traceback.go:488  0x468cd4        39016466        MOVB R6, 89(R3)                 
  traceback.go:492  0x468cd8        3940a084        MOVBU 40(R4), R4                
  traceback.go:492  0x468cdc        39016064        MOVB R4, 88(R3)                 
  traceback.go:493  0x468ce0        f9000060        MOVD R0, (R3)                   
  traceback.go:493  0x468ce4        f9000461        MOVD R1, 8(R3)                  
  traceback.go:494  0x468ce8        f9401064        MOVD 32(R3), R4                 
  traceback.go:494  0x468cec        f9000864        MOVD R4, 16(R3)                 
  traceback.go:495  0x468cf0        f900107f        MOVD ZR, 32(R3)                 
  traceback.go:496  0x468cf4        f9401864        MOVD 48(R3), R4                 
  traceback.go:496  0x468cf8        f9001464        MOVD R4, 40(R3)                 
  traceback.go:497  0x468cfc        f900187f        MOVD ZR, 48(R3)                 
  traceback.go:501  0x468d00        36000365        TBZ $0, R5, 27(PC)              
  traceback.go:502  0x468d04        aa0403e1        MOVD R4, R1                 
  traceback.go:503  0x468d08        91004021        ADD $16, R1, R1                 
  traceback.go:502  0x468d0c        f9400082        MOVD (R4), R2                   
  traceback.go:502  0x468d10        f90027e2        MOVD R2, 72(RSP)                
  traceback.go:503  0x468d14        f9001461        MOVD R1, 40(R3)                 
  traceback.go:504  0x468d18        f9400860        MOVD 16(R3), R0                 
  traceback.go:504  0x468d1c        94005791        CALL runtime.findfunc(SB)           
  traceback.go:505  0x468d20        f9404fe2        MOVD 152(RSP), R2               
  traceback.go:505  0x468d24        f9000040        MOVD R0, (R2)                   
  traceback.go:505  0x468d28        f9000441        MOVD R1, 8(R2)                  
  traceback.go:506  0x468d2c        b5000080        CBNZ R0, 4(PC)                  
  traceback.go:507  0x468d30        f94027e3        MOVD 72(RSP), R3                
  traceback.go:507  0x468d34        f9000843        MOVD R3, 16(R2)                 
  traceback.go:507  0x468d38        1400000c        JMP 12(PC)                  
  traceback.go:508  0x468d3c        f9400843        MOVD 16(R2), R3                 
  symtab.go:1174    0x468d40        b9401002        MOVWU 16(R0), R2                
  symtab.go:1174    0x468d44        b24003e4        ORR $1, ZR, R4                  
  symtab.go:1174    0x468d48        97ffe856        CALL runtime.pcvalue(SB)            
  traceback.go:508  0x468d4c        350000a0        CBNZW R0, 5(PC)                 
  traceback.go:509  0x468d50        f94027e3        MOVD 72(RSP), R3                
  traceback.go:509  0x468d54        f9404fe0        MOVD 152(RSP), R0               
  traceback.go:509  0x468d58        f9001003        MOVD R3, 32(R0)                 
  traceback.go:509  0x468d5c        14000002        JMP 2(PC)                   
  traceback.go:513  0x468d60        f9404fe0        MOVD 152(RSP), R0               
  traceback.go:513  0x468d64        aa0003e2        MOVD R0, R2                 
  traceback.go:513  0x468d68        aa0203e3        MOVD R2, R3                 
  traceback.go:513  0x468d6c        aa0303e0        MOVD R3, R0                 
  traceback.go:513  0x468d70        aa1f03e1        MOVD ZR, R1                 
  traceback.go:513  0x468d74        aa0103e2        MOVD R1, R2                 
  traceback.go:513  0x468d78        97fffed2        CALL runtime.(*unwinder).resolveInternal(SB)    
  traceback.go:514  0x468d7c        f85f83fd        MOVD -8(RSP), R29               
  traceback.go:514  0x468d80        f84907fe        MOVD.P 144(RSP), R30                
  traceback.go:514  0x468d84        d65f03c0        RET                     
  traceback.go:456  0x468d88        f9404fe0        MOVD 152(RSP), R0               
  traceback.go:456  0x468d8c        39416402        MOVBU 89(R0), R2                
  traceback.go:457  0x468d90        721f005f        TSTW $2, R2                 
  traceback.go:457  0x468d94        9a9f17e3        CSET EQ, R3                 
  traceback.go:458  0x468d98        370801a2        TBNZ $1, R2, 13(PC)             
  traceback.go:458  0x468d9c        f94043e4        MOVD 128(RSP), R4               
  traceback.go:458  0x468da0        f9401885        MOVD 48(R4), R5                 
  traceback.go:458  0x468da4        394460a5        MOVBU 280(R5), R5               
  traceback.go:458  0x468da8        360000e5        TBZ $0, R5, 7(PC)               
  traceback.go:458  0x468dac        f94037e5        MOVD 104(RSP), R5               
  traceback.go:458  0x468db0        3940a0a6        MOVBU 40(R5), R6                
  traceback.go:458  0x468db4        71004cdf        CMPW $19, R6                    
  traceback.go:458  0x468db8        540000e1        BNE 7(PC)                   
  traceback.go:458  0x468dbc        aa1f03e3        MOVD ZR, R3                 
  traceback.go:458  0x468dc0        14000005        JMP 5(PC)                   
  symtab.go:1110    0x468dc4        f94037e5        MOVD 104(RSP), R5               
  traceback.go:458  0x468dc8        14000003        JMP 3(PC)                   
  traceback.go:466  0x468dcc        f94043e4        MOVD 128(RSP), R4               
  symtab.go:1110    0x468dd0        f94037e5        MOVD 104(RSP), R5               
  traceback.go:456  0x468dd4        7200045f        TSTW $3, R2                 
  traceback.go:465  0x468dd8        54000040        BEQ 2(PC)                   
  traceback.go:465  0x468ddc        36000623        TBZ $0, R3, 49(PC)              
  traceback.go:456  0x468de0        3900ffe2        MOVB R2, 63(RSP)                
  symtab.go:1110    0x468de4        b5000085        CBNZ R5, 4(PC)                  
  symtab.go:1110    0x468de8        aa1f03e1        MOVD ZR, R1                 
  symtab.go:1110    0x468dec        aa1f03e3        MOVD ZR, R3                 
  traceback.go:466  0x468df0        14000007        JMP 7(PC)                   
  symtab.go:1113    0x468df4        b98004a1        MOVW 4(R5), R1                  
  symtab.go:1113    0x468df8        f9403be0        MOVD 112(RSP), R0               
  symtab.go:1113    0x468dfc        97ffe761        CALL runtime.(*moduledata).funcName(SB)     
  traceback.go:466  0x468e00        f94043e4        MOVD 128(RSP), R4               
  traceback.go:466  0x468e04        aa0003e3        MOVD R0, R3                 
  traceback.go:466  0x468e08        f9404fe0        MOVD 152(RSP), R0               
  traceback.go:466  0x468e0c        f90023e1        MOVD R1, 64(RSP)                
  traceback.go:466  0x468e10        f9003fe3        MOVD R3, 120(RSP)               
  traceback.go:466  0x468e14        f9405081        MOVD 160(R4), R1                
  traceback.go:466  0x468e18        f90033e1        MOVD R1, 96(RSP)                
  traceback.go:466  0x468e1c        f9401000        MOVD 32(R0), R0                 
  traceback.go:466  0x468e20        f9002fe0        MOVD R0, 88(RSP)                
  traceback.go:466  0x468e24        97ff6c43        CALL runtime.printlock(SB)          
  traceback.go:466  0x468e28        f000fb20        ADRP 32927744(PC), R0               
  traceback.go:466  0x468e2c        9114cc00        ADD $1331, R0, R0               
  traceback.go:466  0x468e30        d2800161        MOVD $11, R1                    
  traceback.go:466  0x468e34        97ff6e63        CALL runtime.printstring(SB)            
  traceback.go:466  0x468e38        f94033e0        MOVD 96(RSP), R0                
  traceback.go:466  0x468e3c        97ff6da9        CALL runtime.printuint(SB)          
  traceback.go:466  0x468e40        b000fc80        ADRP 33099776(PC), R0               
  traceback.go:466  0x468e44        9120e000        ADD $2104, R0, R0               
  traceback.go:466  0x468e48        d2800361        MOVD $27, R1                    
  traceback.go:466  0x468e4c        97ff6e5d        CALL runtime.printstring(SB)            
  traceback.go:466  0x468e50        f9403fe0        MOVD 120(RSP), R0               
  traceback.go:466  0x468e54        f94023e1        MOVD 64(RSP), R1                
  traceback.go:466  0x468e58        97ff6e5a        CALL runtime.printstring(SB)            
  traceback.go:466  0x468e5c        f000fb40        ADRP 32944128(PC), R0               
  traceback.go:466  0x468e60        912c1c00        ADD $2823, R0, R0               
  traceback.go:466  0x468e64        d28001a1        MOVD $13, R1                    
  traceback.go:466  0x468e68        97ff6e56        CALL runtime.printstring(SB)            
  traceback.go:466  0x468e6c        f9402fe0        MOVD 88(RSP), R0                
  traceback.go:466  0x468e70        97ff6dec        CALL runtime.printhex(SB)           
  traceback.go:466  0x468e74        97ff6cbb        CALL runtime.printnl(SB)            
  traceback.go:466  0x468e78        97ff6c4e        CALL runtime.printunlock(SB)            
  traceback.go:467  0x468e7c        f94043e0        MOVD 128(RSP), R0               
  traceback.go:467  0x468e80        f9400401        MOVD 8(R0), R1                  
  traceback.go:467  0x468e84        f9400000        MOVD (R0), R0                   
  traceback.go:467  0x468e88        f9404fe2        MOVD 152(RSP), R2               
  traceback.go:467  0x468e8c        aa1f03e3        MOVD ZR, R3                 
  traceback.go:467  0x468e90        94000a54        CALL runtime.tracebackHexdump(SB)       
  traceback.go:456  0x468e94        3940ffe0        MOVBU 63(RSP), R0               
  traceback.go:456  0x468e98        7200041f        TSTW $3, R0                 
  traceback.go:472  0x468e9c        f9404fe0        MOVD 152(RSP), R0               
  traceback.go:469  0x468ea0        54000140        BEQ 10(PC)                  
  traceback.go:472  0x468ea4        f900101f        MOVD ZR, 32(R0)                 
  traceback.go:473  0x468ea8        94000032        CALL runtime.(*unwinder).finishInternal(SB) 
  traceback.go:474  0x468eac        f85f83fd        MOVD -8(RSP), R29               
  traceback.go:474  0x468eb0        f84907fe        MOVD.P 144(RSP), R30                
  traceback.go:474  0x468eb4        d65f03c0        RET                     
  traceback.go:447  0x468eb8        9400002e        CALL runtime.(*unwinder).finishInternal(SB) 
  traceback.go:448  0x468ebc        f85f83fd        MOVD -8(RSP), R29               
  traceback.go:448  0x468ec0        f84907fe        MOVD.P 144(RSP), R30                
  traceback.go:448  0x468ec4        d65f03c0        RET                     
  traceback.go:470  0x468ec8        b000fba0        ADRP 32985088(PC), R0               
  traceback.go:470  0x468ecc        91213c00        ADD $2127, R0, R0               
  traceback.go:470  0x468ed0        d2800221        MOVD $17, R1                    
  traceback.go:470  0x468ed4        940047c7        CALL runtime.throw(SB)              
  traceback.go:477  0x468ed8        f9002fe4        MOVD R4, 88(RSP)                
  traceback.go:477  0x468edc        f9002be5        MOVD R5, 80(RSP)                
  traceback.go:479  0x468ee0        97ff6c14        CALL runtime.printlock(SB)          
  traceback.go:479  0x468ee4        f000fca0        ADRP 33124352(PC), R0               
  traceback.go:479  0x468ee8        9134a000        ADD $3368, R0, R0               
  traceback.go:479  0x468eec        d28003a1        MOVD $29, R1                    
  traceback.go:479  0x468ef0        97ff6e34        CALL runtime.printstring(SB)            
  traceback.go:479  0x468ef4        f9402fe0        MOVD 88(RSP), R0                
  traceback.go:479  0x468ef8        97ff6dca        CALL runtime.printhex(SB)           
  traceback.go:479  0x468efc        f000fa80        ADRP 32845824(PC), R0               
  traceback.go:479  0x468f00        911c8c00        ADD $1827, R0, R0               
  traceback.go:479  0x468f04        b27e03e1        ORR $4, ZR, R1                  
  traceback.go:479  0x468f08        97ff6e2e        CALL runtime.printstring(SB)            
  traceback.go:479  0x468f0c        f9402be0        MOVD 80(RSP), R0                
  traceback.go:479  0x468f10        97ff6dc4        CALL runtime.printhex(SB)           
  traceback.go:479  0x468f14        97ff6c93        CALL runtime.printnl(SB)            
  traceback.go:479  0x468f18        97ff6c26        CALL runtime.printunlock(SB)            
  traceback.go:480  0x468f1c        f94043e0        MOVD 128(RSP), R0               
  traceback.go:480  0x468f20        f9400001        MOVD (R0), R1                   
  traceback.go:480  0x468f24        f9400400        MOVD 8(R0), R0                  
  traceback.go:480  0x468f28        f9404fe2        MOVD 152(RSP), R2               
  traceback.go:480  0x468f2c        f9401443        MOVD 40(R2), R3                 
  traceback.go:480  0x468f30        aa0003e4        MOVD R0, R4                 
  traceback.go:480  0x468f34        aa0103e0        MOVD R1, R0                 
  traceback.go:480  0x468f38        aa0403e1        MOVD R4, R1                 
  traceback.go:480  0x468f3c        94000a29        CALL runtime.tracebackHexdump(SB)       
  traceback.go:481  0x468f40        9000fb80        ADRP 32964608(PC), R0               
  traceback.go:481  0x468f44        911ba400        ADD $1769, R0, R0               
  traceback.go:481  0x468f48        b2400fe1        ORR $15, ZR, R1                 
  traceback.go:481  0x468f4c        940047a9        CALL runtime.throw(SB)              
  traceback.go:481  0x468f50        d503201f        NOOP                        
  traceback.go:440  0x468f54        f90007e0        MOVD R0, 8(RSP)                 
  traceback.go:440  0x468f58        aa1e03e3        MOVD R30, R3                    
  traceback.go:440  0x468f5c        940060a1        CALL runtime.morestack_noctxt.abi0(SB)      
  traceback.go:440  0x468f60        f94007e0        MOVD 8(RSP), R0                 
  traceback.go:440  0x468f64        17ffff2f        JMP runtime.(*unwinder).next(SB)        
  traceback.go:440  0x468f68        00000000        ?                       
  traceback.go:440  0x468f6c        00000000        ?                       

Comment From: prattmic

Thanks, that definitely looks like the dereference for m.incgo.

Comment From: prattmic

@adonovan Your crash has a different fault address (0xe8), but I checked and that is the offset of incgo on windows-amd64, so I think that is the same bug.

Comment From: mknyszek

I marked #73413 as a dupe, but it had a potential reproducer: https://github.com/golang/go/issues/73413#issuecomment-2812496241.

Comment From: sirzooro

Hi, have you made some progress with this bug? Can we expect that fix will be part of 1.25 release (there is no milestone assigned to this bug)? We can stay on 1.23 for a while as it is more stable than 1.24, but with release of version 1.25 we would have to upgrade.

BTW, yesterday I saw another crash from 1.23. This callstack is a bit different than previous ones:

Thread 3.13 "fuzztests" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fff917fa640 (LWP 4309)]
runtime.raise () at runtime/sys_linux_amd64.s:154
154 runtime/sys_linux_amd64.s: No such file or directory.
#0  runtime.raise () at runtime/sys_linux_amd64.s:154
No locals.
#1  0x0000555555655da5 in runtime.dieFromSignal (sig=6)
    at runtime/signal_unix.go:967
No locals.
#2  0x000055555563e6fc in runtime.crash () at runtime/signal_unix.go:1056
No locals.
#3  runtime.fatalthrow.func1 () at runtime/panic.go:1287
        gp = 0xc00015afc0
        pc = 93824993397898
        sp = 140735634446816
#4  0x000055555563e658 in runtime.fatalthrow (t=<optimized out>)
    at runtime/panic.go:1276
        pc = <optimized out>
        sp = 6
        gp = 0xc00015afc0
#5  0x0000555555670c8a in runtime.throw (s=...) at runtime/panic.go:1101
No locals.
#6  0x000055555566082a in runtime.(*unwinder).finishInternal (
    u=<optimized out>) at runtime/traceback.go:566
        gp = 0xc00015a000
#7  0x0000555555660632 in runtime.(*unwinder).next (u=0x0)
    at runtime/traceback.go:447
        gp = <optimized out>
        ~r0.ptr = <optimized out>
        ~r0.len = <optimized out>
#8  0x0000555555626fa9 in runtime.scanstack (gp=0xc00015a000,
    gcw=0xc00003b750, ~r0=<optimized out>) at runtime/mgcmark.go:904
        sp = <optimized out>
        scannedSize = <optimized out>
        p = <optimized out>
        state = <optimized out>
        u = <optimized out>
#9  0x00005555556259f1 in runtime.markroot.func1 () at runtime/mgcmark.go:240
        gp = 0xc00015a000
        &workDone = 0x7fff917f9c68
        gcw = 0xc00003b750
        userG = 0xc000092540
        selfScan = false
#10 0x0000555555625699 in runtime.markroot (gcw=0xc00003b750, i=27,
    flushBgCredit=true, ~r0=<optimized out>) at runtime/mgcmark.go:214
        status = <optimized out>
        gp = 0x6
        workCounter = <optimized out>
        ~r0.ptr = <optimized out>
        ~r0.ptr = <optimized out>
        workDone = <optimized out>
        ~r0.len = <optimized out>
        ~r0.len = <optimized out>
#11 0x0000555555627a34 in runtime.gcDrain (gcw=0xc00003b750, flags=3)
    at runtime/mgcmark.go:1186
        gp = 0xc000092540
        pp = 0xc00003a508
        flushBgCredit = true
        initScanWork = 0
        checkWork = 9223372036854775807
        check = {void (bool)} 0x7fff917f9d20
#12 0x0000555555623d1a in runtime.gcDrainMarkWorkerDedicated (gcw=0x0,
    untilPreempt=<optimized out>) at runtime/mgcmark.go:1110
        flags = <optimized out>
#13 runtime.gcBgMarkWorker.func2 () at runtime/mgc.go:1500
        gp = 0xc000092540
        pp = 0xc00003a508
#14 0x0000555555675b27 in runtime.systemstack () at runtime/asm_amd64.s:514
No locals.
#15 0x01ffffffffffff28 in ?? ()
No symbol table info available.
#16 0x0000000000800000 in ?? ()
No symbol table info available.
#17 0x000000c00015afc0 in ?? ()
No symbol table info available.
#18 0x0000555555675a20 in ?? ()
No locals.
#19 0x0000555555675a25 in runtime.mstart () at runtime/asm_amd64.s:395
No locals.
#20 0x000055555576a1a8 in crosscall1 () at gcc_amd64.S:42
No locals.
#21 0x00007fffaaffc910 in ?? ()
No symbol table info available.
#22 0x00007ffff7a677d0 in ?? () at ./nptl/pthread_create.c:321
   from /lib/x86_64-linux-gnu/libc.so.6
#23 0x0000000000000000 in ?? ()

Comment From: prattmic

That latest crash is an explicit throw: https://cs.opensource.google/go/go/+/master:src/runtime/traceback.go;l=566;drc=3fd729b2a14a7efcf08465cbea60a74da5457f06?q=traceback.go:447&ss=go

Those print lines should appear in your stderr, do you have them?

Comment From: prattmic

For the original issue, if you can reproduce with https://go.dev/cl/676635 in your toolchain, you should get more context in the crash output.

You can use https://pkg.go.dev/golang.org/dl/gotip to easily build a toolchain at that CL. This will basically make a Go 1.24.3 toolchain plus my CL.

$ gotip download 676635
$ gotip build your.program/exe

Comment From: gopherbot

Change https://go.dev/cl/676635 mentions this issue: [release-branch.go1.24] DO NOT SUBMIT: runtime: add traceback nil m context

Comment From: sirzooro

That latest crash is an explicit throw: https://cs.opensource.google/go/go/+/master:src/runtime/traceback.go;l=566;drc=3fd729b2a14a7efcf08465cbea60a74da5457f06?q=traceback.go:447&ss=go

Those print lines should appear in your stderr, do you have them?

Unfortunately output from test binary was not captured. I am looking how to capture it in the future.

For the original issue, if you can reproduce with https://go.dev/cl/676635 in your toolchain, you should get more context in the crash output.

You can use https://pkg.go.dev/golang.org/dl/gotip to easily build a toolchain at that CL. This will basically make a Go 1.24.3 toolchain plus my CL.

$ gotip download 676635 $ gotip build your.program/exe

Thanks, I will try it. Is there a way to run gotip download in non-interactive mode? Now it asks for confirmation before downloading.

Comment From: prattmic

Is there a way to run gotip download in non-interactive mode?

Hm, I don't think so, though I believe yes | gotip download 676635 will work.

Otherwise, you can do the steps manually, it's not too difficult:

$ git clone https://go.googlesource.com/go
$ cd go/src
$ git fetch https://go.googlesource.com/go refs/changes/35/676635/1 && git checkout -b change-676635 FETCH_HEAD
$ ./make.bash
# Now use ../bin/go as your go binary.

Note you will want to get output figured out. My CL will print more useful output, but the crash stack trace won't be very different.

Comment From: tsheinen

I have a limited reproducer for what seems to be this same bug.

package main

import (
    "runtime"
)

const RECEIVE_BUFFER_SIZE = 65536

//go:noinline
func big_stack(val int) int {
    var big_buffer = make([]byte, RECEIVE_BUFFER_SIZE)

    sum := 0
    // this was added by vibes in the middle of the night to confound the optimizer
    for i := 0; i < RECEIVE_BUFFER_SIZE; i++ {
        big_buffer[i] = byte(val)
    }
    for i := 0; i < RECEIVE_BUFFER_SIZE; i++ {
        sum ^= int(big_buffer[i])
    }
    return sum
}

//go:noinline
func calls_big_stack() {
    for {
        _ = big_stack(1000)
    }
}

func main() {
    go func() {
        for {
            runtime.GC()
        }
    }()
    calls_big_stack()
}

It reliably crashes in a few minutes on my test machine (a linux aarch64 128 core server)

SIGSEGV: segmentation violation
PC=0x60598 m=8 sigcode=1 addr=0x118

goroutine 0 gp=0x400019c540 m=8 mp=0x4000198708 [idle]:
runtime.(*unwinder).next(0x400030fd10)
        /home/thea/sdk/go1.23.4/src/runtime/traceback.go:458 +0x188 fp=0x400030fcc0 sp=0x400030fc30 pc=0x60598
runtime.scanstack(0x40000021c0, 0x400002f750)
        /home/thea/sdk/go1.23.4/src/runtime/mgcmark.go:887 +0x290 fp=0x400030fe00 sp=0x400030fcc0 pc=0x274f0
runtime.markroot.func1()
        /home/thea/sdk/go1.23.4/src/runtime/mgcmark.go:238 +0xa8 fp=0x400030fe50 sp=0x400030fe00 pc=0x25fc8
runtime.markroot(0x400002f750, 0x14, 0x1)
        /home/thea/sdk/go1.23.4/src/runtime/mgcmark.go:212 +0x1c8 fp=0x400030ff00 sp=0x400030fe50 pc=0x25c98
runtime.gcDrain(0x400002f750, 0x3)
        /home/thea/sdk/go1.23.4/src/runtime/mgcmark.go:1188 +0x434 fp=0x400030ff70 sp=0x400030ff00 pc=0x27f64
runtime.gcDrainMarkWorkerDedicated(...)
        /home/thea/sdk/go1.23.4/src/runtime/mgcmark.go:1112
runtime.gcBgMarkWorker.func2()
        /home/thea/sdk/go1.23.4/src/runtime/mgc.go:1489 +0x94 fp=0x400030ffc0 sp=0x400030ff70 pc=0x23f04
runtime.systemstack(0x400030c000)
        /home/thea/sdk/go1.23.4/src/runtime/asm_arm64.s:244 +0x6c fp=0x400030ffd0 sp=0x400030ffc0 pc=0x72c8c

goroutine 117 gp=0x40002f3180 m=8 mp=0x4000198708 [GC worker (active)]:
runtime.systemstack_switch()
        /home/thea/sdk/go1.23.4/src/runtime/asm_arm64.s:201 +0x8 fp=0x4000681f10 sp=0x4000681f00 pc=0x72c08
runtime.gcBgMarkWorker(0x40002a0000)
        /home/thea/sdk/go1.23.4/src/runtime/mgc.go:1472 +0x200 fp=0x4000681fb0 sp=0x4000681f10 pc=0x23ba0
runtime.gcBgMarkStartWorkers.gowrap1()
        /home/thea/sdk/go1.23.4/src/runtime/mgc.go:1328 +0x28 fp=0x4000681fd0 sp=0x4000681fb0 pc=0x23968
runtime.goexit({})
        /home/thea/sdk/go1.23.4/src/runtime/asm_arm64.s:1223 +0x4 fp=0x4000681fd0 sp=0x4000681fd0 pc=0x74f74
created by runtime.gcBgMarkStartWorkers in goroutine 5
        /home/thea/sdk/go1.23.4/src/runtime/mgc.go:1328 +0x140

goroutine 1 gp=0x40000021c0 m=nil [runnable (scan)]:
runtime.asyncPreempt2()
        /home/thea/sdk/go1.23.4/src/runtime/preempt.go:308 +0x3c fp=0x40003bfcf0 sp=0x40003bfcd0 pc=0x400cc
runtime.asyncPreempt()
        /home/thea/sdk/go1.23.4/src/runtime/preempt_arm64.s:47 +0x9c fp=0x40003bfee0 sp=0x40003bfcf0 pc=0x75aec
main.big_stack(0x40003cff38?)
        /home/thea/dev/stack_corruption_reproducer/main.go:29 +0x94 fp=0x40003cff00 sp=0x40003bfef0 pc=0x77c04
Segmentation fault (core dumped)

real    1m29.165s
user    4m4.987s
sys     0m43.212s

I've tested on 1.23.4 and see crashes consistently but not on 1.23.9. I am seeing production crashes on 1.23.9 so it's my assumption this is just a tuning issue with the reproducer, not a matter of the bug being fixed. I have only observed this crash on linux/aarch64 machines, never amd64.

As far as I can tell the bug occurs when async preemption happens in the middle of the function epilogue of a function with a big stack.

  main.go:29            0x77bf8                 910023fd                ADD $8, RSP, R29
  main.go:29            0x77bfc                 914043bd                ADD $(16<<12), R29, R29
  main.go:29            0x77c00                 910043ff                ADD $16, RSP, RSP
  main.go:29            0x77c04                 914043ff                ADD $(16<<12), RSP, RSP
  main.go:29            0x77c08                 d65f03c0                RET

The goroutine is asynchronously preempted between ADD $16, RSP, RSP and ADD $(16<<12), RSP, RSP on both the reproducer and the production crashes I've seen. I do still see crashes with gcshrinkstackoff=1 so my working theory is this has nothing to do with stack shrinking and instead occurs when code is generated which splits stack pointer addition into two instructions. It makes sense to me that async preemption would be unsound in the middle of the function epilogue, but I don't have a good enough understanding of the runtime to do more than speculate.

Comment From: tsheinen

Hi! I've dug into this some more and have a workaround and a possible runtime fix. The cause is that if the goroutine is asynchronously preempted between two add x, rsp instructions it leaves the stack frame in a weird state. This isn't memory corruption exactly and will fix itself as the goroutine is scheduled again and the stack pointer is adjusted the rest of the way. However, if the stack is unwound in this weird state (garbage collection, panic/recover, etc) the unwinder will dereference the partially added stack pointer and get bad data. We worked around this bug in production by making the problematic stack buffer spill to the heap so the stack wasn't big enough to require two instructions.

The reproducer I posted earlier is a bit finicky and reliability depends on go compiler version -- but the bug is still present on the most recent version. It can be reliably triggered by using a debugger, breaking in between instructions, and emitting SIGURG to trigger async preemption.

I fixed it in a test fork by emitting preempt unsafe points around the function epilogue if the constants require more than one add instruction. It works in my testing but before I put up a PR for review I'd like to think further on alternatives and confirm if this same bug occurs on other fixed instruction length architectures.

Comment From: prattmic

@tsheinen do you have a concrete example you can share of one of these crashes including the crashing PC/assembly? Function preambles should be marked by the compiler as "unsafe points", which means we won't perform async preemption if a signal lands there. So what you describe sounds like a bug in where we mark unsafe points.

Comment From: tsheinen

@prattmic I should note that I'm seeing this in the epilogue not the prologue. It's my understanding from skimming internal/obj/arm64/obj7.go that unsafe points are sprinkled around the preamble but I didn't see any for the epilogue. One of my earlier comments included a stack trace and the instruction which was preempted when it crashed, but I can grab a coredump if you'd like more detail.

Comment From: prattmic

@tsheinen Apologies, I should have looked right above. :) The epilogue (for arm64) does seem problematic to me when the add is split. @cherrymui or @randall77 may be more familiar with this code. Do you agree?

Your https://github.com/tsheinen/go/commit/cf4bfec05b67c326ea8d2dc7e2cc7440a61cfcab looks reasonable to me.

Comment From: randall77

It's not immediately clear to me that async preempting mid-epilog would be a problem. As long as the pcdata for stack frame size is correct, it should work? I'm not 100% sure about that though.

Marking the epilog as non-preemptible sounds fine though.

Comment From: randall77

We can also change the epilog code to always use a single add (building the frame size value in a different register and adding it in all in one go).

Comment From: randall77

Yeah, the frame size info looks wrong. For this code:

package main

//go:noinline
func f(x, y int) byte {
    var a [100000]byte
    a[x] = 1
    return a[y]
}

func main() {
    f(3, 4)
}

The epilog does:

100069630: a97ffbfd     ldp     x29, x30, [sp, #-8]
100069634: 911b03ff     add     sp, sp, #1728
100069638: 914063ff     add     sp, sp, #24, lsl #12    ; =98304
10006963c: d65f03c0     ret

But the frame size data doesn't have that intermediate step.

    frame size: 100032
        [1000695c0:1000695eb]: 0
        [1000695ec:10006963b]: 100032
        [10006963c:10006963f]: 0
        [100069640:10006965f]: 100032
        [100069660:10006967f]: 0

The prolog looks ok, it computes the new SP in a different reigster and moves it into the SP register with a single instruction.

Comment From: prattmic

I have trouble following the assembler code w.r.t. splitting the instructions, but it seems plausible that when a Prog is split, Spadj is not split accordingly.

Comment From: gopherbot

Change https://go.dev/cl/689235 mentions this issue: cmd/compile: for arm64 epilog, do SP increment with a single instruction

Comment From: nsrip-dd

Nice work @tsheinen! Nothing to add on the fix/root cause side, but I will say that for the second example in my original report, we're crashing while scanning the stack of a goroutine preempted at the end of this function: https://github.com/vishvananda/netlink/blob/c7a4f832a3d1a5328cef0a565404e4507eb2bb69/nl/nl_linux.go#L803. That function indeed has a big stack frame due to this buffer. Here's the disassembly where the function is getting preempted:

f95dd8: a97ffbfd      ldp     x29, x30, [sp, #-0x8]
f95ddc: 910143ff      add     sp, sp, #0x50
VVV---- HERE ----VVV
f95de0: 914043ff      add     sp, sp, #0x10, lsl #12  // =0x10000
f95de4: d65f03c0      ret

So I think what you're describing is the same issue that we're seeing 👍

Comment From: prattmic

@gopherbot Please backport to 1.23 and 1.24. This issue causes random crashes when preempting functions with large stack frames on arm64.

Comment From: gopherbot

Backport issue(s) opened: #74693 (for 1.23), #74694 (for 1.24).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

Comment From: sirzooro

Good to see that fix for arm64 is ready. I had similar crash in tests running on amd64 (x86_64), please check code for it too.

Comment From: randall77

@sirzooro The code for amd64 looks ok. Whatever you are seeing is not the same bug as was fixed in CL 689235. Could you open a separate bug with details of your problem?