runtime/pprof.appendLocsForStack asserts that an inline-expanded PC always expands to the same number of logical PCs (inlined frames). This is a static property of a given PC, so it should always be true.

When handling SIGPROF, if we are on an extra M running in C, we don't try to traceback since this is C anyway, and just add a sample with stack {pc, runtime._ExternalCode}.

"We are on an extra M running in C" is defined as gp.m.isExtraInC. In cgocallbackg, we clear this field after exitsyscall returns. This leaves a fairly long window when we are in fact running Go code, but the SIGPROF handler will think it is in C.

A lot of this code (particularly in exitsyscall) is reachable from normal Go code as well. If any of this code has more than 2 inlined frames at a single PC, then a SIGPROF from a normal Go context followed by a SIGPROF in this cgocallback context could trigger this appendLocsForStack panic.

I do not know if any code reachable in this window actually has more than 2 inlined frames. Only 2 frames is insufficient, as appendLocsForStack wouldn't actually care that the second frame is runtime._ExternalCode instead of the proper frame.

One potential fix is to attempt to do inline expansion in sigprofNonGoPC in case it actually is a Go PC.

Comment From: gabyhelp

Related Issues

Related Code Changes

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

Comment From: nsrip-dd

I believe we're seeing this problem in one of our programs. We're using the Apache Arrow Go client library, which has C (and also maybe C++?) code that calls back into Go. I think it also calls Go from C on "extra" Ms, based on seeing samples in a CPU profile where cgocallback is the root frame. We've seen panics like this, on Go 1.23.6:

panic: runtime error: slice bounds out of range [3:2]

goroutine 216489 [running]:
runtime/pprof.(*profileBuilder).appendLocsForStack(0x401d2c2000, {0x4026ff9e00, 0x0, 0x40}, {0x402cf99df8?, 0x7?, 0x40?})
        /usr/local/go/src/runtime/pprof/proto.go:443 +0x8bc
runtime/pprof.(*profileBuilder).build(0x401d2c2000)
        /usr/local/go/src/runtime/pprof/proto.go:376 +0x314
runtime/pprof.profileWriter({0x634c1a0?, 0x4025122630?})
        /usr/local/go/src/runtime/pprof/pprof.go:882 +0xc4
created by runtime/pprof.StartCPUProfile in goroutine 216360
        /usr/local/go/src/runtime/pprof/pprof.go:853 +0x184

I unfortunately don't have the exact PC it's failing to handle. But I do see in CPU profiles for the program that this call to casgstatus in exitsyscall is inlined, and there's a CompareAndSwap inlined into that call:

2157: 0x598fcc M=1 internal/runtime/atomic.(*Uint32).CompareAndSwap /usr/local/go/src/internal/runtime/atomic/types.go:236:0 s=235
             runtime.casgstatus /usr/local/go/src/runtime/proc.go:1193:0 s=1175
             runtime.exitsyscall /usr/local/go/src/runtime/proc.go:4661:0 s=4618

So I think we meet the conditions described in this issue.

Comment From: prattmic

I filed this issue when investigating one of these appendLocsForStack panics. I came up with this theoretical problem, though it ended up not being the cause of my crash.

For that crash, I modified sigprofNonGoPC to throw if the PC actually is a Go PC.

Something like

func sigprofNonGoPC(pc uintptr, info *siginfo, ctx unsafe.Pointer) {
    if prof.hz.Load() != 0 {
        stk := []uintptr{
            pc,
            abi.FuncPCABIInternal(_ExternalCode) + sys.PCQuantum,
        }
        cpuprof.addNonGo(stk)

        fi := findfunc(pc)
        if fi.valid() {
            name := funcname(fi)
            if name == "runtime.futex" {
                // cgocallbackg -> exitsyscall -> stoplockedm -> futex actually occurs fairly often
                return
            }

            println("runtime: SIGPROF on Go PC", hex(pc), "name", name)
            println("runtime: siginfo", info, "ctx", ctx)
            c := &sigctxt{info, ctx}
            dumpregs(c)

            // Extra debugging dumps as desired.

            throw("SIGPROF on Go PC without G/M")
        }
    }
}

If you can reproduce, something like this might be useful as you can catch the problem at the moment it occurs rather than much later when the panic occurs.