runtime/pprof.appendLocsForStack
asserts that an inline-expanded PC always expands to the same number of logical PCs (inlined frames). This is a static property of a given PC, so it should always be true.
When handling SIGPROF, if we are on an extra M running in C, we don't try to traceback since this is C anyway, and just add a sample with stack {pc, runtime._ExternalCode}
.
"We are on an extra M running in C" is defined as gp.m.isExtraInC
. In cgocallbackg
, we clear this field after exitsyscall
returns. This leaves a fairly long window when we are in fact running Go code, but the SIGPROF handler will think it is in C.
A lot of this code (particularly in exitsyscall
) is reachable from normal Go code as well. If any of this code has more than 2 inlined frames at a single PC, then a SIGPROF from a normal Go context followed by a SIGPROF in this cgocallback
context could trigger this appendLocsForStack
panic.
I do not know if any code reachable in this window actually has more than 2 inlined frames. Only 2 frames is insufficient, as appendLocsForStack
wouldn't actually care that the second frame is runtime._ExternalCode
instead of the proper frame.
One potential fix is to attempt to do inline expansion in sigprofNonGoPC
in case it actually is a Go PC.
Comment From: gabyhelp
Related Issues
- runtime: "invalid pc-encoded table" throw caused by bad cgo traceback #44971 (closed)
- runtime: sigprof handler crashes backtracing runtime.nanotime #24925 (closed)
- gccgo: seg fault if profiling signal arrives when collecting backtrace #29448 (closed)
- runtime/pprof: inline frames may not use combined location #37446 (closed)
Related Code Changes
- runtime: record current PC for SIGPROF on non-Go thread
- runtime/pprof: increment fake overflow record PC
- runtime/pprof: expand final stack frame to avoid truncation
- runtime/pprof: correctly encode inlined functions in CPU profile
- runtime: collect stack trace if SIGPROF arrives on non-Go thread
- runtime: fail silently if we unwind over sigpanic into C code
(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)
Comment From: nsrip-dd
I believe we're seeing this problem in one of our programs. We're using the Apache Arrow Go client library, which has C (and also maybe C++?) code that calls back into Go. I think it also calls Go from C on "extra" Ms, based on seeing samples in a CPU profile where cgocallback is the root frame. We've seen panics like this, on Go 1.23.6:
panic: runtime error: slice bounds out of range [3:2]
goroutine 216489 [running]:
runtime/pprof.(*profileBuilder).appendLocsForStack(0x401d2c2000, {0x4026ff9e00, 0x0, 0x40}, {0x402cf99df8?, 0x7?, 0x40?})
/usr/local/go/src/runtime/pprof/proto.go:443 +0x8bc
runtime/pprof.(*profileBuilder).build(0x401d2c2000)
/usr/local/go/src/runtime/pprof/proto.go:376 +0x314
runtime/pprof.profileWriter({0x634c1a0?, 0x4025122630?})
/usr/local/go/src/runtime/pprof/pprof.go:882 +0xc4
created by runtime/pprof.StartCPUProfile in goroutine 216360
/usr/local/go/src/runtime/pprof/pprof.go:853 +0x184
I unfortunately don't have the exact PC it's failing to handle. But I do see in CPU profiles for the program that this call to casgstatus
in exitsyscall
is inlined, and there's a CompareAndSwap
inlined into that call:
2157: 0x598fcc M=1 internal/runtime/atomic.(*Uint32).CompareAndSwap /usr/local/go/src/internal/runtime/atomic/types.go:236:0 s=235
runtime.casgstatus /usr/local/go/src/runtime/proc.go:1193:0 s=1175
runtime.exitsyscall /usr/local/go/src/runtime/proc.go:4661:0 s=4618
So I think we meet the conditions described in this issue.
Comment From: prattmic
I filed this issue when investigating one of these appendLocsForStack
panics. I came up with this theoretical problem, though it ended up not being the cause of my crash.
For that crash, I modified sigprofNonGoPC
to throw if the PC actually is a Go PC.
Something like
func sigprofNonGoPC(pc uintptr, info *siginfo, ctx unsafe.Pointer) {
if prof.hz.Load() != 0 {
stk := []uintptr{
pc,
abi.FuncPCABIInternal(_ExternalCode) + sys.PCQuantum,
}
cpuprof.addNonGo(stk)
fi := findfunc(pc)
if fi.valid() {
name := funcname(fi)
if name == "runtime.futex" {
// cgocallbackg -> exitsyscall -> stoplockedm -> futex actually occurs fairly often
return
}
println("runtime: SIGPROF on Go PC", hex(pc), "name", name)
println("runtime: siginfo", info, "ctx", ctx)
c := &sigctxt{info, ctx}
dumpregs(c)
// Extra debugging dumps as desired.
throw("SIGPROF on Go PC without G/M")
}
}
}
If you can reproduce, something like this might be useful as you can catch the problem at the moment it occurs rather than much later when the panic occurs.