Reproducer:
-- dep/dep.go --
package dep
var sink int
//go:noinline
func spin() {
for i := 0; i < 1000000; i++ {
sink = i
}
}
func b() {
// Two calls, too hot to inline, darn...
spin()
spin()
for i := 0; i < 1000000; i++ {
sink = i
}
}
func a() {
// Two calls, too hot to inline, darn...
b()
b()
for i := 0; i < 1000000; i++ {
sink = i
}
}
//go:noinline
func Foo() {
a()
}
-- dep/dep_test.go --
package dep
import "testing"
func BenchmarkMain(b *testing.B) {
for i := 0; i < b.N; i++ {
Foo()
}
}
-- main.go --
package main
import "example.com/pgo/dep"
func main() {
dep.Foo()
}
-- go.mod --
module example.com/pgo
go 1.23
https://go.dev/play/p/QxBt13wV3Uh
Run the benchmark to collect a profile, and then use that as a PGO profile:
$ go test -bench . -cpuprofile=cpu.pprof ./dep
$ go build "-gcflags=example.com/pgo/dep=-m=2 -d=pgodebug=1" -pgo=cpu.pprof
# example.com/pgo/dep
hot-callsite-thres-from-CDF=0.6493506493506493
dep/dep.go:6:6: cannot inline spin: marked go:noinline
hot-node enabled increased budget=2000 for func=example.com/pgo/dep.b
dep/dep.go:12:6: can inline b with cost 133 as: func() { spin(); spin(); for loop }
hot-node enabled increased budget=2000 for func=example.com/pgo/dep.a
hot-budget check allows inlining for call example.com/pgo/dep.b (cost 133) at dep/dep.go:23:3 in function example.com/pgo/dep.a
hot-budget check allows inlining for call example.com/pgo/dep.b (cost 133) at dep/dep.go:24:3 in function example.com/pgo/dep.a
dep/dep.go:21:6: can inline a with cost 285 as: func() { b(); b(); for loop }
dep/dep.go:31:6: cannot inline Foo: marked go:noinline
hot-budget check allows inlining for call example.com/pgo/dep.b (cost 133) at dep/dep.go:23:3 in function example.com/pgo/dep.a
dep/dep.go:23:3: inlining call to b
hot-budget check allows inlining for call example.com/pgo/dep.b (cost 133) at dep/dep.go:24:3 in function example.com/pgo/dep.a
dep/dep.go:24:3: inlining call to b
hot-budget check allows inlining for call example.com/pgo/dep.a (cost 285) at dep/dep.go:32:3 in function example.com/pgo/dep.Foo
dep/dep.go:32:3: inlining call to a
The hot-budget
lines note that PGO allows inlining b
into a
and a
into Foo
. dep/dep.go:32:3: inlining call to a
indicates that a
was inlined into Foo
, but there should then be a subsequent inline b
into Foo
. The output makes this a bit confusing, but objdump makes it clear that b
was not inlined:
0000000000467a80 <example.com/pgo/dep.Foo>:
...
467a8b: | | e8 b0 ff ff ff call 467a40 <example.com/pgo/dep.b>
467a90: | | e8 ab ff ff ff call 467a40 <example.com/pgo/dep.b>
...
A bit of custom logging in the compiler makes it more clear what is happening:
dep/dep.go:31:6: function Foo: DevirtualizeAndInlineFunc
dep/dep.go:31:6: function Foo: TryInlineCall to a
hot-budget check allows inlining for call example.com/pgo/dep.a (cost 285) at dep/dep.go:32:3 in function example.com/pgo/dep.Foo
dep/dep.go:32:3: inlining call to a
dep/dep.go:31:6: function Foo: TryInlineCall to b
dep/dep.go:31:6: canInlineCallExpr from Foo to b: cost too high: maxCost 80, callSiteScore 133, hot false
dep/dep.go:31:6: mkinlcall b: !canInlineCallExpr
It looks like what is happening here is:
a
in inlined intoFoo
.- We then attempt to inline
b
intoFoo
(from the inlined body ofa
). canInlineCallExpr
checksinlineCostOK
ofFoo -> b
. This consults the PGO edge map, which does not contain a hot edge fromFoo -> b
because PGO includes inline calls in edges. i.e., it has hot edgesFoo -> a
anda -> b
.- Inline does not occur because we didn't increase the budget and base cost is too high.
I believe what should happen is if inlining a call inside of an ir.InlinedCallExpr
, we should use the original name of the InlinedCallExpr
to consult the PGO edge map.
cc @cherrymui
Comment From: gabyhelp
Related Issues and Documentation
- cmd/compile: functions evaluated in defer and go statements are not checked for inlining #42194 (closed)
- cmd/compile/internal/test: TestPGOIntendedInliningShiftedLines failures #59395 (closed)
- cmd/compile/internal/inline: hairyVisitor doesn't properly account for conditionally-inlined PGO calls #59484 (closed)
- cmd/compile: PGO does not devirtualize methods in another package (log: "no function body") #60561 (closed)
- cmd/compile: missed obvious inlining #58905 (closed)
- cmd/compile: output cost while inlining function with `Debug['m'] > 1` #36780 (closed)
- cmd/compile: inlining budget excludes cross-package calls #22194 (closed)
- cmd/compile/internal/devirtualize: PGO devirtualization lost when inlining function containing devirtualized call #62245
(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)
Comment From: prattmic
I believe what should happen is if inlining a call inside of an ir.InlinedCallExpr, we should use the original name of the InlinedCallExpr to consult the PGO edge map.
Getting the original name looks somewhat painful, as it looks like the only reference we keep to it is a reference to the inline tree via an inline mark statement in the expression init: https://cs.opensource.google/go/go/+/master:src/cmd/compile/internal/noder/reader.go;drc=4e1cc09f8b9bcef2b6d0839a7d0026b50c21998b;bpv=1;bpt=1;l=3513
Comment From: gopherbot
Change https://go.dev/cl/626295 mentions this issue: cmd/compile: support multiple levels of inlining using PGO
Comment From: ArsenySamoylov
Hello! When working on fixing this problem, we encountered a situation when inlining in runtime causes deadlock. The problem was discovered when running go1
BinaryTree17
benchmark with PGO using https://go.dev/cl/626295 (see TODO in src/cmd/compile/internal/inline/inl.go:892
).
Reproducing the bug
Remove TODO part from https://go.dev/cl/626295. Complie go1
with PGO and check that lock()
was inlined.
Before PGO (there is call to lock2()):
TEXT runtime.bgsweep(SB) /home/asamoylov/CoreGo/src/runtime/mgcsweep.go
mgcsweep.go:270 0x33a00 f9400b90 MOVD 16(R28), R16
mgcsweep.go:270 0x33a04 eb3063ff CMP R16, RSP
mgcsweep.go:270 0x33a08 54000cc9 BLS 102(PC)
mgcsweep.go:270 0x33a0c f81b0ffe MOVD.W R30, -80(RSP)
mgcsweep.go:270 0x33a10 f81f83fd MOVD R29, -8(RSP)
mgcsweep.go:270 0x33a14 d10023fd SUB $8, RSP, R29
mgcsweep.go:298 0x33a18 f9002fe0 MOVD R0, 88(RSP)
mgcsweep.go:271 0x33a1c f90023fc MOVD R28, 64(RSP)
mgcsweep.go:271 0x33a20 d00043db ADRP 8888320(PC), R27
mgcsweep.go:271 0x33a24 b9411361 MOVWU 272(R27), R1
mgcsweep.go:271 0x33a28 340000c1 CBZW R1, 6(PC)
mgcsweep.go:271 0x33a2c 94015205 CALL runtime.gcWriteBarrier2(SB)
mgcsweep.go:271 0x33a30 f900033c MOVD R28, (R25)
mgcsweep.go:271 0x33a34 b000429b ADRP 8720384(PC), R27
mgcsweep.go:271 0x33a38 f9436761 MOVD 1736(R27), R1
mgcsweep.go:271 0x33a3c f9000721 MOVD R1, 8(R25)
mgcsweep.go:271 0x33a40 f94023e1 MOVD 64(RSP), R1
mgcsweep.go:271 0x33a44 b000429b ADRP 8720384(PC), R27
mgcsweep.go:271 0x33a48 f9036761 MOVD R1, 1736(R27)
mgcsweep.go:274 0x33a4c d503201f NOOP
lock_futex.go:52 0x33a50 d503201f NOOP
lockrank_off.go:24 0x33a54 b0004280 ADRP 8720384(PC), R0
lockrank_off.go:24 0x33a58 911b0000 ADD $1728, R0, R0
lockrank_off.go:24 0x33a5c 97ffaac5 CALL runtime.lock2(SB) // <- call here
...
After PGO:
TEXT runtime.bgsweep(SB) /home/asamoylov/CoreGo/src/runtime/mgcsweep.go
mgcsweep.go:270 0x45f30 f9400b90 MOVD 16(R28), R16
mgcsweep.go:270 0x45f34 d105c3f1 SUB $368, RSP, R17
mgcsweep.go:270 0x45f38 eb10023f CMP R16, R17
mgcsweep.go:270 0x45f3c 5400b929 BLS 1481(PC)
mgcsweep.go:270 0x45f40 d107c3f4 SUB $496, RSP, R20
mgcsweep.go:270 0x45f44 a93ffa9d STP (R29, R30), -8(R20)
mgcsweep.go:270 0x45f48 9100029f MOVD R20, RSP
mgcsweep.go:270 0x45f4c d10023fd SUB $8, RSP, R29
mgcsweep.go:271 0x45f50 f900b7fc MOVD R28, 360(RSP)
mgcsweep.go:271 0x45f54 90004adb ADRP 9797632(PC), R27
mgcsweep.go:271 0x45f58 b9411362 MOVWU 272(R27), R2
mgcsweep.go:271 0x45f5c 340000c2 CBZW R2, 6(PC)
mgcsweep.go:271 0x45f60 94028494 CALL runtime.gcWriteBarrier2(SB)
mgcsweep.go:271 0x45f64 f900033c MOVD R28, (R25)
mgcsweep.go:271 0x45f68 f000497b ADRP 9629696(PC), R27
mgcsweep.go:271 0x45f6c f9436762 MOVD 1736(R27), R2
mgcsweep.go:271 0x45f70 f9000722 MOVD R2, 8(R25)
mgcsweep.go:271 0x45f74 f940b7e2 MOVD 360(RSP), R2
mgcsweep.go:271 0x45f78 f000497b ADRP 9629696(PC), R27
mgcsweep.go:271 0x45f7c f9036762 MOVD R2, 1736(R27)
lock_futex.go:58 0x45f80 f9401b82 MOVD 48(R28), R2
mgcsweep.go:274 0x45f84 d503201f NOOP
lock_futex.go:52 0x45f88 d503201f NOOP
lock_futex.go:58 0x45f8c b9810843 MOVW 264(R2), R3
lock_futex.go:58 0x45f90 37f8b5e3 TBNZ $31, R3, 1455(PC) // <- no call to lock2()
lock_futex.go:61 0x45f94 91000463 ADD $1, R3, R3
lock_futex.go:61 0x45f98 b9010843 MOVW R3, 264(R2)
...
When running benchmark you should see the BinaryTree17
freeze around the 3rd iteration when running on multiple cores (e.g., 4 cores). (The freeze behavior may vary with different core configurations).
Root cause of bug
The deadlock occurs because another goroutine is constantly trying to stop the mark phase (while holding stw lock). Here is part of its back-trace:
1 0x0000000000083d58 in runtime.preemptall
at /home/asamoylov/CoreGo/src/runtime/proc.go:6303
...
6 0x000000000002e3e4 in runtime.forEachP
at /home/asamoylov/CoreGo/src/runtime/proc.go:2013
7 0x000000000002e3e4 in runtime.gcMarkTermination
at /home/asamoylov/CoreGo/src/runtime/mgc.go:1153
The function gcMarkTermination()
ensures that all caches are flushed before continuing the GC cycle. It calls forEachP()
to flush caches for each processor (P).
// World must be stopped and mark assists and background workers must be
// disabled.
func gcMarkTermination(stw worldStop) {
...
// Ensure all mcaches are flushed. Each P will flush its own
// mcache before allocating, but idle Ps may not. Since this
// is necessary to sweep all spans, we need to ensure all
// mcaches are flushed before we start the next GC cycle.
//
// While we're here, flush the page cache for idle Ps to avoid
// having pages get stuck on them. These pages are hidden from
// the scavenger, so in small idle heaps a significant amount
// of additional memory might be held onto.
//
// Also, flush the pinner cache, to avoid leaking that memory
// indefinitely.
forEachP(waitReasonFlushProcCaches, func(pp *p) {
...
However, the main issue is that forEachP()
calls fn(p) for each P only when it is at a GC-safe point. In order to do that, it preempts all Ps (thus call to preemptall()
in back trace).
// forEachP calls fn(p) for every P p when p reaches a GC safe point.
// If a P is currently executing code, this will bring the P to a GC
// safe point and execute fn on that P. If the P is not executing code
// (it is idle or in a syscall), this will call fn(p) directly while
// preventing the P from exiting its state. This does not ensure that
// fn will run on every CPU executing Go code, but it acts as a global
// memory barrier. GC uses this as a "ragged barrier."
//
// The caller must hold worldsema. fn must not refer to any
// part of the current goroutine's stack, since the GC may move it.
func forEachP(reason waitReason, fn func(*p)) {
Goroutines are not preemtible when it holds any locks.
And sweepone()
, freeSomeWbufs()
take locks. So when they are inlined the bgsweep goroutine becomes non-preemtible for the most of the time! Leading to livelock.
Possible solutions
So, in order to solve this problem, I came up with 2 solutions.
First solution
- add
goschedIfBusy()
afterunlock()
to ensure preemption, which will help avoid a livelock.
const sweepBatchSize = 10
nSwept := 0
for sweepone() != ^uintptr(0) {
nSwept++
if nSwept%sweepBatchSize == 0 {
goschedIfBusy()
}
}
...
lock(&sweep.lock)
if !isSweepDone() {
// This can happen if a GC runs between
// gosweepone returning ^0 above
// and the lock being acquired.
unlock(&sweep.lock)
goschedIfBusy() // <- add yielding here
continue
}
We were already yielding after sweeping 10 spans to be low priority task. So one more yielding shouldn't be a big change, I guess.
Second solution
The second solution focuses on why isSweepDone()
returns false
.
isSweepDone() == true
when:
* sweep drained flag is set
* no sweepers exists
Running with dlv showed that flag is set, however there is 1 sweeper. And it turns out that sweeper seems to be the same goroutine that executes gcMarkTermination()
:
func gcMarkTermination(stw worldStop) {
// There may be stale spans in mcaches that need to be swept.
// Those aren't tracked in any sweep lists, so we need to
// count them against sweep completion until we ensure all
// those spans have been forced out.
//
// If gcSweep fully swept the heap (for example if the sweep
// is not concurrent due to a GODEBUG setting), then we expect
// the sweepLocker to be invalid, since sweeping is done.
//
// N.B. Below we might duplicate some work from gcSweep; this is
// fine as all that work is idempotent within a GC cycle, and
// we're still holding worldsema so a new cycle can't start.
sl := sweep.active.begin() // <- !!! starts sweeper !!!
...
forEachP(waitReasonFlushProcCaches, func(pp *p) { // <- !!! livelock here !!!
...
if sl.valid {
// Now that we've swept stale spans in mcaches, they don't
// count against unswept spans.
//
// Note: this sweepLocker may not be valid if sweeping had
// already completed during the STW. See the corresponding
// begin() call that produced sl.
sweep.active.end(sl) // <- !!! ends sweeper !!!
}
So the second solution is to replace isDone()
with checking flag (ignoring number of sweepers). This goroutine will be parked and then P will be idle.
...
lock(&sweep.lock)
if sweep.active.state.Load()&sweepDrainedMask == 0 { // <- check that flag is set ignoring the number of sweepers
// This can happen if a GC runs between
// gosweepone returning ^0 above
// and the lock being acquired.
unlock(&sweep.lock)
continue
}
sweep.parked = true
goparkunlock(&sweep.lock, waitReasonGCSweepWait, traceBlockGCSweep, 1) // <- parking here, so we don't occupy P
}
However, I am not sure that parking the bgsweep goroutine is safe when there are still running sweepers.
PS
To conclude, I would like to know what do you think about this issue and my solutions for it. Thank you in advance =).
Comment From: ArsenySamoylov
cc @prattmic
Comment From: prattmic
Nice find! Thanks for the investigation. It sounds like bgsweep was depending on the preempt check in the preamble of lock
to preempt promptly.
Adding more explicit preemption checks seems reasonable to me, though I don't understand why the check every 10 sweepone
calls isn't sufficient?
for sweepone() != ^uintptr(0) {
nSwept++
if nSwept%sweepBatchSize == 0 {
goschedIfBusy()
}
}
cc @mknyszek
Comment From: ArsenySamoylov
Adding more explicit preemption checks seems reasonable to me, though I don't understand why the check every 10
sweepone
calls isn't sufficient?
for sweepone() != ^uintptr(0) { nSwept++ if nSwept%sweepBatchSize == 0 { goschedIfBusy() } }
The problem occurs when there are no spans to sweep, so we don't execute the body of for loop. In my second solution, I mentioned that sweepDrainedMask
is set, however, we have registered sweeper and this stops bgsweep goroutine from parking (later on in the code).
So, here what bgsweep's execution looks like:
for {
const sweepBatchSize = 10
nSwept := 0
for sweepone() != ^uintptr(0) {
// no work to do, so skiping this
}
for freeSomeWbufs(true) {
// no work to do, so skiping this
}
lock(&sweep.lock)
if !isSweepDone() {
unlock(&sweep.lock)
continue // repeat
}
// we don't reach parking here because we have active sweeper in gcMarkTermination()
sweep.parked = true
goparkunlock(&sweep.lock, waitReasonGCSweepWait, traceBlockGCSweep, 1)
}
Comment From: prattmic
Ok, I see, thanks.
Comment From: ArsenySamoylov
So, what do you think? Maybe I should open a new issue regarding this problem?
Comment From: JunyangShao
@prattmic