Golang cmd/compile: PGO fails to do multiple levels of inlining into a single function

Reproducer:

-- dep/dep.go --
package dep

var sink int

//go:noinline
func spin() {
        for i := 0; i < 1000000; i++ {
                sink = i
        }
}

func b() {
        // Two calls, too hot to inline, darn...
        spin()
        spin()
        for i := 0; i < 1000000; i++ {
                sink = i
        }
}

func a() {
        // Two calls, too hot to inline, darn...
        b()
        b()
        for i := 0; i < 1000000; i++ {
                sink = i
        }
}

//go:noinline
func Foo() {
        a()
}
-- dep/dep_test.go --
package dep

import "testing"

func BenchmarkMain(b *testing.B) {
        for i := 0; i < b.N; i++ {
                Foo()
        }
}
-- main.go --
package main

import "example.com/pgo/dep"

func main() {
        dep.Foo()
}
-- go.mod --
module example.com/pgo

go 1.23

https://go.dev/play/p/QxBt13wV3Uh

Run the benchmark to collect a profile, and then use that as a PGO profile:

$ go test -bench . -cpuprofile=cpu.pprof ./dep
$ go build "-gcflags=example.com/pgo/dep=-m=2 -d=pgodebug=1" -pgo=cpu.pprof                         
# example.com/pgo/dep
hot-callsite-thres-from-CDF=0.6493506493506493
dep/dep.go:6:6: cannot inline spin: marked go:noinline
hot-node enabled increased budget=2000 for func=example.com/pgo/dep.b
dep/dep.go:12:6: can inline b with cost 133 as: func() { spin(); spin(); for loop }
hot-node enabled increased budget=2000 for func=example.com/pgo/dep.a
hot-budget check allows inlining for call example.com/pgo/dep.b (cost 133) at dep/dep.go:23:3 in function example.com/pgo/dep.a
hot-budget check allows inlining for call example.com/pgo/dep.b (cost 133) at dep/dep.go:24:3 in function example.com/pgo/dep.a
dep/dep.go:21:6: can inline a with cost 285 as: func() { b(); b(); for loop }
dep/dep.go:31:6: cannot inline Foo: marked go:noinline
hot-budget check allows inlining for call example.com/pgo/dep.b (cost 133) at dep/dep.go:23:3 in function example.com/pgo/dep.a
dep/dep.go:23:3: inlining call to b
hot-budget check allows inlining for call example.com/pgo/dep.b (cost 133) at dep/dep.go:24:3 in function example.com/pgo/dep.a
dep/dep.go:24:3: inlining call to b
hot-budget check allows inlining for call example.com/pgo/dep.a (cost 285) at dep/dep.go:32:3 in function example.com/pgo/dep.Foo
dep/dep.go:32:3: inlining call to a

The hot-budget lines note that PGO allows inlining b into a and a into Foo. dep/dep.go:32:3: inlining call to a indicates that a was inlined into Foo, but there should then be a subsequent inline b into Foo. The output makes this a bit confusing, but objdump makes it clear that b was not inlined:

0000000000467a80 <example.com/pgo/dep.Foo>:
...
  467a8b:       |  |         e8 b0 ff ff ff             call   467a40 <example.com/pgo/dep.b>
  467a90:       |  |         e8 ab ff ff ff             call   467a40 <example.com/pgo/dep.b>
...

A bit of custom logging in the compiler makes it more clear what is happening:

dep/dep.go:31:6: function Foo: DevirtualizeAndInlineFunc
dep/dep.go:31:6: function Foo: TryInlineCall to a
hot-budget check allows inlining for call example.com/pgo/dep.a (cost 285) at dep/dep.go:32:3 in function example.com/pgo/dep.Foo
dep/dep.go:32:3: inlining call to a
dep/dep.go:31:6: function Foo: TryInlineCall to b
dep/dep.go:31:6: canInlineCallExpr from Foo to b: cost too high: maxCost 80, callSiteScore 133, hot false
dep/dep.go:31:6: mkinlcall b: !canInlineCallExpr

It looks like what is happening here is:

a in inlined into Foo.
We then attempt to inline b into Foo (from the inlined body of a).
canInlineCallExpr checks inlineCostOK of Foo -> b. This consults the PGO edge map, which does not contain a hot edge from Foo -> b because PGO includes inline calls in edges. i.e., it has hot edges Foo -> a and a -> b.
Inline does not occur because we didn't increase the budget and base cost is too high.

I believe what should happen is if inlining a call inside of an ir.InlinedCallExpr, we should use the original name of the InlinedCallExpr to consult the PGO edge map.

cc @cherrymui

Comment From: gabyhelp

Related Issues and Documentation

_{(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)}

Comment From: prattmic

I believe what should happen is if inlining a call inside of an ir.InlinedCallExpr, we should use the original name of the InlinedCallExpr to consult the PGO edge map.

Getting the original name looks somewhat painful, as it looks like the only reference we keep to it is a reference to the inline tree via an inline mark statement in the expression init: https://cs.opensource.google/go/go/+/master:src/cmd/compile/internal/noder/reader.go;drc=4e1cc09f8b9bcef2b6d0839a7d0026b50c21998b;bpv=1;bpt=1;l=3513

Comment From: gopherbot

Change https://go.dev/cl/626295 mentions this issue: cmd/compile: support multiple levels of inlining using PGO

Comment From: ArsenySamoylov

Hello! When working on fixing this problem, we encountered a situation when inlining in runtime causes deadlock. The problem was discovered when running go1 BinaryTree17 benchmark with PGO using https://go.dev/cl/626295 (see TODO in src/cmd/compile/internal/inline/inl.go:892).

Reproducing the bug

Remove TODO part from https://go.dev/cl/626295. Complie go1 with PGO and check that lock() was inlined.

Before PGO (there is call to lock2()):

TEXT runtime.bgsweep(SB) /home/asamoylov/CoreGo/src/runtime/mgcsweep.go                                                                                                                                                                                                     
  mgcsweep.go:270       0x33a00                 f9400b90                MOVD 16(R28), R16                                                                                                                                                                                   
  mgcsweep.go:270       0x33a04                 eb3063ff                CMP R16, RSP                                                                                                                                                                                        
  mgcsweep.go:270       0x33a08                 54000cc9                BLS 102(PC)                                                                                                                                                                                         
  mgcsweep.go:270       0x33a0c                 f81b0ffe                MOVD.W R30, -80(RSP)                                                                                                                                                                                
  mgcsweep.go:270       0x33a10                 f81f83fd                MOVD R29, -8(RSP)                                                                                                                                                                                   
  mgcsweep.go:270       0x33a14                 d10023fd                SUB $8, RSP, R29                                                                                                                                                                                    
  mgcsweep.go:298       0x33a18                 f9002fe0                MOVD R0, 88(RSP)                                                                                                                                                                                    
  mgcsweep.go:271       0x33a1c                 f90023fc                MOVD R28, 64(RSP)                                                                                                                                                                                   
  mgcsweep.go:271       0x33a20                 d00043db                ADRP 8888320(PC), R27                                                                                                                                                                               
  mgcsweep.go:271       0x33a24                 b9411361                MOVWU 272(R27), R1                                                                                                                                                                                  
  mgcsweep.go:271       0x33a28                 340000c1                CBZW R1, 6(PC)                                                                                                                                                                                      
  mgcsweep.go:271       0x33a2c                 94015205                CALL runtime.gcWriteBarrier2(SB)                                                                                                                                                                    
  mgcsweep.go:271       0x33a30                 f900033c                MOVD R28, (R25)                                                                                                                                                                                     
  mgcsweep.go:271       0x33a34                 b000429b                ADRP 8720384(PC), R27                                                                                                                                                                               
  mgcsweep.go:271       0x33a38                 f9436761                MOVD 1736(R27), R1                                                                                                                                                                                  
  mgcsweep.go:271       0x33a3c                 f9000721                MOVD R1, 8(R25)                                                                                                                                                                                     
  mgcsweep.go:271       0x33a40                 f94023e1                MOVD 64(RSP), R1                                                                                                                                                                                    
  mgcsweep.go:271       0x33a44                 b000429b                ADRP 8720384(PC), R27                                                                                                                                                                               
  mgcsweep.go:271       0x33a48                 f9036761                MOVD R1, 1736(R27)                                                                                                                                                                                  
  mgcsweep.go:274       0x33a4c                 d503201f                NOOP                                                                                                                                                                                                
  lock_futex.go:52      0x33a50                 d503201f                NOOP                                                                                                                                                                                                
  lockrank_off.go:24    0x33a54                 b0004280                ADRP 8720384(PC), R0                                                                                                                                                                                
  lockrank_off.go:24    0x33a58                 911b0000                ADD $1728, R0, R0                                                                                                                                                                                   
  lockrank_off.go:24    0x33a5c                 97ffaac5                CALL runtime.lock2(SB)   // <- call here
...

After PGO:

TEXT runtime.bgsweep(SB) /home/asamoylov/CoreGo/src/runtime/mgcsweep.go                                                                                                                                                                                                     
  mgcsweep.go:270       0x45f30                 f9400b90                MOVD 16(R28), R16                                                                                                                                                                                   
  mgcsweep.go:270       0x45f34                 d105c3f1                SUB $368, RSP, R17                                                                                                                                                                                  
  mgcsweep.go:270       0x45f38                 eb10023f                CMP R16, R17                                                                                                                                                                                        
  mgcsweep.go:270       0x45f3c                 5400b929                BLS 1481(PC)                                                                                                                                                                                        
  mgcsweep.go:270       0x45f40                 d107c3f4                SUB $496, RSP, R20                                                                                                                                                                                  
  mgcsweep.go:270       0x45f44                 a93ffa9d                STP (R29, R30), -8(R20)                                                                                                                                                                             
  mgcsweep.go:270       0x45f48                 9100029f                MOVD R20, RSP                                                                                                                                                                                       
  mgcsweep.go:270       0x45f4c                 d10023fd                SUB $8, RSP, R29                                                                                                                                                                                    
  mgcsweep.go:271       0x45f50                 f900b7fc                MOVD R28, 360(RSP)                                                                                                                                                                                  
  mgcsweep.go:271       0x45f54                 90004adb                ADRP 9797632(PC), R27                                                                                                                                                                               
  mgcsweep.go:271       0x45f58                 b9411362                MOVWU 272(R27), R2                                                                                                                                                                                  
  mgcsweep.go:271       0x45f5c                 340000c2                CBZW R2, 6(PC)                                                                                                                                                                                      
  mgcsweep.go:271       0x45f60                 94028494                CALL runtime.gcWriteBarrier2(SB)                                                                                                                                                                    
  mgcsweep.go:271       0x45f64                 f900033c                MOVD R28, (R25)                                                                                                                                                                                     
  mgcsweep.go:271       0x45f68                 f000497b                ADRP 9629696(PC), R27                                                                                                                                                                               
  mgcsweep.go:271       0x45f6c                 f9436762                MOVD 1736(R27), R2                                                                                                                                                                                  
  mgcsweep.go:271       0x45f70                 f9000722                MOVD R2, 8(R25)                                                                                                                                                                                     
  mgcsweep.go:271       0x45f74                 f940b7e2                MOVD 360(RSP), R2                                                                                                                                                                                   
  mgcsweep.go:271       0x45f78                 f000497b                ADRP 9629696(PC), R27                                                                                                                                                                               
  mgcsweep.go:271       0x45f7c                 f9036762                MOVD R2, 1736(R27)                                                                                                                                                                                  
  lock_futex.go:58      0x45f80                 f9401b82                MOVD 48(R28), R2                                                                                                                                                                                    
  mgcsweep.go:274       0x45f84                 d503201f                NOOP                                                                                                                                                                                                
  lock_futex.go:52      0x45f88                 d503201f                NOOP                                                                                                                                                                                                
  lock_futex.go:58      0x45f8c                 b9810843                MOVW 264(R2), R3           
  lock_futex.go:58      0x45f90                 37f8b5e3                TBNZ $31, R3, 1455(PC)     // <- no call to lock2()                                                                                                                                                                       
  lock_futex.go:61      0x45f94                 91000463                ADD $1, R3, R3                                                                                                                                                                                                 
  lock_futex.go:61      0x45f98                 b9010843                MOVW R3, 264(R2) 
...

When running benchmark you should see the BinaryTree17 freeze around the 3rd iteration when running on multiple cores (e.g., 4 cores). (The freeze behavior may vary with different core configurations).

Root cause of bug

The deadlock occurs because another goroutine is constantly trying to stop the mark phase (while holding stw lock). Here is part of its back-trace:

1  0x0000000000083d58 in runtime.preemptall                                                                                                                                                                                                                               
    at /home/asamoylov/CoreGo/src/runtime/proc.go:6303                                                                                                                                                                                                                     
...                                                                                                                                                                                                                
 6  0x000000000002e3e4 in runtime.forEachP                                                                                                                                                                                                                                 
    at /home/asamoylov/CoreGo/src/runtime/proc.go:2013                                                                                                                                                                                                                     
 7  0x000000000002e3e4 in runtime.gcMarkTermination                                                                                                                                                                                                                        
    at /home/asamoylov/CoreGo/src/runtime/mgc.go:1153

The function gcMarkTermination() ensures that all caches are flushed before continuing the GC cycle. It calls forEachP() to flush caches for each processor (P).

// World must be stopped and mark assists and background workers must be
// disabled.
func gcMarkTermination(stw worldStop) {

...
    // Ensure all mcaches are flushed. Each P will flush its own
    // mcache before allocating, but idle Ps may not. Since this
    // is necessary to sweep all spans, we need to ensure all
    // mcaches are flushed before we start the next GC cycle.
    //
    // While we're here, flush the page cache for idle Ps to avoid
    // having pages get stuck on them. These pages are hidden from
    // the scavenger, so in small idle heaps a significant amount
    // of additional memory might be held onto.
    //
    // Also, flush the pinner cache, to avoid leaking that memory
    // indefinitely.
    forEachP(waitReasonFlushProcCaches, func(pp *p) {

...

However, the main issue is that forEachP() calls fn(p) for each P only when it is at a GC-safe point. In order to do that, it preempts all Ps (thus call to preemptall() in back trace).

// forEachP calls fn(p) for every P p when p reaches a GC safe point.
// If a P is currently executing code, this will bring the P to a GC
// safe point and execute fn on that P. If the P is not executing code
// (it is idle or in a syscall), this will call fn(p) directly while
// preventing the P from exiting its state. This does not ensure that
// fn will run on every CPU executing Go code, but it acts as a global
// memory barrier. GC uses this as a "ragged barrier."
//
// The caller must hold worldsema. fn must not refer to any
// part of the current goroutine's stack, since the GC may move it.
func forEachP(reason waitReason, fn func(*p)) {

Goroutines are not preemtible when it holds any locks.

And sweepone(), freeSomeWbufs() take locks. So when they are inlined the bgsweep goroutine becomes non-preemtible for the most of the time! Leading to livelock.

Possible solutions

So, in order to solve this problem, I came up with 2 solutions.

First solution

add goschedIfBusy() after unlock() to ensure preemption, which will help avoid a livelock.

const sweepBatchSize = 10

    nSwept := 0
    for sweepone() != ^uintptr(0) {
        nSwept++
        if nSwept%sweepBatchSize == 0 {
            goschedIfBusy()
        }
    }
    ...
    lock(&sweep.lock)
    if !isSweepDone() {
        // This can happen if a GC runs between
        // gosweepone returning ^0 above
        // and the lock being acquired.
        unlock(&sweep.lock)
        goschedIfBusy() // <- add yielding here
        continue
    }

We were already yielding after sweeping 10 spans to be low priority task. So one more yielding shouldn't be a big change, I guess.

Second solution

The second solution focuses on why isSweepDone() returns false. isSweepDone() == true when: * sweep drained flag is set * no sweepers exists Running with dlv showed that flag is set, however there is 1 sweeper. And it turns out that sweeper seems to be the same goroutine that executes gcMarkTermination():

func gcMarkTermination(stw worldStop) {
    // There may be stale spans in mcaches that need to be swept.
    // Those aren't tracked in any sweep lists, so we need to
    // count them against sweep completion until we ensure all
    // those spans have been forced out.
    //
    // If gcSweep fully swept the heap (for example if the sweep
    // is not concurrent due to a GODEBUG setting), then we expect
    // the sweepLocker to be invalid, since sweeping is done.
    //
    // N.B. Below we might duplicate some work from gcSweep; this is
    // fine as all that work is idempotent within a GC cycle, and
    // we're still holding worldsema so a new cycle can't start.
    sl := sweep.active.begin() // <- !!! starts sweeper !!!

    ...
    forEachP(waitReasonFlushProcCaches, func(pp *p) { // <- !!! livelock here !!!
    ...
    if sl.valid {
        // Now that we've swept stale spans in mcaches, they don't
        // count against unswept spans.
        //
        // Note: this sweepLocker may not be valid if sweeping had
        // already completed during the STW. See the corresponding
        // begin() call that produced sl.
        sweep.active.end(sl) // <- !!! ends sweeper !!!
    }

So the second solution is to replace isDone() with checking flag (ignoring number of sweepers). This goroutine will be parked and then P will be idle.

...
    lock(&sweep.lock)
    if  sweep.active.state.Load()&sweepDrainedMask == 0 { // <- check that flag is set ignoring the number of sweepers
        // This can happen if a GC runs between
        // gosweepone returning ^0 above
        // and the lock being acquired.
        unlock(&sweep.lock)
        continue
    }
    sweep.parked = true
    goparkunlock(&sweep.lock, waitReasonGCSweepWait, traceBlockGCSweep, 1) // <- parking here, so we don't occupy P
}

However, I am not sure that parking the bgsweep goroutine is safe when there are still running sweepers.

PS

To conclude, I would like to know what do you think about this issue and my solutions for it. Thank you in advance =).

Comment From: ArsenySamoylov

cc @prattmic

Comment From: prattmic

Nice find! Thanks for the investigation. It sounds like bgsweep was depending on the preempt check in the preamble of lock to preempt promptly.

Adding more explicit preemption checks seems reasonable to me, though I don't understand why the check every 10 sweepone calls isn't sufficient?

        for sweepone() != ^uintptr(0) {
            nSwept++
            if nSwept%sweepBatchSize == 0 {
                goschedIfBusy()
            }
        }

cc @mknyszek

Comment From: ArsenySamoylov

Adding more explicit preemption checks seems reasonable to me, though I don't understand why the check every 10 sweepone calls isn't sufficient?

for sweepone() != ^uintptr(0) { nSwept++ if nSwept%sweepBatchSize == 0 { goschedIfBusy() } }

The problem occurs when there are no spans to sweep, so we don't execute the body of for loop. In my second solution, I mentioned that sweepDrainedMask is set, however, we have registered sweeper and this stops bgsweep goroutine from parking (later on in the code).

So, here what bgsweep's execution looks like:

for {
        const sweepBatchSize = 10
        nSwept := 0
        for sweepone() != ^uintptr(0) {
                    // no work to do, so skiping this
        }
        for freeSomeWbufs(true) {
            // no work to do, so skiping this
        }
        lock(&sweep.lock)
        if !isSweepDone() {
            unlock(&sweep.lock)
            continue // repeat
        }
                // we don't reach parking here because we have active sweeper in gcMarkTermination()
        sweep.parked = true
        goparkunlock(&sweep.lock, waitReasonGCSweepWait, traceBlockGCSweep, 1)
    }