#!watchflakes
default <- pkg == "internal/trace" && test ~ `TestTraceGCStress/.*` && log ~ `out of memory`

Issue created automatically to collect these failures.

Example (log):

=== RUN   TestTraceGCStress/AllocFree
    trace_test.go:614: stderr: runtime: out of memory: cannot allocate 4194304-byte block (4209803264 in use)
        fatal error: out of memory

        goroutine 55 gp=0xa489d48 m=11 mp=0x117d4808 [running]:
        runtime.throw({0x8113430, 0xd})
            /home/swarming/.swarming/w/ir/x/w/goroot/src/runtime/panic.go:1094 +0x35 fp=0xbc50f18 sp=0xbc50f04 pc=0x80bc8a5
        runtime.(*mcache).refill(0xf7f62848, 0x59)
            /home/swarming/.swarming/w/ir/x/w/goroot/src/runtime/mcache.go:186 +0x234 fp=0xbc50f44 sp=0xbc50f18 pc=0x8061894
        runtime.(*mcache).nextFree(0xf7f62848, 0x59)
...
            /home/swarming/.swarming/w/ir/x/w/goroot/src/runtime/trace.go:781 +0x54 fp=0xa82bf9c sp=0xa82bf78 pc=0x80c0004
        runtime/trace.(*traceMultiplexer).startLocked.func1()
            /home/swarming/.swarming/w/ir/x/w/goroot/src/runtime/trace/subscribe.go:167 +0xb2 fp=0xa82bff0 sp=0xa82bf9c pc=0x80f1f72
        runtime.goexit({})
            /home/swarming/.swarming/w/ir/x/w/goroot/src/runtime/asm_386.s:1386 +0x1 fp=0xa82bff4 sp=0xa82bff0 pc=0x80c1ba1
        created by runtime/trace.(*traceMultiplexer).startLocked in goroutine 1
            /home/swarming/.swarming/w/ir/x/w/goroot/src/runtime/trace/subscribe.go:157 +0x10d
        exit status 2
    trace_test.go:616: exit status 1
--- FAIL: TestTraceGCStress/AllocFree (3.26s)

watchflakes

Comment From: gopherbot

Found new dashboard test flakes for:

#!watchflakes
default <- pkg == "internal/trace" && test == "TestTraceGCStress/AllocFree"
2025-06-06 19:45 go1.25-linux-386-longtest release-branch.go1.25@5abb1d84 internal/trace.TestTraceGCStress/AllocFree (log) === RUN TestTraceGCStress/AllocFree trace_test.go:614: stderr: runtime: out of memory: cannot allocate 4194304-byte block (4209803264 in use) fatal error: out of memory goroutine 55 gp=0xa489d48 m=11 mp=0x117d4808 [running]: runtime.throw({0x8113430, 0xd}) /home/swarming/.swarming/w/ir/x/w/goroot/src/runtime/panic.go:1094 +0x35 fp=0xbc50f18 sp=0xbc50f04 pc=0x80bc8a5 runtime.(*mcache).refill(0xf7f62848, 0x59) /home/swarming/.swarming/w/ir/x/w/goroot/src/runtime/mcache.go:186 +0x234 fp=0xbc50f44 sp=0xbc50f18 pc=0x8061894 runtime.(*mcache).nextFree(0xf7f62848, 0x59) ... /home/swarming/.swarming/w/ir/x/w/goroot/src/runtime/trace.go:781 +0x54 fp=0xa82bf9c sp=0xa82bf78 pc=0x80c0004 runtime/trace.(*traceMultiplexer).startLocked.func1() /home/swarming/.swarming/w/ir/x/w/goroot/src/runtime/trace/subscribe.go:167 +0xb2 fp=0xa82bff0 sp=0xa82bf9c pc=0x80f1f72 runtime.goexit({}) /home/swarming/.swarming/w/ir/x/w/goroot/src/runtime/asm_386.s:1386 +0x1 fp=0xa82bff4 sp=0xa82bff0 pc=0x80c1ba1 created by runtime/trace.(*traceMultiplexer).startLocked in goroutine 1 /home/swarming/.swarming/w/ir/x/w/goroot/src/runtime/trace/subscribe.go:157 +0x10d exit status 2 trace_test.go:616: exit status 1 --- FAIL: TestTraceGCStress/AllocFree (3.26s)

watchflakes

Comment From: gopherbot

Change https://go.dev/cl/683515 mentions this issue: internal/trace: improve gc-stress test

Comment From: mknyszek

As part of investigating this, I did a deep-dive into the root cause of the OOMs. This issue is closed because the test allocates less aggressively, but it is not a complete fix.

There's been this problem with GC pacing (e.g. #37331) since long before the last pacer overhaul (in Go 1.18) that doesn't really ever seem to affect real programs, but does make it really easy to write flaky tests. (I don't think I've ever seen a bug report about this for real programs, in all this time.)

The root of the problem appears to be thus: at the beginning of the mark phase, things are good and we have a tiny scannable heap. From the pacer's perspective, we're confident that the dedicated workers will handle this. GC credit is abundant, because we've got so much runway compared to the size of the heap we expect to scan. GC credit is so abundant in fact, that when a goroutine goes to assist, it can steal some background work credit, and the conversion rate is very favorable (60:1, heap:scan), netting it a ton of allocation credit. Later in the GC cycle, if for some reason the background workers lag behind, all the goroutines just allocating in a loop have an extraordinary amount of combined runway before they end up assisting.

In this way, any test that just allocates in a loop can end up completely blowing past the heap goal.

73136 appears to be another symptom, but only on the new darwin/arm64 longtest builders. Probably just something slightly different about scheduling that's causing this to happen. If the mark worker threads can get a foothold with the OS, then everything is usually fine. It's hard for allocating goroutines to straight-up outpace mark workers blasting through a tiny heap.