Go version
go version go1.24.4 linux/amd64
Output of go env in your module/workspace:
AR='ar'
CC='gcc'
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_ENABLED='1'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
CXX='g++'
GCCGO='gccgo'
GO111MODULE=''
GOAMD64='v1'
GOARCH='amd64'
GOAUTH='netrc'
GOBIN=''
GOCACHE='/usr/local/google/home/mpratt/.cache/go-build'
GOCACHEPROG=''
GODEBUG=''
GOENV='/usr/local/google/home/mpratt/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFIPS140='off'
GOFLAGS=''
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build4227360042=/tmp/go-build -gno-record-gcc-switches'
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMOD='/tmp/memprofile_leak/go.mod'
GOMODCACHE='/usr/local/google/home/mpratt/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/usr/local/google/home/mpratt/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/lib/go-1.24'
GOSUMDB='sum.golang.org'
GOTELEMETRY='on'
GOTELEMETRYDIR='/usr/local/google/home/mpratt/.config/go/telemetry'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/usr/lib/go-1.24/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.24.4'
GOWORK=''
PKG_CONFIG='pkg-config'
What did you do?
Reproducer: https://go.dev/play/p/uJ6XPZzNcEF (must run locally, it requires real time)
Run a program with a very large memory profile and call runtime.MemProfile to collect the profile occasionally.
What did you see happen?
As the profile grows in size, calls to runtime.MemProfile take longer and longer. This isn't surprising, as the profile is growing. What is surprising is that STW times go up a lot.
e.g., after ~60s, I see output like:
MemProfile: took 711.959904ms
GC STW stop time (ns):
| 0 [ -Inf, 0.0)
| 0 [ 0.0, 64.0)
| 0 [ 64.0, 128.0)
| ...
| 0 [ 384.0, 448.0)
| 0 [ 448.0, 512.0)
*| 2 [ 512.0, 640.0)
*****| 6 [ 640.0, 768.0)
********************| 24 [ 768.0, 896.0)
**************| 17 [ 896.0, 1024.0)
*******| 9 [1024.0, 1280.0)
***| 4 [1280.0, 1536.0)
*| 2 [1536.0, 1792.0)
| 0 [1792.0, 2048.0)
| 0 [2048.0, 2560.0)
| ...
| 0 [12288.0, 14336.0)
| 0 [14336.0, 16384.0)
*| 2 [16384.0, 20480.0)
| 0 [20480.0, 24576.0)
***| 4 [24576.0, 28672.0)
***| 4 [28672.0, 32768.0)
*****| 7 [32768.0, 40960.0)
******| 8 [40960.0, 49152.0)
| 1 [49152.0, 57344.0)
*| 2 [57344.0, 65536.0)
*| 2 [65536.0, 81920.0)
| 0 [81920.0, 98304.0)
| 0 [98304.0, 114688.0)
| ...
| 0 [117440512.0, 134217728.0)
| 0 [134217728.0, 167772160.0)
| 1 [167772160.0, 201326592.0)
| 0 [201326592.0, 234881024.0)
| 0 [234881024.0, 268435456.0)
| 0 [268435456.0, 335544320.0)
| 1 [335544320.0, 402653184.0)
| 0 [402653184.0, 469762048.0)
| 0 [469762048.0, 536870912.0)
| 1 [536870912.0, 671088640.0)
| 0 [671088640.0, 805306368.0)
| 0 [805306368.0, 939524096.0)
| ...
| 0 [105553116266496.0, 123145302310912.0)
| 0 [123145302310912.0, 140737488355328.0)
| 0 [140737488355328.0, +Inf)
This shows 1 sample of a STW stop time of ~530-670ms.
I believe the problem is that runtime.memProfileInternal holds profMemActiveLock for its entire duration, which prevents preemption from stopping the goroutine.
What did you expect to see?
Calling runtime.MemProfile should not impact STW times.
My reproducer intentionally creates tons of unique stacks to make the profile very large. That has its own issues (#69590) with memory usage, but I don't believe it should impact STW times.
Comment From: prattmic
cc @golang/runtime @mknyszek
Comment From: prattmic
Note that runtime/pprof heap profile collection also uses runtime.memProfileInternal, so that can also trigger this problem.
Comment From: rhysh
I believe the problem is that
runtime.memProfileInternalholdsprofMemActiveLockfor its entire duration, which prevents preemption from stopping the goroutine.
Yes.
With go version go1.26-devel_4ab1aec007 Fri Aug 1 11:56:27 2025 -0700 darwin/arm64, I modified the reproducer to collect execution trace + CPU profile, and filtered to StackSample Events that occur between STW RangeBegin and RangeEnd Events, and to note whether they're before/during/after the time when only one P is running. The result was many samples in the final loop of memProfileInternal, including samples tens of milliseconds after the STW request, all before the count of running Ps was able to decrease to 1. It holds profMemActiveLock at that time. That's the code that is running which delays the STW requests:
https://github.com/golang/go/blob/4ab1aec00799f91e96182cbbffd1de405cd52e93/src/runtime/mprof.go#L995-L997
Removing use of that lock (by converting it into a semaphore and unsafely removing its use from mProf_Flush, which holds locks and so cannot enter the scheduler) results in much faster STWs: it lasted 48 seconds before crashing, during which the slowest STW took 1.4 ms. (The median STW was 34µs.)
Comment From: rhysh
I updated the STW-describing tool to write out pprof-format profiles of the CPU profile samples it finds during STW ranges. The code is at https://github.com/rhysh/go-tracing-toolbox/tree/main/cmd/unstoppable but not yet on pkg.go.dev (I'm not sure how soon to expect updates to show up there). I hope it'll be helpful when debugging (and discovering!) slow STWs.
Here it shows 55 of the 59 CPU profile samples that arrived during STW ranges, before the number of running Ps decreased to 1 ("pre-stop"), appearing within runtime.memProfileInternal.
$ go run github.com/rhysh/go-tracing-toolbox/cmd/unstoppable -input /tmp/74872.trace -profile /tmp/74872.prof -print-stw=false
$ go tool pprof -tagfocus=stw-phase=pre-stop -list=memProfileInternal /tmp/74872.prof
Total: 59
ROUTINE ======================== runtime.memProfileInternal in /usr/local/go/src/runtime/mprof.go
27 55 (flat, cum) 93.22% of Total
. . 956: // of flushing to the active profile so we only have to look
. . 957: // at the active profile below.
. . 958: index := cycle % uint32(len(memRecord{}.future))
. . 959: lock(&profMemActiveLock)
. . 960: lock(&profMemFutureLock[index])
. 3 961: mProf_FlushLocked(index)
. . 962: unlock(&profMemFutureLock[index])
. . 963: clear := true
. . 964: head := (*bucket)(mbuckets.Load())
. . 965: for b := head; b != nil; b = b.allnext {
. 5 966: mp := b.mp()
2 2 967: if inuseZero || mp.active.alloc_bytes != mp.active.free_bytes {
. . 968: n++
. . 969: }
1 1 970: if mp.active.allocs != 0 || mp.active.frees != 0 {
. . 971: clear = false
. . 972: }
. . 973: }
. . 974: if clear {
. . 975: // Absolutely no data, suggesting that a garbage collection
. . 976: // has not yet happened. In order to allow profiling when
. . 977: // garbage collection is disabled from the beginning of execution,
. . 978: // accumulate all of the cycles, and recount buckets.
. . 979: n = 0
1 1 980: for b := head; b != nil; b = b.allnext {
. . 981: mp := b.mp()
. . 982: for c := range mp.future {
. . 983: lock(&profMemFutureLock[c])
. . 984: mp.active.add(&mp.future[c])
. . 985: mp.future[c] = memRecordCycle{}
. . 986: unlock(&profMemFutureLock[c])
. . 987: }
. . 988: if inuseZero || mp.active.alloc_bytes != mp.active.free_bytes {
. . 989: n++
. . 990: }
. . 991: }
. . 992: }
. . 993: if n <= size {
. . 994: ok = true
20 20 995: for b := head; b != nil; b = b.allnext {
. 20 996: mp := b.mp()
3 3 997: if inuseZero || mp.active.alloc_bytes != mp.active.free_bytes {
. . 998: r := profilerecord.MemProfileRecord{
. . 999: AllocBytes: int64(mp.active.alloc_bytes),
. . 1000: FreeBytes: int64(mp.active.free_bytes),
. . 1001: AllocObjects: int64(mp.active.allocs),
. . 1002: FreeObjects: int64(mp.active.frees),
Comment From: prattmic
Thanks for looking into this! I love the name "unstoppable" for that tool!
Replacing the lock with a semaphore was my first thought as well until I realized the uses outside of memProfileInternal aren't safe for semaphores :(.
I think we'll just need to step back and rethinking locking here to make this work. I haven't spent enough time looking at this closely to have concrete ideas yet. I know @mknyszek wants to completely rip out and replace the entire profile bucket implementation, though I imagine we can do something more targeted for this specific issue.
Comment From: aktau
As another mitigation, I recall @prattmic you mentioned that there's some of this profiling code that calls CallersFrames which spends quite some time calculculating information that isn't required. Could we already stop that in order to at least reduce the largesse of these pauses?
I updated the STW-describing tool to write out pprof-format profiles of the CPU profile samples it finds during STW ranges. The code is at https://github.com/rhysh/go-tracing-toolbox/tree/main/cmd/unstoppable but not yet on pkg.go.dev (I'm not sure how soon to expect updates to show up there). I hope it'll be helpful when debugging (and discovering!) slow STWs.
That's a really cool idea. Though for some reason, the runtime trace where we noticed this did not have any CPU samples (even though I think it should have).
Related to your tool: something we also noticed is that (for this process), runtime.ReadMemStats provoked STW times of 1.5ms-5ms. Which seems like way too much for what it actually does. If this can be reproduced1, it may be good to run your tool on it.
This bug causes STW pauses on the order of 200ms, so those are the main focus, but the runtime.ReadMemStats one is also too much (IMO).
Comment From: rhysh
Though for some reason, the runtime trace where we noticed this did not have any CPU samples (even though I think it should have).
You have to get lucky, for the OS's profiling timer to line up with a STW, and for the thread's 10ms-valued sample timer to roll over during a sub-millisecond span of STW-related work. Did you confirm that the execution trace has any CPUSample events at all, before filtering with unstoppable? Such as with go tool trace -d=footprint ./trace | grep CPUSample.
If this can be reproduced
There's less luck necessary if you have a giant pile of execution traces; I'll take this opportunity to plug the idea of gathering them continuously from the production environment, and https://pkg.go.dev/github.com/rhysh/autoprof as a way to do that. A profile describing the exact bad behavior can go a long way towards writing a minimal reproducer.
the
runtime.ReadMemStatsone is also too much (IMO).
I don't doubt it! Probably worth a new issue ... #63423 (which you've seen) is related, but doesn't quite describe that.
Comment From: aktau
You have to get lucky,
Such as with go tool trace -d=footprint ./trace | grep CPUSample.
I had done exactly that. There are no CPU samples in this entire trace. Normally speaking, CPU sampling is always enabled for us during trace times. I'll just have to look into what went wrong at some point.
The process was consuming on average 25 CPU s/s when this happened.
There's less luck necessary if you have a giant pile of execution traces;
We collect profiling data automatically, but not tracing (yet at least, and we won't get time for that for soon regrettably).
I don't doubt it! Probably worth a new issue ... https://github.com/golang/go/issues/63423 (which you've seen) is related, but doesn't quite describe that.
Indeed, I now remember that issue (we don't use expvar so it hasn't affected us in this way, but we use ReadMemStats for other purposes.
At any rate, a "fear" I have is that this is a STW where most of the time is spent waiting, instead of processing. Meaning CPU samples won't be forthcoming and we'll need different techniques to debug this.
-
The only thing I can tell that is dynamic in computation cost is the flush Ms call. The process in production has GOMAXPROCS=30, #goroutines=30000, #Ms=150 (OS threads), 10000 cgo-calls/second. Maybe
runtime.ReadMemStatscould be provoked into a slow STW by ratcheting up the number of Ms, or the Ms in a cgo-call. But I haven't tried to reproduce. ↩