Go version

go version go1.24.4 linux/amd64

Output of go env in your module/workspace:

AR='ar'
CC='gcc'
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_ENABLED='1'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
CXX='g++'
GCCGO='gccgo'
GO111MODULE=''
GOAMD64='v1'
GOARCH='amd64'
GOAUTH='netrc'
GOBIN=''
GOCACHE='/usr/local/google/home/mpratt/.cache/go-build'
GOCACHEPROG=''
GODEBUG=''
GOENV='/usr/local/google/home/mpratt/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFIPS140='off'
GOFLAGS=''
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build4227360042=/tmp/go-build -gno-record-gcc-switches'
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMOD='/tmp/memprofile_leak/go.mod'
GOMODCACHE='/usr/local/google/home/mpratt/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/usr/local/google/home/mpratt/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/lib/go-1.24'
GOSUMDB='sum.golang.org'
GOTELEMETRY='on'
GOTELEMETRYDIR='/usr/local/google/home/mpratt/.config/go/telemetry'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/usr/lib/go-1.24/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.24.4'
GOWORK=''
PKG_CONFIG='pkg-config'

What did you do?

Reproducer: https://go.dev/play/p/uJ6XPZzNcEF (must run locally, it requires real time)

Run a program with a very large memory profile and call runtime.MemProfile to collect the profile occasionally.

What did you see happen?

As the profile grows in size, calls to runtime.MemProfile take longer and longer. This isn't surprising, as the profile is growing. What is surprising is that STW times go up a lot.

e.g., after ~60s, I see output like:

MemProfile: took 711.959904ms
GC STW stop time (ns):
                    |      0 [  -Inf,    0.0)
                    |      0 [   0.0,   64.0)
                    |      0 [  64.0,  128.0)
                    | ...
                    |      0 [ 384.0,  448.0)
                    |      0 [ 448.0,  512.0)
                   *|      2 [ 512.0,  640.0)
               *****|      6 [ 640.0,  768.0)
********************|     24 [ 768.0,  896.0)
      **************|     17 [ 896.0, 1024.0)
             *******|      9 [1024.0, 1280.0)
                 ***|      4 [1280.0, 1536.0)
                   *|      2 [1536.0, 1792.0)
                    |      0 [1792.0, 2048.0)
                    |      0 [2048.0, 2560.0)
                    | ...
                    |      0 [12288.0, 14336.0)
                    |      0 [14336.0, 16384.0)
                   *|      2 [16384.0, 20480.0)
                    |      0 [20480.0, 24576.0)
                 ***|      4 [24576.0, 28672.0)
                 ***|      4 [28672.0, 32768.0)
               *****|      7 [32768.0, 40960.0)
              ******|      8 [40960.0, 49152.0)
                    |      1 [49152.0, 57344.0)
                   *|      2 [57344.0, 65536.0)
                   *|      2 [65536.0, 81920.0)
                    |      0 [81920.0, 98304.0)
                    |      0 [98304.0, 114688.0)
                    | ...
                    |      0 [117440512.0, 134217728.0)
                    |      0 [134217728.0, 167772160.0)
                    |      1 [167772160.0, 201326592.0)
                    |      0 [201326592.0, 234881024.0)
                    |      0 [234881024.0, 268435456.0)
                    |      0 [268435456.0, 335544320.0)
                    |      1 [335544320.0, 402653184.0)
                    |      0 [402653184.0, 469762048.0)
                    |      0 [469762048.0, 536870912.0)
                    |      1 [536870912.0, 671088640.0)
                    |      0 [671088640.0, 805306368.0)
                    |      0 [805306368.0, 939524096.0)
                    | ...
                    |      0 [105553116266496.0, 123145302310912.0)
                    |      0 [123145302310912.0, 140737488355328.0)
                    |      0 [140737488355328.0,   +Inf)

This shows 1 sample of a STW stop time of ~530-670ms.

I believe the problem is that runtime.memProfileInternal holds profMemActiveLock for its entire duration, which prevents preemption from stopping the goroutine.

What did you expect to see?

Calling runtime.MemProfile should not impact STW times.

My reproducer intentionally creates tons of unique stacks to make the profile very large. That has its own issues (#69590) with memory usage, but I don't believe it should impact STW times.

Comment From: prattmic

cc @golang/runtime @mknyszek

Comment From: prattmic

Note that runtime/pprof heap profile collection also uses runtime.memProfileInternal, so that can also trigger this problem.

Comment From: rhysh

I believe the problem is that runtime.memProfileInternal holds profMemActiveLock for its entire duration, which prevents preemption from stopping the goroutine.

Yes.

With go version go1.26-devel_4ab1aec007 Fri Aug 1 11:56:27 2025 -0700 darwin/arm64, I modified the reproducer to collect execution trace + CPU profile, and filtered to StackSample Events that occur between STW RangeBegin and RangeEnd Events, and to note whether they're before/during/after the time when only one P is running. The result was many samples in the final loop of memProfileInternal, including samples tens of milliseconds after the STW request, all before the count of running Ps was able to decrease to 1. It holds profMemActiveLock at that time. That's the code that is running which delays the STW requests:

https://github.com/golang/go/blob/4ab1aec00799f91e96182cbbffd1de405cd52e93/src/runtime/mprof.go#L995-L997

Removing use of that lock (by converting it into a semaphore and unsafely removing its use from mProf_Flush, which holds locks and so cannot enter the scheduler) results in much faster STWs: it lasted 48 seconds before crashing, during which the slowest STW took 1.4 ms. (The median STW was 34µs.)