Golang runtime: MutexProfile has multiple records with equal stack

Go version

go1.23-f4494522dc

Output of `go env` in your module/workspace:

GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/home/korniltsev/.cache/go-build'
GOENV='/home/korniltsev/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/home/korniltsev/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/korniltsev/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/home/korniltsev/github/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/home/korniltsev/github/go/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='devel go1.23-f4494522dc Mon May 13 16:42:35 2024 +0000'
GODEBUG=''
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/home/korniltsev/github/go/src/go.mod'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build1551311124=/tmp/go-build -gno-record-gcc-switches'

What did you do?

I run this program on gotip https://go.dev/play/p/YD8N3UVACna?v=gotip and I get different results on gotip and 1.22

I think the reason is before the change this line of code https://github.com/golang/go/blob/22344e11f27d9667e7bbb6209df59e9a9e976d91/src/runtime/mprof.go#L776 used stack from callers / gcallers which does skip logical frames and performs inline unwinding and looks like this

[
   0x48f931, //  sync.(*Mutex).Unlock
   0x48f918,  // main.main.func1
   0x46ba81,  // runtime.goexit
]

And after the change the same line uses stack from fpTracebackPCs which looks like this:

[
   0x4,      // skip value
   0x467405, // sync.event 
   0x44ad65, // runtime.semrelease1
   0x468533, // sync.runtime_Semrelease 
   0x476b5b, // sync.(*Mutex).unlockSlow << this function has a branch, so value here depends on Mutex state
   0x48f931, // main.main.func1
   0x46ba81, // runtime.goexit
]

Because of that, runtime.stkbucket allocates two bucket s for the same logical stack and therefore later we get two BlockProfileRecord with equal stack.

I am not exactly sure this is a bug and how severe it is. It looks like it may lead to memory usage increase.

What did you see happen?

I see multiple BlockProfileRecord with equal stack.

=========================
4905d1  sync.(*Mutex).Unlock
4905b8  main.main.func1
46b241  runtime.goexit
=========================
4905d1  sync.(*Mutex).Unlock
4905b8  main.main.func1
46b241  runtime.goexit

What did you expect to see?

If I run on 1.22 I get only one BlockProfileRecord

=========================
482751  sync.(*Mutex).Unlock
482738  main.main.func1
4622c1  runtime.goexit

Comment From: korniltsev

cc @nsrip-dd

Comment From: nsrip-dd

Thanks for raising this, I hadn't considered that when sending the CL. Technically these are different stacks, as you observe, it's just that the choice of how much to skip makes them look the same. There is a similar issue with the heap profile, where map internals are elided from the end of call stacks and we can get multiple entries with the same stack in the pprof.

A few things come to mind for how to address this: * We could deduplicate/merge the records before returning them. * We could handle skipping/inlining at sample time again. This would somewhat defeat the purpose of switching to frame pointer unwinding, where we ideally would keep the recording overhead low and defer the more expensive stack handling logic to reporting time, after we've already aggregated the data. * We could skip less, maybe even just up to mutexevent where the sample is recorded, and these would no longer be duplicates. That might warrant a proposal, though.

Comment From: korniltsev

I'll just braindump some more ideas: - We can change (m *Mutex) unlockSlow implementation to call runtime_Semrelease only once, unconditionally as the last expression before exit. This way return address on the stack will be the same for both starving and non starving mutexes. Not sure about other cases (rwmutex, waitgroup) - We can mark (m *Mutex) unlockSlow as noinline and teach saveblockevent to skip until specific function instead of number frames

Comment From: rsc

The discussion in https://github.com/grafana/pyroscope-go/issues/103 makes this look like a Go 1.23 regression to me.

Comment From: felixge

@rsc ack. We'll try to submit a patch that restores the old behavior by deduplicating the returned records. This will make runtime.MutexProfile() a little more expensive, but not cause any user code to potentially break. @nsrip-dd will work on this.

Comment From: gopherbot

Change https://go.dev/cl/595966 mentions this issue: runtime: aggregate records with identical stacks in MutexProfile

Comment From: thanm

Any updates here?

Comment From: nsrip-dd

Yes, I've sent https://go-review.googlesource.com/c/go/+/595966 implementing my first suggested fix, to merge records with the same call stack before returning them from MutexProfile. The CL needs review, though it looks like it hasn't been automatically assigned reviewers as of this writing.

Comment From: cherrymui

Thanks @nsrip-dd ! I can review the CL.

Comment From: rhysh

I gave it a non-maintainer pass; I'll leave it to @cherrymui .