Go version
go version go1.24.4 linux/arm64
Output of go env
in your module/workspace:
AR='ar'
CC='gcc'
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_ENABLED='1'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
CXX='g++'
GCCGO='gccgo'
GO111MODULE=''
GOARCH='arm64'
GOARM64='v8.0'
GOAUTH='netrc'
GOBIN=''
GOCACHE='/root/.cache/go-build'
GOCACHEPROG=''
GODEBUG=''
GOENV='/root/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFIPS140='off'
GOFLAGS=''
GOGCCFLAGS='-fPIC -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build1130418166=/tmp/go-build -gno-record-gcc-switches'
GOHOSTARCH='arm64'
GOHOSTOS='linux'
GOINSECURE=''
GOMOD='/dev/null'
GOMODCACHE='/root/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/root/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/local/go1.24'
GOSUMDB='sum.golang.org'
GOTELEMETRY='local'
GOTELEMETRYDIR='/root/.config/go/telemetry'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/usr/local/go1.24/pkg/tool/linux_arm64'
GOVCS=''
GOVERSION='go1.24.4'
GOWORK=''
PKG_CONFIG='pkg-config'
What did you do?
I conducted benchmarks using this code on Go 1.24, comparing the default build against Go 1.24 with GOEXPERIMENT=nospinbitmutex on AWS EC2 c7g.4xlarge. Note that although the instance has 16 CPUs, I tested with -test.cpu
values ranging from 1-32.
package main
import (
"sync"
"testing"
)
// Run with: go test runtime -test.run='^$' -test.bench=ChanContended -test.cpu="$(seq 1 32 | tr '\n' ',')" -test.count=10
func BenchmarkPart2LockContentionUnpredictable(b *testing.B) {
const requestsPerGoroutine = 10
var globalMutex sync.Mutex
var sharedCounter = 0
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
for i := 0; i < requestsPerGoroutine; i++ {
respChan := make(chan int, 1)
go processPart2LockContentionUnpredictable(i, respChan, &globalMutex, &sharedCounter)
<-respChan
}
}
})
}
func processPart2LockContentionUnpredictable(reqID int, respChan chan int, globalMutex *sync.Mutex, sharedCounter *int) {
results := make(chan int, 16)
for i := 0; i < 16; i++ {
go func(id int) {
globalMutex.Lock()
*sharedCounter++
// Some Workload
baseWork := 50
variance := (id*13 + reqID*7) % 199951
work := baseWork + variance
sum := id + reqID
for j := 0; j < work; j++ {
sum += j
}
globalMutex.Unlock()
results <- sum
}(i)
}
total := 0
for i := 0; i < 16; i++ {
total += <-results
}
respChan <- total
}
What did you see happen?
Benchstat 1.24 Run 1: Performance comparison between Go 1.24 default configuration and Go 1.24 with GOEXPERIMENT=nospinbitmutex
/root/go/bin/benchstat BenchmarkPart2LockContentionUnpredictable_go1.24_124_nospinbit.txt BenchmarkPart2LockContentionUnpredictable_go1.24_124.txt
goos: linux
goarch: arm64
pkg: chan-contended-1.24
│ BenchmarkPart2LockContentionUnpredictable_go1.24_124_nospinbit.txt │ BenchmarkPart2LockContentionUnpredictable_go1.24_124.txt │
│ sec/op │ sec/op vs base │
Part2LockContentionUnpredictable 115.5µ ± 0% 115.8µ ± 0% ~ (p=0.218 n=10)
Part2LockContentionUnpredictable-2 92.59µ ± 2% 85.16µ ± 1% -8.03% (p=0.000 n=10)
Part2LockContentionUnpredictable-3 70.09µ ± 1% 65.15µ ± 1% -7.04% (p=0.000 n=10)
Part2LockContentionUnpredictable-4 60.07µ ± 1% 58.49µ ± 1% -2.63% (p=0.000 n=10)
Part2LockContentionUnpredictable-5 55.98µ ± 1% 53.59µ ± 3% -4.27% (p=0.000 n=10)
Part2LockContentionUnpredictable-6 52.18µ ± 0% 55.11µ ± 1% +5.61% (p=0.002 n=10)
Part2LockContentionUnpredictable-7 52.25µ ± 1% 53.62µ ± 11% +2.63% (p=0.000 n=10)
Part2LockContentionUnpredictable-8 54.72µ ± 0% 63.78µ ± 11% +16.56% (p=0.000 n=10)
Part2LockContentionUnpredictable-9 58.87µ ± 1% 64.30µ ± 6% +9.22% (p=0.000 n=10)
Part2LockContentionUnpredictable-10 63.40µ ± 1% 67.22µ ± 5% +6.03% (p=0.000 n=10)
Part2LockContentionUnpredictable-11 65.92µ ± 1% 72.38µ ± 8% +9.81% (p=0.001 n=10)
Part2LockContentionUnpredictable-12 67.27µ ± 1% 73.67µ ± 8% +9.51% (p=0.003 n=10)
Part2LockContentionUnpredictable-13 67.89µ ± 0% 74.62µ ± 8% +9.91% (p=0.000 n=10)
Part2LockContentionUnpredictable-14 68.32µ ± 0% 75.33µ ± 9% +10.26% (p=0.000 n=10)
Part2LockContentionUnpredictable-15 68.50µ ± 0% 75.92µ ± 9% +10.84% (p=0.000 n=10)
Part2LockContentionUnpredictable-16 68.90µ ± 1% 70.17µ ± 9% +1.84% (p=0.000 n=10)
Part2LockContentionUnpredictable-17 68.64µ ± 0% 70.44µ ± 9% +2.63% (p=0.000 n=10)
Part2LockContentionUnpredictable-18 68.80µ ± 0% 70.34µ ± 9% +2.24% (p=0.000 n=10)
Part2LockContentionUnpredictable-19 68.69µ ± 0% 77.28µ ± 9% +12.51% (p=0.000 n=10)
Part2LockContentionUnpredictable-20 68.88µ ± 0% 76.92µ ± 8% +11.67% (p=0.000 n=10)
Part2LockContentionUnpredictable-21 68.89µ ± 0% 71.08µ ± 9% +3.18% (p=0.000 n=10)
Part2LockContentionUnpredictable-22 68.89µ ± 0% 71.78µ ± 9% +4.19% (p=0.000 n=10)
Part2LockContentionUnpredictable-23 69.11µ ± 1% 78.10µ ± 8% +13.01% (p=0.000 n=10)
Part2LockContentionUnpredictable-24 69.27µ ± 1% 71.80µ ± 10% +3.66% (p=0.000 n=10)
Part2LockContentionUnpredictable-25 69.06µ ± 1% 78.62µ ± 9% +13.83% (p=0.000 n=10)
Part2LockContentionUnpredictable-26 69.33µ ± 1% 79.37µ ± 9% +14.47% (p=0.000 n=10)
Part2LockContentionUnpredictable-27 69.48µ ± 0% 72.49µ ± 10% +4.32% (p=0.000 n=10)
Part2LockContentionUnpredictable-28 69.51µ ± 0% 73.00µ ± 9% +5.02% (p=0.000 n=10)
Part2LockContentionUnpredictable-29 69.52µ ± 0% 72.85µ ± 10% +4.78% (p=0.000 n=10)
Part2LockContentionUnpredictable-30 69.74µ ± 0% 72.81µ ± 11% +4.40% (p=0.000 n=10)
Part2LockContentionUnpredictable-31 69.90µ ± 0% 73.13µ ± 11% +4.63% (p=0.000 n=10)
Part2LockContentionUnpredictable-32 69.88µ ± 0% 80.91µ ± 10% +15.79% (p=0.000 n=10)
geomean 67.70µ 71.61µ +5.78%
Benchstat 1.24 Run 2: Performance comparison between Go 1.24 default configuration and Go 1.24 with GOEXPERIMENT=nospinbitmutex
/root/go/bin/benchstat BenchmarkPart2LockContentionUnpredictable_go1.23_123.txt BenchmarkPart2LockContentionUnpredictable_go1.24_124.txt
goos: linux
goarch: arm64
pkg: chan-contended-1.24
│ BenchmarkPart2LockContentionUnpredictable_go1.23_123.txt │ BenchmarkPart2LockContentionUnpredictable_go1.24_124.txt │
│ sec/op │ sec/op vs base │
Part2LockContentionUnpredictable 117.9µ ± 0% 115.8µ ± 0% -1.70% (p=0.000 n=10)
Part2LockContentionUnpredictable-2 89.26µ ± 1% 89.86µ ± 2% ~ (p=0.280 n=10)
Part2LockContentionUnpredictable-3 67.46µ ± 2% 67.01µ ± 1% ~ (p=0.579 n=10)
Part2LockContentionUnpredictable-4 60.64µ ± 1% 59.93µ ± 1% -1.18% (p=0.000 n=10)
Part2LockContentionUnpredictable-5 55.17µ ± 1% 54.48µ ± 1% -1.24% (p=0.001 n=10)
Part2LockContentionUnpredictable-6 52.51µ ± 0% 52.49µ ± 1% ~ (p=0.739 n=10)
Part2LockContentionUnpredictable-7 52.78µ ± 0% 54.53µ ± 3% +3.33% (p=0.001 n=10)
Part2LockContentionUnpredictable-8 55.29µ ± 0% 58.52µ ± 5% +5.85% (p=0.000 n=10)
Part2LockContentionUnpredictable-9 59.62µ ± 1% 61.38µ ± 4% ~ (p=0.700 n=10)
Part2LockContentionUnpredictable-10 64.47µ ± 1% 66.37µ ± 6% ~ (p=0.481 n=10)
Part2LockContentionUnpredictable-11 67.47µ ± 1% 64.39µ ± 1% -4.56% (p=0.002 n=10)
Part2LockContentionUnpredictable-12 68.70µ ± 1% 69.77µ ± 6% ~ (p=0.481 n=10)
Part2LockContentionUnpredictable-13 69.62µ ± 0% 70.46µ ± 5% ~ (p=0.137 n=10)
Part2LockContentionUnpredictable-14 69.55µ ± 0% 67.20µ ± 6% ~ (p=0.143 n=10)
Part2LockContentionUnpredictable-15 70.09µ ± 0% 69.79µ ± 4% ~ (p=1.000 n=10)
Part2LockContentionUnpredictable-16 70.23µ ± 1% 72.21µ ± 5% +2.82% (p=0.023 n=10)
Part2LockContentionUnpredictable-17 69.91µ ± 1% 68.81µ ± 5% ~ (p=0.143 n=10)
Part2LockContentionUnpredictable-18 69.78µ ± 0% 72.72µ ± 5% +4.21% (p=0.023 n=10)
Part2LockContentionUnpredictable-19 69.87µ ± 1% 71.34µ ± 3% ~ (p=0.436 n=10)
Part2LockContentionUnpredictable-20 70.07µ ± 1% 73.25µ ± 5% ~ (p=0.105 n=10)
Part2LockContentionUnpredictable-21 69.81µ ± 1% 73.55µ ± 5% +5.36% (p=0.001 n=10)
Part2LockContentionUnpredictable-22 70.32µ ± 0% 70.29µ ± 1% ~ (p=0.684 n=10)
Part2LockContentionUnpredictable-23 69.80µ ± 0% 74.03µ ± 5% +6.06% (p=0.000 n=10)
Part2LockContentionUnpredictable-24 70.10µ ± 0% 74.16µ ± 5% +5.79% (p=0.001 n=10)
Part2LockContentionUnpredictable-25 70.08µ ± 0% 72.49µ ± 3% +3.44% (p=0.000 n=10)
Part2LockContentionUnpredictable-26 70.22µ ± 1% 72.77µ ± 3% +3.63% (p=0.000 n=10)
Part2LockContentionUnpredictable-27 70.28µ ± 1% 73.28µ ± 3% +4.27% (p=0.000 n=10)
Part2LockContentionUnpredictable-28 70.50µ ± 0% 71.38µ ± 5% +1.25% (p=0.002 n=10)
Part2LockContentionUnpredictable-29 70.59µ ± 1% 75.34µ ± 5% +6.73% (p=0.000 n=10)
Part2LockContentionUnpredictable-30 70.93µ ± 1% 71.79µ ± 5% +1.22% (p=0.000 n=10)
Part2LockContentionUnpredictable-31 70.78µ ± 0% 73.85µ ± 3% +4.35% (p=0.000 n=10)
Part2LockContentionUnpredictable-32 70.85µ ± 0% 75.69µ ± 4% +6.82% (p=0.000 n=10)
geomean 68.46µ 69.85µ +2.03%
the spinbit shows high variance indicating unpredictable performance.
What did you expect to see?
I expect similar performance between spinbitmutex(new in Go 1.24) and nospinbitmutex (before Go 1.24, and in Go 1.24 with GOEXPERIMENT=nospinbitmutex, removed in Go 1.25). However, spinbit exhibits higher variance and unpredictable latency patterns.
I reproduced BenchmarkChanContended on my arm64 test environment, which yielded comparable results to the amd64 benchmark included in this. This validates the improvements shown in that particular benchmark case.
/root/go/bin/benchstat BenchmarkChanContended_go1.24_124_nospinbit.txt BenchmarkChanContended_go1.24_124.txt
goos: linux
goarch: arm64
pkg: chan-contended-1.24
│ BenchmarkChanContended_go1.24_124_nospinbit.txt │ BenchmarkChanContended_go1.24_124.txt │
│ sec/op │ sec/op vs base │
ChanContended 6.317µ ± 0% 6.481µ ± 0% +2.60% (p=0.000 n=10)
ChanContended-2 13.44µ ± 5% 18.34µ ± 2% +36.43% (p=0.000 n=10)
ChanContended-3 16.99µ ± 3% 14.15µ ± 4% -16.70% (p=0.000 n=10)
ChanContended-4 22.11µ ± 2% 17.51µ ± 1% -20.78% (p=0.000 n=10)
ChanContended-5 26.79µ ± 11% 17.03µ ± 1% -36.43% (p=0.000 n=10)
ChanContended-6 31.70µ ± 4% 17.85µ ± 1% -43.70% (p=0.000 n=10)
ChanContended-7 34.40µ ± 11% 17.98µ ± 1% -47.72% (p=0.000 n=10)
ChanContended-8 41.35µ ± 2% 18.41µ ± 1% -55.47% (p=0.000 n=10)
ChanContended-9 38.51µ ± 18% 18.21µ ± 1% -52.71% (p=0.000 n=10)
ChanContended-10 43.31µ ± 22% 18.14µ ± 1% -58.10% (p=0.000 n=10)
ChanContended-11 43.41µ ± 2% 17.97µ ± 1% -58.61% (p=0.000 n=10)
ChanContended-12 44.10µ ± 16% 17.80µ ± 2% -59.63% (p=0.000 n=10)
ChanContended-13 45.99µ ± 24% 18.00µ ± 0% -60.85% (p=0.000 n=10)
ChanContended-14 46.22µ ± 15% 18.07µ ± 1% -60.91% (p=0.000 n=10)
ChanContended-15 48.10µ ± 11% 18.07µ ± 0% -62.43% (p=0.000 n=10)
ChanContended-16 45.97µ ± 5% 17.87µ ± 1% -61.12% (p=0.000 n=10)
ChanContended-17 46.88µ ± 7% 17.60µ ± 1% -62.45% (p=0.000 n=10)
ChanContended-18 48.94µ ± 14% 17.14µ ± 2% -64.99% (p=0.000 n=10)
ChanContended-19 44.91µ ± 12% 17.06µ ± 1% -62.01% (p=0.000 n=10)
ChanContended-20 44.43µ ± 3% 16.96µ ± 2% -61.83% (p=0.000 n=10)
ChanContended-21 43.35µ ± 0% 17.01µ ± 2% -60.76% (p=0.000 n=10)
ChanContended-22 43.42µ ± 9% 16.53µ ± 2% -61.94% (p=0.000 n=10)
ChanContended-23 43.26µ ± 19% 16.78µ ± 1% -61.21% (p=0.000 n=10)
ChanContended-24 42.91µ ± 3% 16.64µ ± 2% -61.23% (p=0.000 n=10)
ChanContended-25 42.84µ ± 8% 16.61µ ± 2% -61.23% (p=0.000 n=10)
ChanContended-26 38.40µ ± 32% 16.84µ ± 2% -56.16% (p=0.000 n=10)
ChanContended-27 37.61µ ± 27% 16.89µ ± 2% -55.10% (p=0.000 n=10)
ChanContended-28 51.70µ ± 8% 16.79µ ± 2% -67.52% (p=0.000 n=10)
ChanContended-29 48.91µ ± 11% 16.66µ ± 2% -65.94% (p=0.000 n=10)
ChanContended-30 45.40µ ± 10% 16.65µ ± 1% -63.32% (p=0.000 n=10)
ChanContended-31 45.36µ ± 7% 16.89µ ± 3% -62.77% (p=0.000 n=10)
ChanContended-32 45.12µ ± 10% 16.71µ ± 3% -62.96% (p=0.000 n=10)
geomean 36.86µ 16.72µ -54.63%
However, under high contention scenarios that better reflect production workloads, no performance benefits were observed (attached benchmark on what did you do section). Our service experienced degradation in P99 Max Latency, particularly affecting goroutines performing network operations such as database queries, cache requests, and external service calls.
Other than performance degradation, this is also related to this issue regarding the removal of the nospinbitmutex GOEXPERIMENT, which is hindering our transition to Go 1.25 as we use it for workaround.
Comment From: gabyhelp
Related Issues
- runtime: spinbitmutex performance differs between 1.24.0 and 1.24.1, what changed? #72117 (closed)
- cmd/compile: sync.Mutex is not fair on ARMv8.1 systems #39304 (closed)
- runtime: improve scaling of lock2 #68578 (closed)
- sync: Mutex performance collapses with high concurrency #33747 (closed)
- runtime: with -test.blockprofile program run even faster. #57319 (closed)
- runtime: async preemption causes short sleep in tickspersecond #63103 (closed)
- sync: mutex profiling information is confusing (wrong?) for mutexes with >2 contenders #24877
- runtime: unknown GOEXPERIMENT spinbitmutex #75094 (closed)
Related Code Changes
(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)
Comment From: rhysh
Thank you for the report.
CC @golang/runtime
The behavior in the benchmark looks related to contention within the (251-element) table of semaphore roots. I've run it on a linux/amd64 machine, with go1.23.12, go1.24.7, go1.24.7 with nospinbitmutex, and go1.25.1. The mutex contention profile of course shows a lot of demand for the benchmark's global sync.Mutex
value, but when ignoring those samples (and using Go 1.25) the profile also shows contention in the slow path of sync.Mutex.Lock
inside runtime.semacquire1
.
The execution trace I collected with Go 1.25 shows about 3 million goroutines per second that run processPart2LockContentionUnpredictable.func1
. In a sample of 300,000 of those, I see hundreds with "Execution time" of 200µs or more. In similar data from Go 1.24 with nospinbitmutex, the sample includes only 4 with "Execution time" of 200µs or more. I expect that corresponds to delay within runtime.semacquire1
, where the thread (M) is blocked in runtime.lock2
while still associated with the goroutine (G).
The execution traces show that most goroutines only have a few hundred nanoseconds of work to do (while holding the global sync.Mutex
value). It looks like work
maxes out at 50+15*13+9*7
.
I'm not sure at the moment why the benchmark's use of the semaphore table is more sensitive to latency than throughput. I'd expect the benchmark to nearly always have goroutines that need the global sync.Mutex
, and for some M to be able to acquire the semaRoot
and make progress.
You mentioned that your production workload's tail latency regressions are associated with network operations, but suggested that they were resolved by building with nospinbitmutex. Can you share more about how you reduced the problem you saw into this benchmark?