Golang runtime: high variance and unpredictable latency with spinbit mutex

Go version

go version go1.24.4 linux/arm64

Output of `go env` in your module/workspace:

AR='ar'
CC='gcc'
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_ENABLED='1'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
CXX='g++'
GCCGO='gccgo'
GO111MODULE=''
GOARCH='arm64'
GOARM64='v8.0'
GOAUTH='netrc'
GOBIN=''
GOCACHE='/root/.cache/go-build'
GOCACHEPROG=''
GODEBUG=''
GOENV='/root/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFIPS140='off'
GOFLAGS=''
GOGCCFLAGS='-fPIC -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build1130418166=/tmp/go-build -gno-record-gcc-switches'
GOHOSTARCH='arm64'
GOHOSTOS='linux'
GOINSECURE=''
GOMOD='/dev/null'
GOMODCACHE='/root/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/root/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/local/go1.24'
GOSUMDB='sum.golang.org'
GOTELEMETRY='local'
GOTELEMETRYDIR='/root/.config/go/telemetry'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/usr/local/go1.24/pkg/tool/linux_arm64'
GOVCS=''
GOVERSION='go1.24.4'
GOWORK=''
PKG_CONFIG='pkg-config'

What did you do?

I conducted benchmarks using this code on Go 1.24, comparing the default build against Go 1.24 with GOEXPERIMENT=nospinbitmutex on AWS EC2 c7g.4xlarge. Note that although the instance has 16 CPUs, I tested with -test.cpu values ranging from 1-32.

package main

import (
    "sync"
    "testing"
)

// Run with: go test runtime -test.run='^$' -test.bench=ChanContended -test.cpu="$(seq 1 32 | tr '\n' ',')" -test.count=10
func BenchmarkPart2LockContentionUnpredictable(b *testing.B) {
    const requestsPerGoroutine = 10

    var globalMutex sync.Mutex
    var sharedCounter = 0

    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            for i := 0; i < requestsPerGoroutine; i++ {
                respChan := make(chan int, 1)
                go processPart2LockContentionUnpredictable(i, respChan, &globalMutex, &sharedCounter)
                <-respChan
            }
        }
    })
}

func processPart2LockContentionUnpredictable(reqID int, respChan chan int, globalMutex *sync.Mutex, sharedCounter *int) {
    results := make(chan int, 16)
    for i := 0; i < 16; i++ {
        go func(id int) {
            globalMutex.Lock()
            *sharedCounter++

            // Some Workload
            baseWork := 50
            variance := (id*13 + reqID*7) % 199951
            work := baseWork + variance

            sum := id + reqID
            for j := 0; j < work; j++ {
                sum += j
            }

            globalMutex.Unlock()
            results <- sum
        }(i)
    }
    total := 0
    for i := 0; i < 16; i++ {
        total += <-results
    }

    respChan <- total
}

What did you see happen?

Benchstat 1.24 Run 1: Performance comparison between Go 1.24 default configuration and Go 1.24 with GOEXPERIMENT=nospinbitmutex

/root/go/bin/benchstat  BenchmarkPart2LockContentionUnpredictable_go1.24_124_nospinbit.txt BenchmarkPart2LockContentionUnpredictable_go1.24_124.txt
goos: linux
goarch: arm64
pkg: chan-contended-1.24
                                    │ BenchmarkPart2LockContentionUnpredictable_go1.24_124_nospinbit.txt │ BenchmarkPart2LockContentionUnpredictable_go1.24_124.txt │
                                    │                               sec/op                               │              sec/op               vs base                │
Part2LockContentionUnpredictable                                                             115.5µ ± 0%                       115.8µ ±  0%        ~ (p=0.218 n=10)
Part2LockContentionUnpredictable-2                                                           92.59µ ± 2%                       85.16µ ±  1%   -8.03% (p=0.000 n=10)
Part2LockContentionUnpredictable-3                                                           70.09µ ± 1%                       65.15µ ±  1%   -7.04% (p=0.000 n=10)
Part2LockContentionUnpredictable-4                                                           60.07µ ± 1%                       58.49µ ±  1%   -2.63% (p=0.000 n=10)
Part2LockContentionUnpredictable-5                                                           55.98µ ± 1%                       53.59µ ±  3%   -4.27% (p=0.000 n=10)
Part2LockContentionUnpredictable-6                                                           52.18µ ± 0%                       55.11µ ±  1%   +5.61% (p=0.002 n=10)
Part2LockContentionUnpredictable-7                                                           52.25µ ± 1%                       53.62µ ± 11%   +2.63% (p=0.000 n=10)
Part2LockContentionUnpredictable-8                                                           54.72µ ± 0%                       63.78µ ± 11%  +16.56% (p=0.000 n=10)
Part2LockContentionUnpredictable-9                                                           58.87µ ± 1%                       64.30µ ±  6%   +9.22% (p=0.000 n=10)
Part2LockContentionUnpredictable-10                                                          63.40µ ± 1%                       67.22µ ±  5%   +6.03% (p=0.000 n=10)
Part2LockContentionUnpredictable-11                                                          65.92µ ± 1%                       72.38µ ±  8%   +9.81% (p=0.001 n=10)
Part2LockContentionUnpredictable-12                                                          67.27µ ± 1%                       73.67µ ±  8%   +9.51% (p=0.003 n=10)
Part2LockContentionUnpredictable-13                                                          67.89µ ± 0%                       74.62µ ±  8%   +9.91% (p=0.000 n=10)
Part2LockContentionUnpredictable-14                                                          68.32µ ± 0%                       75.33µ ±  9%  +10.26% (p=0.000 n=10)
Part2LockContentionUnpredictable-15                                                          68.50µ ± 0%                       75.92µ ±  9%  +10.84% (p=0.000 n=10)
Part2LockContentionUnpredictable-16                                                          68.90µ ± 1%                       70.17µ ±  9%   +1.84% (p=0.000 n=10)
Part2LockContentionUnpredictable-17                                                          68.64µ ± 0%                       70.44µ ±  9%   +2.63% (p=0.000 n=10)
Part2LockContentionUnpredictable-18                                                          68.80µ ± 0%                       70.34µ ±  9%   +2.24% (p=0.000 n=10)
Part2LockContentionUnpredictable-19                                                          68.69µ ± 0%                       77.28µ ±  9%  +12.51% (p=0.000 n=10)
Part2LockContentionUnpredictable-20                                                          68.88µ ± 0%                       76.92µ ±  8%  +11.67% (p=0.000 n=10)
Part2LockContentionUnpredictable-21                                                          68.89µ ± 0%                       71.08µ ±  9%   +3.18% (p=0.000 n=10)
Part2LockContentionUnpredictable-22                                                          68.89µ ± 0%                       71.78µ ±  9%   +4.19% (p=0.000 n=10)
Part2LockContentionUnpredictable-23                                                          69.11µ ± 1%                       78.10µ ±  8%  +13.01% (p=0.000 n=10)
Part2LockContentionUnpredictable-24                                                          69.27µ ± 1%                       71.80µ ± 10%   +3.66% (p=0.000 n=10)
Part2LockContentionUnpredictable-25                                                          69.06µ ± 1%                       78.62µ ±  9%  +13.83% (p=0.000 n=10)
Part2LockContentionUnpredictable-26                                                          69.33µ ± 1%                       79.37µ ±  9%  +14.47% (p=0.000 n=10)
Part2LockContentionUnpredictable-27                                                          69.48µ ± 0%                       72.49µ ± 10%   +4.32% (p=0.000 n=10)
Part2LockContentionUnpredictable-28                                                          69.51µ ± 0%                       73.00µ ±  9%   +5.02% (p=0.000 n=10)
Part2LockContentionUnpredictable-29                                                          69.52µ ± 0%                       72.85µ ± 10%   +4.78% (p=0.000 n=10)
Part2LockContentionUnpredictable-30                                                          69.74µ ± 0%                       72.81µ ± 11%   +4.40% (p=0.000 n=10)
Part2LockContentionUnpredictable-31                                                          69.90µ ± 0%                       73.13µ ± 11%   +4.63% (p=0.000 n=10)
Part2LockContentionUnpredictable-32                                                          69.88µ ± 0%                       80.91µ ± 10%  +15.79% (p=0.000 n=10)
geomean                                                                                         67.70µ                            71.61µ         +5.78%

Benchstat 1.24 Run 2: Performance comparison between Go 1.24 default configuration and Go 1.24 with GOEXPERIMENT=nospinbitmutex

/root/go/bin/benchstat BenchmarkPart2LockContentionUnpredictable_go1.23_123.txt BenchmarkPart2LockContentionUnpredictable_go1.24_124.txt
goos: linux
goarch: arm64
pkg: chan-contended-1.24
                                    │ BenchmarkPart2LockContentionUnpredictable_go1.23_123.txt │ BenchmarkPart2LockContentionUnpredictable_go1.24_124.txt │
                                    │                          sec/op                          │              sec/op                vs base               │
Part2LockContentionUnpredictable                                                   117.9µ ± 0%                         115.8µ ± 0%  -1.70% (p=0.000 n=10)
Part2LockContentionUnpredictable-2                                                 89.26µ ± 1%                         89.86µ ± 2%       ~ (p=0.280 n=10)
Part2LockContentionUnpredictable-3                                                 67.46µ ± 2%                         67.01µ ± 1%       ~ (p=0.579 n=10)
Part2LockContentionUnpredictable-4                                                 60.64µ ± 1%                         59.93µ ± 1%  -1.18% (p=0.000 n=10)
Part2LockContentionUnpredictable-5                                                 55.17µ ± 1%                         54.48µ ± 1%  -1.24% (p=0.001 n=10)
Part2LockContentionUnpredictable-6                                                 52.51µ ± 0%                         52.49µ ± 1%       ~ (p=0.739 n=10)
Part2LockContentionUnpredictable-7                                                 52.78µ ± 0%                         54.53µ ± 3%  +3.33% (p=0.001 n=10)
Part2LockContentionUnpredictable-8                                                 55.29µ ± 0%                         58.52µ ± 5%  +5.85% (p=0.000 n=10)
Part2LockContentionUnpredictable-9                                                 59.62µ ± 1%                         61.38µ ± 4%       ~ (p=0.700 n=10)
Part2LockContentionUnpredictable-10                                                64.47µ ± 1%                         66.37µ ± 6%       ~ (p=0.481 n=10)
Part2LockContentionUnpredictable-11                                                67.47µ ± 1%                         64.39µ ± 1%  -4.56% (p=0.002 n=10)
Part2LockContentionUnpredictable-12                                                68.70µ ± 1%                         69.77µ ± 6%       ~ (p=0.481 n=10)
Part2LockContentionUnpredictable-13                                                69.62µ ± 0%                         70.46µ ± 5%       ~ (p=0.137 n=10)
Part2LockContentionUnpredictable-14                                                69.55µ ± 0%                         67.20µ ± 6%       ~ (p=0.143 n=10)
Part2LockContentionUnpredictable-15                                                70.09µ ± 0%                         69.79µ ± 4%       ~ (p=1.000 n=10)
Part2LockContentionUnpredictable-16                                                70.23µ ± 1%                         72.21µ ± 5%  +2.82% (p=0.023 n=10)
Part2LockContentionUnpredictable-17                                                69.91µ ± 1%                         68.81µ ± 5%       ~ (p=0.143 n=10)
Part2LockContentionUnpredictable-18                                                69.78µ ± 0%                         72.72µ ± 5%  +4.21% (p=0.023 n=10)
Part2LockContentionUnpredictable-19                                                69.87µ ± 1%                         71.34µ ± 3%       ~ (p=0.436 n=10)
Part2LockContentionUnpredictable-20                                                70.07µ ± 1%                         73.25µ ± 5%       ~ (p=0.105 n=10)
Part2LockContentionUnpredictable-21                                                69.81µ ± 1%                         73.55µ ± 5%  +5.36% (p=0.001 n=10)
Part2LockContentionUnpredictable-22                                                70.32µ ± 0%                         70.29µ ± 1%       ~ (p=0.684 n=10)
Part2LockContentionUnpredictable-23                                                69.80µ ± 0%                         74.03µ ± 5%  +6.06% (p=0.000 n=10)
Part2LockContentionUnpredictable-24                                                70.10µ ± 0%                         74.16µ ± 5%  +5.79% (p=0.001 n=10)
Part2LockContentionUnpredictable-25                                                70.08µ ± 0%                         72.49µ ± 3%  +3.44% (p=0.000 n=10)
Part2LockContentionUnpredictable-26                                                70.22µ ± 1%                         72.77µ ± 3%  +3.63% (p=0.000 n=10)
Part2LockContentionUnpredictable-27                                                70.28µ ± 1%                         73.28µ ± 3%  +4.27% (p=0.000 n=10)
Part2LockContentionUnpredictable-28                                                70.50µ ± 0%                         71.38µ ± 5%  +1.25% (p=0.002 n=10)
Part2LockContentionUnpredictable-29                                                70.59µ ± 1%                         75.34µ ± 5%  +6.73% (p=0.000 n=10)
Part2LockContentionUnpredictable-30                                                70.93µ ± 1%                         71.79µ ± 5%  +1.22% (p=0.000 n=10)
Part2LockContentionUnpredictable-31                                                70.78µ ± 0%                         73.85µ ± 3%  +4.35% (p=0.000 n=10)
Part2LockContentionUnpredictable-32                                                70.85µ ± 0%                         75.69µ ± 4%  +6.82% (p=0.000 n=10)
geomean                                                                            68.46µ                              69.85µ       +2.03%

the spinbit shows high variance indicating unpredictable performance.

What did you expect to see?

I expect similar performance between spinbitmutex(new in Go 1.24) and nospinbitmutex (before Go 1.24, and in Go 1.24 with GOEXPERIMENT=nospinbitmutex, removed in Go 1.25). However, spinbit exhibits higher variance and unpredictable latency patterns.

I reproduced BenchmarkChanContended on my arm64 test environment, which yielded comparable results to the amd64 benchmark included in this. This validates the improvements shown in that particular benchmark case.

/root/go/bin/benchstat BenchmarkChanContended_go1.24_124_nospinbit.txt BenchmarkChanContended_go1.24_124.txt 
goos: linux
goarch: arm64
pkg: chan-contended-1.24
                 │ BenchmarkChanContended_go1.24_124_nospinbit.txt │ BenchmarkChanContended_go1.24_124.txt │
                 │                     sec/op                      │    sec/op      vs base                │
ChanContended                                         6.317µ ±  0%     6.481µ ± 0%   +2.60% (p=0.000 n=10)
ChanContended-2                                       13.44µ ±  5%     18.34µ ± 2%  +36.43% (p=0.000 n=10)
ChanContended-3                                       16.99µ ±  3%     14.15µ ± 4%  -16.70% (p=0.000 n=10)
ChanContended-4                                       22.11µ ±  2%     17.51µ ± 1%  -20.78% (p=0.000 n=10)
ChanContended-5                                       26.79µ ± 11%     17.03µ ± 1%  -36.43% (p=0.000 n=10)
ChanContended-6                                       31.70µ ±  4%     17.85µ ± 1%  -43.70% (p=0.000 n=10)
ChanContended-7                                       34.40µ ± 11%     17.98µ ± 1%  -47.72% (p=0.000 n=10)
ChanContended-8                                       41.35µ ±  2%     18.41µ ± 1%  -55.47% (p=0.000 n=10)
ChanContended-9                                       38.51µ ± 18%     18.21µ ± 1%  -52.71% (p=0.000 n=10)
ChanContended-10                                      43.31µ ± 22%     18.14µ ± 1%  -58.10% (p=0.000 n=10)
ChanContended-11                                      43.41µ ±  2%     17.97µ ± 1%  -58.61% (p=0.000 n=10)
ChanContended-12                                      44.10µ ± 16%     17.80µ ± 2%  -59.63% (p=0.000 n=10)
ChanContended-13                                      45.99µ ± 24%     18.00µ ± 0%  -60.85% (p=0.000 n=10)
ChanContended-14                                      46.22µ ± 15%     18.07µ ± 1%  -60.91% (p=0.000 n=10)
ChanContended-15                                      48.10µ ± 11%     18.07µ ± 0%  -62.43% (p=0.000 n=10)
ChanContended-16                                      45.97µ ±  5%     17.87µ ± 1%  -61.12% (p=0.000 n=10)
ChanContended-17                                      46.88µ ±  7%     17.60µ ± 1%  -62.45% (p=0.000 n=10)
ChanContended-18                                      48.94µ ± 14%     17.14µ ± 2%  -64.99% (p=0.000 n=10)
ChanContended-19                                      44.91µ ± 12%     17.06µ ± 1%  -62.01% (p=0.000 n=10)
ChanContended-20                                      44.43µ ±  3%     16.96µ ± 2%  -61.83% (p=0.000 n=10)
ChanContended-21                                      43.35µ ±  0%     17.01µ ± 2%  -60.76% (p=0.000 n=10)
ChanContended-22                                      43.42µ ±  9%     16.53µ ± 2%  -61.94% (p=0.000 n=10)
ChanContended-23                                      43.26µ ± 19%     16.78µ ± 1%  -61.21% (p=0.000 n=10)
ChanContended-24                                      42.91µ ±  3%     16.64µ ± 2%  -61.23% (p=0.000 n=10)
ChanContended-25                                      42.84µ ±  8%     16.61µ ± 2%  -61.23% (p=0.000 n=10)
ChanContended-26                                      38.40µ ± 32%     16.84µ ± 2%  -56.16% (p=0.000 n=10)
ChanContended-27                                      37.61µ ± 27%     16.89µ ± 2%  -55.10% (p=0.000 n=10)
ChanContended-28                                      51.70µ ±  8%     16.79µ ± 2%  -67.52% (p=0.000 n=10)
ChanContended-29                                      48.91µ ± 11%     16.66µ ± 2%  -65.94% (p=0.000 n=10)
ChanContended-30                                      45.40µ ± 10%     16.65µ ± 1%  -63.32% (p=0.000 n=10)
ChanContended-31                                      45.36µ ±  7%     16.89µ ± 3%  -62.77% (p=0.000 n=10)
ChanContended-32                                      45.12µ ± 10%     16.71µ ± 3%  -62.96% (p=0.000 n=10)
geomean                                               36.86µ           16.72µ       -54.63%

However, under high contention scenarios that better reflect production workloads, no performance benefits were observed (attached benchmark on what did you do section). Our service experienced degradation in P99 Max Latency, particularly affecting goroutines performing network operations such as database queries, cache requests, and external service calls.

Other than performance degradation, this is also related to this issue regarding the removal of the nospinbitmutex GOEXPERIMENT, which is hindering our transition to Go 1.25 as we use it for workaround.

Comment From: gabyhelp

Related Issues

Related Code Changes

_{(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)}

Comment From: rhysh

Thank you for the report.

CC @golang/runtime

The behavior in the benchmark looks related to contention within the (251-element) table of semaphore roots. I've run it on a linux/amd64 machine, with go1.23.12, go1.24.7, go1.24.7 with nospinbitmutex, and go1.25.1. The mutex contention profile of course shows a lot of demand for the benchmark's global sync.Mutex value, but when ignoring those samples (and using Go 1.25) the profile also shows contention in the slow path of sync.Mutex.Lock inside runtime.semacquire1.

The execution trace I collected with Go 1.25 shows about 3 million goroutines per second that run processPart2LockContentionUnpredictable.func1. In a sample of 300,000 of those, I see hundreds with "Execution time" of 200µs or more. In similar data from Go 1.24 with nospinbitmutex, the sample includes only 4 with "Execution time" of 200µs or more. I expect that corresponds to delay within runtime.semacquire1, where the thread (M) is blocked in runtime.lock2 while still associated with the goroutine (G).

The execution traces show that most goroutines only have a few hundred nanoseconds of work to do (while holding the global sync.Mutex value). It looks like work maxes out at 50+15*13+9*7.

I'm not sure at the moment why the benchmark's use of the semaphore table is more sensitive to latency than throughput. I'd expect the benchmark to nearly always have goroutines that need the global sync.Mutex, and for some M to be able to acquire the semaRoot and make progress.

You mentioned that your production workload's tail latency regressions are associated with network operations, but suggested that they were resolved by building with nospinbitmutex. Can you share more about how you reduced the problem you saw into this benchmark?