Golang cmd/compile: riscv performance degredation

Go version

go version go1.24.2 linux/riscv64

Output of `go env` in your module/workspace:

AR='ar'
CC='riscv64-unknown-linux-gnu-gcc'
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_ENABLED='1'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
CXX='riscv64-unknown-linux-gnu-g++'
GCCGO='gccgo'
GO111MODULE=''
GOARCH='riscv64'
GOAUTH='netrc'
GOBIN=''
GOCACHE='/root/.cache/go-build'
GOCACHEPROG=''
GODEBUG=''
GOENV='/root/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFIPS140='off'
GOFLAGS=''
GOGCCFLAGS='-fPIC -pthread -fmessage-length=0 -ffile-prefix-map=/tmp/go-build1454745784=/tmp/go-build -gno-record-gcc-switches'
GOHOSTARCH='riscv64'
GOHOSTOS='linux'
GOINSECURE=''
GOMOD='/tmp/benchmark_demo/go.mod'
GOMODCACHE='/root/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/root/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GORISCV64='rva20u64'
GOROOT='/usr/lib/go'
GOSUMDB='sum.golang.org'
GOTELEMETRY='local'
GOTELEMETRYDIR='/root/.config/go/telemetry'
GOTMPDIR=''
GOTOOLCHAIN='local'
GOTOOLDIR='/usr/lib/go/pkg/tool/linux_riscv64'
GOVCS=''
GOVERSION='go1.24.2'
GOWORK=''
PKG_CONFIG='pkg-config'

What did you do?

I was studying branch prediction behavior on real RISC-V hardware (Starfive VisionFive 2 - JH7110 => 4x sifive u74-mc) by creating two nearly identical benchmark functions that differ only in data preparation before b.ResetTimer():

Created a minimal test case with two benchmark functions:
BenchmarkSortedData: calls sort.Ints(data) before b.ResetTimer()
BenchmarkUnsortedData: same function without the sort call
Both functions have identical benchmark loops after b.ResetTimer()
Ran the benchmarks with default optimizations: bash go test -bench=. riscv_bug_test.go

Complete Go source file and generated RISC-V assembly have been attached for full analysis:

riscv_bug_test.go.txt

riscv_bug_test.S.txt

What did you see happen?

1. Performance Results (Problematic)

BenchmarkSortedData-4       6843    175703 ns/op  (SLOW - 4x slower!)
BenchmarkUnsortedData-4    27356     43874 ns/op  (FAST)

2. Assembly Generation Issues The compiler generated drastically different assembly for the two functions:

BenchmarkSortedData: 152 bytes, 48-byte stack frame ($40-8) BenchmarkUnsortedData: 124 bytes, 24-byte stack frame ($16-8)

The benchmark loops after b.ResetTimer() are identical in Go source code, but the compiler:

Uses different stack layouts (56(SP) vs 32(SP) offsets)
Applies different inlining strategies
Generates different register allocation patterns

Verification with disabled optimizations

When running with -gcflags="-N -l", the performance difference becomes logical:

bashBenchmarkSortedData-4        975   1229971 ns/op  (Predictable branches - faster)
BenchmarkUnsortedData-4      858   1397883 ns/op  (Unpredictable branches - slower)

This shows the 4x artificial difference disappears when optimizations are disabled, revealing the real ~14% CPU behavior difference.

Expected Assembly Behavior

Both functions should generate similar assembly code since they have identical Go source code after b.ResetTimer(). The compiler should:

Use similar stack layouts (same frame size and local variable allocation)
Apply consistent optimization strategies for the identical benchmark loops
Generate comparable code size (within a few bytes)
Respect the b.ResetTimer() optimization boundary - code before the timer reset should not influence code generation after it

Expected Performance Results

The performance difference should reflect real CPU behavior (branch prediction effects), approximately:

BenchmarkSortedData-4        ~1000   ~1200000 ns/op  (Predictable branches)
BenchmarkUnsortedData-4       ~900   ~1400000 ns/op  (Unpredictable branches)

This would show a realistic ~15% difference due to branch misprediction penalties on the SiFive U74-MC, not an artificial 4x compiler-generated difference.

Expected consistency

The same optimization level should produce the same code structure for logically equivalent functions, regardless of data preparation steps that occur before the measurement boundary.

Root cause analysis (updated)

After further investigation with collaboration from other AI systems, the root cause appears to be be related to the inlining budget heuristics in the RISC-V backend.

The presence of sort.Ints(data) before b.ResetTimer() causes the compiler to:

Perceive higher function complexity due to the sort operation's internal loops and calls
Consume inlining budget during the analysis phase
Adopt different optimization strategies for subsequent code, including the benchmark loop
Allocate larger stack frames as a preventive measure (48 vs 24 bytes)
Generate different register allocation patterns due to the altered stack layout

This creates a cascade effect where identical Go code after b.ResetTimer() produces different assembly due to compiler state changes from code that shouldn't influence the measured performance.

Technical Impact

Different stack frame sizes: $40-8 vs $16-8
Different register allocation strategies
Different code generation for identical source code
4x artificial performance difference masking real CPU behavior

Related Issues

This appears related to #50821 (AMD64 register allocation inconsistency), suggesting a broader compiler optimization pipeline issue affecting multiple architectures.

Edits: Typos. No results change.

Comment From: gabyhelp

Related Issues

cmd/compile/internal/amd64: huge performance degradation (register allocation issue?) #50821

Related Code Changes

_{(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)}

Comment From: seankhliao

I believe you should use b.Loop for consistency in optimizations

Comment From: admnd

Interesting finding here. I tried your suggestion to use b.Loop : not only it does not fix the bug but it inverts it! So we have a deeper issue here :(

Performance Matrix - RISC-V Go Compiler Bug

Updated test file : riscv_bug_test_v2.go

# Complete comparison - all 4 scenarios
$ go test -bench="BenchmarkDirectComparison" -benchtime=3s riscv_bug_test_v2.go

# Original bug
$ go test -bench="BenchmarkSortedData$|BenchmarkUnsortedData$" -benchtime=3s riscv_bug_test_v2.go

# Inverted bug with b.Loop()
$ go test -bench="BenchmarkSortedDataBLoop|BenchmarkUnsortedDataBLoop" -benchtime=3s riscv_bug_test_v2.go

Test Results on StarFive VisionFive 2 (SiFive U74-MC)

Loop Method	With sort.Ints()	Without sort.Ints()	Performance Ratio
`for i := 0; i < b.N; i++`	175,372 ns/op	43,888 ns/op	4.0x slower
`for b.Loop()`	43,919 ns/op	175,423 ns/op	4.0x faster

Key Findings

Original Bug: sort.Ints() in same function causes 4x performance degradation with traditional for-loop
Inverted Bug: b.Loop() creates opposite behavior - now the function WITHOUT sort.Ints() is 4x slower
Same Magnitude: Both bugs produce identical 4.0x performance ratios
Perfect Reproducibility: Results are consistent across multiple test runs

Comment From: randall77

  _ = sum // Prevent optimization

This does not prevent any optimization. Sum is never calculated (in either of your benchmark versions).

The inner loop, for _, v := range data compiles to the following 3 instructions in both cases:

issue74606_test.go:42 0x10010ca94 91000463 ADD $1, R3, R3
issue74606_test.go:42 0x10010ca98 eb03003f CMP R3, R1
issue74606_test.go:42 0x10010ca9c 54ffffcc BGT -2(PC)

issue74606_test.go:25 0x10010ca08 91000463 ADD $1, R3, R3
issue74606_test.go:25 0x10010ca0c eb03003f CMP R3, R1
issue74606_test.go:25 0x10010ca10 54ffffcc BGT -2(PC)

I don't understand why those inner loops would result in vastly different timings. Could be alignment, could be something else. That would also explain why when using b.Loop roles reverse. Could be a different one of the two benchmarks happened to trip on some alignment slowdown.

In any case, I don't see any problems with the generated code. A 48 vs 24 byte stack frame size is ~immaterial. Regalloc is identical in the inner loop. It is mysterious why they are 4x different. Please reopen if you find something interesting.

(Side note: tip compiles to inner loops with only 2 instructions:

issue74606_test.go:25 0x127a14 fff30313 ADDI $-1, X6, X6
issue74606_test.go:25 0x127a18 fe604ee3 BLT X0, X6, -1(PC)

issue74606_test.go:42 0x127a90 fff30313 ADDI $-1, X6, X6
issue74606_test.go:42 0x127a94 fe604ee3 BLT X0, X6, -1(PC)
)