Go version
go1.23.4 linux/arm64
Output of go env in your module/workspace:
GOARCH='arm64'
What did you do?
While investigating the performance of json package, I found unquoteBytes() spends extra time with the if c == '\\' || c == '"' || c < ' ' condition (src/encoding/json/decode.go#L1209), which could be improved by condidional comparisons (such as ARM64 CCMP instruction).
What did you see happen?
Currently, Go compiler can generate conditonal assignments (e.g., CSET, CSEL, CSINC), but it can not generate conditional comparisons (i.e., CCMP), which let you combine the results of multiple comparisons so you can perform a single test at the end. (see more details: The AArch64 processor (aka arm64), part 16: Conditional execution - The Old New Thing)
E.g., following are Go tests (https://godbolt.org/z/q6YMT7Msa) and corresponding C tests (https://godbolt.org/z/4qshde78o, compiled by GCC -O1). Go compiler generates CMP;BEQ insteald of CPM;CCMP for test2():
The Go tests:
func test1(c int) (r int) {
if c > 0 {
r = 1
}
return
}
// CMP $0, R0
// CSET GT, R0
// RET
func test2(a, b, c int) (r int) {
if c == '\\' || c == '"' || c < ' ' {
r = a
} else {
r = b
}
return
}
// CMP $92, R2
// BEQ pc28
// CMP $34, R2
// BEQ pc28
// CMP $32, R2
// BLT pc28
// MOVD R1, R0
// pc28:
// RET (R30)
The C tests:
int test1(int c) {
if (c > 0) {
return 1;
}
return 0;
}
// cmp w0, 0
// cset w0, gt
// ret
int test2(int a, int b, int c) {
if (c == '\\' || c == '"' || c < ' ') {
return a;
} else {
return b;
}
}
// cmp w2, 92
// mov w3, 34
// ccmp w2, w3, 4, ne
// ccmp w2, 31, 4, ne
// csel w0, w1, w0, gt
// ret
Measure performance
Conditional comparisons should generally improve the performance of conjunction/disjunction of conditions by &&/|| operators on ARM64 machines.
Following cases are simplified from unquoteBytes. I tested on ARM64 Neoverse-N1 (AmpereComputing Altra and AWS Graviton2 is similar), the C case (GCC -O3 generates CCMP) is much faster than the Go case (go1.23.4 generates CMP): 5.69s vs. 9.23s.
The Go Test:
package main
//go:nosplit
//go:noinline
func unquoteBytes(s []byte, len int) int {
s = s[1:]
r := 0
for r < len {
c := s[r]
if c == '\\' || c == '"' || c < ' ' {
break
}
r++
}
return r
}
func main() {
data := []byte(`"hello, world"`)
len := len(data)
for i := 0; i < 1000*1000*500; i++ {
unquoteBytes(data, len)
}
}
The C Test
#include <string.h>
__attribute__((noinline, noipa))
int unquoteBytes(const char *data, int len) {
data = data + 1;
int r = 0;
while (r < len) {
char c = data[r];
if (c == '\\' || c == '"' || c < ' ') {
break;
}
r++;
}
return r;
}
int main() {
const char *data = "\"hello, world\"";
unsigned len = strlen(data);
for (int i = 0; i < 1000 * 1000 * 500; i++) {
unquoteBytes(data, len);
}
}
As C may have less overhead than Go in function call and main, let's just compare the linux-perf samples of the loop:
Go results:
; 95.89% 35560 mytest mytest [.] main.unquoteBytes
; 3.97% 1473 mytest mytest [.] main.main
7381 : 73260: add x0, x0, #0x1 ; loop header
699 : 73264: cmp x3, x0
0 : 73268: b.le 73288 ; loop exit
9698 : 7326c: ldrb w2, [x1, x0]
0 : 73270: cmp w2, #0x5c
0 : 73274: b.eq 73288 ; loop exit
8688 : 73278: cmp w2, #0x22
0 : 7327c: b.eq 73288 ; loop exit
7908 : 73280: cmp w2, #0x20
0 : 73284: b.cs 73260
661 : 73288: ret
C results:
; 99.76% 22829 simp simp [.] unquoteBytes
; 0.20% 45 simp simp [.] main
12051 : 4006f8: cmp x6, x2
0 : 4006fc: b.eq 400724 ; loop exit
121 : 400700: mov x2, x4
431 : 400704: ldrb w3, [x0, x2]
547 : 400708: add x4, x2, #0x1
0 : 40070c: cmp w3, #0x5c
221 : 400710: ccmp w3, w5, #0x4, ne
1233 : 400714: ccmp w3, #0x1f, #0x0, ne
6381 : 400718: b.hi 4006f8 ; loop header
991 : 40071c: sub w0, w2, #0x1
0 : 400720: ret
0 : 400724: mov w0, w1
0 : 400728: ret
The assembly instructions are much similar except the CMP and CCMP. If we just count samples related to the loop, the C case (CCMP) vs the Go case (CCMP): 21976 vs 35035 (+47%).
Since the input data may affect performance, I also tested data like "\hello, world", so the loop could break at the 1st comparison against \, then CCMP is still faster than CMP (1.00s vs. 1.29s).
What did you expect to see?
Could we enhance Go compiler to generate CCMP?
BTW. I searched and didn't find any issue about conditional instructions (there is just an old issue #6011 about failing to generate conditional move).
Comment From: gabyhelp
Related Issues
Related Code Changes
- cmd/compile: ARM comparisons with 0 incorrect on overflow
- cmd/compile/ssa: optimize the derivable known branch of If block
- cmd/compile: add rewrite rules for conditional instructions on arm64
- cmd/compile: fix incorrect rewriting to if condition
- cmd/compile/internal/ssa: optimize ARM64 code with TST
- cmd/compile: optimize ARM's comparision
- cmd/compile: combine RORW with logical ops on arm64
- cmd/compile: optimize arm's comparison
- cmd/compile: optimize the Phi values
(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)
Comment From: mknyszek
CC @golang/compiler
Comment From: gopherbot
Change https://go.dev/cl/698037 mentions this issue: cmd/compile: introduce CCMP generation
Comment From: gopherbot
Change https://go.dev/cl/698099 mentions this issue: cmd/compile: replace conditions to CCMP instructions on ARM64