Go version
go1.23.4 linux/arm64
Output of go env
in your module/workspace:
GOARCH='arm64'
What did you do?
While investigating the performance of json package, I found unquoteBytes()
spends extra time with the if c == '\\' || c == '"' || c < ' '
condition (src/encoding/json/decode.go#L1209), which could be improved by condidional comparisons (such as ARM64 CCMP
instruction).
What did you see happen?
Currently, Go compiler can generate conditonal assignments (e.g., CSET
, CSEL
, CSINC
), but it can not generate conditional comparisons (i.e., CCMP
), which let you combine the results of multiple comparisons so you can perform a single test at the end. (see more details: The AArch64 processor (aka arm64), part 16: Conditional execution - The Old New Thing)
E.g., following are Go tests (https://godbolt.org/z/q6YMT7Msa) and corresponding C tests (https://godbolt.org/z/4qshde78o, compiled by GCC -O1). Go compiler generates CMP;BEQ
insteald of CPM;CCMP
for test2()
:
The Go tests:
func test1(c int) (r int) {
if c > 0 {
r = 1
}
return
}
// CMP $0, R0
// CSET GT, R0
// RET
func test2(a, b, c int) (r int) {
if c == '\\' || c == '"' || c < ' ' {
r = a
} else {
r = b
}
return
}
// CMP $92, R2
// BEQ pc28
// CMP $34, R2
// BEQ pc28
// CMP $32, R2
// BLT pc28
// MOVD R1, R0
// pc28:
// RET (R30)
The C tests:
int test1(int c) {
if (c > 0) {
return 1;
}
return 0;
}
// cmp w0, 0
// cset w0, gt
// ret
int test2(int a, int b, int c) {
if (c == '\\' || c == '"' || c < ' ') {
return a;
} else {
return b;
}
}
// cmp w2, 92
// mov w3, 34
// ccmp w2, w3, 4, ne
// ccmp w2, 31, 4, ne
// csel w0, w1, w0, gt
// ret
Measure performance
Conditional comparisons should generally improve the performance of conjunction/disjunction of conditions by &&
/||
operators on ARM64 machines.
Following cases are simplified from unquoteBytes
. I tested on ARM64 Neoverse-N1 (AmpereComputing Altra and AWS Graviton2 is similar), the C case (GCC -O3
generates CCMP
) is much faster than the Go case (go1.23.4 generates CMP
): 5.69s vs. 9.23s.
The Go Test:
package main
//go:nosplit
//go:noinline
func unquoteBytes(s []byte, len int) int {
s = s[1:]
r := 0
for r < len {
c := s[r]
if c == '\\' || c == '"' || c < ' ' {
break
}
r++
}
return r
}
func main() {
data := []byte(`"hello, world"`)
len := len(data)
for i := 0; i < 1000*1000*500; i++ {
unquoteBytes(data, len)
}
}
The C Test
#include <string.h>
__attribute__((noinline, noipa))
int unquoteBytes(const char *data, int len) {
data = data + 1;
int r = 0;
while (r < len) {
char c = data[r];
if (c == '\\' || c == '"' || c < ' ') {
break;
}
r++;
}
return r;
}
int main() {
const char *data = "\"hello, world\"";
unsigned len = strlen(data);
for (int i = 0; i < 1000 * 1000 * 500; i++) {
unquoteBytes(data, len);
}
}
As C may have less overhead than Go in function call and main
, let's just compare the linux-perf samples of the loop:
Go results:
; 95.89% 35560 mytest mytest [.] main.unquoteBytes
; 3.97% 1473 mytest mytest [.] main.main
7381 : 73260: add x0, x0, #0x1 ; loop header
699 : 73264: cmp x3, x0
0 : 73268: b.le 73288 ; loop exit
9698 : 7326c: ldrb w2, [x1, x0]
0 : 73270: cmp w2, #0x5c
0 : 73274: b.eq 73288 ; loop exit
8688 : 73278: cmp w2, #0x22
0 : 7327c: b.eq 73288 ; loop exit
7908 : 73280: cmp w2, #0x20
0 : 73284: b.cs 73260
661 : 73288: ret
C results:
; 99.76% 22829 simp simp [.] unquoteBytes
; 0.20% 45 simp simp [.] main
12051 : 4006f8: cmp x6, x2
0 : 4006fc: b.eq 400724 ; loop exit
121 : 400700: mov x2, x4
431 : 400704: ldrb w3, [x0, x2]
547 : 400708: add x4, x2, #0x1
0 : 40070c: cmp w3, #0x5c
221 : 400710: ccmp w3, w5, #0x4, ne
1233 : 400714: ccmp w3, #0x1f, #0x0, ne
6381 : 400718: b.hi 4006f8 ; loop header
991 : 40071c: sub w0, w2, #0x1
0 : 400720: ret
0 : 400724: mov w0, w1
0 : 400728: ret
The assembly instructions are much similar except the CMP and CCMP. If we just count samples related to the loop, the C case (CCMP
) vs the Go case (CCMP
): 21976 vs 35035 (+47%).
Since the input data may affect performance, I also tested data like "\hello, world"
, so the loop could break at the 1st comparison against \
, then CCMP is still faster than CMP (1.00s vs. 1.29s).
What did you expect to see?
Could we enhance Go compiler to generate CCMP
?
BTW. I searched and didn't find any issue about conditional instructions (there is just an old issue #6011 about failing to generate conditional move).
Comment From: gabyhelp
Related Issues
Related Code Changes
- cmd/compile: ARM comparisons with 0 incorrect on overflow
- cmd/compile/ssa: optimize the derivable known branch of If block
- cmd/compile: add rewrite rules for conditional instructions on arm64
- cmd/compile: fix incorrect rewriting to if condition
- cmd/compile/internal/ssa: optimize ARM64 code with TST
- cmd/compile: optimize ARM's comparision
- cmd/compile: combine RORW with logical ops on arm64
- cmd/compile: optimize arm's comparison
- cmd/compile: optimize the Phi values
(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)
Comment From: mknyszek
CC @golang/compiler
Comment From: gopherbot
Change https://go.dev/cl/698037 mentions this issue: cmd/compile: introduce CCMP generation
Comment From: gopherbot
Change https://go.dev/cl/698099 mentions this issue: cmd/compile: replace conditions to CCMP instructions on ARM64