Go version

go1.23.4 linux/arm64

Output of go env in your module/workspace:

GOARCH='arm64'

What did you do?

While investigating the performance of json package, I found unquoteBytes() spends extra time with the if c == '\\' || c == '"' || c < ' ' condition (src/encoding/json/decode.go#L1209), which could be improved by condidional comparisons (such as ARM64 CCMP instruction).

What did you see happen?

Currently, Go compiler can generate conditonal assignments (e.g., CSET, CSEL, CSINC), but it can not generate conditional comparisons (i.e., CCMP), which let you combine the results of multiple comparisons so you can perform a single test at the end. (see more details: The AArch64 processor (aka arm64), part 16: Conditional execution - The Old New Thing)

E.g., following are Go tests (https://godbolt.org/z/q6YMT7Msa) and corresponding C tests (https://godbolt.org/z/4qshde78o, compiled by GCC -O1). Go compiler generates CMP;BEQ insteald of CPM;CCMP for test2():

The Go tests:

func test1(c int) (r int) {
    if c > 0 {
        r = 1
    }
    return
}
// CMP     $0, R0
// CSET    GT, R0
// RET

func test2(a, b, c int) (r int) {
    if c == '\\' || c == '"' || c < ' ' {
        r = a
    } else {
        r = b
    }
    return
}
//   CMP     $92, R2
//   BEQ     pc28
//   CMP     $34, R2
//   BEQ     pc28
//   CMP     $32, R2
//   BLT     pc28
//   MOVD    R1, R0
// pc28:
//   RET     (R30)

The C tests:

int test1(int c) {
    if (c > 0) {
        return 1;
    }
    return 0;
}
// cmp     w0, 0
// cset    w0, gt
// ret

int test2(int a, int b, int c) {
    if (c == '\\' || c == '"' || c < ' ') {
        return a;
    } else {
        return b;
    }
}
// cmp     w2, 92
// mov     w3, 34
// ccmp    w2, w3, 4, ne
// ccmp    w2, 31, 4, ne
// csel    w0, w1, w0, gt
// ret

Measure performance

Conditional comparisons should generally improve the performance of conjunction/disjunction of conditions by &&/|| operators on ARM64 machines.

Following cases are simplified from unquoteBytes. I tested on ARM64 Neoverse-N1 (AmpereComputing Altra and AWS Graviton2 is similar), the C case (GCC -O3 generates CCMP) is much faster than the Go case (go1.23.4 generates CMP): 5.69s vs. 9.23s.

The Go Test:

package main

//go:nosplit
//go:noinline
func unquoteBytes(s []byte, len int) int {
    s = s[1:]
    r := 0

    for r < len {
        c := s[r]
        if c == '\\' || c == '"' || c < ' ' {
            break
        }
        r++
    }

    return r
}

func main() {
    data := []byte(`"hello, world"`)
    len := len(data)

    for i := 0; i < 1000*1000*500; i++ {
        unquoteBytes(data, len)
    }
}

The C Test

#include <string.h>

__attribute__((noinline, noipa))
int unquoteBytes(const char *data, int len) {
  data = data + 1;
  int r = 0;

  while (r < len) {
    char c = data[r];
    if (c == '\\' || c == '"' || c < ' ') {
      break;
    }
    r++;
  }

  return r;
}

int main() {
  const char *data = "\"hello, world\"";
  unsigned len = strlen(data);

  for (int i = 0; i < 1000 * 1000 * 500; i++) {
    unquoteBytes(data, len);
  }
}

As C may have less overhead than Go in function call and main, let's just compare the linux-perf samples of the loop:

Go results:

; 95.89% 35560 mytest mytest [.] main.unquoteBytes
;  3.97%  1473 mytest mytest [.] main.main

7381 :   73260:  add     x0, x0, #0x1  ; loop header
 699 :   73264:  cmp     x3, x0
   0 :   73268:  b.le    73288         ; loop exit
9698 :   7326c:  ldrb    w2, [x1, x0]
   0 :   73270:  cmp     w2, #0x5c
   0 :   73274:  b.eq    73288         ; loop exit
8688 :   73278:  cmp     w2, #0x22
   0 :   7327c:  b.eq    73288         ; loop exit
7908 :   73280:  cmp     w2, #0x20
   0 :   73284:  b.cs    73260
 661 :   73288:  ret

C results:

; 99.76% 22829 simp simp [.] unquoteBytes
;  0.20%    45 simp simp [.] main

12051 :   4006f8: cmp     x6, x2
    0 :   4006fc: b.eq    400724       ; loop exit
  121 :   400700: mov     x2, x4
  431 :   400704: ldrb    w3, [x0, x2]
  547 :   400708: add     x4, x2, #0x1
    0 :   40070c: cmp     w3, #0x5c
  221 :   400710: ccmp    w3, w5, #0x4, ne
 1233 :   400714: ccmp    w3, #0x1f, #0x0, ne
 6381 :   400718: b.hi    4006f8       ; loop header
  991 :   40071c: sub     w0, w2, #0x1
    0 :   400720: ret
    0 :   400724: mov     w0, w1
    0 :   400728: ret

The assembly instructions are much similar except the CMP and CCMP. If we just count samples related to the loop, the C case (CCMP) vs the Go case (CCMP): 21976 vs 35035 (+47%).

Since the input data may affect performance, I also tested data like "\hello, world", so the loop could break at the 1st comparison against \, then CCMP is still faster than CMP (1.00s vs. 1.29s).

What did you expect to see?

Could we enhance Go compiler to generate CCMP?

BTW. I searched and didn't find any issue about conditional instructions (there is just an old issue #6011 about failing to generate conditional move).

Comment From: gabyhelp

Related Issues

Related Code Changes

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

Comment From: mknyszek

CC @golang/compiler

Comment From: gopherbot

Change https://go.dev/cl/698037 mentions this issue: cmd/compile: introduce CCMP generation

Comment From: gopherbot

Change https://go.dev/cl/698099 mentions this issue: cmd/compile: replace conditions to CCMP instructions on ARM64