Golang cmd/compile: poor spill decisions making code 14% slower

At commit 266b0cff18 from earlier today (but also with some older toolchains, not claiming the behavior is new), suppose you:

GOOS=linux GOARCH=amd64 go test -c math/big
go tool objdump -s big.nat.scan big.test

I've attached big.zip, which contains the test executable (big.test.old) and the objdump output, lightly annotated (old.asm).

Now consider this line-number-preserving diff, which introduces a new variable (bZ) that must be saved on the stack across a function call.

diff --git a/src/math/big/natconv.go b/src/math/big/natconv.go
index ce94f2cf72..a92343befe 100644
--- a/src/math/big/natconv.go
+++ b/src/math/big/natconv.go
@@ -121,9 +121,9 @@ func (z nat) scan(r io.ByteScanner, base int, fracOk bool) (res nat, b, count in
    prev := '.'
    invalSep := false

-   // one char look-ahead
-   ch, err := r.ReadByte()
-
+   bZ := base + 1
+   ch, err := r.ReadByte()
+   base = bZ - 1
    // determine actual base
    b, prefix := base, 0
    if base == 0 {

The big.zip attachment also contains the resulting test executable (big.test.new) and the objdump output (new.asm).

On x86-64, this diff makes the 'Scan/1000/Base10' benchmark run 14% faster! Quite an optimization!

goos: linux
goarch: amd64
pkg: math/big
cpu: AMD Ryzen 9 7950X 16-Core Processor            
                    │     old     │                 new                 │
                    │   sec/op    │   sec/op     vs base                │
Scan/1000/Base10-32   4.694µ ± 0%   4.032µ ± 0%  -14.11% (p=0.000 n=20)

How can this be? It turns out that the compiler is make poor choices when spilling to the stack, and for whatever reason, that extra temporary causes different, better choices. The nat.scan method has a loop that looks like:

for err == nil {
    if ch == '.' && fracOk {
        ...
    } else if ch == '_' && base == 0 {
        ...
    } else {
        ...
    }
    ch, err = r.ReadByte()
}

That last ReadByte call is a basic block of its own, with three possible predecessor blocks. There is a slice variable z in the code that is live across that call.

In the slow version of the code (without bZ), the compiler chooses to store len(z) in 0xb0(SP) and cap(z) in 0xb8(SP) pretty consistently throughout the function, except in this one basic block, where it swaps their locations. On entry to this basic block, len(z) is in R8, and cap(z) is in SI. The assembly, lightly annotated, is:

> natconv.go:243    0x5fc72f        4c898424b8000000    MOVQ R8, 0xb8(SP)   // len(z) (NOT cap!)
  natconv.go:243    0x5fc737        4c89ac24d0000000    MOVQ R13, 0xd0(SP)
> natconv.go:243    0x5fc73f        4889b424b0000000    MOVQ SI, 0xb0(SP) // cap(z) (NOT len!)
  natconv.go:243    0x5fc747        4c897c2470      MOVQ R15, 0x70(SP)
  natconv.go:242    0x5fc74c        4889542460      MOVQ DX, 0x60(SP)
  natconv.go:231    0x5fc751        4889bc24a0000000    MOVQ DI, 0xa0(SP)
  natconv.go:171    0x5fc759        4488642443      MOVB R12, 0x43(SP)
  natconv.go:219    0x5fc75e        498b4b18        MOVQ 0x18(R11), CX
  natconv.go:219    0x5fc762        ffd1            CALL CX
  natconv.go:170    0x5fc764        488b9424c0000000    MOVQ 0xc0(SP), DX
  natconv.go:170    0x5fc76c        488bb42410010000    MOVQ 0x110(SP), SI // could be just R11
  natconv.go:170    0x5fc774        488b7c2458      MOVQ 0x58(SP), DI // could be just R10
  natconv.go:170    0x5fc779        4c8b842420010000    MOVQ 0x120(SP), R8 // could be just R9
  natconv.go:170    0x5fc781        4c8b8c2418010000    MOVQ 0x118(SP), R9 // dead
  natconv.go:170    0x5fc789        4c8b542468      MOVQ 0x68(SP), R10 // dead
  natconv.go:170    0x5fc78e        4c8b5c2470      MOVQ 0x70(SP), R11 // dead
  natconv.go:170    0x5fc793        4c8ba424b8000000    MOVQ 0xb8(SP), R12 // len(z) (NOT cap!); dead
  natconv.go:170    0x5fc79b        4c8bac24d0000000    MOVQ 0xd0(SP), R13
  natconv.go:170    0x5fc7a3        4c8bbc24b0000000    MOVQ 0xb0(SP), R15 // cap(z) (NOT len!); dead
  natconv.go:110    0x5fc7ab        4d89c1          MOVQ R8, R9
  natconv.go:212    0x5fc7ae        4989fa          MOVQ DI, R10
  natconv.go:219    0x5fc7b1        4989f3          MOVQ SI, R11
  natconv.go:233    0x5fc7b4        4c8b642448      MOVQ 0x48(SP), R12
  natconv.go:213    0x5fc7b9        4c8b7c2450      MOVQ 0x50(SP), R15
  natconv.go:171    0x5fc7be        4189c0          MOVL AX, R8
  natconv.go:227    0x5fc7c1        8b742444        MOVL 0x44(SP), SI
  natconv.go:231    0x5fc7c5        488bbc24a0000000    MOVQ 0xa0(SP), DI
  natconv.go:219    0x5fc7cd        488b842418010000    MOVQ 0x118(SP), AX // dead
> natconv.go:243    0x5fc7d5        488b8424b8000000    MOVQ 0xb8(SP), AX // len(z) (NOT cap!)
> natconv.go:171    0x5fc7dd        4488842480000000    MOVB R8, 0x80(SP) // temporary
> natconv.go:243    0x5fc7e5        4c8b8424b0000000    MOVQ 0xb0(SP), R8 // cap(z) (NOT len!)
> natconv.go:243    0x5fc7ed        4c898424b8000000    MOVQ R8, 0xb8(SP)   // cap(z)
> natconv.go:243    0x5fc7f5        48898424b0000000    MOVQ AX, 0xb0(SP) // len(z)
> natconv.go:219    0x5fc7fd        488b842418010000    MOVQ 0x118(SP), AX
> natconv.go:171    0x5fc805        440fb6842480000000  MOVZX 0x80(SP), R8

The important part is the lines marked with a leading >. At the start of the block, the compiler stores len(z) and cap(z) in the slots usually used for cap(z) and len(z). At the end of the block, the compiler must now correct the stores to match where other parts of the function will expect to load from. The marked sequence is a bit confusing but it does:

Reload len(z) from 0xb8(SP) into AX.
Spill a byte from R8, which holds the ch result of the call, to 0x80(SP).
Reload cap(z) from 0xb0(SP) into R8, and then store it into 0xb8(SP).
Store len(z) from AX into 0xb0(SP).
Reload AX with the value it held before the marked sequence (0x118(SP)).
Reload R8 from its spilled stack location, zero-extending the byte into a full word.

If the marked stores at the top of the basic block wrote to 0xb0(SP) and 0xb8(SP) instead of 0xb8(SP) and 0xb0(SP), then all the marked lines at the bottom could be deleted entirely. In the bZ version of the code, something about the extra temporary leads the compiler to make better choices and do exactly that.

We can also make that change directly. The big.zip attachment contains fixbig.go, which reads big.test.old, makes exactly those edits (swap the two store locations and delete the marked lines at the end), and writes big.test.old.fix. Sure enough, big.test.old.fix runs about as fast as big.test.new does.

Aside 1. Note that the line numbers are bogus. The ReadByte call is line 219. The rest is spilling in preparation for the call and reloading in preparation for falling through to the next basic block. I would argue that the initial spills should be given the first line number from the basic block's actual code, and the cleanup should be given the last line number from the actual code, so that every instruction in this basic block should say line 219. It looks like maybe the spill/load code is assigned the line number of the final mention of that variable in the function. As it stands now, the resulting scattershot line numbering makes source-level profiling almost worthless.

Aside 2. Note the number of loads in the disassembly that I've marked // dead. The registers these instructions load are overwritten later in the basic block without being used. It seems like a basic peephole optimizer could remove them, although better would be not to generate them in the first place. There are also a few marked // could be just that load into a register whose only use is to move into a final destination register later. These suboptimal load sequences persist in the bZ version that is 14% faster. I wrote a version of fixbig.go that removes the lines I marked // dead (except the final one before the > section at the bottom, which stops being dead when that section is deleted), and it runs 17% faster than the base:

goos: linux
goarch: amd64
pkg: math/big
cpu: AMD Ryzen 9 7950X 16-Core Processor            
                    │     old     │                 fix                 │
                    │   sec/op    │   sec/op     vs base                │
Scan/1000/Base10-32   4.694µ ± 0%   3.859µ ± 1%  -17.80% (p=0.000 n=20)

This is not the tightest loop in the world (it contains an indirect call!) and yet these basic cleanups give a very significant speedup. I wonder whether this is an isolated case or this happens in basically all code we generate.

Comment From: gabyhelp

Related Issues

Related Code Changes

_{(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)}

Comment From: gopherbot

Change https://go.dev/cl/650640 mentions this issue: math/big: optmize atoi of base 2, 4, 16

Comment From: mknyszek

CC @golang/compiler

Comment From: mcy

I've done a lot of reading of gc's output and this is very par for the course. gc is very bad at making many regalloc decisions, some of which show up in the assembly listings here: https://mcyoung.xyz/2024/12/12/go-abi/ (although I don't comment on them, as I explain below). Admittedly it's hard to tell when a spill is necessary for the GC to scan the stack or not, but I have seen a lot of this behavior both for non-pointers and for pointers that should not be marked as such in a GC shape, such as the itab pointer on an interface fat pointer.

I often find that gc is very bad at things that I would expect to not be an issue when coming out of SSA where you do cleanups between passes like LLVM does (such as DCE, DSE, and mem2reg cleanups). This includes unnecessary spills, but also lots of dead stores, both to registers and memory. Also the lack of callee registers hurts a lot, because it means every call forces spills.

I would be filing bugs but I have always had an expectation that Go doesn't care about codegen micro in the way that I (an LLVM-shaped person) do. The most I do is I email @dr2chase once a quarter whenever I encounter something obviously wrong (wrong in the "this code was written by GCC in the 90s"-- surprisingly I have not hit a miscompile in official releases).

If this expectation is based on stale information... well, let me know, and I'll have a lot of missed optimizations to share. One that irked me recently is that Go cannot optimize if a || b {} into if a | b {} to avoid an extra branch when both a and b are pure, such as when doing integer comparisons.

(If you're wondering where I get them. Well, I kinda just torture compilers to find missed optimizations for fun, and the current job has me writing high performance Go, so there's been a lot of opportunities...)

Comment From: randall77

Ok, we've got a couple issues here.

First, the computation of whether there is a call-free path through the loop is incorrect. The compiler thinks there is a call-free path through the loop, but there isn't. This matters because the compiler is more aggressive about keeping things in registers when such a path exists, but that extra effort is only harmful when no such path exists. This will be an easy fix.

Second, the shuffle part of regalloc is sub-optimal. For instance, it decides that we want to move some value that is in a stack slot to a register. So it issues an unspill (a load) to put it in a register. Then we have a subsequent shuffle that needs to move a value from one stack slot to another. It needs a register to do that. If there is no free register, it will steal one from a value that's already in 2 locations. The value from the previous load is a candidate, because it is in both a stack slot and a register. Once that register is stolen to do the stack slot -> stack slot move, the original unspill is now dead. It will have to be issued again (and we don't cancel the previous one). This one may be a bit harder to fix. I think some reordering of how we do the shuffle might help.

The spill-to-the-wrong-slot thing I'm not sure about yet. I suspect it is just that there is no lookahead when deciding where to spill. It also isn't spilling to the argument slot, which is odd.

Comment From: gopherbot

Change https://go.dev/cl/680775 mentions this issue: cmd/compile: fix containsUnavoidableCall computation

Comment From: dr2chase

It also isn't spilling to the argument slot, which is odd

That has been a thing since we first changed to the register ABI, and at the time we (mostly me, some Than I think) could not figure out what provoked this. It was really annoying. It will be more annoying when the args are SIMD-sized. But/also, at least in CAM we've been talking about "can we avoid permanently allocated spill space for input args?"

Comment From: gopherbot

Change https://go.dev/cl/681515 mentions this issue: cmd/compile: do mini deadcode elim during regalloc shuffle