At commit 266b0cff18 from earlier today (but also with some older toolchains, not claiming the behavior is new), suppose you:
GOOS=linux GOARCH=amd64 go test -c math/big
go tool objdump -s big.nat.scan big.test
I've attached big.zip, which contains the test executable (big.test.old) and the objdump output, lightly annotated (old.asm).
Now consider this line-number-preserving diff, which introduces a new variable (bZ) that must be saved on the stack across a function call.
diff --git a/src/math/big/natconv.go b/src/math/big/natconv.go
index ce94f2cf72..a92343befe 100644
--- a/src/math/big/natconv.go
+++ b/src/math/big/natconv.go
@@ -121,9 +121,9 @@ func (z nat) scan(r io.ByteScanner, base int, fracOk bool) (res nat, b, count in
prev := '.'
invalSep := false
- // one char look-ahead
- ch, err := r.ReadByte()
-
+ bZ := base + 1
+ ch, err := r.ReadByte()
+ base = bZ - 1
// determine actual base
b, prefix := base, 0
if base == 0 {
The big.zip attachment also contains the resulting test executable (big.test.new) and the objdump output (new.asm).
On x86-64, this diff makes the 'Scan/1000/Base10' benchmark run 14% faster! Quite an optimization!
goos: linux
goarch: amd64
pkg: math/big
cpu: AMD Ryzen 9 7950X 16-Core Processor
│ old │ new │
│ sec/op │ sec/op vs base │
Scan/1000/Base10-32 4.694µ ± 0% 4.032µ ± 0% -14.11% (p=0.000 n=20)
How can this be? It turns out that the compiler is make poor choices when spilling to the stack, and for whatever reason, that extra temporary causes different, better choices. The nat.scan method has a loop that looks like:
for err == nil {
if ch == '.' && fracOk {
...
} else if ch == '_' && base == 0 {
...
} else {
...
}
ch, err = r.ReadByte()
}
That last ReadByte call is a basic block of its own, with three possible predecessor blocks. There is a slice variable z
in the code that is live across that call.
In the slow version of the code (without bZ
), the compiler chooses to store len(z) in 0xb0(SP) and cap(z) in 0xb8(SP) pretty consistently throughout the function, except in this one basic block, where it swaps their locations. On entry to this basic block, len(z) is in R8, and cap(z) is in SI. The assembly, lightly annotated, is:
> natconv.go:243 0x5fc72f 4c898424b8000000 MOVQ R8, 0xb8(SP) // len(z) (NOT cap!)
natconv.go:243 0x5fc737 4c89ac24d0000000 MOVQ R13, 0xd0(SP)
> natconv.go:243 0x5fc73f 4889b424b0000000 MOVQ SI, 0xb0(SP) // cap(z) (NOT len!)
natconv.go:243 0x5fc747 4c897c2470 MOVQ R15, 0x70(SP)
natconv.go:242 0x5fc74c 4889542460 MOVQ DX, 0x60(SP)
natconv.go:231 0x5fc751 4889bc24a0000000 MOVQ DI, 0xa0(SP)
natconv.go:171 0x5fc759 4488642443 MOVB R12, 0x43(SP)
natconv.go:219 0x5fc75e 498b4b18 MOVQ 0x18(R11), CX
natconv.go:219 0x5fc762 ffd1 CALL CX
natconv.go:170 0x5fc764 488b9424c0000000 MOVQ 0xc0(SP), DX
natconv.go:170 0x5fc76c 488bb42410010000 MOVQ 0x110(SP), SI // could be just R11
natconv.go:170 0x5fc774 488b7c2458 MOVQ 0x58(SP), DI // could be just R10
natconv.go:170 0x5fc779 4c8b842420010000 MOVQ 0x120(SP), R8 // could be just R9
natconv.go:170 0x5fc781 4c8b8c2418010000 MOVQ 0x118(SP), R9 // dead
natconv.go:170 0x5fc789 4c8b542468 MOVQ 0x68(SP), R10 // dead
natconv.go:170 0x5fc78e 4c8b5c2470 MOVQ 0x70(SP), R11 // dead
natconv.go:170 0x5fc793 4c8ba424b8000000 MOVQ 0xb8(SP), R12 // len(z) (NOT cap!); dead
natconv.go:170 0x5fc79b 4c8bac24d0000000 MOVQ 0xd0(SP), R13
natconv.go:170 0x5fc7a3 4c8bbc24b0000000 MOVQ 0xb0(SP), R15 // cap(z) (NOT len!); dead
natconv.go:110 0x5fc7ab 4d89c1 MOVQ R8, R9
natconv.go:212 0x5fc7ae 4989fa MOVQ DI, R10
natconv.go:219 0x5fc7b1 4989f3 MOVQ SI, R11
natconv.go:233 0x5fc7b4 4c8b642448 MOVQ 0x48(SP), R12
natconv.go:213 0x5fc7b9 4c8b7c2450 MOVQ 0x50(SP), R15
natconv.go:171 0x5fc7be 4189c0 MOVL AX, R8
natconv.go:227 0x5fc7c1 8b742444 MOVL 0x44(SP), SI
natconv.go:231 0x5fc7c5 488bbc24a0000000 MOVQ 0xa0(SP), DI
natconv.go:219 0x5fc7cd 488b842418010000 MOVQ 0x118(SP), AX // dead
> natconv.go:243 0x5fc7d5 488b8424b8000000 MOVQ 0xb8(SP), AX // len(z) (NOT cap!)
> natconv.go:171 0x5fc7dd 4488842480000000 MOVB R8, 0x80(SP) // temporary
> natconv.go:243 0x5fc7e5 4c8b8424b0000000 MOVQ 0xb0(SP), R8 // cap(z) (NOT len!)
> natconv.go:243 0x5fc7ed 4c898424b8000000 MOVQ R8, 0xb8(SP) // cap(z)
> natconv.go:243 0x5fc7f5 48898424b0000000 MOVQ AX, 0xb0(SP) // len(z)
> natconv.go:219 0x5fc7fd 488b842418010000 MOVQ 0x118(SP), AX
> natconv.go:171 0x5fc805 440fb6842480000000 MOVZX 0x80(SP), R8
The important part is the lines marked with a leading >. At the start of the block, the compiler stores len(z) and cap(z) in the slots usually used for cap(z) and len(z). At the end of the block, the compiler must now correct the stores to match where other parts of the function will expect to load from. The marked sequence is a bit confusing but it does:
- Reload len(z) from 0xb8(SP) into AX.
- Spill a byte from R8, which holds the
ch
result of the call, to 0x80(SP). - Reload cap(z) from 0xb0(SP) into R8, and then store it into 0xb8(SP).
- Store len(z) from AX into 0xb0(SP).
- Reload AX with the value it held before the marked sequence (0x118(SP)).
- Reload R8 from its spilled stack location, zero-extending the byte into a full word.
If the marked stores at the top of the basic block wrote to 0xb0(SP) and 0xb8(SP) instead of 0xb8(SP) and 0xb0(SP), then all the marked lines at the bottom could be deleted entirely. In the bZ
version of the code, something about the extra temporary leads the compiler to make better choices and do exactly that.
We can also make that change directly. The big.zip attachment contains fixbig.go, which reads big.test.old, makes exactly those edits (swap the two store locations and delete the marked lines at the end), and writes big.test.old.fix. Sure enough, big.test.old.fix runs about as fast as big.test.new does.
Aside 1. Note that the line numbers are bogus. The ReadByte call is line 219. The rest is spilling in preparation for the call and reloading in preparation for falling through to the next basic block. I would argue that the initial spills should be given the first line number from the basic block's actual code, and the cleanup should be given the last line number from the actual code, so that every instruction in this basic block should say line 219. It looks like maybe the spill/load code is assigned the line number of the final mention of that variable in the function. As it stands now, the resulting scattershot line numbering makes source-level profiling almost worthless.
Aside 2. Note the number of loads in the disassembly that I've marked // dead
. The registers these instructions load are overwritten later in the basic block without being used. It seems like a basic peephole optimizer could remove them, although better would be not to generate them in the first place. There are also a few marked // could be just
that load into a register whose only use is to move into a final destination register later. These suboptimal load sequences persist in the bZ version that is 14% faster. I wrote a version of fixbig.go that removes the lines I marked // dead
(except the final one before the > section at the bottom, which stops being dead when that section is deleted), and it runs 17% faster than the base:
goos: linux
goarch: amd64
pkg: math/big
cpu: AMD Ryzen 9 7950X 16-Core Processor
│ old │ fix │
│ sec/op │ sec/op vs base │
Scan/1000/Base10-32 4.694µ ± 0% 3.859µ ± 1% -17.80% (p=0.000 n=20)
This is not the tightest loop in the world (it contains an indirect call!) and yet these basic cleanups give a very significant speedup. I wonder whether this is an isolated case or this happens in basically all code we generate.
Comment From: gabyhelp
Related Issues
- cmd/compile: slice hint seems to perform better than loop bound check #32492
- Performance regression from 98cb76799c "cmd/compile: insert complicated x86 addressing modes as a separate pass" #37955 (closed)
- cmd/compile: increased line number churn #20367 (closed)
- cmd/compile: regression: unnecessary spilling to the stack #61356 (closed)
- cmd/compile: loads combining regression #18946 (closed)
- cmd/compile: registerization of large functions worse than in Go 1.2 #8214 (closed)
Related Code Changes
- cmd/compile: avoid a spill in append fast path
- cmd/internal/obj/arm64: make function prologue more predictable
- cmd/compile/internal/obj/x86: eliminate some function prologues
- cmd/compile: move value around before kick it out of register
(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)
Comment From: gopherbot
Change https://go.dev/cl/650640 mentions this issue: math/big: optmize atoi of base 2, 4, 16
Comment From: mknyszek
CC @golang/compiler
Comment From: mcy
I've done a lot of reading of gc's output and this is very par for the course. gc is very bad at making many regalloc decisions, some of which show up in the assembly listings here: https://mcyoung.xyz/2024/12/12/go-abi/ (although I don't comment on them, as I explain below). Admittedly it's hard to tell when a spill is necessary for the GC to scan the stack or not, but I have seen a lot of this behavior both for non-pointers and for pointers that should not be marked as such in a GC shape, such as the itab pointer on an interface fat pointer.
I often find that gc is very bad at things that I would expect to not be an issue when coming out of SSA where you do cleanups between passes like LLVM does (such as DCE, DSE, and mem2reg cleanups). This includes unnecessary spills, but also lots of dead stores, both to registers and memory. Also the lack of callee registers hurts a lot, because it means every call forces spills.
I would be filing bugs but I have always had an expectation that Go doesn't care about codegen micro in the way that I (an LLVM-shaped person) do. The most I do is I email @dr2chase once a quarter whenever I encounter something obviously wrong (wrong in the "this code was written by GCC in the 90s"-- surprisingly I have not hit a miscompile in official releases).
If this expectation is based on stale information... well, let me know, and I'll have a lot of missed optimizations to share. One that irked me recently is that Go cannot optimize if a || b {}
into if a | b {}
to avoid an extra branch when both a and b are pure, such as when doing integer comparisons.
(If you're wondering where I get them. Well, I kinda just torture compilers to find missed optimizations for fun, and the current job has me writing high performance Go, so there's been a lot of opportunities...)
Comment From: randall77
Ok, we've got a couple issues here.
First, the computation of whether there is a call-free path through the loop is incorrect. The compiler thinks there is a call-free path through the loop, but there isn't. This matters because the compiler is more aggressive about keeping things in registers when such a path exists, but that extra effort is only harmful when no such path exists. This will be an easy fix.
Second, the shuffle
part of regalloc is sub-optimal. For instance, it decides that we want to move some value that is in a stack slot to a register. So it issues an unspill (a load) to put it in a register. Then we have a subsequent shuffle that needs to move a value from one stack slot to another. It needs a register to do that. If there is no free register, it will steal one from a value that's already in 2 locations. The value from the previous load is a candidate, because it is in both a stack slot and a register. Once that register is stolen to do the stack slot -> stack slot move, the original unspill is now dead. It will have to be issued again (and we don't cancel the previous one).
This one may be a bit harder to fix. I think some reordering of how we do the shuffle might help.
The spill-to-the-wrong-slot thing I'm not sure about yet. I suspect it is just that there is no lookahead when deciding where to spill. It also isn't spilling to the argument slot, which is odd.
Comment From: gopherbot
Change https://go.dev/cl/680775 mentions this issue: cmd/compile: fix containsUnavoidableCall computation