Golang cmd/compile: For some sizes, zeroing memory through OpZero could be faster (AMD64)

I'm working on a patch for inlining known-size memclrNoHeapPointers calls and replacing them with an OpZero.

The inlining significantly speeds up clearing, but for some clear size, AMD64's OpZero is actually slower than calling memcrlNoHeapPointers,

see:

Intel Alder Lake 12600k:

name old time/op new time/op delta MemclrKnownSize1-16 0.59ns ± 3% 0.38ns ± 6% -36.00% (p=0.000 n=19+18) MemclrKnownSize2-16 0.57ns ± 1% 0.19ns ± 5% -66.27% (p=0.000 n=19+19) MemclrKnownSize4-16 0.66ns ± 2% 0.36ns ±21% -45.12% (p=0.000 n=19+20) MemclrKnownSize8-16 0.74ns ± 1% 0.30ns ±26% -59.81% (p=0.000 n=18+20) MemclrKnownSize16-16 1.00ns ± 7% 0.21ns ± 8% -79.51% (p=0.000 n=20+19) MemclrKnownSize32-16 0.95ns ± 1% 0.40ns ± 1% -57.61% (p=0.000 n=20+18) MemclrKnownSize64-16 1.20ns ± 2% 0.41ns ± 0% -65.82% (p=0.000 n=20+18) MemclrKnownSize112-16 1.27ns ± 2% 1.03ns ± 0% -19.35% (p=0.000 n=20+18) MemclrKnownSize128-16 1.34ns ± 2% 1.03ns ± 0% -23.02% (p=0.000 n=20+20) MemclrKnownSize192-16 1.92ns ± 2% 1.44ns ± 0% -24.89% (p=0.000 n=20+16) MemclrKnownSize248-16 2.77ns ± 1% 3.29ns ± 0% +18.81% (p=0.000 n=20+16) MemclrKnownSize256-16 1.92ns ± 1% 1.86ns ± 0% -3.49% (p=0.000 n=19+15) MemclrKnownSize512-16 2.81ns ± 2% 3.49ns ± 0% +24.15% (p=0.000 n=20+17) MemclrKnownSize1024-16 4.02ns ± 1% 6.78ns ± 0% +68.44% (p=0.000 n=20+18) MemclrKnownSize4096-16 17.2ns ± 2% 14.4ns ± 0% -16.73% (p=0.000 n=20+17) MemclrKnownSize512KiB-16 6.71µs ± 1% 6.52µs ± 0% -2.85% (p=0.000 n=20+18) [Geo mean] 2.60ns 1.71ns -34.06%

name old speed new speed delta MemclrKnownSize1-16 1.71GB/s ± 3% 2.67GB/s ± 6% +56.39% (p=0.000 n=19+18) MemclrKnownSize2-16 3.52GB/s ± 2% 10.43GB/s ± 6% +196.04% (p=0.000 n=20+20) MemclrKnownSize4-16 6.06GB/s ± 1% 10.83GB/s ±11% +78.63% (p=0.000 n=19+18) MemclrKnownSize8-16 10.7GB/s ± 1% 27.0GB/s ±21% +151.49% (p=0.000 n=18+20) MemclrKnownSize16-16 16.0GB/s ± 8% 78.1GB/s ± 7% +387.24% (p=0.000 n=20+19) MemclrKnownSize32-16 33.6GB/s ± 1% 79.4GB/s ± 1% +135.89% (p=0.000 n=20+18) MemclrKnownSize64-16 53.3GB/s ± 2% 155.9GB/s ± 0% +192.58% (p=0.000 n=20+18) MemclrKnownSize112-16 88.0GB/s ± 2% 109.1GB/s ± 0% +23.97% (p=0.000 n=20+18) MemclrKnownSize128-16 95.3GB/s ± 2% 123.8GB/s ± 0% +29.88% (p=0.000 n=20+20) MemclrKnownSize192-16 100GB/s ± 2% 133GB/s ± 0% +33.12% (p=0.000 n=20+17) MemclrKnownSize248-16 89.7GB/s ± 1% 75.5GB/s ± 0% -15.84% (p=0.000 n=20+19) MemclrKnownSize256-16 133GB/s ± 1% 138GB/s ± 0% +3.61% (p=0.000 n=19+14) MemclrKnownSize512-16 182GB/s ± 2% 147GB/s ± 0% -19.46% (p=0.000 n=20+17) MemclrKnownSize1024-16 254GB/s ± 1% 151GB/s ± 0% -40.64% (p=0.000 n=20+18) MemclrKnownSize4096-16 237GB/s ± 2% 285GB/s ± 0% +20.09% (p=0.000 n=20+17) MemclrKnownSize512KiB-16 78.2GB/s ± 1% 80.4GB/s ± 0% +2.93% (p=0.000 n=20+18) [Geo mean] 42.1GB/s 63.8GB/s +51.53%

This problem doesn't appear on arm64 (M1 Pro):

name old time/op new time/op delta MemclrKnownSize1-10 2.03ns ± 0% 0.31ns ± 0% -84.69% (p=0.000 n=18+19) MemclrKnownSize2-10 1.97ns ± 0% 0.31ns ± 0% -84.19% (p=0.000 n=12+19) MemclrKnownSize4-10 2.02ns ± 0% 0.31ns ± 0% -84.56% (p=0.000 n=12+20) MemclrKnownSize8-10 2.02ns ± 0% 0.31ns ± 0% -84.59% (p=0.000 n=14+19) MemclrKnownSize16-10 2.15ns ± 0% 0.31ns ± 0% -85.50% (p=0.000 n=18+19) MemclrKnownSize32-10 2.48ns ± 0% 0.31ns ± 0% -87.48% (p=0.000 n=20+19) MemclrKnownSize64-10 1.93ns ± 0% 0.62ns ± 0% -67.88% (p=0.000 n=20+19) MemclrKnownSize112-10 2.48ns ± 0% 1.80ns ± 0% -27.74% (p=0.000 n=19+20) MemclrKnownSize128-10 10.0ns ±112% 2.0ns ± 0% -79.76% (p=0.000 n=18+17) MemclrKnownSize192-10 27.4ns ±103% 2.6ns ± 0% -90.38% (p=0.000 n=16+19) MemclrKnownSize248-10 9.67ns ±43% 3.26ns ± 0% -66.29% (p=0.000 n=19+19) MemclrKnownSize256-10 85.4ns ±148% 3.3ns ± 0% -96.18% (p=0.000 n=20+20) MemclrKnownSize512-10 223ns ±54% 6ns ± 0% -97.42% (p=0.000 n=18+20) MemclrKnownSize1024-10 216ns ±26% 11ns ± 0% -95.00% (p=0.000 n=18+15) MemclrKnownSize4096-10 265ns ± 2% 88ns ± 0% -66.84% (p=0.000 n=19+17) MemclrKnownSize512KiB-10 9.91µs ± 1% 10.23µs ± 2% +3.14% (p=0.000 n=19+19) [Geo mean] 15.6ns 2.5ns -83.62%

name old speed new speed delta MemclrKnownSize1-10 493MB/s ± 0% 3216MB/s ± 0% +553.04% (p=0.000 n=18+19) MemclrKnownSize2-10 1.02GB/s ± 0% 6.43GB/s ± 0% +532.33% (p=0.000 n=16+19) MemclrKnownSize4-10 1.99GB/s ± 0% 12.86GB/s ± 0% +547.67% (p=0.000 n=18+20) MemclrKnownSize8-10 3.96GB/s ± 0% 25.72GB/s ± 0% +548.81% (p=0.000 n=19+19) MemclrKnownSize16-10 7.46GB/s ± 0% 51.43GB/s ± 0% +589.72% (p=0.000 n=20+19) MemclrKnownSize32-10 12.9GB/s ± 0% 102.9GB/s ± 0% +698.60% (p=0.000 n=20+18) MemclrKnownSize64-10 33.1GB/s ± 0% 103.0GB/s ± 0% +211.34% (p=0.000 n=19+19) MemclrKnownSize112-10 45.1GB/s ± 0% 62.4GB/s ± 0% +38.38% (p=0.000 n=19+20) MemclrKnownSize128-10 13.3GB/s ±107% 63.5GB/s ± 0% +378.03% (p=0.000 n=19+18) MemclrKnownSize192-10 6.97GB/s ±139% 72.72GB/s ± 0% +943.44% (p=0.000 n=19+19) MemclrKnownSize248-10 25.9GB/s ±46% 76.1GB/s ± 0% +194.16% (p=0.000 n=20+17) MemclrKnownSize256-10 8.64GB/s ±196% 78.51GB/s ± 0% +808.19% (p=0.000 n=20+20) MemclrKnownSize512-10 2.33GB/s ±86% 89.13GB/s ± 0% +3719.50% (p=0.000 n=17+20) MemclrKnownSize1024-10 4.85GB/s ±32% 94.93GB/s ± 0% +1856.74% (p=0.000 n=18+19) MemclrKnownSize4096-10 15.4GB/s ± 2% 46.6GB/s ± 0% +201.55% (p=0.000 n=19+18) MemclrKnownSize512KiB-10 52.9GB/s ± 1% 51.3GB/s ± 2% -3.04% (p=0.000 n=19+19) [Geo mean] 7.54GB/s 42.86GB/s +468.76%

It seems that either: on AMD64 OpZero could be improved or for some clears we could use memcrlNoHeapPointers instead.

I'll send the patch soon. Creating this issue so that it can be linked from the change.

Comment From: gopherbot

Change https://go.dev/cl/454255 mentions this issue: cmd/compile: inline known-size memclrNoHeapPointers calls

Comment From: randall77

The problem I see here is that sometimes OpZero is used in situations where we can't issue a call. Typically, this is because it is being used to set up call arguments, so setting arguments for memclrNoHeapPointers overwrites the arguments we've already set for some other call. Also initializing stack frames is a tricky time, as we can't do an interruptible call because the stack frame has junk in it.

Maybe though this problem only applies to OpCopy? Not sure that OpZero is used for setting up arguments, and since memclrNoHeapPointers is not interruptible maybe it is ok.

Comment From: jake-volvo

I think it’s best solved by improving the DUFFZERO implementation. If you look at the benchmark it seems the problem areas are all under the DUFFZERO range.

We know the size of the clear, thus we can skip the initial branching in memclrNoHeapPointers. The trick is to teach DUFFZERO checking for AVX support and figuring out whether the increased code size (would it increase?) pays off. We could also elide the check on higher arch levels.

Comment From: laboger

On ppc64le, we have an instruction to clear blocks of 128 which is used in the memclr function and is more efficient than doing it this way. It sounds like these cases were recognized because it is mentioned in this issue as well as the commit message of the CL, but not addressed?

Comment From: jake-volvo

@laboger is it possible to use that instruction in ppc64’s Zero implementation? Perhaps we can skip inlining for ppc64 if it’s a problem.

Maybe you could checkout this change and try running memory clearing benchmarks to see how it impacts performance. Thanks.

Comment From: laboger

@jake-volvo Not really because there are different conditions to verify before you can use that instruction, such as alignment of the target, which makes the code more complicated. It is better to do that within the memclr assembler function.

Why don't you do this the same way as the runtime.memmove generic rules, which use the function isInlinableMemmove to determine when it should be done and returns the condition depending on the target? You could define a similar function isInlinableMemclr.

Here is the comparison with your change: name old time/op new time/op delta MemclrKnownSize1-64 2.07ns ± 0% 0.57ns ± 1% -72.44% (p=0.029 n=4+4) MemclrKnownSize2-64 2.55ns ± 1% 0.57ns ± 0% -77.66% (p=0.029 n=4+4) MemclrKnownSize4-64 5.14ns ± 0% 0.50ns ± 0% -90.36% (p=0.029 n=4+4) MemclrKnownSize8-64 2.24ns ± 0% 0.57ns ± 1% -74.58% (p=0.029 n=4+4) MemclrKnownSize16-64 2.23ns ± 0% 0.50ns ± 1% -77.48% (p=0.029 n=4+4) MemclrKnownSize32-64 2.28ns ± 1% 0.57ns ± 1% -74.95% (p=0.029 n=4+4) MemclrKnownSize64-64 2.27ns ± 0% 0.72ns ± 0% -68.13% (p=0.029 n=4+4) MemclrKnownSize112-64 2.95ns ± 0% 1.15ns ± 0% -60.89% (p=0.029 n=4+4) MemclrKnownSize128-64 4.74ns ± 1% 2.68ns ± 0% -43.52% (p=0.029 n=4+4) MemclrKnownSize192-64 5.08ns ± 1% 3.15ns ± 0% -38.09% (p=0.029 n=4+4) MemclrKnownSize248-64 4.68ns ± 4% 3.00ns ± 0% -36.00% (p=0.029 n=4+4) MemclrKnownSize256-64 6.81ns ± 1% 3.56ns ± 1% -47.70% (p=0.029 n=4+4) MemclrKnownSize512-64 3.65ns ± 0% 8.97ns ±31% +145.78% (p=0.029 n=4+4) MemclrKnownSize1024-64 4.74ns ± 0% 12.02ns ± 0% +153.75% (p=0.029 n=4+4) MemclrKnownSize4096-64 17.1ns ± 0% 51.9ns ± 0% +203.34% (p=0.029 n=4+4) MemclrKnownSize512KiB-64 2.12µs ± 0% 5.30µs ± 0% +149.61% (p=0.029 n=4+4)

Comment From: jake-ciolek

Thanks, I'll gate the inlining so it only happens for AMD64 and ARM64 for now.

Comment From: mknyszek

Doing a sweep of the Go 1.21 milestone. Is there anything left to do here?

Also, given that this for performance it probably shouldn't be in the Go 1.21 milestone if there is. Moving it to Backlog for now. Please comment if you plan to work on this in Go 1.22 and we can push it there. Thanks!

Comment From: randall77

Duffzero is now gone, so this probably needs a re-evaluation to see if there is still anything to do.