Go version
go version go1.22.2 darwin/arm64
Output of go env
in your module/workspace:
GO111MODULE=''
GOARCH='arm64'
GOBIN=''
GOCACHE='/Users/filippo/Library/Caches/go-build'
GOENV='/Users/filippo/Library/Application Support/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='arm64'
GOHOSTOS='darwin'
GOINSECURE=''
GOMODCACHE='/Users/filippo/pkg/mod'
GONOPROXY='github.com/FiloSottile/*,filippo.io/*'
GONOSUMDB=''
GOOS='darwin'
GOPATH='/Users/filippo'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org'
GOROOT='/Users/filippo/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.2.darwin-arm64'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/Users/filippo/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.2.darwin-arm64/pkg/tool/darwin_arm64'
GOVCS=''
GOVERSION='go1.22.2'
GCCGO='gccgo'
AR='ar'
CC='clang'
CXX='clang++'
CGO_ENABLED='1'
GOMOD='/Users/filippo/src/filippo.io/mlkem768/go.mod'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -ffile-prefix-map=/var/folders/_j/hq4ytn1n4b94fhrpvvb9tktr0000gn/T/go-build3556564162=/tmp/go-build -gno-record-gcc-switches -fno-common'
What did you do?
I have this function in a fairly hot loop.
const n = 256
type fieldElement uint16
type nttElement [n]fieldElement
func nttMul(f, g nttElement) nttElement {
var h nttElement
for i := 0; i < 128; i++ {
a0, a1 := f[2*i], f[2*i+1]
b0, b1 := g[2*i], g[2*i+1]
h[2*i] = fieldAdd(fieldMul(a0, b0), fieldMul(fieldMul(a1, b1), gammas[i]))
h[2*i+1] = fieldAdd(fieldMul(a0, b1), fieldMul(a1, b0))
}
return h
}
What did you see happen?
Both f[2*i]
and f[2*i+1]
get a isInBounds()
. g[2*i]
and g[2*i+1]
don't.
What did you expect to see?
One or zero bounds checks.
One if the compiler realized that if f[2*i+1]
is in bounds with i
positive and small (0 to 127), then f[2*i]
is clearly in bounds.
Zero if the compiler figured out that i is 0 to 127, so 2*i+1
is 0 to 255, and the size of f is a constant 256.
Comment From: zigo101
It looks there is a workaround now:
func nttMul(f, g nttElement) nttElement {
var h nttElement
for i := 0; i < 256; i += 2 {
a0, a1 := f[i],
f[i+1]
b0, b1 := g[i],
g[i+1]
_, _, _, _ = a0, a1, b0, b1
}
return h
}
Not benchmark it yet.
[edit] BTW, why not use slices as arguments:
func nttMul(f, g []fieldElement) nttElement {
var h nttElement
f, g = f[:256], g[:256]
for i := 0; i < 256; i += 2 {
a0, a1 := f[i],
f[i+1]
b0, b1 := g[i],
g[i+1]
_, _, _, _ = a0, a1, b0, b1
}
return h
}
Comment From: FiloSottile
That does indeed remove the bounds checks on f
, but we get one on gammas[i/2]
, which also feels avoidable.
Slices should in theory add overhead compared to const size arrays and remove information from the type system
Comment From: zigo101
Yes, there are still several cases BCE fails to handle.
Now, a workaround is to double the length of gammas
and only uses its elements at 2N
indexes.
Comment From: FiloSottile
Now, a workaround is to double the length of
gammas
and only uses its elements at2N
indexes.
Interestingly, that benchmarks decidedly worse, eating between 10% and 20% of the speedup of dropping the f
bounds checks. 🤷
Comment From: zigo101
It is possible. BCE will not always get positive effects.
Another way to remove all bound checks is declare fieldElement
as [2]uint16
. I'm not sure about the final effect.
Comment From: zigo101
The test shows that double length of gammas
is the most performant: https://gotipplay.golang.org/p/VKJDqUPuhKV (maybe the implementation of fieldMul
and fieldAdd
will affect the results).
goos: linux
goarch: amd64
pkg: example.com
cpu: Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz
Benchmark_f1-4 1762314 680.7 ns/op
Benchmark_f2-4 3191974 374.8 ns/op
Benchmark_f3-4 3488221 343.2 ns/op
Benchmark_f4-4 1893159 630.0 ns/op
Comment From: randall77
Simple reproducer:
func f(a [256]byte) {
for i := 0; i < 128; i++ {
_ = a[2*i]
}
}
We should be able to get rid of the bounds check. Currently I think it fails because the prove pass can't tell that the math 2*i
doesn't overflow and become negative. It does seem to grok that 2*i < 256
.
Should be fixable. That code is delicate though, it is not immediately obvious to me where that fix would go.
Comment From: gopherbot
Change https://go.dev/cl/599096 mentions this issue: cmd/compile: rewrite the constant parts of the prove pass
Comment From: gopherbot
Change https://go.dev/cl/599256 mentions this issue: cmd/compile: propagate constant ranges through multiplies and shifts