Go version

go version go1.22.2 darwin/arm64

Output of go env in your module/workspace:

GO111MODULE=''
GOARCH='arm64'
GOBIN=''
GOCACHE='/Users/filippo/Library/Caches/go-build'
GOENV='/Users/filippo/Library/Application Support/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='arm64'
GOHOSTOS='darwin'
GOINSECURE=''
GOMODCACHE='/Users/filippo/pkg/mod'
GONOPROXY='github.com/FiloSottile/*,filippo.io/*'
GONOSUMDB=''
GOOS='darwin'
GOPATH='/Users/filippo'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org'
GOROOT='/Users/filippo/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.2.darwin-arm64'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/Users/filippo/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.2.darwin-arm64/pkg/tool/darwin_arm64'
GOVCS=''
GOVERSION='go1.22.2'
GCCGO='gccgo'
AR='ar'
CC='clang'
CXX='clang++'
CGO_ENABLED='1'
GOMOD='/Users/filippo/src/filippo.io/mlkem768/go.mod'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -ffile-prefix-map=/var/folders/_j/hq4ytn1n4b94fhrpvvb9tktr0000gn/T/go-build3556564162=/tmp/go-build -gno-record-gcc-switches -fno-common'

What did you do?

I have this function in a fairly hot loop.

const n = 256
type fieldElement uint16
type nttElement [n]fieldElement

func nttMul(f, g nttElement) nttElement {
    var h nttElement
    for i := 0; i < 128; i++ {
        a0, a1 := f[2*i], f[2*i+1]
        b0, b1 := g[2*i], g[2*i+1]
        h[2*i] = fieldAdd(fieldMul(a0, b0), fieldMul(fieldMul(a1, b1), gammas[i]))
        h[2*i+1] = fieldAdd(fieldMul(a0, b1), fieldMul(a1, b0))
    }
    return h
}

What did you see happen?

Both f[2*i] and f[2*i+1] get a isInBounds(). g[2*i] and g[2*i+1] don't.

What did you expect to see?

One or zero bounds checks.

One if the compiler realized that if f[2*i+1] is in bounds with i positive and small (0 to 127), then f[2*i] is clearly in bounds.

Zero if the compiler figured out that i is 0 to 127, so 2*i+1 is 0 to 255, and the size of f is a constant 256.

Comment From: zigo101

It looks there is a workaround now:

func nttMul(f, g nttElement) nttElement {
    var h nttElement
    for i := 0; i < 256; i += 2 {
        a0, a1 := f[i], 
        f[i+1]
        b0, b1 := g[i], 
        g[i+1]
        _, _, _, _ = a0, a1, b0, b1
    }
    return h
}

Not benchmark it yet.

[edit] BTW, why not use slices as arguments:

func nttMul(f, g []fieldElement) nttElement {
    var h nttElement
    f, g = f[:256], g[:256]
    for i := 0; i < 256; i += 2 {
        a0, a1 := f[i], 
        f[i+1]
        b0, b1 := g[i], 
        g[i+1]
        _, _, _, _ = a0, a1, b0, b1
    }
    return h
}

Comment From: FiloSottile

That does indeed remove the bounds checks on f, but we get one on gammas[i/2], which also feels avoidable.

Slices should in theory add overhead compared to const size arrays and remove information from the type system

Comment From: zigo101

Yes, there are still several cases BCE fails to handle.

Now, a workaround is to double the length of gammas and only uses its elements at 2N indexes.

Comment From: FiloSottile

Now, a workaround is to double the length of gammas and only uses its elements at 2N indexes.

Interestingly, that benchmarks decidedly worse, eating between 10% and 20% of the speedup of dropping the f bounds checks. 🤷

Comment From: zigo101

It is possible. BCE will not always get positive effects.

Another way to remove all bound checks is declare fieldElement as [2]uint16. I'm not sure about the final effect.

Comment From: zigo101

The test shows that double length of gammas is the most performant: https://gotipplay.golang.org/p/VKJDqUPuhKV (maybe the implementation of fieldMul and fieldAdd will affect the results).

goos: linux
goarch: amd64
pkg: example.com
cpu: Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz
Benchmark_f1-4       1762314           680.7 ns/op
Benchmark_f2-4       3191974           374.8 ns/op
Benchmark_f3-4       3488221           343.2 ns/op
Benchmark_f4-4       1893159           630.0 ns/op

Comment From: randall77

Simple reproducer:

func f(a [256]byte) {
    for i := 0; i < 128; i++ {
        _ = a[2*i]
    }
}

We should be able to get rid of the bounds check. Currently I think it fails because the prove pass can't tell that the math 2*i doesn't overflow and become negative. It does seem to grok that 2*i < 256. Should be fixable. That code is delicate though, it is not immediately obvious to me where that fix would go.

Comment From: gopherbot

Change https://go.dev/cl/599096 mentions this issue: cmd/compile: rewrite the constant parts of the prove pass

Comment From: gopherbot

Change https://go.dev/cl/599256 mentions this issue: cmd/compile: propagate constant ranges through multiplies and shifts