Golang cmd/compile: performance regression in 1.20 - inlining regression due to missed export

What version of Go are you using (`go version`)?

$ go version
go version go1.19.4 linux/amd64
$ gotip version
go version devel go1.20-e870de9 Tue Dec 27 21:10:04 2022 +0000 linux/amd64

Does this issue reproduce with the latest release?

Yes, also in 1.20rc1

What did you do?

$ cat go.mod
module mymodule/math

go 1.20

$ cat math.go
package math

import (
        "math"
        "math/rand"
)

const Epsilon = 1e-7

type Float interface {
        ~float32 | ~float64
}

type Mat4[T Float] struct {
        X00, X01, X02, X03 T
        X10, X11, X12, X13 T
        X20, X21, X22, X23 T
        X30, X31, X32, X33 T
}

func (m Mat4[T]) Eq(n Mat4[T]) bool {
        return ApproxEq(m.X00, n.X00, Epsilon) &&
                ApproxEq(m.X10, n.X10, Epsilon) &&
                ApproxEq(m.X20, n.X20, Epsilon) &&
                ApproxEq(m.X30, n.X30, Epsilon) &&
                ApproxEq(m.X01, n.X01, Epsilon) &&
                ApproxEq(m.X11, n.X11, Epsilon) &&
                ApproxEq(m.X21, n.X21, Epsilon) &&
                ApproxEq(m.X31, n.X31, Epsilon) &&
                ApproxEq(m.X02, n.X02, Epsilon) &&
                ApproxEq(m.X12, n.X12, Epsilon) &&
                ApproxEq(m.X22, n.X22, Epsilon) &&
                ApproxEq(m.X32, n.X32, Epsilon) &&
                ApproxEq(m.X03, n.X03, Epsilon) &&
                ApproxEq(m.X13, n.X13, Epsilon) &&
                ApproxEq(m.X23, n.X23, Epsilon) &&
                ApproxEq(m.X33, n.X33, Epsilon)
}

func Abs[T Float](x T) T {
        return T(math.Abs(float64(x)))
}

func ApproxEq[T Float](v1, v2, epsilon T) bool {
        return Abs(v1-v2) <= epsilon
}

type Vec4[T Float] struct {
        X, Y, Z, W T
}

func NewRandVec4[T Float]() Vec4[T] {
        return Vec4[T]{
                T(rand.Float64()),
                T(rand.Float64()),
                T(rand.Float64()),
                T(rand.Float64()),
        }
}

func (v Vec4[T]) Dot(u Vec4[T]) T {
        return FMA(v.X, u.X, FMA(v.Y, u.Y, FMA(v.Z, u.Z, v.W*u.W)))
}

func FMA[T Float](x, y, z T) T {
        return T(math.FMA(float64(x), float64(y), float64(z)))
}

$ cat bench_test.go 
package math_test

import (
        "testing"

        "mymodule/math"
)

func BenchmarkMat4_Eq(b *testing.B) {
        m1 := math.Mat4[float32]{
                5, 1, 5, 6,
                8, 71, 2, 47,
                5, 1, 582, 4,
                2, 1, 7, 25,
        }
        m2 := math.Mat4[float32]{
                5, 1, 5, 6,
                8, 71, 2, 47,
                5, 1, 582, 4,
                2, 1, 7, 25,
        }

        b.ResetTimer()
        b.ReportAllocs()
        var m bool
        for i := 0; i < b.N; i++ {
                m = m1.Eq(m2)
        }
        _ = m
}

var v float32

func BenchmarkVec_Dot(b *testing.B) {
        b.Run("Vec4", func(b *testing.B) {
                v1 := math.NewRandVec4[float32]()
                v2 := math.NewRandVec4[float32]()

                b.ReportAllocs()
                b.ResetTimer()
                for i := 0; i < b.N; i++ {
                        v = v1.Dot(v2)
                }
        })
}

$ perflock go test -run=none -bench=. -count=10 | tee bench119.txt
goos: linux
goarch: amd64
pkg: mymodule/math
cpu: Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz
BenchmarkMat4_Eq-8      64214283                18.66 ns/op            0 B/op          0 allocs/op
BenchmarkMat4_Eq-8      64270538                18.66 ns/op            0 B/op          0 allocs/op
BenchmarkMat4_Eq-8      64261249                18.66 ns/op            0 B/op          0 allocs/op
...

$ perflock gotip test -run=none -bench=. -count=10 | tee bench120.txt
goos: linux
goarch: amd64
pkg: mymodule/math
cpu: Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz
BenchmarkMat4_Eq-8      35130938                35.00 ns/op            0 B/op          0 allocs/op
BenchmarkMat4_Eq-8      35127861                34.20 ns/op            0 B/op          0 allocs/op
BenchmarkMat4_Eq-8      34658744                34.21 ns/op            0 B/op          0 allocs/op
...

What did you expect to see?

Same performance.

What did you see instead?

$ benchstat bench119.txt bench120.txt
name            old time/op    new time/op    delta
Mat4_Eq-8         18.7ns ± 0%    34.2ns ± 0%   +83.24%  (p=0.000 n=8+9)
Vec_Dot/Vec4-8    4.44ns ± 0%    9.30ns ± 0%  +109.24%  (p=0.000 n=8+8)

name            old alloc/op   new alloc/op   delta
Mat4_Eq-8          0.00B          0.00B           ~     (all equal)
Vec_Dot/Vec4-8     0.00B          0.00B           ~     (all equal)

name            old allocs/op  new allocs/op  delta
Mat4_Eq-8           0.00           0.00           ~     (all equal)
Vec_Dot/Vec4-8      0.00           0.00           ~     (all equal)

Comment From: changkun

cc @golang/runtime

Comment From: cherrymui

I can reproduce the regression on Mat4_Eq, but not on Vec_Dot/Vec4. Setting GOEXPERIMENT=nounified brings the performance back. So it looks like it is due to unified IR.

It looks to me that with Go 1.19 or non-unified, it inlines math.Abs, math.Float64bits, and math.Float64frombits (from the standard library math package, not mymodule/math) into . At tip with unified IR, they are not inlined. Maybe there is some issue about inlining non-generic callee into generic caller?

cc @mdempsky

Comment From: cherrymui

Yeah, if I remove the type parameters (hard code float32), I get the same performance as Go 1.19.

Comment From: changkun

It is weird that Vec_Dot/Vec4 is not reproducible. Nevertheless, in https://github.com/polyred/polyred/tree/develop/math, there are more regression examples:

name                        old time/op    new time/op    delta
Mat_Mul-8                     5.80ms ± 0%    5.92ms ± 0%    +2.11%  (p=0.000 n=9+8)
Mat4_Eq-8                     19.0ns ± 0%    34.2ns ± 0%   +79.90%  (p=0.000 n=10+9)
Vec_Eq/Vec2-8                 1.82ns ± 1%    2.52ns ± 0%   +38.40%  (p=0.000 n=8+9)
Vec_Eq/Vec3-8                 1.90ns ± 2%    2.74ns ± 0%   +44.39%  (p=0.000 n=9+10)
Vec_Eq/Vec4-8                 2.10ns ± 1%    3.08ns ± 0%   +47.06%  (p=0.000 n=10+10)
Vec_IsZero/Vec2-8             1.82ns ± 1%    2.51ns ± 0%   +37.99%  (p=0.000 n=9+8)
Vec_IsZero/Vec3-8             1.71ns ± 3%    2.51ns ± 0%   +46.40%  (p=0.000 n=9+8)
Vec_IsZero/Vec4-8             1.82ns ± 0%    2.51ns ± 1%   +38.45%  (p=0.000 n=9+9)
Vec_Dot/Vec2-8                2.06ns ± 2%    2.95ns ± 0%   +43.22%  (p=0.000 n=10+9)
Vec_Dot/Vec3-8                3.13ns ± 0%    6.15ns ± 1%   +96.09%  (p=0.000 n=8+9)
Vec_Dot/Vec4-8                4.21ns ± 0%    9.34ns ± 1%  +121.83%  (p=0.000 n=8+10)
Vec_Len/Vec2-8                2.12ns ± 0%    3.20ns ± 0%   +50.73%  (p=0.000 n=9+9)
Vec_Len/Vec3-8                2.98ns ± 1%    3.80ns ± 0%   +27.45%  (p=0.000 n=10+8)
Vec_Len/Vec4-8                4.34ns ± 1%    5.14ns ± 1%   +18.35%  (p=0.000 n=9+10)
Vec_Unit/Vec3-8               8.90ns ± 0%   13.35ns ± 0%   +50.06%  (p=0.000 n=9+8)
Vec_Unit/Vec4-8               15.7ns ± 0%    19.3ns ± 1%   +23.15%  (p=0.000 n=8+10)
Vec_Apply/Vec3-8              8.90ns ± 0%   13.40ns ± 1%   +50.53%  (p=0.000 n=9+10)
Vec_Apply/Vec4-8              15.7ns ± 0%    19.3ns ± 0%   +22.86%  (p=0.000 n=8+8)
Vec_Cross/Vec4-8              4.77ns ± 0%    4.90ns ± 1%    +2.81%  (p=0.000 n=8+10)

Comment From: cherrymui

This is also multi-level inlining. E.g. standard math.Abs inlined into user-defined, instantiated Abs[go.shape.float32_0], then inlined into ApproxEq[go.shape.float32_0]. #56280 may be related.

Comment From: mdempsky

I'm on vacation (and currently on a plane), but briefly looking at the compiler's -m and -S output, it looks like everything is inlining the same. I don't see anything obviously wrong. (Caveat: I had to retype stuff from my phone onto my laptop and I simplified things slightly because of that.)

I'll take a look once I'm back in the office on Monday.

Comment From: cherrymui

Hmmm, I got different results with tip vs. 1.19 or non-unified.

With Go 1.19,

$ go1.19 test -c -gcflags=-m 
# mymodule/math_test [mymodule/math.test]
./math.go:40:6: can inline "mymodule/math".Abs[go.shape.float32_0]
./math.go:41:26: inlining call to "math".Abs
./math.go:41:26: inlining call to "math".Float64bits
./math.go:41:26: inlining call to "math".Float64frombits
./math.go:44:6: can inline "mymodule/math".ApproxEq[go.shape.float32_0]
./math.go:45:19: inlining call to "mymodule/math".Abs[go.shape.float32_0]
./math.go:45:19: inlining call to "math".Abs
./math.go:45:19: inlining call to "math".Float64bits
./math.go:45:19: inlining call to "math".Float64frombits
./math.go:22:24: inlining call to "mymodule/math".ApproxEq[go.shape.float32_0]
./math.go:22:24: inlining call to "mymodule/math".Abs[go.shape.float32_0]
./math.go:22:24: inlining call to "math".Abs
./math.go:22:24: inlining call to "math".Float64bits
./math.go:22:24: inlining call to "math".Float64frombits
...

With tip,

$ go test -c -gcflags=-m 
# mymodule/math_test [mymodule/math.test]
./math.go:40:6: can inline math.Abs[go.shape.float32]
./math.go:44:6: can inline math.ApproxEq[go.shape.float32]
./math.go:45:19: inlining call to math.Abs[go.shape.float32]
./math.go:22:24: inlining call to math.ApproxEq[go.shape.float32]
./math.go:23:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:24:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:25:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:26:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:27:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:28:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:29:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:30:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:31:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:32:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:33:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:34:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:35:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:36:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:37:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:22:24: inlining call to math.Abs[go.shape.float32]
./math.go:23:25: inlining call to math.Abs[go.shape.float32]
./math.go:24:25: inlining call to math.Abs[go.shape.float32]
./math.go:25:25: inlining call to math.Abs[go.shape.float32]
./math.go:26:25: inlining call to math.Abs[go.shape.float32]
./math.go:27:25: inlining call to math.Abs[go.shape.float32]
./math.go:28:25: inlining call to math.Abs[go.shape.float32]
./math.go:29:25: inlining call to math.Abs[go.shape.float32]
./math.go:30:25: inlining call to math.Abs[go.shape.float32]
./math.go:31:25: inlining call to math.Abs[go.shape.float32]
./math.go:32:25: inlining call to math.Abs[go.shape.float32]
./math.go:33:25: inlining call to math.Abs[go.shape.float32]
./math.go:34:25: inlining call to math.Abs[go.shape.float32]
./math.go:35:25: inlining call to math.Abs[go.shape.float32]
./math.go:36:25: inlining call to math.Abs[go.shape.float32]
./math.go:37:25: inlining call to math.Abs[go.shape.float32]
./math_test.go:24:23: inlining call to testing.(*B).ReportAllocs
...

but no Float64bits and Float64frombits.

In particular,

$ go1.19 test -c -gcflags=-m 2>&1 | grep Float64frombits
./math.go:41:26: inlining call to "math".Float64frombits
./math.go:45:19: inlining call to "math".Float64frombits
./math.go:22:24: inlining call to "math".Float64frombits
./math.go:23:25: inlining call to "math".Float64frombits
./math.go:24:25: inlining call to "math".Float64frombits
./math.go:25:25: inlining call to "math".Float64frombits
./math.go:26:25: inlining call to "math".Float64frombits
./math.go:27:25: inlining call to "math".Float64frombits
./math.go:28:25: inlining call to "math".Float64frombits
./math.go:29:25: inlining call to "math".Float64frombits
./math.go:30:25: inlining call to "math".Float64frombits
./math.go:31:25: inlining call to "math".Float64frombits
./math.go:32:25: inlining call to "math".Float64frombits
./math.go:33:25: inlining call to "math".Float64frombits
./math.go:34:25: inlining call to "math".Float64frombits
./math.go:35:25: inlining call to "math".Float64frombits
./math.go:36:25: inlining call to "math".Float64frombits
./math.go:37:25: inlining call to "math".Float64frombits
$ go test -c -gcflags=-m 2>&1 | grep Float64frombits
$ # no output

Comment From: mdempsky

@cherrymui Thanks, I'm able to repro the issue now. Not sure what went wrong with my earlier attempt.

Comment From: mdempsky

The issue here is that unified IR has a simpler heuristic for deciding which function bodies to re-export. It simply re-exports functions that were inlined into the current compilation unit. (It also always exports its own inlinable functions.)

The problem manifests here that mymodule/math doesn't actually instantiate its generic types/functions, so math.{Abs,Float64bits,Float64frombits} never get inlined within that package, so they're never re-exported by that package either. Then when compiling mymodule/math_test, the inline bodies aren't available so they don't get inlined.

Two possible workarounds:

Instantiate the generic function/types within mymodule/math. For example, add two statements var _ Mat4[float64]; var _ Vec4[Float64].
Within mymodule/math_test, add an import _ "math" directive. This will make sure math.{Abs,Float64bits,Float64frombits} inline bodies are available from the origin package, regardless of reexporting.

There's supposed to be a compiler diagnostic to warn when this happens. I'm not sure at the moment why it's not firing.

Comment From: mknyszek

Hey @mdempsky, doing a sweep of the Go 1.21 milestone. Any updates here? Should this go into Backlog? Thanks.

Golang cmd/compile: performance regression in 1.20 - inlining regression due to missed export

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What did you do?

What did you expect to see?

What did you see instead?

What version of Go are you using (`go version`)?