What version of Go are you using (go version)?

$ go version
go version go1.19.4 linux/amd64
$ gotip version
go version devel go1.20-e870de9 Tue Dec 27 21:10:04 2022 +0000 linux/amd64

Does this issue reproduce with the latest release?

Yes, also in 1.20rc1

What did you do?

$ cat go.mod
module mymodule/math

go 1.20
$ cat math.go
package math

import (
        "math"
        "math/rand"
)

const Epsilon = 1e-7

type Float interface {
        ~float32 | ~float64
}

type Mat4[T Float] struct {
        X00, X01, X02, X03 T
        X10, X11, X12, X13 T
        X20, X21, X22, X23 T
        X30, X31, X32, X33 T
}

func (m Mat4[T]) Eq(n Mat4[T]) bool {
        return ApproxEq(m.X00, n.X00, Epsilon) &&
                ApproxEq(m.X10, n.X10, Epsilon) &&
                ApproxEq(m.X20, n.X20, Epsilon) &&
                ApproxEq(m.X30, n.X30, Epsilon) &&
                ApproxEq(m.X01, n.X01, Epsilon) &&
                ApproxEq(m.X11, n.X11, Epsilon) &&
                ApproxEq(m.X21, n.X21, Epsilon) &&
                ApproxEq(m.X31, n.X31, Epsilon) &&
                ApproxEq(m.X02, n.X02, Epsilon) &&
                ApproxEq(m.X12, n.X12, Epsilon) &&
                ApproxEq(m.X22, n.X22, Epsilon) &&
                ApproxEq(m.X32, n.X32, Epsilon) &&
                ApproxEq(m.X03, n.X03, Epsilon) &&
                ApproxEq(m.X13, n.X13, Epsilon) &&
                ApproxEq(m.X23, n.X23, Epsilon) &&
                ApproxEq(m.X33, n.X33, Epsilon)
}

func Abs[T Float](x T) T {
        return T(math.Abs(float64(x)))
}

func ApproxEq[T Float](v1, v2, epsilon T) bool {
        return Abs(v1-v2) <= epsilon
}

type Vec4[T Float] struct {
        X, Y, Z, W T
}

func NewRandVec4[T Float]() Vec4[T] {
        return Vec4[T]{
                T(rand.Float64()),
                T(rand.Float64()),
                T(rand.Float64()),
                T(rand.Float64()),
        }
}

func (v Vec4[T]) Dot(u Vec4[T]) T {
        return FMA(v.X, u.X, FMA(v.Y, u.Y, FMA(v.Z, u.Z, v.W*u.W)))
}

func FMA[T Float](x, y, z T) T {
        return T(math.FMA(float64(x), float64(y), float64(z)))
}
$ cat bench_test.go 
package math_test

import (
        "testing"

        "mymodule/math"
)

func BenchmarkMat4_Eq(b *testing.B) {
        m1 := math.Mat4[float32]{
                5, 1, 5, 6,
                8, 71, 2, 47,
                5, 1, 582, 4,
                2, 1, 7, 25,
        }
        m2 := math.Mat4[float32]{
                5, 1, 5, 6,
                8, 71, 2, 47,
                5, 1, 582, 4,
                2, 1, 7, 25,
        }

        b.ResetTimer()
        b.ReportAllocs()
        var m bool
        for i := 0; i < b.N; i++ {
                m = m1.Eq(m2)
        }
        _ = m
}

var v float32

func BenchmarkVec_Dot(b *testing.B) {
        b.Run("Vec4", func(b *testing.B) {
                v1 := math.NewRandVec4[float32]()
                v2 := math.NewRandVec4[float32]()

                b.ReportAllocs()
                b.ResetTimer()
                for i := 0; i < b.N; i++ {
                        v = v1.Dot(v2)
                }
        })
}
$ perflock go test -run=none -bench=. -count=10 | tee bench119.txt
goos: linux
goarch: amd64
pkg: mymodule/math
cpu: Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz
BenchmarkMat4_Eq-8      64214283                18.66 ns/op            0 B/op          0 allocs/op
BenchmarkMat4_Eq-8      64270538                18.66 ns/op            0 B/op          0 allocs/op
BenchmarkMat4_Eq-8      64261249                18.66 ns/op            0 B/op          0 allocs/op
...
$ perflock gotip test -run=none -bench=. -count=10 | tee bench120.txt
goos: linux
goarch: amd64
pkg: mymodule/math
cpu: Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz
BenchmarkMat4_Eq-8      35130938                35.00 ns/op            0 B/op          0 allocs/op
BenchmarkMat4_Eq-8      35127861                34.20 ns/op            0 B/op          0 allocs/op
BenchmarkMat4_Eq-8      34658744                34.21 ns/op            0 B/op          0 allocs/op
...

What did you expect to see?

Same performance.

What did you see instead?

$ benchstat bench119.txt bench120.txt
name            old time/op    new time/op    delta
Mat4_Eq-8         18.7ns ± 0%    34.2ns ± 0%   +83.24%  (p=0.000 n=8+9)
Vec_Dot/Vec4-8    4.44ns ± 0%    9.30ns ± 0%  +109.24%  (p=0.000 n=8+8)

name            old alloc/op   new alloc/op   delta
Mat4_Eq-8          0.00B          0.00B           ~     (all equal)
Vec_Dot/Vec4-8     0.00B          0.00B           ~     (all equal)

name            old allocs/op  new allocs/op  delta
Mat4_Eq-8           0.00           0.00           ~     (all equal)
Vec_Dot/Vec4-8      0.00           0.00           ~     (all equal)

Comment From: changkun

cc @golang/runtime

Comment From: cherrymui

I can reproduce the regression on Mat4_Eq, but not on Vec_Dot/Vec4. Setting GOEXPERIMENT=nounified brings the performance back. So it looks like it is due to unified IR.

It looks to me that with Go 1.19 or non-unified, it inlines math.Abs, math.Float64bits, and math.Float64frombits (from the standard library math package, not mymodule/math) into . At tip with unified IR, they are not inlined. Maybe there is some issue about inlining non-generic callee into generic caller?

cc @mdempsky

Comment From: cherrymui

Yeah, if I remove the type parameters (hard code float32), I get the same performance as Go 1.19.

Comment From: changkun

It is weird that Vec_Dot/Vec4 is not reproducible. Nevertheless, in https://github.com/polyred/polyred/tree/develop/math, there are more regression examples:

name                        old time/op    new time/op    delta
Mat_Mul-8                     5.80ms ± 0%    5.92ms ± 0%    +2.11%  (p=0.000 n=9+8)
Mat4_Eq-8                     19.0ns ± 0%    34.2ns ± 0%   +79.90%  (p=0.000 n=10+9)
Vec_Eq/Vec2-8                 1.82ns ± 1%    2.52ns ± 0%   +38.40%  (p=0.000 n=8+9)
Vec_Eq/Vec3-8                 1.90ns ± 2%    2.74ns ± 0%   +44.39%  (p=0.000 n=9+10)
Vec_Eq/Vec4-8                 2.10ns ± 1%    3.08ns ± 0%   +47.06%  (p=0.000 n=10+10)
Vec_IsZero/Vec2-8             1.82ns ± 1%    2.51ns ± 0%   +37.99%  (p=0.000 n=9+8)
Vec_IsZero/Vec3-8             1.71ns ± 3%    2.51ns ± 0%   +46.40%  (p=0.000 n=9+8)
Vec_IsZero/Vec4-8             1.82ns ± 0%    2.51ns ± 1%   +38.45%  (p=0.000 n=9+9)
Vec_Dot/Vec2-8                2.06ns ± 2%    2.95ns ± 0%   +43.22%  (p=0.000 n=10+9)
Vec_Dot/Vec3-8                3.13ns ± 0%    6.15ns ± 1%   +96.09%  (p=0.000 n=8+9)
Vec_Dot/Vec4-8                4.21ns ± 0%    9.34ns ± 1%  +121.83%  (p=0.000 n=8+10)
Vec_Len/Vec2-8                2.12ns ± 0%    3.20ns ± 0%   +50.73%  (p=0.000 n=9+9)
Vec_Len/Vec3-8                2.98ns ± 1%    3.80ns ± 0%   +27.45%  (p=0.000 n=10+8)
Vec_Len/Vec4-8                4.34ns ± 1%    5.14ns ± 1%   +18.35%  (p=0.000 n=9+10)
Vec_Unit/Vec3-8               8.90ns ± 0%   13.35ns ± 0%   +50.06%  (p=0.000 n=9+8)
Vec_Unit/Vec4-8               15.7ns ± 0%    19.3ns ± 1%   +23.15%  (p=0.000 n=8+10)
Vec_Apply/Vec3-8              8.90ns ± 0%   13.40ns ± 1%   +50.53%  (p=0.000 n=9+10)
Vec_Apply/Vec4-8              15.7ns ± 0%    19.3ns ± 0%   +22.86%  (p=0.000 n=8+8)
Vec_Cross/Vec4-8              4.77ns ± 0%    4.90ns ± 1%    +2.81%  (p=0.000 n=8+10)

Comment From: cherrymui

This is also multi-level inlining. E.g. standard math.Abs inlined into user-defined, instantiated Abs[go.shape.float32_0], then inlined into ApproxEq[go.shape.float32_0]. #56280 may be related.

Comment From: mdempsky

I'm on vacation (and currently on a plane), but briefly looking at the compiler's -m and -S output, it looks like everything is inlining the same. I don't see anything obviously wrong. (Caveat: I had to retype stuff from my phone onto my laptop and I simplified things slightly because of that.)

I'll take a look once I'm back in the office on Monday.

Comment From: cherrymui

Hmmm, I got different results with tip vs. 1.19 or non-unified.

With Go 1.19,

$ go1.19 test -c -gcflags=-m 
# mymodule/math_test [mymodule/math.test]
./math.go:40:6: can inline "mymodule/math".Abs[go.shape.float32_0]
./math.go:41:26: inlining call to "math".Abs
./math.go:41:26: inlining call to "math".Float64bits
./math.go:41:26: inlining call to "math".Float64frombits
./math.go:44:6: can inline "mymodule/math".ApproxEq[go.shape.float32_0]
./math.go:45:19: inlining call to "mymodule/math".Abs[go.shape.float32_0]
./math.go:45:19: inlining call to "math".Abs
./math.go:45:19: inlining call to "math".Float64bits
./math.go:45:19: inlining call to "math".Float64frombits
./math.go:22:24: inlining call to "mymodule/math".ApproxEq[go.shape.float32_0]
./math.go:22:24: inlining call to "mymodule/math".Abs[go.shape.float32_0]
./math.go:22:24: inlining call to "math".Abs
./math.go:22:24: inlining call to "math".Float64bits
./math.go:22:24: inlining call to "math".Float64frombits
...

With tip,

$ go test -c -gcflags=-m 
# mymodule/math_test [mymodule/math.test]
./math.go:40:6: can inline math.Abs[go.shape.float32]
./math.go:44:6: can inline math.ApproxEq[go.shape.float32]
./math.go:45:19: inlining call to math.Abs[go.shape.float32]
./math.go:22:24: inlining call to math.ApproxEq[go.shape.float32]
./math.go:23:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:24:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:25:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:26:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:27:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:28:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:29:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:30:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:31:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:32:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:33:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:34:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:35:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:36:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:37:25: inlining call to math.ApproxEq[go.shape.float32]
./math.go:22:24: inlining call to math.Abs[go.shape.float32]
./math.go:23:25: inlining call to math.Abs[go.shape.float32]
./math.go:24:25: inlining call to math.Abs[go.shape.float32]
./math.go:25:25: inlining call to math.Abs[go.shape.float32]
./math.go:26:25: inlining call to math.Abs[go.shape.float32]
./math.go:27:25: inlining call to math.Abs[go.shape.float32]
./math.go:28:25: inlining call to math.Abs[go.shape.float32]
./math.go:29:25: inlining call to math.Abs[go.shape.float32]
./math.go:30:25: inlining call to math.Abs[go.shape.float32]
./math.go:31:25: inlining call to math.Abs[go.shape.float32]
./math.go:32:25: inlining call to math.Abs[go.shape.float32]
./math.go:33:25: inlining call to math.Abs[go.shape.float32]
./math.go:34:25: inlining call to math.Abs[go.shape.float32]
./math.go:35:25: inlining call to math.Abs[go.shape.float32]
./math.go:36:25: inlining call to math.Abs[go.shape.float32]
./math.go:37:25: inlining call to math.Abs[go.shape.float32]
./math_test.go:24:23: inlining call to testing.(*B).ReportAllocs
...

but no Float64bits and Float64frombits.

In particular,

$ go1.19 test -c -gcflags=-m 2>&1 | grep Float64frombits
./math.go:41:26: inlining call to "math".Float64frombits
./math.go:45:19: inlining call to "math".Float64frombits
./math.go:22:24: inlining call to "math".Float64frombits
./math.go:23:25: inlining call to "math".Float64frombits
./math.go:24:25: inlining call to "math".Float64frombits
./math.go:25:25: inlining call to "math".Float64frombits
./math.go:26:25: inlining call to "math".Float64frombits
./math.go:27:25: inlining call to "math".Float64frombits
./math.go:28:25: inlining call to "math".Float64frombits
./math.go:29:25: inlining call to "math".Float64frombits
./math.go:30:25: inlining call to "math".Float64frombits
./math.go:31:25: inlining call to "math".Float64frombits
./math.go:32:25: inlining call to "math".Float64frombits
./math.go:33:25: inlining call to "math".Float64frombits
./math.go:34:25: inlining call to "math".Float64frombits
./math.go:35:25: inlining call to "math".Float64frombits
./math.go:36:25: inlining call to "math".Float64frombits
./math.go:37:25: inlining call to "math".Float64frombits
$ go test -c -gcflags=-m 2>&1 | grep Float64frombits
$ # no output

Comment From: mdempsky

@cherrymui Thanks, I'm able to repro the issue now. Not sure what went wrong with my earlier attempt.

Comment From: mdempsky

The issue here is that unified IR has a simpler heuristic for deciding which function bodies to re-export. It simply re-exports functions that were inlined into the current compilation unit. (It also always exports its own inlinable functions.)

The problem manifests here that mymodule/math doesn't actually instantiate its generic types/functions, so math.{Abs,Float64bits,Float64frombits} never get inlined within that package, so they're never re-exported by that package either. Then when compiling mymodule/math_test, the inline bodies aren't available so they don't get inlined.

Two possible workarounds:

  1. Instantiate the generic function/types within mymodule/math. For example, add two statements var _ Mat4[float64]; var _ Vec4[Float64].
  2. Within mymodule/math_test, add an import _ "math" directive. This will make sure math.{Abs,Float64bits,Float64frombits} inline bodies are available from the origin package, regardless of reexporting.

There's supposed to be a compiler diagnostic to warn when this happens. I'm not sure at the moment why it's not firing.

Comment From: mknyszek

Hey @mdempsky, doing a sweep of the Go 1.21 milestone. Any updates here? Should this go into Backlog? Thanks.