Proposal Details

SIMD is crucial for achieving high performance in many modern workloads. While Go currently allows access to SIMD via hand-written assembly, this approach has significant drawbacks: it is difficult to write, prevents asynchronous preemption, and hinders inlining for small kernels.

Adding SIMD support in Go (without requiring writing assembly code) has been requested, see e.g. #35307, #53171, #64634, #67520. Here we propose a SIMD API and intrinsics without a language change. It has some similarities with #53171, #64634, and #67520, with differences in the details.

Two-level approach

Generally, Go APIs are often simple and portable. However, the SIMD operations that the hardware supports are inherently nonportable and complex. Different CPU architectures support different vector sizes and operations, and sometimes have different representations.

A portable SIMD API would be nice. As mentioned on #67520, the work on Highway (a portable SIMD implementation for C++) demonstrated that it is possible to achieve a portable and performant SIMD implementation. For C++, it is built on top of architecture-specific compiler intrinsics, which we don't yet have for Go.

Our plan is to take a two-level approach: Low-level architecture-specific API and intrinsics, and a high-level portable vector API. The low-level intrinsics will closely resemble the machine instructions (most intrinsics will compile to a single instruction), and will serve as building blocks for the high-level API.

It is expected that most data-processing code can just use the high-level portable API and achieve good performance. When some uncommon architecture-specific operations are needed, the low-level API is there for power users to use and perform such operations.

Another way to look at them, as mentioned in https://github.com/golang/go/issues/67520#issuecomment-2122776918, is that the low-level API is analogous to the syscall package, whereas the high-level one is analogous to the os package. Most code that interacts with the system will use the os package, which works on nearly all platforms, and occasionally some code will reach into the syscall package for some uncommon system-specific operations.

In this proposal, we focus on the low-level architecture-specific fixed-size vectors for now, using AMD64 as a concrete example. Variable-size vectors (scalable vectors, e.g. ARM64 SVE) support and a portable high-level API will be addressed later. We propose adding this under GOEXPERIMENT=simd for a preview.

Design goals

Here are some design goals for the low-level architecture-specific API.

Expressive: Being an architecture-specific API, the lower-level package can cover most of the common and useful operations that the hardware supports.
Despite being a low-level API and intended for power users, we expect it to be relatively easy to use. It is expected that general users could read and understand code using this API without digging into the hardware details.
Best-effort portability: When an operation is supported on multiple platforms, we intend to have a portable API for it. However, strict or maximal portability is not the goal for this level. In most cases, we don't plan to emulate operations that are not supported by the hardware.
It will be a building block for the high-level portable API.

Portable(-ish) vector types with possibly architecture-specific methods

The SIMD vector types will be defined as opaque structs. The internal representation contains the elements of the proper type and count (we may include other zero-width unexported tagging fields if necessary).

package simd

type Uint32x4 struct { a0, a1, a2, a3 uint32 }
type Uint32x8 struct { a0, a1, a2, a3, a4, a5, a6, a7 uint32 }
type Float64x4 struct { a0, a1, a2, a3 float64 }
type Float64x8 struct { a0, a1, a2, a3, a4, a5, a6, a7 float64 }

etc.

The vector types will be defined on the architectures that support them. The compiler will recognize them as special types, using the vector registers to represent and pass them.

We do not define them as arrays, as often the hardware does not support element access with a dynamic index.

Operations will be defined as methods on the vector types, e.g.

// Add each element of two vectors.
//
// Equivalent to x86 instruction VPADDD.
func (Uint32x4) Add(Uint32x4) Uint32x4

which performs an element-wise add operation. The compiler will recognize it as intrinsic, and compile it to the corresponding machine instruction. E.g. on AMD64, it will be compiled to the VPADDD instruction.

The operations will be defined in an architecture-specific way (controlled by build tags). This allows different architectures to have different operations. Common operations (like Add) will be defined on almost all architectures.

Naming

We choose the names of the operations to be easy to understand, and not tied to a specific architecture. The goal is that general readers will understand the code without digging into the CPU details.

Common operations (like Add) will have the same names and signatures across architectures. If an operation is not supported by the hardware on some architectures, it will not be defined there. We will avoid methods with the same name and signature but different semantics on different architectures.

The machine instruction name (like VPADDD) will be included in the comment, for the ease of lookup.

Load and store

The machine instructions will load/store from/to a pointer. For type safety it should be a pointer to a properly sized array type.

func LoadUint32x4(*[4]uint32) Uint32x4
func (Uint32x4) Store(*[4]uint32)

Loading/storing from/to a slice are expected to be common useful operations, which will be defined from the above, e.g.

func LoadUint32x4FromSlice(s []uint32) Uint32x4 {
    return LoadFloat64x4((*[4]uint32)(s))
}

(LoadUint32x4FromSlice is long. Maybe we name it LoadUint32x4, and name the array pointer form LoadUint32x4Ptr?)

Another option is for the load operation to take a placeholder receiver, like

func (Uint32x4) Load(*[4]uint32) Uint32x4
func (Uint32x4) LoadFromSlice([]uint32) Uint32x4

The receiver's value will not be used, just for providing the type. This form keeps the name short. But this form is confusing, in that x.Load may appear as a load to x (updating x) except it does not (it returns the loaded value as a result), and is not idiomatic as a Go API.

Mask types

As pointed out in #67520, the internal representation of masks is different from architecture to architecture, and even different at different levels of CPU features. E.g. on AVX512, a mask is one bit per element of the vector and stored in a mask register (K register); on AVX2 it is stored in a regular vector register with one element per vector element; and on ARM64 SVE, the mask is one bit per byte of the vector.

To handle this situation, we represent masks as opaque types. The compiler will choose the representation that is appropriate for the program. A mask can be used in an operation that supports masking out some elements, and in logical operations between masks. It can also be explicitly converted to a vector. E.g.

func (Uint32x4) AddMasked(Uint32x4, Mask32x4) Uint32x4 // VPADDD.Z
func (Uint32x4) Equal(Uint32x4) Mask32x4 // VCMPEQD or VPCMPD
func (Mask32x4) And(Mask32x4) Mask32x4 // VPANDD or KANDB
func (Mask32x4) AsVector() Int32x4 // VPMOVM2D or no-op

Besides operations that produce a mask (like comparison), masks can also be created from a vector, or from a bit pattern (on AMD64, it will be AVX512 style, 1 bit per element; on ARM64 SVE, it will be a different function getting a mask from a 1-bit-per-byte pattern.)

func (Uint32x4) AsMask() Mask32x4 // no-op or VPMOVD2M
func Mask32x4FromBits(uint8)

The compiler will choose the representation depending on how the mask is consumed, and will remove conversions between different representations if possible. E.g. for an Equal operation followed by an AsVector, the compiler will choose VCMPEQD which produces the result directly in a vector register. For an Equal operation followed by AddMasked, it will choose VPCMPD, which produces the result in a K register, which can be consumed in a VPADDD instruction.

It is not recommended to use the masks directly in other ways.

Alternatively, we could have separate, explicit types and methods for different representations of the masks, e.g.

func (Uint32x4) EqualToVec(Uint32x4) Uint32x4 // VCMPEQD
func (Uint32x4) EqualToMask(Uint32x4) Mask32x4 // VPCMPD

This way, the methods are more directly corresponding to machine instructions.

Note that the opaque mask representation doesn't preclude the addition of the more explicit representation and methods. If later it turns out that the explicit methods are needed, we can add them under different names.

Some masked operations (e.g. in AVX512) support two modes of masking: zero mask, where the unselected elements are set to 0, and merging mask, where the unselected elements remain unchanged in the destination register. The merging mask is more complicated, in that the destination register is actually both an input and an output. Zero mask has a simpler API. Therefore we choose the zero mask version (thus VPADDD.Z above). To use the merging mask operation, one can use the zero mask version of the operation followed by a blend operation with the same mask, and the compiler can optimize to a merging mask operation. For example, x.AddMasked(y, m).Blend(z, m) can be optimized to a merging masked instruction VPADDD x, y, m, z.

Conversions

Vector elements can be extended, truncated, or converted between integer and floating point types. E.g.

func (Uint32x4) TruncateToUint16() Uint16x8 // VPMOVDW
func (Uint32x4) ExtendToUint64() Uint64x4 // VPMOVZXDQ
func (Uint32x4) ConvertToFloat32() Float32x4 // VCVTUDQ2PS

In some cases it may be useful to truncate to fewer or expand to more elements of the same type, or to "reinterpret" a vector as another vector with the same bits but different element type and arrangement. Some such conversions don't need to generate a machine instruction.

func (Uint32x8) AsUint32x4() Uint32x4 // truncate
func (Uint32x4) AsUint32x8() Uint32x8 // expand, zero high elements
func (Uint32x4) AsInt32x4() Int32x4 // unsigned to signed
func (Uint32x4) AsFloat32x4() Float32x4 // interpret the bits as float32

Constant operands

Some machine instructions require a constant operand, or have a form with a constant operand. For example, on AMD64, getting or setting a specific element of a vector (VPEXTRD/VPINSRD instruction) requires a constant index. For shifts, one form is shifting all the elements by the same amount, which has to be a constant (the other form being shifting a variable amount for each element). Namely,

func (Uint32x4) GetElem(int) uint32 // VPEXTRD
func (Uint32x4) SetElem(int, uint32) uint32 // VPINSRD
func (Uint32x4) ShiftLeftConst(uint8) Uint32x4 // VPSLLD

It is recommended (and will be documented) that these methods are called with constant arguments, so it can generate efficient code. What happens if it is not? The corresponding C intrinsics may cause a compilation failure. We could choose to do an emulation, or generate a table switch (as the range of the constant operand is usually small), as a fallback path.

An (incomplete) API list

To give a concrete example, here is the proposed API for Uint32x4 on AMD64.

type Uint32x4 struct { a0, a1, a2, a3 uint32 }
func LoadUint32x4(*[4]uint32) Uint32x4 // VMOVDQU
func (Uint32x4) Store(*[4]uint32) // VMOVDQU
func (Uint32x4) Add(Uint32x4) Uint32x4 // VPADDD
func (Uint32x4) Sub(Uint32x4) Uint32x4 // VPSUBD
func (Uint32x4) Mul(Uint32x4) Uint32x4 // VPMULLD
func (Uint32x4) Min(Uint32x4) Uint32x4 // VPMINUD
func (Uint32x4) Max(Uint32x4) Uint32x4 // VPMAXUD
func (Uint32x4) And(Uint32x4) Uint32x4 // VPAND
func (Uint32x4) Or(Uint32x4) Uint32x4 // VPOR
func (Uint32x4) Xor(Uint32x4) Uint32x4 // VPXOR
func (Uint32x4) AndNot(Uint32x4) Uint32x4 // VPANDN
func (Uint32x4) ShiftLeft(Uint32x4) Uint32x4 // VPSLLVD
func (Uint32x4) ShiftLeftConst(uint8) Uint32x4 // VPSLLD
func (Uint32x4) ShiftRight(Uint32x4) Uint32x4 // VPSRLVD
func (Uint32x4) ShiftRightConst(uint8) Uint32x4 // VPSRLD
func (Uint32x4) RotateLeft(Uint32x4) Uint32x4 // VPROLVD
func (Uint32x4) RotateLeftConst(uint8) Uint32x4 // VPROLD
func (Uint32x4) RotateRight(Uint32x4) Uint32x4 // VPRORVD
func (Uint32x4) RotateRightConst(uint8) Uint32x4 // VPRORD
func (Uint32x4) PairwiseAdd(Uint32x4) Uint32x4 // VPHADDD
func (Uint32x4) MulEvenWiden(Uint32x4) Uint64x2 // VPMULUDQ
func (Uint32x4) OnesCount() Uint32x4 // VPOPCNTD
func (Uint32x4) LeadingZeros() Uint32x4 // VPLZCNTD
func (Uint32x4) Equal(Uint32x4) Uint32x4 // VCMPEQD or VPCMPD $0
func (Uint32x4) GreaterThan(Uint32x4) Uint32x4 // VCMPGTD or VPCMPD $6
func (Uint32x4) Blend(Uint32x4, Mask32x4) Uint32x4 // VPBLENDMD
func (Uint32x4) Compress(Mask32x4) Uint32x4 // VPCOMPRESSD
func (Uint32x4) Expand(Mask32x4) Uint32x4 // VPEXPANDD
func (Uint32x4) Permute(Uint32x4) Uint32x4 // VPERMD
func (Uint32x4) Broadcast() Uint32x4 // VPBROADCASTD
func (Uint32x4) GetElem(int) uint32 // VPEXTRD
func (Uint32x4) SetElem(int, uint32) uint32 // VPINSRD
func (Uint32x4) TruncateToUint16() Uint16x8 // VPMOVDW
func (Uint32x4) TruncateToUint8() Uint8x16 // VPMOVDB
func (Uint32x4) ExtendToUint64() Uint64x4 // VPMOVZXDQ
func (Uint32x4) ConvertToFloat32() Float32x4 // VCVTUDQ2PS
func (Uint32x4) ConvertToFloat64() Float64x4 // VCVTUDQ2PD
func (Uint32x4) AsInt32x4() Int32x4 // no-op
func (Uint32x4) AsFloat32x4() Float32x4 // no-op

For the operations that support masking, we will also have masked versions, like AddMasked above.

Some operations are supported only on some types. For example, a vector with floating point elements (e.g. Float64x4) supports Div, Reciprocal, and Sqrt, but does not support shifts or OnesCount.

Note that this list is not comprehensive, e.g. it doesn't yet include operations like Gather, Scatter, Intersect, or Galois Field Affine Transformation. We do plan to support these operations in the API. This proposal is just proposing the direction we're heading to. We can add more operations later.

Note also that we intentionally don't want to include some forms of the machine instructions in the API, and instead leave them to the optimizer. For example, on AMD64, many arithmetic operations support a memory operand. Just like the language has a + operator and a dereference (unary *) operator but not an operator for a + *b, we don't provide an API for the same operation on vectors. Instead, if a Load operation is followed by an Add, the compiler can optimize it to the memory form of the ADD instruction.

CPU features

SIMD operations often require certain CPU features. One may want to check if specific CPU features are available on the target hardware, and guard the SIMD code with it. CPU feature check functions will be provided, e.g.

func HasAVX512() bool
func HasAVX512VL() bool

The compiler will treat them like pure functions, as they never change after runtime initialization. That said, it is still recommended to check CPU features before doing SIMD operations, instead of doing it in the middle of a sequence of SIMD operations.

It is an open question whether we want to enforce that a CPU feature check must be performed before using a vector intrinsic, through static or dynamic analysis. Required checks would encourage portable code (across machines of the same architecture with different CPU features).

AVX vs. SSE

On AMD64, there are two "flavors" of SIMD instructions, SSE and AVX. Almost all the SSE operations have equivalent AVX instructions. It is not recommended to mix the two forms, which may sometimes result in a performance penalty. Therefore, we want to stick to one form. The initial version of this API will always generate the instructions in AVX form. One goal of this API is to support writing performant code on the evolving hardware, making use of advanced CPU features. SSE operations may be added later in an explicit or transparent way, if there is a strong need.

Discussion

Alternatives

We have considered a few alternatives APIs.

Instead of methods, the operations could be defined as top-level functions, like AddUint32x4. This name is unnecessarily long, and repetitive if a piece of code operates on the same type of vectors over and over. We also considered defining them as generic functions, like Add[T Vec](T, T) T. This makes the name short, but in user code it is likely that the package prefix (simd.) will still be repeated. The main difficulty for generic functions is that it is hard to express relationships between types, e.g. MulEvenWiden returns a vector with half the number of elements with doubled width. Besides, the implementation will be more complex. Overall it doesn't provide much benefit over methods.

We also considered defining the vector types as generic types, like Vec4[T]. This approach has similar difficulties as generic functions. It is also hard to express irregularity. For example, on AMD64, SaturatedAdd operation is supported with 8- and 16-bit integer elements, but not wider ones.

Future work

Scalable vectors and high-level portable API

A number of architectures have chosen to adopt scalable vectors, such as ARM64 SVE and RISC-V Vector Extension. For scalable vectors, the size cannot be determined at compile time, and it theoretically can be quite large. We plan to add support for scalable vectors in Go, although currently we're not ready to propose a concrete design.

On top of that, we plan to add a high-level portable API for vector operations. Existing portable SIMD implementations such as Highway will be a source of inspiration. To support various architectures and CPU features, the API will probably be based on scalable vectors. On platforms like AMD64, it may be lowered to a fixed size vector representation, depending on the hardware features.

It is expected that for the majority of use cases in data processing and AI infrastructure, it will be possible to write just with the high-level API, which achieves portability and performance. We also hope that the low-level and high-level APIs are interoperable. If some code is mostly portable but just needs an operation or two that is very architecture-specific, one can write the code mostly using the high-level API, and drop to the low-level just for these operations.

Comment From: dominikh

How do you plan to handle immediates, like in VPINSRD? Specifically the requirement that their values be known at compile time?

Comment From: randall77

func (Uint32x4) TruncateToUint16() Uint16x8 // VPMOVDW
func (Uint32x4) TruncateToUint8() Uint8x16 // VPMOVDB

These should generate Uint16x4 and Uint8x4? Otherwise these would be more of a Reinterpret operation?

func (Uint32x4) PairwiseAdd(Uint32x4) Uint32x4 // VPHADDD

Not a fan of the "pairwise" name, or the weird semantics (receiver goes to low 2 entries, arg goes to high 2 entries). Maybe

func (Uint32x8) ReduceAdd() Uint32x4

Where to get a VPHADDD you would first do two AsUint32x8 and then shift/pack somehow to get a Uint32x8 that you can then ReduceAdd to the desired result.

func (Uint32x4) Permute(Uint32x4) Uint32x4 // VPERMD

Really the arg is Uint2x4, as the permutation indexes are 0-3. Not sure whether it is necessary to enforce that or not. Similarly for the shift/rotate operations. Maybe we just handle the input restrictions in comments.

func (Uint32x4) MulEvenWiden(Uint32x4) Uint64x2 // VPMULUDQ

This should really generate a Uint64x4, and be called MulWiden. Then a subselect (Uint64x4).Evens() Uint64x2 afterwards would together make a VPMULUDQ.

Am I allowed to write Uint32x4{99,0,0,0}.Broadcast()? I presume so. Might be clearer to do BroadcastUint32x4(x uint32) Uint32x4, although maybe we would be starting with simd types a fair amount. Maybe both?

Comment From: apparentlymart

Thanks for writing this up! This seems like a good general direction; I like the distinction between low-level intrinsics and higher-level operations.

With that said, I find the "best-effort portability" angle a little concerning. You made analogy to the syscall vs. os distinction, where a similar concerning situation exists:

The syscall functions are largely named after POSIX functions even though on some operating systems they do something more complex than just directly wrapping the POSIX function of the same name.
Not all functions in syscall are supported on all Go targets. For those that are supported on multiple operating systems, the functionality is nonetheless often subtly different in behavior, in performance, or both.
pkg.go.dev lets you choose between different platform-specific views of the docs for that package¹, but it's hard to quickly understand which subset of the functionality is portable and which isn't.

These and other concerns led to splitting syscall into a number of OS-specific packages, each of which is usable only on the platform it corresponds to. This then means that it's explicitly the caller's responsibility to use conditional compilation to call into the appropriate package for the target OS, rather than the low-level package attempting to provide that abstraction itself.

Based on that experience, I would personally have expected a separate package of SIMD intrinsics per architecture, and then have the caller (either end-user code directly, or the higher-level portable wrapper implementation) use conditional compilation to address each implementation they intend to support.

I acknowledge that the proposal described the situation as "best-effort portability", and that this comment is essentially arguing for making no effort whatsoever to achieve portability at this initial level of abstraction. I think it's clearer to have one layer that is explicitly not portable and then a separate layer on top that provides portable wrappers for the subset that all targets have in common, since then it's easier to understand what is portable and what isn't.

I am essentially making an SIMD-architecture-flavored variation of the following from the syscall-splitting proposal:

Inside go.sys, there will be three packages, independent of syscall, called plan9, unix, and windows, and the current syscall package's contents will be broken apart as appropriate and installed in those packages.

(This split expresses the fundamental interface differences between the systems, permitting some source-level portability, but within the packages build tags will still be needed to separate out architectures and variants (darwin, linux)).

These are the packages we expect all external Go packages to migrate to when they need support for system calls. Because they are distinct, they are easier to curate, easier to examine with godoc, and may be easier to keep well documented. This layout also makes it clearer how to write cross-platform code: by separating system-dependent elements into separately imported components.

Comment From: ianlancetaylor

We choose the names of the operations to be easy to understand, and not tied to a specific architecture. The goal is that general readers will understand the code without digging into the CPU details.

This is the kind of decision that we made (following Inferno) with the assembler: we have the same instruction names across architectures, rather than following the conventions of other assemblers. I think it's a mistake. It is better for people who are only superficially familiar with the architecture. But it is worse for experts. Experts in effect have to write their SIMD code using the operations with which they are familiar, and then look up the translation into the names used by Go.

I also note that Intel architectures have some 400 SIMD instructions. Even granting that Go doesn't expect to support all of them directly (after all, you can always use the assembler), it is implausible to think that there are good and meaningful names for 200 different vector instructions.

Just as with the assembler oddball instructions wind up using the names that are standard for the architecture, the simd package will wind up doing the same. Then we have a mix of meaningful names and architecture-specific names, so nobody is really clear on what is going on. And Go is living on a separate island, which is absolutely fine when there is a good reason to do so, but what is the good reason in this case?

I think it's better to provide the names that experts expect, either in the form of the instruction names or in the form of the intrinsic functions that architectures defined for C/C++ programmers.

Once we have that it would certainly be fine to define another package that provides readable names for common operations, providing a modicum of architecture portability without attempting to support all operations.

Comment From: cespare

@ianlancetaylor writes:

I think it's a mistake. It is better for people who are only superficially familiar with the architecture. But it is worse for experts. Experts in effect have to write their SIMD code using the operations with which they are familiar, and then look up the translation into the names used by Go.

I agree, but I think it's even worse than this, because I'm not sure it helps the "superficially familiar" newbie either. I had this experience a decade ago doing a bunch of Go assembly work as a non-expert. I was cross-referencing the intel manual, agner fog, objdump output, and so forth, and figuring out the Go assembler naming scheme just added an extra layer of learning that I had to do on top of it all.

(That said, I don't feel too strongly about this because what's proposed here -- where each method has a comment that clearly lists the corresponding amd64 instruction mnemonic -- seems much easier to figure out than the Go assembler.)

Comment From: dr2chase

These should generate Uint16x4 and Uint8x4?

The problem with those types is, what do you do with them? They're not valid operands to any other instruction, there's no operations on those partial SIMD registers (except when a truncated ZMM or YMM happens to fit in an XMM). The obvious next step is to widen them to the zero-extended wider type, which is what the operation produced anyway. So, skip all that.

The exception to this is if their destination is memory, and given the plan to obtain that by optimizing away the stores, we would need obvious-to-the-optimizer sub-vector store operations, if we don't support these weird small types.

Comment From: randall77

I'd rather do

var large Uint32x4 = ...
var p *Uint16x4 = ...
large.TruncateToUint16().Store(p)

Even if the Uint16x4 is stored in the low part of a larger register.

If you really want the results in the low half of a larger thing,

large.TrunchateToUint16().DoubleUpWithZeros()

or something.

Comment From: cherrymui

@dominikh

How do you plan to handle immediates, like in VPINSRD? Specifically the requirement that their values be known at compile time?

In the proposal draft I actually had a section about constant operand, and I decided not to include it in the post 😂 Here it is:

Constant operands

func (Uint32x4) GetElem(int) uint32 // VPEXTRD
func (Uint32x4) SetElem(int, uint32) uint32 // VPINSRD
func (Uint32x4) ShiftLeftConst(uint8) Uint32x4 // VPSLLD

Comment From: cherrymui

@ianlancetaylor @cespare One reason to choose names that are more descriptive is that I think it makes the code much easier to read: General readers of the code could easily understand (or guess) what Uint32x4.Add does, but not quite so for a function named VPADDD without reading the Intel manual or looking up on the internet. I think generally for some code, there will be more users who read the code than who write the code. So I'm leaning towards on the reader side.

For SIMD code writers, the methods will be clearly documented with the machine instruction names. So code writers can easily look them up. This is where it can be different from the assembler, for which we don't always have good documentations for the mapping. Another difference is that assembly code is hard to read in general, even if we try to make it more unified.

Comment From: cherrymui

For naming, a reference (I should have added a reference to the proposal text) is that C# seems to choose to use common, descriptive names for their vector intrinsics. E.g. the X86 and ARM intrinsics classes both define Add functions to add two vectors (https://learn.microsoft.com/en-us/dotnet/api/system.runtime.intrinsics.x86.avx?view=net-8.0, https://learn.microsoft.com/en-us/dotnet/api/system.runtime.intrinsics.arm.advsimd?view=net-8.0). One difference is that they have different classes for different architectures and CPU features. Nevertheless, the names are common.

Comment From: cherrymui

@randall77

func (Uint32x4) TruncateToUint16() Uint16x8 // VPMOVDW func (Uint32x4) TruncateToUint8() Uint8x16 // VPMOVDB These should generate Uint16x4 and Uint8x4

As @dr2chase mentioned, the reason for that is that the smallest vector size is 128-bit. Perhaps we should name them TruncateToUint16PackedLow or so to be clearer.

Uint2x4

I think it would be infeasible to define many types for "small" vectors. We can document the semantics for Permute with Uint32x4. We probably don't want to enforce that high bits are not set. It wouldn't be surprising that one could come up with clever tricks with a vector of indices with the right low bits but possibly nonzero high bits (e.g. using -1 to represent the last element).

func (Uint32x4) PairwiseAdd(Uint32x4) Uint32x4 // VPHADDD Not a fan of the "pairwise" name, or the weird semantics (receiver goes to low 2 entries, arg goes to high 2 entries).

We can choose a different name. I'm not a fan of the semantics either. But at this low-level package I'm leaning towards doing what the machine instruction does. At least AMD64 (the VPHADDx instruction) and ARM64 (the ADDP instruction) both do that...

Comment From: balasanjay

+1 to apparentlymart's comment above about having per-architecture packages (with implementations that just immediately panic on any other architecture).

Given the proposal says this: "It is expected that for the majority of use cases in data processing and AI infrastructure, it will be possible to write just with the high-level API, which achieves portability and performance", it seems a little messy to mix the portable and non-portable APIs in the same package. And if we do intend to separate them, I think we should save the nice package name of "simd" for the portable API.

(All that said, I'm a big fan of this proposal, kudos!)

Comment From: dominikh

I'll also note that

The vector types will be defined on the architectures that support them

and

The operations will be defined in an architecture-specific way (controlled by build tags). This allows different architectures to have different operations

preclude writing low-level SIMD in files that don't have any build tags and using bits of SIMD in otherwise "normal" Go code by guarding uses of SIMD with static checks on architecture and dynamic checks on features. That is, we cannot have a single implementation of a function that uses scalar code, AMD64 assembly or ARM64 assembly, depending on which architecture we're compiling for and which CPU features were detected at runtime. At a minimum, we'll need a SIMD-free implementation and a SIMDful implementation, in two separate files, or our code cannot build for any architecture that isn't supported by the simd package. And once we use any instructions or register sizes that are architecture-specific, we need >2 files, one per architecture.

Given those limitations, we might as well import different packages per architecture, avoiding the need for finding similar-but-different names for functions in the case of

We will avoid methods with the same name and signature but different semantics on different architectures.

Having multiple packages would also make it easier to browse documentation and might aid code completion in gopls--as in, I will want to write ARM64 SIMD code while working on my x86 computer and would like auto completion to work well.

Yes, this would cause code duplication for trivial SIMD code that would be portable, but how often will such trivial code really be written? And couldn't it be written with the planned high-level abstraction instead?

Comment From: dominikh

@ianlancetaylor @cespare One reason to choose names that are more descriptive is that I think it makes the code much easier to read: General readers of the code could easily understand (or guess) what Uint32x4.Add does, but not quite so for a function named VPADDD without reading the Intel manual or looking up on the internet. I think generally for some code, there will be more users who read the code than who write the code. So I'm leaning towards on the reader side.

For SIMD code writers, the methods will be clearly documented with the machine instruction names. So code writers can easily look them up. This is where it can be different from the assembler, for which we don't always have good documentations for the mapping. Another difference is that assembly code is hard to read in general, even if we try to make it more unified.

This seems like a noble (and difficult to achieve) goal for a high level abstraction, but I think that people who read low-level simd code can be expected to be familiar with the instruction set, or to want to easily look up instructions in a manual, which is complicated by first having to map between Go names and official names. I also think that names are the smallest difficulty. On some architectures, they're quite readable, and on others, they follow patterns. What is far more complex is the actual behavior of instructions, the way lanes interact, etc. This will require a lot of manual reading, which again would be easier if names could be copied verbatim between Go code and the manual. I don't think low-level SIMD code will ever be intuitive to the casual reader (and rarely to the knowledgeable reader...)

I also share Ian's doubts about finding good names for 200+ instructions, especially if we need to find different names for identical actions that nevertheless have different semantics. What would you call VP4DPWSSD?

Comment From: dominikh

An assortment of other things I've wondered about:

> The internal representation contains the elements of the proper type and count (we may include other zero-width unexported tagging fields if necessary).

Do you plan to support types that Go doesn't natively support, such as bfloat16 and half-precision floats? These can't be represented natively in Go, but I could imagine having some "untyped", i.e. []byte memory holding such values and wanting to do SIMD math on them. Or is that out of the scope of the simd package?

> CPU feature check functions will be provided

Have you considered the overlap with the golang.org/x/sys/cpu package? Will these functions support something akin to GODEBUG=cpu.avx=off? Will GOAMD64 allow constant folding calls to the functions?

VZEROUPPER

What are your thoughts on VZEROUPPER and how to handle it in simd? With Go assembly files, this isn't much of a problem, as entire functions are written in assembly, they rarely call into other code, and they can trivially end with a manual VZEROUPPER. But with simd, arbitrary Go code and AVX code can be mixed, which might involve calling into runtime code or assembly functions, both of which might use SSE. It is my understanding that this is a solved problem in C compilers(?)

Loop unrolling

A common pattern in assembly code is loop unrolling. The Go compiler doesn't currently have loop unrolling, neither automatic nor explicit. Do you think that work on simd will motivate work on loop unrolling in the compiler?

GOEXPERIMENT

Will GOEXPERIMENT allow us to iterate on the design of the simd package and make breaking changes?

Comment From: cherrymui

Do you plan to support types that Go doesn't natively support, such as bfloat16 and half-precision floats?

Yes, we do plan to support them (later). As you mentioned, we don't need to support them as builtin types in Go. We can support them in vector form in the simd package. A number of CPU architectures support them as vectors but not as scalars.

Have you considered the overlap with the golang.org/x/sys/cpu package? Will these functions support something akin to GODEBUG=cpu.avx=off? Will GOAMD64 allow constant folding calls to the functions?

For the simd package package, I think we'd need feature checks in the standard library. My current implementation plan is to build on top of the internal/cpu package and expose that in the API level. So GODEBUG settings would work as expected. If we constant fold with build-time GOAMD64 setting, it won't be overridden by GODEBUG. Perhaps it is fine. We'd need to decide.

VZEROUPPER

For functions that uses wider vectors, for some operations that we currently use SSE (e.g. inlined memmove), we can choose to generate the AVX form of such instructions (we'd need to be careful if the function contains CPU a feature check).

Currently, vector registers are all dead at function call boundaries, unless they are passed as parameters or results. So the compiler can safely insert a VZEROUPPER at function call boundaries if it does not pass vectors as parameters or results. (If it does, we know we're entering another function that uses AVX, so VZEROUPPER is not needed.) Later on we might do some inter-procedural analysis to make it smarter.

Loop unrolling

I don't plan to add loop unrolling in the scope of this proposal, although it doesn't exclude the possibility to add it in the future. For now, one can manually unroll the loop.

Will GOEXPERIMENT allow us to iterate on the design of the simd package and make breaking changes?

Yes.

Comment From: janpfeifer

I didn't see it (maybe I missed) explicitly mentioned (and maybe it's implied or trivial), so let me suggest to include also some API to dynamically (runtime) probe the available SIMD features of the current CPU/Core the program is running on.

This way one can compile binaries (e.g.: for distribution) with support for all (or some subset, like Highway seems to do) variations, with some form of dynamic (runtime) dispatch. That is, assuming cross-compilation of SIMD is also a given.

Comment From: mcy

Hey, I got linked this proposal over on bluesky and said they were asking for feedback. For a little background, I've given trial runs to a lot of SIMD libraries at this point so I think I can probably provide some useful insights here. I apologize in advance that I am going to ramble a lot, this is more stream of consciousness than anything else.

Since this is intended to be a low-level library (that, perhaps, something a bit more like Highway or rust's Simd type would be built on top of), I think that having types of the form TxN is a bit of a distraction. Lower-level SIMD code will typically pun registers very hard; I've written code in Rust that is a bit painful because converting from a u8x8 to a u16x4 is not cleanly supported.

For something at this level, I think that providing types like simd.Vec128 and simd.Mask128 will probably simplify a lot of this punning pain, where instead the lane size of the operation is determined by the method, e.g. simd.Vec128.AddInt32 or simd.Vec128.AddFloat64. This is closer to how vendor intrinsics work in C/C++.

This also avoids the duplication of signed/unsigned types, since very few operations actually care about signedness. IMO, when doing comparisons in this type of algorithm, it is much clearer to specify whether a comparison is signed or not at the site of the comparison; the same holds for zero/sign extending lanes.

This also reduces the scope of the API, since the number of types isn't quadratic. In fact, one thing that I think the Rust API got wrong is that their primitive is a generic <N x T> (in LLVM terms), when really being wanting to be generic on the bit width of the whole vector is more common (to, say, select between sse, avx2, and avx512 versions of a kernel).

Another thing that is unclear is whether this API is intended to be instruction set specific or not. I get the vibes of "no", which is fine, and generally better, but as Ian observes, this is going to confuse people who are looking for a specific instruction. My recommendation is as follows:

Operations that are obviously provided by everyone can stay as they are. Intel's intrinsic names are stupid and we should not replicate them. I also think that we should be careful to not pick Intel's names for instructions as the name for higher-level operations. Nobody calls vpblendmq "blend"; it's usually called "select". Similarly, nobody calls vppermps "permute"; it's usually "swizzle" (even calling this a "shuffle" is an Intel-specific thing). If the intent is for this to be portable-ish, having to use a different name for manifestly the same operation on different instruction sets is going to be confusing.
Instruction-set-specific operations should probably go in their own packages as free functions. For example, swisstable uses this weird "select all the sign bits" instruction that only Intel provides. I think that pulling in package simd/avx2 and calling avx2.Sign8 (for _mm_sign_epi8) would be fine. (Incidentally, being able to implement the swisstable simd for both x86 and arm would be a good goal for this proposal, to make sure it covers a real-life use-case). I also think that, for explicitness, it may be worthwhile for everything available through the main simd types to also have free function intrinsics.

Speaking of naming, I disagree with Ian we need to stick to exactly the names from the instruction set. I am an x86 expert and I despise the Intel names for things, they are not readable, especially to non-experts reading my code. I would prefer that Go try to pick names that are faithful to Intel's intent without them being completely inscrutable.

I also think that having a plan for people to use instructions that don't have explicit intrinsics should probably be in place. For the Go assembler (which is incomplete), the guidance is to explicitly encode instructions. Pretty bad, but doable. But calling a function, especially a non-ABIInternal function, is extremely slow, so having something approximating inline assembly will become a bit more necessary. I don't think full inline assembly is necessary, just something to the effect of

//go:intrinsic my_instruction %a, %b, %c
func MyIntrinsic(a, b Vec128) Vec128

Not to mention that this will probably be valuable for making the API in the standard library a bit more data-driven, instead of needing to write a million intrinsic hooks for things that are basically just inline assembly (for instructions that do not need complicated instruction selection logic).

Which brings me to "table stakes" instructions. One thing I see missing is a good shuffle intrinsic. I see a way to call vpermps and friends, but shuffle-by-immediate is extremely common. I also don't see a "load constant" operation, i.e., something like Load but taking n by-value arguments. This is a very important operation, and having to write LoadUint32x4(&[...]uint32{...}) all the time is going to get old. Also, Broadcast seems to have the wrong signature. I would expect something like e.g. BroadcastUint32x4(uint32) Uint32x4 or similar.

It may also be worth considering making the vector types magically able to be placed in const. Despite being structs they are immutable, and because Go more-or-less promises not to optimize away loads of global vars (due to linker shenanigans, among other things) being able to hoist constants across functions into consts would be very helpful. This would likely require making the functions for constructing constants be magic in the same way that unsafe.Sizeof and friends are.

Relatedly, I actually kind of doubt that loading from *[n]T and []T is a useful operation. Being able to directly cast to/from [n]T (which is what Rust provides) would probably be the best alternative overall. This composes well with Go's []T <-> *[n]T casts.

Regarding loads and stores, this brings up a question of layout (and in my mind, why the Load/Store operations exist at all). On many architectures, loads and stores of vector registers require stronger alignment than the maximum Go alignment of 8 (and certainly more than the alignment of a [4]uint32). So it's probably going to be necessary to answer:

What is the alignment of a simd vector for the purposes of struct layout.
How are simd vectors passed to ABIInternal functions (on x86 this triggers really painful questions about vzeroupper).

You might think (2) is not a big deal because you really shouldn't be passing these things into functions, but there are cases where you might want to pass a vector to a noinline function. This is not a question you want to put off answering. Also, I would not mind having more than nine integer argument registers on x86... I tried using float64 for this at one point and had a very bad time, lol.

This brings me to the last bit, which is that Go has a very timid inline heuristic, which is fine for 99% of code, but a huge problem for performance-oriented code, such as anything using simd kernels. Much of the draw of writing things in the surface language rather than assembly is inlining, and especially inlining of helper functions. I've written a lot of high-perf Go at this point, and fighting with the inliner has been a huge source of performance footguns.

For example, without PGO, the inliner will never inline a function that contains both a branch and a load, or two separate loads. Hand-inlining stuff leads to mistakes, and I would prefer that Go not encourage code patterns that make it easy to make mistakes in code that is already very hard to debug.

My suggestion would probably be to introduce some kind of //go:mustinline directive that can only be applied to unexported functions, which generates a nosplit-like compile error if the function is part of a non-trivial call graph SCC, or is made into a func(). The intent is that the function is more like a macro and less like a function (compare, macros in assembly). This avoids the classic problem of inline directives getting put everywhere by people who have no business using them, since it is intended for internal helpers only.

Finally, some notes on documentation. I think that for every function, it should probably include: 1. Instruction selection notes, such as under what conditions an argument turns into an immediate. Trying to guess what this is is not fun. 2. The instruction that it lowers to in supported instruction sets. If we went the route of having very specialized intrinsics in an e.g. simd/avx2, simd/neon package, it might be better to point to those, and then have those specify which instruction they represent. a. Instructions should use their official name, not the one that Go chose for its assembler. Among other things, this means not using the ones GCC made up for AT&T syntax. Or maybe include all of them, everyone seems to know them by different names on x86. b. Linking to the corresponding official intrinsic. E.g. linking to Intel or ARM's documentation. This helps close the loop, and will make the Go documentation searchable by ctrl-F. 3. If the instruction needs to do emulation, it should probably be extremely specific about what that emulation looks like, even if it's not a compatibility promise. In simd/avx2, I would generally eschew emulation and emit a backend error. Way too easy to be surprised by emulation at that level, IMO.

And one final note: don't spend too long working on this only targeting x86. If you don't do at least two instruction sets from day 1, you'll probably have to pay for at least one extra redesign.

Again, sorry if this is all over the place. SIMD is very near and dear to my heart and there's a lot of small details that you won't notice until you're way too deep into an awkward design.

Comment From: cherrymui

@mcy thanks for the detailed comments!

I think that having types of the form TxN is a bit of a distraction. Lower-level SIMD code will typically pun registers very hard; I've written code in Rust that is a bit painful because converting from a u8x8 to a u16x4 is not cleanly supported.

Go is a type safe language. I'd lean towards keeping the type safety for vectors as well. I think in many cases, it is clearer to have the element layout (like Uint32x4) explicit, than an implicit Vec128. As mentioned in the "Conversions" section, we do plan to support the "punning" conversions, without generating a machine instruction. So one can write code with punning without performance overhead. Just like Go requires explicit conversions in places where C allows implicit conversions, I'd think explicit conversions between punned vector types are better than implicit.

Operations that are obviously provided by everyone can stay as they are. Intel's intrinsic names are stupid and we should not replicate them. I also think that we should be careful to not pick Intel's names for instructions as the name for higher-level operations. Nobody calls vpblendmq "blend"; it's usually called "select". Similarly, nobody calls vppermps "permute"; it's usually "swizzle" (even calling this a "shuffle" is an Intel-specific thing). If the intent is for this to be portable-ish, having to use a different name for manifestly the same operation on different instruction sets is going to be confusing.

Sure, we can carefully choose the names so the meaning is clear to most users. I'd think we learn from Highway and C#'s APIs. (For the list above I didn't look them all up, and picked a few arbitrarily.)

Being able to directly cast to/from [n]T (which is what Rust provides) would probably be the best alternative overall.

At least with the current implementation, [n]T is mostly stored and passed in memory. So semantically load/store with *[n]T is not that different from a cast to/from [n]T. Go's conversion syntax and function calls are similar, so they are not that different even in the syntactically level. Could you explain what specific advantage it would have with casting between a vector and an array? I'd lean towards explicit function calls, instead of special handling of array types.

I also think that having a plan for people to use instructions that don't have explicit intrinsics should probably be in place.

I'll note that this proposal does not exclude the possibility of adding this in the future. For now, I'd say it is out of the scope of this proposal. I don't think we want to leave the SIMD API to just user-defined intrinsics: 1. for the same instruction, different users could define them differently, which reduces code readability; 2. in some cases, the compiler still needs to understand the instruction in order to do optimizations with it; if so, why not just define the intrinsic.

I see a way to call vpermps and friends, but shuffle-by-immediate is extremely common.

Sure, we can have that, although I haven't thought carefully about what the signature it should be. As noted above, the API is not a complete list.

What is the alignment of a simd vector for the purposes of struct layout.

This will be architecture dependent. My impression is that alignment is not that important with modern hardwares. On platforms where an alignment is either required or more performant, we can make it have larger alignment than the natural alignment of a struct with the elements. We'll also introduce a way to allocate memory with large alignment. (To be clear, the Load function taking a *[n]T doesn't mean that we expect the backing store to be allocated with, say, new([n]T).)

How are simd vectors passed to ABIInternal functions

They'll be passed in vector registers. Writing helper functions for small kernels is a reasonable thing to do, so we shouldn't exclude this possibility.

For inlining, we can still improve the inlining heuristics. One heuristic we could consider is that functions passing/returning vectors will get a higher budget. mustinline directive would be out of the scope of this proposal, though (personally I'd actually like macros, but most Go developers probably don't agree with me).

And one final note: don't spend too long working on this only targeting x86. If you don't do at least two instruction sets from day 1, you'll probably have to pay for at least one extra redesign.

Definitely. We're thinking and working on ARM64 support, including SVE. The current proposal may seem somewhat AMD64 targeted, but it is intended to be a concrete example behind a GOEXPERIMENT for early experimentation.

Thanks.

Comment From: cherrymui

@janpfeifer Yes, CPU feature detection functions will be provided, as mentioned in the "CPU features" section.

Comment From: mcy

Could you explain what specific advantage it would have with casting between a vector and an array? I'd lean towards explicit function calls, instead of special handling of array types.

Of course. I just mean that converting to/from array by-value should be the primary way to construct simd values, and let users manage loading and storing through the ordinary dereference operator. The conversion should be by function. Although I personally dont think there would be harm in making Uint32x4 have core type [4]uint32, and it would enable a more ergonomic API. (Part in this is that I think Go treating arrays and structs differently in the ABI was not the right choice for reasons not especially relevant to the proposal).

I'd lean towards keeping the type safety for vectors as well. I think in many cases, it is clearer to have the element layout (like Uint32x4) explicit, than an implicit Vec128.

IME this has not been the case, but I think this is going to depend on how painful things are. Ill definitely provide more concrete feedback once there is an API to play with.

Re conversions, do make sure that in addition to the i2f/f2i conversions, there should also be a version for doing a bitcast too, since that's an important operation.

For inlining, we can still improve the inlining heuristics. One heuristic we could consider is that functions passing/returning vectors will get a higher budget

I think we'll have to see if that's enough or not. This is uncharted territory as far as compiler implementations go, since every other language solves this with something morally equivalent to macros.

What I will probably do is port my base64 codec as-is from Rust and see if the Go inliner makes any mistakes or not.

My impression is that alignment is not that important with modern hardwares.

I used to believe this but I know that it's been a problem for Rust, but not me personally. Go does lack a way to align to a cache line, too... so maybe it doesn't matter that much.

if so, why not just define the intrinsic.

No other reason than that Go releases much less frequently than most other compilers... if the policy is just that if someone asks for an instruction it will get added, I would be less worried. Like Ian noted elsewhere, Intel alone has like 400 intrinsics.

Oh, I remembered one more thing: Go doesn't really have anything like -mcpu right? What's the plan for when someone uses an intrinsic that requires cpu feature checks on hardware that doesn't support it? IIRC you are not guaranteed to get a SIGILL in every case, and even then it's not clear to me if that should be a panic or a throw... if it's not a throw it opens you up to some rather, uh, nasty compatibility situations.

Relatedly, Rust currently passes simd values by pointer rather than by register to work around -mcpu affecting how LLVM chooses to lower a particular vector argument on some architectures, which is a problem for statically linking different objects compiled with different tuning flags, which is something rust unfortunately supports. I don't know as much about exotic build modes in Go, so I don't know if this is a problem for Go or not.

Comment From: cherrymui

Of course. I just mean that converting to/from array by-value should be the primary way to construct simd values, and let users manage loading and storing through the ordinary dereference operator. The conversion should be by function. Although I personally dont think there would be harm in making Uint32x4 have core type [4]uint32, and it would enable a more ergonomic API.

We considered using arrays to represent SIMD vectors in early design. One difficulty we realized is that arrays permit indexing with a dynamic (non-constant) value, i.e. x[i] where i is a variable, whereas CPUs usually don't provide instructions to dynamically indexing an element in a vector register. If we cannot compile it efficiently, I'd rather not allow such a language-level construct.

I'd envision that a number of use cases for SIMD will be operating on data stored in slices. So loading/storing from/to slices look natural. Suppose we have an API to convert an array to a SIMD vector, say, func MakeUint32x4([4]uint32) Uint32x4, users may often write MakeUint32x4(*(*[4]uint32)s). We could just provide an API for that.

With the Load/Store API, one can write MakeUint32x4 themself, such as

func MakeUint32x4(x [4]uint32) Uint32x4 { return LoadUint32x4(&x) }

If it turns out to be very useful, we can add it.

(Part in this is that I think Go treating arrays and structs differently in the ABI was not the right choice for reasons not especially relevant to the proposal).

I assume you mean ABIInternal. ABIInternal is an implementation detail, which we can certainly change, provided there is a good benefit. In fact, the work on SIMD would inevitably require some additions to the internal ABI, and we plan to take this opportunity to make some changes as well.

Re conversions, do make sure that in addition to the i2f/f2i conversions, there should also be a version for doing a bitcast too, since that's an important operation.

Of course. The "Conversions" section in the proposal mentions func (Uint32x4) AsFloat32x4() Float32x4 // interpret the bits as float32, which I think is what you mean by bitcast.

Go doesn't really have anything like -mcpu right?

We have CPU micro-architecture settings through GOAMD64, GOARM64, etc. environment variables. It can take a micro-architecture version, e.g. GOAMD64=v4 which targets x86-64-v4 machines, and possibly with a suffix indicating specific features, e.g. (hypothetical) GOARM64=v8.2,sve. The environment variable is setting applied to the whole build. It is not supported for linking objects with different settings.

What's the plan for when someone uses an intrinsic that requires cpu feature checks on hardware that doesn't support it? IIRC you are not guaranteed to get a SIGILL in every case, and even then it's not clear to me if that should be a panic or a throw... if it's not a throw it opens you up to some rather, uh, nasty compatibility situations.

It will get a SIGILL, which is a runtime fatal error (throw). In general we usually don't document that an API must panic or throw, and code should not depend on it. This also applies to passing arguments: if one calls a function that takes a Uint64x4 on a machine without 256-bit vector support, it is expected to get a SIGILL.

In general, in this proposal we're aiming for neither perfection nor completeness. I think this proposal does not preclude the possibility of adding new APIs and functionalities. We are open to a number of additions mentioned above, and we hope that experiences with the GOEXEPERIMENT will provide valuable information to help guide the additions and revisions of the API.

Thanks.

Comment From: mcy

Thanks for answering all of my questions Cherry. Sounds like you guys are on the right track. :)

Once I hear this has shipped in tip I'll port one of my libraries and report how badly it breaks.

Comment From: dr2chase

I wrote a trivial example of using "Go SIMD" as if we had implemented it for 5 different architectures, and organized the simd operations three different ways -- one in a single directory (current proposal here), another with a subdirectory per architecture, and a third with a subdirectory per "largest vector size". All of these also add a "s" vector size for "scalable".

It's my personal githib repo, but SIMD comments ought to come here, actual mistakes in the examples, can be bugs against my repo. https://github.com/dr2chase/go_simd_examples

Comment From: AndrewHarrisSPU

One statement from @dr2chase 's repo, A note on future plans caught my eye. I think it's probably related to why I can't really figure out a way to rank the organization of files.

However one feature that we think Highway should have, is the ability to drop down to this SIMD API and be specific about sizes and operations.

(C++) Highway offers an extensible dispatch framework on top of the universal-ish SIMD API. C++ esoterica is ~~suffered through~~ leveraged in pretty significant ways to this end. Stepping back and looking at the roadmap to Go-Midway and Go-Highway, I wonder if this level of extensibility is possible in vanilla Go. Or, if possible, if it's likely to be pleasent and not-so-tricky.

Flipping things around a bit, I wonder about go generate configurably constructing pimd ISAs. By pimd I mean portable instruction, multiple data. By ISA I probably just mean API, but one that resembles an ISA. Such an ISA would include the more-or-less equivalent instructions existing across architectures, as well as synthetic instructions that may combine a few instructions on specific platforms (or a lot, for emulation). I really think the question of extensible dispatch I think overarches all of it, though - that this is a distinct value proposition from a universal SIMD API, but a significant one that captures a significant amount of stuff (C++) Highway is doing.

One configuration of pimd could be Go's Highway, published in the standard library, supporting a standardized mapping across all SIMD architectures, for the lazy dev. But maybe the power user would be able to go generate a pimd API that was narrowly tailored, extended, etc., and I think this tends towards platform-specific names, top-level functions at this point.

But, starting with platform-specific intrinsics seems like the right first step in any case.

Comment From: janpfeifer

Flipping things around a bit, I wonder about go generate configurably constructing pimd ISAs.

Inline with what you suggest, this was a sketch/"straw man" of what a go-highway with go generate could look like created during the discussion of the previous SIMD proposal #67520.

I notice also that pkg.go.dev's selections are currently focused mainly on OS differences, so if this proposal introduces a new package that differs significantly between GOARCH values then that would probably call for expanding pkgsite to support selecting between multiple CPU architectures, too. ↩

Golang proposal: simd: architecture-specific SIMD intrinsics under a GOEXPERIMENT

Proposal Details

Two-level approach

Design goals

Portable(-ish) vector types with possibly architecture-specific methods

Naming

Load and store

Mask types

Conversions

Constant operands

An (incomplete) API list

CPU features

AVX vs. SSE

Discussion

Alternatives

Future work

Scalable vectors and high-level portable API

Constant operands

> The internal representation contains the elements of the proper type and count (we may include other zero-width unexported tagging fields if necessary).

> CPU feature check functions will be provided

VZEROUPPER

Loop unrolling

GOEXPERIMENT