Golang cmd/compile: ARM64 CSE flags generation between CMP 0$ and instruction already setting the flags according to the result

Consider the following Montgomery Reduction implementation:

func MontReduce(v uint64) uint32 {
    const (
        R    = 1 << 32
        qInv = 2164260865
    )

    m := uint32(v) * qInv
    t, borrow := bits.Sub64(v, uint64(m)*q, 0)

    res := uint32(t / R)
    if borrow != 0 {
        res += q
    }
    return res
}

Using objdump I get the following assembly (the comments are my own):

    MOVD $-65535, R1            // low part of qInv
    MOVK $(33024<<16), R1       // high part of qInv. Now R1 = qInv
    MULW R1, R0, R1             // R0 = v, R1 = m = v * qInv
    MOVD $1, R2
    MOVK $(32512<<16), R2       // R2 = q
    MUL R2, R1, R1              // R1 = m*q
    SUBS R1, R0, R1             // R1 = t = v - m*q
    NGC ZR, R3                  // R3 = -borrow
    NEG R3, R3                  // R3 = borrow
    LSR $32, R1, R1             // R1 = res = t / R
    ADD R2, R1, R2              // R2 = res + q
    CMP $0, R3
    CSEL NE, R2, R1, R0         // if borrow ! = 0 etc
    RET

However, the following simpler code also works:

TEXT ·MontReduce(SB), NOSPLIT, $16-8

    MOVD v+0(FP), R0            // R0 = v

    MOVD $1, R2                 // low part of qInv
    MOVK $(33024<<16), R2       // high part of qInv. Now R2 = qInv
    MULW R2, R0, R1             // R0 = v, R1 = m = v * qInv
    MOVK $(32512<<16), R2       // R2 = q
    MUL R2, R1, R1              // R1 = m*q
    SUBS R1, R0, R1             // R1 = t = v - m*q
    LSR $32, R1, R1             // R1 = res = t / R
    ADD R2, R1, R2              // R2 = res + q

    CSEL CS, R1, R2, R0

    MOVD R0, ret+8(FP)
    RET

In the generated code, we're using an extra instruction to make sure a nonzero borrow value would be equal to 1, not -1, even though the condition only checks if it is zero, and is thus unaffected by the NEG. Furthermore, since none of the instructions after the SUBS modify the carry flag, it doesn't have to explicitly reside in a register at all. One can directly use it in CSEL and remove the CMP (saving an addition equivalent.)

I don't know if something like this is easy enough to patch to be worth the modest perf gains. Just thought it was interesting. :)

Comment From: Jorropo

There are many ways we could solve this.

If you wanted to solve all of theses kinds the "issue" is that there is no granularity in how we handle flags. If you GOSSAFUNC that function and take a look at flagalloc and regalloc passes you can see that there is an opaque flags type containing them.

This limit us in how we can express theses optimizations, because you can't just match all instructions that set the zero flag according to the result and CSE different flag generating instructions that generate the same instructions.

With our current facilities you need to manually express this exact CSEL matching for this exact compare matching for a SUBS, this lead to many slightly different rules if you wanted to cover other cases.

A more complete solution would have a database mapping how each flag is used and set by each instruction. Then it could see .NE only uses XYZ flags, theses are identically set by CMP($0, X) and X := SUBS(Y, Z) and CSE them together. We could also regalloc flags (see adc and adox on amd64).

I don't know how often this comes up in real code tho. Writing shorter smart flags code can result in slower code since CPUs enjoy fusing cmps, tests, ... with cmovs and conditional jumps. Here given one of the arguments to CSELing is the SUBS there already is a strong dependency, so it should make things faster (or do nothing).