Consider the following program:

package x

//go:nosplit
func x(v *int, b bool) unsafe.Pointer {
    if b {
        return unsafe.Pointer(v)
    }
    return unsafe.Pointer(&v)
}

Naively, one expects this to produce assembly like this:

  TEXT .x(SB)
  TBZ $0, R1, 1f
  MOVD R0, 8(RSP)
  MOVD type:*int, R0
  CALL runtime.newobject
  MOVD 8(RSP), R1
  MOVD R1, (R0)
1:
  RET

However, Go does something very naughty: it moves the call to the allocator before the branch.

        TEXT    x.x(SB)
        MOVD.W  R30, -48(RSP)
        MOVD    R29, -8(RSP)
        SUB     $8, RSP, R29
        MOVB    R1, 8(FP)
        MOVD    R0, (FP)
        MOVD    $type:*int(SB), R0
        CALL    runtime.newobject(SB)
        MOVWU   runtime.writeBarrier(SB), R1
        CBZW    1f
        MOVD    R0, -8(SP)
        MOVD    R0, R1
        MOVD    $v(FP), R2
        MOVD    $type:*int(SB), R0
        CALL    runtime.wbMove(SB)
        MOVD    -8(SP), R0
1:
        MOVD    (FP), R1
        MOVD    R1, (R0)
        MOVBU   8(FP), R1
        TBZ     2f
        MOVD    (R0), R0
        MOVD    -8(RSP), R29
        MOVD.P  48(RSP), R30
        RET     (R30)
2:
        MOVD    -8(RSP), R29
        MOVD.P  48(RSP), R30
        RET     (R30)

This seems unfortunate, but seems very fixable: observe that the the stack slot's address being taken is the last time v is referenced in the function. In an SSA compiler, you would simply check that the place where v's address is taken does not dominate any uses of v. Not sure if that's easy for Go to do, though, since I'm not sure how it represents stack slots.

Alternatively, you could look for returns that are not dominated by taking the address; in that case, you have a path through the function that can avoid the allocation. Then, sink the allocation to blocks which dominate taking the address, but which do not dominate those allocation-free returns.

This can be worked around by doing something like v2 := v; &v2 or p := new(*int); *p = v. The fact that a small perturbation causes an unnecessary allocation suggests there's other opportunities elsewhere to make early returns cheaper.

Comment From: gabyhelp

Related Issues

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

Comment From: mcy

Oh, and perhaps it's worth noting that the write barrier seems unnecessary? v is directly reachable by stack root already, so as long as the live region encompasses the store after newobject it should be fine...? I don't know enough about the current GC's concurrency semantics to tell, I just know that write barriers are sometimes avoidable for stack roots.

Comment From: randall77

Unfortunately, v escapes or it doesn't. Escape analysis is not control-flow sensitive; it doesn't currently have the precision to solve this issue.

I'm curious as to why this code uses wbMove. We should be using a cheaper write barrier option.

Comment From: adonovan

Unfortunately, v escapes or it doesn't. Escape analysis is not control-flow sensitive; it doesn't currently have the precision to solve this issue.

That's a true statement about the current escape analysis, but @mcy is right that in principle the compiler could transform the program to improve its allocation efficiency when the identity of the variable does not matter. For example, a global data flow analysis could observe that (a) the escapeness of v differs in the two arms of the if-statement and (b) only the value of the variable, but not its identity, matters in the true branch, therefore it is safe to decouple the never-address-taken parameter value v0 from the synthesized local variable v that actually escapes in one of the branches.

func x(v0 *int, b bool) unsafe.Pointer {
    if b {
        v := v0 // does not escape
        return unsafe.Pointer(v)
    } else {
        v := v0  // escapes
        return unsafe.Pointer(&v)
    }
}

So I think this is a request for a more flow-sensitive escape analysis.

Comment From: mcy

I don't know where in the compiler escape analysis actually happens, but the transformation I describe in terms of SSA dominators does not require any global dataflow analysis beyond calculating domination frontiers, which you kind of have to do to do anything useful in SSA (like finding natural loops).

The second analysis I suggest in my original post is precisely the algorithm you would use to determine where to sink the variable to, as @adonovan's transform shows.

@randall77 that's what I feared. I don't think the escape analysis can be improved to be flow-sensitive in general without greatly increasing asymptotic complexity (closer to what e.g. LLVM's aggressive-aa does). I think that if this wanted to be possible, the best bet would be to split variables as described above, by sinking them to their first use, before doing escape analysis. This avoids needing to change escape analysis while benefiting from what appears to be flow-sensitive escape analysis.