Golang runtime: non-preemptible zeroing in clear, append, bytes.growSlice, etc

The clear and append built-ins can result in the need to zero an arbitrary amount of memory. For byte slices, the compiler appears to use a call to runtime.memclrNoHeapPointers. That function cannot be preempted, which can lead to arbitrary delays when another goroutine wants to stop the world (such as to start or end a GC cycle).

Applications that use bytes.Buffer can experience this when a call to bytes.(*Buffer).Write leads to a call to bytes.growSlice which uses append, as seen in one of the execution traces from #68399.

The runtime and compiler should collaborate to allow opportunities for preemption when zeroing large amounts of memory.

CC @golang/runtime @mknyszek

Reproducer, using `clear` built-in plus `runtime.ReadMemStats` to provide STWs

package memclr

import (
    "context"
    "fmt"
    "math"
    "runtime"
    "runtime/metrics"
    "sync"
    "testing"
    "time"
)

func BenchmarkMemclr(b *testing.B) {
    for exp := 4; exp <= 9; exp++ {
        size := int(math.Pow10(exp))
        b.Run(fmt.Sprintf("bytes=10^%d", exp), testcaseMemclr(size))
    }
}

func testcaseMemclr(l int) func(b *testing.B) {
    return func(b *testing.B) {
        b.SetBytes(int64(l))
        v := make([]byte, l)
        for range b.N {
            clear(v)
        }
    }
}

func BenchmarkSTW(b *testing.B) {
    for exp := 4; exp <= 9; exp++ {
        size := int(math.Pow10(exp))
        b.Run(fmt.Sprintf("bytes=10^%d", exp), testcaseSTW(size))
    }
}

func testcaseSTW(size int) func(*testing.B) {
    const name = "/sched/pauses/stopping/other:seconds"

    return func(b *testing.B) {
        ctx, cancel := context.WithCancel(context.Background())

        clears := 0
        var wg sync.WaitGroup
        wg.Add(1)
        go func() {
            defer wg.Done()
            v := make([]byte, size)
            for ctx.Err() == nil {
                clear(v)
                clears++
            }
        }()

        before := readMetric(name)
        var memstats runtime.MemStats
        for range b.N {
            runtime.ReadMemStats(&memstats)
            time.Sleep(10 * time.Microsecond) // allow others to make progress
        }
        after := readMetric(name)

        cancel()
        wg.Wait()

        ns := float64(time.Second.Nanoseconds())
        diff := delta(before.Float64Histogram(), after.Float64Histogram())
        b.ReportMetric(worst(diff)*ns, "worst-ns")
        b.ReportMetric(avg(diff)*ns, "avg-ns")
        b.ReportMetric(float64(clears), "clears")
    }
}

func readMetric(name string) metrics.Value {
    samples := []metrics.Sample{{Name: name}}
    metrics.Read(samples)
    return samples[0].Value
}

func delta(a, b *metrics.Float64Histogram) *metrics.Float64Histogram {
    v := &metrics.Float64Histogram{
        Buckets: a.Buckets,
        Counts:  append([]uint64(nil), b.Counts...),
    }
    for i := range a.Counts {
        v.Counts[i] -= a.Counts[i]
    }
    return v
}

func worst(h *metrics.Float64Histogram) float64 {
    var v float64
    for i, n := range h.Counts {
        if n > 0 {
            v = h.Buckets[i]
        }
    }
    return v
}

func avg(h *metrics.Float64Histogram) float64 {
    var v float64
    var nn uint64
    for i, n := range h.Counts {
        if bv := h.Buckets[i]; !math.IsInf(bv, 0) && !math.IsNaN(bv) {
            v += float64(n) * h.Buckets[i]
            nn += n
        }
    }
    return v / float64(nn)
}

Reproducer results, showing average time to stop the world is more than 1 ms (instead of less than 10 µs) when another part of the app is clearing a 100 MB byte slice

GOGC=off go test -cpu=2 -bench=. ./memclr)
goos: darwin
goarch: arm64
pkg: issues/memclr
cpu: Apple M1
BenchmarkMemclr/bytes=10^4-2             9421521               122.2 ns/op      81857.90 MB/s
BenchmarkMemclr/bytes=10^5-2             1000000              1433 ns/op        69779.61 MB/s
BenchmarkMemclr/bytes=10^6-2               99464             15148 ns/op        66016.50 MB/s
BenchmarkMemclr/bytes=10^7-2                8704            153918 ns/op        64969.54 MB/s
BenchmarkMemclr/bytes=10^8-2                 758           1632702 ns/op        61248.17 MB/s
BenchmarkMemclr/bytes=10^9-2                  67          16443990 ns/op        60812.49 MB/s
BenchmarkSTW/bytes=10^4-2                  29718             40598 ns/op              5473 avg-ns          2912452 clears            98304 worst-ns
BenchmarkSTW/bytes=10^5-2                  29895             38866 ns/op              4920 avg-ns           560027 clears            81920 worst-ns
BenchmarkSTW/bytes=10^6-2                  26226             44481 ns/op              8116 avg-ns            70132 clears            16384 worst-ns
BenchmarkSTW/bytes=10^7-2                   8925            164844 ns/op            120482 avg-ns             8919 clears           655360 worst-ns
BenchmarkSTW/bytes=10^8-2                   2184           1571734 ns/op           1376487 avg-ns             2102 clears          4194304 worst-ns
BenchmarkSTW/bytes=10^9-2                   1209           7075640 ns/op           6506152 avg-ns              529.0 clears       16777216 worst-ns
PASS
ok      issues/memclr   29.098s

`bytes.growSlice` calling `runtime.memclrNoHeapPointers`

$ go version
go version go1.23.0 darwin/arm64

$ go tool objdump -s 'bytes.growSlice$' `which go` | grep CALL
  buffer.go:249         0x10012a3cc             97fd23cd                CALL runtime.growslice(SB)              
  buffer.go:249         0x10012a3f0             97fd3db8                CALL runtime.memclrNoHeapPointers(SB)   
  buffer.go:250         0x10012a43c             97fd3e0d                CALL runtime.memmove(SB)                
  buffer.go:251         0x10012a464             94000ef3                CALL bytes.growSlice.func1(SB)          
  buffer.go:251         0x10012a484             97fd3ca3                CALL runtime.panicSliceAcap(SB)         
  buffer.go:249         0x10012a488             97fc9d82                CALL runtime.panicmakeslicelen(SB)      
  buffer.go:249         0x10012a490             97fc2cb8                CALL runtime.deferreturn(SB)            
  buffer.go:229         0x10012a4c0             97fd332c                CALL runtime.morestack_noctxt.abi0(SB)

Comment From: gabyhelp

Related Issues and Documentation

_{(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)}

Comment From: prattmic

mallocgc uses memclrNoHeapPointersChunked to address this. It seems like the builtins could do the same?

cc @golang/runtime @mknyszek @dr2chase

Comment From: dr2chase

@prattmic that seems plausible to me.

Comment From: mknyszek

The simplest way to fix this would be to have memclrNoHeapPointers just always do the Chunked part, right? That doesn't even need a compiler change: rename memclrNoHeapPointers to memclrNoHeapPointers0 and memclrNoHeapPointersChunked to memclrNoHeapPointers.

I think the only concern would be if this function is called in places where we have to be very carefully non-preemptible (no explicit acquirem), but there are very few of these cases, and we can still call memclrNoHeapPointers0 there.

Comment From: rhysh

From go doc -u runtime.memclrNoHeapPointers:

    memclrNoHeapPointers ensures that if ptr is pointer-aligned, and n is a
    multiple of the pointer size, then any pointer-aligned, pointer-sized
    portion is cleared atomically. Despite the function name, this is necessary
    because this function is the underlying implementation of typedmemclr and
    memclrHasPointers. See the doc of memmove for more details.

    The (CPU-specific) implementations of this function are in memclr_*.s.

    memclrNoHeapPointers should be an internal detail, but widely used packages
    access it using linkname. Notable members of the hall of shame include:
      - github.com/bytedance/sonic
      - github.com/chenzhuoyu/iasm
      - github.com/cloudwego/frugal
      - github.com/dgraph-io/ristretto
      - github.com/outcaste-io/ristretto

    Do not remove or change the type signature. See go.dev/issue/67401.

First, @mknyszek , you said

the only concern would be if this function is called in places where we have to be very carefully non-preemptible (no explicit acquirem), but there are very few of these cases

Are you referring to the "ensures ... cleared atomically" part of that documentation? (I see that memclrNoHeapPointers is marked NOSPLIT — that's probably one of the load-bearing components that ensures the atomicity.) What's the reason for not using an explicit acquirem, or an explicit getg().m.preemptoff, or other explicit mechanism of disabling preemption?

Second, given that the current function definition includes a promise of atomicity — is that part of the API that members of the linkname hall of shame might require? (I would guess that it would be hard to use that property effectively outside of the runtime.)

Though the comment of "any pointer-aligned, pointer-sized portion is cleared atomically" might only guarantee that individual portions — the size of a single pointer — are each cleared atomically?

Comment From: randall77

I think the distinguishing use case for memclrNoHeapPointers is clearing junk from something that is about to contain pointers. We want to ensure that the GC can't see a partially-cleared buffer and treat the uncleared portion as valid pointers.

I don't think any user code linknaming into memclrNoHeapPointers would run into this problem. Only the runtime can see Go-heap-allocated, not-yet-initialized memory.

I think there are a few ways to solve this special issue. One is to ensure that an object mid-initialization is never scanned by the GC. Another is to ensure that an object mid-initialization has no ptr bits set yet. A third is to ensure that it is scanned conservatively. With any of those, I think interrupting the memclr itself should be fine.

Comment From: randall77

Though the comment of "any pointer-aligned, pointer-sized portion is cleared atomically" might only guarantee that individual portions — the size of a single pointer — are each cleared atomically?

Correct.

Comment From: mknyszek

First, @mknyszek , you said

the only concern would be if this function is called in places where we have to be very carefully non-preemptible (no explicit acquirem), but there are very few of these cases

Are you referring to the "ensures ... cleared atomically" part of that documentation? (I see that memclrNoHeapPointers is marked NOSPLIT — that's probably one of the load-bearing components that ensures the atomicity.) What's the reason for not using an explicit acquirem, or an explicit getg().m.preemptoff, or other explicit mechanism of disabling preemption?

Yeah, that's right. I think a lot of places this is used don't actually care about true atomicity, but we need to be very careful.

I was looking a bit more deeply into this and this is going to be harder than I thought. Take for example typedmemclr, which assumes it cannot be interrupted. The concern is a situation like the following:

(1) typedmemclr begins outside of a GC cycle. We skip the check for GC being enabled and running bulkBarrierPreWrite. (2) typedmemclr is preempted after bulkBarrierPreWrite. (3) The GC mark phase begins. (4) We delete a bunch of pointers without barriers during the mark phase.

Now whether or not this is a real problem is subtle. I think (but am not 100% certain) that you could make the argument that this isn't actually a problem in this case. The purpose of the deletion barrier is to catch pointers which are hidden in already-scanned stacks. If this is just a regular clear, then there's no chance of that. But if we read pointers from what we're about to clear and stored them on the stack before the mark phase, we can be certain they will be visible during the mark phase, if they're still relevant.

Unfortunately there may be more bad or possibly-bad-but-actually-fine scenarios, but reasoning about all the situations will be subtle.

I don't know why these functions do not explicitly disable preemption, but I suspect it's just performance. However, I think if we explicitly disabled preemption, that would probably not meaningfully impact performance very much either.

Second, given that the current function definition includes a promise of atomicity — is that part of the API that members of the linkname hall of shame might require? (I would guess that it would be hard to use that property effectively outside of the runtime.)

Though the comment of "any pointer-aligned, pointer-sized portion is cleared atomically" might only guarantee that individual portions — the size of a single pointer — are each cleared atomically?

I agree with @randall77 that it's unlikely linknamed Go code will run into this.

... With maybe one exception. IIRC sonic JITs code, and it may be making calls to mallocgc with needzero==false. Though if they're doing that with anything containing pointers, that might already be broken. There are preemption points in mallocgc.

Comment From: rhysh

With CL 31763 (merged as 8f81dfe8b47e975b90bb4a2f8dd314d32c633176), the runtime executes the write barrier in advance of its pointer writes rather than after. See for example typedmemmove.

The comment on bulkBarrierPreWrite in mbitmap.go is an outdated hybrid, with portions preceding that 2016 switch.

// Callers should call bulkBarrierPreWrite immediately before
// calling memmove(dst, src, size). This function is marked nosplit
// to avoid being preempted; the GC must not stop the goroutine
// between the memmove and the execution of the barriers.
// The caller is also responsible for cgo pointer checks if this
// may be writing Go pointers into non-Go memory.

Yes, the call to bulkBarrierPreWrite should precede the memmove. However, no instant exists "between the memmove and the execution of the barriers", since the barriers come first.

In late 2022, an accidental preemption point in typedmemclr led to #55156 (and others). While fixing it, CL 431919 (merged as f1b7b2fc52947711b8e78f7078c9e0bda35320d3) suggested that typedmemclr would benefit from a go:nosplitrec annotation.

But the accidental preemption point was within the implementation of the write barrier itself. It looks like #55156 is evidence that the write barrier must not be preempted, without providing any data on whether a preemption point between the write barrier and the copy would be allowed.

Out of the many uses of memclrNoHeapPointers, @randall77 called out the case of "clearing junk from something that is about to contain pointers". That would come up when the type of the memory is changing, which would be limited to within the memory allocator.

We already use memclrNoHeapPointersChunked in mallocgcLarge, with @randall77's second option, "to ensure that an object mid-initialization has no ptr bits set yet." After obtaining the memory, it sets span.largeType = nil. Near the end, it does the chunked/preemptible memory clear and then immediately sets the type info via heapSetTypeLarge.

That works since the new (reused) allocation is only reachable from the goroutine's stack. The GC can't find it without scanning that goroutine's stack, and it can't do that until the goroutine is once again preemptible.

If an allocation is already reachable from other roots, the GC could find it at any moment. It's already nearly impossible to change its type from one scannable ("non-noscan") type to another; we'd need to claim that it doesn't have any pointers, then zero it, then apply the new type / pointer pattern. And on top of that, to ensure that no GC worker that started during the first phase is still running during the third. That sounds very much like the sweeper and the allocator; it's an arduous process, so I don't think we'd have another version of those code paths hiding somewhere else.

For the cases where the type does not change, it seems that a preemption point between the write barrier and the zeroing is only a small increase in how observable the state is. For allocations that are reachable from other roots, the GC (and user code, via data races) can already observe the partially-zeroed state. The change is to allow the caller to switch threads (may need a publication barrier?), or to allow the GC to observe the allocation -- and for anything but new allocations, we have to assume the GC could already do that via other roots.

It allows a second (typed) copy of the memory to exist for a bit longer than before, and for cases where we don't change the allocation's type this seems like a non-issue. The execution of the write barrier -- or the observation that it's inactive -- is the commit point.

The only way that avoiding preemption grants any atomicity is for pointers that are not yet reachable from other roots. That's the memory allocator, which already decides whether to use the chunked version and correctly defends against its caveats.

The current chunk size for memclrNoHeapPointersChunked is 256 kiB. We should take care to not quietly corrupt the code for allocating small objects, at or below 32 kiB.

Does that sound like the start of a proof, @mknyszek?