The clear
and append
built-ins can result in the need to zero an arbitrary amount of memory. For byte slices, the compiler appears to use a call to runtime.memclrNoHeapPointers
. That function cannot be preempted, which can lead to arbitrary delays when another goroutine wants to stop the world (such as to start or end a GC cycle).
Applications that use bytes.Buffer
can experience this when a call to bytes.(*Buffer).Write
leads to a call to bytes.growSlice
which uses append
, as seen in one of the execution traces from #68399.
The runtime and compiler should collaborate to allow opportunities for preemption when zeroing large amounts of memory.
CC @golang/runtime @mknyszek
Reproducer, using `clear` built-in plus `runtime.ReadMemStats` to provide STWs
package memclr
import (
"context"
"fmt"
"math"
"runtime"
"runtime/metrics"
"sync"
"testing"
"time"
)
func BenchmarkMemclr(b *testing.B) {
for exp := 4; exp <= 9; exp++ {
size := int(math.Pow10(exp))
b.Run(fmt.Sprintf("bytes=10^%d", exp), testcaseMemclr(size))
}
}
func testcaseMemclr(l int) func(b *testing.B) {
return func(b *testing.B) {
b.SetBytes(int64(l))
v := make([]byte, l)
for range b.N {
clear(v)
}
}
}
func BenchmarkSTW(b *testing.B) {
for exp := 4; exp <= 9; exp++ {
size := int(math.Pow10(exp))
b.Run(fmt.Sprintf("bytes=10^%d", exp), testcaseSTW(size))
}
}
func testcaseSTW(size int) func(*testing.B) {
const name = "/sched/pauses/stopping/other:seconds"
return func(b *testing.B) {
ctx, cancel := context.WithCancel(context.Background())
clears := 0
var wg sync.WaitGroup
wg.Add(1)
go func() {
defer wg.Done()
v := make([]byte, size)
for ctx.Err() == nil {
clear(v)
clears++
}
}()
before := readMetric(name)
var memstats runtime.MemStats
for range b.N {
runtime.ReadMemStats(&memstats)
time.Sleep(10 * time.Microsecond) // allow others to make progress
}
after := readMetric(name)
cancel()
wg.Wait()
ns := float64(time.Second.Nanoseconds())
diff := delta(before.Float64Histogram(), after.Float64Histogram())
b.ReportMetric(worst(diff)*ns, "worst-ns")
b.ReportMetric(avg(diff)*ns, "avg-ns")
b.ReportMetric(float64(clears), "clears")
}
}
func readMetric(name string) metrics.Value {
samples := []metrics.Sample{{Name: name}}
metrics.Read(samples)
return samples[0].Value
}
func delta(a, b *metrics.Float64Histogram) *metrics.Float64Histogram {
v := &metrics.Float64Histogram{
Buckets: a.Buckets,
Counts: append([]uint64(nil), b.Counts...),
}
for i := range a.Counts {
v.Counts[i] -= a.Counts[i]
}
return v
}
func worst(h *metrics.Float64Histogram) float64 {
var v float64
for i, n := range h.Counts {
if n > 0 {
v = h.Buckets[i]
}
}
return v
}
func avg(h *metrics.Float64Histogram) float64 {
var v float64
var nn uint64
for i, n := range h.Counts {
if bv := h.Buckets[i]; !math.IsInf(bv, 0) && !math.IsNaN(bv) {
v += float64(n) * h.Buckets[i]
nn += n
}
}
return v / float64(nn)
}
Reproducer results, showing average time to stop the world is more than 1 ms (instead of less than 10 µs) when another part of the app is clearing a 100 MB byte slice
GOGC=off go test -cpu=2 -bench=. ./memclr)
goos: darwin
goarch: arm64
pkg: issues/memclr
cpu: Apple M1
BenchmarkMemclr/bytes=10^4-2 9421521 122.2 ns/op 81857.90 MB/s
BenchmarkMemclr/bytes=10^5-2 1000000 1433 ns/op 69779.61 MB/s
BenchmarkMemclr/bytes=10^6-2 99464 15148 ns/op 66016.50 MB/s
BenchmarkMemclr/bytes=10^7-2 8704 153918 ns/op 64969.54 MB/s
BenchmarkMemclr/bytes=10^8-2 758 1632702 ns/op 61248.17 MB/s
BenchmarkMemclr/bytes=10^9-2 67 16443990 ns/op 60812.49 MB/s
BenchmarkSTW/bytes=10^4-2 29718 40598 ns/op 5473 avg-ns 2912452 clears 98304 worst-ns
BenchmarkSTW/bytes=10^5-2 29895 38866 ns/op 4920 avg-ns 560027 clears 81920 worst-ns
BenchmarkSTW/bytes=10^6-2 26226 44481 ns/op 8116 avg-ns 70132 clears 16384 worst-ns
BenchmarkSTW/bytes=10^7-2 8925 164844 ns/op 120482 avg-ns 8919 clears 655360 worst-ns
BenchmarkSTW/bytes=10^8-2 2184 1571734 ns/op 1376487 avg-ns 2102 clears 4194304 worst-ns
BenchmarkSTW/bytes=10^9-2 1209 7075640 ns/op 6506152 avg-ns 529.0 clears 16777216 worst-ns
PASS
ok issues/memclr 29.098s
`bytes.growSlice` calling `runtime.memclrNoHeapPointers`
$ go version
go version go1.23.0 darwin/arm64
$ go tool objdump -s 'bytes.growSlice$' `which go` | grep CALL
buffer.go:249 0x10012a3cc 97fd23cd CALL runtime.growslice(SB)
buffer.go:249 0x10012a3f0 97fd3db8 CALL runtime.memclrNoHeapPointers(SB)
buffer.go:250 0x10012a43c 97fd3e0d CALL runtime.memmove(SB)
buffer.go:251 0x10012a464 94000ef3 CALL bytes.growSlice.func1(SB)
buffer.go:251 0x10012a484 97fd3ca3 CALL runtime.panicSliceAcap(SB)
buffer.go:249 0x10012a488 97fc9d82 CALL runtime.panicmakeslicelen(SB)
buffer.go:249 0x10012a490 97fc2cb8 CALL runtime.deferreturn(SB)
buffer.go:229 0x10012a4c0 97fd332c CALL runtime.morestack_noctxt.abi0(SB)
Comment From: gabyhelp
Related Issues and Documentation
- runtime: mark assist blocks GC microbenchmark for 7ms #27732
- runtime: latency in sweep assists when there's no garbage #18155
- runtime: long GC STW after large allocation (100+ msec waiting for runtime.memclrNoHeapPointers) #35825 (closed)
- runtime: pointer assignment in tight loop cause long STW #59902
- runtime: reducing preemption in suspendG when G is running large nosplit functions #40229
- GC errors when time benchmarks are run with GOMAXPROCS > 1 #1504 (closed)
- runtime: frequent ReadMemStats will cause heap_live to exceed next_gc #50592
- runtime: repeated syscalls inhibit periodic preemption #28701
- runtime: contention in runtime.newMarkBits on gcBitsArenas.lock #61428
- runtime: multi-ms sweep termination pauses (second edition) #42642 (closed)
(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)
Comment From: prattmic
mallocgc
uses memclrNoHeapPointersChunked
to address this. It seems like the builtins could do the same?
cc @golang/runtime @mknyszek @dr2chase
Comment From: dr2chase
@prattmic that seems plausible to me.
Comment From: mknyszek
The simplest way to fix this would be to have memclrNoHeapPointers
just always do the Chunked
part, right? That doesn't even need a compiler change: rename memclrNoHeapPointers
to memclrNoHeapPointers0
and memclrNoHeapPointersChunked
to memclrNoHeapPointers
.
I think the only concern would be if this function is called in places where we have to be very carefully non-preemptible (no explicit acquirem
), but there are very few of these cases, and we can still call memclrNoHeapPointers0
there.
Comment From: rhysh
From go doc -u runtime.memclrNoHeapPointers
:
memclrNoHeapPointers ensures that if ptr is pointer-aligned, and n is a
multiple of the pointer size, then any pointer-aligned, pointer-sized
portion is cleared atomically. Despite the function name, this is necessary
because this function is the underlying implementation of typedmemclr and
memclrHasPointers. See the doc of memmove for more details.
The (CPU-specific) implementations of this function are in memclr_*.s.
memclrNoHeapPointers should be an internal detail, but widely used packages
access it using linkname. Notable members of the hall of shame include:
- github.com/bytedance/sonic
- github.com/chenzhuoyu/iasm
- github.com/cloudwego/frugal
- github.com/dgraph-io/ristretto
- github.com/outcaste-io/ristretto
Do not remove or change the type signature. See go.dev/issue/67401.
First, @mknyszek , you said
the only concern would be if this function is called in places where we have to be very carefully non-preemptible (no explicit
acquirem
), but there are very few of these cases
Are you referring to the "ensures ... cleared atomically" part of that documentation? (I see that memclrNoHeapPointers
is marked NOSPLIT
— that's probably one of the load-bearing components that ensures the atomicity.) What's the reason for not using an explicit acquirem
, or an explicit getg().m.preemptoff
, or other explicit mechanism of disabling preemption?
Second, given that the current function definition includes a promise of atomicity — is that part of the API that members of the linkname hall of shame might require? (I would guess that it would be hard to use that property effectively outside of the runtime.)
Though the comment of "any pointer-aligned, pointer-sized portion is cleared atomically" might only guarantee that individual portions — the size of a single pointer — are each cleared atomically?
Comment From: randall77
I think the distinguishing use case for memclrNoHeapPointers
is clearing junk from something that is about to contain pointers.
We want to ensure that the GC can't see a partially-cleared buffer and treat the uncleared portion as valid pointers.
I don't think any user code linknaming into memclrNoHeapPointers would run into this problem. Only the runtime can see Go-heap-allocated, not-yet-initialized memory.
I think there are a few ways to solve this special issue. One is to ensure that an object mid-initialization is never scanned by the GC. Another is to ensure that an object mid-initialization has no ptr bits set yet. A third is to ensure that it is scanned conservatively. With any of those, I think interrupting the memclr itself should be fine.
Comment From: randall77
Though the comment of "any pointer-aligned, pointer-sized portion is cleared atomically" might only guarantee that individual portions — the size of a single pointer — are each cleared atomically?
Correct.
Comment From: mknyszek
First, @mknyszek , you said
the only concern would be if this function is called in places where we have to be very carefully non-preemptible (no explicit
acquirem
), but there are very few of these casesAre you referring to the "ensures ... cleared atomically" part of that documentation? (I see that
memclrNoHeapPointers
is markedNOSPLIT
— that's probably one of the load-bearing components that ensures the atomicity.) What's the reason for not using an explicitacquirem
, or an explicitgetg().m.preemptoff
, or other explicit mechanism of disabling preemption?
Yeah, that's right. I think a lot of places this is used don't actually care about true atomicity, but we need to be very careful.
I was looking a bit more deeply into this and this is going to be harder than I thought. Take for example typedmemclr
, which assumes it cannot be interrupted. The concern is a situation like the following:
(1) typedmemclr
begins outside of a GC cycle. We skip the check for GC being enabled and running bulkBarrierPreWrite
.
(2) typedmemclr
is preempted after bulkBarrierPreWrite
.
(3) The GC mark phase begins.
(4) We delete a bunch of pointers without barriers during the mark phase.
Now whether or not this is a real problem is subtle. I think (but am not 100% certain) that you could make the argument that this isn't actually a problem in this case. The purpose of the deletion barrier is to catch pointers which are hidden in already-scanned stacks. If this is just a regular clear, then there's no chance of that. But if we read pointers from what we're about to clear and stored them on the stack before the mark phase, we can be certain they will be visible during the mark phase, if they're still relevant.
Unfortunately there may be more bad or possibly-bad-but-actually-fine scenarios, but reasoning about all the situations will be subtle.
I don't know why these functions do not explicitly disable preemption, but I suspect it's just performance. However, I think if we explicitly disabled preemption, that would probably not meaningfully impact performance very much either.
Second, given that the current function definition includes a promise of atomicity — is that part of the API that members of the linkname hall of shame might require? (I would guess that it would be hard to use that property effectively outside of the runtime.)
Though the comment of "any pointer-aligned, pointer-sized portion is cleared atomically" might only guarantee that individual portions — the size of a single pointer — are each cleared atomically?
I agree with @randall77 that it's unlikely linknamed Go code will run into this.
... With maybe one exception. IIRC sonic
JITs code, and it may be making calls to mallocgc
with needzero==false
. Though if they're doing that with anything containing pointers, that might already be broken. There are preemption points in mallocgc
.
Comment From: rhysh
With CL 31763 (merged as 8f81dfe8b47e975b90bb4a2f8dd314d32c633176), the runtime executes the write barrier in advance of its pointer writes rather than after. See for example typedmemmove
.
The comment on bulkBarrierPreWrite
in mbitmap.go is an outdated hybrid, with portions preceding that 2016 switch.
// Callers should call bulkBarrierPreWrite immediately before
// calling memmove(dst, src, size). This function is marked nosplit
// to avoid being preempted; the GC must not stop the goroutine
// between the memmove and the execution of the barriers.
// The caller is also responsible for cgo pointer checks if this
// may be writing Go pointers into non-Go memory.
Yes, the call to bulkBarrierPreWrite
should precede the memmove
. However, no instant exists "between the memmove and the execution of the barriers", since the barriers come first.
In late 2022, an accidental preemption point in typedmemclr
led to #55156 (and others). While fixing it, CL 431919 (merged as f1b7b2fc52947711b8e78f7078c9e0bda35320d3) suggested that typedmemclr
would benefit from a go:nosplitrec annotation.
But the accidental preemption point was within the implementation of the write barrier itself. It looks like #55156 is evidence that the write barrier must not be preempted, without providing any data on whether a preemption point between the write barrier and the copy would be allowed.
Out of the many uses of memclrNoHeapPointers
, @randall77 called out the case of "clearing junk from something that is about to contain pointers". That would come up when the type of the memory is changing, which would be limited to within the memory allocator.
We already use memclrNoHeapPointersChunked
in mallocgcLarge
, with @randall77's second option, "to ensure that an object mid-initialization has no ptr bits set yet." After obtaining the memory, it sets span.largeType = nil
. Near the end, it does the chunked/preemptible memory clear and then immediately sets the type info via heapSetTypeLarge
.
That works since the new (reused) allocation is only reachable from the goroutine's stack. The GC can't find it without scanning that goroutine's stack, and it can't do that until the goroutine is once again preemptible.
If an allocation is already reachable from other roots, the GC could find it at any moment. It's already nearly impossible to change its type from one scannable ("non-noscan") type to another; we'd need to claim that it doesn't have any pointers, then zero it, then apply the new type / pointer pattern. And on top of that, to ensure that no GC worker that started during the first phase is still running during the third. That sounds very much like the sweeper and the allocator; it's an arduous process, so I don't think we'd have another version of those code paths hiding somewhere else.
For the cases where the type does not change, it seems that a preemption point between the write barrier and the zeroing is only a small increase in how observable the state is. For allocations that are reachable from other roots, the GC (and user code, via data races) can already observe the partially-zeroed state. The change is to allow the caller to switch threads (may need a publication barrier?), or to allow the GC to observe the allocation -- and for anything but new allocations, we have to assume the GC could already do that via other roots.
It allows a second (typed) copy of the memory to exist for a bit longer than before, and for cases where we don't change the allocation's type this seems like a non-issue. The execution of the write barrier -- or the observation that it's inactive -- is the commit point.
The only way that avoiding preemption grants any atomicity is for pointers that are not yet reachable from other roots. That's the memory allocator, which already decides whether to use the chunked version and correctly defends against its caveats.
The current chunk size for memclrNoHeapPointersChunked
is 256 kiB. We should take care to not quietly corrupt the code for allocating small objects, at or below 32 kiB.
Does that sound like the start of a proof, @mknyszek?