findRunnable makes a copy of the slice header of allp because once findRunnable drops its P a STW can change allp without synchronization (changing length and possibly allocating a new backing array) via procresize (GOMAXPROCS change).
"Possibly allocating a new backing array" is the problem, since allp is simply a standard heap allocation. allpSnapshot is on the system stack of an M without a P. This means that (a) STW can proceed without stopping the M in findRunnable and (b) the GC will not scan allpSnapshot. Thus, we could have this sequence:
- M1 copies the
allpslice header toallpSnapshot. - M1 drops its P.
- M2 calls runtime.GOMAXPROCS.
- STW need not stop M1.
- procresize reallocates the
allpbacking array. The old array is now only referenced by M1'sallpSnapshot. - World restarts.
- M2 triggers a GC.
- STW need not stop M1.
- GC does not scan M1's system stack, so it does not find a reference to the old
allparray. - Old
allparray is freed. - Word restarts.
- M2 allocates something which happens to reuse the same memory as the old
allparray, which zeroes it (and then maybe writes to it). - M1 reads from
allpSnapshot, reading the now-clobbered array.
This is only possible if the GOMAXPROCS increases beyond the initial startup value (to trigger reallocation).
This also requires M1 to run really slowly to lose the race. M2 needs to do multiple stop-the-worlds and run an entire GC all before M1 manages to finish using allpSnapshot. That seems pretty far-fetched, but could be possible if the kernel deschedules M1.
cc @golang/runtime
Comment From: prattmic
There are a few options here:
- Use
persistentallocforallp. It can only grow, which shouldn't happen too many times, especially if we use power-of-two sizes. - Put
allpSnapshotin the M struct. The Ms are reachable viaallm, so that would keep the backing array alive as long as we need it.
(2) seems simple and straightforward, so that is my preference.
Comment From: prattmic
@gopherbot Please backport to 1.23 and 1.24. This is a long-standing bug, but can cause random crashes when increasing GOMAXPROCS.
Comment From: gopherbot
Backport issue(s) opened: #74415 (for 1.23), #74416 (for 1.24).
Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.
Comment From: gabyhelp
Related Code Changes
(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)
Comment From: gopherbot
Change https://go.dev/cl/684460 mentions this issue: runtime: stash allpSnapshot on the M
Comment From: gopherbot
Change https://go.dev/cl/685075 mentions this issue: [release-branch.go1.23] runtime: stash allpSnapshot on the M
Comment From: gopherbot
Change https://go.dev/cl/685056 mentions this issue: [release-branch.go1.24] runtime: stash allpSnapshot on the M
Comment From: dr2chase
Critical because of a chance of garbage values / memory corruption.