findRunnable
makes a copy of the slice header of allp
because once findRunnable
drops its P a STW can change allp
without synchronization (changing length and possibly allocating a new backing array) via procresize
(GOMAXPROCS change).
"Possibly allocating a new backing array" is the problem, since allp
is simply a standard heap allocation. allpSnapshot
is on the system stack of an M without a P. This means that (a) STW can proceed without stopping the M in findRunnable
and (b) the GC will not scan allpSnapshot
. Thus, we could have this sequence:
- M1 copies the
allp
slice header toallpSnapshot
. - M1 drops its P.
- M2 calls runtime.GOMAXPROCS.
- STW need not stop M1.
- procresize reallocates the
allp
backing array. The old array is now only referenced by M1'sallpSnapshot
. - World restarts.
- M2 triggers a GC.
- STW need not stop M1.
- GC does not scan M1's system stack, so it does not find a reference to the old
allp
array. - Old
allp
array is freed. - Word restarts.
- M2 allocates something which happens to reuse the same memory as the old
allp
array, which zeroes it (and then maybe writes to it). - M1 reads from
allpSnapshot
, reading the now-clobbered array.
This is only possible if the GOMAXPROCS
increases beyond the initial startup value (to trigger reallocation).
This also requires M1 to run really slowly to lose the race. M2 needs to do multiple stop-the-worlds and run an entire GC all before M1 manages to finish using allpSnapshot
. That seems pretty far-fetched, but could be possible if the kernel deschedules M1.
cc @golang/runtime
Comment From: prattmic
There are a few options here:
- Use
persistentalloc
forallp
. It can only grow, which shouldn't happen too many times, especially if we use power-of-two sizes. - Put
allpSnapshot
in the M struct. The Ms are reachable viaallm
, so that would keep the backing array alive as long as we need it.
(2) seems simple and straightforward, so that is my preference.
Comment From: prattmic
@gopherbot Please backport to 1.23 and 1.24. This is a long-standing bug, but can cause random crashes when increasing GOMAXPROCS.
Comment From: gopherbot
Backport issue(s) opened: #74415 (for 1.23), #74416 (for 1.24).
Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.
Comment From: gabyhelp
Related Code Changes
(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)
Comment From: gopherbot
Change https://go.dev/cl/684460 mentions this issue: runtime: stash allpSnapshot on the M