findRunnable makes a copy of the slice header of allp because once findRunnable drops its P a STW can change allp without synchronization (changing length and possibly allocating a new backing array) via procresize (GOMAXPROCS change).

"Possibly allocating a new backing array" is the problem, since allp is simply a standard heap allocation. allpSnapshot is on the system stack of an M without a P. This means that (a) STW can proceed without stopping the M in findRunnable and (b) the GC will not scan allpSnapshot. Thus, we could have this sequence:

  1. M1 copies the allp slice header to allpSnapshot.
  2. M1 drops its P.
  3. M2 calls runtime.GOMAXPROCS.
  4. STW need not stop M1.
  5. procresize reallocates the allp backing array. The old array is now only referenced by M1's allpSnapshot.
  6. World restarts.
  7. M2 triggers a GC.
  8. STW need not stop M1.
  9. GC does not scan M1's system stack, so it does not find a reference to the old allp array.
  10. Old allp array is freed.
  11. Word restarts.
  12. M2 allocates something which happens to reuse the same memory as the old allp array, which zeroes it (and then maybe writes to it).
  13. M1 reads from allpSnapshot, reading the now-clobbered array.

This is only possible if the GOMAXPROCS increases beyond the initial startup value (to trigger reallocation).

This also requires M1 to run really slowly to lose the race. M2 needs to do multiple stop-the-worlds and run an entire GC all before M1 manages to finish using allpSnapshot. That seems pretty far-fetched, but could be possible if the kernel deschedules M1.

cc @golang/runtime

Comment From: prattmic

There are a few options here:

  1. Use persistentalloc for allp. It can only grow, which shouldn't happen too many times, especially if we use power-of-two sizes.
  2. Put allpSnapshot in the M struct. The Ms are reachable via allm, so that would keep the backing array alive as long as we need it.

(2) seems simple and straightforward, so that is my preference.

Comment From: prattmic

@gopherbot Please backport to 1.23 and 1.24. This is a long-standing bug, but can cause random crashes when increasing GOMAXPROCS.

Comment From: gopherbot

Backport issue(s) opened: #74415 (for 1.23), #74416 (for 1.24).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

Comment From: gabyhelp

Related Code Changes

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

Comment From: gopherbot

Change https://go.dev/cl/684460 mentions this issue: runtime: stash allpSnapshot on the M