Overview

This proposal aims to address the scheduling issue in our test execution system (golang/go#61233) where a run task might be scheduled while an earlier run task is still waiting on its compile dependencies. This was introduced in a change to enforce tests start in package order. This situation can lead to a worker goroutine being blocked on a channel, which in turn may starve the worker pool. My proposed solution is to introduce chained barrier tasks that ensure run tasks are scheduled in order. This would ensure that any channel lockups resolve immediately, while maintaining start order of package tests.

Context

In our current DAG structure: - We have run tasks per test package - Each run task has a corresponding compile task as its child. - All tasks are given a priority, based off their insertion order into the DAG - Run tasks use an internal channel-based locking mechanism to enforce that they start in insertion order.

When scheduling work off this DAG structure, we do the following: - Initialize a worker pool of goroutines (the -p flag) - Create a set of of all leaf nodes (tasks with no dependencies) - Pick the leaf node with the highest priority (as defined in the last section) - Upon completion, signal to all parents of its completion - Add to the set of leaf nodes, any nodes that no longer have an children that have not been completed

Issue

Due to varying compile task dependency chains (which may have different depths and breadths), it’s possible for a later compile task (say, compile_8) to finish before an earlier one (such as compile_0). Without additional coordination, the run_8 task could be scheduled while run_0 is still waiting on its compile dependencies. The work scheduler will have a goroutine execute run_8, but will lockup, waiting for all preceding run tasks to start.

This can cause situations where the entire worker pool of goroutines get locked in channel receives, save for one. This effectively reduces the concurrency to become single-threaded.

We at Samsara saw some of our larger test suites go from 15m => 90m runtimes when running with a pool of 16 (-p 16) and saw task concurrency collapse and become serialized to only running one task at a time.

Proposed Solution

I propose adding a barrier task for each run task to enforce scheduling order. Each barrier task ensures that:

Its associated compile task is complete.
It does not clear until the previous barrier task has cleared.

By making each run task depend on its corresponding barrier task, we ensure that run tasks are scheduled strictly in insertion order—even if compile tasks finish out of order. This extra layer of coordination prevents a run task from being scheduled prematurely and thus avoids worker lock-ups.

Design Details

Barrier Task per Run Task

For every run task run_n, add a barrier task barrier_n.

Barrier-to-Compile:

Each barrier task (barrier_n) depends on its corresponding compile task (compile_n), ensuring that the compile phase is complete before proceeding.

Chained Barrier Dependency:

For every barrier task with n > 0, add a dependency on the previous barrier task (barrier_n depends on barrier_(n-1)). This horizontal dependency guarantees that a run task is not scheduled until all earlier run tasks have completed their barriers, and thus compile tasks.

Run Task Dependency

Each run task is modified to depend on its corresponding barrier task. This makes sure that even if a compile task completes early, its run task won’t be scheduled until all prior run tasks have been processed.

Text Diagrams

Before (Original Graph)

         root
           │
  ┌────────┼────────┐
  │        │        │
 run_0   run_1   run_2   ... run_n
   │        │        │
compile_0 compile_1 compile_2 ... compile_n

Explanation

The root node spawns run tasks (run_0, run_1, …).
Each run_n has a single child, compile_n.
Run tasks use internal channel-based locking to enforce start order.

After (With Chained Barrier Tasks)


         root
           │
  ┌────────┼────────┐
  │        │        │
 run_0   run_1   run_2   ... run_n
   │        │        │
barrier_0 barrier_1 ... barrier_n
   │        │        │
   │ <───── │ <───── │   (each barrier_n depends on barrier_(n-1))
   │        │        │
compile_0 compile_1 ... compile_

Explanation

The root node spawns run tasks (run_0, run_1, …).
Each run_n now depends on a corresponding barrier_n.
Each barrier_n depends on its associated compile_n task.
Additionally, each barrier (starting with barrier_1) has a horizontal dependency on the previous barrier (e.g., barrier_1 depends on barrier_0).
This chain ensures that a run task (e.g., run_1) is scheduled only after its compile task is complete and after the previous run task (via barrier_0) has been fully processed.

Proof: Barriers Enforce Scheduling Order

Goal

Prove that run_n cannot be scheduled until all run_{<n} are also schedulable.

Base Case (n = 0)

run_0 depends on barrier_0
barrier_0 depends on compile_0
So run_0 becomes a DAG leaf once compile_0 is done

Inductive Step (n > 0)

run_n depends on barrier_n
barrier_n depends on:
- compile_n
- barrier_{n-1}
Therefore:
- run_n is not schedulable until barrier_{n-1} is complete
- By induction, all barrier_{<n} must be complete first
- So all run_{<n} must have already been schedulable

Conclusion

The chain of barriers ensures that run tasks are only scheduled once all earlier run tasks are ready. This avoids scheduling tasks that will block on internal locks and restores proper utilization of the worker pool.

Comment From: jsm

I'll add that I do think this approach is pretty heavy handed. I think the better solution would be to remove the start-in-order guarantee, so we can just remove the channel locking completely.

Comment From: jsm

cc: @rsc as the person who introduced start-in-order behavior with https://go-review.googlesource.com/c/go/+/448357

Comment From: jsm

I've pushed a change here: https://go-review.googlesource.com/c/go/+/668035?tab=comments

I verified this works using the testing method suggested in https://github.com/golang/go/issues/61233

for i in a b c d; do
    mkdir -p $i
    cat <<EOF > "$i/${i}_test.go"
package $i

import (
    "testing"
    "time"
)

func TestSleep(t *testing.T) {
    if "$i" == "b" {
        time.Sleep(time.Second)
    }
}
EOF
done
for i in e f g; do
    mkdir -p $i
    cat <<EOF > "$i/${i}.go"
package $i

EOF
done
GOMAXPROCS=2 go test -a -count=1 -p 3 -debug-trace=trace.out ./...

Before:

After:

Comment From: gopherbot

Change https://go.dev/cl/668035 mentions this issue: cmd/go: test barrier actions

Comment From: matloob

Sounds good to me. (It would be nice to not have the ordering imposed at all when we're not in json mode, but this sounds good to get in now.)

Comment From: jsm

Just in case the process needs it (I see there are still "NeedsFix" and "FixPending" tags), I have pushed changes that should resolve any outstanding concerns.

Comment From: thepudds

(It would be nice to not have the ordering imposed at all when we're not in json mode, but this sounds good to get in now.)

Hi @jsm and @matloob, quick question -- does https://go.dev/cl/668035 fully resolve the issue @howardjohn reported in #61233?

If so, I don't fully understand the comment I just quoted, which might be possible to (mis-)read as implying that CL might be a partial solution or initial improvement?

The CL currently includes Fixes #61233.

In any event, very exciting to see the progress here!

Comment From: jsm

(It would be nice to not have the ordering imposed at all when we're not in json mode, but this sounds good to get in now.)

Hi @jsm and @matloob, quick question -- does https://go.dev/cl/668035 fully resolve the issue @howardjohn reported in #61233?

If so, I don't fully understand the comment I just quoted, which might be possible to (mis-)read as implying that CL might be a partial solution or initial improvement?

The CL currently includes Fixes #61233.

In any event, very exciting to see the progress here!

@thepudds yes it does resolve it, at the cost of some limitations in the ability to start running tests at full concurrency. I think @matloob is suggesting, remove the ordered start and ordered scheduling all together when we don't need it (such as outside of json mode), so we don't limit the concurrency at all.

Comment From: matloob

Yes, as @jsm says, the CL solves the concurrency problem in #61223, which was that we had actions sitting there doing nothing and preventing other actions from running.

Comment From: jsm

@matloob

Changing a.Deps = append(a.Deps, barrier) to a.Deps = []*work.Action{barrier} causes tests to fail for some reason.

I remember now - I had tried just doing a.Deps = []*work.Action{barrier} and couldn't figure out what was going wrong, and so I switched it back to just doing the append.

I'd vote that we just keep it as a.Deps = append(a.Deps, barrier), unless you know what might be going wrong?

Comment From: matloob

@jsm just commented on the CL, sorry for the delay.

We need to jump up two layers of dependency now that there's an extra barrier action in between the run action and the build action.

Golang cmd/go: prevent worker starvation via ordered scheduling using barrier tasks

Overview

Context

Issue

Proposed Solution

Design Details

Barrier Task per Run Task

Barrier-to-Compile:

Chained Barrier Dependency:

Run Task Dependency

Text Diagrams

Before (Original Graph)

Explanation

After (With Chained Barrier Tasks)

Explanation

Proof: Barriers Enforce Scheduling Order

Goal

Base Case (n = 0)

Inductive Step (n > 0)

Conclusion