Overview
This proposal aims to address the scheduling issue in our test execution system (golang/go#61233) where a run task might be scheduled while an earlier run task is still waiting on its compile dependencies. This was introduced in a change to enforce tests start in package order. This situation can lead to a worker goroutine being blocked on a channel, which in turn may starve the worker pool. My proposed solution is to introduce chained barrier tasks that ensure run tasks are scheduled in order. This would ensure that any channel lockups resolve immediately, while maintaining start order of package tests.
Context
In our current DAG structure: - We have run tasks per test package - Each run task has a corresponding compile task as its child. - All tasks are given a priority, based off their insertion order into the DAG - Run tasks use an internal channel-based locking mechanism to enforce that they start in insertion order.
When scheduling work off this DAG structure, we do the following:
- Initialize a worker pool of goroutines (the -p
flag)
- Create a set of of all leaf nodes (tasks with no dependencies)
- Pick the leaf node with the highest priority (as defined in the last section)
- Upon completion, signal to all parents of its completion
- Add to the set of leaf nodes, any nodes that no longer have an children that have not been completed
Issue
Due to varying compile task dependency chains (which may have different depths and breadths), it’s possible for a later compile task (say, compile_8) to finish before an earlier one (such as compile_0). Without additional coordination, the run_8 task could be scheduled while run_0 is still waiting on its compile dependencies. The work scheduler will have a goroutine execute run_8, but will lockup, waiting for all preceding run tasks to start.
This can cause situations where the entire worker pool of goroutines get locked in channel receives, save for one. This effectively reduces the concurrency to become single-threaded.
We at Samsara saw some of our larger test suites go from 15m => 90m runtimes when running with a pool of 16 (-p 16
) and saw task concurrency collapse and become serialized to only running one task at a time.
Proposed Solution
I propose adding a barrier task for each run task to enforce scheduling order. Each barrier task ensures that:
- Its associated compile task is complete.
- It does not clear until the previous barrier task has cleared.
By making each run task depend on its corresponding barrier task, we ensure that run tasks are scheduled strictly in insertion order—even if compile tasks finish out of order. This extra layer of coordination prevents a run task from being scheduled prematurely and thus avoids worker lock-ups.
Design Details
Barrier Task per Run Task
For every run task run_n, add a barrier task barrier_n.
Barrier-to-Compile:
Each barrier task (barrier_n) depends on its corresponding compile task (compile_n), ensuring that the compile phase is complete before proceeding.
Chained Barrier Dependency:
For every barrier task with n > 0, add a dependency on the previous barrier task (barrier_n depends on barrier_(n-1)). This horizontal dependency guarantees that a run task is not scheduled until all earlier run tasks have completed their barriers, and thus compile tasks.
Run Task Dependency
Each run task is modified to depend on its corresponding barrier task. This makes sure that even if a compile task completes early, its run task won’t be scheduled until all prior run tasks have been processed.
Text Diagrams
Before (Original Graph)
root
│
┌────────┼────────┐
│ │ │
run_0 run_1 run_2 ... run_n
│ │ │
compile_0 compile_1 compile_2 ... compile_n
Explanation
- The root node spawns run tasks (run_0, run_1, …).
- Each run_n has a single child, compile_n.
- Run tasks use internal channel-based locking to enforce start order.
After (With Chained Barrier Tasks)
root
│
┌────────┼────────┐
│ │ │
run_0 run_1 run_2 ... run_n
│ │ │
barrier_0 barrier_1 ... barrier_n
│ │ │
│ <───── │ <───── │ (each barrier_n depends on barrier_(n-1))
│ │ │
compile_0 compile_1 ... compile_
Explanation
- The root node spawns run tasks (run_0, run_1, …).
- Each run_n now depends on a corresponding barrier_n.
- Each barrier_n depends on its associated compile_n task.
- Additionally, each barrier (starting with barrier_1) has a horizontal dependency on the previous barrier (e.g., barrier_1 depends on barrier_0).
- This chain ensures that a run task (e.g., run_1) is scheduled only after its compile task is complete and after the previous run task (via barrier_0) has been fully processed.
Proof: Barriers Enforce Scheduling Order
Goal
Prove that run_n cannot be scheduled until all run_{<n} are also schedulable.
Base Case (n = 0)
- run_0 depends on barrier_0
- barrier_0 depends on compile_0
- So run_0 becomes a DAG leaf once compile_0 is done
Inductive Step (n > 0)
- run_n depends on barrier_n
- barrier_n depends on:
- compile_n
- barrier_{n-1}
- Therefore:
- run_n is not schedulable until barrier_{n-1} is complete
- By induction, all barrier_{<n} must be complete first
- So all run_{<n} must have already been schedulable
Conclusion
The chain of barriers ensures that run tasks are only scheduled once all earlier run tasks are ready. This avoids scheduling tasks that will block on internal locks and restores proper utilization of the worker pool.
Comment From: jsm
I'll add that I do think this approach is pretty heavy handed. I think the better solution would be to remove the start-in-order guarantee, so we can just remove the channel locking completely.
Comment From: jsm
cc: @rsc as the person who introduced start-in-order behavior with https://go-review.googlesource.com/c/go/+/448357
Comment From: jsm
I've pushed a change here: https://go-review.googlesource.com/c/go/+/668035?tab=comments
I verified this works using the testing method suggested in https://github.com/golang/go/issues/61233
for i in a b c d; do
mkdir -p $i
cat <<EOF > "$i/${i}_test.go"
package $i
import (
"testing"
"time"
)
func TestSleep(t *testing.T) {
if "$i" == "b" {
time.Sleep(time.Second)
}
}
EOF
done
for i in e f g; do
mkdir -p $i
cat <<EOF > "$i/${i}.go"
package $i
EOF
done
GOMAXPROCS=2 go test -a -count=1 -p 3 -debug-trace=trace.out ./...
Before:
After:
Comment From: gopherbot
Change https://go.dev/cl/668035 mentions this issue: cmd/go: test barrier actions
Comment From: matloob
Sounds good to me. (It would be nice to not have the ordering imposed at all when we're not in json mode, but this sounds good to get in now.)
Comment From: jsm
Just in case the process needs it (I see there are still "NeedsFix" and "FixPending" tags), I have pushed changes that should resolve any outstanding concerns.
Comment From: thepudds
(It would be nice to not have the ordering imposed at all when we're not in json mode, but this sounds good to get in now.)
Hi @jsm and @matloob, quick question -- does https://go.dev/cl/668035 fully resolve the issue @howardjohn reported in #61233?
If so, I don't fully understand the comment I just quoted, which might be possible to (mis-)read as implying that CL might be a partial solution or initial improvement?
The CL currently includes Fixes #61233
.
In any event, very exciting to see the progress here!
Comment From: jsm
(It would be nice to not have the ordering imposed at all when we're not in json mode, but this sounds good to get in now.)
Hi @jsm and @matloob, quick question -- does https://go.dev/cl/668035 fully resolve the issue @howardjohn reported in #61233?
If so, I don't fully understand the comment I just quoted, which might be possible to (mis-)read as implying that CL might be a partial solution or initial improvement?
The CL currently includes
Fixes #61233
.In any event, very exciting to see the progress here!
@thepudds yes it does resolve it, at the cost of some limitations in the ability to start running tests at full concurrency. I think @matloob is suggesting, remove the ordered start and ordered scheduling all together when we don't need it (such as outside of json mode), so we don't limit the concurrency at all.
Comment From: matloob
Yes, as @jsm says, the CL solves the concurrency problem in #61223, which was that we had actions sitting there doing nothing and preventing other actions from running.