Currently a lot of short tests are being executed individually in separate invocations of go tool dist test, instead of being grouped together. This matters for overall runtime, because on the raspberry pi builders each invocation of go tool dist test adds an overhead of >30 seconds, even for a test which takes a small fraction of a second.
See for example this excerpt from an execution log on plan9-arm:
2021-10-25T17:15:53Z run_tests_multi pi4g: [go_test:internal/abi go_test:internal/buildcfg go_test:internal/cpu]
2021-10-25T17:16:23Z still_waiting_on_test go_test:internal/abi
2021-10-25T17:16:53Z still_waiting_on_test go_test:internal/abi
2021-10-25T17:17:18Z finish_run_tests_multi after 1m24.6s; pi4g: [go_test:internal/abi go_test:internal/buildcfg go_test:internal/cpu]
2021-10-25T17:17:18Z run_test:go_test:internal/execabs pi4g
2021-10-25T17:17:48Z still_waiting_on_test go_test:internal/execabs
2021-10-25T17:17:53Z finish_run_test:go_test:internal/execabs after 35.3s; pi4g
2021-10-25T17:17:53Z run_test:go_test:internal/fmtsort pi4g
2021-10-25T17:18:23Z still_waiting_on_test go_test:internal/fmtsort
2021-10-25T17:18:37Z finish_run_test:go_test:internal/fmtsort after 43.5s; pi4g
2021-10-25T17:18:37Z run_test:go_test:internal/fuzz pi4g
2021-10-25T17:19:07Z still_waiting_on_test go_test:internal/fuzz
2021-10-25T17:19:13Z finish_run_test:go_test:internal/fuzz after 36.3s; pi4g
2021-10-25T17:19:13Z run_test:go_test:internal/itoa pi4g
2021-10-25T17:19:43Z still_waiting_on_test go_test:internal/itoa
2021-10-25T17:19:48Z finish_run_test:go_test:internal/itoa after 34.6s; pi4g
2021-10-25T17:19:48Z run_test:go_test:internal/poll pi4g
2021-10-25T17:20:18Z still_waiting_on_test go_test:internal/poll
2021-10-25T17:20:35Z finish_run_test:go_test:internal/poll after 47.1s; pi4g
2021-10-25T17:20:35Z run_test:go_test:internal/profile pi4g
2021-10-25T17:21:05Z still_waiting_on_test go_test:internal/profile
2021-10-25T17:21:11Z finish_run_test:go_test:internal/profile after 36.1s; pi4g
2021-10-25T17:21:11Z run_test:go_test:internal/reflectlite pi4g
2021-10-25T17:21:41Z still_waiting_on_test go_test:internal/reflectlite
2021-10-25T17:22:04Z finish_run_test:go_test:internal/reflectlite after 53s; pi4g
2021-10-25T17:22:04Z run_test:go_test:internal/singleflight pi4g
2021-10-25T17:22:34Z still_waiting_on_test go_test:internal/singleflight
2021-10-25T17:22:38Z finish_run_test:go_test:internal/singleflight after 34s; pi4g
2021-10-25T17:22:38Z run_test:go_test:internal/trace pi4g
2021-10-25T17:23:08Z still_waiting_on_test go_test:internal/trace
2021-10-25T17:23:14Z finish_run_test:go_test:internal/trace after 35.5s; pi4g
2021-10-25T17:23:14Z run_test:go_test:internal/unsafeheader pi4g
2021-10-25T17:23:44Z still_waiting_on_test go_test:internal/unsafeheader
2021-10-25T17:23:48Z finish_run_test:go_test:internal/unsafeheader after 34.8s; pi4g
2021-10-25T17:23:48Z run_test:go_test:internal/xcoff pi4g
2021-10-25T17:24:18Z still_waiting_on_test go_test:internal/xcoff
2021-10-25T17:24:24Z finish_run_test:go_test:internal/xcoff after 35.4s; pi4g
2021-10-25T17:24:24Z run_test:go_test:io pi4g
2021-10-25T17:24:54Z still_waiting_on_test go_test:io
2021-10-25T17:25:11Z finish_run_test:go_test:io after 47.4s; pi4g
2021-10-25T17:25:11Z run_test:go_test:io/fs pi4g
2021-10-25T17:25:41Z still_waiting_on_test go_test:io/fs
2021-10-25T17:25:46Z finish_run_test:go_test:io/fs after 34.9s; pi4g
2021-10-25T17:25:46Z run_test:go_test:io/ioutil pi4g
2021-10-25T17:26:16Z still_waiting_on_test go_test:io/ioutil
2021-10-25T17:26:39Z finish_run_test:go_test:io/ioutil after 52.4s; pi4g
2021-10-25T17:26:39Z run_test:go_test:log pi4g
2021-10-25T17:27:09Z still_waiting_on_test go_test:log
2021-10-25T17:27:15Z finish_run_test:go_test:log after 35.9s; pi4g
2021-10-25T17:27:15Z run_test:go_test:math pi4g
2021-10-25T17:27:45Z still_waiting_on_test go_test:math
2021-10-25T17:28:04Z finish_run_test:go_test:math after 49.6s; pi4g
2021-10-25T17:28:04Z run_test:go_test:math/big pi4g
2021-10-25T17:28:34Z still_waiting_on_test go_test:math/big
2021-10-25T17:28:51Z finish_run_test:go_test:math/big after 46.5s; pi4g
2021-10-25T17:28:51Z run_test:go_test:math/bits pi4g
2021-10-25T17:29:21Z still_waiting_on_test go_test:math/bits
2021-10-25T17:29:42Z finish_run_test:go_test:math/bits after 50.9s; pi4g
2021-10-25T17:29:42Z run_test:go_test:math/cmplx pi4g
2021-10-25T17:30:12Z still_waiting_on_test go_test:math/cmplx
2021-10-25T17:30:19Z finish_run_test:go_test:math/cmplx after 36.7s; pi4g
2021-10-25T17:30:19Z run_test:go_test:math/rand pi4g
2021-10-25T17:30:49Z still_waiting_on_test go_test:math/rand
2021-10-25T17:30:58Z finish_run_test:go_test:math/rand after 39.4s; pi4g
2021-10-25T17:30:58Z run_test:go_test:mime pi4g
2021-10-25T17:31:28Z still_waiting_on_test go_test:mime
2021-10-25T17:31:35Z finish_run_test:go_test:mime after 36.9s; pi4g
2021-10-25T17:31:35Z run_test:go_test:mime/multipart pi4g
2021-10-25T17:32:05Z still_waiting_on_test go_test:mime/multipart
2021-10-25T17:32:13Z finish_run_test:go_test:mime/multipart after 37.8s; pi4g
2021-10-25T17:32:13Z run_test:go_test:mime/quotedprintable pi4g
2021-10-25T17:32:43Z still_waiting_on_test go_test:mime/quotedprintable
2021-10-25T17:32:49Z finish_run_test:go_test:mime/quotedprintable after 36.1s; pi4g
2021-10-25T17:32:49Z run_test:go_test:net pi4g
2021-10-25T17:33:19Z still_waiting_on_test go_test:net
2021-10-25T17:33:49Z still_waiting_on_test go_test:net
2021-10-25T17:34:12Z finish_run_test:go_test:net after 1m23.2s; pi4g
2021-10-25T17:34:12Z run_test:go_test:net/http pi4g
2021-10-25T17:34:42Z still_waiting_on_test go_test:net/http
2021-10-25T17:35:12Z still_waiting_on_test go_test:net/http
2021-10-25T17:35:42Z still_waiting_on_test go_test:net/http
2021-10-25T17:36:12Z still_waiting_on_test go_test:net/http
2021-10-25T17:36:20Z finish_run_test:go_test:net/http after 2m8.1s; pi4g
2021-10-25T17:36:20Z run_tests_multi pi4g: [go_test:net/http/cgi go_test:net/http/cookiejar go_test:net/http/fcgi]
2021-10-25T17:36:50Z still_waiting_on_test go_test:net/http/cgi
2021-10-25T17:37:02Z finish_run_tests_multi after 41s; pi4g: [go_test:net/http/cgi go_test:net/http/cookiejar go_test:net/http/fcgi]
compared to the same subsequence of tests run on another slow platform, freebsd-arm64:
2021-10-25T19:14:16Z run_tests_multi aarch64.localdomain: [go_test:internal/abi go_test:internal/buildcfg go_test:internal/cpu]
2021-10-25T19:14:46Z still_waiting_on_test go_test:internal/abi
2021-10-25T19:14:51Z finish_run_tests_multi after 34.4s; aarch64.localdomain: [go_test:internal/abi go_test:internal/buildcfg go_test:internal/cpu]
2021-10-25T19:14:51Z run_tests_multi aarch64.localdomain: [go_test:internal/execabs go_test:internal/fmtsort go_test:internal/fuzz]
2021-10-25T19:14:59Z finish_run_tests_multi after 8.31s; aarch64.localdomain: [go_test:internal/execabs go_test:internal/fmtsort go_test:internal/fuzz]
2021-10-25T19:14:59Z run_tests_multi aarch64.localdomain: [go_test:internal/itoa go_test:internal/poll go_test:internal/profile]
2021-10-25T19:15:11Z finish_run_tests_multi after 12.3s; aarch64.localdomain: [go_test:internal/itoa go_test:internal/poll go_test:internal/profile]
2021-10-25T19:15:11Z run_tests_multi aarch64.localdomain: [go_test:internal/reflectlite go_test:internal/singleflight go_test:internal/trace]
2021-10-25T19:15:27Z finish_run_tests_multi after 15.4s; aarch64.localdomain: [go_test:internal/reflectlite go_test:internal/singleflight go_test:internal/trace]
2021-10-25T19:15:27Z run_tests_multi aarch64.localdomain: [go_test:internal/unsafeheader go_test:internal/xcoff go_test:io]
2021-10-25T19:15:38Z finish_run_tests_multi after 11.7s; aarch64.localdomain: [go_test:internal/unsafeheader go_test:internal/xcoff go_test:io]
2021-10-25T19:15:38Z run_tests_multi aarch64.localdomain: [go_test:io/fs go_test:io/ioutil go_test:log]
2021-10-25T19:15:42Z finish_run_tests_multi after 3.83s; aarch64.localdomain: [go_test:io/fs go_test:io/ioutil go_test:log]
2021-10-25T19:15:42Z run_tests_multi aarch64.localdomain: [go_test:log/syslog go_test:math go_test:math/big]
2021-10-25T19:15:59Z finish_run_tests_multi after 16.6s; aarch64.localdomain: [go_test:log/syslog go_test:math go_test:math/big]
2021-10-25T19:15:59Z run_tests_multi aarch64.localdomain: [go_test:math/bits go_test:math/cmplx go_test:math/rand]
2021-10-25T19:16:12Z finish_run_tests_multi after 13.3s; aarch64.localdomain: [go_test:math/bits go_test:math/cmplx go_test:math/rand]
2021-10-25T19:16:12Z run_tests_multi aarch64.localdomain: [go_test:mime go_test:mime/multipart go_test:mime/quotedprintable]
2021-10-25T19:16:17Z finish_run_tests_multi after 4.66s; aarch64.localdomain: [go_test:mime go_test:mime/multipart go_test:mime/quotedprintable]
2021-10-25T19:16:17Z run_tests_multi aarch64.localdomain: [go_test:net go_test:net/http go_test:net/http/cgi]
2021-10-25T19:16:47Z still_waiting_on_test go_test:net
2021-10-25T19:16:56Z finish_run_tests_multi after 39.2s; aarch64.localdomain: [go_test:net go_test:net/http go_test:net/http/cgi]
Naive question: on single builders without helpers, why are the tests partitioned at all? Running the whole set of tests under a single invocation of go tool dist test would reduce the overhead considerably.
Comment From: bcmills
The logic that partitions the tests is here: https://cs.opensource.google/go/x/build/+/master:cmd/coordinator/coordinator.go;l=2655-2692;drc=db77efa219bbf2dff0b44c84d5e7ea957f55a5db
It appears to be using a hard-coded 10s cutoff, and I suspect that it is not accounting for per-test setup overhead — so the freebsd-arm64 builder probably gets better clustering because it doesn't have that overhead. (Bit of a catch-22 there. 😅)
Comment From: bcmills
As for why the tests are partitioned at all — I don't know? I would guess that the coordinator is written assuming that per-test overhead is minimal, and if we have sharded builders at all the coordinator implementation is certainly simpler using the same sharding logic for all builders.
Comment From: millerresearch
The logic that partitions the tests is here: ...
Yes, I did look at that code before posting the issue. But that doesn't show where getTestStats is getting its historical durations from. I suspect they are not very accurate in the case of plan9-arm. Can I look at them somehow?
My feeling is that the bad partitioning is a recent problem, but I don't have an archive of test log output so I can't check. How often does the database of durations change?
I would guess that the coordinator is written assuming that per-test overhead is minimal
Clearly not the case for plan 9 since go1.10, as I reported in #24300. The redundant (imho) staleness tests at the start of dist test repeatedly thrash the entire source tree with lookup + stat calls, not minimal on operating systems which don't have linux's aggressive directory and inode cacheing, even worse if the builder uses a remote file server.
Comment From: bcmills
Can I look at them somehow?
I'm not sure! (That seems like a question for @golang/release...)
Comment From: millerresearch
This issue may as well be closed, since the coordinator is no longer in use.