Currently a lot of short tests are being executed individually in separate invocations of go tool dist test
, instead of being grouped together. This matters for overall runtime, because on the raspberry pi builders each invocation of go tool dist test
adds an overhead of >30 seconds, even for a test which takes a small fraction of a second.
See for example this excerpt from an execution log on plan9-arm:
2021-10-25T17:15:53Z run_tests_multi pi4g: [go_test:internal/abi go_test:internal/buildcfg go_test:internal/cpu]
2021-10-25T17:16:23Z still_waiting_on_test go_test:internal/abi
2021-10-25T17:16:53Z still_waiting_on_test go_test:internal/abi
2021-10-25T17:17:18Z finish_run_tests_multi after 1m24.6s; pi4g: [go_test:internal/abi go_test:internal/buildcfg go_test:internal/cpu]
2021-10-25T17:17:18Z run_test:go_test:internal/execabs pi4g
2021-10-25T17:17:48Z still_waiting_on_test go_test:internal/execabs
2021-10-25T17:17:53Z finish_run_test:go_test:internal/execabs after 35.3s; pi4g
2021-10-25T17:17:53Z run_test:go_test:internal/fmtsort pi4g
2021-10-25T17:18:23Z still_waiting_on_test go_test:internal/fmtsort
2021-10-25T17:18:37Z finish_run_test:go_test:internal/fmtsort after 43.5s; pi4g
2021-10-25T17:18:37Z run_test:go_test:internal/fuzz pi4g
2021-10-25T17:19:07Z still_waiting_on_test go_test:internal/fuzz
2021-10-25T17:19:13Z finish_run_test:go_test:internal/fuzz after 36.3s; pi4g
2021-10-25T17:19:13Z run_test:go_test:internal/itoa pi4g
2021-10-25T17:19:43Z still_waiting_on_test go_test:internal/itoa
2021-10-25T17:19:48Z finish_run_test:go_test:internal/itoa after 34.6s; pi4g
2021-10-25T17:19:48Z run_test:go_test:internal/poll pi4g
2021-10-25T17:20:18Z still_waiting_on_test go_test:internal/poll
2021-10-25T17:20:35Z finish_run_test:go_test:internal/poll after 47.1s; pi4g
2021-10-25T17:20:35Z run_test:go_test:internal/profile pi4g
2021-10-25T17:21:05Z still_waiting_on_test go_test:internal/profile
2021-10-25T17:21:11Z finish_run_test:go_test:internal/profile after 36.1s; pi4g
2021-10-25T17:21:11Z run_test:go_test:internal/reflectlite pi4g
2021-10-25T17:21:41Z still_waiting_on_test go_test:internal/reflectlite
2021-10-25T17:22:04Z finish_run_test:go_test:internal/reflectlite after 53s; pi4g
2021-10-25T17:22:04Z run_test:go_test:internal/singleflight pi4g
2021-10-25T17:22:34Z still_waiting_on_test go_test:internal/singleflight
2021-10-25T17:22:38Z finish_run_test:go_test:internal/singleflight after 34s; pi4g
2021-10-25T17:22:38Z run_test:go_test:internal/trace pi4g
2021-10-25T17:23:08Z still_waiting_on_test go_test:internal/trace
2021-10-25T17:23:14Z finish_run_test:go_test:internal/trace after 35.5s; pi4g
2021-10-25T17:23:14Z run_test:go_test:internal/unsafeheader pi4g
2021-10-25T17:23:44Z still_waiting_on_test go_test:internal/unsafeheader
2021-10-25T17:23:48Z finish_run_test:go_test:internal/unsafeheader after 34.8s; pi4g
2021-10-25T17:23:48Z run_test:go_test:internal/xcoff pi4g
2021-10-25T17:24:18Z still_waiting_on_test go_test:internal/xcoff
2021-10-25T17:24:24Z finish_run_test:go_test:internal/xcoff after 35.4s; pi4g
2021-10-25T17:24:24Z run_test:go_test:io pi4g
2021-10-25T17:24:54Z still_waiting_on_test go_test:io
2021-10-25T17:25:11Z finish_run_test:go_test:io after 47.4s; pi4g
2021-10-25T17:25:11Z run_test:go_test:io/fs pi4g
2021-10-25T17:25:41Z still_waiting_on_test go_test:io/fs
2021-10-25T17:25:46Z finish_run_test:go_test:io/fs after 34.9s; pi4g
2021-10-25T17:25:46Z run_test:go_test:io/ioutil pi4g
2021-10-25T17:26:16Z still_waiting_on_test go_test:io/ioutil
2021-10-25T17:26:39Z finish_run_test:go_test:io/ioutil after 52.4s; pi4g
2021-10-25T17:26:39Z run_test:go_test:log pi4g
2021-10-25T17:27:09Z still_waiting_on_test go_test:log
2021-10-25T17:27:15Z finish_run_test:go_test:log after 35.9s; pi4g
2021-10-25T17:27:15Z run_test:go_test:math pi4g
2021-10-25T17:27:45Z still_waiting_on_test go_test:math
2021-10-25T17:28:04Z finish_run_test:go_test:math after 49.6s; pi4g
2021-10-25T17:28:04Z run_test:go_test:math/big pi4g
2021-10-25T17:28:34Z still_waiting_on_test go_test:math/big
2021-10-25T17:28:51Z finish_run_test:go_test:math/big after 46.5s; pi4g
2021-10-25T17:28:51Z run_test:go_test:math/bits pi4g
2021-10-25T17:29:21Z still_waiting_on_test go_test:math/bits
2021-10-25T17:29:42Z finish_run_test:go_test:math/bits after 50.9s; pi4g
2021-10-25T17:29:42Z run_test:go_test:math/cmplx pi4g
2021-10-25T17:30:12Z still_waiting_on_test go_test:math/cmplx
2021-10-25T17:30:19Z finish_run_test:go_test:math/cmplx after 36.7s; pi4g
2021-10-25T17:30:19Z run_test:go_test:math/rand pi4g
2021-10-25T17:30:49Z still_waiting_on_test go_test:math/rand
2021-10-25T17:30:58Z finish_run_test:go_test:math/rand after 39.4s; pi4g
2021-10-25T17:30:58Z run_test:go_test:mime pi4g
2021-10-25T17:31:28Z still_waiting_on_test go_test:mime
2021-10-25T17:31:35Z finish_run_test:go_test:mime after 36.9s; pi4g
2021-10-25T17:31:35Z run_test:go_test:mime/multipart pi4g
2021-10-25T17:32:05Z still_waiting_on_test go_test:mime/multipart
2021-10-25T17:32:13Z finish_run_test:go_test:mime/multipart after 37.8s; pi4g
2021-10-25T17:32:13Z run_test:go_test:mime/quotedprintable pi4g
2021-10-25T17:32:43Z still_waiting_on_test go_test:mime/quotedprintable
2021-10-25T17:32:49Z finish_run_test:go_test:mime/quotedprintable after 36.1s; pi4g
2021-10-25T17:32:49Z run_test:go_test:net pi4g
2021-10-25T17:33:19Z still_waiting_on_test go_test:net
2021-10-25T17:33:49Z still_waiting_on_test go_test:net
2021-10-25T17:34:12Z finish_run_test:go_test:net after 1m23.2s; pi4g
2021-10-25T17:34:12Z run_test:go_test:net/http pi4g
2021-10-25T17:34:42Z still_waiting_on_test go_test:net/http
2021-10-25T17:35:12Z still_waiting_on_test go_test:net/http
2021-10-25T17:35:42Z still_waiting_on_test go_test:net/http
2021-10-25T17:36:12Z still_waiting_on_test go_test:net/http
2021-10-25T17:36:20Z finish_run_test:go_test:net/http after 2m8.1s; pi4g
2021-10-25T17:36:20Z run_tests_multi pi4g: [go_test:net/http/cgi go_test:net/http/cookiejar go_test:net/http/fcgi]
2021-10-25T17:36:50Z still_waiting_on_test go_test:net/http/cgi
2021-10-25T17:37:02Z finish_run_tests_multi after 41s; pi4g: [go_test:net/http/cgi go_test:net/http/cookiejar go_test:net/http/fcgi]
compared to the same subsequence of tests run on another slow platform, freebsd-arm64:
2021-10-25T19:14:16Z run_tests_multi aarch64.localdomain: [go_test:internal/abi go_test:internal/buildcfg go_test:internal/cpu]
2021-10-25T19:14:46Z still_waiting_on_test go_test:internal/abi
2021-10-25T19:14:51Z finish_run_tests_multi after 34.4s; aarch64.localdomain: [go_test:internal/abi go_test:internal/buildcfg go_test:internal/cpu]
2021-10-25T19:14:51Z run_tests_multi aarch64.localdomain: [go_test:internal/execabs go_test:internal/fmtsort go_test:internal/fuzz]
2021-10-25T19:14:59Z finish_run_tests_multi after 8.31s; aarch64.localdomain: [go_test:internal/execabs go_test:internal/fmtsort go_test:internal/fuzz]
2021-10-25T19:14:59Z run_tests_multi aarch64.localdomain: [go_test:internal/itoa go_test:internal/poll go_test:internal/profile]
2021-10-25T19:15:11Z finish_run_tests_multi after 12.3s; aarch64.localdomain: [go_test:internal/itoa go_test:internal/poll go_test:internal/profile]
2021-10-25T19:15:11Z run_tests_multi aarch64.localdomain: [go_test:internal/reflectlite go_test:internal/singleflight go_test:internal/trace]
2021-10-25T19:15:27Z finish_run_tests_multi after 15.4s; aarch64.localdomain: [go_test:internal/reflectlite go_test:internal/singleflight go_test:internal/trace]
2021-10-25T19:15:27Z run_tests_multi aarch64.localdomain: [go_test:internal/unsafeheader go_test:internal/xcoff go_test:io]
2021-10-25T19:15:38Z finish_run_tests_multi after 11.7s; aarch64.localdomain: [go_test:internal/unsafeheader go_test:internal/xcoff go_test:io]
2021-10-25T19:15:38Z run_tests_multi aarch64.localdomain: [go_test:io/fs go_test:io/ioutil go_test:log]
2021-10-25T19:15:42Z finish_run_tests_multi after 3.83s; aarch64.localdomain: [go_test:io/fs go_test:io/ioutil go_test:log]
2021-10-25T19:15:42Z run_tests_multi aarch64.localdomain: [go_test:log/syslog go_test:math go_test:math/big]
2021-10-25T19:15:59Z finish_run_tests_multi after 16.6s; aarch64.localdomain: [go_test:log/syslog go_test:math go_test:math/big]
2021-10-25T19:15:59Z run_tests_multi aarch64.localdomain: [go_test:math/bits go_test:math/cmplx go_test:math/rand]
2021-10-25T19:16:12Z finish_run_tests_multi after 13.3s; aarch64.localdomain: [go_test:math/bits go_test:math/cmplx go_test:math/rand]
2021-10-25T19:16:12Z run_tests_multi aarch64.localdomain: [go_test:mime go_test:mime/multipart go_test:mime/quotedprintable]
2021-10-25T19:16:17Z finish_run_tests_multi after 4.66s; aarch64.localdomain: [go_test:mime go_test:mime/multipart go_test:mime/quotedprintable]
2021-10-25T19:16:17Z run_tests_multi aarch64.localdomain: [go_test:net go_test:net/http go_test:net/http/cgi]
2021-10-25T19:16:47Z still_waiting_on_test go_test:net
2021-10-25T19:16:56Z finish_run_tests_multi after 39.2s; aarch64.localdomain: [go_test:net go_test:net/http go_test:net/http/cgi]
Naive question: on single builders without helpers, why are the tests partitioned at all? Running the whole set of tests under a single invocation of go tool dist test
would reduce the overhead considerably.
Comment From: bcmills
The logic that partitions the tests is here: https://cs.opensource.google/go/x/build/+/master:cmd/coordinator/coordinator.go;l=2655-2692;drc=db77efa219bbf2dff0b44c84d5e7ea957f55a5db
It appears to be using a hard-coded 10s cutoff, and I suspect that it is not accounting for per-test setup overhead — so the freebsd-arm64 builder probably gets better clustering because it doesn't have that overhead. (Bit of a catch-22 there. 😅)
Comment From: bcmills
As for why the tests are partitioned at all — I don't know? I would guess that the coordinator is written assuming that per-test overhead is minimal, and if we have sharded builders at all the coordinator implementation is certainly simpler using the same sharding logic for all builders.
Comment From: millerresearch
The logic that partitions the tests is here: ...
Yes, I did look at that code before posting the issue. But that doesn't show where getTestStats
is getting its historical durations from. I suspect they are not very accurate in the case of plan9-arm. Can I look at them somehow?
My feeling is that the bad partitioning is a recent problem, but I don't have an archive of test log output so I can't check. How often does the database of durations change?
I would guess that the coordinator is written assuming that per-test overhead is minimal
Clearly not the case for plan 9 since go1.10, as I reported in #24300. The redundant (imho) staleness tests at the start of dist test
repeatedly thrash the entire source tree with lookup + stat calls, not minimal on operating systems which don't have linux's aggressive directory and inode cacheing, even worse if the builder uses a remote file server.
Comment From: bcmills
Can I look at them somehow?
I'm not sure! (That seems like a question for @golang/release...)
Comment From: millerresearch
This issue may as well be closed, since the coordinator is no longer in use.