Golang x/build/cmd/coordinator: bad test partitioning for plan9-arm

Currently a lot of short tests are being executed individually in separate invocations of go tool dist test, instead of being grouped together. This matters for overall runtime, because on the raspberry pi builders each invocation of go tool dist test adds an overhead of >30 seconds, even for a test which takes a small fraction of a second. See for example this excerpt from an execution log on plan9-arm: 2021-10-25T17:15:53Z run_tests_multi pi4g: [go_test:internal/abi go_test:internal/buildcfg go_test:internal/cpu] 2021-10-25T17:16:23Z still_waiting_on_test go_test:internal/abi 2021-10-25T17:16:53Z still_waiting_on_test go_test:internal/abi 2021-10-25T17:17:18Z finish_run_tests_multi after 1m24.6s; pi4g: [go_test:internal/abi go_test:internal/buildcfg go_test:internal/cpu] 2021-10-25T17:17:18Z run_test:go_test:internal/execabs pi4g 2021-10-25T17:17:48Z still_waiting_on_test go_test:internal/execabs 2021-10-25T17:17:53Z finish_run_test:go_test:internal/execabs after 35.3s; pi4g 2021-10-25T17:17:53Z run_test:go_test:internal/fmtsort pi4g 2021-10-25T17:18:23Z still_waiting_on_test go_test:internal/fmtsort 2021-10-25T17:18:37Z finish_run_test:go_test:internal/fmtsort after 43.5s; pi4g 2021-10-25T17:18:37Z run_test:go_test:internal/fuzz pi4g 2021-10-25T17:19:07Z still_waiting_on_test go_test:internal/fuzz 2021-10-25T17:19:13Z finish_run_test:go_test:internal/fuzz after 36.3s; pi4g 2021-10-25T17:19:13Z run_test:go_test:internal/itoa pi4g 2021-10-25T17:19:43Z still_waiting_on_test go_test:internal/itoa 2021-10-25T17:19:48Z finish_run_test:go_test:internal/itoa after 34.6s; pi4g 2021-10-25T17:19:48Z run_test:go_test:internal/poll pi4g 2021-10-25T17:20:18Z still_waiting_on_test go_test:internal/poll 2021-10-25T17:20:35Z finish_run_test:go_test:internal/poll after 47.1s; pi4g 2021-10-25T17:20:35Z run_test:go_test:internal/profile pi4g 2021-10-25T17:21:05Z still_waiting_on_test go_test:internal/profile 2021-10-25T17:21:11Z finish_run_test:go_test:internal/profile after 36.1s; pi4g 2021-10-25T17:21:11Z run_test:go_test:internal/reflectlite pi4g 2021-10-25T17:21:41Z still_waiting_on_test go_test:internal/reflectlite 2021-10-25T17:22:04Z finish_run_test:go_test:internal/reflectlite after 53s; pi4g 2021-10-25T17:22:04Z run_test:go_test:internal/singleflight pi4g 2021-10-25T17:22:34Z still_waiting_on_test go_test:internal/singleflight 2021-10-25T17:22:38Z finish_run_test:go_test:internal/singleflight after 34s; pi4g 2021-10-25T17:22:38Z run_test:go_test:internal/trace pi4g 2021-10-25T17:23:08Z still_waiting_on_test go_test:internal/trace 2021-10-25T17:23:14Z finish_run_test:go_test:internal/trace after 35.5s; pi4g 2021-10-25T17:23:14Z run_test:go_test:internal/unsafeheader pi4g 2021-10-25T17:23:44Z still_waiting_on_test go_test:internal/unsafeheader 2021-10-25T17:23:48Z finish_run_test:go_test:internal/unsafeheader after 34.8s; pi4g 2021-10-25T17:23:48Z run_test:go_test:internal/xcoff pi4g 2021-10-25T17:24:18Z still_waiting_on_test go_test:internal/xcoff 2021-10-25T17:24:24Z finish_run_test:go_test:internal/xcoff after 35.4s; pi4g 2021-10-25T17:24:24Z run_test:go_test:io pi4g 2021-10-25T17:24:54Z still_waiting_on_test go_test:io 2021-10-25T17:25:11Z finish_run_test:go_test:io after 47.4s; pi4g 2021-10-25T17:25:11Z run_test:go_test:io/fs pi4g 2021-10-25T17:25:41Z still_waiting_on_test go_test:io/fs 2021-10-25T17:25:46Z finish_run_test:go_test:io/fs after 34.9s; pi4g 2021-10-25T17:25:46Z run_test:go_test:io/ioutil pi4g 2021-10-25T17:26:16Z still_waiting_on_test go_test:io/ioutil 2021-10-25T17:26:39Z finish_run_test:go_test:io/ioutil after 52.4s; pi4g 2021-10-25T17:26:39Z run_test:go_test:log pi4g 2021-10-25T17:27:09Z still_waiting_on_test go_test:log 2021-10-25T17:27:15Z finish_run_test:go_test:log after 35.9s; pi4g 2021-10-25T17:27:15Z run_test:go_test:math pi4g 2021-10-25T17:27:45Z still_waiting_on_test go_test:math 2021-10-25T17:28:04Z finish_run_test:go_test:math after 49.6s; pi4g 2021-10-25T17:28:04Z run_test:go_test:math/big pi4g 2021-10-25T17:28:34Z still_waiting_on_test go_test:math/big 2021-10-25T17:28:51Z finish_run_test:go_test:math/big after 46.5s; pi4g 2021-10-25T17:28:51Z run_test:go_test:math/bits pi4g 2021-10-25T17:29:21Z still_waiting_on_test go_test:math/bits 2021-10-25T17:29:42Z finish_run_test:go_test:math/bits after 50.9s; pi4g 2021-10-25T17:29:42Z run_test:go_test:math/cmplx pi4g 2021-10-25T17:30:12Z still_waiting_on_test go_test:math/cmplx 2021-10-25T17:30:19Z finish_run_test:go_test:math/cmplx after 36.7s; pi4g 2021-10-25T17:30:19Z run_test:go_test:math/rand pi4g 2021-10-25T17:30:49Z still_waiting_on_test go_test:math/rand 2021-10-25T17:30:58Z finish_run_test:go_test:math/rand after 39.4s; pi4g 2021-10-25T17:30:58Z run_test:go_test:mime pi4g 2021-10-25T17:31:28Z still_waiting_on_test go_test:mime 2021-10-25T17:31:35Z finish_run_test:go_test:mime after 36.9s; pi4g 2021-10-25T17:31:35Z run_test:go_test:mime/multipart pi4g 2021-10-25T17:32:05Z still_waiting_on_test go_test:mime/multipart 2021-10-25T17:32:13Z finish_run_test:go_test:mime/multipart after 37.8s; pi4g 2021-10-25T17:32:13Z run_test:go_test:mime/quotedprintable pi4g 2021-10-25T17:32:43Z still_waiting_on_test go_test:mime/quotedprintable 2021-10-25T17:32:49Z finish_run_test:go_test:mime/quotedprintable after 36.1s; pi4g 2021-10-25T17:32:49Z run_test:go_test:net pi4g 2021-10-25T17:33:19Z still_waiting_on_test go_test:net 2021-10-25T17:33:49Z still_waiting_on_test go_test:net 2021-10-25T17:34:12Z finish_run_test:go_test:net after 1m23.2s; pi4g 2021-10-25T17:34:12Z run_test:go_test:net/http pi4g 2021-10-25T17:34:42Z still_waiting_on_test go_test:net/http 2021-10-25T17:35:12Z still_waiting_on_test go_test:net/http 2021-10-25T17:35:42Z still_waiting_on_test go_test:net/http 2021-10-25T17:36:12Z still_waiting_on_test go_test:net/http 2021-10-25T17:36:20Z finish_run_test:go_test:net/http after 2m8.1s; pi4g 2021-10-25T17:36:20Z run_tests_multi pi4g: [go_test:net/http/cgi go_test:net/http/cookiejar go_test:net/http/fcgi] 2021-10-25T17:36:50Z still_waiting_on_test go_test:net/http/cgi 2021-10-25T17:37:02Z finish_run_tests_multi after 41s; pi4g: [go_test:net/http/cgi go_test:net/http/cookiejar go_test:net/http/fcgi] compared to the same subsequence of tests run on another slow platform, freebsd-arm64: 2021-10-25T19:14:16Z run_tests_multi aarch64.localdomain: [go_test:internal/abi go_test:internal/buildcfg go_test:internal/cpu] 2021-10-25T19:14:46Z still_waiting_on_test go_test:internal/abi 2021-10-25T19:14:51Z finish_run_tests_multi after 34.4s; aarch64.localdomain: [go_test:internal/abi go_test:internal/buildcfg go_test:internal/cpu] 2021-10-25T19:14:51Z run_tests_multi aarch64.localdomain: [go_test:internal/execabs go_test:internal/fmtsort go_test:internal/fuzz] 2021-10-25T19:14:59Z finish_run_tests_multi after 8.31s; aarch64.localdomain: [go_test:internal/execabs go_test:internal/fmtsort go_test:internal/fuzz] 2021-10-25T19:14:59Z run_tests_multi aarch64.localdomain: [go_test:internal/itoa go_test:internal/poll go_test:internal/profile] 2021-10-25T19:15:11Z finish_run_tests_multi after 12.3s; aarch64.localdomain: [go_test:internal/itoa go_test:internal/poll go_test:internal/profile] 2021-10-25T19:15:11Z run_tests_multi aarch64.localdomain: [go_test:internal/reflectlite go_test:internal/singleflight go_test:internal/trace] 2021-10-25T19:15:27Z finish_run_tests_multi after 15.4s; aarch64.localdomain: [go_test:internal/reflectlite go_test:internal/singleflight go_test:internal/trace] 2021-10-25T19:15:27Z run_tests_multi aarch64.localdomain: [go_test:internal/unsafeheader go_test:internal/xcoff go_test:io] 2021-10-25T19:15:38Z finish_run_tests_multi after 11.7s; aarch64.localdomain: [go_test:internal/unsafeheader go_test:internal/xcoff go_test:io] 2021-10-25T19:15:38Z run_tests_multi aarch64.localdomain: [go_test:io/fs go_test:io/ioutil go_test:log] 2021-10-25T19:15:42Z finish_run_tests_multi after 3.83s; aarch64.localdomain: [go_test:io/fs go_test:io/ioutil go_test:log] 2021-10-25T19:15:42Z run_tests_multi aarch64.localdomain: [go_test:log/syslog go_test:math go_test:math/big] 2021-10-25T19:15:59Z finish_run_tests_multi after 16.6s; aarch64.localdomain: [go_test:log/syslog go_test:math go_test:math/big] 2021-10-25T19:15:59Z run_tests_multi aarch64.localdomain: [go_test:math/bits go_test:math/cmplx go_test:math/rand] 2021-10-25T19:16:12Z finish_run_tests_multi after 13.3s; aarch64.localdomain: [go_test:math/bits go_test:math/cmplx go_test:math/rand] 2021-10-25T19:16:12Z run_tests_multi aarch64.localdomain: [go_test:mime go_test:mime/multipart go_test:mime/quotedprintable] 2021-10-25T19:16:17Z finish_run_tests_multi after 4.66s; aarch64.localdomain: [go_test:mime go_test:mime/multipart go_test:mime/quotedprintable] 2021-10-25T19:16:17Z run_tests_multi aarch64.localdomain: [go_test:net go_test:net/http go_test:net/http/cgi] 2021-10-25T19:16:47Z still_waiting_on_test go_test:net 2021-10-25T19:16:56Z finish_run_tests_multi after 39.2s; aarch64.localdomain: [go_test:net go_test:net/http go_test:net/http/cgi] Naive question: on single builders without helpers, why are the tests partitioned at all? Running the whole set of tests under a single invocation of go tool dist test would reduce the overhead considerably.

Comment From: bcmills

The logic that partitions the tests is here: https://cs.opensource.google/go/x/build/+/master:cmd/coordinator/coordinator.go;l=2655-2692;drc=db77efa219bbf2dff0b44c84d5e7ea957f55a5db

It appears to be using a hard-coded 10s cutoff, and I suspect that it is not accounting for per-test setup overhead — so the freebsd-arm64 builder probably gets better clustering because it doesn't have that overhead. (Bit of a catch-22 there. 😅)

Comment From: bcmills

As for why the tests are partitioned at all — I don't know? I would guess that the coordinator is written assuming that per-test overhead is minimal, and if we have sharded builders at all the coordinator implementation is certainly simpler using the same sharding logic for all builders.

Comment From: millerresearch

The logic that partitions the tests is here: ...

Yes, I did look at that code before posting the issue. But that doesn't show where getTestStats is getting its historical durations from. I suspect they are not very accurate in the case of plan9-arm. Can I look at them somehow?

My feeling is that the bad partitioning is a recent problem, but I don't have an archive of test log output so I can't check. How often does the database of durations change?

I would guess that the coordinator is written assuming that per-test overhead is minimal

Clearly not the case for plan 9 since go1.10, as I reported in #24300. The redundant (imho) staleness tests at the start of dist test repeatedly thrash the entire source tree with lookup + stat calls, not minimal on operating systems which don't have linux's aggressive directory and inode cacheing, even worse if the builder uses a remote file server.