Golang crypto/tls: races detected in BenchmarkCertCache, BenchmarkHandshakeServer, BenchmarkLatency and BenchmarkThroughput

The macOS race builder is reporting data races found by briefly running benchmarks. For example:

 ==================
 WARNING: DATA RACE
 Write at 0x00c000152a49 by goroutine 40:
   testing.(*common).Fail()
       /Users/swarming/.swarming/w/ir/x/w/goroot/src/testing/testing.go:955 +0x104
   testing.(*common).Fail()
       /Users/swarming/.swarming/w/ir/x/w/goroot/src/testing/testing.go:947 +0x79
   testing.(*common).Errorf()
       /Users/swarming/.swarming/w/ir/x/w/goroot/src/testing/testing.go:1076 +0xae
   crypto/tls.benchmarkHandshakeServer.func2()
       /Users/swarming/.swarming/w/ir/x/w/goroot/src/crypto/tls/handshake_server_test.go:1346 +0x3ee
 Previous read at 0x00c000152a49 by goroutine 21:
   testing.(*B).Run()
       /Users/swarming/.swarming/w/ir/x/w/goroot/src/testing/benchmark.go:686 +0xa8d
   crypto/tls.BenchmarkHandshakeServer()
       /Users/swarming/.swarming/w/ir/x/w/goroot/src/crypto/tls/handshake_server_test.go:1371 +0x64
   testing.(*B).runN()
       /Users/swarming/.swarming/w/ir/x/w/goroot/src/testing/benchmark.go:193 +0x19c
   testing.(*B).run1.func1()
       /Users/swarming/.swarming/w/ir/x/w/goroot/src/testing/benchmark.go:215 +0x7b
 Goroutine 40 (running) created at:
   crypto/tls.benchmarkHandshakeServer()
       /Users/swarming/.swarming/w/ir/x/w/goroot/src/crypto/tls/handshake_server_test.go:1336 +0xa36
   crypto/tls.BenchmarkHandshakeServer.func2.2()
       /Users/swarming/.swarming/w/ir/x/w/goroot/src/crypto/tls/handshake_server_test.go:1377 +0x9a
   testing.(*B).runN()
       /Users/swarming/.swarming/w/ir/x/w/goroot/src/testing/benchmark.go:193 +0x19c
   testing.(*B).launch()
       /Users/swarming/.swarming/w/ir/x/w/goroot/src/testing/benchmark.go:316 +0x385
   testing.(*B).doBench.gowrap1()
       /Users/swarming/.swarming/w/ir/x/w/goroot/src/testing/benchmark.go:266 +0x33
 Goroutine 21 (running) created at:
   testing.(*B).run1()
       /Users/swarming/.swarming/w/ir/x/w/goroot/src/testing/benchmark.go:208 +0x194
   testing.(*B).Run()
       /Users/swarming/.swarming/w/ir/x/w/goroot/src/testing/benchmark.go:682 +0x9d4
   testing.runBenchmarks.func1()
       /Users/swarming/.swarming/w/ir/x/w/goroot/src/testing/benchmark.go:543 +0x85
   testing.(*B).runN()
       /Users/swarming/.swarming/w/ir/x/w/goroot/src/testing/benchmark.go:193 +0x19c
   testing.runBenchmarks()
       /Users/swarming/.swarming/w/ir/x/w/goroot/src/testing/benchmark.go:552 +0xbb6
   testing.(*M).Run()
       /Users/swarming/.swarming/w/ir/x/w/goroot/src/testing/testing.go:2050 +0x152c
   crypto/tls.runMain()
       /Users/swarming/.swarming/w/ir/x/w/goroot/src/crypto/tls/handshake_test.go:400 +0xde4
   crypto/tls.TestMain()
       /Users/swarming/.swarming/w/ir/x/w/goroot/src/crypto/tls/handshake_test.go:349 +0x10e
   main.main()
       _testmain.go:423 +0x171
 ==================
     handshake_server_test.go:1349: #4: mismatch on read: got:00000000000000000000000000000000000000000000000000000000000000000000000000000000000000 want:1403030001011603030020d6984a2d5e5b03e16c2ff4634b31c8975d4e142adba6a63188219c02e4e81183
     testing.go:1399: race detected during execution of test

From https://logs.chromium.org/logs/golang/buildbucket/cr-buildbucket/8745300644568482689/+/u/step/11/log/2#L333_13. It's happening quite frequently:

Since GOOS=darwin is a first class port, we can't release Go 1.23 without addressing this.

CC @golang/security.

Comment From: gabyhelp

Similar Issues

_{(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)}

Comment From: FiloSottile

To a first investigation, I think this is a crypto/tls benchmark failure flake that however is tickling a bug in testing's sub-sub-benchmarks package causing the data race. (Note how e.g. BenchmarkHandshakeServer has two levels of nesting.)

testing.B.Errorf is documented to be safe for concurrent use

A benchmark ends when its Benchmark function returns or calls any of the methods FailNow, Fatal, Fatalf, SkipNow, Skip, or Skipf. Those methods must be called only from the goroutine running the Benchmark function. The other reporting methods, such as the variations of Log and Error, may be called simultaneously from multiple goroutines.

and indeed the write to b.parent.failed in the race is done by b.parent.Fail ((*common).Fail()) under b.parent.mu. (Here b is the sub-sub-benchmark and b.parent is a sub-benchmark itself.)

However, the read in the race is done by the top-level benchmark's b.parent.parent.Run ((*B).Run()) on sub.failed (aka b.parent.failed) without holding b.parent.mu.

Not sure if the two levels of nesting are necessary to trigger this, to be honest, or if the correct behavior would be for Errorf to panic with Fail in goroutine after $NAME has completed instead, but the data race is not right.

Comment From: gopherbot

Change https://go.dev/cl/594255 mentions this issue: crypto/tls: replay test recordings without network

Comment From: joedian

@rolandshoemaker is any more updates planned for this?

Comment From: rolandshoemaker

I believe @FiloSottile is continuing to work on this.

Comment From: FiloSottile

I am working on the crypto/tls side, but https://github.com/golang/go/issues/67979#issuecomment-2169244912 is a little out of my depth in terms of the testing package. I'll open a separate issue for it.

Comment From: thanm

Any update on this issue? Discussed during the weekly release checkin meeting. Is this a 1.23 regression, or a more longstanding problem?

Comment From: aclements

testing.B.Errorf is supposed to be safe for concurrent use, but it's not safe to call after the benchmark has ended, which I believe is what's happening here.

The race is between a read here, when the subbenchmark ends, and a write here where Errorf is called from a separate goroutine. If I'm reading the test right, nothing is waiting for the goroutine created by benchmarkHandshakeServer to exit before allowing the benchmark to finish. Hence, the benchmark can finish and then the goroutine can call Errorf on the exited benchmark. I think this just needs a WaitGroup wait after the close(feeder).

Comment From: aclements

I see as of CL 594255 the feeder goroutine is no longer there. And the Darwin builder looks happy since that landed. Thus, I think this is fixed and we can close the issue. Is there anything I'm missing?

Comment From: FiloSottile

testing.B.Errorf is supposed to be safe for concurrent use, but it's not safe to call after the benchmark has ended, [...]

Fair enough. I feel like a Fail in goroutine after $NAME has completed panic is still friendlier than a data race, but maybe it's not worth the complexity.

I see as of CL 594255 the feeder goroutine is no longer there. And the Darwin builder looks happy since that landed.

Huh, I thought the other benchmarks were flaking too but now that I went to rerun them it doesn't look like they are. I wonder what I had wrong. Great, closing.