#!watchflakes
default <- pkg == "internal/trace" && test ~ `TestTrace.*/Stress` && log ~ `signal: killed`
Issue created automatically to collect these failures.
Example (log):
=== RUN TestTraceStressStartStop/Stress
exec.go:213: test timed out while running command: /home/swarming/.swarming/w/ir/x/w/goroot/bin/go run testdata/testprog/stress-start-stop.go
trace_test.go:615: signal: killed
--- FAIL: TestTraceStressStartStop/Stress (806.78s)
Comment From: gopherbot
Found new dashboard test flakes for:
#!watchflakes
default <- pkg == "internal/trace" && test == "TestTraceStressStartStop/Stress"
2025-03-07 19:32 gotip-linux-amd64-longtest-noswissmap go@b4a333fe internal/trace.TestTraceStressStartStop/Stress (log)
=== RUN TestTraceStressStartStop/Stress exec.go:213: test timed out while running command: /home/swarming/.swarming/w/ir/x/w/goroot/bin/go run testdata/testprog/stress-start-stop.go trace_test.go:615: signal: killed --- FAIL: TestTraceStressStartStop/Stress (806.78s)Comment From: gopherbot
Found new dashboard test flakes for:
#!watchflakes
default <- pkg == "internal/trace" && test == "TestTraceStressStartStop/Stress"
2025-05-02 17:28 gotip-darwin-arm64-longtest go@1b40dbce internal/trace.TestTraceStressStartStop/Stress (log)
=== RUN TestTraceStressStartStop/Stress exec.go:213: test timed out while running command: /Users/swarming/.swarming/w/ir/x/w/goroot/bin/go run testdata/testprog/stress-start-stop.go trace_test.go:615: signal: killed --- FAIL: TestTraceStressStartStop/Stress (2550.43s)Comment From: gopherbot
Sorry, but there were parse errors in the watch flakes script. The script I found was:
#!watchflakes
default <- pkg == "internal/trace" && test ~ `Stress` && log ~ "signal: killed"
And the problems were:
script:2.64: ~ requires backquoted regexp
See https://go.dev/wiki/Watchflakes for details.
Comment From: mknyszek
Looking more closely, I don't think this is a timeout. I think this is a deadlock. I thought at first it might be that all the stress tests take a long time, but that's not true. When I look at the failures, the long-running one is typically the ONLY long-running test.
Comment From: mknyszek
I was able to reproduce it locally!
Comment From: mknyszek
I am very close to an answer. It's clear that traceAdvance is trying to suspend a goroutine, and the other one is running. I'm bending the trace testing framework until it finally gives me a stack trace for that running goroutine.
Comment From: mknyszek
Oh no... I suspected it might be asynchronous preemption preventing me from getting a stack trace from the running goroutine, and when I set GODEBUG=asyncpreemptoff=1, the tests run 50% faster and I haven't gotten them to fail yet.
Comment From: mknyszek
61573 runs of each stress test later and I can't get it to hang with GODEBUG=asyncpreemptoff=1. This is concerning.
Comment From: gopherbot
Change https://go.dev/cl/680977 mentions this issue: internal/trace: end test programs with SIGQUIT
Comment From: mknyszek
Actually, I think the GODEBUG=asyncpreemptoff=1 thing might just be an error on my part.
Comment From: mknyszek
Figured it out. Because the tracer can suspendG, I forgot to update the stopTheWorldWithSema calls for GC stop-the-world calls. This means the potential for mutual deadlock between the tracer and a goroutine trying to start or end the mark phase.
Comment From: gopherbot
Change https://go.dev/cl/681501 mentions this issue: runtime: prevent mutual deadlock between GC stopTheWorld and suspendG
Comment From: mknyszek
@gopherbot Please open backport issues for Go 1.23 and Go 1.24.
When running with tracing enabled, this can cause a random deadlock with no workaround.
Comment From: gopherbot
Backport issue(s) opened: #74293 (for 1.23), #74294 (for 1.24).
Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.
Comment From: gopherbot
Change https://go.dev/cl/684078 mentions this issue: [release-branch.go1.24] runtime: prevent mutual deadlock between GC stopTheWorld and suspendG
Comment From: gopherbot
Change https://go.dev/cl/684095 mentions this issue: [release-branch.go1.23] runtime: prevent mutual deadlock between GC stopTheWorld and suspendG