Proposal Details

Design document at golang/proposal/pull/58.

This proposal involves an experimental change to the garbage collector that allows it detect some types goroutine leaks (a.k.a. partial deadlocks in the community).

The approach uses memory reachability as computed by the garbage collector to single out goroutines blocked over concurrency primitives that cannot be accessed by any (transitively) runnable goroutines. It is already described at length in this publication.

The approach has been validated experimentally in several contexts. It was able to discover goroutine leaks: - caused by the misuse of synchronization primitives (channels, sync primitives, and mixed) in 69 microbenchmarks derived from real-world concurrency bugs, - in 3111 test suites within Uber, for libraries using concurrency features, with a rate of 180 to 357 syntactically distinct goroutine leaks when compared to uber-go/goleak (which is only applicable to unit tests) - in a production service within Uber, where it was able to find 3 syntactically distinct goroutine leaks, for a total of 252 goroutine leak reports over 24 hours.

This proposal follows the implementation according to the available artifact, but with the following changes: - The stacks of leaked goroutines (and other heap resources reachable only through the stacks) are not forcefully reclaimed by the GC; instead, once goroutines are flagged as leaked, they are treated as traceable by the GC, as in the runnable GC. This preserves the soundness of goroutine leak detection without changing what the GC perceives as reachable memory. - Goroutine leak detection would be triggered on-demand by extracting a new type of profile, pprof/goroutineleak; requesting it triggers a goroutine leak detection GC cycle, followed by the collecting of a goroutine profile containing only the stacks of leaked goroutines. - In the current implementation, changes are enabled via the goleakprofile flag for GOEXPERIMENT.

Comment From: seankhliao

I think this would benefit from an implementation before a proposal.

Comment From: VladSaiocUber

Implementation provided at #74622. Waiting for CLA resolution.

Comment From: randall77

Not sure where Gaby is. Related: https://github.com/golang/go/issues/18835 https://github.com/golang/go/issues/13759

Comment From: VladSaiocUber

@randall77 Thanks for the reference to #13759! Indeed, the proposed approach is very similar to what is proposed in this comment.

Comment From: gopherbot

Change https://go.dev/cl/688335 mentions this issue: Proposal #74609: goroutine leak detection by using the garbage collector

Comment From: mknyszek

To start with, I want to highlight a few aspects of this proposal that I think are great. - No false positives. If the detector finds a goroutine that will never wake, it can be 100% certain of that. - The implementation is clean and minimally invasive to the garbage collector. I really appreciate @VladSaioc's work on this. - It has real-world use and real-world data backing it. Uber has reported using this internally successfully at scale. - Leaked goroutines stay leaked, so the performance cost of the analysis isn't super important. Users of the detector can essentially choose their own overhead by deciding how often to check.

All of these things together, to me, make this proposal very exciting, and I personally think we should move forward with this in general.


Feedback

With that being said, let's quibble about some details.

For context, we have discussed this proposal to some degree in the performance and diagnostics meeting on July 17th: https://github.com/golang/go/issues/57175#issuecomment-3119533619.

Naming

Firstly, I think most of us agreed this should be called a "goroutine leak detector" or "goroutine leak finder" (I prefer the former slightly since it aligns with "race detector").

API

I think the main thing that came out of there is that we think the clearest way to expose this new functionality is as a new type of profile. That is a very clear expression of intent on the part of a Go developer that they would like to perform this analysis. It's like a subset of a goroutine profile.

(One point that came up in the July 17th discussion was possibly exposing this as part of a goroutine profile by using labels. My counter-point to this is that I think it buries the functionality a little too deep. We could make it a parameter on goroutine profiles, but that's less discoverable, and labels are less obvious in the tooling. A new kind of profile is unambiguous in the tooling and more discoverable.)

The good news is although this still requires a proposal, it doesn't need to change any APIs to start with. We just need to give it a name so that the runtime/pprof package will understand it.

Additional interfaces

That being said, I think for this new functionality to really shine, we should consider two more ways to interface with it (that could be their own proposals, or tacked on to this one). 1. net/http/pprof should have an endpoint for this so that it can be used at scale. 2. go test should possibly have a flag to make it easy to add this sort of checking during regular testing.

For (1): I think this is uncontroversial and unobjectionable. We want to help tooling perform this analysis on running services at scale, and I think the popularity of net/http/pprof suggests we should support this functionality there too as a new endpoint. We just need to pick some names, otherwise it mostly follows the example set by

For (2): Much like the race detector, being able to opt-in to it from go test is very valuable, especially for uses of Go at smaller scales. The idea is that the detector would run just before the test binary exits. The biggest question about exposing the leak detector as a go test flag is what it should do. It could either behave more like the profiling flags and generate a file, or it can behave more like the race detector and force the program to exit with a message and a non-zero exit code. From our discussion on July 17th, I think it's clear to all of us that the leak detector sits in a space in between the race detector and a profile and both might be useful. Unfortunately I do not think we can just enable it by default, because it might break too many programs (it's also not clear to me that crashing is compliant with the Go memory model).

GOEXPERIMENT

I think we should probably leave the GOEXPERIMENT in there, but only as a feature flag to disable the runtime integration in case we find there's something problematic about it.

Alternatives

Collecting dead goroutines and the memory they reference

IIRC, @VladSaioc's previous implementation would actually collect dead goroutines and memory only that goroutine would reference. @VladSaioc has reported that a slightly more invasive and slightly more complex implementation is fast enough to run on every GC cycle for many workloads, so reclaiming dead goroutines is feasible.

This is attractive but has some problems. The main concerns from Go team members came down to the fact that memory is often not the only resource leaked by a leaked goroutine (for example, network connections, files, C memory, etc.). These could be handled by finalizers/cleanups, but it's also unclear exactly whether finalizers/cleanups are really safe to run in these scenarios. As a result, we worry that reclaiming only the memory can just push back the problem, creating new, harder-to-debug problems down the line.

Given these objections, it seems safest to start with a detector rather than a reclamation mechanism. We can always return to these questions in the future if we have new information, or a new way to approach these problems. A goroutine leak detector is still very valuable to the Go ecosystem.

(As a final point here, the faster version of the analysis still requires additional GC STWs that might impact latency and make the GC less able to parallelize its work. This might be fixable, or we could work around it by only executing the analysis occasionally, for example every 100 GC cycles or something.)

Comment From: mknyszek

@VladSaioc in order to move this forward, since the change is circling the drain, could you please:

  1. Update the name of this issue to call it "goroutine leak detection" to reflect the implementation (and what it does; "deadlock" tends to be a global property of a program, and has caused some confusion when I've mentioned this proposal lately),
  2. Link to your md file in the proposal repo once it lands.

This will help get it onto the proposal committee's radar.

Comment From: mknyszek

@VladSaioc Please also add context to the original post in terms of how widely applicable this is. The extensive test suite you added to go.dev/cl/688335 which includes minimized versions of many real-world leaks is strong evidence that this is very useful, and I think should be stated as explicitly as possible to help move this forward.

I also filed #75280 as a mini-proposal to get the functionality landed behind a GOEXPERIMENT.

Comment From: gopherbot

Change https://go.dev/cl/689555 mentions this issue: design: add design/74609-goroutine-leak-detection-gc.md

Comment From: VladSaiocUber

@mknyszek I added more context about the kinds of experiments we performed to validate the efficacy of the approach.