Inspecting individual goroutines and their states is a critical debugging tool in some situations, but the STW pauses of debug=2 / runtime.Stack(all=true) in processes with larger numbers of goroutines can be disruptive.

This proposal recommends reducing the latency impact of this functionality by adapting the low-pause, concurrent snapshot mechanism already used in pprof.Lookup("goroutine") (introduced in CL 387415) to support per-goroutine output — either by modifying debug=2 in place, or, more likely, by introducing a new debug=3 mode with the same output format but reduced STW impact while leaving debug=2 unchanged.

Background

The /debug/pprof/goroutine endpoint supports two main human-readable formats:

  • debug=1: A collapsed textual profile consisting of counts of goroutines with matching top function frame and label values, followed by a representative stack trace. It does not show individual goroutines or attributes such as their state or wait times. Aggregation by top frame means that not all goroutine stack traces are represented.

  • debug=2: A full, panic-style dump of the complete stack of all goroutines individually, with additional metadata useful for debugging, including status, scheduling state, and stack creation information (but not pprof labels).

CL 387415 introduced a new mechanism for low-latency goroutine stack profiling that uses a brief STW to enable profiling that then runs concurrently after it is resumed. This approach is now used internally by pprof.Lookup("goroutine") and is significantly less disruptive than the full stop-the-world scan.

However, debug=2 continues to use the previous STW approach (runtime.Stack(all=true)), which scales poorly in systems with high goroutine counts. In real-world systems with tens or hundreds of thousands of goroutines, this can result in STW pauses of hundreds of milliseconds, making debug=2 disruptive to use in production debugging workflows.

Rationale

  • Improves production safety: Enables incident debugging using /debug/pprof/goroutine?debug=2, with its richer per-goroutine output, without introducing outsized latency spikes that could worsen slow requests or throughput issues.
  • Reuses proven mechanism: Leverages the design already adopted for debug=1, avoiding the need for new runtime infrastructure.
  • Preserves existing format: Avoids breaking any tools that rely on the output style of debug=2.

Implementation Sketch

The existing runtime.goroutineProfileWithLabelsConcurrent function already supports concurrent, low-pause stack trace collection for aggregated profiles.

To support the richer, per-goroutine output required by debug=2, it can be extended to also collect the additional fields used in the current textual format from each g, such as goid, waitsince, along side the labels and StackRecords it already collects, for use by runtime/pprof.

Depending on the integration approach:

  • For a new mode, pprof.writeGoroutine would be extended to implement debug=3. This new format could additionally include the pprof labels as they are collected already by goroutineProfileWithLabelsConcurrent.
  • For an in-place modification do debug=2, pprof.writeGoroutineStacks would be updated to call the new snapshot-augmented goroutineProfileWithLabelsConcurrent, though how arguments are handled is an open question.

This implementation would reuse the same synchronization and collection infrastructure already proven for debug=1-style profiles, with minimal added runtime complexity.

A proof of concept of this approach seems to produce promising benchmark results.

Compatibility

  • Modifying debug=2 in-place to reduce STW time would be appealing if it could maintain the current behavior for existing users, but while maintaining an identical output format seems feasible, maintaining the same content when it comes to printed function arguments may be a challenge, as it would require capturing argument values before a goroutine is resumed. Alternatively it could elide them and just print (...) as debug=2 already can sometimes do, but doing so for every frame is likely too of significant a behavior change to be done in-place. An additional question is whether the relaxed consistency of reading the state and other properties that debug=2 includes after the world has been restarted would be acceptable for an in-place change.

  • Adding a new debug=3 format offers a more conservative path: an opt-in for those who want a lower latency impact option, while leaving debug=2 entirely unchanged for existing callers. The new format can be documented as not including arguments, and its relaxed consistency wrt states noted. Introducing a new format also provides an opportunity to potentially incorporate pprof labels in the output format.

Prior Work and Related Proposals

  • Issue #50794: explored alternatives to full stack dumps for large numbers of goroutines, including sampling.
  • CL 387415: introduced the barrier-based, low-pause stack snapshot mechanism now used by pprof.Lookup("goroutine").
  • CL 574795: abandoned CL for stack size profiling that also altered this implementation, related to Issue #66566.

Comment From: gabyhelp

Related Issues

Related Code Changes

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

Comment From: dt

I'm not sure if it would make more sense as a separate proposal or incorporation into this one, but I have also found myself wanting a compact, binary goroutine profile format, a la debug=0, where its small size -- due to deferring function name resolution -- could be amenable to (in)frequent background collection and accumulation in some sort of ring buffer.

However my use-cases depend on being able to inspect individual goroutines, precluding debug=0 despite its appealing size characteristics. Using debug=2 instead, to get individual goroutines, has also proven a challenge for this sort of constant, background collection and accumulation, both due to both due to the STW latency impact of its collection -- which this proposal as written would address -- and its size. Thus my desire to add a new, binary format for individual, rather than count, profiles.

Such a format would be easy to incorporate into this proposal, as the collection of the raw information in runtime.goroutineProfileWithLabelsConcurrent needed for debug=2 is the same regardless of how it is rendered, so the only difference would be the output format being textual and including resolved func names vs binary.