Golang proposal: runtime/pprof: add detailed, per-goroutine versions of debug=0,1 goroutine profiles

Inspecting individual goroutines, their creators and their states can be a critical debugging tool in some situations, but this level of per-goroutine information is not surfaced in aggregated debug=1 or debug=0 goroutine profiles, while the STW pauses of debug=2 / runtime.Stack(all=true) profiles that do include it can be extremely disruptive in processes with larger numbers of goroutines.

This proposal recommends adding a new debug=3 goroutine profile mode based on debug=0 that provides details on each goroutine, rather than just counts by unique stacks, by adding per-goroutine attributes such as ID, creator, state and wait-duration as additional labels.

This mode would thus make available information previously only accessible via debug=2, while dramatically reducing the latency impact of of collecting it, thanks to the low-pause, concurrent snapshot mechanism introduced in CL 387415 and already used for debug=0 and debug=1.

Background

The /debug/pprof/goroutine endpoint supports two main human-readable formats:

debug=1: A collapsed textual profile consisting of counts of goroutines with matching top function frame and label values, followed by a representative stack trace. It does not show individual goroutines or attributes such as their state or wait times. Aggregation by top frame means that not all goroutine stack traces are represented.
debug=2: A full, panic-style dump of the complete stack of all goroutines individually, with additional metadata useful for debugging, including status, scheduling state, and stack creation information (but not pprof labels).

CL 387415 introduced a new mechanism for low-latency goroutine stack profiling that uses a brief STW to enable profiling that then runs concurrently after it is resumed. This approach is now used internally by pprof.Lookup("goroutine") and is significantly less disruptive than the full stop-the-world scan.

However, debug=2 continues to use the previous STW approach (runtime.Stack(all=true)), which scales poorly in systems with high goroutine counts. In real-world systems with tens or hundreds of thousands of goroutines, this can result in STW pauses of tens or hundreds of milliseconds, making debug=2 disruptive to use in production debugging workflows.

Alternatives Considered

Reduce debug=2 STW in-place: debug=2 already has this per-goroutine information (except labels), so modifying it in-place to just reduce STW pause associated with collecting it would be appealing, however the implementation could be a challenge: debug=2 includes function call argument values, has slightly different length-limiting controls, and including per-goroutine labels (#74964) would likely require a breaking change of the output format.
Add to debug=0 and debug=1 in-place: These profiles already have minimal collection impact latency and their format includes labels which could be added to to add the extra information. However the change in aggregation behavior of reporting at per-goroutine granularity instead of an aggregated count likely makes existing profiles much larger, and is a breaking behavior change. This suggests the addition of labels to these formats should be opt-in instead (e.g. via a new debug value).

Compatibility

Adding a new debug=3 format while leaving existing debug=0,1,2 behavior unchanged avoids breaking changes to existing users of the existing profile modes, presenting the new functionality strictly to those who opt-in by using the new format.

Implementation Sketch

The existing runtime.goroutineProfileWithLabelsConcurrent function already supports concurrent, low-pause stack trace collection for aggregated profiles.

To support the richer, per-goroutine output required by a debug=3 format, it can be extended to also collect the additional fields from each g, such as goid, parentGoid and waitsince, along side the labels and StackRecords it already collects. This implementation would reuse the same synchronization and collection infrastructure already proven for debug=1-style profiles, with minimal added runtime complexity.

The printing/proto encoding of the collected information in runtime/pprof would then, optionally based on the debug mode, add these fields to its output, with debug=3 (and potentially also debug=4, if we want to offer both binary proto and textual options) causing the inclusion of these extra fields in pprof labels when formatting the profile (in countProfile).

A proof of concept of this approach seems to produce promising benchmark results.

Prior Work and Related Proposals

Issue #50794: explored alternatives to full stack dumps for large numbers of goroutines, including sampling.
CL 387415: introduced the barrier-based, low-pause stack snapshot mechanism now used by pprof.Lookup("goroutine").
CL 574795: abandoned CL for stack size profiling that also altered this implementation, related to Issue #66566.

Comment From: gabyhelp

Related Issues

Related Code Changes

_{(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)}

Comment From: dt

I'm not sure if it would make more sense as a separate proposal or incorporation into this one, but I have also found myself wanting a compact, binary goroutine profile format, a la debug=0, where its small size -- due to deferring function name resolution -- could be amenable to (in)frequent background collection and accumulation in some sort of ring buffer.

However my use-cases depend on being able to inspect individual goroutines, precluding debug=0 despite its appealing size characteristics. Using debug=2 instead, to get individual goroutines, has also proven a challenge for this sort of constant, background collection and accumulation, both due to both due to the STW latency impact of its collection -- which this proposal as written would address -- and its size. Thus my desire to add a new, binary format for individual, rather than count, profiles.

Such a format would be easy to incorporate into this proposal, as the collection of the raw information in runtime.goroutineProfileWithLabelsConcurrent needed for debug=2 is the same regardless of how it is rendered, so the only difference would be the output format being textual and including resolved func names vs binary.

Comment From: dt

I wrote up a standalone proposal focused on just on adding a format that includes labels when inspecting individual goroutines here: https://github.com/golang/go/issues/74964

I figured I'd write it up separately since the primarily focus on and motivated by the additional content of pprof labels, while this is primarily focused on and motivated by the STW behavior of the implementation, but I could also see combining them into a single proposal that just addresses both of these at the same time.

Comment From: rhysh

You're interested in seeing the details of single goroutines, but you're willing to give up access to the arguments. Can you say more about the need? What specific problems are you trying to debug that need more information than debug=0/1 provide? ("Of course lower STWs are nice, but why is debug=2 important to you?")

What if the runtime could add labels to the profile corresponding to the number of minutes the goroutine(s) had spent waiting?
What if the runtime could add labels for the goroutine IDs? (Though that might make the path to goroutine-local storage too easy.)
And for the parent goroutine ID?
Plus information on an arbitrary number of ancestor goroutines, subject to tracebackancestors? (Hmm.)
Does it have to be a text-based output, or could there be a (third-party) package to reformat a label-enhanced version of the debug=0/1 output?

There are more differences with debug=2 which might make mechanism reuse trickier, such as the handling of very deep call stacks. If there are more than 100 frames, it shows 50 from the top and 50 from the bottom. Even if there are many thousands of nested calls. (See https://go.dev/play/p/f-iiUg5zUvW.) The debug=0/1 modes show the leaf-most frames (usually 128, but configurable via GODEBUG=profstackdepth). The debug=2 format also interacts with GODEBUG=tracebackancestors (same play link, but run locally with modified environment).

The safety-by-default could be addressed as part of #74544 (by removing access to /goroutines?debug=2 from the v2 package, or giving it a scarier name).

You mention "relaxed consistency" a couple of times. Today, debug=0, 1, and 2 all give a snapshot describing the instant that the calling goroutine was able to stop the world (and subsequently to inspect its own call stack). I think that's an important property to maintain; debug=0/1 do it by requiring each goroutine to be profiled before they're allowed to resume (and change from their snapshot state).

I wonder if a goal of printing function arguments (for debug=2) would complicate that, through the possibility of printing return values combined with the possibility for one goroutine to write directly to the stack of another goroutine (during unbuffered channel sends). A detail to think through when the time comes, not a show-stopper.

Comment From: dt

Can you say more about ... why is debug=2 important to you?

The two main things that make me reach for debug=2, despite its STW hit, are IDs and states+waitsince:

States and waitsince are often very helpful for quickly categorizing and identifying potentially interesting goroutines: I often quickly skim for runnable goroutines first to get a since of what isn't blocked, then for very long waitsince times e.g. in select for hours to find goroutines that corresponding to something reported as being stuck for hours.

Individual goroutine IDs are useful both for identifying the same goroutine across multiple stack dumps, and for tying parent and child goroutines together in the same stack dump. A common pattern in our codebase is for a task to use WaitGroup to launch many workers and wait for them to finish, so if I can find the goroutine of my specific task, finding all goroutines by it lets me find the long tail worker it is waiting on.

willing to give up access to the arguments

I almost never use the argument values printed in debug=2: the vast majority of interesting information in the arguments is buried in fields of arguments passed by pointer, so an opaque 0xc03cb141e0 isn't going to do me much good. If I really need to inspect the arguments being passed to a running function, I'll need a debugger that can actually dig into what's pointed to.

What if the runtime could add labels to the [debug=1] profile corresponding to {state, waitsince, goroutine ID/creator ID}

Sure; I'm not all that picky about the exact format, so the information I'm after could easily be retrofitted into debug=1's format via labels -- namespace collisions with user-set labels notwithstanding. That said, adding states or IDs to the labels would change the aggregation (obviously) of existing profiles, so this seems like this would probably still need to be packaged as a new -- but syntactically identical -- format rather than an in-place change to debug=1?

Does it have to be a text-based output, or could there be a (third-party) package to reformat a label-enhanced version of the debug=0/1 output?

Yeah, I'm not tied to the specific bytes at all so long as the information is there. I already use a post-processing tool to parse created-by IDs out into a reverse map for faster navigation of large stack dumps.

differences with debug=2 which might make mechanism reuse trickier

True; part of this is why I think I ended up at the conclusion that this should certainly be a new debug=3 mode, and not try to reuse debug=2 exactly. My prototype is ignoring profstackdepth right now in its printing, but it doesn't seem difficult to add.

relaxed consistency

It is possible I was too quick to assert this, and/or that my read of what exactly can happen to the fields of g prior to calling tryRecordGoroutineProfileWB is not correct; I think the line that made me think this is this assignment to waitsince, but I am reading most of this code for the first time.

Comment From: dt

a label-enhanced version of the debug=0/1 output

I played around with a prototype in this direction -- enhancing the existing debug=0/1 output with additional synthetic labels to convey per-goroutine information.

For the textual format, at first I tried printing these attentional attributes as additional key/value pairs the labels {}, but this seems like it risks collisions with user-defined labels. In the binary proto format, in addition to the namespace collisions concern, I didn't love formatting numeric IDs and in particular the creation PC as just untyped strings pairs, even if the actual string values do at least get de-duped.

This made me wonder if it'd be better to print the additional fields on a separate line along side the # labels: {} line in the textual format, and use separate, typed proto fields for the IDs?

In my prototype I ended up defining debug=3 and debug=4: - 3 would be a binary profile in the same format as debug=0, but grouped by individual goroutine and containing additional per-goroutine details (in labels for now, but perhaps in standalone proto fields instead) - 4 would be a textual profile in the same format as debug=1, but grouped by individual goroutine and containing additional per-goroutine details on lines adjacent to the # labels line, similar to this example.

Comment From: rhysh

Thanks for the background. CC @golang/runtime

Check out #74545 as a proposed way to configure profile collection... the GoroutineRecorderConfig struct could end up with a bool field to control whether the resulting profile included the runtime-generated labels for goroutine id and wait duration. That sort of opt-in could resolve the concern about colliding with user-defined labels; the app could delay opting-in until they'd migrated to new label names.

It is possible I was too quick to assert this, and/or that my read of what exactly can happen to the fields of g prior to calling tryRecordGoroutineProfileWB is not correct; I think the line that made me think this is this assignment to waitsince, but I am reading most of this code for the first time.

I see ... say we have a goroutine blocked in a syscall trying to read from a pipe, we collect a goroutine profile, and during that collection another goroutine could add itself to the profile and then close the pipe. The first goroutine would then return from the syscall, and would add itself to the profile with a call stack describing its post-syscall state. That would be inconsistent with the other goroutine's profile record, which reflects a state where the first goroutine must still be blocked in the syscall. So there's a weaker guarantee when we talk about the goroutine profile entries that show non-Go operations. But it gets weirder if debug=0/1 were to provide access to waitsince, which with the current code would already be zeroed out.

Comment From: dt

https://github.com/golang/go/issues/74545 as a proposed way to configure profile collection... the GoroutineRecorderConfig

Interesting, thanks for pointing this out!

I guess an opt-in to clobbering labels is an option, but is there a drawback to just delivering the additional runtime-generated per-goroutine info along side, rather than in, the labels to avoid label clobbering altogether?

It seems like the difference in grouping/rollup semantics compared to 0/1 already mandate adding per-goroutine info be opt-in, via a separate debug= format (or perhaps via a bool in the new proposed recorder api), so why not just put the extra info on its own line(s) in the text / in its own fields in the proto, to leave the user-set labels as-is?

waitsince, which with the current code would already be zeroed out.

Re-reading the code today, I think maybe this specific example -- the timing of the zero out -- is actually easy to fix by just moving the assignment down?

add itself to the profile with a call stack describing its post-syscall state

Huh, maybe I've gotten myself confused yet again: wouldn't the first goroutine, when it exits its syscall, call tryRecordGoroutineProfileWB in exitsyscall to record its pre-exitsyscall stack before it resumes?

Or rather, perhaps putting that another way: assuming debug=0/1 are okay as is right now, if this proposal is just "add option to include goroutine ID, creator, status, and waitsince in debug=0/1", does adding those attributes to the output add any observable inconsistency that isn't there already (the aforementioned timing of waitsince=0 on exitsyscall notwithstanding)?

Comment From: dt

I've edited the text of this proposal, and its title, to shift it from proposing a lower-impact version of debug=2 to instead an augmented, more-granular version of debug=1.

I removed the "relaxed consistency" line from the body as well as I'm no longer sure I'm actually proposing any change wrt to consistency compared to the existing debug=1 behavior.

Comment From: rhysh

is there a drawback to just delivering the additional runtime-generated per-goroutine info along side, rather than in, the labels to avoid label clobbering altogether?

For data to end up in the debug=0 format, it will need to fit the structure of https://github.com/google/pprof/blob/main/proto/profile.proto . That structure is used by multiple languages/runtimes, so can't change quickly to meet Go-specific needs.

It supports per-sample labels, with values that are either integers (with units) or strings.

I think it'll be hard to propose a new profile format that only supports text-based formatting — the feature matrix for Go's profiles (for example, in the 2022-era https://github.com/DataDog/go-profiler-notes/blob/main/guide/README.md#go-profilers) is already pretty complex. I'd start with the binary format, and have the other representations flow from that.

I think maybe this specific example -- the timing of the zero out -- is actually easy to fix by just moving the assignment down?

Seems so, yes.

wouldn't the first goroutine, when it exits its syscall, call tryRecordGoroutineProfileWB in exitsyscall to record its pre-exitsyscall stack before it resumes?

Aha, yes, because doRecordGoroutineProfile's call to saveg results in runtime.(*unwinder).initAt finding gp.syscallsp != 0, and that's not reset until the end of exitsyscall, so the traceback will show the pre-syscall call stack, so there's no inconsistency in the current implementation. Thanks.

does adding those attributes to the output add any observable inconsistency that isn't there already (the aforementioned timing of waitsince=0 on exitsyscall notwithstanding)?

For exitsyscall, looks like it's limited to gp.waitsince=0 (which might be trivially addressed by moving it down a few lines, below the tryRecordGoroutineProfileWB call).

I'd expect a ton of leaks around "status", since there are lots of ways that a goroutine can transition from "waiting" to "runnable". I don't think it would be feasible to add an "are we profiling right now" check to each one, from either a maintenance or a performance standpoint.

Comment From: dt

[upstream proto definition] is used by multiple languages/runtimes.

Ah, that makes sense — I see now why injecting into existing labels makes sense. I’ve updated the proposal to explicitly mention labels as the way per-goroutine information would be included. Thanks for walking me through this!

The upstream proto’s comment suggests some precedent for namespacing with ::(it reserved pprof:: for internal use) so perhaps something like go::goroutine_state or go::goroutine_wait_mins would be unlikely to collide with user labels.

I think it'll be hard to propose a new profile format that only supports text-based formatting ... I'd start with the binary format, and have the other representations flow from that.

Got it. By “flow from,” do you mean “can be post-processed by an external tool,” or that any textual version would be defined as a direct mapping from the binary one? For example, debug=4 is to binary debug=3 as debug=1 is to binary debug=0.

A binary format that’s both low-STW and per-goroutine would fully meet my needs, so I’d happily take that if it’s easier to get consensus. Still, I’d make one last pitch for a textual counterpart for parity with debug=1: just as it’s handy to have a human-readable debug=1 alongside binary debug=0, if debug=3 is “0 plus extra labels,” a debug=4 (“1 plus extra labels”) would be equally convenient for quick inspection.

lots of ways that a goroutine can transition from "waiting" to "runnable" ... [not feasible to] check each one

Yeah, seems like it makes sense to just document that the status of goroutines which become unblocked during profile collection may be reported as runnable. The stack will still show that, and where, they were blocked, and so seems like a more than fair trade for avoiding a length STW.

Comment From: prattmic

I agree with @rhysh about skepticism w.r.t. adding a new text format. Trying to fit into the existing pprof format is a better direction.

I think adding new labels that we are reasonably confident won't collide is OK (maybe that is go::foo), with a GODEBUG to opt out if it is a problem for a specific application. It's too bad that runtime/pprof.WithLabels and Do can't return an error on label collision though.

I'll note that a waitsince label is effectively unique per goroutine, which will prevent any aggregation. Maybe that shouldn't be included by default?

Comment From: nsrip-dd

I'll note that a waitsince label is effectively unique per goroutine, which will prevent any aggregation. Maybe that shouldn't be included by default?

And goid :)

Comment From: dt

fit into the existing pprof format is a better direction

Yep, I think I've updated the proposal to reflect this now: debug=3 uses the existing pprof format, matching debug=0, with the runtime adding additional labels with goroutine details to the user-specified labels.

If we include a textual debug=4 counterpart to this debug=3, I wouldn't necessarily describe it as a "new" text format, since it is just the existing debug=1 format with the runtime adding additional labels to the user's labels the same way 3 is just 0's format with added labels. But again, while I personally like maintaining the party with that 1 has with 0, I'm happy to cede this if this preference isn't shared.

waitsince label is effectively unique per goroutine, which will prevent any aggregation

Indeed, as proposed, this profile mode is specifically defined as one sample per goroutine, to include goroutine-specific attributes like ID and creator ID, previously only available via the STW debug=2, so it is definitionally mutually exclusive with aggregation across goroutines.

and goid

@nsrip-dd Just to clarify, are you suggesting "and goid [should not be included by default] or simply "and goid [prevents aggregation]" ?

If you remove goid then the creator ID isn't useful anymore, as its inclusion is there to allow matching up parents and children. If you remove id, creator id, and waitsince, this is just "add state to debug=0,1", but at that point I'm not sure that's a viable debug=2 alternative that provides per-gorotuine detail without a disruptive STW.

Comment From: nsrip-dd

@nsrip-dd Just to clarify, are you suggesting "and goid [should not be included by default] or simply "and goid [prevents aggregation]" ?

I was suggesting that adding goid prevents aggregation. I agree that creator ID is only useful with goid. I'm not opposed to adding goid, though, just pointing out that it affects aggregation. In practice gzip does a good job of compressing duplicate stack traces in pprof-encoded profiles, FWIW.

Comment From: dt

I suppose one other Possible Alternatives option which I didn't include but could just for completeness would be to do something like add a new public method to runtime/pprof along the lines of pprof.SetGoroutineDetailsInGoroutineProfileLabels(bool) that toggles the synthesis and injection of additional goroutine detail labels when generating profiles, causing runtime label injection to optionally occur in the existing debug=0/1 formats in-place, rather than using a new debug=3 value that fully duplicates one of them (0) just to control label synthesis.

That said, I have a pretty strong distaste for mutating process-wide state when there is any other option instead, so my first choice would still be just adding debug=3 which is 0 but with synthesized additional labels.