proposal: new Recorder types in runtime/pprof

Written primarily by me and @prattmic, but with input from the Go team. Consider this as a first take. We're hoping to shake out the details with the community.

Background

The runtime/pprof package provides foundational Go APIs to collect runtime diagnostic data from Go programs. It is critical to Go's production stack, with 22,000+ public package imports.

For the most part this package serves us well, and most profiles are neatly captured by a global Profile value. However, configuring each profile is messy. The sampling and collection configuration of each profile generally consists of some bespoke functions and/or global variables in the runtime and runtime/pprof packages. Meanwhile the profile format is captured by an opaque and slightly mysterious debug integer. Things get worse as that there's other data that overlaps with the profile data. For the worst case of this complexity, see, for example, Felix Geisendörfer's goroutine profile feature matrix).

On top of all this, CPU profiles have a completely different API, and yet another new API that has an accepted proposal, but has stagnated before being implemented.

Goal

The existing API surface has three main problems:

The goal of this proposal is then to create a template for runtime diagnostics APIs that is clear, composable, and extendible.

Proposal

There are 3 kinds of profiles we expose today:

  1. Snapshot profiles (goroutine profile, heap profile (inuse))
  2. Profiles containing data since program start (heap profile (allocs) and most other profiles)
  3. Profiles containing data for a specific window of time (CPU profiles)

The core idea behind this proposal is to create a suite of recorder types that exclusively expose profiles as either snapshot profiles or time-windowed profiles. Profiles containing data since program start are non-intuitive and consistently confusing, especially when coupled with the fact that profile configuration is in a different package. (For example, mutex and block profiles are completely empty by default, and the API exposed takes a snapshot of all data since program start. So, by default, the API confusingly produces exactly nothing.) Recorder types for snapshot profiles expose a Snapshot method which writes the profile to an io.Writer, while recorder types for time-windowed profiles expose Start and Stop methods to record over a specific time window. We propose having one recorder type for each kind of profile and a separate recorder type for custom profiles. These recorder types will accept a bespoke set of configuration options in their constructors, centralizing the configuration to a single package, and localizing configuration options to a specific recorder instance.

This proposal is intended to supersede https://github.com/golang/go/issues/42502.

API

Given the core idea, the actual proposed API changes are relatively simple and straightforward, although the API surface is somewhat large. Below is a full listing of all the different recorders we propose. We omit the thread creation profile because it's not useful. (See https://github.com/golang/go/issues/6104.)


package runtime/pprof

type CPURecorder struct { ... }

func CPURecorderConfig struct {
    // Period sets the duration between profile samples.
    //
    // If no value is set, the sample period for the duration is implementation-defined.
    // This implementation-defined value is independent of [runtime.SetCPUProfileRate],
    // but the rate set here may be visible to consumers of [StartCPUProfile], so callers
    // are discouraged from using both CPURecorder and StartCPUProfile in the same
    // program.
    Period time.Duration
}

func NewCPURecorder(CPURecorderConfig) (*CPURecorder, error)

// Start applies the recorder's configuration and begins global collection of
// CPU samples.
//
// Returns an error if this recorder had already been started.
func (*CPURecorder) Start(io.Writer) error

// Stop completes collection of the CPU profile.
//
// Returns an error on any failure to write to the io.Writer provided to Start,
// or if the recorder had not been started.
func (*CPURecorder) Stop() error

type AllocRecorder struct { ... }

func AllocRecorderConfig struct {
    // BytesPerSample sets the maximum number of bytes allocated between samples.
    //
    // If no value is set, the sample period for the duration is implementation-defined.
    // This implementation-defined value is independent of [runtime.MemProfileRate],
    // but the rate set here may be visible to consumers of [Profile.WriteTo].
    BytesPerSample int64
}

func NewAllocRecorder(AllocRecorderConfig) *AllocRecorder

// Start applies the recorder's configuration and takes a snapshot of the profile.
//
// Returns an error if this recorder had already been started.
func (*AllocRecorder) Start(io.Writer) error

// Stop takes a second snapshot and computes the difference with the profile
// taken at start. The resulting profile is written to the io.Writer provided
// to Start. It contains all sampled allocations made in the window of time between
// Start and Stop, which is useful for identifying sources of high allocation volume.
//
// Returns an error on any failure to write to the io.Writer provided to Start,
// or if the recorder had not been started.
func (*AllocRecorder) Stop() error

type HeapRecorder struct { ... }

func HeapRecorderConfig struct {
    // None for now.
}

func NewHeapRecorder(HeapRecorderConfig) *HeapRecorder

// Snapshot writes a sampled profile of the live heap to the provided io.Writer.
//
// The sampling rate is controlled by [runtime.MemProfileRate].
func (*HeapRecorder) Snapshot(io.Writer) (int, error)

// Start applies the recorder's configuration and takes a snapshot of the sampled live heap.
//
// Returns an error if this recorder had already been started.
func (*HeapRecorder) Start(io.Writer) error

// Stop takes a second snapshot and computes the difference with the profile
// taken at start. The resulting profile is written to the io.Writer provided
// to Start. This delta is useful for identifying memory leaks, since memory leaks
// will quickly rise out of the noise with a large positive delta over time.
//
// Returns an error on any failure to write to the io.Writer provided to Start,
// or if the recorder had not been started.
func (*HeapRecorder) Stop() error

type BlockRecorder struct { ... }

func BlockRecorderConfig struct {
    // EventsPerSample sets the number of goroutine block events between samples.
    //
    // If no value is set, the sample period for the duration is implementation-defined.
    // This implementation-defined value is independent of [runtime.SetBlockProfileRate],
    // but the rate set here may be visible to consumers of [Profile.WriteTo].
    EventsPerSample int
}

func NewBlockRecorder(BlockRecorderConfig) (*BlockRecorder, error)

// Start applies the recorder's configuration and takes a snapshot of the profile.
//
// Returns an error if this recorder had already been started.
func (*BlockRecorder) Start(io.Writer) error

// Stop takes a second snapshot and computes the difference with the profile
// taken at start. The resulting profile is written to the io.Writer provided
// to Start.
//
// Returns an error on any failure to write to the io.Writer provided to Start,
// or if the recorder had not been started.
func (*BlockRecorder) Stop() error

type MutexRecorder struct { ... }

func MutexRecorderConfig struct {
    // EventsPerSample sets the maximum number of unlock events between samples.
    //
    // If no value is set, the sample period for the duration is implementation-defined.
    // This implementation-defined value is independent of [runtime.SetMutexProfileRate],
    // but the rate set here may be visible to consumers of [Profile.WriteTo].
    EventsPerSample int
}

func NewMutexRecorder(MutexRecorderConfig) (*MutexRecorder, error)

// Start applies the recorder's configuration and takes a snapshot of the profile.
//
// Returns an error if this recorder had already been started.
func (*MutexRecorder) Start(io.Writer) error

// Stop takes a second snapshot and computes the difference with the profile
// taken at start. The resulting profile is written to the io.Writer provided
// to Start.
//
// Returns an error on any failure to write to the io.Writer provided to Start,
// or if the recorder had not been started.
func (*MutexRecorder) Stop() error

type GoroutineRecorder struct { ... }

func GoroutineRecorderConfig struct {
    Format GoroutineProfileFormat
}

// GoroutineProfileFormat is an enumeration of available formats for writing
// out the goroutine profile.
type GoroutineProfileFormat int

const (
    PprofGoroutineProfile GoroutineProfileFormat = iota // Default gzipped protobuf.
    TextGoroutineProfile                                // Legacy text profile.
    TracebackGoroutineProfile                           // Matches default traceback format.
)

func NewGoroutineRecorder(MutexRecorderConfig) (*GoroutineRecorder, error)

// Snapshot snapshots the state of all goroutines, assembles it into a profile,
// and writes the result to the provided io.Writer.
func (*GoroutineRecorder) Snapshot(io.Writer) (int, error)

// ProfileRecorder is a generic profile recorder that works for any profile.
//
// It does not provide as much customizability as the more specific types,
// but works with any Profile, include custom Profiles.
type ProfileRecorder struct { ... }
func ProfileRecorderConfig struct {
    // None for now.
}

func NewProfileRecorder(*Profile, ProfileRecorderConfig) (*ProfileRecorder, error)

// Start applies the recorder's configuration and takes a snapshot of the profile.
//
// Returns an error if this recorder had already been started.
func (*ProfileRecorder) Start(io.Writer) error

// Stop takes a second snapshot and computes the difference with the profile
// taken at start. The resulting profile is written to the io.Writer provided
// to Start.
//
// Returns an error on any failure to write to the io.Writer provided to Start,
// or if the recorder had not been started.
func (*ProfileRecorder) Stop() error

// Snapshot emits all profile data collected since program start until this point.
func (*ProfileRecorder) Snapshot(io.Writer) (int, error)

Rationale

For the general structure of the API, with bespoke types representing some configuration, the rationale can be found in the lengthy discussion around the FlightRecorder proposal. In short, we want to support multiple consumers with different configurations (hence configurations are values) and we want to give room for configuration options to grow (hence the many recorder types we propose, instead of just one to rule them all). In some ways this proposal is just applying the insights and lessons from the FlightRecorder proposal to the runtime/pprof package.

What is new in this proposal is our focus on delta profiles. We have two reasons for this.

First, delta profiles compose much more cleanly. Long-term, we want to move toward being able to compose multiple profile consumers, but specifically by having configuration options specify a minimum requirement on how much detail is present in the profile. This is much harder to do for profile data that has been collected since program start. Although such profile data is useful, the API that already exists is about as good as we can do anyway.

Second, delta profiles match the models of CPU profiling and runtime execution traces far more closely. This means a more uniform API surface across all our diagnostics, leading to better discoverability of diagnostics, diagnostic configuration options, and a space to grow additional configuration options for each profile type.

As mentioned earlier, we still have to make some exceptions to delta profiles. Certain profiles represent a single instant in time, like goroutine profiles. The heap profile also represents an instant, specifically with the live heap part of the profile. (The heap profile and alloc profile are really the same thing, so this part can get a bit tricky. We choose to continue to have separate types for them, and when taking one of these profiles, the other one simply comes along for the ride.)

Composing consumers

Note that with some of the API above, each consumer may set its own desired sampling rate. For this to compose, the runtime must adjust its internal sampling to the maximum of all requested sampling rates.

This seems simple at first glance, but comes with a significant complication. Namely, the pprof format only supports a single global sampling rate so there's no way to indicate that some samples have different weights. The tooling also all makes the same assumption as a result.

Long-term, we believe the fix to this will be random downsampling to the desired rate. For this to work correctly we will need to change the implementation to track the rate that each sample was collected under, as currently all this data is aggregated away. With this information, randomly decimating samples in each sample rate group to the requested sampling rate should be sufficient. Implementing this will likely require some significant restructuring to the sampling bucket infrastructure in the runtime.

However, we need not block the new APIs on this work. For one, we take note that, by and large, diagnostics consumers do not adjust the sampling rate, or only do so infrequently. And it's already the case today that the sampling rate parameters should not (or simply cannot) change while profiling is active.

Therefore, I propose the following near-term compromise: if a consumer requests a sampling rate that is identical to the current sampling rate, new consumers are allowed to subscribe. Otherwise, Start returns a descriptive error explaining the issue and the current sampling rate. This compromise is already a step forward because it would allow multiple concurrent profiling consumers at all, and supports the common case without much additional effort required.

Comment From: gabyhelp

Related Issues

Related Code Changes

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

Comment From: prattmic

We discussed this proposal briefly at the Go Contributor Summit at GopherCon Europe (https://go.dev/s/gceu25-summit). Working from memory, I believe the discussion included @felixge @kakkoyun @seankhliao @qmuntal (apologies to those I forgot).

We mostly discussed how to handle composing consumers that have different sample rates. One idea was to allow multiple subscribers but only one "owner" of the sample rate. At the time, I imagined this as an explicit API. What you have proposed is actually quite similar, but it happens implicitly in the existing API.

if a consumer requests a sampling rate that is identical to the current sampling rate, new consumers are allowed to subscribe. Otherwise, Start returns a descriptive error explaining the issue and the current sampling rate.

Comment From: prattmic

(The heap profile and alloc profile are really the same thing, so this part can get a bit tricky. We choose to continue to have separate types for them, and when taking one of these profiles, the other one simply comes along for the ride.)

This feels a bit awkward to me. AllocRecorder seems fine. You get a delta allocation profile plus a bonus snapshot of the live heap.

HeapRecorder is the awkward one since it isn't a delta profile at all. You get a snapshot heap profile plus a bonus allocation profile since program start. Why since program start?

I acknowledge that folks probably do want a way to get heap+allocs together since that is how things have always worked, but I think it would be OK to limit that to AllocRecorder (it could also be an option).

Comment From: nsrip-dd

+1 to @prattmic's comment above: HeapRecorder seems redundant given how AllocRecorder is proposed to work here.

Also, still reading through the proposal, but one really small documentation nit-pick:

func AllocRecorderConfig struct {
    // BytesPerSample sets the maximum number of bytes allocated between samples.
    //
    // If no value is set, the sample period for the duration is implementation-defined.
    // This implementation-defined value is independent of [runtime.MemProfileRate],
    // but the rate set here may be visible to consumers of [Profile.WriteTo].
    BytesPerSample int64
}

Assuming that we keep the existing random sampling method, BytesPerSample would be the average number of bytes allocated between samples, not the maximum.

Comment From: mknyszek

@prattmic Just to be clear, are you suggesting we drop HeapRecorder altogether? I'm fine with that if it's the consensus, but I'd like to understand better where you're coming from.

You get a delta allocation profile plus a bonus snapshot of the live heap.

That's not quite right, as the proposal is written. It currently proposes doing a full delta. However, maybe that's wrong, and we should only do a delta of the alloc_space and alloc_objects sample indexes, while retaining only the live heap snapshot at the beginning or end. I think that might just be more intuitive, and it lines up well with #57765. However, as @nsrip-dd notes in #57765, taking a delta of two live heap snapshots is still useful for things like debugging memory leaks. It would be good to find a place to put that. Maybe that's what HeapRecorder should mean?

HeapRecorder is the awkward one since it isn't a delta profile at all.

Neither is the goroutine profile, though, but I still think it's worthwhile to try to cover it here. It's not fundamentally that different in how it should be configured. That's why there's both the Start/Stop and WriteTo templates.

There are 3 kinds of profiles we expose today: 1. Instantaneous (goroutine profile, heap profile (inuse)) 2. Since program start (heap profile (allocs) and most other profiles) 3. Time window profile (CPU profiles)

Another way to look at this is that this proposal is trying to move profiles out of (2) and either into (1) or (3), since category (2) is just less intuitive for several of the profiles we expose. We can kind of do this with heap profiles by just changing the default sample_index, like what the different Profile types do. It is awkward that the "since program start" data comes along for the ride, then again I suppose we can just drop it?

Why since program start?

The live heap snapshot is also sampled, so the conclusion that we need to be sampling since program start to have an accurate snapshot is logical. It's certainly not obvious, but it does make sense.

(This makes me think again about mutex and block delta profiles. It is a little bit unfortunate that we'd lose some data with this delta profiles approach. That is, if contention started before the profiled region, we'd miss it. I don't know how much less useful that makes the profile. It's probably not a big deal.)

Lastly, with regard to removing HeapRecorder, it'd be a little unfortunate that these new APIs would no longer line up with the list of Profiles we have. I think part of the reason we have both "heap" and "alloc" profiles, even though they're the same thing, is because many users don't necessarily realize that. The "sample_index" functionality in pprof is really non-obvious, especially if you only look at profiles once in a while.

Comment From: prattmic

No, I was not suggesting dropping HeapRecorder altogether. I was suggesting that HeapRecorder include only the live heap snapshot (inuse) and not allocs since program start.

There are 3 kinds of profiles we expose today:

  1. Instantaneous (goroutine profile, heap profile (inuse))
  2. Since program start (heap profile (allocs) and most other profiles)
  3. Time window profile (CPU profiles)

Another way to look at this is that this proposal is trying to move profiles out of (2) and either into (1) or (3), since category (2) is just less intuitive for several of the profiles we expose.

Thanks for writing this out explicitly, I was actually thinking along the same lines about wanting to move everything to (1) or (3). The primary point of my comment is that HeapRecorder does not match this model. It includes an alloc profile since program start, which is (2).


I also did not realize that the AllocRecorder includes a delta inuse profile. That's interesting, and I think it makes things a bit more consistent. That said, I wonder if it would be more intuitive to make AllocRecorder allocs only, and offer delta inuse profiles by adding Start/Stop to HeapRecorder. i.e., HeapRecord.Start/Stop yields and delta profile, and HeapRecorder.WriteTo provides a snapshot. (I think this would be more clear if we renamed WriteTo to Snapshot, but then it won't implement io.WriterTo. Does anyone care?)

Comment From: mknyszek

No, I was not suggesting dropping HeapRecorder altogether. I was suggesting that HeapRecorder include only the live heap snapshot (inuse) and not allocs since program start.

Understood, apologies for the misunderstanding.

Thanks for writing this out explicitly, I was actually thinking along the same lines about wanting to move everything to (1) or (3). The primary point of my comment is that HeapRecorder does not match this model. It includes an alloc profile since program start, which is (2).

👍

I also did not realize that the AllocRecorder includes a delta inuse profile. That's interesting, and I think it makes things a bit more consistent. That said, I wonder if it would be more intuitive to make AllocRecorder allocs only, and offer delta inuse profiles by adding Start/Stop to HeapRecorder. i.e., HeapRecord.Start/Stop yields and delta profile, and HeapRecorder.WriteTo provides a snapshot.

Yeah, I think you're right that it would be a lot more intuitive to just separate them.

That being said, there are use-cases for collecting them together, I think mostly boiling down to the arguments in https://github.com/golang/go/issues/57765. Maybe we make collecting them both configurable? Like AllocRecorder can have an argument like SnapshotHeap bool which will add back in the inuse sample indexes and will take the snapshot at the end of the Start/Stop, or something. This is minor, we can always discuss this later.

(I think this would be more clear if we renamed WriteTo to Snapshot, but then it won't implement io.WriterTo. Does anyone care?)

I am in favor. WriteTo was chosen for flight recording out of convenience, so I wanted to be consistent with that, but flight recording is also doing something different, and I think WriteTo makes a lot more sense for a name there. I doubt anyone cares about implementing io.WriterTo given that Profile technically does not implement it.

I will update the proposal with your points.

Comment From: felixge

Thanks for raising this proposal!

Given that one of the goals is to solve for the compression issue I raised, and we're also having debates about combined profile types (allocs+inuse), I'd like to point to a use case that this new API doesn't solve: Data duplication.

IMO an efficient API would allow a user to select the profiling data they wish to record, and end up with a single file that contains all the data, but no duplication of symbols, stack traces, and other meta data that might be shared between profile types.

The pprof format is a bit limited in this regard because the Sample.location_id list field embeds the stack trace directly into a sample without the ability to reuse a reference to it. Additionally adding many sample types into a single pprof causes a lot of Sample.value entries with a value of 0.

However, the upcoming OpenTelemetry format (the SIG hopes to publish 1.0 in August) overcomes both of these limitations. AProfile has a single sample_type and stack traces will dictionary encoded after this PR lands (has SIG consensus).

In theory this proposal is well positioned to support new profiling formats in the future. However, the current API wouldn't be able to take advantage of the efficiency improvements in the new OpenTelemetry format, as each Recorder would have to dump its own dictionary table.

I'll raise it in the diagnostics meeting today, but I'd love to align on whether or not we want to include this use case (ability to bulk-encode profiles so that they can share format specific dictionaries).

Comment From: mknyszek

Being able to request a set of profiles at once in a single file was something @prattmic and I discussed offline. We think it can be built as a layer on top of the Recorders, at the API level I mean. (You package up a whole bunch of Recorders together into a single aggregation type, and then call Start and Stop on that.) This would be useful for things like, for example, producing a profile containing all the things PGO needs.

Comment From: felixge

@mknyszek @prattmic that's a neat idea, I like it. If we implement such a MultiRecorder, we need to think about how format choices will compose. E.g. if GoroutineRecorderConfig.Format conflicts with the MultiRecorderConfig.Format. But I guess the simple answer is: The MultiRecorder format overwrites any format choices from the recorders it is aggregating?

Comment From: mknyszek

E.g. if GoroutineRecorderConfig.Format conflicts with the MultiRecorderConfig.Format. But I guess the simple answer is: The MultiRecorder format overwrites any format choices from the recorders it is aggregating?

Agreed, that is awkward. That is a simple answer, but it occurs to me that the other profiles also have different formats depending on the debug setting (including the fact that the heap profile in text form will also dump MemStats).

It might make sense just to drop these formats (especially the legacy pprof text format) but add back in other functionality. For example, you could imagine we add a "memory breakdown" profile which produces a tiny pprof profile consisting of /memory/classes runtime/metrics. The goroutine profile is a little awkward in particular because you can get just the regular stack trace, which seems useful. Maybe there should just be a separate, explicit API for that. Or maybe the existing API is good enough.

The idea of MultiRecorder is also a little awkward with respect to 'snapshot' profiles, especially when heap profiles (and user profiles) can be both Snapshotted and Start/Stop'd. This could be resolved if we split up the concept of a 'snapshotted' and a 'recorded' profile.