Since 2022, members of the Go team and the Go community have met every two weeks to discuss Go runtime topics. Until 2025 the focus was mainly on diagnostics, but we've also delved into plenty of other topics related to the runtime as well. To reflect this, I'm renaming this group to the "performance and diagnostics biweekly meeting."
We currently have a virtual sync every 2 weeks, Thursdays at 11 AM NYC time. Please ping me at mknyszek -- at -- golang.org
for an invite if you're interested in attending. This issue will be updated regularly with meeting notes from those meetings.
The reasons for starting this meeting in 2022 are still relevant today. So for the sake of some nostalgia, and also as a reminder of the importance of diagnostics to making Go an effective tool for our users, I leave behind some of the original text from this GitHub issue below.
As the Go user base grows, more and more Go developers are seeking to understand the performance of their programs and reduce resource costs. However, they are locked into the relatively limited diagnostic tools we provide today. Some teams build their own tools, but right now that requires a large investment. This issue extends to the Go team as well, where we often put significant effort into ad-hoc performance tooling to analyze the performance of Go itself.
This issue is a tracking issue for improving the state of Go runtime diagnostics and its tooling, focusing primarily on runtime/trace
traces and heap analysis tooling.
Comment From: mknyszek
2022-12-07 Sync
Attendees: @mknyszek @aclements @prattmic @felixge @nsrip-dd @rhysh
Notes: * pprof labels in execution traces * Michael K: I need to follow up on the CL and issue. * Michael P: Have you considered runtime/trace regions? * Nick: Yes, but it doesn't quite hit our use-cases. * Rhys: Could use logs instead. Is the point to follow goroutines? * Felix: Yeah. * Rhys: Started parsing the trace format for some internal tooling. * Felix: Prefer labels because inheritance. Mostly like it for profile tools on top, but maybe doesn't matter for tracing tools. The parent goroutine gets recorded so things like request IDs can be tracked in post-processing. * Also useful for attributing CPU time. * Michael P: Tasks are inherited, but inheritance is processed downstream. * Would be nice to have pprof labels just so you don't have to think about which one to use. * Michael K: Useful even just to bridge pprof and runtime/trace. * Austin: Agreed. We can one day get to the point of deprecating the old APIs as well. * Rhys: RE: Attributing CPU time, can see at better than 10 ms of granularity already (even if Ps aren't CPU time, it's still time the rest of the app couldn't run). * Michael P: There's an issue about measure CPU time on-line. (TODO: Find it.) * Trace parsing API * Michael K: How important is this? Priority? * Felix: Important for the community, but we can't make use of it in our case. Will the trace format change in between releases? * Michael K: I think we can always guarantee a trace format for a release. * Michael P: How high level should this API be? * cmd/trace has two levels: * Low-level that understands format and events. * Higher-level that understands relationships between goroutines, etc. * Michael K: Page trace splits this into "parser" and "simulator." The latter is more stateful. * Felix: Intuitive feeling toward lower level API. * Rhys: +1 to low level. * Austin: Scalability of processing traces. * Currently not in a good state in low or high level format (currently requires the whole trace). * Can fix trace wire format for low-level parsing scalability issues, but it's much less clear how to do this for the high-level format. * Austin: Flight recorder idea. * Interacts interestingly with simulation. Current trace snapshots everything. * Solved this in debuglog; reads its own tail and keeps local state updated. * Complicated trade-offs in this space. * Felix: We use a lot of JFR, one thing that's nice is it's broken down into self-contained chunks. * Michael K sent out a very half-baked trace format revamp. (Thanks for the comments! Far from ready to share more widely.) * The next step is to measure the actual current overhead. * Maybe add a mode to Sweet? * Rhys: Have been collecting CPU profiles and execution traces. 20% of CPU time during execution trace is for execution trace itself. 95% of overhead is collecting stack traces. * Collect 1 second every 5000 seconds and no one complains. People do complain about goroutine profiles every 2 minutes. * Michael K: Shooting for KUTrace overhead, so making stack traces optional/faster is just step 1. * Felix: Trace effect on tail latency. * Rhys: Traces are my view of tail latency. * Felix: Benchmark for pathological cases and worst case. * Austin: Linked to trace overhead issue, Dmitry proposed switching to frame pointer unwinding. * Felix: At some point implemented frame pointer unwinding in userland and it was 50x faster (link). * Rhys: Not sure what kind of tool you could build without stack traces in an app that doesn't set pprof labels, tasks, regions, trace logs, etc. * Michael K: Integration with structured logging? * Michael P: It does add yet another way to add things to runtime/trace. * Rhys: The standard library (e.g. database/sql) doesn't currently use runtime/trace at all, maybe it should. * Michael K: This connects to deciding what goes into a trace. I think this is a very good idea. * Felix: +1. Java world and JFR does this. * Deciding what goes into a trace * Disabling stack tracing / reduce stack trace depth * Filtering by pprof labels * Specific event classes * Standard library events * Rhys: I've made this decision for my organization. Expected that you do profiling for a long running service. No opportunity for app owners to express opinions. People who complained forked the package, turned it off, and now coming back. I kind of want everything. * Felix: I would love to be in a place where we can do that, but we get pushback from users when the overhead is too high. * Rhys: The question is how close we get within 1% overhead. My choice was to get everything, but less often. * Felix: Desire to get more of everything is in conflict with adding more kinds of things in the trace. * Michael P: Agreed. Ideally we have tracing that's sufficiently fast that we have on all the time, but if libraries are allowed to add new traces then it could be a problem. It would be nice to turn that off without forking a library. * Before next sync: * Michael K: Unblock pprof labels patch and benchmarking trace overhead. * Felix: I can contribute a worst case benchmark. * Currently blocked on pprof labels in trace. * Felix: Started to work on gentraceback. Might work on it over the holidays. * Trying for bug-for-bug compatibility. * Michael P: Austin has been working on this too.
Comment From: felixge
I'll miss the Dec 22nd meetup because I'm traveling for the holidays. That being said, if I find time I might also look into https://github.com/golang/go/issues/57159 . Getting a proof of concept for Perfetto UI integration (ideally using their protocol buffer format) is probably more important than the gentraceback refactoring at this point. I just tried to work with a 300 MB (15s of prod activity) yesterday, and it was a real eye opener to the way the current UI struggles.
Comment From: tbg
I don't know if it's relevant (probably nothing new for the folks on this thread), but I had similar problems with the go tool trace
viewer where it would freeze on me all the time, esp. in the per-goroutine view (/trace?goid=N
). I figured out you can download perfetto-compatible JSON data from /jsontrace?goid=N
. (/jsontrace
gives the default view). This can then be uploaded to ui.perfetto.dev. This doesn't show all the information in the trace so it's not as great, but I was glad to have something that worked.
Comment From: thediveo
would the pprof labels also show up in goroutine traces?
Comment From: qmuntal
I'm working on a PoC that improves native stack unwinding on Windows by adding additional information to the PE file. This will help debugging with WinDbg and profiling with Windows Performance Analyzer. Would this work fit into the effort tracked by this issue?
Comment From: mknyszek
@thediveo I think that might be a good question for #56295, or you could file another issue. Off the top of my head, that doesn't sound like it would be too difficult to do.
@qmuntal Oh neat! That's awesome. I think it's a little tangential to the work we're proposing here, unless you also plan to do anything with the runtime's unwinder (i.e. gentraceback
). Then again, if one of the goals is better integration with the Windows Performance Analyzer that's certainly more in the same spirit. Do you have an issue for tracking that already?
Comment From: qmuntal
Do you have an issue for tracking that already?
I still have to prepare the proposal, I plan to submit it next week.
unless you also plan to do anything with the runtime's unwinder (i.e. gentraceback).
Not for now, but once I finish this I want to investigate how feasible is too unwind native code and merge it with the Go unwinding, in case the exception happens in a non-Go module.
Comment From: qmuntal
Do you have an issue for tracking that already?
I do now #57302 😄
Comment From: gopherbot
Change https://go.dev/cl/459095 mentions this issue: sweet: add support for execution traces and measuring trace overhead
Comment From: mknyszek
2022-12-22 Sync
Attendees: @mknyszek @aclements @prattmic @bboreham @rhysh @dominikh
- Organizational stuff
- OK to record meetings?
- Meeting recorded with transcript this week (please ask if you would like to see it).
- Trace overhead benchmarks
- https://go.dev/cl/459095
name old time/op new time/op delta
BiogoIgor 17.7s ± 3% 17.5s ± 4% ~ (p=0.190 n=10+10)
BiogoKrishna 15.1s ± 4% 15.1s ± 4% ~ (p=0.739 n=10+10)
BleveIndexBatch100 5.78s ± 7% 5.76s ±11% ~ (p=0.853 n=10+10)
BleveQuery 2.37s ± 0% 2.37s ± 0% -0.26% (p=0.016 n=8+10)
FoglemanFauxGLRenderRotateBoat 16.9s ± 9% 16.9s ± 7% ~ (p=0.796 n=10+10)
FoglemanPathTraceRenderGopherIter1 36.7s ± 1% 44.4s ± 2% +21.01% (p=0.000 n=10+10)
GoBuildKubelet 47.0s ± 2% 48.8s ± 3% +3.72% (p=0.000 n=10+10)
GoBuildKubeletLink 8.89s ± 2% 8.88s ± 4% ~ (p=0.720 n=10+9)
GoBuildIstioctl 45.9s ± 1% 47.8s ± 2% +4.09% (p=0.000 n=10+10)
GoBuildIstioctlLink 9.07s ± 2% 8.99s ± 2% ~ (p=0.095 n=10+9)
GoBuildFrontend 15.7s ± 4% 16.1s ± 2% +2.45% (p=0.043 n=10+10)
GoBuildFrontendLink 1.38s ± 2% 1.37s ± 3% ~ (p=0.529 n=10+10)
GopherLuaKNucleotide 27.9s ± 0% 27.9s ± 1% ~ (p=0.853 n=10+10)
MarkdownRenderXHTML 256ms ± 2% 256ms ± 2% ~ (p=1.000 n=9+9)
Tile38WithinCircle100kmRequest 618µs ± 7% 657µs ±10% +6.30% (p=0.015 n=10+10)
Tile38IntersectsCircle100kmRequest 722µs ± 6% 773µs ± 4% +6.96% (p=0.000 n=10+9)
Tile38KNearestLimit100Request 508µs ± 3% 532µs ± 3% +4.73% (p=0.000 n=10+10)
name old average-RSS-bytes new average-RSS-bytes delta
BiogoIgor 68.8MB ± 2% 71.8MB ± 4% +4.40% (p=0.000 n=10+10)
BiogoKrishna 4.42GB ± 0% 4.42GB ± 0% ~ (p=0.739 n=10+10)
BleveIndexBatch100 194MB ± 2% 198MB ± 3% +1.91% (p=0.008 n=9+10)
BleveQuery 536MB ± 0% 537MB ± 1% ~ (p=0.190 n=10+10)
FoglemanFauxGLRenderRotateBoat 444MB ± 1% 446MB ± 0% +0.41% (p=0.035 n=10+9)
FoglemanPathTraceRenderGopherIter1 132MB ± 1% 142MB ± 4% +7.61% (p=0.000 n=10+10)
GoBuildKubelet 1.75GB ± 1% 1.85GB ± 1% +5.51% (p=0.000 n=10+10)
GoBuildIstioctl 1.35GB ± 1% 1.42GB ± 1% +5.49% (p=0.000 n=10+9)
GoBuildFrontend 511MB ± 2% 543MB ± 1% +6.31% (p=0.000 n=10+9)
GopherLuaKNucleotide 37.0MB ± 1% 40.4MB ± 2% +9.24% (p=0.000 n=9+10)
MarkdownRenderXHTML 21.8MB ± 3% 24.0MB ± 3% +10.14% (p=0.000 n=9+8)
Tile38WithinCircle100kmRequest 5.40GB ± 1% 5.38GB ± 1% ~ (p=0.315 n=10+10)
Tile38IntersectsCircle100kmRequest 5.72GB ± 1% 5.71GB ± 1% ~ (p=0.971 n=10+10)
Tile38KNearestLimit100Request 7.26GB ± 0% 7.25GB ± 0% ~ (p=0.739 n=10+10)
name old peak-RSS-bytes new peak-RSS-bytes delta
BiogoIgor 95.9MB ± 4% 98.5MB ± 3% +2.70% (p=0.030 n=10+10)
BiogoKrishna 4.49GB ± 0% 4.49GB ± 0% ~ (p=0.356 n=9+10)
BleveIndexBatch100 282MB ± 3% 284MB ± 4% ~ (p=0.436 n=10+10)
BleveQuery 537MB ± 0% 538MB ± 1% ~ (p=0.579 n=10+10)
FoglemanFauxGLRenderRotateBoat 485MB ± 1% 483MB ± 0% ~ (p=0.388 n=10+9)
FoglemanPathTraceRenderGopherIter1 180MB ± 2% 193MB ± 3% +7.19% (p=0.000 n=10+10)
GopherLuaKNucleotide 39.8MB ± 3% 46.0MB ±20% +15.56% (p=0.000 n=9+10)
MarkdownRenderXHTML 22.1MB ± 3% 25.5MB ± 7% +15.45% (p=0.000 n=9+10)
Tile38WithinCircle100kmRequest 5.70GB ± 1% 5.68GB ± 1% -0.45% (p=0.023 n=10+10)
Tile38IntersectsCircle100kmRequest 5.93GB ± 1% 5.91GB ± 2% ~ (p=0.631 n=10+10)
Tile38KNearestLimit100Request 7.47GB ± 1% 7.46GB ± 0% ~ (p=0.579 n=10+10)
name old peak-VM-bytes new peak-VM-bytes delta
BiogoIgor 802MB ± 0% 803MB ± 0% +0.11% (p=0.000 n=10+10)
BiogoKrishna 5.24GB ± 0% 5.24GB ± 0% +0.01% (p=0.001 n=10+10)
BleveIndexBatch100 1.79GB ± 0% 1.79GB ± 0% +0.05% (p=0.000 n=8+8)
BleveQuery 3.53GB ± 1% 3.53GB ± 1% ~ (p=0.237 n=10+10)
FoglemanFauxGLRenderRotateBoat 1.21GB ± 0% 1.16GB ± 4% ~ (p=0.163 n=8+10)
FoglemanPathTraceRenderGopherIter1 875MB ± 0% 884MB ± 0% +1.02% (p=0.000 n=10+10)
GopherLuaKNucleotide 733MB ± 0% 734MB ± 0% +0.11% (p=0.000 n=9+10)
MarkdownRenderXHTML 733MB ± 0% 734MB ± 0% +0.10% (p=0.000 n=10+9)
Tile38WithinCircle100kmRequest 6.42GB ± 0% 6.39GB ± 1% ~ (p=0.086 n=8+10)
Tile38IntersectsCircle100kmRequest 6.62GB ± 1% 6.61GB ± 2% ~ (p=0.927 n=10+10)
Tile38KNearestLimit100Request 8.16GB ± 1% 8.18GB ± 0% ~ (p=0.649 n=10+8)
name old p50-latency-ns new p50-latency-ns delta
Tile38WithinCircle100kmRequest 144k ± 3% 159k ± 3% +10.56% (p=0.000 n=9+9)
Tile38IntersectsCircle100kmRequest 215k ± 1% 232k ± 2% +7.91% (p=0.000 n=9+10)
Tile38KNearestLimit100Request 347k ± 2% 373k ± 1% +7.21% (p=0.000 n=10+10)
name old p90-latency-ns new p90-latency-ns delta
Tile38WithinCircle100kmRequest 908k ± 6% 956k ± 9% +5.22% (p=0.043 n=10+10)
Tile38IntersectsCircle100kmRequest 1.07M ± 4% 1.11M ± 5% +4.33% (p=0.001 n=10+10)
Tile38KNearestLimit100Request 1.03M ± 3% 1.05M ± 4% +2.64% (p=0.011 n=10+10)
name old p99-latency-ns new p99-latency-ns delta
Tile38WithinCircle100kmRequest 7.55M ± 9% 7.93M ±13% ~ (p=0.089 n=10+10)
Tile38IntersectsCircle100kmRequest 7.81M ± 8% 8.39M ± 2% +7.36% (p=0.000 n=10+8)
Tile38KNearestLimit100Request 2.03M ± 4% 2.08M ± 5% +2.52% (p=0.019 n=10+10)
name old ops/s new ops/s delta
Tile38WithinCircle100kmRequest 9.73k ± 7% 9.16k ±11% -5.83% (p=0.015 n=10+10)
Tile38IntersectsCircle100kmRequest 8.31k ± 6% 7.77k ± 4% -6.55% (p=0.000 n=10+9)
Tile38KNearestLimit100Request 11.8k ± 3% 11.3k ± 3% -4.51% (p=0.000 n=10+10)
- Introduction: Bryan Boreham, Grafana Labs
- Questions within the team about whether useful information has been derived from Go execution traces.
- Phlare: continuous profiling. Interested in linking together various signals (distributed tracing, profiling)
- Michael K: Interesting data point about usability.
- Michael P: Hard to link application behavior to trace.
- Bryan: Example: channels. Still don't really know where to find that data.
- Dominik: One of the reasons I started on gotraceui was to surface more information and do more automatic inference and analysis of the data.
- Rhys: Execution trace technique: get data out of them to find the interesting traces. Try to extract features that would be interesting up-front.
- Starts with internal trace parser. Have code to find start and end of HTTP requests, DNS lookups, etc.
- Tooling on the way to get open sourced.
- Heap analysis plan (#57447)
- Austin: Additional context is we're confident in the API we're planning to export, as opposed to tracing which we have nothing for yet.
- https://go.dev/issue/57307 proposal: cmd/trace: visualize time taken by syscall
- Austin: Does Perfetto do better with instantaneous events?
- Michael P: Yes, there's a 20px wide arrow but we have so many.
- Rhys: Hold shift, draw a box. If you aim well, you get what you want.
- Rhys: Why is there only one timestamp on some events?
- Austin: We can add another timestamp.
- Michael P: Syscall fast path does a lot less.
- Austin: Does Perfetto do better with instantaneous events?
- pprof labels in traces
- Michael K: I think I've unblocked Nick. Michael and I are reviewing.
runtime.gentraceback
cleanup- Austin: Back and forth on the issue about making it an iterator, sent out CLs, not tested yet.
- Next meeting: Jan 5th, Michael P and Michael K won't be here, so Austin will run it.
- Action items:
- We're slowing down for the holidays, so no strong expectations
- Michael K:
- Try to land execution trace benchmarking.
- Might look into heap analysis stuff.
- After break, might want to start working on trace format more seriously.
- Happy holidays!
Comment From: mknyszek
2023-01-05 Sync
Attendees: @aclements @felixge @nsrip-dd @rhysh @bboreham vnedkov @dashpole * Organizational stuff * @mknyszek is out today, @aclements running the meeting. * @dashpole from Google OSS Telemetry joining us. * Benchmarks: Can we add a goroutine ping pong example? (Felix) * Tracer benchmarks all show relatively low overhead. Can we add a benchmark that demonstrates the worst case? * Austin: Sweet probably isn’t the right place because that’s application-level. Maybe add to Bent? * Felix: Next step on these benchmarks? Land MK’s trace benchmark support? * Austin: It’s certainly fine to land. We don’t have a good way to integrate these “extra dimensions” into our automated benchmarking. * AI(austin): Bring up monitoring extra benchmarking dimensions. * Austin: “Unit benchmarks” would be the perfect place for a ping pong benchmark (we already have one in the main repo), but we never quite got to integrating these into automated monitoring. * Are only GC STW recorded? Would it make sense to record other STW events (read metrics, goroutine profile, heap dump)? (Felix) * Rhys: You get ProcStop events * Austin: Yeah, you’re right that we trace high level GC STW events. * Rhys: Currently the GC traces the “best case” STW, which can be really misleading. * Austin: We could definitely have a “stopping the world” and a “world stopped”. Maybe don’t need that for start. * Felix: That would be great. We’re investigating rare long STWs right now. * Rhys: Starting the world can take a while. Problems with heap lock contention. I would love to have more visibility into the runtime locks. * Austin: Runtime locks are a bit of a mess. I also wonder if they should be “scalable”. * Rhys: I’d love to discuss that. C&R office hours? * Austin: Perfect. * Conclusion: Let’s add events for all STWs and also separate “stopping” from “stopped”. * Updates on Perfetto UI (Felix and Nick) * Add to UI CL: https://go.dev/cl/457716 * Felix: The JSON currently produced by the trace tool is basically compatible with Perfetto. Doesn’t let us open really large traces without splitting, which was one of the hopes. And it takes a while to load. I was able to use the command line tools to load a 280MB trace into a 9.8GB JSON trace and load that in Perfetto, but it took 20 minutes. Nick has been working on outputting proto directly, which will hopefully produce less data than JSON. * Rhys: When I tried this a while ago, the connection of data flow wasn’t quite right. * Felix: This CL doesn’t fix that. I’m hoping it’s an upstream issue, which they’re pretty responsive to. I’m hoping protobuf will just make it go away, since that’s their canonical input. * Nick: Stack traces seem to be missing from protobuf, which we definitely want. We might need upstream changes to support that. * Felix: I suspect there may be some long tail of issues. But the initial plan would be to keep both viewers until we feel this is solid. * Austin: How does the streaming work? * Felix: They have an in-memory column store with a SQL interface on top of it. Large traces would still be a problem because they’d need to be loaded fully into memory. * Austin: In principle we could swap out that column store for our own streaming thing, but that sounds like a significant amount of work. * Felix: On Go Time someone said they only use runtime trace when they’re really desperate and then they can’t figure it out anyway. Most people don’t think about their program from the perspective of the scheduler. I’d like to have different pivoting, like one timeline per G (or M). We sort of have that in the goroutine analysis, but that only shows on-CPU time. Dominick did that in gotraceui. * Updates on pprof labels (Nick) * Nick: In MK’s recent comments on pprof labels CL, he wondered about a size limit on labels being recorded in the trace. Thinking about trace overhead. Users can also add arbitrary logs (limited by trace buffer size). My thought is that users are deciding to make these as big or as small as they want. * Austin: My reaction is “do what the user said” * Rhys: It seems like we already don’t have a limit on the pprof labels (number/length/etc) and maybe it would have been good to have a limit, but we already don’t. * Bryan: For me it’s more important to be able to find out how much damage you’re doing with this data. Inevitably people want one more byte than the limit and will be frustrated. * Felix: Two sides to this problem: how to get the data in the trace while keeping overhead low, and the other is keeping the memory usage low for keeping all these labels. For trace overhead, I’m thinking we want two or three levels of filtering: filter what events, filter events by properties (e.g., duration). JFR supports both of these. And potentially a way to modify events (maybe too far), like truncation. At some point you can almost guarantee fixed-cost tracing. E.g., turn off everything except profile events; now you have timestamps on profile events without all the other overhead. * Austin: MK and I have definitely been thinking in that direction. The current trace viewer is almost purpose-built for analyzing the scheduler and needs to understand how a lot of events relate. But if we open up reading traces, the trace viewer becomes just another tool and maybe it’s fine for it to say “I need these events” (kind of like “perf sched” or similar). * Felix: I can ask my Java colleagues about how this works in JFR. * Rhys: Curious how you’re thinking about filtering. * Felix: Simpler is better. You could imagine a callback, but that’s not simple. Probably something like runtime/metrics where you can discover the events and select. * Rhys: Definitely need a header saying which events are included. * Felix: Agreed. Also nice for viewers so they don’t have to hard-code all of the events.
Comment From: mknyszek
2023-01-19 Sync
Attendees: @aclements @felixge @nsrip-dd @rhysh @bboreham @mknyszek @prattmic @dominikh @dashpole
- Felix: gentraceback iterator refactoring
- Felix: What's the progress?
- Austin: Made progress. Running into issues with write barriers and trying to knock down all the write barriers one by one. Big open question of testing; so many horrible corner cases. No good answers.
- Felix: Tried to do it incrementally instead of all at once; also painful. RE: testing, would it be useful to have the ability to instrument a PC and do a traceback from there?
- Austin: That would help. The worst parts are weird though, like signals. If we had a good way to inject a symbol, like a breakpoint, that would help a lot.
- Idea: could use hardware breakpoints via perf-event-open (Linux only, but at least architecture-independent) which could get enough coverage for Austin to be happy.
- Could potentially synthesize other signal tests from a single signal.
- Felix: I'll give it a shot.
- Michael K: What work could we do in parallel?
- Felix: Could write a frame pointer unwinder separately for tracing just to get an idea of the overhead.
- Austin: +1. Tricky things include logic in gentraceback for filtering out frames. Maybe it doesn't matter for the trace viewer (i.e. don't filter). Also inline unwinding. Trying to totally separate inline unwinding in gentraceback. Once its its own separate thing, it'd be straightforward to plumb that into a frame pointer unwinder.
- Michael K: Could we skip inline unwinding for the experiment?
- Austin: Yeah.
- Michael P: +1 to separating out inline unwinding. Already "runtime_expandFinalInlineFrame" in the runtime which is a good reference point for this.
- Felix: Also all the complexity with cgo traceback, but we should just ignore that for the experiment.
- Michael K: The cgo traceback tests are also really flaky, and if we could have better testing around that that would be great.
- Felix: Perfetto UI blues … (timeline bug, link bug, stack traces, large traces, small screens, protocol buffer format) … gotraceui w/ wasm? Having an online tool with independent release cycle is tempting?
- CL out that makes Perfetto work. Limitations:
- Limited for very large traces as-is.
- Doesn't seem easy to make it work as well as
go tool trace
(bugs). e.g. timelines not named correctly. Events not connected correctly.- Harder: getting stack traces to show up. Nick has tried to make it work. Protobuf format doesn't have an obvious stack trace format?
- Nick: Not a one-to-one mapping between Catapult format and Perfetto. Can stick a single location in the Perfetto format, but not a full stack trace. Little things in the protobuf format that aren't well-documented. e.g. string interning only works if you include a number in the header.
- Michael K: MP and I looked into this. Perfetto knows how to do this for some traces, but it’s built into a C++ library, so we’d have to rewrite that in Go or call into it from Go. I’m not sure it even has strong backwards compatibility.
- Michael P: There is the Perfetto tool that runs the RPC server. (trace_processor.) That loads into a SQLite in-memory DB, but does do better than the fully in-browser implementation. It can do bigger traces, though is still limited. That seems like enough of an improvement to me.
- Felix: I have a 280MB trace that gets split into 90 parts for 15 seconds on a busy server. Maybe we should start with deciding what size trace we want to have a good experience for.
- Michael K: I think 280MB is a big trace, though it’s only 15 seconds. I think we should be targeting bigger than that. It’s easy to get a 1GB trace. But we can start with Perfetto as long as it’s better and work toward that.
- Austin: Is that better with Perfetto?
- Felix: I think it would be better. Maybe 5x better, so a second at a time (don’t quote me on that).
- Michael P: The trace_processsor is better, but still limited by the in-memory SQLite DB. Presumably that could be on disk. I don’t know if the trace loading is also linear in the trace size.
- Rhys: What do you even do with an execution trace that large? How do you get value out of that?
- Felix: This trace was from a colleague from an instance that was struggling with pauses. It looked like a straggling procstop. It was debugging the behavior of a whole application that was behaving poorly.
- Rhys: So you were looking for behavior that was pretty zoomed-out.
- Felix: Yeah.
- Michael K: Part of the problem with existing traces is the usability of this. I think it’s a valid question about whether big traces are all that useful. Sometimes you’re not even really sure what you’re looking for. Say I wanted to run a full trace on every invocation of the compiler. You don’t necessarily know what you’re looking for to improve compiler speed.
- Austin: I bet if you were to profile the space of large trace file, the vast majority of that would not be useful to you looking at it at a high level. Suggests a solution here for filtering is to just reduce what goes into the trace.
- 280MB Trace Size Breakdown
- Michael K: Maybe just proc start/proc stop for what Felix was describing.
- Rhys: But once you find the problem, you want more detail. It's hard to catch the end of a garbage collection cycle because of the rules of starting a trace during a GC cycle.
- Michael K: Fixing the mark phase issue should be easier than before.
- Austin: Awesome breakdown!
- User group said "please don't do this" because Perfetto isn't nice to small screens.
- Felix: gotraceui
- Viewing timelines for goroutines is great.
- Would like Dominik to talk about gotraceui some more.
- I want to be intentional about choosing Perfetto.
- Michael K: I think the dependency on gio was a concern.
- Dominik: Gio (the UI library I use) supports wasm, so it should be fairly straightforward to have gotraceui run in the browser if we want to go down that road.
- Dominik: I still rely on loading entire traces into memory (but using significantly less memory than current go tool trace), but with the upcoming format changes, streaming data might be possible. We currently load everything into memory because when the user zooms out far enough, we need all events to compute what we display. But we could probably precompute these zoom levels, similar to mipmaps.
- Dominik: For the current trace format, gotraceui needs roughly 30x the size of the trace in memory. so a 300 MB trace needs 9 GB.
- Michael K: I have been thinking about an HTML UI that does something like Google Maps tiles to scale. We could skip a lot of work if we could take gotraceui as the UI, but port it into something more portable than Gio. OTOH, it’s even more work to build something from scratch.
- Dominik: WRT gotraceui's use of Gio, there'll be pretty rich UI, and I don't fancy writing UIs in HTML/JS. But all of the processing of trace data could live externally
- Michael P: It’s not necessarily a hard requirement that the Go project itself ship a trace viewer. We have to now because there’s no API. But if we shipped an API, it wouldn’t be a hard requirement. Much like we don’t ship a debugger.
- Michael K: One option is that we ignore the UI situation entirely and build something that you can parse separately and ship something really bare later. In the meantime, point at a little tool that will shove it into trace_processor and point people at Perfetto. For a brief time, stop shipping our own. It’s very convenient that you only need a Go installation to view these traces, but I think you’re right that we could stop shipping a UI. We could also keep the existing UI working/limping while we do other things in parallel.
- Felix: Is Dominik looking for contributors? (That comes with its own overheads)
- Dominik: I'm usually not big on contributions in the form of code; but ideas and feedback are hugely appreciated
- Michael K: We don’t have to make a decision on using Perfetto now. Maybe we should plug along for two more weeks (with Perfetto) and figure out if we can fix the issues without too much effort, and then make a hard decision on what to do at the next meeting.
- 👍
- CL out that makes Perfetto work. Limitations:
- Felix: traceutils anonymize & breakdown and ideas: (flamescope, graphviz, tracer overhead)
- Implemented anonymization of traces. Breakdowns, too.
- Tracer overhead tool that uses profile samples in the trace to identify overheads.
- Felix: Format: Consistent message framing, remove varint padding for stacks
- 4 different cases for how an event can be laid out.
- Maybe a way to skip messages and layouts it doesn't understand.
- Austin: self-descriptive header giving lengths for each opcode
- Michael K: Any state in the trace makes things hard to push it up into OTel, since that’s completely stateless.
- Felix: We’re actually trying to do two things in OTel. Including binary data blobs, like pprof and JFRs. And something to send stateful things like stack traces, etc, where you can refer back to them efficiently.
- David: For trace I wouldn’t expect a stateful protocol to be introduced any time soon. But for profiling it may be a possibility.
Comment From: mknyszek
2023-02-02 Sync
Attendees: @aclements @felixge @nsrip-dd @thepudds @bboreham @dashpole @mknyszek @prattmic
- Felix: Discuss results from frame pointer unwinding experiments (blog, sweet results) and next steps
- Targeted ping-pong example, worst case. Worth noting that the stack depth in that benchmark is 2. Went from +773% -> +30%, apparently doing 50% more work too!
- Sweet: 10% -> 2% overhead!
- Michael K: Michael P mentioned missed cases.
- Michael P: Inlined frames are one example. Maybe we just accept slightly less accurate traces in the tracer.
- Austin: +1 to missing inlined frames, but we can also expand that after the fact.
- Michael K: Do you need the binary for that?
- Austin: Today, yes.
- Felix: The tracer already de-duplicates stack traces. If we do inline expansion at the end, there's probably not that much work to do.
- Michael P: Other avenue, do we need stack traces on every event? Maybe remove stack traces for some events?
- Michael K: Where does the rest of the time go?
- Felix: In the blog post. Frame pointer unwinding is only 9% of the trace overhead. 28% is cputicks. 21% is stack put.
- Austin: Shocked that cputicks is 28%. It's one instruction. I guess that's a good sign?
- Austin: (FP unwinding is also relevant for #53286. In that case it’s the kernel’s FP unwinder, but it means our FP data is going to have to be high quality for both.)
- Thepudds: Or maybe an option for sampling of stack traces?
- Michael K: I think it depends. As traces are used today, you probably want 100% sampling. For larger scale aggregation, I think it's a solid option.
- Michael K: Dream of nanotime to line up clocks.
- Austin: It might not be that bad. RDTSC is serializing so the extra math in nanotime might not make much of a difference in overhead.
- Michael K: We should definitely pursue this, at least for tracing.
- Felix: The prototype is missing inline expansion, support for SetCgoTraceback (Go -> C -> Go), and dragons in the compiler where the FP isn't on the stack when it should be. Previous implementation hit this and I suspect I hit this as well.
- Austin: Status of FPs is better than it once was. Saving grace of the tracer is you often don't have an assembly frame on the stack. Talked about making vet complain if you clobber the frame pointer in assembly code. Would be surprised if there are problems in the compiler generated code; worry much more about assembly.
- Felix: Worried about stack shrinking / relocation. Growing shouldn't happen while in unwinding, but not sure about shrinking.
- Austin: I think you always see a fully formed stack.
- Felix: There's no chance of seeing the stack mid-move?
- Austin: The goroutine that's getting moved has to be stopped.
- Nick: If unwinding happens asynchronously then it's a problem, like CPU profiling. We could use gentraceback in the difficult cases.
- Felix: Plan on working on better unwind testing. That machinery could be used to harden frame pointer unwinding as well.
- Michael K and Austin: Not a blocker to have the testing.
- Austin: FP on x86 is specified as part of the Go internal ABI. If the compiler is messing that up that's a violation of the ABI and definitely a bug. Doesn't apply to hand-written assembly.
- thepudds: One of the older CLs mentioned its approach depended on the stack not being copied while walking the frames, along with the comment “currently ok, but won't be if we preempt at loop backedges”... but maybe that old concern is not a current concern....
- Michael K: I think loop backedges aren't a concern, and async preemption as it exists shouldn't be an issue.
- Michael P: Traceback itself would just disable preemption just for consistency, but just because it's in the runtime package, we won't ever async preempt.
- Austin: I'm not sure why loop backedges would be a concern.
- Michael K: I don't think we should block on inline expansion, but maybe cgo tracebacks.
- Austin: As an intermediate step, use gentraceback for if there’s a cgoTracebacker and cgo on the stack. Will work for 99% of our users.
- Felix: Like the idea of making it the new default, but with the ability to switch back.
- Michael K: We could add a GODEBUG flag
- Felix: Flight recorder / ring buffer mode
- Felix: We’d like to capture traces of slow spans. Wait for a p99 response and then get the last N MB of trace. I’m currently working on an experiment to see if this can be done in user space.
- Michael K: I think finding the oldest batch is an O(N) operation. Ordering the batches is difficult because we assume everything will show up eventually.
- Austin: The tracer is really stateful, so it's really difficult to actually manage a ring buffer. debuglog is a ring buffer, and what it does is consume its own format in order to manage a snapshot of the state.
- Felix: I’d be okay with getting a non-perfect trace at the end. At least understand what the goroutines are doing. Maybe we could every once in a while emit a “synchronization” event. If a complete redesign of the format is required, [flight recorded mode] is something we’d be interested in.
- Michael K: I’d like to find out what the overhead of writing the trace is. Say you have no stack traces, where is the rest of the time going? That’s important information for redesigning the trace format. I’ve already been thinking about redesigning the format. At the cost of using more space, it has to end up less stateful. Regularly synchronizing is one way to do that. That’s kind of where I was going: a “trace” is really a collection of self-contained traces. With the tooling able to be more resilient at the edges. Synchronation wouldn’t necessarily be STW, but you have a ragged barrier across the Ps that sync them all to the next trace chunk. That gets complicated in a ring buffer. I was thinking of gathering the requirements for a new trace format. Because there’s so much state, it’s hard to make it completely stateless without ballooning the trace.
- Felix: JFR does that ... splitting the stream up into self-contained chunks.
- Michael K: We’re definitely on the same page [wrt flight recorder]. The Go team arrived at this, too. We’re also trying to make ELF core dumps the source of truth for heap analysis. Ideally we’d be able to pull the ring buffer out of a core dump so you can see exactly what was happening before crashing.
Comment From: qmuntal
Felix: The prototype is missing inline expansion, support for SetCgoTraceback (Go -> C -> Go), and dragons in the compiler where the FP isn't on the stack when it should be. Previous implementation hit this and I suspect I hit this as well.
FYI: #57302 is hitting this as well, as I'm implementing SEH unwinding using the frame pointer. Whichever is the fix for that, would be good to take SEH also into account.
Comment From: mknyszek
2023-02-16 Sync
Attendees: @mknyszek @aclements @felixge @nsrip-dd @prattmic @dominikh @thepudds @pmbauer @dashpole @rhysh
- 468301: runtime: delete gentraceback
- Austin: Needs more testing.
- Austin: Nice things to do as a result, listed in the issue. e.g.
- Simpler defer processing
- CPU profiles have a low limit on frames it'll capture.
- Iterator makes this much more tenable to fix.
- Years-old squirrely bug in the race detector.
- Felix: I’m happy to look into testing using perf, but I’m not sure when I can get to it.
- Rhys: If there are more frames than you want to record, could you add some context by including N outermost frames and M innermost frames. Maybe a “runtime._Elided” frame in the middle.
- Michael P: We’ve thought about doing that for panic tracebacks.
- 463835: runtime: frame pointer unwinding for tracer Felix Geisendörfer: wip, but almost ready for review
- Are slight stack trace differences acceptable?
- Michael K: I think that’s fine. As we move toward letting people parse the format, I think lining up traces with stacks from other sources could become more of a problem.
- Felix: The current patch passes most of the tests of tracebacks in traces.
- Should it use an unwinder interface similar to austin’s patches?
- Could systemstack be changed to push frame pointers? Otherwise the caller frame is lost. Naive attempts to make this change caused crashes.
- Austin: Yes please.
- Weird issue with syscalls on BSDs losing a frame.
- Austin: That’s probably lazy assembly.
- Felix: Another option is to only enable FP unwinding on Linux for now.
- Austin: As long as it works on Linux, Windows, and Darwin I’m happy.
- Cgo unwinders
- Austin: It’s fine to take the slow path if the current goroutine has cgo frames and there’s a cgo unwinder.
- Felix: I got inlining to work (when traces are finalized). Benchmark numbers are still holding.
- Are slight stack trace differences acceptable?
- Michael K: Once all of the backtrace stuff is settled, I want to try using the monotonic clock (nanotime) rather than CPU ticks.
- Nick: Could you record nanotime at the beginning of a batch and then CPU ticks after that.
- Michael P: To do that safely, you’d need to know when you migrate CPUs. Linux’s restartable sequences can get you that.
- Michael K: There might not be a performance gap between nanotime and cputicks.
- Austin: If there’s even a performance gap, you could push more of the nanotime computation into the trace reader.
$ benchstat -col '.name@(CPUTicks Nanotime)' /tmp/bench
goos: linux
goarch: amd64
pkg: runtime
cpu: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
│ CPUTicks │ Nanotime │
│ sec/op │ sec/op vs base │
*-8 10.75n ± 0% 16.11n ± 0% +49.88% (p=0.000 n=20)
- runtime: copystack doesn't adjust frame pointers on arm64 · Issue #58432 Felix Geisendörfer
- It was relatively easy to fix once I understood what was going on, but there appear to be dragons there.
- Boolean in the runtime does a double check of FPs on stack copies.
- Would like to treat arm64 as a separate issue, but I plan to get to it.
- 460541: runtime: reduce sysmon goroutine preemption (Felix Geisendörfer)
- Michael P: There are likely issues here with special cases in the scheduler. Not sure they're easy to fix.
- cmd/pprof: macOS 12.6.1 (M1) profile overcounts system calls (again) #57722 (Felix Geisendörfer)
- Michael P: C reproducer and handing off to Apple (if it works) seems like a reasonable next step. No guarantee we'll get a fix though.
- proposal: runtime: add per-goroutine CPU stats · Issue #41554 (Felix Geisendörfer)
- Felix: Initial justification was along the lines of billing, which seems better served by pprof. Then it shifted to fast control loops to throttle users. It seems better to have scheduling priorities, but barring that it seems good to let user space do priorities.
- Michael P: We’ve been discussing having tracing that’s cheap enough to have on all the time, and a parsing library. Maybe a user could do this by enabling tracing and parsing their own trace. Is this generally the right approach to user throttling at all?
- Rhys: I think a cheap trace that can be parsed in the app is good and flexible. I’m not sure per-goroutine stats is the right approach. E.g., if I use the net/http client, there are a bunch of goroutines involved that I don’t control but I want to understand the latency of.
- Felix: One trade-off of the trace route is the latency of reading your own trace.
- Rhys: It would be useful if the app could say, “I need a ragged barrier ASAP and I’m willing to take some performance hit.”
- Michael K: The other complication is how fast we can make the parsing. That might add unacceptable latency.
- Felix: I think the “explain analyze” case is not the best example. The most difficult is trying to throttle a user of the database that’s doing something you don’t want. In that case you don’t know ahead of time, so you’d be doing the ragged barrier all the time.
- Michael P: I think that’s a good argument for actual priorities in the scheduler. If you have some background goroutine watching for bad behavior, that might not get scheduled if there’s bad behavior.
- Austin: Swirling around problems that people have been thinking about for decades. Would love to see a summary of the current state-of-the-art is here.
- Michael K: Probably only OS APIs.
- Austin: That's not a bad thing. If it's a good API, we can consider replicating it.
- AIs
- Michael K: Writing down trace requirements in earnest
- Michael K: Testing for x/debug
- Michael P: Need to review Austin's CL stack.
- Michael P: debug/gosym proposal.
- Felix: Clean up the tracer FP unwind patch (for amd64) to get it ready for review.
- Austin: Try to keep moving along gentraceback stack. Think about test-hook-coverage aspect.
Comment From: mknyszek
2023-03-02 Sync
Attendees: @mknyszek @prattmic @felixge @nsrip-dd @aclements @thepudds @rhysh @bboreham
- Michael K: I'm 70% of the way to a trace v2 (producer, consumer, trace format), and 40% of the way to writing it up.
- Most of the time is being spent detangling the existing tracer, documenting it, and using that to justify next decisions. Hopefully I'll have a draft to share before next time.
-
[Michael K proceeds to go into way too much detail about this. Highlights below. A public document will follow.]
- Let's use the system clock (e.g.
clock_gettime
) instead ofRDTSC
(for a number of reasons). - There are a very small number of places where you really need to understand the exact order of events. The current tracer takes advantage of that and I believe we need to retain this. Timestamps aren't enough.
-
Attach traces to Ms, not Ps. There’s a lot of complexity around
GoSysExit
racing with trace start. Thinking about ragged start and making the parser robust to that.- This choice forces us into avoiding a stop-the-world.
-
Trace binary format ended up being more about consumer efficiency than producer efficiency, but still more efficient on both sides.
- Traces will be partitioned for streaming. Each partition is fully self-contained with a set of stacks and strings.
- Trace events are sequences of 4-byte words whose internal structure respects byte boundaries and field alignment, to allow encoding/decoding events to just be memcpys and state management.
- Using Felix's 280 MiB trace breakdown as a motivating example. By my calculations the current design woukld use around 10% more. Personally that seems acceptable for the other gains.
- Every G event has an explicit G ID, but it's derived from a "G context" event. G IDs are also compressed.
- Michael K: We could make the stack table faster by only checking the hash instead of an exact match. Small chance of error.
- Rhys: Let's be cautious about making sure that traces actually work.
- Michael K: That's a good point. We should put an explicit bound on the likelihood of error. If it's astronomically small, is that fine?
- Rhys: Astronomical is good.
- Rhys: Would the new trace format still enumerate every goroutine? Currently can get stuck in many-millisecond STW waiting for tracing to enumerate all goroutines.
- Michael K: My plan was no. Goroutine profiles if you want that?
- Rhys: That's good. Yeah, you should be able to correlate a goroutine profile with a corresponding STW event in a trace. Happy about no STW in general for traces too.
- Rhys: RE: correlating things with traces, do we want to keep things orthogonal in general? Thinking about CPU profile events in traces.
- Michael P: I see where you're coming from in that you might want just the CPU profile events from a trace (with timestamps) and it's weird to get the whole trace and throw most of it away. We discussed having an API for configuring the trace and which events get emitted, so that might be a good place for that.
- Austin: There's a relevant issue about making CPU profiles more configurable as well, so maybe that's a good place for it too?
- Michael P: I think there are a lot of API questions here. Do you configure CPU profile in tracing at the CPU profile side or at the tracing side? The most natural way sounds like the tracing side because that's your actual output format, but I'm not sure. And then it gets complicated if you turn on CPU profiling in the tracing API and then you separately turn on CPU profiling, is that allowed? Right now you can't turn on profiling twice. And that's even more complicated, if we let you figure the sampling rate and they're not the same.
- Rhys: One of the difficulties that I've had in using execution traces and CPU profiles at the same time is that even though the CPU profile doesn't exactly stream its output while it's going. It's tricky to juggle two different output formats. At the same time that I'm trying to put into a single zip file to upload to blob storage. A single buffer would be handy.
- Michael P: A single buffer is ideal, but we don't have a converter that could pull a CPU profile out of a trace. We're missing information.
- Rhys: For one, we're definitely missing goroutine labels, though there's a patch out for that. We're also missing
/proc/<pid>/maps
for binary/symbol information. - Austin: It occurs to me that Linux perf basically starts with
/proc/<pid>/maps
. - Michael P: Perhaps we should also dump build information. We've been brainstorming about including this information for PGO.
- Michael K: There's room for as much as we want at the beginning of the trace, basically, so I'm all for adding more there.
- Michael K: I have also have a rougher draft of a trace parser API, with input from Michael Pratt.
- Felix: Would the old trace format fit in the new parser?
- Michael K: That was my goal. We'd have to do some retrofitting, but the old parser already exists. Caveat: parsing old traces would still have the same overall properties as the trace parser currently does.
- Felix: Frame pointer unwinding patch for tracer is ready to review. It’s only amd64 for now and a bit rough around the edges. We should discuss what needs to be done before landing. Cgo is still missing, but I’m working on adding that.
- Let's use the system clock (e.g.
Comment From: dominikh
Traces will be partitioned for streaming. Each partition is fully self-contained with a set of stacks and strings
Does this include the current state of all (relevant) goroutines? The current parser is essentially a state machine and we need to see all previous events to reconstruct a global timeline. I don't see that going away with the new format.
Michael K: I have also have a rougher draft of a trace parser API, with input from Michael Pratt.
I'd encourage you to take a look at https://github.com/dominikh/gotraceui/blob/04107aeaa72e30c50bb6d10e9f2b6ca384fafc3d/trace/parser.go#L18-L77 for the data layout I've chosen in gotraceui. It's nothing groundbreaking, but it highlights the need to avoid the use of pointers.
Comment From: mknyszek
Traces will be partitioned for streaming. Each partition is fully self-contained with a set of stacks and strings
Does this include the current state of all (relevant) goroutines? The current parser is essentially a state machine and we need to see all previous events to reconstruct a global timeline. I don't see that going away with the new format.
It does not. It only cares about the initial state of all Ms (including goroutines running on them), and generally only mentions goroutines that actually emit events. For goroutines that aren't running, there are only two cases where we actually care about the initial state of a goroutine: whether it was blocked, or whether it was waiting. In both cases it's straightforward to infer the state of the goroutine from the events that must happen to transition goroutines out of these states: unblocking and starting to run.
The trace still needs to indicate if a goroutine (and M) is in a syscall or if it's running. In the new design, this information is emitted together at the first call into the tracer by that M for that partition. The timestamp needs to be back-dated to the start of the partition. There's some imprecision with this back-dating but it's only relevant at the very start of a trace. The worst case is that a goroutine may appear to have been running or in a syscall at the start of a trace for longer than it actually was. The amount of imprecision here is bounded by the time delta between the global (serialized) declaration of a new partition and when an M has it's buffer flushed and/or is notified (via an atomic) that tracing has started, which I expect in general to be very short and non-blocking. (We can also explicitly bound the time by telling the M what time it was contacted for a new partition.)
Note that the details above imply that when a new partition starts, a running M may have been in a tight loop and so hasn't emitted any events for the last partition, in which case we need to preempt it to have it dump its initial state. Generally, moving partitions forward doesn't have to even involve preemption.
Michael K: I have also have a rougher draft of a trace parser API, with input from Michael Pratt.
I'd encourage you to take a look at https://github.com/dominikh/gotraceui/blob/04107aeaa72e30c50bb6d10e9f2b6ca384fafc3d/trace/parser.go#L18-L77 for the data layout I've chosen in gotraceui. It's nothing groundbreaking, but it highlights the need to avoid the use of pointers.
That seems useful for the current trace format, thanks. For the new format, I don't expect to expand the trace events out of their encoded form at all, but rather decode them lazily (either copy them out wholesale or just point into the encoded trace data in the input buffer, both of which are cheap from the perspective of the GC).
Comment From: dominikh
In both cases it's straightforward to infer the state of the goroutine from the events that must happen to transition goroutines out of these states: unblocking and starting to run.
That has two implications, however: 1. goroutines that don't unblock during the trace will be unaccounted for 2. the states of all goroutines can't be determined without looking at the entire trace
I realize that with self-contained partitions it isn't feasible to include the state of all goroutines in all partitions, but maybe it should optionally be possible to dump complete state in the first partition, for users who want a complete view? However that wouldn't really fit into an M-centric format…
That seems useful for the current trace format, thanks. For the new format, I don't expect to expand the trace events out of their encoded form at all, but rather decode them lazily (either copy them out wholesale or just point into the encoded trace data in the input buffer, both of which are cheap from the perspective of the GC).
I feel like the current parser + its types and the new approach you describe are at two different layers of abstraction. The current parser isn't exposing raw events. Instead it is doing a fair bit of processing of arguments, and it populates Link
fields, which point to related events. Your approach sounds a lot closer to just casting from []byte to a type describing the raw events. And there'll still need to be a layer of abstraction on top of that that can be consumed by users (unless you expect them to build their own, which would work for me, but be a barrier to entry for people less familiar with the underlying file format.)
Comment From: mknyszek
That has two implications, however:
- goroutines that don't unblock during the trace will be unaccounted for
- the states of all goroutines can't be determined without looking at the entire trace
I realize that with self-contained partitions it isn't feasible to include the state of all goroutines in all partitions, but maybe it should optionally be possible to dump complete state in the first partition, for users who want a complete view?
Both of those things are good points.
Dumping the state of the world at the start is one option but I'm also reluctant to do anything around this because it adds a lot of overhead. Interrogating every goroutine can take a while, and the world needs to be effectively stopped while it happens (or the synchronization will get really complicated). At the end of the day, my gut feeling is that the execution trace should focus solely on what's necessary for tracing execution, not what could execute.
However, I can definitely see that getting the information you describe has utility and we don't want to lose that. In the last meeting we discussed how goroutine profiles could be used to fill this gap. As a baseline, it should be fairly straightforward to correlate a goroutine profile's STW timestamp with a STW event in the trace. Taking that one step further, we could explicitly mention that the STW was for a goroutine profile in the trace. (In theory we could also dump the goroutine profile into the trace, like we do with CPU samples. I am not opposed to this, but I probably wouldn't do it to start with.)
You should be able to get a close approximation to the current behavior by starting a trace and then immediately grabbing a goroutine profile. Does that sound reasonable? Perhaps I'm missing some use-case that's totally missed. FTR, I fully recognize that we're losing something here in the trace, but I argue the net benefit is worth that cost.
Also I just want to disclaim the design details in the last paragraph: subject to change in the first document draft. :) That's just where my head's at right now. It may turn out that the per-M synchronization I have in mind is too complex.
However that wouldn't really fit into an M-centric format…
I think it works fine if, like I mention above, we're willing to give a little bit of leeway. Maybe you don't have a snapshot of the state of all goroutines at the moment the trace starts, but you have one from very soon after the trace starts, which is probably good enough?
I feel like the current parser + its types and the new approach you describe are at two different layers of abstraction. The current parser isn't exposing raw events. Instead it is doing a fair bit of processing of arguments, and it populates
Link
fields, which point to related events. Your approach sounds a lot closer to just casting from []byte to a type describing the raw events. And there'll still need to be a layer of abstraction on top of that that can be consumed by users (unless you expect them to build their own, which would work for me, but be a barrier to entry for people less familiar with the underlying file format.)
That's another good point. To be clear, I do plan to have an API with some level of abstraction and not quite just []byte-to-type. :) Events will be opaque and fields will be accessed through methods, so we have a lot of wiggle room. However, something like the Link
field I think requires keeping the whole trace in memory, because you never know when someone might want to access an event from a long long time ago (though I haven't thought this through). In theory an accessor can be arbitrarily complicated and even re-parse the trace to find the event, I suppose. :P
My general hope and expectation is that the vast majority of users should never have to look at the API at all, and instead rely on tools built with it. And those that do use the API don't need to understand the file format, just the execution model it presents (which I think is somewhat unavoidable).
Comment From: dominikh
Dumping the state of the world at the start is one option but I'm also reluctant to do anything around this because it adds a lot of overhead. Interrogating every goroutine can take a while, and the world needs to be effectively stopped while it happens (or the synchronization will get really complicated).
I think not having to STW and enumerate all goroutines was one of the design goals, as it didn't scale well. I take it the ragged barrier approach didn't pan out?
At the end of the day, my gut feeling is that the execution trace should focus solely on what's necessary for tracing execution, not what could execute.
One use case of looking at execution traces as they are now is debugging synchronization issues. Imagine having an N:M producer/consumer model using goroutines and channels, and we're debugging why producers are blocking. The reason might be that all of the consumers are stuck, which is only evident if we can see them be stuck. If they're already stuck at the beginning of the trace then they would be invisible in the new implementation.
More generally speaking, a lot of users aren't interested in the per-P or per-M views and instead want to see what each goroutine is doing (see also the per-goroutine timelines in gotraceui.) It turns out that per-G views are useful for debugging correctness and performance issues in user code and that traces aren't only useful for debugging the runtime.
You should be able to get a close approximation to the current behavior by starting a trace and then immediately grabbing a goroutine profile. Does that sound reasonable?
In theory that sounds fine, assuming goroutine profiles are proper STW snapshots? Otherwise it would probably be difficult to synchronize the trace and the profile.
At least this would give people the choice if they want to tolerate STW for more detailed traces.
However that wouldn't really fit into an M-centric format…
I think it works fine if, like I mention above, we're willing to give a little bit of leeway. Maybe you don't have a snapshot of the state of all goroutines at the moment the trace starts, but you have one from very soon after the trace starts, which is probably good enough?
Probably, yeah.
Comment From: mknyszek
I think not having to STW and enumerate all goroutines was one of the design goals, as it didn't scale well. I take it the ragged barrier approach didn't pan out?
It's not quite that it didn't pan out and more that it just doesn't work with a per-M approach given other design constraints.
The ragged barrier I mentioned in an earlier design sketch is the forEachP
one which is ultimately still P-focused. Part of the reason I want to switch to a per-M approach is to remove the GoSysExit
complexity that comes from the fact that goroutines can in fact run without Ps sometimes. That complexity is part of the event's semantics, so it tends to leak everywhere.
A per-M approach can side-step a lot of that complexity, but it means we need a way to synchronize all Ms that doesn't involve waiting until the M gets back into the scheduler. What I wrote above is a rough sketch of a proposed lightweight synchronization mechanism that most of the time doesn't require preemption. I think that in general we can't require preemption in a per-M approach if we want to be able to simplify the no-P edge cases and also get events out of e.g. sysmon
, which always runs without a P. (EDIT: D'oh. I keep forgetting that the current tracer can indeed emit events without a P. So that's really more that just that we don't currently have a great way of tracking Ms in general. I would like to add more explicit M-related events. The GoSysExit
point still stands because it races with a trace stop-the-world, which is the main source of complexity. If we synchronize via Ms that goes away.)
(In effect, I am proposing to shift the GoSysExit
complexity somewhere else, but I hope that in the end it will be less complexity overall because the M synchronization details can probably be written in a way such that the details don't leak as much.)
Comment From: dominikh
An aside that might steer you closer to a per-M approach: I tried adding per-M timelines to gotraceui using the current format and found it impossible due to the current event sorting logic. I ran into scenarios where a P would start on an M while the M was still blocked in a syscall.
Comment From: rhysh
In theory that sounds fine, assuming goroutine profiles are proper STW snapshots? Otherwise it would probably be difficult to synchronize the trace and the profile.
Yes, goroutine profiles are STW snapshots, but the duration of the STW pause does not vary based on the number of goroutines. Go 1.19 includes https://go.dev/cl/387415, which says "... do only a fixed amount of bookkeeping while the world is stopped. Install a barrier so the scheduler confirms that a goroutine appears in the profile, with its stack recorded exactly as it was during the stop-the-world pause, before it allows that goroutine to execute."
We'd want something like a "STW" event in the execution trace that we could tie back to the particular goroutine profile (maybe record the goroutine ID of the initiator and either a call stack or the reason
string); although a protobuf-formatted goroutine profile will include a timestamp, it's determined by the runtime/pprof package after all of the stacks have been collected, rather than by the runtime package during the STW that the snapshot represents.
goroutines can in fact run without Ps sometimes
This goes for g0 as well: CPUSample events include the G and P (when they're available), but an M can run without either, such as when "spinning" near the bottom of findRunnable
. That event should have included a reference to the M. As it is, it's tricky to attribute the on-CPU cost of spinning Ms, and to find the application behaviors that result in higher/lower costs there.
Having a tracing buffer available to the M could also simplify the way we get information from the SIGPROF handler into the trace.
Comment From: mknyszek
2023-03-16
Attendees: @mknyszek @felixge @bboreham @nsrip-dd @dominikh @rhysh @thepudds
- Frame pointer unwinding for tracer CL stack is ready for review. No change to existing test cases is needed anymore. https://go-review.googlesource.com/c/go/+/476235 Felix Geisendörfer
- Works without any changes to tests! Produces the same traces as gentraceback (or the new thing). Slightly hacky inline expansion handling. Will be sorted out in review.
- The next frontier is arm64.
- Execution traces v2 document (public document coming very soon, sorry!)
- Michael K: Rhys, what did you mean in your comment about traces not being usable until a goroutine profile is collected?
- Rhys: Today there's an instant, but in the new design the trace start might be ragged. Effectively the STW for the goroutine profile becomes the synchronization point.
- Michael K: The new design shifted from what I might've said before: it does actually have a single global start point, it just might be slightly wrong in the case of races.
- Felix: When a goroutine unblocks for the first time after the trace starts, the previous state is recorded, giving us the knowledge of what that goroutine was doing from the start of the trace until now.
- Rhys: The doc mentions that stack traces might not be recorded for GoStart events.
- Michael K: I might want to walk back on this.
- Michael K: Doubling the network of a trace wouldn’t be an issue, does anybody disagree with this?
- Rhys: Good to revisit all the things.
- Felix: Network bandwidth costs could be prohibitive for non-stop tracing.
- Michael K: Flight recorder mode recording into a ring buffer with a single pc stack trace. …
- Michael K: Filtering events could shrink the size of the trace in the future as well.
- Michael K: Maybe we should pursue making the trace fast by default, and then have other ways to deal with it later.
- Felix: Maybe compression can be done after trace recording.
- Michael K: Sounds reasonable. I’m also more convinced of the 8 byte format now.
- The pudds: Something like lz4 or snappy in memory by the runtime + zstd for network transport would compose well for total compression on the wire.
- Michael K: On-the-fly compression seems like it would be too slow.
- The pudds: They claim “faster than memcpy” for some lz4 style implementations. Though it's not always true based on exact usage
- Michael K: We should explore this more. Not very familiar with compression algorithms.
- Felix: Anecdote: Adding an efficient timestamp encoding to pprof didn’t beat a naive way of adding them via pprof labels after compression.
- Michael K: Could users use a goroutine to take the things out of a trace that they want to decide what to send?
- Felix: That would seem very useful.
- Rhys: It’s not only network bandwidth. It’s also memory in process. Currently recording traces/profiles in a very simple way that is safe for all the app owners. Wants to be careful of how much memory is used. Need to buffer all data in zip before sending it because remote blob storage expects the size to be known upfront. Had this in prod for several years.
- Rhys: Made a comment elsewhere that call stacks are useful for tools not just for people. Can be used to write instrumentation after the fact rather than deploying an update to prod.
- Michael K: How much easier would your life be if a parser could say that while you’re parsing you could take a slice. I’m streaming this in, but I know there is a partition, so I cut it off now. You get a slice of the traces.
- Rhys: Apps might be able to do flight recording use cases themselves.
- Michael K: If trace parsing API can make cuts efficiently, that would help.
- Rhys: Yes.
- Dominik: if it buffers the same amount, it'll have to send off more frequently; but the memory usage should stay the same, right?
- Dominik: you'd be able to send off individual partitions
- Rhys: In some contexts network bandwidth is essentially free. Storage is not.
- Felix: Is 1% memory overhead reasonable? 100MB for a 10GB app?
- Michael K: Expects 16-64kB buffer per M. For most apps it will be much less than 1%.
- Rhys: On a big machine 1% gives you a lot of room to work with. One design constraint for my library in prod is to not allow users to configure it all. Sometimes it’s in little daemons that don’t own the machine and just use 10s of MBs total that they use.
- Felix: So we want to figure out if design should scale down?
- Rhys: Yes.
- Michael K: Yes, we should care about it. I suspect we can get substantially less than 1%.
- Michael K: Question is how much is held in memory at once before writing it out? Is the network bandwidth a concern?
- Felix: The new design seems like it would allow recording many small tables more often.
- Michael K: Writing out stack tables will usually be less than 1000 entries usually, expect it to be fast, e.g. microseconds.
- Felix: Question about how the OS syncs clock across cores without massive overhead.
- Michael K: It’s not perfect, the clock can still be broken. You still have a RDTSC every time. The kernel does clever fixing up.
- Michael K: You’re right, it doesn’t get around the global sync issue. Michael Pratt pointed out the details regarding VDSO with regards to that.
- Michael K: The other benefits are worth it.
- Michael k: One benefit is aligning with other data sources, the other thing is making the timestamps monotonic.
- Dominik: the OS tries to fix things like cross-core/CPU drift of tsc.
- Michael K: Manually calling RDTSC is likely to be problematic when a G/M/P migrates CPU.
- Rhys: Trace writing should be pretty fast? I had to do some tricky things with buffer allocations to make this work. But that’s a problem I created for myself I think.
- Dominik: One "downside" is that the kernel detects bad TSC and falls back to other time sources that are much slower and not vDSO, and bad TSC include some modern consumer machines with faulty BIOSes. But it's probably fair to require well-working systems for Go tracing to work. It's not a regression from the current support, at any rate.
- Michael K: Seems like a fool's errand to make every faulty system out there work. I think we should do a timestamp fixup in the parser. We have a way to do partial ordering.
- Rhys: Does that interact poorly with the change to 8-bit sequence numbers?
- Michael K: Don’t think so. Sequence numbers are per-G. In an 8 byte scheme we can make this bigger. But that’s a good question. I will have to think about it more.
- Dominik: I remember relying entirely on the sequence IDs in the current format and skipping the check, and it did lead to invalid orderings. Not sure why
- Michael K: If timestamps are not fixed up and you’re consuming this … it depends on what you mean by invalid orderings.
- Dominik: It didn't pass the sanity checks of event ordering. Goroutines unblocking/starting/… out of order
- Michael K: I’ll catch up with you offline Dominik.
Comment From: mknyszek
2023-03-30 Sync
Attendees: @prattmic @mknyszek @felixge @nsrip-dd @bboreham @thepudds
- Michael P: #58474 - proposal: debug/gosym: expose function start line
- Also inlined calls
- Polar signals wants to use this for PGO.
- Planned for x/debug/gocore?
- FYI: Going to land in x/exp first to iterate on an API.
- Initial package sketch in https://go.dev/cl/474543
- Michael K: Also plan to do the same for x/debug/internal/gocore before making it public.
- Felix: Not tied to the Go release cycle?
- Michael P: Yeah.
- Michael P: Undecided on how far back to support Go binaries in the library.
- Felix: Start with the supported ones and if people complain then add more?
- Michael P: Do we actively drop support or just don't fix them if they break?
- If we just let them break then everyone might expect it to just work in the long term, and if we break it then we need to fix it.
- Michael K: Crazy idea: self-description?
- Michael P: I'm skeptical but maybe.
- Felix: Frame pointer unwinding updates
- Current status is that the Intel patch set is ready for review. Comments all addressed.
- It would be great to get another round of review and get it submitted soon.
- Nick and I have been looking into getting it working on ARM. I think we're down to the last issue. Feeling good about getting it working for the upcoming release.
- Michael P: I think your CL is ready, just needs another +2.
- Michael K: I updated the execution traces v2 document with a few things (public document coming tomorrow).
- I measured what we save by de-duplicating stack IDs between GoBlock/GoStart pairs and we got under 1% CPU usage with FP tracebacks! So, I removed the proposed API to control traceback depth.
- The event encoding section has been rewritten with a much simpler encoding. I expect about a 2x-2.5x trace size increase. I also propose a fast compression scheme as an out.
- Fleshed out the self-description section to be more concrete. Should the self-description data use some well-known encoding format?
- Added a use-cases section to help cement what the new design actually lets us do.
- Felix: We're worried about trace size in the long run. Also, what does the roadmap look like?
- Michael K: I think we can probably get this in behind a GOEXPERIMENT for Go 1.21. Definitely the default for Go 1.22.
- Felix: It would help if we could toggle between CPU samples and the full trace. We could do a lot with just CPU samples and timestamps. Our users are debugging p99 spans, and we can confidently say how much of that time was on-CPU. Currently impossible with pprof labels.
- Michael K: Does streaming help?
- Felix: Unfortunately we have to move data between clouds which eats up egress bandwidth. Also, it would have to be processed in-process.
- Michael P: I think it was always a follow-up to choose what you want in the trace, but it wasn't the highest priority. Maybe it should be higher.
- Michael K: I'm thinking we should look into APIs now.
- Michael P: Maybe there should be an API to decide in process? The former is more efficient, but it might be complicated.
- Felix: We might be able to do it today.
- Michael K: Maybe we try the in-process one first, and then later decide to try the API because it's a longer road. We need a parsing API for the new trace anyway, so we can just dump a copy into x/exp for experimentation.
- Michael K: I'd like to make a small collection of representative traces. I can look within Google. Any other ideas?
- Trace anonymization?
- Felix: I already have a prototype but it wasn't aggressive enough (thanks Rhys for pointing that out). My version keeps standard library frames. I can enhance it to a point where it's less scary and we can look at it.
- Bryan: I can volunteer traces out of Prometheus.
- Michael K: For CPU profile samples in the new trace format, would we be OK with a scheme that's simpler but might lose samples for the tracer itself?
- Idea: Write the event directly from the signal handler. When we try to pick up the buffer from the current M, we check if it's already being written to. If it is, drop the sample.
- The rest of the code should be safe to run from a signal handler.
- Is there something I'm missing?
- Michael P: What happens if the signal lands on a C thread that has never run Go code.
- Michael K: Can we do that today?
- Michael P: Today we write to a global profile buffer.
- Michael K: We could keep doing that just for that case. An advantage of both schemes is that at least Go code doesn't have to be best-effort about this.
- Michael P: The problem is we only read from the profile buffer when we read trace events. Could be exacerbated if a program is mostly C with just a little bit of Go code. Though, it is already a problem today. That could also be fixed in today's scheme. The CPU buffer reader could push into the trace instead.
- Idea: Write the event directly from the signal handler. When we try to pick up the buffer from the current M, we check if it's already being written to. If it is, drop the sample.
- Felix: What’s the compatibility policy for diagnostic data (traces, pprof)? Does the latest version of Go really need to be able to read data from unsupported versions of Go? Some cmd/trace code goes back to go1.5 (2015).
- Michael K: For the current code, I'd honestly be OK just supporting the last 2 Go versions. The other versions are always there if you need them. However, I think if we make a trace parsing API, we may need to support arbitrarily old formats.
- Michael K: We should probably just go with whatever decision we make for x/debug.
- Michael K: Spoke with David Ashpole and others about OTLP integration down the line and what that might look like. (Long-term goal: can we push tracing information up efficiently?)
- One thing that came up was possible conversion overhead to OTLP. Another option is to smuggle our trace data as a binary blob and have the collector figure out what to do with it.
- Felix: Working with the OTel profiling group which is trying to standardize across profiling formats. It's still an open question whether we want to allow opaque binary blobs to flow through the collector. Want to be able to process things going through. At the same time, having an intermediate format that's a superset of JFR seems like a bad time.
Comment From: mknyszek
2023-04-13 Sync
Attendees: @nsrip-dd @bboreham @thepudds @prattmic @mknyszek
- Michael K: Turns out that my "optimization" to reduce duplicate stacks was actually just dropping stacks. It's not actually useful. (Thanks Nick!)
- thepudds: What kind of event rate are you expecting for 1% overhead? 500,000 events per CPU for 1% overhead gives a 20 ns budget.
- Michael K: I did a worst-case analysis based on ~600 KiB/s for 6 cores.
- Michael P: Depends on parallelism.
- thepudds: 1% of total CPU available right?
- Michael P: If the program itself has limited parallelism, then we don't want 1% of all available CPU. I like the way you're thinking about the per-event budget.
- Michael K: +1.
- thepudds: Trade-off between what's done in-line or what's done out-of-line. There's a fair amount of sensitivity to what's done in-line.
- Michael P: That's true, there's some cost to the trace reader.
- thepudds: I wonder how much people will appreciate that subtlety. Also per-M buffer memory use?
- Michael K: I'm not worried about it because of global buffer flushes per-partition. It comes down to how many threads can be active over 1 second.
- AI(mknyszek): Write down per-M buffer memory use subtleties.
- Michael P: If a lot of threads are going to sleep on syscalls, the scheduler might be your real bottleneck.
- Michael K: Akin to idle-priority GC workers where maybe we have an idle-priority trace compressor? Optimistic compression?
- Michael P: CPU-bound apps with lots of running goroutines is sort of the worst case, and it's a cascading failure in a sense.
- thepudds: Experimenting with bring-your-own-compression.
- Nick: Felix’s amd64 frame pointer unwinding CLs were merged! We’d like to do arm64 as well. Main blocker right now is https://go-review.googlesource.com/c/go/+/241158 (to fix up frame pointers when copying stacks), needs review.
- Michael K: I'll unblock you after this meeting.
- Michael K: Did a few trace performance investigations:
- Reducing trace depth didn't meaningfully reduce trace overhead.
- A probabilistic stack experiment I tried also didn't meaningfully change things (though, smaller trace size due to fewer stacks).
- The last runs I did seemed to indicate <1% overhead with what's at tip-of-tree, but I need to verify that again.
- Compared varint trace path for 3 arguments + timestamp with an earlier version of the proposal (4-byte words, 16-bit timestamp deltas, etc.):
name time/op
Varint-48 13.9ns ± 0%
Word-48 3.20ns ± 1%
Comment From: felixge
@mknyszek @thepudds interesting overhead discussion ❤️! FWIW I don't think we'll be able to get overhead < 1% for pathological workloads (e.g. channel ping pong). Our internal target is < 1% overhead for 95% of applications. From what I've seen, I think we'll hit that with frame pointer unwinding.
@mknyszek thanks for all the experiments. The varint benchmark sounds promising! But we're still struggling to confirm it as a significant overhead source on our end. Looking forward to discussing it more at the next sync.
Sorry for missing the meeting, I'm on vacation right now 🏖️.
Comment From: mknyszek
FWIW I don't think we'll be able to get overhead < 1% for pathological workloads (e.g. channel ping pong). Our internal target is < 1% overhead for 95% of applications. From what I've seen, I think we'll hit that with frame pointer unwinding.
I'd like to learn more about what the <1% looks like (is that while the tracer is enabled continuously?). But when you get back. :)
thanks for all the experiments. The varint benchmark sounds promising! But we're still struggling to confirm it as a significant overhead source on our end. Looking forward to discussing it more at the next sync.
That's fair, and I'm considering putting it rest for now and just proceeding with everything else. These experiments were easy to run in the background since I was on rotation last week, but in hindsight I think I should've checked a couple other things first. Specifically, I should've check for whether it's worth changing the encoding solely for the encoder-side costs at all:
- What is the cost of switching to nanotime
?
- If we remove the bulk of the encoding work, what is the upper limit on what we can save?
I'll run these before the next meeting and I should have the answers I need on whether to pursue this further in the near-term.
Thinking more long-term, LEB128 is well-known for being relatively slow to decode, and I suspect in-process decode performance may become a bottleneck for some use-cases. I think there are still a lot of questions there but I also want to leave open as many doors as possible. I'll add a line to the agenda for next time about this.
(Since we're already overhauling the trace format, I'd like to make sure these kinds of use-cases are considered to avoid making more breaking changes than necessary down the line.)
Sorry for missing the meeting, I'm on vacation right now 🏖️.
No worries at all; enjoy your vacation! :)
Comment From: felixge
I'd like to learn more about what the <1% looks like (is that while the tracer is enabled continuously?). But when you get back. :)
We're still working out the details, but yeah, ideally the overhead is <1% (cpu, memory, latency) for 95% of applications while the tracer is recording.
Thinking more long-term, LEB128 is well-known for being relatively slow to decode, and I suspect in-process decode performance may become a bottleneck for some use-cases.
Yeah, I think there are many good reasons for moving away from LEB128 in the long run. In-process decoding is an excellent reason. If we could find an approach that doesn't increase data volumes or encoding overhead, it'd be an absolute no-brainer. But if that's not possible we just need to be careful that we're hitting a good balance.
Comment From: mknyszek
2023-04-27 Sync
Attendees: @mknyszek @thepudds @nsrip-dd @felixge @prattmic @bboreham @cagedmantis @irfansharif
- Felix & Nick: Patches that need merging so we can set tracefpunwindoff=0 for amd64 and arm64:
- https://go-review.googlesource.com/c/go/+/489015 (fix for amd64 crash michael k found)
- https://go-review.googlesource.com/c/go/+/488755 (incgocallback crash nick found on arm)
- https://go-review.googlesource.com/c/go/+/481636 (nice to have, but just needed to make debugCheckBP happy)
- https://go-review.googlesource.com/c/go/+/489117 (re-enable frame pointer unwinding for amd64)
-
- one more CL (not submitted yet) to set tracefpunwindoff=0 for arm64 by default.
- Michael K: Further performance experiments.
- First, I think FP tracebacks alone have actually hit <1%, woo! I've confirmed it multiple times now.
- Second, if I switch to nanotime in the current implementation there's no statistically significant performance difference if the time granularity is roughly the same as it was for cputicks. Currently we divide cputicks by 64, which gives us ~20 ns granularity for a 3 GHz CPU. If I just do plain nanotime, so 1 ns granularity, there's about a 0.5% regression (!). if I apply the same division by 64 to nanotime, so 64 ns granularity, it's a 0.2% improvement (!!). This difference is strongly correlated with trace size. All-in-all, I think this means 2 things: (1) nanotime is not worth worrying about and (2) maybe the trace encoding does actually matter a good bit? or at least, how the timestamp is encoded specifically. like, if it can often stay in 1-2 bytes, that apparently makes a fairly big difference. I'm not exactly certain why.
- Thirdly, if I comment out all of the trace buffer writes but still make sure to grab the stack trace, I get a 0.5% regression (!?). The trace size is miniscule now, confirming that in fact nothing is being written out. This contradicts ~everything above and I have no idea why. I'm probably just doing something wrong.
- Felix: I think we could go up to 128 ns because scheduling goroutines takes ~200 ns.
- Michael P: We get some more headroom for larger granularity as long as we have a partial ordering.
- Michael K: We already have partial ordering, so I might go ahead and do the timestamp fixups for Go 1.21.
- Michael K: I'm going to go ahead and make the proposal public, modulo questions about trace encoding. I'd like to get started on landing parts of the proposal as a GOEXPERIMENT for Go 1.21, so we can start working on a parsing API for it. Also gives us a space to iterate on a trace encoding.
- thepudds: Event filtering has been discussed verbally but hasn't been put into the proposal. Is that worth having a marker for a future consideration?
- Michael K: I think we should write it down.
- thepudds: There's also interplay with the encoding.
- Felix: The doc does mention filtering, but in the use-cases section (online analysis).
- Michael K: Yes, we’ll try filtering in user-space first, but having a dedicated API for it is not off the table.
- thepudds: Maybe it doesn’t matter if it's fast enough. But if we filter on the producer side we could even skip taking a timestamp - could be relevant in some cases.
- thepudds: Can you clarify the trace size decrease from using nanotime instead of cputicks?
- Michael K: Saw 1.69% decrease in size when going to 64ns resolution from 20ns (roughly, using the current encoding).
- thepudds: In practice it will probably depend on the delta between the timestamps and how compression plays out.
- thepudds: Any progress on representative traces?
- Felix: We have the execution tracer running on staging collecting data. We can look into anonymizing and sharing that data. My traceutils can anonymize traces now and I think is good enough to pass security review.
- Michael K: Haven't had time recently, but definitely want to do this.
- Trace parsing
- Michael K: Will work on this during the freeze. Dominik probably has lots of input on this.
- Michael P: Freeze is the last week of May (beginning or end?).
Comment From: bboreham
Regarding representative traces, I attach two traces from the Open Source monitoring system Prometheus: prom.trace1.gz prom.trace2.gz
Both cover 5 seconds; the first one is from a small Prometheus and the second from a much bigger instance. Let me know if something longer/bigger/different would help more.
Comment From: mknyszek
2023-05-11 Sync
Attendees: @mknyszek @rhysh @bboreham @thepudds @aclements @nsrip-dd @dominikh @prattmic
- Michael K: Goroutine profiles!
- The new proposed execution trace changes will mean that we don't capture the state of goroutines that never run while the trace is active. A proposed workaround is to use goroutine profiles to achieve basically the same thing: start tracing then immediately grab a goroutine profile. The trace tooling can then correlate the two to obtain a complete picture of all goroutines.
- This works, except that only one type of goroutine profile actually distinguishes distinct goroutines, and it's not super amenable to being machine-readable.
- We could add this to the gzipped pprof proto goroutine profile, but if we add label information on all samples, then we might significantly increase the size of the proto (differing labels prevent merging).
- Three options:
- Add a new goroutine profile type (debug=3) that's a gzipped pprof proto with the additional information.
- Add the information to the default (debug=0) goroutine profile type. Maybe the increase in size isn't actually that bad?
- Just make traces STW on trace start in the same way as we do for goroutine profiles (to avoid keeping the world stopped while we inspect every single goroutine). This would only happen on trace start, not on new partitions.
- I'm sort of inclined to just give up and do the third option at this point. Though, if there's value in the other two outside of traces, maybe worthwhile to still consider?
- FYI, I think I also figured out how to get the state of goroutines without stopping the world.
- Felix: +1 on just doing it in the trace.
- Austin: Easier from a user perspective.
- (earlier) Michael K: Goroutine Profile Matrix is useful.
- Bryan: Hit this problem of options on the goroutine profile type. The amount of time spent asleep is useful but only available on the text one [debug=2].
- Felix: +1 to that.
- Austin: The pprof proto is pretty extensible.
- Rhys: We've got a lot of goroutines on some services, and those teams have complained about the STW pauses from goroutine profiles and traces.
- Michael P: re: Gscan, how does that work with the GC?
- Michael K: Gscan acts like a spinlock.
- Michael P: If we wanted to get fancy, the Gscan bit could be more like a semaphore count, because everything looking at the G is read-only.
- Felix G:
- Two small cleanup patches (CL 481617 and CL 489095) are “Ready to submit”. CL 481636 for arm64 needs one more review (it’s just for debugCheckBP, not needed for tracer).
- Try to enable debugCheckBP by default?
- Austin: Maybe a debug mode on a new builder? longtest builder?
- Felix G: Considered changing the const to a var to make it easier to enable.
- Michael K: I think it's just the fact that we're close to the freeze.
- Austin: If it was the start of the release cycle I would be OK with turning it. Worried about production crashes. Also concerned about user assembly code that does weird stuff to the frame pointer. I would want a decent amount of soak time. We test in google3 but we can roll it back quickly.
- Felix G: Based on that, would it be OK to land a CL to put it behind a GOEXPERIMENT?
- Austin: Go for it.
- Michael P: We can run it against Google-internal tests for coverage.
- Expand frame pointer usage into more places? E.g. replacing funcspdelta lookup in unwinder? Also runtime.Callers and similar APIs, e.g. non-CPU profilers?
- Sharing traces.
- Bryan shared some traces; thank you!
- Looking into it internally. What I could do earlier is extract statistical information. Hoping to eventually share full traces.
- Michael K:
- FYI tracer refactoring https://go.dev/cl/494192
- Michael K: Just plugging along, hope to have it landed before the freeze. Plan to split the tracer into 2 implementations.
- Felix G: Nick and I discussed that part of the proposal captures syscall durations, which would mean capturing both start and end times. We're concerned that it could produce a lot more events where someone is making a lot of syscalls or calling into cgo often. We also couldn't think of a use-case for the short syscalls. Can get an upper bound on time between syscalls when you're doing a lot of them.
- Dominik: "really short" is relative, it can be up to 10 ms.
- Rhys: I've felt a need for this in the last 6 months. crypto/tls does a write syscall, but it's hard to determine how much time is spent in the syscall vs. in application time. Knowing what to go after would be useful.
- Felix G: Wouldn't the CPU profiler be more appropriate for this case?
- Rhys: If your application does only one kind of work, then sure. But if your application does two kinds of work, sometimes small, sometimes large, then it's harder. Could be done, but it's harder.
- Michael K: Not worried about performance since syscalls are expensive because of Meltdown. The "syscall end" event is also pretty easy to make small.
- Felix G: Concern about short cgo calls in a loop.
- Michael K: Maybe we should just treat cgo calls differently.
- Austin: It is a little weird that we trace them. Might just be a historical artifact.
- Felix G: I support not tracing cgo calls. We don't trace Go calls for a reason.
- Dominik: If cgo got a lot faster then I agree it shouldn't be in the trace.
- Felix G: Not tracing cgo could also be a good use case for enabling/disabling individual events via an API.
Comment From: mknyszek
2023-05-25 Sync
Attendees: @mknyszek @thepudds @aclements
Before the notes, I'd like to note that this sync barely happened. I showed up 15 minutes late (very sorry about that), though it seems like many of us were unavailable for the sync anyway, so perhaps it was just not meant to be. For anyone that showed up in those first 15 minutes but left confused, I'm very sorry about that; that's a mistake on my part. Austin and I joined late and briefly discussed some topics with thepudds, but the sync did not run very long at all.
- Michael K: No updates; I've been working on getting stuff in before the code freeze.
- thepudds: I did some work on the trace event encoding that I'd like to share. I'll sync up with Michael K briefly before the next meeting.
- thepudds: RE: #58106, I'm not certain if we have everything we need after the new heap metrics landed (#56857) to allow users to detect when the heap is saturating GOMEMLIMIT. I'll send Michael K a pointer and we'll discuss further.
Comment From: mknyszek
2023-06-08 Sync
Attendees: @mknyszek @prattmic @bboreham @leitzler @nsrip-dd @rhysh @felixge @irfansharif
- Michael K: Working on new tracer implementation. Will finally (finally) make the doc public in the proposal repo and file an issue to track work.
- Felix: Update. Working on building features on top of the execution tracer. Eagerly waiting on the next release. Hoping to share more on this publicly soon. Curious about plans for the next release cycle. One theme we're interested in is getting frame pointer unwinding in more places. CPU profiling should be straightforward, but the GC or copystack might see some improvements as well.
- Michael K: Sounds good to me. Would be nice to have just one unwinder on amd64 and arm64.
- Felix: We can't totally replace it, but we can swap the frame data lookup with just a pointer delta. It wouldn't be the tight loop we have in the tracer, but still better. Ideal state would be to not use the PC delta lookup tables at all (and not include them in the binary), but this will be difficult in some places.
- Michael P: A few months Felix sent a CL for sysmon to do fewer preemptions. This came up yesterday when debugging the scheduler. Turns out the extra preemptions in sysmon are masking work conservation issues. We started discussing again the possibility of reducing preemptions from sysmon, at least as a debug mode initially.
- Michael K: Plan to look into core analysis libraries and making them public.
- Bryan: This may be off-topic but I was looking for tools to investigate growth in /memory/classes/heap/unused:bytes. Would viewcore be good for that?
- Michael K: Yeah, I think so.
- Michael K: Flight recorder mode?
- Nick: Yeah, I think we'd use it.
- Rhys: Would use it for one-offs. Currently trying to collect traces one second every hour from services. Allow the app to say "now would be a good time to collect a trace."
- Michael P: Would be handy combined with core files. Might only be useful to me debugging the scheduler.
- Rhys: Core files are great because they have all the info, but haven't put in the time to figure out how to use core files in practice because they use everything. For instance, correctly accounting for customer data.
- Michael P: That's fair. Where I see it used is for one-offs where one person has access to one core file for debugging.
- Michael K: That's one thing the old defunct heap dump format had over core files: it basically was just a memory map.
- Rhys: Do core files include precise "pointer or not" data for all data? Would it be feasible to write a program that overwrites all data that is not a pointer?
- Michael K: I think this is feasible. The API can't write though.
- Rhys: Could one just identify the byte offsets in the core file of what not to keep and write a tool to copy the core file?
- Michael K: Seems reasonable.
- Rhys: This is the main reason I don't feel comfortable using core files.
- Felix: Use-case we're interested in is OOM kills. Heap profile collected regularly is not enough, because it's usually not "more of the same" causing the OOM, but something new and recent. Core dumps aren't the only solution. V8 has a callback that allows you to be informed when one is about to run out of memory.
- Michael P: Linux OOM killer might disable core dumping.
- Felix: In most cases we see the cgroups OOM killer, which is a bit different. There are many flavors of OOM killer.
- Michael P: I will double-check.
- Michael K: One of the motivations behind the viewcore improvements is debugging OOMs. Hard memory limits seem undesirable and maybe easier in the V8 world.
- Felix: Ideally it would be something like everything else stops while a goroutine uploads some data when things are really bad. Might be difficult because that goroutine probably needs to do complicated things.
- Michael P: Asynchronous signal to applications for reclaim, but it's not ideal. You might just let memory balloon elsewhere e.g. in the kernel.
- Felix: One problem with cgroups is you can be allocating so quickly that you can't actually catch the problem before you fall over. Another problem is that the OOM killer often kills innocent programs, since it can't really identify who's at fault.
- Michael P: The kernel OOM-kills process with SIGKILL unconditionally, which means that it never writes a core dump.
Comment From: mknyszek
2023-06-23 Sync
Attendees: @aclements @mknyszek @prattmic @nsrip-dd @rhysh @felixge
- Michael K: Published the execution tracer doc. Working on the tracer.
- Felix: Continuing to push the idea of sharing anonymized traces. Haven't been shut down entirely and security seems to say that it's OK if I think it's a good idea.
- Austin: Curious about what goes into trace anonymization, specifically for PGO.
- Felix: My approach is going through the string table and replacing everything that's not a known std symbol and replacing it with XXX. Also replacing log messages with XXX. Still leaks some information like CPU utilization, size of the instance running, etc. Doesn't seem like a deal breaker for just sharing some data. Might not be useful for PGO.
- Rhys: That means that you can write out a new execution trace. Can you do it in the general sense or can you only make edits to the string table? Being able to modify the in-memory version of an execution trace and write it out is powerful. Usually write instrumentation after the fact. It would be nice to write that out directly into the trace. To what extent do you have that capability?
- Felix: Currently I have a streaming decoder for the trace. Not sufficient for manipulating timestamps. We have produced some artificial traces internally, but they're one-offs. Have thought about more full-featured execution trace parser. Some loss of information like batches.
- Rhys: The tools I've been working on for traces have tests, and the tests are built around a string format. There are a lot more tools that work with profiles, and part of the reason is it's just easier. There's a protobuf format and a text format (e.g. heap profile text format, debug=1 or 2 or something). The bar for getting started with execution traces is much higher.
- Felix: There's the "-d" flag on
go tool trace
. Does that work? - Rhys: Can round-trip most things, but not if it has a stack. Stacks don't work. Strings aren't quoted. Once you have the text format, there's nothing that reserializes it.
- Austin:
go tool pprof
does have a-raw
flag (and a-traces
flag?) for producing a text format. - Felix: Another text format that's useful is Brendan Gregg's "folded" format. (Rhys: "stackcollapse"?)
- Michael P: Really interesting ideas. Talked in this meeting about having a public parsing package. Haven't talked about a public writing package.
- Michael K: Starting with a new parser, with a nicer API. Lower level, streamable. Might be not good to have multiple formats.
- Austin: It doesn't have to be another format, just a flag to say it's already sorted.
- Felix: One challenge with having a package where you can stream the events in and out is when you want to shift an event in time. Any API to allow really flexible trace construction would require having all the events in memory.
- Rhys: As far as having different formats goes, the old CPU profile format was streamed in whatever format the runtime produced. As soon as you put it through the pprof tool you got the canonicalized version.
- Felix: Interesting to keep in mind for testing purposes for a low-level trace parser is that you can round-trip the trace.
- Austin: If you want to do anything more complex like event insertion then you have to break the batches. It's not clear that you want to keep the capability to not break them.
- Michael P: Summarizing: use-cases we have for writing traces:
- Anonymization
- Insert trace events (post-instrumentation)
- Felix: Wanted to create an artificial trace for testing.
- Rhys: With pprof there's the -focus flag. Want to be able to cut out just the interesting part to avoid long wait times in tooling. For tests, want to be able to check-in something that I can read, but then generate the in-memory or binary format for that. If you want to write tests for the
go tool trace
UI it would currently be really onerous, even for little things like just the latency histograms. Text format would help.
- Michael K: go.dev/fast idea. Go community survey asks for more optimization guidance. Could tell you how to write fast code, but also how to use diagnostic tooling. Would be interested in guest writers. Inspired by internal Google newsletters, but also abseil.io/fast.
- Michael P: Writing a whole guide is a huge undertaking. One motivation behind a newsletter is that it can be incremental.
- Felix: Would it be focused on particular examples, tutorial-style? Or is it more about the mental model?
- Michael P: I think it would be both. You probably have some baseline ones that are more about fundamentals, but some more tooling focused ones that go through examples. The ones inside of Google try to be fairly actionable. "Here's a thing you can actually do." They do still provide background, though.
- Michael K: Mental model vs. tutorial-style came up a lot. One thought is to build a "mental model" guide out of a constellation of tutorials. A piece of feedback I received a few times on the GC guide was that people wished it got down to specific examples and solving their problems, as opposed to building a mental model.
- Felix: Could be a good place to experiment with different ways of explaining things on different levels. A guide needs to commit to an approach, but this would let us shine a light on topics from different angles.
- Michael K: RE: OOM-killing and core dumping, see https://lwn.net/Articles/590960/.
- Michael P: kdump works like this. Boots a new kernel in reserve memory to create a core dump of the kernel. But it does exist and does get used.
Comment From: mknyszek
2023-07-06 Sync
Attendees: @mknyszek @aclements @prattmic @felixge @nsrip-dd @bboreham @rhysh
- Michael P: Heads-up that I started doing refactoring on viewcore's internal libraries. Main goal was to add support for PIE binaries. It was a pain to edit the code, so I first did a refactor.
- Michael K: I have some additional trace format tweaks to propose:
- Put the string dictionary and stack table in specially-marked batches, so they can be discovered quickly when collecting batches.
- Make trace events more uniform by removing the inline string exception.*
- *Strings can appear either in the dictionary, or just before use. Either way, they allocate a unique ID. For an inline string, we would emit the string first, the parser would add it to the dictionary on-the-fly, and then the following event would reference it.
- No additional memory use, small trace size increase (vs. today, emits an extra ID that appears twice), simpler tracer and parser. Allows for a more flexible "inline string" policy in the future by making all strings uniform.
- Rhys: Experience report re: strings sometimes embedded in batches. When writing trace processors, it’s usually “run it and see if it works”, but that makes it really easy to miss these rare cases. Maybe the solution is more documentation. But in execution traces there are so many little interactions that happen only some of the time or basically never. This seems like another stumbling block.
- Felix: I think having an official parsing library should help with this to some degree?
- Rhys: It depends how fast it is. If you want a tool that’s “dump out all the strings”. Maybe you could do that very quickly if you went through all the batches and wrote out the string batches, but that would have a subtle bug in this case.
- Michael: I’m hoping the fact that all strings have the same structure will help. For a tool like that, look for all string events (ignoring “string block” header).
- Felix: I’ve written a tool to dump and anonymize all strings. It almost missed the inline strings. It sounds like we’re moving toward something less subtle.
- Rhys: I agree the consistency in that part of the format is an improvement.
- Michael: One design I considered was to emit the inline string into the string buffer. The string table would also hold a buffer, which would get flushed out regularly. “Inline strings” would go into this buffer but wouldn’t be stored in the intern table. Maybe I should just do that? Then we don’t need an exception at all.
- Austin: Yes.
- Rhys: maybe you need a "torture test", a trace file that contains all the rare cases ?
- Michael: We have that torture test.
- (User logs in traces)
- Rhys: The user log strings wouldn’t get deduplicated?
- Right
- Rhys: I don’t work in an environment that has a lot of user logging, or a clear ideology on what to put in the category and what to put in the message. It’s not clear as a user how to use those. E.g., is it beneficial as a user to have a fixed log message followed by another with the variable parts? That’s one reason logging hasn’t been picked up much. There are a lot of unknowns on the overhead.
- Michael: With the switch to partitions, it becomes a lot more okay to use categories that don’t deduplicate well.
- Felix: With more people using execution tracing, we should think about how (and if) it would integrate with structured logging (slog). For our current use case (log span id) we found the existing logging to work fine.
- Michael: jba (working on slog) has been interested in how it integrates with the trace. Gets to the question of whether it’s better to push data down into the trace or push trace data up into other logs.
- Combining/correlating logs
- Felix: Uncertain about pushing all logging down into the execution trace. Traces are more about execution (from the runtime’s perspective), logs are more about application-level concerns. Reinventing the wheel (e.g. OpenTelemetry’s Logging Signal) in execution traces might lead to weird situations down the road.
- Rhys: Some of the things that end up in user logs are sensitive and there's not a whole lot of information in execution traces that are sensitive in the same way. Wouldn't want to mix the two, unconditionally.
- Nick: One thought of things to pull up out of the trace are e.g. trace ID. Putting in logging to correlate trace data with higher-level spans. If there was a way to have the annotation in the trace already, that would be ideal.
- Austin: It's a false binary to push one trace into the other. A third approach is to correlate them really well.
- Rhys: Joke suggestion: goroutine ID. That's the thing that shows up really well in execution traces. There are good reasons for not putting that in userland, but something like that + a timestamp for correlation works.
- Austin: Kind of want to just export the goroutine ID (sub-joke) as a type that cannot be used with "==" (not
comparable
). - Michael P: If you can convert it to a string, then plenty of users would be perfectly happy to just convert it back. If you can't convert it to a string, then you can't log it. slog-only doesn't help because the slog backend could just route it back to the application.
- Rhys: The thing that makes goroutine IDs so useful in execution traces is that it shows up on every event. If you miss the correlation key (say, it gets emitted 10ms before tracing starts) then you have a problem.
- Felix: Already plenty of ways to access goroutine IDs and the world hasn't ended. What stands between that and complete chaos is the runtime's strong stance against it.
- Michael K: I think all the ways to get it are slow?
- Felix: There are some packages that use linkname.
- Michael P: The easiest way is to just write assembly. linkname is fragile and you can't linkname getg.
- Michael K: Maybe worth revisiting goroutine labels?
- Felix: Labels assume a strong relationship between parent and child goroutines. Breaks down in multiple ways: in some cases you want the relationship and you don't get it, and in some cases you don't want the relationship and you do get it. In the end it's goroutine IDs connecting things.
- Rhys: Have had huge success when users put request ID in the pprof label.
Comment From: dominikh
I'm leaving this here as it's somewhat related and possibly of interest: user_events: Enable user processes to create and write to trace events
Comment From: dominikh
@mknyszek regarding our earlier conversation about EvGoSysCall and cgo calls emitting those events, I've noticed that in Go 1.21, calls to runtime.gcStart
also emit EvGoSysCall.
Maybe it's best if we clean that up after all and make sure only actual syscalls emit that event. If we later decide that Cgo calls deserve to be included, too, they could get their own event type.
Comment From: mknyszek
I've noticed that in Go 1.21, calls to
runtime.gcStart
also emit EvGoSysCall.
Huh... Thanks for pointing that out. That's a bit surprising to me; maybe a bug? Can you share the runtime part of the stack trace for the event? I'm curious as to where it's getting emitted. (Sounds benign, but maybe not!)
+1 to cgo calls just having their own event type though. Alternatively the syscall event could being a more general "we're leaving Go" event with a subtype byte (since otherwise we'd have to copy the complexity of the syscall event's sequencing into the cgo events).
Comment From: dominikh
runtime.gcStart
/home/dominikh/prj/go/src/runtime/mgc.go:667
runtime.mallocgc
/home/dominikh/prj/go/src/runtime/malloc.go:1242
runtime.newobject
/home/dominikh/prj/go/src/runtime/malloc.go:1324
runtime/trace.StartRegion
/home/dominikh/prj/go/src/runtime/trace/annotation.go:158
main.fn1
/home/dominikh/prj/src/example.com/foo.go:20
Where foo.go:20 is rg := trace.StartRegion(context.Background(), "foo region")
.
The Go version was go version devel go1.21-e6ec2a34dc Thu Jul 6 20:35:47 2023 +0000 linux/amd64
Comment From: mknyszek
Thanks! The source of that is the notetsleepg
in gcBgMarkStartWorkers
which calls entersyscallblock
. This case is kind of interesting, because it is a blocking syscall (the thread blocks in FUTEX_WAIT
but the goroutine gets descheduled and its P is released). It's just one invoked by the runtime itself.
I'd argue maybe this one is working as intended? I see two possible problems here:
- It is kind of noisy for the user debugging their own code to see this extra syscall that's entirely runtime-produced.
- There are frames skipped which make it very unclear what the actual the source of the syscall is. Maybe the skip count is the right amount for any syscalls going through the syscall
package, but it's not right for this case. This stack trace makes it appear as if the GC somehow represents a syscall but it's just a side-effect.
Maybe we capture more stack frames? That makes noisiness potentially worse, but in this case I think it would help a lot to identify "OK it's just the GC spinning up workers and waiting for them to start" from the extra stack frames. TBH hiding the extra stack frames and/or reducing noise from runtime-specific information seems more like a presentation problem and I'd be inclined to hide fewer frames in the trace itself (and leave it up to the UI to hide things), but I'm biased since those frames are ~always useful to me personally. :)
(From a performance perspective, I doubt including those frames would hurt much. Stack traces are likely to be just as dedup-able and the extra few frames would only increase trace size slightly.)
Comment From: dominikh
Based on your analysis I agree that this is working as intended. I'm also in favor of capturing more stack frames. With a proper stack, this event might even be useful to users, since it's a possible source of latency. And if it's too noisy, the UI should allow filtering it away.
TBH hiding the extra stack frames and/or reducing noise from runtime-specific information seems more like a presentation problem
It wouldn't be the first time that Gotraceui analyses stack traces, either. I already do so to walk backwards out of the runtime for events such as blocking channel sends (the user doesn't care that this happens in runtime.chansend1).
Comment From: mknyszek
2023-07-20 Sync
Attendees: @mknyszek @prattmic @rhysh @bboreham @dominikh
- Michael K: Continuing to make progress, but nothing new to talk about.
- Michael P: FYI the tree is going to be reopened for development imminently.
- Rhys: Filed some bugs recently about lock contention. Filed about a runtime/metrics metric to measure runtime-internal lock contention. Michael P mentioned that it would be nice to write up runtime-internal locks to the contention profile.
- Dominik: I briefly wondered about sync.Mutex and execution tracing and if there could be a (cheap) unique ID for mutexes to analyze contention in detail (i.e. find goroutines blocked on the same mutex).
- Michael K: They do have a unique ID: their address.
- Dominik: The problem with the address would be address reuse.
- Michael P: Addresses could be reused, but you're tracking contention on small timescales, and unless a GC occurs, addresses won't be reused.
- Rhys: There is sometimes lock contention in the http2 implementation: something grabs the lock for the whole transport, then does a blocking write to the http2 connection, but blocks everything (seen for up to 15 minutes). These are just bugs, but the way I've found them is via goroutine profiles. The unique ID would be helpful to filter down which goroutine owns and/or is impacted by the lock.
- Michael P: What does it mean for a goroutine to hold a lock? Locks can be released by other goroutines.
- Rhys: Would still help for a lot of cases.
- Michael P: My old team's locking was complicated. I made a wrapper for sync.Mutex that enforced unlocking by the same goroutine, and there was exactly one case found in the relatively large codebase where a different goroutine released the lock.
- Rhys: It's useful to see how goroutines interact. I see this information when mutexes are contended, but not when they're uncontended. I wonder if it would be useful to occasionally sample uncontended locks as well to understand the structure of the program when it's not overloaded.
- Michael P: That's an interesting idea. Would you imagine that it would be integrated into a contention profile? (Contention is zero on a sample.) Or would it be a separate profile?
- Rhys: I was thinking of execution traces, e.g. this goroutine would have unblocked this other goroutine if they were executing differently. Right now I get no data about locks that might become contended until they are.
- Michael P: In the limit, it would be nice if every lock and unlock was a trace event, but that's very expensive.
- Dominik: It would probably be too expensive to emit lock/unlock events for uncontended locks, right?
- Rhys: Maybe we can do 1% or 1 per million of lock/unlock pairs.
- Michael P: Uncontended lock is already pretty fast: CAS and you're done. But maybe a (non-atomic, per-P) counter or something is fine with a slow path if that counter gets big.
- Rhys: Also applies to channels.
- Dominik: And the unlock checks a flag to tell if the lock was sampled?
- Michael K: I think that flag would have to be on the mutex.
- Michael P: (a) It's not immediately obvious that we should sample the unlock (but that would be a nice property. (b) There are unused state bits in the mutex state. We probably couldn't use it for a counter, but we could probably use it for a flag.
- Rhys: I think this nice property is important for execution traces. "Had this unlock not happened, this lock would not be able to proceed."
- Bryan: The mention of RWMutex made me think about getting data on how useful RWMutex actually is. For example, how often the program is actually taking advantage of the read lock. I often wonder if the implementation cost is worth it.
- Rhys: One problem is that they look like they're reentrant until they're not. A useful tool would be to identify if you're assuming the lock is reentrant.
- Michael K: You can imagine that the lock implementation detects it by writing down the G.
- Michael P: The problem here is that you don't have to unlock from the same G. You can do weird things like handing off lock ownership and (surprisingly) sidestepping deadlocks. This makes detection more difficult.
- Dominik: Would putting the address in the trace be difficult?
- Michael K: I don't think it would be that difficult. It would make the trace bigger, but you can have a side-table.
- Michael P: You could imagine a block context event or something that includes that address and comes after the block event.
- Dominik: I'm not sure how much address reuse actually matters. The address will refer to the same mutex for lock and unlock pairs, and if two goroutines block on the same address'd mutex, it has to be the same mutex, too. Sampling all locks/unlocks would be a different story, however.
- Although I'm not sure how this interacts with stack moving/the elusive "moving GC".
- Rhys: Is there data that would be helpful for you for debugging the three issues I filed (61426, 61427, 61428)? In other words, is there any way we could improve runtime diagnostics that would make those issues issue to resolve?
- Michael K: I think runtime-internal mutex contention might be sufficiently covered by CPU profiles? Because they lock up the whole thread, so samples land there. This isn't true for sync.Mutex.
- Rhys: Doesn't capture details in certain cases, e.g. channel lock contention. Can't distinguish individual mutexes in a profile. Filed an issue about this a while ago. It would be useful in this case too to have some identifier for the lock.
- Michael P: The sync.Mutex contention profile would benefit from a unique identifier or something to distinguish whether it's one mutex or 5. Possibly via a label.
- Rhys: The contention profile accumulates over all time, so attaching labels to it means that until we figure out something clever, that also leaks memory over all time. So we can't have labels.
- Michael P: sync.Mutex can appear on a CPU profile because the contended locker has to call into the runtime scheduler. It's just not quite as expensive as calling futex.
- Michael P: I think the problem is slightly less bad for mutex profiles, but once you turn them on they keep everything forever. Could have a diff profiling mode.
- Michael K: Could a diff profiling mode have a label?
- Rhys: An execution trace is start and stop, but you can't get goroutine labels from it, and you can't get allocation events. Well, you kind of can with the HeapAlloc event.
- Michael K: That's super interesting; I think HeapAlloc being a good sampler is sort of accidental, but that's a good use-case. I'm not certain that was intended…
- Michael P: Hyrum's Law!
- Rhys: We do collect a stack every half-megabyte to put into an allocation profile. If we put that into an execution trace too, that would be fine. Over time we could see that a certain type of request allocates a certain type of memory looking over many samples.
- Michael K: Wrote an allocation tracer similar to the execution trace that emits an event for every alloc and free. Had about a 30% allocation overhead, but the traces were huge.
- Dominik: There's a GODEBUG flag to trace all allocs, isn't there?
- Michael K: Yeah, but it's really, really slow, since it just writes to STDERR.
Comment From: tbg
Just a small comment,
Rhys: One problem is that they look like they're reentrant until they're not. A useful tool would be to identify if you're assuming the lock is reentrant.
https://github.com/sasha-s/go-deadlock does this (at runtime) by recording and comparing the stack frames seen for each (manually instrumented) mutex (very expensive, we only use it in nightly stress testing). Of course this requires observing a re-entrant lock acquisition. It seems difficult to do this sort of thing statically, but even doing it cheaply enough to have it always on would be helpful.
Comment From: prattmic
@tbg We discussed detection briefly. It is unclear to me how to do robust deadlock detection given that lock and unlock are not required to be on the same goroutine. e.g., consider:
- Goroutine 1 calls RLock.
- Goroutine 2 takes "ownership" of this critical section.
- Goroutine 1 calls RLock again.
- Goroutine 2 calls RUnlock.
Step 3 naively looks like a deadlock, but as long as goroutine 2 is guaranteed to make progress it is not. How does go-deadlock handle this situation?
Comment From: tbg
It gives a false positive:
https://github.com/sasha-s/go-deadlock/blob/5afde13977e624ab3bd64e5801f75f9e8eb1f41b/deadlock.go#L284-L300
It's difficult to handle locks that pass between goroutines. Which is a shame, because it's also pretty rare.
Comment From: mknyszek
2023-08-03 Sync
Attendees: @mknyszek @rhysh @nsrip-dd @felixge @bboreham @dominikh
- Michael K: First pass implemented, no parser yet. (CL) Veered away from doc a little based on learnings from implementation. Couldn’t get around a brief STW on trace start. Advancing the partitions or stopping the trace doesn’t need STW.
- Rhys: Can you quantify “brief STW”?
- Michael K: It stops the world, prevents sysmon from doing its thing, sets global trace.on value, emits on event per P but this might not need to be done during STW. Briefly spins waiting for things to exit syscall. Couldn’t figure out how to get around that.
- Rhys: Syscall stuff is a wild card, but otherwise it sounds like less STW than what we see from a GC?
- Michael K: Yeah, it’s extremely brief. Getting attention of every G in a non-racy way was difficult and changed the design a bit.
- Dominik: Will STW be recorded?
- Michael K: There's currently a bug that it doesn't, but it easily can.
- Rhys: Some of the applications I collect traces from have long GCs, and tracer can't start or stop during a GC cycle, so I miss things, especially when I'm collecting only 1 second traces. Does the new tracer need to wait for a GC to end?
- Michael K: It does, but that restriction can be easily lifted in the future. The trace just needs to contain more information about what's currently in progress.
- Felix: Can waiting for syscall exit could cause a very long stop-the-world since goroutines might be blocking on I/O?
- Michael K: Ah, no. To clarify, the syscall exit is just making sure that there are no threads in a very short critical section on the syscall exit path. Threads are guaranteed to drain from that critical section.
- Rhys: Could the spinning for that wait still be a problem, say if that thread gets descheduled?
- Michael K: It could, but it's very short (hundreds of nanoseconds at most), and working around that would require the tracer to be wait-free. Though, any thread writing to the trace is indeed wait-free. The core of the design is that every thread holds its own seqlock (thread-local atomic counter) while it's writing events. The tracer then makes sure nothing is writing to the trace by ensuring the counter is not observed to be an odd number (indicating that it's in a critical section).
- Michael K: As an aside, I think I may have also mitigated some of the issues with the CPU profile samples being inserted into the tracer only in a best-effort manner. I basically just spun up a different goroutine because it was simpler. There may still be some data loss across trace generations (partitions) since the CPU profile buffer isn't forcibly flushed on each generation. We can probably do a bit more to minimize that, but on the whole the loss should be small.
- Rhys: Does this mean we also get M information for each profile sample?
- Michael K: Yup!
- Nick: You mentioned you're planning on writing a parser. There was some discussion of fixing up timestamps. Have you reached that point yet?
- Michael K: Not yet, but I expect it to be fairly straightforward. The algorithm is going to treat event partial orders as the source of truth and try to treat timestamps as annotations. It can't in general, but my goal is to keep it as close to that as possible. The timestamp fixup will just happen when timestamps don't make sense with respect to the partial order.
- Dominik: Will we end up with two parsers, one of the old, one for the new?
- Michael K: My goal was no, but I'm starting with a parser for the new one and then I'll go back and retrofit the old format.
- Dominik: Can fixing up timestamps introduce severe inaccuracies?
- Michael K: No more inaccurate than the timestamps being out of order in the first place! As a side note, I plan to write a wire format spec at some point.
- Dominik: Would it be possible for two events to have the same timestamp?
- Michael K: Yeah, it'll be possible.
- Rhys: It's useful in order to identify events uniquely. But that's fine actually, a good alternative is the byte offset in the trace which is what I already do.
- Michael K: Yeah, I think you'll have to keep doing that.
- Rhys: One interesting property about the current trace is that CPU profile samples are put in the trace buffer a bit after the stack is collected. So if you look closely in the trace you can see it shifted right quite a bit and get an idea of how long it takes to write a trace event.
Comment From: mknyszek
2023-08-17 Sync
Attendees: @mknyszek @aclements @felixge @nsrip-dd @rhysh @dominikh * Austin: Wondering if anyone had seen runtime.step or runtime.pcvalue in their profiles, re: PC value data encoding experiments. * Rhys: I have an app that does a lot of copystack. 6% in runtime.pcvalue+runtime.step in the profile with “the most pcvalue” * Austin: That's where I'd expect to see it hit the hardest. I suspect it’s pretty low on average, but some apps hit it hard. * Rhys: This app starts a lot of goroutines that grow their stacks. * Michael K: I’ve been working on the trace parser. I’m getting to the partial ordering part of the parser. I got all of the structural elements out of the way. I plan to start with just a streaming API and to put it in x/exp. What would be the best way to have this parser both exist publicly and use it from within std? Exp isn’t great for this because we don’t want to vendor exp. Maybe we could vendor x/debug? * Austin: What’s so bad about vendoring from exp? * Michael K: I suspect Russ will be like “we do not want to depend on exp in any official capacity”. I do want this to live in two places so people can start using it, with the knowledge that the API might break. The long-term plan is to have a stable API, but the devil’s in the details of how events look. * Michael K: Leaning toward opaque value types. I want to leave the option open for lazy decoding. E.g., switch over event type, then say “event.GoStart()” and that returns the event as a GoStart. * Rhys: I will note there's a lot of stuff vendored into cmd. * Felix: Lazy decoding sounds good from the perspective of online processing. * Michael K: Right now it’s a completely streaming API, but I want to be able to derive one of these streamers from a ReaderAt. Mmap the entire trace, and the reader either does a buffered read or keeps just offsets to minimize the number of copies that happen. We could keep references into a mmapped byte slice. * Rhys: For the programs I have that read execution trace data, often they first sort it by goroutine. I don’t know if there’s a way to do that that’s still streaming without loading the entire thing into memory. If there was a way to have multiple iterators without having to reparse the entire file… * Michael K: Because of partial orders across Ms, things have to come out in time order and obey the partial order. If you’re just streaming it out, the parser already has to have state for every G and P, you could process it in total order, but copy the events out to separate output streams for each G, etc. * Rhys: Could we have an API that allows filtering over the streaming data, without discarding all of the work it does to figure out where the chunk boundaries are? But maybe calculating the partial orders is expensive enough that you have to keep everything in memory. * Michael K: No matter what you have to buffer an entire partition at a time. * Felix: Re sorting events by G, how do you handle events that refer to two goroutines? * Rhys: I look at the goroutine that emitted the event. E.g., if one goroutine unblocks another, we have the Unblock on the source and a GoStart on the target. * Michael K: I was expecting the parser API to be pretty low level. E.g., for go tool trace, I expect this to sit below the current high-level allocation-heavy work it does (at least to start). If you want to do higher-level analysis on top of that, it should be possible. If you want to keep more state around, that should be up to the consumer. * Austin: I think there are asymptotic issues at play. You can write processed streams out to disk, and I think that might be necessary to scale up arbitrarily. * Rhys: Pprof is an example of writing to disk. E.g., you can process a pprof with a focus flag and write out a much smaller profile that you can then work with further. Maybe the first time you run “go tool trace” it rewrites the trace in sorted order and flags it as already-sorted. Or add filtering in the future. Going to disk is fine, but I want to keep things in the same format so the ecosystem keeps working. * Dominik (Chat): what I found is that no order is convenient for all analyses of the events. All of G, P, and M orders are useful * Michael K: That might be doable. I’m reluctant to have a second format. * Michael K: Filtering by G is interesting… Early on, Russ said to me that a trace is most useful when you have the full context. * Austin: If people have a lightweight way to refer to events, is that useful? It wouldn't have all the order information (so you couldn't seek the streamer to that point, though you could do that but it would be much heavier-weight). * Many: nod * Rhys: The equivalent of focus for an execution trace is "I have a 5min execution trace, but I just want to see this 30 second window starting here." One of the things you can do with go tool pprof is you can have 100 profiles, but all these samples are from the same machine so you can merge them all together by adding a label. The equivalent in the execution trace might be to process a trace and add extra information, say what request a goroutine was handling, and write that back out to be processed by other tools. * Michael K: For focusing on a time window, partitions get you that. Slicing a trace is very easy now. You still have to go over the trace to figure out the time windows (which is very fast), but then you can copy out the partitions you need. Long-term I proposed the trace itself could write out a partition table at the end so you don’t have to do that yourself, but I’m parking a bunch of features so I can just get this done. (Also self-description.) * Felix: On our end we're creating tasks for every request that we're tracing as part of DataDog's distributed tracing, and we're using the logging feature to log the span ID (request ID). We've been able to build this ourselves but it would be nice to have this for all the tools. We're currently considering goroutines involved the way pprof does (inherited from parents). Having a way to slice-and-dice based on these user-level log messages seems useful. * Michael K: Is the issue with the current API that you can have tasks, but you can’t associate your own data with it? If you could associate arbitrary data with the task, would that be helpful? * (A task just has a single string name. You can then log with key/value pairs.) * Dominik (Chat): Yeah. Right now, the only custom data we have is logging strings, and the association is implicit, and really bound to a G, not a task. Arguably it could also be useful for regions. * Felix: I don’t necessarily need the information associated at task creation time, as long as tools agree on how to process this. * Rhys: More on annotations, when you make an HTTP request you don't know what the method is, the account ID, etc. They're things you might want to associate with a task but you don't know until later. There are also places where interesting things happen and it's not possible to annotate with tasks and regions. There's DNS requests that happen, for instance. I think it's powerful to be able to add instrumentation after the fact and be able to write that out and use that with all the existing tools. * Michael K: Writing out an updated trace should be straightforward. I'm not sure I follow the instrumentation after the fact. * Rhys: The idea is to look for events with a particular stack that look a particular way. * Michael K: This annotation seems somewhat fragile, and so I was thinking of std packages pushing trace events into the execution trace. * Rhys: It is fragile, but it only changes once every six months (at most) and it's easy to update. It also depends on having more than just one stack frame of context. * Michael K: Yeah, I abandoned that idea. It's not actually meaningfully faster. * Rhys: Also sometimes I want to go back in time and figure out how many resources (CPU time, memory allocations) it took to figure out that we even had an HTTP request to handle. * Dominik (Chat): We should put runtime/metrics into the trace. * Michael K: Agreed. Michael P recently suggested the same thing. * Michael K: Felix, you mentioned issues with worker pools. Couldn't it be instrumented with regions and contexts passed in? You don't necessarily need the library to instrument. * Felix: I consider all the various background goroutines as part of that, and that's basically impossible to instrument. * Nick (Chat): The main missing spot is non-blocking channel operations. If the worker pool is also instrumented, it doesn't matter. But otherwise you only have the correlation if there's some synchronization that points to the worker goroutine. * Dominik (Chat): Nick, but regions are annotated with the task from the context, isn't that sufficient?
Comment From: mknyszek
2023-08-31 Sync
Attendees: @mknyszek @rhysh @bboreham
- Most people were out. Michael K and Rhys discussed a few things, but there isn't much to note down that wasn't mentioned later. Bryan had to leave early.
Comment From: mknyszek
2023-09-14 Sync
Attendees: @rhysh @prattmic @mknyszek @felixge @nsrip-dd * https://go.dev/cl/527875: trace: add experimental execution trace parsing API * https://go.dev/issue/62627: proposal: x/debug/trace: add package for parsing execution traces * https://go.dev/issue/57071: runtime/metrics: add a way to measure runtime-internal lock contention * Have code, including for Michael P’s suggestion of getting call stacks * How to test??? There’s no elective blocking on runtime-internal locks, as there is in sync. * OTel. Adding profiling as a supported data type * https://github.com/open-telemetry/oteps/pull/237 * “What would we like to fix about the pprof format” * Timestamps are first-class * Still similar to pprof * Should it be pprof / should pprof be it? * Active debate ongoing
TODO(mknyszek): Go back and fill in the notes from the recording.
Comment From: mknyszek
2023-09-28 Sync
Cancelled due to Gophercon US.
Comment From: mknyszek
2023-10-12 Sync
Attendees: @mknyszek @prattmic @felixge @rhysh @nsrip-dd @bboreham
- Michael K: Working on fixing bugs in 1.22 tracer. Based on how this goes we'll decide whether it will be enabled by default in 1.21. Should be possible to opt-out either way. The existing big test is very good at finding problems.
- Felix: We’d be interested in contributing to the new parsing library so it can read < go1.22 traces.
- Michael K: Parser source of truth will be in stdlib and copied to /x/ on a regular basis. Vendoring stuff into the stdlib the other way around is more painful.
- Felix: Sharing go1.21 trace STW analysis results from Datadog services.
- Interested in flight recorder. Tried to build a version of it with the current trace. While we're doing that we were looking to see if we could ship it. The concern was STW pauses. How bad are the pauses? For most services it's OK. We saw on average between 5 and 40 ms. 15k - 200k goroutines. Proportional to the number of goroutines. 300ns per goroutine approximately. Internal service using 2M goroutines, that one took 530 ms. The removal of pauses in the new tracer will matter. Will be nice to get some real-world experience with flight recording.
- Michael K: Only one STW pause on starting tracing and it's "constant" time.
- Rhys: Out of curiosity, is anything in the runtime proportional to the number of Ms?
- Michael K: Not really. There are some things that are explicitly per-M, but the GC and scheduler only really cares about running Ms. The burden is usually on Ms without Ps to synchronize with the runtime.
- Michael P: Are you by any chance increasing the max thread count?
- Rhys: I haven't worked with programs that would exceed the limit since before the limit existed.
- Rhys: https://go.dev/issue/63340 “I am also open to deleting /gc/pauses:seconds entirely.” (Is this the right group to discuss runtime/metrics deprecation posture?)
- Thanks to Michael P for looking into it. I like the split you've come up with.
- Michael K: This is not too hard of a case because we can offer an off-ramp for the metric. If the metric has a clear use-case, we don't want to remove it without offering an alternative. Deprecation and waiting a few releases is fine. Users have an opportunity to determine if the metric doesn't exist anymore.
- Michael P: The documentation as it exists is not great for discovering metrics or whether they're deprecated. Also, should there be a GODEBUG flag for reenabled deprecated metrics?
- Michael K: RE: documentation, I was thinking of listing all deprecated metrics in perpetuity. We shouldn't reintroduce a dead metric with a different meaning.
- Michael P: One issue is that you have to go to the docs at all to identify that a metric is deprecated. There's no vet tool.
- Rhys: Two ideas. The first idea is to let the metric be there but present no data. The second idea is to have a metric to query whether any deprecated metrics are being used.
- Michael K: We could do both?
- Rhys: GODEBUG is interesting because the app owner can fix their program without fixing the code of a library, which might be hard for them to do.
- Michael P: I'm not thrilled about the zero value idea because there are quite a few metrics that stay zero for a long time. We could use a max uint value, but it's still awkward.
- Felix: +1 to zero being a bad value to indicate deprecation. The metrics we released into Go 1.21 were originally broken and all zeroes, and so that would be indistinguishable from deprecation.
- Rhys: Is the Go version in the go.mod file something that's available for this purpose? Basically, you can update your Go toolchain but metrics would retain their old names.
- Michael P: IIUC if the Go version in the main module is less than the Go version of the toolchain, GODEBUG values will automatically be set. If we wanted to rely on this, just using GODEBUG seems to make sense.
- Michael K: What do we do about metric values that don't make sense in a new runtime? Hm, maybe the zero value is fine in that case.
- Michael P: I agree it's a theoretical concern, but can't think of a case where the zero value wouldn't work. If you're collecting a bunch of metrics and summing them to get a total, if you're not aware of a new metric then it might not include that. For example, some new memory class.
- Michael K: For that specific case, I think it's OK. We have a catch-all memory metric (other) and a total. So we could just add new memory into the catch-all for compatibility.
- Rhys: If you set the version of runtime metrics to be Go 1.21 for Go 1.22, do you not see new metrics?
- Michael K: New metrics are fairly harmless, so it might be OK.
- Rhys: Moving metrics around could be tricky in some cases. Let's discuss the details on the issue tracker.
- Michael P: Do we emulate an old version of Go, or do we just do the bare minimum for compatibility? It can probably be argued both ways. GODEBUG traditionally hasn't been the former, but we are also trying to make it more comprehensive. Let's discuss further on the issue tracker. For the STW metrics proposal we can move forward on that by simply keeping both.
- Bryan: post-meeting note: wanted to link prometheus/client_golang#994; it seems to me the Prometheus Go library could host some opinions about what are a good set of metrics to collect today. Currently it implements "some set from 2017" and "everything". (We will follow-up on this in the next meeting.)
- Michael K: Should Ps be part of the trace parsing API? They're kind of an internal implementation detail.
- Felix: How likely is it that there's a future where Ps go away? What about rendering P-based timelines?
- Michael K: I'm not certain of the value of rendering P-based timelines when you have per-M timelines.
- Rhys: Currently I use Ps to guess what runqueue a G was in.
- Michael P: As someone who works on the runtime, I'm biased toward knowing implementation details and having per-P details. I agree a per-P timeline isn't mostly directly useful, but it does compress the view. Minor but handy that more fits on screen.
- Felix: +1. It impacts the visualization. The use-case of understanding the amount of runnable goroutines.
- Michael K: There is an accepted proposal for the number of goroutines in each state, which may resolve that use-case.
Comment From: mknyszek
2023-10-26 Sync
Attendees: @mknyszek @prattmic @felixge @rhysh
- Michael K: New tracer in good shape, stability-wise.
- Michael K: Demo. (Summary: live trace analysis of a service under load. Ask for the recording if you're curious!)
- Felix: How much CPU overhead for trace parsing?
- Michael K: Parser should be able to do 26 MiB/s. Demo is ingesting ~4.6 MiB/s.
- Rhys: Maybe we could use something like this to build a top for Go, e.g. show CPU usage of goroutines in a list?
- Michael K: New tracer Sweet benchmark results. (Raw data and new vs. old tracing overhead comparison.)
- Michael K: New tracer is slightly slower than old tracer. Fixed some problems already, but needs more debugging. Trace size is ~12-20% bigger. Haven’t analyzed why yet.
- Michael K: Previous tracer had an optimization for goroutines running on the same P again that they were running on before. Got rid of this because of complexity.
- Michael K: Ps have to have sequence numbers too.
- Michael K: If you have a region that starts before tracing starts, you’re going to lose that information today.
- Michael K: If you slice the trace, it’s no worse than if you started the trace at this point.
- Rhys K: Did you get rid of listing all goroutines at the start of the trace?
- Michael K: Yes.
- Felix: Can you publish these results somewhere?
- Michael K: yes.
- Felix: Trace size increase is a little worrying, but probably okay given the other improvements around partitioning.
- Michael K: Will make the parser faster, but might not have a chance to work on smaller trace time.
- Michael K: Current trace code has zero abstraction over trace events. Can’t take the new event API and plug it into the old one. It would be difficult to fit the new clean syscall handling into the new clean syscall handling. Probably need to fork cmd/trace. I could use help with with this if somebody is interested?
- Rhys: I got a lot of code relying on the old trace events without a lot of abstraction.
- Felix: I could make some time next week.
- Michael K: Would be great, I want it to go on in. And ideally on by default.
- Michel K: Tracer seems stable, haven’t gotten a crash in one week. One edge case is C calling back into Go. I haven’t had a chance to debug this yet. Doesn’t crash in sweet otherwise.
- Felix: Will you have time to fix the cgo stuff?
- Michael K: Yes.
- Rhys: Seeking tradeoff advice on runtime-internal mutex profiler. Collecting the call stack once we know that we’re going to keep the record extends the critical section “probably too much”. In a production app with GMP=36 and MutexProfileFraction=10 and a high level of contention on a particular channel lock, CPU usage increased by 0.3x. So, capture stack earlier? Larger M, or larger stack requirement in lock2?
- Rhys: Where do we store callstacks for contented locks when we’re not able to allocate memory? Made a field on M that can store one of them. If there is a second it choses between the two using the ns of contention value. I’d like it to be fair. I delay capturing stack into M until I know I’m going to save it. This means I have to delay until we capture the timestamp. We only know the duration once we get the lock (?). … It was using too much CPU. Maybe mutex profile fraction is too high and should be scaled back? It seems like I need to move capturing stack out of critical section, but where can I store it? Worried about putting 32 pointer size variables on the stack for every call to lock.
- Michael P: Can you do it in unlock?
- Rhys: Then I need somewhere to stash the duration. That would mean making every lock/channel bigger?
- Rhys: Are locks always unlocked by same M, no hand-offs?
- Michael P&M: Yes
- (more discussion)
- Rhys: I’ll stash a pointer to the lock on the M and capture the stack in unlock.
- Felix: Deep stacks grinding Go runtime to a halt.
- Michael K: Do you use MEMLIMIT? Will trigger GC more often if stack size grows.
- Felix: Will share more info once we debug this.
- Michael K: How should GOMAXPROCS be represented in the trace?
- Currently, when Ps are named they stay alive forever in the parser. If GOMAXPROCS changes the number of Ps, then nothing happens. Those Ps just stay idle.
- Question: should GOMAXPROCS represent some kind of Idle->NotExist transitions for these Ps? Or is this OK?
- Felix: It's interesting to understand when Ps actually disappear, but I haven't had a use-case for it before. If it's easy to add, I don't see a reason not to add it.
- Michael K: I don't think it's too complex, but I'd have to think through everything. It would be something like a ProcCreate/ProcDestroy event pair.
- Felix: There are a few places where GOMAXPROCS emits events even when the number of procs doesn't change.
- Michael K: It's emitted on every start-the-world. To be clear, it would only emit create/destroy events for the diff in GOMAXPROCS.
- Rhys: How hard would it be to identify what runqueue a goroutine came from?
- Michael P: We're talking about Ps being in _Pidle, but that also can't have anything on its runqueue.
- Rhys: It can have a non-empty mcache. But runqueues are just the more interesting thing to me today. What becomes harder to track if we don't have a ProcDestroy event?
- Michael P: What's the reason to not make procresize have not-exist transitions? Seems like the consensus is that it's maybe not critical, but it would be easier to make sense of the traces.
- Michael K: Mostly just worried because we don't have much time until the freeze.
- Rhys: The wire protocol includes the count of Ps. This seems like it's a question about what the parsing API should present to applications using the parsing API. If the parsing code can figure out which Ps are dead, then it can just expose that.
- Michael K: It's a bit harder to get it in the parser today because the parser likes to have a 1:1 correspondence between wire format events and the events it emits. The new events would actually be easier.
- Rhys: Uber has a library that changes GOMAXPROCS but not super often.
- Michael K: Even a few times an hour, it's probably not going to be a big deal. Michael P: Given that the API change here is pretty small and it's something we can leave for now, we can do it in Go 1.23.
Comment From: mknyszek
Since it came up in the last sync and I don't want it to be missed, I updated the previous sync notes with a link to the raw performance data from Sweet: Raw data and new vs. old tracing overhead comparison.
Comment From: narqo
Demo. (Summary: live trace analysis of a service under load. Ask for the recording if you're curious!)
@mknyszek if you could upload the recording to a public channel, and add the link to the file to the notes, this would be cool
Comment From: mknyszek
@narqo Hey, sorry for not getting back to you; I've been working on getting stuff (like the tracer :)) in before the code freeze. I'll get back to this soon. Thanks for your interest and sorry for the delay.
Comment From: mknyszek
2023-11-09 Sync
- Attendees: @rhysh @mknyszek @prattmic @nsrip-dd @bboreham @felixge
- Michael K and Felix: Status update. New tracer will land for Go 1.22, gonna try to get cmd/trace ported in time. We have a freeze exception for that. Whether it's enabled by default will depend on cmd/trace being finished in time.
- Bryan: post-meeting note from two meetings ago: prometheus/client_golang#994; it seems to me the Prometheus Go library could host some opinions about what are a good set of metrics to collect today. Currently it implements "some set from 2017" and "everything".
- Michael P: Were you thinking of documentation as to what's the most interesting or an API that would expose this?
- Bryan: I'm thinking that Prometheus would just pick the good set. The Go runtime internally could document the good set.
- Felix: Having a good set of metrics as a default for libraries that ingest metrics is valuable and we're interested in that too. I'm not sure if that should be curated by the runtime or if some provider can make that choice. I'm also involved in an internal project to determine what the most interesting metrics are that should be in a default dashboard that doesn't overwhelm people.
- Rhys: The discussion on the Prometheus issue seems to be about histograms and bucket alignment, etc. Is this about interesting metrics or names of metrics?
- Bryan: I didn't mean to focus on histogram metrics. IIRC the last time this was a problem is that it added 88 metrics per process. The set that was present in 2017 is probably not a great set today, but the set of everything will probably make people unhappy. My target is to remove some of the ones from 2017 and to add some new interesting ones.
- Rhys: Could this be a blog post or some living document describing the most interesting metrics?
- Bryan: That would help.
- Michael K: I'm slightly worried about picking a 'good set' because it may hide the fact that more detailed metrics exist. Perhaps that's just a communication problem.
- Felix: I can get away probably with exposing the metrics we want that are useful to us, but then there's still the dashboard question.
- Rhys: My strategy is to first record across the cluster if there's something interesting to look at, and then separately outside the metrics system to have more data at a per-process level in a cheaper way. e.g. goroutine profiles, heap profiles, etc. They're available and they're cheap.
- Michael P: Internally at Google it tends to work this way. The metrics system will only collect a small subset of metrics company-wide, but every process has a dump of everything that's always available to look at.
- Felix: This came up in the OpenTelemtry profiling. Pixie would store things like profiles on the host, but it's difficult to run queries to learn things about aggregate behavior (e.g. bash scripts to aggregate the data).
- Michael P: Pricing per sample. Histograms are charged per bucket so they're more expensive. But our histogram metrics are some of our most interesting metrics.
- Michael K: I think I'm convinced we should do this.
- Rhys: Looking forward to having an encoding/json library that can serialize histograms and doesn't choke up on +Inf and -Inf.
- Felix: Why does it currently choke?
- Rhys: Maybe I should open an issue with the details.
- Rhys: We should add cross-linking between ReadMemStats and runtime/metrics.
- Michael P: Along those lines, someone sent me some sample code that used ReadMemStats. It made me realize the docs for that don't even mention runtime/metrics.
- Rhys: Should we add an API to runtime/metrics to write data into MemStats?
- Michael P/Rhys: What to do about mutex profile semantics: https://go-review.googlesource.com/c/go/+/528657/6..8/src/runtime/mprof.go#b669
- Michael P: Last time Rhys was discussing whether we should collect stack samples are lock or unlock. We decided unlock would be simpler to implement and would match mutex profiles better. That's true, but we've realized that even though both are collecting on unlock, there's a difference in semantics. The sample value in the current profile is how long a waiter has waited for the lock. In contrast the sample value in the profile is how long the lock holder waited. Rhys proposed in the CL to copy the sync.Mutex behavior so that we do record how long waiters are waiting. I like this. But we almost certainly can't do that for 1.22. Should we submit what we have or should we wait for 1.23?
- Rhys: This would mean the way we use the key in lock futex would be unnecessary for the core function of locks. It's only necessary for the profiling. I don't know if this going to slow down locks on Linux.
- Michael P: I don't think it should affect performance on the fast path, only if the lock is contended. If the lock is contended you'll be sleeping anyway. But we'll see.
- Michael K: Inventory cost? If the correct semantics need a full rewrite then I'm not sure it's necessarily worth landing.
- Rhys: That's not my only concern with the patch. An earlier version of the patch did show an increase in CPU time. I wrote a benchmark for the newest version and it does slow down the contended case from 6µs per op to 7µs per op. I'm wary about landing it now. Perhaps we can add a GOEXPERIMENT and put the awkward semantics behind that.
- Michael P: I support a GOEXPERIMENT. We need tests to make sure the GOEXPERIMENT doesn't break.
- Michael K: +1 to GOEXPERIMENT.
- Rhys: Do GOEXPERIMENTS have an on-going cost? Some GOEXPERIMENTS end up as runtime/metrics and stick around for 4 releases.
- Michael P: I think you're thinking of GODEBUG which is technically different. GODEBUG is set at run time, GOEXPERIMENT is set at build time. At this point, all GOEXPERIMENTS are completely unsupported. All GODEBUGs we add specifically for compatibility, so it's kinda in-between, because there are old GODEBUGs that are more in the unsupported case.
- Felix: Small question re: the GOEXPERIMENT. Do you have to rebuild the toolchain?
- Michael P: No. There are a few that may be specifically for the compiler that need to be set when building the compiler, but in general they can just be set at go build time. Though, that makes me think that perhaps this could be a GODEBUG. The main benefit of the GOEXPERIMENT is that it would be a build flag and you could be absolutely sure there's no performance impact. There would still be a flag check for GODEBUG.
- Rhys: It seems like there are a bunch of "ifs" already in lock/unlock. I'm also worried that teams have set the mutex profile fraction that's appropriate for their use of sync.Mutex, but that's no longer appropriate for runtime locks.
- Michael K: Is this actually a breaking change?
- Rhys: CPU went up by +0.5x in the earlier version I mentioned before. They run with a mutex profile fraction of 10 and GOMAXPROCS=36.
- Michael P: I think this is a real concern in the final version. This might be a good example of a compatibility GODEBUG added for the final version. My hope would be that we get things to a point where it's not a problem for anyone.
- Rhys: That's my hope too, but we're less than 2 weeks from the freeze so I think we just didn't make it for Go 1.22.
- Michael P: I think the conclusion we're coming to is that we can add a GODEBUG, but it would be off for Go 1.22. It wouldn't affect anyone by default. In Go 1.23 we'll fix the semantics and turn it on by default, but they'll be able to turn it off.
- Michael K: GOEXPERIMENT or GODEBUG?
- Rhys: Not worried about the extra branch; it's only in the contended case.
- Michael P: If that's not acceptable now, it also won't be acceptable later.
- Bryan: we found we were spending up to 5% of CPU in runtime.morestack ref https://github.com/golang/go/issues/18138
- The default behavior of gRPC is to fire up a new goroutine per unary call. Eventually someone noticed in a profile that it's spending 5% of its time in morestack. It has an option to use a worker pool. The end result of this realization is that everyone is going to build worker pools and we lose the idea of goroutines along the way. I'd like to avoid that. There is mention on the issue that PGO could solve the issue. It did take a very long time for someone to notice.
- Rhys: I don't think that we have a profile that shows how big a goroutine's current stack needs are, or how big it's grown to. It shows up in the CPU profile but we don't have the numbers.
- Michael P: I agree we could have better visibility. Since a recent release the starting stack size is adjusted based on average overall stack needs. It's fairly conservative. It would be interesting to see if it's firing at all.
- Michael K: The current strategy is conservative because it's a global knob. If we made it more fine-grained we could be aggressive. Keith (who implemented the adaptive stack size stuff) has had some thoughts about it. But he went with the global knob to begin with because it wasn't very hard to implement.
- Rhys: This is 5% of the CPU time that's spent, but it's on the critical path for processing requests. One thing on my mind is that I'd like to see that in execution traces. What do you think about stack growth events?
- Michael K: Stack growth happens a lot so I'd be concerned about the volume of events. I'd be in favor of it if we sampled it.
- Rhys: Sampling will make it harder to reason about. You can kind of already infer this from the CPU sample event, too.
- Bryan: The number of people reporting the issue may not be related to the number of people experiencing the issue.
- Michael P: That's a good point. Part of the confusion is that 18138 doesn't mention or include adaptive stack growth. Maybe we should have closed the issue and made a new one. Definitely still worth posting on the issue. RE: stack growth events, I'm less concerned about the volume of stack growth events. It happens the most when new events are created, so the number of stack growth events should be linearly related to the goroutine creation events.
- Nick: You can kind of figure out how big a stack needed to be from a traceback, since you know the size of each function. Maybe that can help.
- Michael K: This is actually fairly easy to do in gdb as it turns out.
- Rhys: That's a great point. That requires the executable though, and there aren't many tools that we discuss here that require the executable. Could that information go into the symbol table that we ship with goroutine and CPU profiles?
- Nick (in chat: I thought about this mainly in a PGO context, where you know the frame sizes at compile time. But I guess that wouldn't tell what they were at profile collection time, to your point.
- Michael P: You can attach to samples but not frames in the pprof format, which is a bit annoying.
- Felix: If you have more thoughts on what's annoying and missing pprof, there's a group in OTel discussing pprof and possibly merging the profiling formats, so I could take those ideas to that group.
Comment From: mknyszek
2023-12-07 Sync
Attendees: @rhysh @mknyszek @prattmic @bboreham @nsrip-dd @felixge
- Michael K: Everything's in for the new tracer. Lots of cleanup to do for next release, like moving old traces behind the new API. Working on a prototype of the flight recorder for x/exp.
- Michael P: There's also x/debug that needs fixing, but I might not get to it before the holidays.
- Felix: RE: making the new API compatible with old traces, I thought that might be too hard?
- Michael K: The difficulty was with exporting the new traces behind the old (internal) API. I don't think it'll be hard to export the old traces behind the new API. It's just work.
- Michael K: Russ Cox was concerned about the performance of
traceAcquire
, since there's atrace.gen
atomic load to check if tracing is enabled. I didn't notice any problems in scheduler-heavy microbenchmarks on amd64 or arm64, but his point is that coroutine switches might be happening an order of magnitude more often, so every little bit (especially with respect to atomics) counts a lot more. He suggested adding a non-atomic bool that's set during the trace start STW and then unset when tracing stops later. This would replace the first check oftrace.gen
, so if tracing is disabled there's no non-atomic check. I'm pretty sure this works because we always double-checktrace.gen
, but I have to think about it more in the case of repeated trace starts and stops.- Michael P: The atomic load of
trace.gen
adds a happens-before edge, so there shouldn't be any flip-flopping. I think it sounds reasonable. I was going to suggest making trace stop and start get the attention of every M and set a bool on them, but that's more complicated.
- Michael P: The atomic load of
- Felix: I'm curious what y'alls thoughts are about coroutines and diagnostics.
- Michael K: In tracing it's just handing off goroutines.
- Michael P: +1, it's just goroutines. The CPU profile will treat them as separate goroutines.
- Michael K: That reminds me; there was some discussion about tracebacks merging the stack traces from multiple goroutines.
- Felix: Will there be a parent/child relationship between the coroutines?
- Michael P: I would expect so.
- Michael K: I'll note that this is only for iter.Pull.
- Michael P: The proposal isn't yet fully accepted, so this can still be discussed on the proposal. This probably won't stop the proposal though, unless we decide that the current proposal makes it impossible to create sensible information for debugging. Delve might have some fun with this.
- Topic: OS Metrics (Process RSS) for runtime/metrics (Felix Geisendörfer)
- Backstory: playing a lot with runtime/metrics, especially the memory classes. I've been writing a blog post explaining the memory classes and how this relates to OS metrics.
- One thing I noticed that's lacking is estimating how much non-Go memory a process is using. Our strategy is to take a best guess of the physical memory used by the Go runtime, and subtract that from the OS's RSS numbers… assuming there's no shenanigans like ballasts going on (which should be less common now). From that perspective, does it sound reasonable to add OS metrics? At some point you have to reconcile those numbers anyway. Lots of libraries collect these metrics differently making for a confusing landscape. Are there any good reasons not to make this a proposal?
- Bryan: I like the suggestion, but I think it's going to be very hard to get meaning from the OS numbers. RSS includes memory the OS just hasn't kicked out yet. If you want to know if you're running out of memory you need the PSI metrics from cgroups v2.
- Felix: Are you specifically thinking about the page cache or something else?
- Bryan: If you write a bunch of stuff to disk it'll sit in your RSS. But specifically that's not the working set, which is a much tighter definition.
- Rhys: Are you talking about a write syscall or something like a memory mapped file?
- Bryan: I'm talking about a write syscall. I proposed a talk to KubeCon along the lines of "why does my process OOM." I proposed this to try and force myself to figure this out.
- Michael P: I have a similar concern to Bryan's. There are so many metrics and it's hard to make sense of them. But Felix is also right: at some point somebody has to understand it. The metrics also differ on different OSes. Should we have different metric sets for different OSes?
- Felix: I would also like to point out there might be some low-hanging fruit. One particular issue is looking at your memory usage and getting surprised that it doesn't match with the heap profile.
- Michael K: At the bare minimum, I think we should communicate that the heap profile isn't exactly intended to map directly to actual memory usage.
- Felix: Except that in practice I find the heap profile to be pretty accurate, within single-digit percent error.
- Rhys: Haven't been closely involved with teams debugging OOMs, but my understanding is that the heap profile is confusing and insufficient for this use-case. Confusing because it's off by a factor of GOGC, and insufficient because it only includes reports of allocations once they've been around for a cycle or two. You don't get data on the most recently allocations that caused the OOM.
- Michael P: Just double-checked the Linux kernel source that Bryan is correct. The RSS is effectively the number of pages mapped into your page table. So if you map a file, the faulted in pages are in your page table. If your memory is low then those pages get zapped. That's the core reason why it's usually confusing. Anonymous memory is pretty straightforward, but files get complicated.
- Felix: Thanks for confirming that, but I think that just shows that it's important to give users more guidance. I think what most people want is what happened just before dying, but it's really hard because the OOM killer sends a SIGKILL. But at the same time, if only 20% caused the failure, what about the other 80%? Maybe that's more important, and we can get that information.
- Bryan: I second that the post-mortem heap profile is difficult to get.
- Michael K: The state of OOMs on Linux is fairly dismal, and we could spend a lot of time working around it, but I wonder if maybe we should put some effort into engaging with that community to change things. Regarding where to export OS metrics, I think Jonathan Amsterdam was starting discussions around standard library metrics and package for that. If that allows users to hook in their own metrics, we could provide an x/debug package that inserts platform-specific memory metrics. I feel more comfortable with that because platforms all provide different metrics and the x/ repositories have less strict compatibility guarantees, so we could evolve with the various platforms more naturally. The barrier to fixing up these metrics as things change also goes down. (Not said in the meeting, but in my mind this is like the difference between the "syscall" package, now frozen, and the "x/sys/unix" package which is allowed to continue to evolve relatively fluidly.) If we could spend some time and apply the expertise in this virtual room to do this work once, I think that would bring a big benefit to a lot of people without that much additional work. As a final note, I've also run into plenty of situations where, for example, some system provides a metric like "bytes of memory in use" with a similarly vague description, and figuring out exactly what that means and how its derived is endlessly frustrating.
- Rhys: The frequent cause of OOMs are not giant processes on one machine, but usually tiny processes with small resource limits. So if it's OOMing, it's doing so on a regular basis. So what I want to do is stash a heap profile to disk every once in a while. But it only includes data from a couple GCs ago, so it'll always be stale. RE: RSS, I don't know if it's helpful, but I thought there was a syscall on Linux that you can ask for what part of memory is actually mapped and sets a bitmap.
- Felix: /proc/* has this bitmap.
- Bryan: For the case where you really only have anonymous memory and no swap (most processes), the Go runtime could figure out what the limit is and generate the post-mortem diagnostic close to the limit.
- Michael P: I agree we could just do it ourselves and perhaps should. cgroups even have an event that sends a signal when close to the limit. My biggest annoyance with this limit is that all sorts of reclaimable memory gets counted against you, but it throws it away as soon as you get close to the limit. So you might spend a lot of time thrashing for memory but the kernel could drop some memory too. It's annoying that today these would have to be two completely different systems that don't know about each other.
- Felix: Just to recap, it's probably not worth a proposal for runtime/metrics, but maybe it could be in an x/debug package that hooks into some more general metrics package, like what Jonathan Amsterdam is starting discussions around.
- Michael K: That's my gut feeling, but I don't want to say for sure. Maybe I'm wrong about runtime/metrics being in the wrong place. In theory we could export to expvar, but that package is in a weird place. It's not a great API, yet there are people out there using it.
- Michael P: I think we should invite Jonathan Amsterdam to this meeting. I'll talk to him.
- Topic: Why is CPU profiling so different from other profile types (start/stop, given that there’s no streaming in the current implementation). Implications on emitting labels for other profile types, or including them in execution trace (especially heap profile sample events?). (Rhys)
- Michael K: I think CPU profiles are different because they implicitly require a duration (first Start and then later Stop). I'm not really sure why it's not just always-on, but my assumption is it's higher overhead. Regarding emitting heap profile sample events into execution traces, I'm hoping to use the new tracing infrastructure to bring in certain optional diagnostics into traces, like allocation events (if sampled, that's approximately like the heap profile), page allocation events, or possibly even pointer-write events to identify precise lifetimes for heap objects. This information would be useless to most people, but super useful for future runtime/compiler work.
- Felix: Thank you Rhys for pointing out the footgun of CPU profiles accumulating data over time without freeing it. That's a lesson I probably avoided learning the hard way. A workaround could be to
- Rhys: Michael K said CPU profiles are expensive. It's 0.6% or 0.7%, which is not huge. Maybe we could have something on all the time that samples even less frequently.
- Michael P: I find myself using Linux perf for the initialization part, since that's what you're stuck with. I wonder if in the SIGPROF handler we could check if we're at an async preempt safepoint and use frame pointer unwinding at that point. So we could reduce the overhead quite a bit. Signal delivery probably costs a bit, but I imagine tracebacks are the biggest cost.
- Michael P: The pprof.Profile API lets you make a custom profile. Has anyone actually done this? I don't see this much.
- Rhys: I've done it to count concurrent HTTP requests to find places where we didn't close the response body. Also, custom profiles are only for active handles. It doesn't help with accumulating things like bytes.
- Felix: Have we ever measured the overhead of the CPU profile? I'm just wondering how much we're going to get.
- Rhys: When I look at execution trace data with timestamps and call stacks, there's a minimum time delta between execution trace events that include call stacks. My understanding from that is that a normal traceback takes ~30µs. I also don't know how much time it takes the kernel to do all its timer stuff. Linux perf would have the answers.
- No meeting in two weeks since it would be on December 21st. See you next year!
Comment From: mknyszek
2024-01-04 Sync
Attendees: @nsrip-dd @felixge @bboreham @prattmic @dominikh @rhysh @thepudds @mknyszek Daniel Schwartz-Narbonne
- Bryan: It came up that GOMEMLIMIT should replace a heap ballast, but we've found in practice that it doesn't work and we've had to roll back to a heap ballast. [Thread.]
- One example is a web proxy that handles a ton of requests, so if you do nothing you get a really small heap and a frequent GC. GOMEMLIMIT makes you pick a number, so you can't burst up a bit. In practice we didn't like having to pick that number, because a number too low causes issues.
- Felix: What about GOGC?
- Bryan: You need to have it at 1000 or 2000, so you get these really big spikes if you do nothing else. Some combination of the two ought to work, but we had a couple of attempts but had to roll it back in production. In practice we set a heap ballast of ~1 GiB.
- Michael K: For some context, Bryan's situation involves overcommit of container memory, so simply setting GOMEMLIMIT and GOGC=off is problematic, since the runtime will just use all memory up to GOMEMLIMIT. We discussed both the idea of setting GOGC to some value and a much higher GOMEMLIMIT (closer to the true maximum, after overcommit), as well as other alternatives. The memory target proposal is a much closer fit, but it didn't get much traction. A similar issue also suggesting GOMEMLIMIT isn't good enough for all use cases was filed by someone. That was a little different since it involved a persistent memory leak, but the recommendations are about the same.
- Rhys: Ballasts are trivial, why have them in the runtime?
- Felix: Quick reply to ballast is that it messes with heap inuse profiles and metrics in ways that are confusing if you don't know there's a ballast.
- Bryan: We believed that you could map memory without Go touching it, but we saw in some specific cases that the runtime actually does end up zeroing the memory, paging it all in. +1 to Felix's point.
- Rhys: There's a blog post that points out that a whole bunch of things that need to line up for it to work out. So easy if you get the subtle parts.
- thepudds: Is the memory target something that can at least be approximated in user space?
- Michael K: There are two ways: one is a ballast, and one is with GOGC. The downside of the latter is that you're not going to get perfect responsiveness, but if you're just using it as an optimization, that's probably fine.
- thepudds: Someone could implement it as a package.
- Bryan: The mismatch between k8s CPU numbers and GOMAXPROCS is a pain, but yeah when the GC kicks up in response to memory pressure (as defined by GOMEMLIMIT) it can become a real problem.
- thepudds: Is SetFinalizer responsive enough?
- Michael K: Yeah, the Go compiler actually does this and it works out OK.
- David Chase CL for cmd/compile: https://go.dev/cl/436235
- Rhys: In the compiler we know that nothing is going to attach a finalizer that just blocks forever, then if you know that's true in your application then it's responsive enough. But you can also make a finalizer that blocks all other finalizers from running, which is a hazard.
- Nick/Daniel: Adding GC roots to the inuse heap profile to improve memory leak debugging (proof-of-concept CL)
- Michael P: Seems interesting. We considered modifying the heap dump tooling to emit a valid core file, so the same tooling works on crashes and heap dumps. I think the core idea of figuring out why memory is held live is very valuable.
- Bryan: Wanted to call out that I got viewcore working again for 1.20 and 1.21.
Can you give more details as to what this looks like? - Nick: These are labels attached to profile samples that tell you something about the root keeping the object alive. The prototype just picks a random number of roots and it's non-deterministic. RE: core dumps, I don't know how hard it is to handle core dumps operationally. In our context, we know our users might not feel great about sending a full dump of memory to us.
- Daniel: The nice thing about this is that it does tie into the existing heap profile, and not yet another new type of thing.
- Bryan: About getting the data, CRIU is interesting but not something I've tried. It is part of k8s though. It's an alpha feature as of 1.26, haven't checked what it is today. You can ask for a checkpoint of a container. I assume that's analogous to a core file. → Link
- Michael P: You raise a good point about the PII stuff. A heap profile doesn't actually contain the contents of memory. In theory if we're making these artificial core files, we could exclude non-pointer data. Also agreed that it's nice to have things in the same heap profile format.
- Daniel: Dumping the heap graph could be really useful, but our prototype doesn't do that.
- Bryan: I have to think for a reasonably sized program it takes too long to STW to dump the full heap graph. You probably want some kind of virtual memory copy-on-write thing. In .NET you could get a dump of the heap graph. Just trying to get back there personally.
- Daniel: One of the nice things about this approach is that we're already doing most of the work in the GC. The overhead is about 7% slower when disabled, but 12% slower when enabled. With some more careful tuning it might be possible to get this to zero when off, lower and when enabled.
- Bryan: Yeah, if it's possible to get it cheap enough maybe that's a good path.
- Daniel: You could also get a partial graph and that could be useful.
- Rhys: Can the chain be something simpler like the type? What if we represent the chains as a graph of types? Is that small enough that that could fit in a new kind of hash table? Is that enough information to be useful?
- Michael P: I don't really know the answer to your question as to whether that's enough, but I'll note that the roots aren't really call stacks at all. The chain that it's reachable through is indeed more about types. It's unclear to me exactly what information we want.
- Rhys: The mprof hash table today has a list of call stacks as a unique key. What if this were a different sort of profile where instead of a list of call stacks, it's a list of types?
- thepudds: Almost anything here would be a big step forward. Even if you just had a number of bytes tied to some root, it can be an 80% solution.
- Michael K: I worry that the investment to make this fast would be really high. 7% represents >1 years of investment in improving the GC hot paths. But I could be wrong, there could be a way to make this fast. (Disclaimer: I have not looked closely at the patch.)
- Daniel: How hard would it be to have a GC visitor framework? Are there other use-cases? RE: double counting, I don't think that should be an issue. Doesn't the GC only mark objects once?
- Michael K: GC is inherently racy for performance reasons. It’s possible (though rare) that objects get marked twice.
- thepudds: The ideal solution means zero cost when it's off, but then you pay the cost when you dump something, but it is also more work to analyze a new kind of dump.
- Nick: It makes a lot of sense to me that the bar for performance here is really high. But it also wasn't totally clear to me whether the cost would be worth it, which is one reason why I wanted to bring it up here.
- Rhys: What is standing in the way of a process snapshotting its own memory? Aside from it being hard, is there type information that's hard to get, or something that gets in the way of a process analyzing its own memory? If we want zero overhead and not have the runtime commit to supporting new things, maybe we can have the runtime support a simple thing that lets programs do introspection into their own memory.
- Michael K: This sounds similar to what Daniel was saying about the GC visitor framework. I vaguely remember checkmark mode doing something like this.
- Bryan: The important information is mostly in internal packages, so it's a bit of the antithesis of what Go normally exports, but if it's a delve-like thing, then why not?
- Michael K: On a final note, it would be great to fix this. It's definitely a gap that Go has compared to other languages.
- Michael K: Flight recorder prototype landed. Please try it out!
- Felix: Compiler regression in go1.22?
- Trying out PGO in 1.22, and a benchmark decoding protobuf data that just a baseline build was 4% slower on darwin/arm64.
- Michael K: I got a message on Slack about a possible regression. It was 5% faster on linux/amd64 and 5% slower on darwin/arm64. I've been meaning to come back to that. Nothing really stood out in the profile. I was also looking at the performance dashboard and saw this yesterday. Looks like the regression came from August 2023, which was mostly compiler CLs.
- Michael P: You're probably already aware of this, but in case you're not: if you're benchmarking a tight loop, it might turn out to be a code alignment thing. If it's protobuf it's probably big enough.
- Felix: I'll double-check. I'll also double check to see exactly what platforms the regression happens on.
Comment From: mknyszek
2024-01-18 Sync
Attendees: @prattmic @mknyszek @rhysh @bboreham @felixge @nsrip-dd @thepudds
- Felix: How can callers of metrics.Read understand if a metric value is the same because it has not changed since the last read or because it has not been updated (e.g. by the GC). I.e. How can one avoid computing incorrect /cpu/classes/ deltas?
- Felix: Considering using GC count metrics, but it doesn't feel right.
- Michael K: We should just fix this. The issue is non-monotonicity of some metrics (that are computed as subtractions) because other metrics are computed only once per GC cycle.
- Rhys: You can use idle time as an in-band proxy for whether anything changed.
- Felix: How would I interpret an increase in the number of GC cycles but no change in stats?
- Rhys: It could be a system with very broken timers.
- Felix: Not urgent, just wanted to bring it to this group's attention. Wanted to avoid manipulating metrics from the runtime before they reach the user because they're harder to interpret.
- Michael K: I mostly added the gc metrics to replace GCCPUFraction.
- Felix: Yeah, one problem with this metric was also that it counted the GC fraction since the program started.
- Michael P: There might be a race between GC count and the CPU metrics. If a GC STW happens in between while the metrics are being read, they might be inconsistent.
- Felix: testing: writeProfiles is not called after panic #65129
- Felix: We were debugging a test hang in CI, but we had a test that panicked, which made it hard to get a trace out.
- Michael K: We could call traceAdvance on a fatal panic, and then the trace tool would have to understand that it's OK to have garbage data at the tail of the trace.
- Michael P: The testing package already has a cleanup function that could stop acquiring the execution trace and other diagnostics. It's weird that it doesn't.
- Felix: Do we see any good reasons for not doing this in the defers that are already in place? My only concern is what would happen if the panic was so bad that these functions would hang. But I think that problem already exists, so we wouldn't be creating a new problem.
- Rhys: Complex proposal: we spoke months ago about how great it would be if we could extract an execution trace from a core dump. Is that a thing that we want to do? We could have the testing package be the thing that uses it. Flushing traces still has a problem because the data is being fed through an io.Writer, which has no strong way to fully flush, especially when plumbed across many goroutines.
- Michael P: Would like to have this, but I'm more skeptical about the testing package using this. OSes can make it difficult to enable core files. On Linux there's no other way to request a core file except for changing the global core pattern.
- Felix: Worried about being able to do this in CI environments. For example, it might be really hard to set this up in, for example, GitHub Actions. Just looking up how to get core dumps on macOS seems complicated.
- Rhys: I wasn't necessarily thinking of a core file, but the parent process can attach in the same way a debugger might. Just like ptrace, it would stop the process before it fully crashed. I imagine this would be more portable.
- Michael K: I think ptrace is going to be less portable probably; delve is only available on two platforms currently and that's part of the reason why. Also, if we export a core dump file via WriteHeapDump, then we can have a portable way to dump this data. If flight recording is enabled, we can even lift the trace data directly from the flight recorder buffers, which we know will be whole.
- Michael P: Shout-out that one of our internal users was noticing runtime lock contention in their profiles and I was able to figure out what was going on. Thank you to Rhys for contributing runtime lock contention to the mutex profiles!
- Michael K: We're gonna be shifting gears a little bit toward efficiency and performance analysis this year. Before that happens, I want to get some trace cleanup done, and specifically the last big chunk of work is getting the old trace parser behind the new API. If anyone has time to spare or thoughts on how we can make this easier and cleaner, please let me know! Dominik offered to pitch in a little bit.
- Felix: If I can find some time I'll see if that's something I can help with. I'll also talk to Dominik.
- Felix: Update: have not confirmed the compiler regression from 2 weeks ago. I'll ping someone on Slack if I find anything. I also found a fun PGO thing where the average goroutine stack size went up 30% after enabling PGO.
- Michael P: You can look at the assembly to check stack frame sizes, but there's a compiler bisect tool that can be used to identify which decisions the compiler made actually resulted in the observed change. It was created for the loop variable scope change. It's called something like x/tools/cmd/bisect.
- Felix: The challenge would be to write a reproducer, since this is a big application that's using a very large number of goroutines.
- Michael P: If there's a common stack, then you could look at those frames specifically. I can send instructions on how to find out what's inlined due to PGO.
- Felix: Sometimes I see a different stack trace from PGO binaries because of inlined frame attribution differences.
- Michael P: I think what you're referring to is if a function calls a closure which gets inlined, its name gets more complicated and confusing. Otherwise it shouldn't really impact things.
- Felix: It does make profiles harder to compare.
- Michael P: Matthew Dempsky has been thinking about how to adjust these names.
Comment From: dominikh
If anyone has time to spare or thoughts on how we can make this easier and cleaner, please let me know!
I don't have ideas regarding overall design yet. However, since there'll be no way around reading old traces entirely into memory to sort them, I'd like to suggest using the trace parser of Gotraceui. It started out as a copy of go tool trace
's parser, but has been optimized to use less memory, put no pressure on the GC, and it uses a smarter way of merging batches that lowers CPU time. It's how Gotraceui is able to load traces in a fraction of the time and memory needed by the old go tool trace
.
It's at https://github.com/dominikh/gotraceui/tree/8fbc7cfaeb3cebed8890efbc030a62a7f1ff3f81/trace — since it started as a direct copy of Go's parser, the git log for the folder should show all relevant changes, and includes some benchmarks.
There are some changes that might have to be reverted if we want to support very old traces. My parser dropped support for Go 1.10 and older and it doesn't handle EvFutileWakeup, which haven't been a thing since Nov 2015.
The only change that might not be great is 2c5675443eebc969ee07cd4f2063c2d7476f7b9b which removes support for trace formats older than Go 1.11. That change could be reverted if need be, but traces that old wouldn't benefit from my improvements to batch merging.
Edit: I've nearly completed an implementation of the conversion.
Comment From: mknyszek
2024-02-01 Sync
Attendees: @prattmic @felixge @dominikh @nsrip-dd @rhysh @bboreham @mknyszek
- Felix: RE: Parsing old traces in the new API. I see that Dominik has implemented it. We've tried it and it seems to work really well.
- Michael K: On my TODO list to review, I will get to it soon. And thank you again to Dominik for working on that!
- Felix: RE: Panics in tests cause broken execution traces.
- Got reviews. Thanks!
- Felix: RE: Tried PGO, found increase in stack memory usage.
- Have a profiler tool that takes a goroutine profile and a binary that produces a plausible stack frame size profile.
- Worth a proposal as a profile type?
- Michael P: Could be another sample type in the goroutine profile, possibly. Your proposed way of estimating the stack size should be pretty close. It probably omits some wrapper functions that add a few bytes. If we do make this a real profile type, we can put in the real stack size. The runtime has to look at the goroutine anyway.
- Rhys: How common is it for goroutines to idle with a small number of stack frames, but grow their stack pretty large transiently. That could hide stack space with this way of estimating the stack size. RE: expanding the goroutine profile, could we add extra little debug things to execution traces to get this data? For example, when stack growths happen.
- Felix: Lots of good ideas. RE: the concern about goroutines that grow and then shrink again, are you worried about capturing the HWM of stack size?
- Rhys:
- Felix: Wouldn't this be the same issue with a live heap profile?
- Rhys: I think there's a bound on the live heap.
- Michael P: I think there are two types of growing being discussed here. One is how much of the stack is being used, and the other is how much stack space is allocated. Rhys's case suggests that stacks shouldn't shrink too often, since we need a full GC to shrink. I think the case where the runtime directly records the allocated size handles that.
- Michael K: I think Felix's approach would still work well if you take many samples. Eventually you'd catch the spikes. You're going into the problem already knowing your stack memory usage is high.
- Felix: Another reason this would work well in practice is that you only care about it when you have a lot of goroutines, because that's when you have a lot of stack space.
- Michael K: Flush generation on panic and accept broken trace tails
- Michael P: In theory if you have a core dump, it shouldn't be too difficult to simulate what traceAdvance does, and finish up the trace. That way you can fully extract the trace out of the core dump.
- Rhys: Does this only work well if the trace is being directed to an os.File, or does it work well in the general case? What happens when the io.Writer goroutine needs to execute?
- Michael P: When a panic starts we freeze the world. RE: what Michael was talking about, we were thinking more about
- Michael K: I was thinking more about crashes as somewhat isolated, so that most of the application could still continue running. Maybe that's wrong.
- Michael P: We could investigate that, but we'd have to not freeze the world.
- Michael P: Just merged SetCrashOutput, it lets you specify an additional FD to write out panic string in addition to stderr.
- You could imagine the flight recorder has a similar API that writes out to an os.File, which will be a lot more robust than a general io.Writer.
- Michael K: Still need to look into the slight regressions we've seen due to recent diagnostics work. It's only been microbenchmarks, so we expect this to not actually impact real applications. Rolling out at a bigger scale hasn't shown any significant regressions. See https://perf.golang.org/dashboard/?benchmark=regressions&unit=&repository=go&branch=master&days=180&end=2024-02-01T16%3A37 for the benchmark list we have.
- Rhys: Did a bit of investigation as well: https://github.com/golang/go/issues/64455.
Comment From: mknyszek
2024-02-15 Sync
Attendees: @nsrip-dd @rhysh @mknyszek @bboreham @felixge @prattmic @dashpole
- Sysmon Preemption
- Michael P: Felix’ patch came up when somebody reported a go1.22 regression that might have been caused by preemption. https://go.dev/issue/65647
- Felix: Came up on my end this week as well. I found the preemptions useful to debug OS latency, seeing 500ms goroutine time slices in the execution trace gave me confidence that OS scheduling latency was high.
- Schedviz tool: MP started prototype to combine kernel scheduler traces and Go execution traces, effectively add a CPU dimension to the execution trace.
- Michael K: Default metrics set for OpenTelemetry
- Following up on a previous discussion.
- Michael K: Exporting all metrics might be too much. I wrote up a proposal for an API to give a default set of metrics that are widely useful. Michael P convinced me that this should maybe start as documentation rather than an API.
- Michael K: API would take a go language version and give back a set of metrics. If you ask for go1.17, you get the defaults for that version.
func Default(toolchain string) ([]Description, error)
- Bryan: There is a prometheus client golang issue (link?). Maybe we (grafana) could dig up information on which go metrics are collected by our customers in the cloud via prometheus. We can scan dashboards and queries for 30d and report on this in aggregate.
- Michael K: That would be useful to have. I came up with a default set in my doc, but I’m not super happy with it. I could share the doc, but feels like it might get deprecated right away.
- Michael K: The default set would have strict compatibility guarantees. Anything in the set would be supported effectively forever. You could unconditionally rely on those metrics to still exist (but they might become 0).
- Rhys: Questions about RSS, VSS, why heap profile is ½ of RSS come up often. Maybe the GC guide could address some of this. I don’t understand why the API to get good metrics needs to be in the core rather than a 3rd party thing.
- Michael K: Yeah, could also go into /x/ as a middle ground.
- Rhys: I don’t know what it means for a metric to be a default. Different apps have different needs. This is also coming up b/c metrics are expensive. They are small relative to heap profiles which many folks collect.
- Michael K: I had the same thoughts – why is this a problem at all.
- Bryan: We charge $8 / month for a thousand time series. I think runtime/metrics has ~100 metrics, that would be 80 cents per month. This is multiplied by the number of processes you run.
- Felix: Usually what's expensive is metrics with tags with high cardinality. runtime/metrics are typically exceptions to this because they're process-global. I think it's cheap-ish, but different people have different opinions.
- Rhys: I agree that putting them as time series in practice gets expensive, but there's a lot of data we can get out of the runtime that we wouldn't turn into a time series. Like heap profiles and execution traces. In practice I'd like to have all the data sitting around.
- David: Our motivation is primarily around stability. Unless it's very very expensive, and it sounds like it's not, that's probably not much of a concern for us. We would like to be able to say "here's 1.0, we won't break you."
- Bryan: In Prometheus there will be 3 options: the 2017 set, all of them, recommended set. Aren't heap profiles just as large? (e.g. you have 100s of stacks and frames, if you have a bunch of heap profiles, isn't that the same)
- Rhys: I'm not sure we're looking at this the same way. For stability, could we have a string translation API?
- David: That seems OK. The library would hide this translation from users. The harder question is: is there a set of metrics that we think will continue to be useful in Go 1.30?
- Rhys: I would expect most runtime/metrics to not change. Recently, the STW metric changed.
- Michael P: I see where you're coming from David. I think to Rhys' point that most metrics don't change. But there are also details about, for example, memory classes. New classes can get added, and some might not make sense as the GC changes.
- Felix: I think the argument of stability resonates with me. I recently built a dashboard that breaks down memory and it could break. I also think I got what Bryan was saying earlier. I think you can see the heap profiles similarly to metrics, but the blob profile has different performance characteristics. Also, they tend to get taken less frequently.
- David: It's also good to have some space for metrics to be introduced that are in development.
- Michael K: Shall we move forward with documenting metrics? Would this work for you David?
- David: What this would look like is we would have a documented set that depends on Go's documented set. This would be fine to unblock us. Going forward, it would be good to have a way to make this more maintainable and detect breakage from the Go side.
- Felix: Last I looked it seemed like the Go metrics support in OTel was incomplete. What's the plan?
- David: We plan to throw away what we have and start from scratch. We have two approaches in progress: one would be to have a stable set, and the other would be to have some configuration so that users that want different metrics or more metrics don't have to wait for us.
- Felix: What's the best way to follow along this stream of work?
- David: There is an issue on the semantic conventions repo: https://github.com/open-telemetry/semantic-conventions/issues/535
- David: There are also WIP discussions/proposals in the Go sig: runtime instrumentation replacement proposal
- OpenTelemetry OTEP 239 (pprofextended format proposed as new profiling signal format) - Felix Geisendörfer
- I wanted to call some attention to this and it's in the last call stage. It will get merged soon.
- At this point the spec could still be changed, but the big picture will probably stay the same.
- pprofextended: https://github.com/open-telemetry/opentelemetry-proto-profile/compare/2cf711b3cfcc1edd4e46f9b82d19d016d6d0aa2a...petethepig:opentelemetry-proto:pprof-experiments#diff-9cb689ea05ecfd2edffc39869eca3282a3f2f45a8e1aa21624b452fa5362d1d2
- Michael K: I'll bring it up to the team.
- Michael P: Alexey has been following this, which gives me confidence that there will be some alignment.
- Felix: Yeah, he basically co-designed this with Dmitri from Grafana. But AFAIK he was not the one that could make the call on what could ultimately happen with the pprof project. It's more about whether OTel determines pprof's future.
- Michael P: Yeah, it seems like the question is how they're going to be kept in sync (or not).
- Felix: Divergence would be highly undesirable because great pains were taken to make pprofextended to maintain compatibility.
- Felix: Deterministic Testing for Go
- Felix: A colleague approached me with questions on how to make the Go runtime more deterministic for testing.
- Michael K: We have considered adding scheduler hooks or more deterministic scheduling.
- Michael P: https://github.com/golang/go/issues/54475
- Rhys: There are some places where the Go project intentionally randomizes.
- Felix: I think my colleague was actually trying to randomize but with a fixed seed.
- Michael K: Race mode already randomizes a whole bunch of things. You could imagine setting the seed for that.
- Felix: A seed that is used for everything would be interesting.
- Michael P: Some of the randomness isn't explicit, like two threads racing for the same work. It needs explicit support to become deterministic.
- Rhys: You could imagine a record-replay type system.
- Michael K: It occurs to me that an execution trace might have the information you need for that already. It's not designed for it, but it might work.
- Michael P: That's interesting, it might work. There will still be some weird things you need to do, for example to make an async preemption land at exactly some instruction.
Comment From: mknyszek
2024-02-29 Sync
Attendees: @mknyszek @prattmic @bboreham @rhysh @nsrip-dd @dashpole Arun (from DataDog)
- Deterministic simulations in Go – Arun (from Datadog)
- Arun: Testing distributed systems. We have used formal methods like TLA+ and lightweight simulation like SimPy, but they have problems. The goal is to inject failures and run the runtime in a deterministic way. FoundationDB is an example of applying the determinism concept.
- Arun: The Go runtime doesn't provide determinism. Goroutine scheduling, map iteration order, channel selects, etc.
- Arun: Exploring trying to side-step making Go deterministic. Using Hermit from Facebook. Containerizes a program and intercepts syscalls.
- Michael P: Interesting idea, we (the runtime team) have discussed testing runtime internals with more deterministic scheduling. For you, this is more about in-user applications which seems a bit easier to solve. The Go runtime could run everything on a single thread and pick a specific goroutine execution order. It wouldn't be a small feature in the runtime, but it seems feasible.
- Arun: Could we have a seed that captures the order in which the scheduler decides to do things?
- Michael P: It comes down to the implementation not taking determinism into account. There's a lot of intentional racing and time measurement that is inherent to the scheduler's implementation. At a baseline we'd need only cooperative preemption, and we'd need to enforce that we always preempt at the same time. Limiting to a single thread helps.
- Michael P: Question about Hermit: sounds promising, but how does it deal with the same sort of scheduling issues? Like Linux scheduling determinism? Does it deal with that at all?
- Arun: Not sure, but in testing spinning up multiple goroutines and running them in the same order, it seems like Hermit does actually seem to run them in the same order.
- Michael P: While I say that it feels implementable in Go, a third party project is probably better to invest in. For example, the advantage of also running your C code (in cgo cases) deterministically.
- Arun: cgo doesn't seem to work. Hermit seems to crash on a vfork call in the runtime. Can the Go runtime do anything about C code?
- Michael P: Not really.
- Bryan: I saw the blog post from antithesis, but both that time and today I'm missing something. I get that determinism is useful in trying to track down a problem, but I don't get how taking away that problem is enough to root cause analysis.
- Arun: Determinism solves part of the issue, but test cases are needed as well, for testing distributed systems.
- Michael K: Is the missing link record-replay?
- Bryan: That would do it.
- Michael P: You could fuzz the scheduler order, and use that as a regression test. Have these projects done anything to solve perturbation of deterministic orders due to source changes? That seems like a substantial problem. Does the determinism need to be defined at a higher level?
- Arun: Not aware of anyone who's solved that problem. You can definitely at least capture higher-level operations, like RPCs, and record and replay those.
- Rhys: You mentioned you haven't been able to check the model against real code execution, but I've heard of groups that run a simplified version of their model as assertions. Agree with Michael P that making little changes and determinism is set at too low of a level, that'll throw off regression tests.
- Arun: There's a Rust library that lets you run model checking against your implementation, but I'm not sure if we'll still run into challenges with determinism in Go.
- Michael K: Bringing back up idea from two weeks ago about traces for this.
- Michael P: The main missing feature though is still that you don't know how long to run before yielding, and that's not in the trace.
- Michael K: True, you'd need a function call trace or something. Didn't Jane Street have a thing for function call tracing?
- Michael P: Yeah, it uses Intel PT.
- Michael K: With the disclaimer that I haven't read your doc yet, I think we'd be open to a prototype. We would like to be able to use this ourselves anyway.
- Michael P: Logistically, it would be interesting with respect to the faketime implementation, in terms of making the Go playground more deterministic.
- Michael P: https://go.dev/issue/54475 scheduler testing issue
- Bryan B: Topic: profiling maps
- Idea comes from watching Matt Kulukundis' talk on Swiss Tables
- Some portion of every map is randomly chosen for sampling
- Fetch the size, capacity, few other things.
- thepudds: There was a recent issue about map and slice presizing with PGO.
- Nick: The heap profiler today does capture map growth, but somebody made the decision to hide runtime functions at the tail end of the call stack.
- Bryan: It's hard to figure out if the problem is lots of small maps, or a few big maps. Something explicit for maps would take away a lot of the guesswork.
- Felix: It would be interesting to profile channels as well. We've gained significant capabilities to figure out when channels are contended with traces, but we don't really know why we block on the channel, or whether the channel's capacity is an issue. "How full are my channels" would be another interesting profile.
- Michael K: You could imagine that in the execution trace we just dump more information when blocking on channels.
- Felix: It's really useful to know when you're not blocked yet. Or never will be (e.g. when using select to send).
- Rhys: Knowing when you're not blocked in the execution trace would be really interesting. Sometimes you don't get to see that two goroutines don't have an interaction at all.
- Michael K: You could imagine extra trace events that log every channel operation.
- thepudds: Escape analysis could be exploited further to drop old slice backing stores.
- Michael K: Definitely something we're thinking about, just reusing memory in general. TODO: Add in the rest.
- thepudds: Getting the runtime and compiler to collaborate on this seems like it would be the most beneficial.
- Michael K: +1.
- Michael K: Have not gotten around to creating a proposal for a documented set of recommended metrics, I'll try to get on that.
- David: Reached out to someone on the dashboards team internally to try and figure out what metrics from Prometheus people are actually using. Reached out to other people in the OTel group to get similar information from other companies.
- Felix: We can see what we can get. Customers tend to just use MemStats, so the analysis will be harder for us. We are internally using runtime/metrics though. Did an analysis of MemStats and I can at least tell you what's definitely not being used there.
- Michael K: Also an update on more goroutines in traces: I sent a CL out. There's one case left, but it's unfortunate we're playing whack-a-mole. We should perhaps do something more principled.
Comment From: mknyszek
2024-03-14 Sync
Attendees: @mknyszek @prattmic @felixge @dashpole @nsrip-dd @rhysh
- Nick: Checking out viewcore lately. Wondering if anybody has any experience with something like viewcore but with a UI that we can learn from.
- Felix: Maybe JMC (Java Machine Control) has something?
- Nick: Yeah, it's called "visual VM" which has a ton of stuff but I haven't tried it myself yet.
- Nick: I was able to recreate what my colleague did with root tracking with tracking a heap profile from a core dump.
- Michael P: One thing we wanted to do is to make the internals of viewcore a nice public API. We haven't gotten to it yet. We have to decide what layer that API operates at. RE: extracting a heap profile, is that something the API should be able to express, or .
- Michael K: I think the API might need to know about that. I was thinking that you could also grab a trace out.
- Michael P: You would effectively do traceAdvance from outside of the running program. I don't think it would be too difficult to do.
- Rhys: Seems useful to think of it in terms of "implement an API similar to what the runtime does, for accessing the runtime's internal datastructures".
- Michael K: I'm pretty sure viewcore is still broken for Go 1.22 (sorry…).
- Nick: Yeah, I was using Go 1.20 and it seems to work. I can try to fix it, but no promises.
- Michael K: Let me know if you need any help. I'm pretty sure Michael P tried and the main issue he ran into was that it tries to collect all GC metadata up-front. It would be simpler to type the heap up-front.
- Michael P: It's also really crufty. It would definitely be less crufty if we could delete old codepaths.
- Michael K: Been a bit of a struggle to prioritize viewcore.
- Nick: Maybe what it needs is a strong use-case?
- Felix: Would there be reviewer bandwidth for viewcore?
- Michael K: Yeah definitely. I really should just fix Go 1.22. But the main work here is that someone needs to sit down and think about the API surface.
- Michael P: We were also thinking of landing it in x/exp so it's not final the moment it lands. That way we can gather feedback on the API.
- Michael K: Ran into some issues with the tracer trying it out on some big services. The issues seem old.
- Rhys: I've run the execution tracer many many times for 1 second. It usually works. There's still a low rate of not being able to create a good trace (1.20 era). I haven't seen anything that makes me think O(n^2) behavior with the old tracer.
- Felix: We have used the tracer similar to Rhys, 5 MiB or 60 seconds, whichever comes first. We also haven't tested with Go 1.22 much yet.
- Michael K: Also dumping stacks was non-preemptible, that might be new. I have a fix out for that.
- Rhys: Is there an interaction also with CPU profiling?
- Nick: Asynchronous preemption can introduce more unique stacks, too.
- Michael P: One more unique stack case is that Google does a lot of protobufs, so you end up with lots of slight variations.
- Michael K: FYI, blog post going out probably today (https://go.dev/cl/570603).
- Michael K: Haven't made progress on default metrics, but I should just do that. Does it need a proposal?
- Michael P: Technically not, but we are kinda making stronger guarantees.
- Rhys: Would a test be sufficient?
- Michael K: I think so?
- Rhys: You could imagine viewcore is also able to extract metrics and validate them. That could be the test. You could also look at all the Ms and see if you're in the runtime on any of them, and refuse to proceed if they are.
- Michael P: I was thinking the same thing, specifically with holding locks.
- Michael K: So… what's next for diagnostics? Just brainstorming. (We already talked about viewcore, which is one thing.)
- Felix: Known papercuts with stack trace size. Relatively low stack depth specifically. In terms of novel things I still intend to create a proposal for a stack profiler. viewcore and anything to help with memory leaks and OOM kills is going to be very interesting going forward.
- Rhys: I'll agree that OOM kills are an underserved need. I have unfinished business with the execution tracer, but TBA. Perhaps with post-hoc analysis.
- Michael K: On my end, I intend to add new event types to the tracer for runtime work. Another new use-case that came up recently was possibly adding timer-related events.
- Michael P: Perfetto uses SQL queries to do post-hoc analysis.
- Felix: At some point we discussed the fact that different profile types are overlapping with each other. We discussed having an API for choosing what goes into the trace.
- Michael K: Definitely on my mind.
- Rhys:
- Michael P: Post-hoc analyses seem potentially fragile since they depend on implementation details (for example, some net functionality has to actually enter a syscall to see a stack trace in the trace itself).
- Felix: Would people here be interested in trying out something SQL based for execution traces?
- Michael P: It seems like in theory you should be able to do it already by importing JSON traces into Perfetto and running SQL queries, though you lose some information. Rhys: You said it's difficult to decide when a connection was created.
- Rhys: There can be in-bound keep-alive connections and there isn't
Comment From: mknyszek
2024-03-28 Sync
Attendees: @felixge @mknyszek @prattmic @nsrip-dd @rhysh @thepudds
- Michael K: Default metrics proposal.
- Felix: Being able to discover metrics programmatically is nice, but I'm concerned about guaranteeing metrics exist and return zero.
- Michael K: I see what you're saying. Returning zero is supposed to be a last resort.
- Felix: I think
- thepudds: Zero is well-defined / equivalent to “no data” be for some metrics like throughput or count, but it's annoying downstream to have zeroes show up meaning “no data” for other types of metrics (e.g., an average latency)
- Michael K: That's fair.
- Felix: Something I would like to ensure is that the totals always correctly represent the total of whatever "subdirectory" they represent.
- Rhys: I still don't totally understand this need. Having a single field that's just a boolean seems too simplistic. What about tags? Also concerned about the battles waged over which metrics should be default or not. Things that the Go team says matters a lot in strange ways.
- Michael K: The Default boolean felt incremental enough that it would be OK.
- Bryan: I find the heap goal metric really useful but I noticed it was cut.
- Michael K: Yeah, I might go back on that.
- thepudds: One thing that would go a long way would be to take the godebug list out of the main list, since it tends to clutter it. Being able to just get one metric is useful too, and it's fairly annoying today. Perhaps an example modification could help here. Also, I noticed blog posts are still using runtime.ReadMemStats.
- Michael K: runtime.ReadMemStats still has the advantage of being able to be used for measuring individual allocations, because the runtime metrics don’t flush the needed stats.
- thepudds: maybe a new runtime/metrics example calling runtime.GC and measuring individual allocations? (Calling runtime.GC might be what the testing package does as part of reporting allocations; not sure).
- Michael P: On the topic of godebug metrics and the proposal, I'm not sure
- Michael K: I'd understood the godebug metrics live indefinitely.
- thepudds, Michael P: Nope, 2 years for many GODEBUGs. (Some GODEBUGs might be supported longer or even indefinitely).
- thepudds: I don’t know if there’s been an explicit statement about what happens to a ‘/godebug/…’ metric if the associated GODEBUG ages out.
- Michael K: Summed up the changes. Are we OK with programmatic access?
- Michael P: I think programmatic access is OK, but I worry that the "Default" field is a bit too binary and restrictive.
- Rhys: The MemStats struct kinda was the default, and was simultaneously too big and too small for many years.
- Felix: Another thought on the Default boolean flag: what are the chances of new metrics becoming default? I suspect it would be low, and the main utility of this flag for me would be discovering new metrics.
- Michael K: All good points.
- Felix G:
- Increase stack depth for mem/block/mutex to 128 (issue, CL)
- Michael K: Clever. I'm supportive.
- Michael P: Also supportive. However, I know that changing the [32]uintptr to a slice does theoretically break compatibility, but maybe nobody is using it.
- Felix: My concern is using the record as a map key. I went out of my way to make sure that works.
- Michael P: We can check the open source ecosystem to see if anybody's doing that.
- Rhys: I'm paranoid about the hidden slice containing different information from the PC array if someone modifies it.
- Felix: Currently that might get confusing, yeah. An alternative could be to have a "stack remainder" instead of having the hidden slice contain the full stack, since that would solve that problem. Also, the Stack method always returns a copy.
- Michael K: It's not clear to me that changing the API would help that much, and this wouldn't have to go through the proposal process.
- Michael P: I don't have a concrete alternative, but the API is also very weird. Maybe not that weird, since the array is already called Stack0.
- Felix: If we do change our mind if we want to change the array to a slice in the future, we could still do that.
- Rhys: This might affect CPU profiles, but I see this isn't for CPU profiles.
- Felix: Yeah, it's not. I think it's OK to generally increase it because the memory use increase only affects those with truncated stacks, who I suspect are willing to pay the cost given how annoying it is to deal with.
- Rhys: How far are we from being able to set this with GODEBUG?
- Felix: I haven't tried yet, but I think this is pretty easy to wire through a GODEBUG.
- Rhys: It couldn't be a constant if it's a GODEBUG.
- Felix: We can hash it out in the CL.
- Stack memory profiler proposal (issue, CL)
- Reduce pprof label overhead (CL)
- Felix: This is fun. The idea is to represent the labels as a LabelSet directly internally and to merge LabelSet lists with a merge sort. It turns out to be a lot faster. There are some applications that set this a lot and it costs a lot of CPU.
- Rhys: I'm a fan of setting them eagerly and often for requests.
- Felix: The only remaining overhead in terms of allocations is it's not possible to set goroutine labels without creating a new context, which forces an allocation. We'd need a different API. I would first try this CL.
- Increase stack depth for mem/block/mutex to 128 (issue, CL)
- Michael K: Annotating heap profiles further: GC cost proxy (scannable bytes * GCs survived as a counter per stack) and precise type information (as a tag, will split stack samples in the format, probably not a lot unless there's a lot of very dynamic reflection).
- thepudds: Do we have precise type information for all size classes?
- Michael K: Yeah, at the point of allocation. We can write that down.
- Felix: heapage
Comment From: mknyszek
2024-04-11 Sync
Attendees: @nsrip-dd @rhysh @bboreham @cagedmantis @dominikh @felixge @mknyszek
- Michael K: Updated recommended metrics proposal.
- Please take a look. Might be two proposals in one, and might have to break it out. If anyone has ideas on other possible future tags we might want to add to metrics, please comment on the doc! Thanks.
- Rhys: Open-sourced the tracing and profiling tools I've been using for ~6 years.
- autoprof: produces, tags, and zips up profiling data for collecting at scale. Nice for teams that don't have existing continuous profiling infrastructure. Goal is to make profiling less "one-off."
- go-tracing-toolbox: a collection of tools for poking at execution traces. etgrep, which allows for grepping trace data in all sorts of fun ways. Other tools like using CPU profile samples that happen in between two events and build up an idea of what is happening between those two events.
- Felix: CL for getting deeper stack traces.
- Backstory: Investigating service using GOMEMLIMIT and we noticed that heap profiles were getting truncated.
- From last time: dealing with public API being limited to 32 frames. Went with Rhys' suggestion to hide the extra frames inside the struct.
- Tried to make the max depth configurable at runtime, had all of it working, but it's probably overkill. What we could do instead is just make it configurable at startup via GODEBUG.
- Dominik: On the note of stacks, my biggest problem with the truncation in the context of tracing is not knowing that truncation happened, or how many levels got truncated.
- Michael K: Yeah, we can always do the full unwinding and add a dummy frame indicating how many stacks were lost.
- Michael K: Might be worth pooling the stack frame buffers.
- Rhys: How to decide when to attach to the M or the P.
- Michael K: Generally we try to prefer Ps because there's a fixed number and usually fewer in number than Ms. But diagnostics stuff is moving to the Ms because we want visibility into what the runtime is doing. Ps are generally about user code executing.
- Felix: Adding a new buffer to PCs to the M. I noticed there was already one there and considered just using it.
- Rhys: Yes, adding a new stack to the mprof hash map involves taking a lock, so I think it's hazardous to use that buffer.
- Michael K: Core dumps!
- Would love to get this done but need help. I'll try to get Go 1.22 and Go 1.23 working after the freeze (6 weeks from now).
- Byran: I can help. I've been trying to get an answer to questions I have about a service and what memory is being kept live. Uses a very specific data structure that is staying live long, and my hypothesis is that it's referenced from a slice in a pool. Manually cleaned out elements in pools and that still didn't fix it.
- Rhys: Does viewcore integrate with heap profile data? Such that if you get lucky, it'll tell you the callstack that led to an allocation.
- Nick: Yup, I did this. I can send a patch, but the first order problem is unbreaking viewcore.
- Felix: The service I mentioned earlier, it was really about tracking down live memory.
- Bryan: viewcore makes assumptions that are only correct for x86 (didn't work on macOS). (Frame pointers.)
- Nick: I know that it only has logic for ELF core dumps, so it might be that macOS core dumps just aren't supported.
- Bryan: That seemed to work.
- Nick: I'm happy to help too!
Comment From: mknyszek
2024-04-25 Sync
Attendees: @mknyszek @felixge @dashpole @rhysh @prattmic @nsrip-dd
- Rhys: Mutex Contention Patches
- Freeze is in 4 weeks.
- Working on runtime mutex contention patches. Struggling with testing. Don’t want the tests to be flaky.
- Michael K: Making the mutex bigger is probably fine, especially if it doesn't bump a size class.
- Rhys: That might make it simpler. Trade-off in code complexity vs. performance costs?
- Michael P: It's subjective.
- Felix:
- OpenTelemetry for Go Profiling / Metrics / Traces … ? (gh issue) (quote: “Could we integrate into the new “hot thing”, OpenTelemetry?”)
- The idea would be something like having the runtime (or runtime/metrics), via an environment variable, export data via OTLP or some vendor-neutral protocol to an endpoint.
- Michael P: I'm not sure how this very different from a package you import.
- Felix: A big barrier with adopting observability, specifically adopting across a large number of repositories, users don't want to go and make a change to, for example, a package.
- Rhys: Could we make importing packages easier as an alternative?
- Felix: We're looking into that, but lots of organizations are dysfunctional in that way where adding an environment variable is much easier than modifying code across the org. It's a second-best option.
- Michael P: We can't pull gRPC into the runtime, for example, but marshalling a few protos manually is probably OK.
- Rhys: Is there something simpler and file-based we could do?
- Felix: I think that could be a middle-ground, since something else on the host could ship them around.
- Rhys: Go's profiles are really good for understanding Go programs, and execution traces are unusual, but good for Go programs. OpenTelemetry is applicable widely, but not designed for Go or Go's strengths. That makes the idea of having something pick up the Go-specific files and do the conversion or pass them on seem better.
- Michael P: Worried a bit about folks being upset about binary size if every Go binary includes net/http. There may be workarounds.
- Felix: That makes sense. Any standard interface would be nice. Integrating all the things (e.g. runtime/metrics in expvar) and issues with manually passing distributed tracing information through contexts.
- Michael P: Worked on gVisor. We'd export diagnostic information over a Unix domain socket. Doesn't Java have something similar?
- Michael K: Java has everything. Maybe you're thinking of JVMTI?
- Michael P: If your binary is not a server, maybe we don't care? For example, automatically included if you're using net/http.
- Felix: Sounds like it's not a definite no? If we were to take the ideas and writing up a proposal would it be worth it?
- Michael K: I'm in support of anything that brings all the diagnostics together and exported in an easy way. Maybe this will go the direction of memory profiles.
- Michael P: If I recall correctly, memory profiles are disabled in the linker if memory profiles aren't ever referenced.
- Rhys: What if the API is "do a core dump" or "attach a debugger"? Is there anything there?
- Michael P: That sounds really fun. The API for "attach a debugger" is "ptrace attach." The debugger can do a core dump. The runtime could do a core dump, and we've discussed it as making it the new heap dump format.
- Felix: I really want to see more stuff being done with memory dumps, but it might be too high overhead for the use-cases I have in mind.
- Rhys: The out-of-the-box experience could just be attaching to the process for debugging, but just doing a very simple thing.
- Michael P: It would be very nice if we had a plan9 style interface where a process can export a new filesystem. Like /proc/
/go. - Rhys: My read on why keeping memory profiling always on is OK is that it's because the memory profile is just metadata on heap memory. But it means that other profiles just aren't used because they're disabled by default because the overhead is unclear and nobody knows what to set the profiling rates to.
- Felix: I like the idea of turning it on by default. It's definitely hard to decide what a good value is for block and mutex profiles.
- Michael K: I saw Nick had CLs out to make these profiles cheaper. Maybe cheap enough to be on by default?
- Workaround PGO inline issue by dropping samples? (pr)
- The workaround is to hack up the pprof profiles passed to PGO to prevent profiling of the problematic frames in the meantime.
- Michael P commented on the PR.
- Ready for high-level review: Configurable stack depth CLs for heap/block/mutex/goroutine profile.
- Michael K: I'll take a look.
- OpenTelemetry for Go Profiling / Metrics / Traces … ? (gh issue) (quote: “Could we integrate into the new “hot thing”, OpenTelemetry?”)
Comment From: mknyszek
2024-05-09 Sync
Attendees: @prattmic @rhysh Jon B (from DataDog) @nsrip-dd @bboreham @cagedmantis
- Freeze soon: code reviews?
- Carlos: removing cmd/trace/v2, internal/trace/v2 v2 components. CLs almost ready.
- Nick: FP for block/mutex, ready but needs submit.
- Felix: Deeper stacks CL requests a proposal from Cherry. Follow-up CL sent yesterday may avoid the need.
- Rhys: Fixing runtime mutex contention profile semantics. Is this too subtle for this point in the cycle?
- MP: I’m not too concerned, but haven’t seen the CL yet.
- Jon: Looking to add profiling/tracing without changing the source code.
- See notes from last week.
- Jon: Been playing with Go plugins. If the plugin is available, the net/http, etc may log via the plugin.
- Jon: Trying to bridge the gap with other language platforms.
- MP: Feels like we want a .so provided via environment variable, but Go doesn’t support that well.
- Bryan: https://github.com/grafana/beyla is a eBPF kernel-based approach. Maybe Go should make it easier for these kinds of tools to find the data they need?
- Rhys: In the past, we’ve collected periodic profiles/execution traces. Then do post-hoc analysis of the trace to determine the higher-level events. Perhaps net/http, etc, could add to execution traces?
- Jon: We’d like to combine this with distributed tracing, so we’d like request ID, etc metadata.
- MP: We’ve discussed adding high-level events to trace before. Adding context is a bit tricky, as we need some sort of standard of what to include.
- Rhys: It may be a bit much to automatically include certain headers, etc to traces automatically.
- Jon: OpenTelemetry has standardized HTTP headers for tracing, so that could serve as a baseline of context to include.
- MP: If we want this one automatically (via environment variable), there could also be an option for extra http tracing opt-in.
- Rhys: There is a debug call API.
- Jon: ptrace is slow, is that problematic?
- MP: Debug call is used by Delve. It doesn’t technically need ptrace, perhaps an LD_PRELOAD shared object could call it? I’m not sure what the ABI is?
- https://cs.opensource.google/go/go/+/master:src/runtime/asm_amd64.s;l=1843;drc=a8ba163695f605c72156c90de97a06b3e40566ff has the ABI/requirements. I don’t think this would be usable from a shared library as it more-or-less depends on goroutines being frozen.
- Rhys: Perhaps a debug call into a common package to register a socket/listener/etc to receive events? Maybe this is no better than an environment variable?
Comment From: mknyszek
2024-05-23 Sync
Attendees: @jba Jon B (from DataDog) @prattmic @rhysh @felixge @nsrip-dd @mknyszek Milind (from Uber)
- Rhys: Got in an update to runtime-internal lock contention before the freeze, but it's a bit slow, as it turns out.
- Felix: Does that collect stack traces? I mostly see the "missing" indicator from the last release.
- Rhys: It currently can capture stack traces behind a GODEBUG, but the stack trace doesn't really match the mutex profile exactly (kind of halfway between block and mutex profile). The change that I have in is to plumb through the amount of delay the current lock holder is causing is all attributed to the right call stack. That is, making it one kind of profile, which means we can get rid the GODEBUG. However it looks like it made the lock critical section ~100 ns slower and it's causing regressions.
- Michael K: The regression is minimal at 16 cores, but significant at 88 cores.
- Michael P: Did you confirm it was actually locks failing to acquire when they should be acquire because of some bug in the lock vs. legitimately failing to acquire because of a longer critical section.
- Rhys: What I've seen so far is consistent with the latter. I haven't seen data that makes me worried that it'll get slower and slower with bigger machines. What I think is going is that apps have some lock that is often the most popular lock. My idea is that it reaches some threshold and everything falls apart at that point.
- Automatic Instrumentation for Go (Jon B)
- JVMs allow users to attach tracing at runtime. OpenTelemetry currently uses eBPF tricks (in a way they're not really meant to be used) to achieve something similar for some programs.
- Built a prototype where net/http will load a Go plugin and start generating trace IDs if an environment variable is set. It just writes to a log. Go plugins probably won't actually work because it's not a stable ABI, but if there's any way we can set a flag and let the Go runtime start writing out data to a UDS, that would be awesome. Don't want to be too prescriptive, but this is the problem we want to solve.
- Jonathan A: Only interested in tracing, not metrics?
- Jon B: There's ways to get metric information out already, but there isn't a great way to get distributed
- Jonathan A: You'd get runtime metrics, but not other kinds of metrics. Like from net/http.
- Jon B: Want to get the trace ID from a header and then propagate it and send it back to whatever is collecting spans.
- Jonathan A: Ah, so it's about leaf services (final span).
- Jon B: Or if it's part of a chain of requests. Each server would have this kind of tracing support in it and each server would report back and the collector would correlate them.
- Jonathan A: What would the server write to the UDS? What is the format?
- Jon B: For now it was just a little custom binary blob.
- Jonathan A: Would it be OK if it was JSON?
- Jon B: It's not a lot of data for each span, so probably. There was some discussion of including OTel in the runtime, but it's probably too big.
- Jonathan A: We wouldn't vendor the OTel library, but we could just follow the spec for serialization.
- Felix: I want to correct something from earlier, we're interested in not only tracing, but also metrics and profiling. We want to get all kinds of stuff out.
- Jonathan A: I was thinking of something similar to expvar but the output format would be OTel JSON or something.
- Rhys: It sounds like what you're describing is a read-only API where the process dumps information out. But you've also been talking about chains of services that need to know who called them. It sounds like it's not entirely read-only. It sounds like there's a need to read a header and apply it to the outbound calls.
- Jonathan A: That's actually really easy. We just put it in a context and make sure outbound requests put it in the headers. We should probably do it.
- Rhys: You're saying that's something we'd be willing to have in the standard library for all requests?
- Jonathan A: Seems like a reasonable thing to me, but I don't speak for the whole team.
- Rhys: I haven't found that there's a clear answer that we should definitely trust the headers we're getting. But I'm also hearing that there's a desire not to have the application owner have any dependencies on packages that would make these explicit choices. Am I understanding that right?
- Jon B: Yes, with an asterisk. If the trace generates a header that we're generating trace information
- Rhys: It seems like there are many components that need to line up for this to work, and it's not clear to me that it can be done without the application owner having a say in it. If the application owner is going to have to import a package anyway, why not make that the entire design?
- Jonathan A: I don't see a need to require an import except if the user wanted to add more.
- Rhys: There are security implications here.
- Jonathan A: The good news is that the user is going to opt in, and every server in the chain has to opt-in.
- Felix: In terms of the opt-in, the ideal state is that it could be set at runtime (e.g. via an env var). With net/http/pprof you have to make a small code change in main().
- Michael K:
- Felix: +1.
- Michael P: Reiterate and +1 Rhys' concern about some of the automatic trace ID propagation through net/http. That may be the most worrisome point. That one does feel potentially problematic from a security perspective. I know we said you have to opt-in, but we're not so much thinking about individual applications opting, but a cloud provider opting everyone in. I don't know if it's really a problem, but it's something we definitely have to think about more.
- Michael K: We have a bunch of different ways of getting data out of Go processes, and there are a bunch of different formats for that and a bit of a discoverability problem. [more.] Maybe it’s time for an issue, for active discussion and planning?
- Jon B: I can do that.
- Michael K: Or a draft design. This is genuinely exciting!
- Work on metric for number of “runnable” goroutines (proposal). (Felix)
- Some renewed interest to get a runnable goroutine metric with the idea to do load shedding based on it. It's a little bit more direct than scheduling latencies. Just wanted to call dibs on working on this early in Go 1.24.
- Rhys: I think the idea of this metric for the purpose of load shedding has been on the issue tracker for a long time, and it existed as a patch, but it didn't really work for Google. Totally normal programs can spin up a zillion goroutines that instantly finish.
- Felix: There are internal users who insist that runnable goroutines are the answer. My idea is to just give it to them and let them experiment.
- Rhys: +1 that the scheduling latency histogram is useful.
- Felix: Want to check the temperature on implementing the proposal.
- Michael K: I think it's fine to have the metrics. Go for it!
- Milind: Talking of runnable goroutine metric: we have the need to get a metric of blocked goroutine with blocked wall-clock time in pprof. This helps detect deadlocked goroutines.
- Michael K: This has also been on the issue tracker for a while but we haven't come back to it.
- Rhys: +1 for collecting goroutine profiles every few minutes. Being able to connect them back to the process that created them is incredibly useful.
- Milind: So we may not need to put block time in the profiles, and there's a different way to get the same information?
- Rhys: Yeah, that'll show up in goroutine profiles.
- Milind: Those that leak slowly are harder.
- Michael K: In theory if you made goroutine profiles inconsistent, you could drop the STW.
- Rhys: In the design I found that a consistent snapshot is important.
- Nick: There is a way to get a goroutine profile with waiting time. It is one of debug mode levels. Is that good enough?
- Milind: I'll take a look.
- Milind: Is this issue I created some time back in the scope of this group's area of work? https://github.com/golang/go/issues/67221
- Felix: Yeah, I think so.
- Michael K: +1. I think we could do it optimistically based on whether heap profiling is on. Collecting stacks on each allocation is quite expensive.
- Milind: Yeah, so if someone wanted to catch it with high certainty, they can turn up the sample rate.
- AI(mknyszek): Comment on the issue.
Comment From: mknyszek
2024-06-06 Sync
Attendees: @rhysh @felixge @nsrip-dd @bboreham @cagedmantis Milind (from DataDog) @dashpole @aclements
- Felix: Any thoughts on shadow stacks? Obviously not the hacky thing I did, but about a future where operating systems provide us access to hardware shadow stacks?
- Felix: Inspired by Bytehound’s profiler’s cache. Caches the walked stack and leaves a trampoline on a return address in the stack to invalidate that cache. Tried it out in Go with frame pointer unwinding and in the best case it’s 8x faster! Even if the stack is constantly changing, it helps for really deep stacks. I don’t intend to really propose this because it’s scary and FP unwinding is already really fast.
- Felix: The hardware is already capable of maintaining shadow stacks. If in the future the OS provided them (there are Linux patches in flight), would we be interested in using them? Maybe FPs would no longer be needed.
- Austin: I think it all depends on cost. In particular, if it makes it expensive to switch goroutines, then I can’t see it being an overall win. But if not, then sure I think we’d be open to it.
- Nick: coroutines in execution traces
- Nick: One of the biggest changes for execution traces is that we now have coroutines. We’re looking at how to represent these in the trace UI. It looks like it’s implemented as a lightweight goroutine switch. In our current UI, we have lanes for each G. With coroutines, I think this will look like a rapid ping-pong between goroutines. Has anyone thought about whether that’s good or if there are better ways to represent this?
- Felix: I think I brought this up in the previous meeting. My concern was more on the data volume size. If this becomes popular, it could bloat traces. In terms of visualizing, if we chose not to record this data for size reasons, there’s nothing to visualize.
- Austin: We’re hoping that this doesn’t get used all over the place.
- Felix: Are we leaning toward not using iterator patterns all over std?
- Nick: I think the idea is to use the push pattern, which doesn’t involve coroutines.
- Austin: Right. We do plan to add iterators to many places in std, but those just use yield functions, not coroutines.
- Felix: I think it’s probably okay to just represent these as goroutine switches.
- Nick: If the expectation is that these won’t be common, then maybe we just leave it alone unless people complain.
- Austin: I will note that these coroutines are first-class, so most of the time it’ll be a ping-pong between two goroutines, but you can pass them to other goroutines and then things get really weird.
- Bryan: pprof labels for memory profiles? go/issues/23458
- Bryan: A colleague asked about this. Is there any recent thinking? If you did have “tell me all the memory allocations in the last 30 seconds”, you could do something like CPU profiles.
- Rhys: I would like to see memory profile sample events in the execution trace. You can kind of see which goroutines are allocating memory from the “heap in use” change events. If we had a trace event for when an allocation is profiled, that would be really useful.
- Austin: Ah, then you could construct a labeled memory profile from a trace.
- Bryan: I guess I am proposing memory delta profiles.
- Felix: I would love to have that option in the execution trace, as long as its an option. I’m worried about data volume. At some point we need to think about an API for controlling what goes into a trace.
- Nick: I think we can trade one for the other. I think we might get more events from the current “heap in use” events than you would from memory profile events.
- Rhys: Heap in use also goes down, so I don’t know if we could do a direct trade. At the current default memory profile rate, you wouldn’t get many events, but if you set that to 1 you’ll be sad.
- Rhys: A few years ago Austin talked about log-based profiling with offline aggregation. Is the execution tracer the way to do that, with some API to filter?
- Austin: I think that would be a good way to do that. We’ve already gone down that path a bit! I don’t think we can change the pprof format, though.
- Rhys: A notion of time in profiles is amazing. See the Firefox profiler.
- Felix: OpenTelemetry has had a lot of discussions about timestamps in their profile format.
- Austin: Also it’s not that hard to write a perf.data file!
- Austin: Back to memory profile labels. I think we could have labels on the “inuse” profile, but not the “alloc” profile.
- Bryan: We’re attaching trace IDs to allocations. I think they’d have to age out.
- Austin: The other thing is memory delta profiles. I think the execution trace approach is a good way to do that. Or we could add an API specifically for a delta profile. “Allocations between point A and point B.”
- Felix: To me goroutines can be grouped together by what initial function they execute in. You might be able to infer heap labels from labels in a CPU.
- Bryan: The pathological case is you have something allocating a 1 MiB buffer. It won’t spend much time doing that, versus something allocating lots of 8 byte objects.
- Felix: What use case are you trying to solve? OOMs?
- Bryan: I’m interested in the things allocating and freeing a lot of memory, which is driving GC.
- Rhys: I think execution traces are really good. If you have a profile you’re trying to solve, you’re going to want other data from the execution trace to pin that down once you know the general area that’s in. Also, if tracing is too high overhead, but this is happening a lot, you can sample when you’re tracing.
- Bryan: We have a distributed trace with head sampling and we can attribute CPU profiles to the trace, but not memory profiles.
- Austin: And you’re using labels on the CPU profile to correlate it with the distributed trace spans?
- Bryan: Yes.
- Austin: And an “inuse” profile doesn’t make much sense for attributing to a distributed trace. You really want a memory delta profile with labels so you can say “this span allocated stuff.”
- Bryan: Yes.
- Rhys: Maybe if we can really trim down the events recorded by the execution tracer, the overhead would be low enough that you could use it for this.
- Felix: +1 to being able to collect execution traces with just CPU samples. You get timestamps and you can afford to send the data somewhere. I would love to just do that.
- Rhys: And then you also don’t have the mysterious 200ms in stopping the CPU profile before you can start a new one. https://github.com/golang/go/issues/63043
- Bryan: So memory allocation events would have to be added to the execution trace, and there would have to be a filtering API. It seems like nothing else we talked about really solved the problem.
- Austin: We want that filtering API anyway. That’s the hard part but very desirable.
- Felix: Another complication is that we currently don’t include labels in the execution trace.
- Felix: For the next release, if there’s interest in working on a filtering API, we’d be interested in proposing something.
- Austin: +1.
- Rhys: clocks .. why do some platforms have cputicks==nanotime, is there a reason we don’t (or should start to) use cntvct_el0 on arm64 (relates to runtime-internal mutex contention profile, and perhaps also the execution tracer)
- Rhys: I assume cputicks is supposed to be really fast. Is there a reason we haven’t used the instruction on ARM?
- Austin: How tightly are the clocks synchronized between CPUs? And are they constant rate or do they fluctuate with the clock rate?
- Milind: is it possible to get the full ancestry of goroutines in data race reports?
- Austin: Those stacks are tracked and printed by TSAN itself, so I think this would require suppose on the TSAN side. It unfortunately doesn’t just call back into the runtime for these.
- Rhys: If you set gotracebackancestors to a large value, you might be able to correlate the output of the race detector with the Go traceback.
Comment From: mknyszek
2024-06-20 Sync
Attendees: @rhysh @bboreham @cagedmantis @dominikh @mknyszek @prattmic
- Rhys: is a mutex contention profile really the best way to find problematic contention (especially for runtime-internal locks), or is there a way we could measure the amount of time that a particular lock is held, to provide an earlier warning? (relates to scalability of apps where a single runtime-internal lock is the bottleneck, either sched.lock or in a channel)
- Runtime-internal lock contention is not really a problem until it's a terrible problem. We had a problem with sync.Mutex in Go 1.9 (queuing slow path) and Go 1.13 (priority boost when unlocking).
- Looking for prior art on identifying locks with heavy load before contention.
- Michael K: Worried about whether such a metric would be ignored by default.
- Rhys: But for runtime-internal contention, this could be really useful for maintainers.
- Michael K: Interesting. Is this apparent in CPU profiles?
- Bryan: Had fun with a read-write lock. One pending writer holds up all future pending readers and people don't know this. There are a few PRs in Prometheus related to this. It would be great in these situations to know better when these things are happening.
- Michael K: Maybe this is extractable from a trace? Maybe we already have this (or a weaker version of it) in the trace viewer?
- Michael P: You could imagine a mutex viewer in the trace tool that indexes by mutex.
- Rhys: We don't have the address of the sync.Mutex in the execution trace, and we also don't see mutex operations that didn't cause blocking/unblocking.
- Rhys: There's an interesting instrumentation technique from a paper Milind shared that could be used to annotate CPU profile samples. Could be an easy way to get data on what's going on in critical sections for runtime-internal locks.
- Bryan: Observation: traces are not a tool we're reaching for, and historically it was expensive and not that great. A different but related thing is a tool that would warn you that you used an RWMutex when you should have used a Mutex.
- Michael K: RE: RWMutex, maybe we should put a warning in the docs. This is definitely not the first time someone was burned by it.
- Rhys: What information do you want in a bug report?
- Michael K: It really depends. Maybe we need a flow chart?
- Michael P: RE: CPU profiling to estimate time in critical sections. Could the lock code itself sample the duration of critical sections even in uncontended sections? The downside is having to make the uncontended fast path slower. It would give us numbers that we could be more confident in.
- Rhys: Should we put mp.locks in the CPU profile sample? And also, could we get a measure of lock load before they get contended. And yeah, we also have to be careful about adding anything to the critical section.
- Rhys: What other tools would be helpful? For example, would better bpf support (tell a large swathe of users to run a bpf program to get some data) help in making decisions? This is in the same vein as a Go build cache experiment to understand how people were actually interacting with the cache. Not perfect, but it's still useful data.
- Michael K: Runtime diagnostics face a bigger hurdle than build diagnostics because production environments tend to be hard to get data out for.
- Rhys: Continuous profiling/tracing helps. For example, getting a page trace out is hard but getting a trace with certain fields set is much easier because of policy.
- Michael K: The page trace is now replaced with some "experimental events" that you can parse out of a trace: https://pkg.go.dev/golang.org/x/exp/trace#ExperimentalBatch.
- Michael P: https://go.dev/cl/588696
- Dominik: Do the alloc/free experimental events have stack traces?
- Michael K: Not at the moment because I was worried about overhead, but you could imagine stealing the heap profiling stack when we're taking it anyway as a low-cost way.
- Dominik: One user just said they'd be happy with opt-in stacks that have a 90% slowdown, for the sake of good data. It's useful in dev environments, at least.
- Michael K: Related to adding mp.locks to CPU profile samples, I personally would really like to see # of GCs survived on heap profile samples.
- Rhys: Are you thinking something like being able to see a histogram for a heap profile sample of how long it survived?
- Michael K: Yeah.
Comment From: mknyszek
2024-07-18 Sync
Attendees: @mknyszek @felixge @rhysh @nsrip-dd @bboreham @cagedmantis
- Felix: New Trace Viewer
- Working on a nicer trace viewer with minimal JS dependencies to maybe hopefully make it upstream. Anything you all would want out of this?
- Michael K: Awesome! No concurrent plans on our side.
- Rhys: Grouping related goroutines or a subset of them. Also composability of features and deep-linking of UI elements back to trace data.
- Michael K: Would like the trace viewer to be more all-encompassing (so features are more discoverable) and for the trace viewer to scale to arbitrarily sized traces with help from the server.
- Rhys: Could the experience be good even if the execution trace is not on the same machine?
- Felix: Do you mean live streaming of the trace?
- Rhys: I just mean the file being remotely accessed and relatively quick to load. So that you could build a web service to view execution traces. Also to clarify viewing related goroutines, I mean interactions with other tools (so, controlling what's viewed on screen via external tooling).
- Felix: Were you imagining a view that's goroutine-centric as opposed to the default proc-centric view?
- Rhys: Ideally I'd have a view that's centered on the right goroutines. Probably can't be done in the general sense, but fine for investigating specific issues.
- Michael K: I wonder if there's a good high-level data model for visualizing trace data (something higher level than what the parser spits out).
- Felix: Perfetto uses SQL, and I played around with that a bit, but it was pretty hard to make work nicely.
- Rhys: I recently built a tool that extracts state machines from individual goroutines, so if you have an HTTP/1.1 client then the execution trace data + this tool can produce that into a graph SVG with a call stack at each node. A step toward a call graph that spans goroutines describing what it takes to serve an RPC.
- Bryan: The arrows just represent the next thing that happened regardless of how long it took.
- Rhys: Right, there's no concept of time. The thickness of the line describes frequency of the event, however.
- Michael K: Really cool. Could put a distribution of time spent in each state on each box or something.
- Felix: +1, really cool. This is always the same goroutine right?
- Rhys: This is a couple thousand of the same goroutine. But it's also hard to say "same goroutine" because it's hard to name a goroutine. Sometimes it's easy (where did it start) but with things like errgroup and goroutine pools, it's much harder.
- Felix: I did something similar trying to map communication across goroutines.
- Nick: I was confused by some of these state transitions. If a goroutine is runnable, but shouldn't it always be running after?
- Rhys: The transition from runnable to waiting was running and the execution trace tells us how long, but we don't get any call stacks, so we don't know exactly what it's doing. Also in the tool if you mouse over any arrow it'll give you an example goroutine ID and a timestamp. I also did a whole bunch of filtering out.
- Bryan: RE: finding/naming goroutines, there are goroutine (pprof) labels, and then there was https://github.com/golang/go/issues/35178. JetBrains also came up with a workaround.
- Felix: Why is Running sometimes there?
- Rhys: Unblock events.
- Michael K: You could put in dummy events to retain the original state machine.
- Rhys: Yeah, we could do that, but it was hard to name them for dot. RE: naming goroutines, names can be situational
- Michael K: How did GopherCon EU go re: execution tracing?
- Felix: I seemed to have gotten more people excited about execution traces, so success!
- Felix: Talked with @jba about runtime observability out of the box.
Comment From: mknyszek
2024-08-01 Sync
Attendees: @mknyszek @prattmic @bboreham @felixge @nsrip-dd @rhysh Milind (from Uber)
- Felix: Trace Viewer Demo.
- Super fast, renders a ton of goroutines quickly.
- Immediately get data when you open the trace – streaming.
- Felix: It would be nice to have a public API to skip over the trace quickly.
- Michael K: Agreed. You can do it, it's not that much code, but it is annoying to have to write it out and depend on the wire format.
- Michael P: +1, nice to have the public API for skipping batches for things like trimming (simple, high-level operations).
- Felix: Working on streaming a trace. Basically, you connect the tool to the online trace
- Michael K: Did you run into any issues with CPU usage in the x/exp/trace parser? I saw 25 MB/s, which isn’t as fast as I’d like.
- Felix: Right now it seems good. 10 MB trace in 300 ms in the Go program, plus 300 ms more in the browser.
- Michael K: There are probably low-hanging fruit. Lots of places where maps were convenient, but maybe we could do better with specific datastructures.
- Felix: As long as the new tool is 10x faster than what we have today, it’s a win.
- Rhys: Really cool! When can we use it? :) Going back to two weeks ago, can we configure the goroutine sorting functionality you depicted? For example, custom clustering.
- Felix: Short answer yes, long answer is that I tried to come up with a query language but quickly realized that was a rabbit hole. But what would be really easy is to just let someone hook into the JavaScript objects containing all the goroutine data before presenting it.
- Michael P: Perfetto exposes a SQL query box on their page which is horrible but that's how they do it at least. A more serious suggestion that might work well with Go users: the tool is importable (app.Main(...)) where you can configure with custom sorting on the server side in Go.
- Felix: Not opposed, but I prefer to do a lot of the heavy lifting on the browser side because serialization and sending data around is expensive. I was thinking about dynamically loading parts of the trace under memory constraints, but maybe it won't be necessary – I think I can get the JavaScript object encoding down even smaller.
- Michael P: I was suggesting a JS text box, not actually SQL
- Rhys: Could just the sorting be pluggable? And the grouping.
- Felix: Already have some ideas for grouping. Open to contributions.
- Felix: What dependencies might be allowed if this is to become part of the Go project? It’s lightweight today, a couple MIT/Apache2.0 libraries.
- Michael K: There’s potential. We have several other JS dependencies. It would need to go through proposal review.
- Felix: Currently I don't really need a framework (just use canvas) but for more traditional UI elements and tracking their state, a framework would streamline things a lot.
- Michael K: May want to reconsider the rest of the UI as well.
- Felix: Yeah, I'd want to also drop users directly into something useful instead of asking them to make a choice up-front. We'd probably want to keep some things, like the MMU views. We'd also want some new stuff, like a flame chart view on a specific goroutine. Or like Rhys' graphviz visualization of a goroutine's "state machine".
- Rhys: Some stuff was already broken in cmd/trace like the pprof. How do we know if it's being used?
- Michael K: Telemetry (the new opt-in kind)? We already have it for invocation counts.
- Felix: When tools are broken, it's hard for inexperienced users to know that there's something wrong, rather than they just did something wrong. Absence of bug reports != absence of usage. These tools also tend to be considered lower priority, and so get fewer bug reports and also less attention.
- Michael K: RE: trust in tools, a good testing strategy would go a long way. cmd/trace never had a great testing story.
- Michael P: You can check telemetry.go.dev for what's configured. Non-compiler non-go-command tools aren't currently collected in the default configuration. All the tools are instrumented, though.
- https://telemetry.go.dev/config
- Felix: RE: testing, I've thought about it while writing this, but my initial approach has been not to test anything yet. UI testing is hard. Testing might freeze the implementation and inhibit exploration. We may even need to burn it to the ground and build it back up. Already done so 3 times following some dead ends.
- Michael K: +1, it's more of a "this would be a big win" when it comes to the proposal to replace the existing cmd/trace. Definitely a long-term concern.
- Michael P: Public gocore / gobinary packages
- https://go.dev/issue/57447 Proposal from a while ago
- Mostly a heads-up.
- There are two parts to it: analyzing binaries, and analyzing core files (with the binary). We'll probably start on analyzing binaries since that's making things harder for other people. For example, govulncheck manages a fork of debug/gosym. Delve has a fork of govulncheck's fork.
- Michael K: Taking a different approach, will probably try to come up with a new API. Getting rid of the old cruft may also make work happen faster.
- Michael P: We also plan to do things in pieces, and not necessarily try to design the entire API at once.
Comment From: mknyszek
2024-08-15 Sync
Attendees: @mknyszek @rhysh @felixge @nsrip-dd @cagedmantis
- Rhys: I’ve been thinking about runtime-internal locks, and wondering if we should consider having a different implementation we use for important singletons like “sched.lock” or “mheap_.lock” than the one we use for, say, the zillions of chan values an app might have. I expect we’ll want to collect data to help make that decision. The question for this group is: Where/how should the instrumentation to collect that sort of data exist? I’m thinking of something that measures the length of critical sections and reports a sort of histogram of that via the mutex profile, or the high-water mark of the number of locks held by the critical section (once m.locks returns to 0). Leave it in a CL that gets updates for a couple releases and then goes to rot? Or something like a GOEXPERIMENT or build tag that’s usually compiled out completely, but is more ready to use? Or, send code first and we’ll discuss it case by case :)
- Michael K: I think this could work as any of GODEBUG, GOEXPERIMENT, or a build tag. Feel free to just send the code and we can hash it out. Sounds useful, I've definitely written one-off things in the past to dump critical section time histograms.
- Felix: Small updates on trace viewer and chats with Lalit Maganti (Google) about Perfetto.
- Made backwards progress; taking everything apart and putting it back together. Data transfer between Go and the browser was slower than it should be by an order of magnitude. I was batching JSON-marshalled messages, but now I'm using a streamed varint-encoded (but higher level) trace format. Zooming out for big traces also has framerate issues, so I'm also working on making summaries.
- Got an email from Lalit Maganti who is interested in trying to spend some time to drive adoption of Perfetto in open source projects, and is interested in Go. I asked if there is a way to stream the data in and it turns out no. There's also no stack trace support and Perfetto has no concept of associating a time slice with a stack trace. You can kind of align it with a separate track, but it's not first-class.
- Michael K: Thanks!
- Rhys: I had some practical experience with what I feel is a real need for rendering a remote trace. I had a 600 MiB trace and cmd/trace needed 51 GiB of memory to work. Launched the server on a separate machine but viewed the trace remotely.
- Felix: I intend to support this use-case. But is cmd/trace's approach good enough, or are you looking for something different?
- Rhys: It worked, so I think so! I was trying to find GCs and it was really annoying to try to find them.
- Michael K: It might be straightforward now, thanks to Felix's refactoring of the trace viewer API, it's much easier to add new views. We can have a simple GC-only view in the existing
go tool trace
. - Felix: Here's a sketch of how this would look like for this use-case. My vision for the trace viewer is that you should be able to look at a high level summary and smoothly look across the trace.
- Rhys: I have a tool kind of like
go tool trace -d
but allows filtering. Would this be useful to send upstream? - Michael K: We'd be happy to take that contribution. Just file an issue to get others aware and to state intent. I don't think it needs a proposal, the trace CLI is (IIRC) not covered by the compatibility guarantee.
- Felix: +1 for Rhys' tool. I usually use awk but it's cumbersome.
- Felix: Nick wrote a blog post: https://nsrip.com/posts/oneinstruction.html
- Nick: Clarifying comments on the flight recorder issue (https://github.com/golang/go/issues/63185#issuecomment-2290155854)
- Michael K: See https://github.com/golang/go/issues/63185 (I described this in the meeting).
- Rhys: Maybe call it a Limit and don't specify upper or lower? :P I think this could benefit from worked-out examples. Russ also said that
- Michael K: This is making me think the config should just always be global. It would be way too complex to have different subscribers have different configurations. I will sketch out a different API.
- Felix: +1 to both worked out examples and a global configuration. I'll note that this is consistent with how profiles work, too.
Comment From: mknyszek
2024-08-29 Sync
Attendees: @prattmic @bboreham @felixge @nsrip-dd
- Michael K: FYI I have a fix for gocore for Go 1.22+. It was easier than expected.
- MP: Runtime lock contention on channels
- Fastly issue discussed at GopherconUK: https://exactly-right-airedale.edgecompute.app/Go%20Channels%20Slow%20Down.pdf
- Felix: trace viewer updates
- Demo!
- Go program parsing is the bottleneck (1.3 GiB trace: 42s to load the first time, 20s second time).
- Generations could be parsed concurrently?
- Parsing is papering over some differences. e.g., generations parsed independently start goroutines in undetermined state. The parser fixes this up.
- Felix: Frame pointer cleanups
- Lots of different representations of stack traces (logical, physical), expanded or not. Felix would like to clean this up for 1.24.
- Nick: Looking at getting rid of the 100ms sleep in the profile reader.
- Mostly working for standard reader. Trace reader still has difficulties with doing the wakeups at the right point.
Comment From: mknyszek
2024-09-12 Sync
Attendees: @mknyszek @prattmic @felixge @nsrip-dd @chabbimilind
- Michael K: WDYT about slightly expanding the scope of this meeting to include all runtime topics?
- Most other runtime topics are closely related to diagnostics, for example performance.
- The regular attendees seem like a good group for such discussions.
- Everyone in attendance seems supportive, we'll discuss next time and try to get more feedback on the topic from more attendees.
- Michael K: See my latest comment on the flight recording proposal.
- TL;DR: I think we should stick with the original proposal, but decouple the internal rate of traceAdvance calls from the amount of trace data kept around for flight recording.
- If we want to expose the internal rate of traceAdvance calls, let's do so in a separate proposal.
- Felix: WriteTo calls traceAdvance.
- Michael K: That's true, Cherry and I were thinking
- Felix: What is the use case for instantiating multiple flight recordings?
- Michael K: Mostly about consistency across different diagnostic APIs. e.g., CPU profiling
- Michael P: How does the global and flight recording
- Felix: I think the rates make more sense to configure globally. We probably don't want to have different instances. Same for CPU profilers. Consuming the data should allow multiple subscribers, but if resources can be shared (flight recording buffers) that would be ideal. It would be neat if we could have packages realize there is a clash in settings, but I don't know how to do that.
- Milind: RE: concurrent entities trying to profile, we have services for continuous profiling (e.g. pyroscope) but we also have a use-case where users of the services want to take a one-shot profile. When the continuous profiler is running, these other tools don't get anything. But these one-shot profiles are higher-priority. Could it be a queue of subscribers? Having multiple subscribers probably solves that problem.
- Felix: That's exactly what I had in mind to support.
- Michael P: This feels like an important and common case. For CPU profiling all you can control is the rate, and AFAIK nobody modifies the rate. It gets complicated when there are other settings.
- Milind: With concurrent profile collectors, each profiler would probably want its own buffer to avoid seeing old samples.
- Michael P: We would have to direct samples to multiple buffers. It should be possible, but there may be some scalability concerns. 1000 profiler consumers would be problematic. 2 is fine.
- Milind: Is stack merging happening in the signal handling or is it done in a separate goroutine?
- Several people: Discussing some implementation details on how this would work. Basically, the profiler goroutine should split out and duplicate the samples, not the signal handler.
- Felix: RE: the central/global configuration issue, the semantics we want are kind of like RWMutex. Many readers, but only one "writer" that controls the settings. Only one owner can get control of the tracer/profiler/etc, but anyone can subscribe.
- Michael K: This can be done with the tracer as well.
- Michael P: It depends a bit on whether the consumer cares what the settings are.
- Michael K: Maybe the "writer" queues up, as Milind suggested earlier?
- Michael P: In perf, with specific events, you can say sampled is OK, and you might get a warning. If you want it to be enabled all the time, you'll actually get an error if perf cannot arrange that.
- Milind: In Intel PMU events you can say "no skid events" and it works similarly.
- Michael P: Most recent proposal: https://go.dev/issue/53286. It paused partially because figuring out new C-created threads is complicated.
- Milind: This is one I filed earlier on: https://go.dev/issue/36821.
- Michael P: perf just scrapes /proc/* racily.
- Milind: I think there's a way to do that not racily with a two-pass algorithm.
- Nick: Mutex/block profiler frame pointer troubles. Recent issues for example: https://github.com/golang/go/issues/69294 https://github.com/golang/go/issues/69335
- Trying to introduce frame pointer unwinding for mutex/block profiling caused a bunch of bugs, mainly that it doesn't quite work exactly like the old profiling. Adding lots of kludges and special cases to make it work.
- Michael K: I think we should keep it, if you've got time to keep fixing bugs. Diagnostics are hard – we're not going to get as good coverage around them as we do for other things. We (unfortunately) have to ship it to find out.
- Michael P: From a maintainer's perspective, reverting changes also comes with a risk.
Comment From: mknyszek
2024-09-26 Sync
Attendees: @mknyszek @prattmic @cagedmantis @nsrip-dd @rhysh @thepudds @bboreham @chabbimilind
- Michael P: Heap profiler memory usage increase in Go 1.23 (https://go.dev/issue/69590)
- Thinking about potential mitigations and couldn't decide why this wasn't already an issue. It would already be an issue if you had 32 random stack frames and generated all permutations of those stack frames. What seems to explain why this is not a problem now is that most of the time the bottom frames are identical.
- Nick: Agreed. In theory this is a problem today even before the Go 1.23 changes.
- Michael P: This makes it seem less concerning to me then.
- Michael K: Should we do more about deduplicating? We only deduplicate whole stacks.
- Michael P: Thinking about bounding the maximum size by dropping samples. That means the profiling rate would basically change.
- Milind: Could use reservoir sampling.
- Michael P: Right, that's what I was thinking of.
- Rhys: One thing that makes this tricky is that it's not even the heap profile that's growing necessarily but the off-heap memory the runtime uses.
- Nick: The way we found out about this is someone reported a memory leak. We're working on getting memory stats in front of people for exactly this reason. We also report delta profiles, and we wouldn't notice in our own metrics because the delta is basically zero most of the time. If we had a built-in delta that would help us.
- Rhys: Could we have a dummy frame for off-heap memory in the heap profile?
- Michael K: We could actually take stack traces from sysAlloc, etc.
- Rhys: Should we have log-based profiles internally and then turn that into whatever we want? (RE: delta heap profiles.)
- Michael P: So it's not really a public API, but the way CPU profiles work is effectively that. They're events that get processed and grouped into buckets.
- Rhys: Block and mutex profiles also accumulate forever, but you have to ask for them. Teams haven't turned them on and off. But, execution traces let us get similar information and that is possible to enable for short bursts.
- Michael K: The memory profiling code is very old and if there's an opportunity to just eliminate the problem that started this discussion, that would be ideal.
- Michael P: What kind of deduplication do you have in mind? The random stacks case is highly incompressible.
- Michael K: Nothing too specific, possibly deduplicating smaller sections of stacks, but you have to be careful about where you set boundaries. I wonder where the rest of the world has gone on this in the meantime.
- Rhys: Many of the records are not "in use," as in not part of the in-use profile. While they correspond to an “in use” memory allocation, they represent 512kiB of allocations (on average, by default), so we can afford to use a small fraction of that for this metadata. But when the memory is freed, the metadata in the “allocs” profile no longer is a small fraction of the memory, since the “live” part of the allocation is now 0 bytes.
- Michael P: What if we had a “reset” mode on WriteHeapProfile that reset all allocation counts to zero? Then it could throw away records from anything not in use anymore.
- Rhys: Resetting the global state makes me nervous because it's global state. Changing the heap profile rate once the program has really gotten started is not something I've seen anyone do. We've had similar discussions about the flight recorder and execution traces. It's fine if only one part of the application cares about the data, but without complete coordination, there's always a concern.
- Michael K: It's interesting that this discussion applies to heap profiles too. The current heap profile API also has some really annoying implementation restrictions in malloc. I also found a bug (not that anyone would run into it in practice).
- Rhys: One way to get programs to coordinate could be to set the baseline heap profile sample rate to zero. CPU profiles and traces are somewhat easier to coordinate.
- Michael P: It's early for Friday afternoon bad ideas, but the compiler could rewrite writes to MemProfileRate to function calls.
- Rhys: Is there a way to make sure MemProfileRate is assigned to its own page and forces a fault on access.
- Michael K: Continue meeting scope discussion.
- Seems all are in favor.
- Rhys: If we're still not sure, we can still give priority to diagnostics topics.
- Rhys: If the above is approved, I’d like to discuss strategy around landing mutex changes
- TLA+ explained the bug I fixed in PS 10 of https://go.dev/cl/601597
- Using Xchg8 requires per-arch support. Or, we could use Xchg (Xchg32) and support all 64-bit architectures right away
- Michael K: Supporting just what's needed to start is fine.
- Is using futex a good idea? The semaphore-based implementation gives a linked list of waiters, which is very convenient!
- All: shrug Maybe worth an experiment?
- Michael P: Might be worth investigating the futex API which lets you pick which thread to wake.
Comment From: mknyszek
2024-10-10 Sync
Attendees: Michael K, Tim, Michael P, Bryan, Rhys, thepudds, Nick, Felix, Carlos, Milind
- Michael P: Thank you to Nick for digging really deeply into that frame pointer issue!
- Nick: Some custom assembly code clobbered the frame pointer and PGO happened to inline the thing that called it and the clobbered frame pointer lived long enough to have frame pointer unwinding visit the bad pointer.
- Rhys: Followed the issue enough to see that it stemmed from machine-generated assembly, but generated from a tool outside the Go ecosystem. It seemed to have different ideas of which registers were available for general use. Avo seems to not have this issue.
- Nick: Found another unrelated crash on arm64 that doesn't know we save the frame pointer under the frame.
- Michael P: We get lucky with the external assembly generators with ABI0. I like Nick's idea of trying to improve vet to catch this.
- Felix: Nick and I were also discussing the differences in stack layout between amd64 and arm64. We were wondering what the reason for that was. Is it true that we can't change it because it's baked into ABI0?
- Michael K: I think so. I'll ask Cherry.
- Felix: Go GC Regions draft design follow-up
- Michael K:
- Felix: As a summary of why we are interested in cross-goroutine regions is that it would fit very well in our model of sending data through a pipeline: the region could track the data passed through the pipeline. In theory we could rearchitect our code to use regions as-is, and they're probably better than arenas, so it still might make sense.
- Rhys: Wanted to confirm my understanding that you get the benefit of regions when you close the region.
- Michael K & others: Discussion about the state of the design draft and how to move forward.
- thepudds: If there was something like 100 performance sensitive applications, how many of them would the technique apply to?
- Michael K:
- Rhys: Re “server that processes 100 or 1000 requests per GC cycle”, do Memory Regions pay off when used with html/template or encoding/json, especially for simple/unoptimized apps? If this is an easier way of getting the low-hanging fruit (applies well for unoptimized apps, doesn’t do much for highly-optimized apps), is that a win?
- Michael K:
- Felix: Stability of -toolexec for DataDog/orchestrion and similar projects. E.g. alibaba/opentelemetry-go-auto-instrumentation and pijng/prep.
- We are very happy with toolexec for instrumentation, and now it's out there and propagating. At what point does it become a compatibility issue? We wanted to raise the issue early.
- Michael P: One of the problems of toolexec is that the Go module cache doesn't integrate with it, so you often have to tell it to rebuild everything. That feels like something that we should solve that if we want people to actually use toolexec. toolexec was originally intended for developers of Go.
- Felix: On our end, we encountered this problem for sure. I think if we could get buy-in on the problem statement of instrumenting code at compile time, we could create a proposal for something better. And also argue why other approaches to instrumentation are more painful and less useful.
- Michael P: I think describing the problem and starting discussion sounds like the right way to go to me. My gut feeling is that the discussion would veer toward arbitrary code execution at compile time as something we've always avoided, but at least in this case you're opt-ing into it.
- Felix: Where's the best place to start this discussion?
- Michael K: Filing a regular issue is a good start.
- Felix: This seems like it overlaps with linkname in that it's maybe using a tool in a way it wasn't quite intended, and we want to be good citizens and not be in the same position as linkname.
- thepudds: People have been using toolexec for a long time and I think it would make sense to loop in the tools team and maybe bring it up in the bimonthly tools call (https://go.dev/wiki/golang-tools#calls). thepudds: Also, here's an older proposal regarding caching issues for toolexec: https://github.com/golang/go/issues/41145.
Comment From: mknyszek
2024-10-24 Sync
Attendees: Michael P, Rhys, Nick, Milind, thepudds, Michael K
- Rhys: Still working on runtime.mutex, trying to address tail latency. Planning to add an additional layer of complexity to make direct handoff to another thread efficient without blocking the unlocker, or blocking other users of the lock.
- We still have some complexity budget remaining 🙂
- Michael K: Very brief regions update.
- Plan to post publicly.
- Milind: Goroutine deadlock detection via GC
- Milind: Goroutines leak in production. A goroutine is blocked on a channel, but the other end of the channel is no longer reachable. The goroutine (and its stack) can’t be collected. Enough leaks eventually lead to OOM.
- Milind: Liveness of the goroutine can be determined by the garbage collector. i.e., the concurrency primitive must be reachable from another live goroutine.
- Milind: Two options: just report leaks, or actually collect the goroutine.
- Rhys: Two goroutines both selecting on each other; I think GC could detect this case?
- Milind: Yes, cycles are not a problem.
- Rhys: Absolute goroutine counts hasn’t been super helpful, but goroutine profiles to break down what each goroutine is has been more useful.
- Milind: We do take daily goroutine profiles and look for suspicious cases, but there are lots of false positives/negatives. Time series analysis may be interesting, but not a sure solution.
- thepudds: Is the main goal reporting or collection?
- Milind: It’s an open question. Collecting is nice because it doesn’t cause outages, but if that is too complex, we’d get lots of value just from reporting.
- thepudds: Some previous discussions
- https://github.com/golang/go/issues/19702#issuecomment-290223996
- https://github.com/golang/go/issues/19702#issuecomment-290433926
- https://github.com/golang/go/issues/19702#issuecomment-289825220
- Michael K: If we were to collect the goroutines, what about finalizers that only those goroutines reference? It’s not safe to run the finalizers, and it’s not safe to not run the finalizers. There’s a bunch of other details in those issues.
- Milind: 95% recall on microbenchmark tests, many unique bugs found in Uber unit tests.
- Rhys: You could add a request ID as a pprof label, and then when you collect a goroutine profile, note goroutines that have a request ID for a request that has already completed.
- Felix: It would be very helpful for goroutine profiles to include how long goroutines have been blocked. We had to work around by parsing runtime/debug.Stack (which incurred STW, better these days).
- thepudds: Built-in race detector has made writing concurrent code more approachable; a built in approach for this would be similarly nice.
- Rhys: goroutine IDs provide a very rough age, since they are sequential.
- Michael K: The patch is complicated, but resolvable in code review. Putting up a CL could help answer questions, such as cost when disabled.
- Michael K: Second question is about bang-for-your-buck. Goroutine profiles is less precise, but can help catch cases that would be false negatives with the GC approach (such as reachable via a global).
- Michael K: IIRC, it wasn’t very effective in production?
- Milind: Not true anymore. Newer experiments are working well. e.g., changing unbounded growth to flat.
- Michael K: We could reduce complexity by decoupling from the GC and building as a separate pass that is opt-in. Something that is run on-demand (e.g., a profile).
- Milind: I think people would be happy with that.
- Milind: New goroutine states may not be strictly necessary.
- thepudds: do you (Michael K) have an off-the-cuff guesstimate on latency impact of an on-demand stop-the-world based approach? (Sounds like it could be parallel but not concurrent w/ the GC, or similar?)
- Michael K: it would be on the order of milliseconds probably, which may or may not be acceptable
- Michael K: but the main question is whether it's acceptable to just do very occasionally. is such a pause acceptable once per day? I dunno.
- Rhys: This analysis could be done on core dumps, if they were easy to work with (both re tools and compliance).
- Milind: Too much PII in core dumps to touch them.
- Rhys: Maybe analyze the core dump on the machine? If there were a good way in the community to analyze core dumps in situ, that could address both compliance (doesn’t leave the machine/container) and tools (they’d get regular use, bug reports, and fixes) problems.
Comment From: mknyszek
2024-11-07 Sync
Attendees: Nick, Michael P, Rhys, Felix, Milind, Bryan, thepudds, Tim
- Michael K: Will post publicly about GC regions in a GitHub discussion sometime soon.
- Michael K: Flight recording updates (gh issue)
- When there are multiple consumers, there’s tension between what they all want
- Thinking of taking the union of all requested events, the highest requested granularity, etc
- Trying to make this all composable, with multiple consumers not conflicting. Some may get more data than they requested, if a concurrent user asked for more (more events, finer granularity).
- Consumers can filter, the runtime can do the simple work of sending the same data to multiple consumers
- Michael P pointed out that reversing that decision is backwards compatible; we can decide later to make each consumer get exactly what they requested.
- Millind: Maybe there can be a stdlib package that does filtering?
- Michael K: Maybe. The trace parsing API is low level enough that users can build their own filtering [editor: did I get this right?]
- Michael K: CPU profile rate is an exception that needs some extra work in the runtime (no big deal). We’d have to set the internal sample rate to the max that’s requested, and downsample for each consumer.
- Felix: Are we sure it’s just a filtering thing? Picking different sample rates can mean different responses to bias, if we collect data at different times.
- Michael P: We’d thought most about tracing, might need to think more about profile samples.
- Michael K: The pprof protobuf format has a single field of sample rate, so we need to make sure that’s right when taking CPU profiles.
- Rhys: Argued against making SetCPUProfileRate because CPU profiling was still broken on Linux, but that's fixed now, so I retract that position. It's more subtle than just changing the sampling rate high to get a high resolution. If the kernel samples too often, it stops representing actual on-CPU time, and the profile becomes less representative.
- Milind: So you're concerned with scaling the value?
- Rhys: If one user asks for 10,000 Hz and one for 100 Hz, then nobody's going to get what they want, it exceeds the maximum rate that Linux will give you (for setitimer/timer_create-based profiling). But also again, at a rate that's too high, the CPU profile ceases to be useful as it interferes too much with execution.
- Felix G: CLs to reduce pprof label overhead. Also helps eBPF profilers want to get trace id / span id from various runtimes (eBPF code).
- We use pprof labels for distributed tracing internally. Very useful to keep track of all the work that's being done on behalf of an endpoint.
- 20-30% of creating a span is going to setting pprof labels.
- 1,000+ CPU cores doing nothing but setting labels.
- CL switches from a map to a slice of key-value pairs.
- Made optimizations based on keeping the labels in sorted order.
- Michael P: Uncomfortably excited to learn eBPF is reading Go maps.
- Michael K: Let's get it reviewed!
- Felix G / Nick R: Seeing issue where the execution tracer seems to interfere with scheduling.
- Nick: Goroutines all just run for extended periods of time, latency spike.
- Michael K: Sounds like the process getting descheduled, have seen this with Linux cgroups CPU quotas.
- Nick: Maybe, but we have other evidence. (We'll come back to this.)
- Nick: It corresponds to the execution tracer being active. It doesn’t correspond to an update to any software as far as we can tell.
- Nick: We have CPU profiling enabled at the same time, and when these goroutines appear as running for many tens of milliseconds (500ms?), we don’t see any CPU profile samples.
- Michael K: One of the expensive parts of tracing is traceAdvance. Is that happening? There’s also a forEachP.
- Michael P: Those goroutines look like they’re from the same root function, maybe they’re all independently trying to do something that’s hard to interrupt like some long assembly function. Does the M view of the execution trace show anything different? We’ve seen problems with getting to stop-the-world because of long assembly functions .. on the scale of seconds.
- Michael K: Another time I’ve seen this is from services that run in cgroups without enough memory, and the binary itself ends up thrashing (getting swapped in and out). Might be worth looking at disk I/O metrics.
- Rhys: Try changing the clock source on CPU profiles to be wall-time based. It's not a CPU profile, but might give you samples for sleeping threads, revealing some interesting structure. Also take a very close look at the very end of the slow times: how do those threads become un-stuck, what’s the exact timing, etc?
- Felix: It seemed like a lot of events were blocked on a mutex in the tracer.
- Micahel P: Maybe also look at CPU metrics in cgroups.
- Michael K: If one goroutine gets stuck with a runtime.mutex, the other goroutines that need that runtime.mutex will end up looking like that.
- Rhys: +1, but I haven't seen this sort of long lasting starvation with the current runtime mutex. This is on linux/arm64, which is interesting. I would be surprised if it's starvation. I would write a program that uses the x/exp/trace parser that dumps out the CPU profile samples for the time range during the time range that this is going on. I'd look to see if there are more samples than usual related to locks, or anything else.
- Felix: We'll take some of these ideas and try more things. We'll also try to anonymize some traces and post it to GitHub.
- Felix G: Trace timeline viewer.
- Will hopefully make some more progress soon!
Comment From: mknyszek
2024-11-21 Sync
Attendees: Carlos, Bryan, Rhys, thepudds, Nick, Michael K, Milind, Michael P
- Day of the freeze!
- thepudds: Have a few things I would like to merge. Are tests and documentation OK during the freeze?
- Michael K: Yup!
- thepudds: Some of it is behavioral changes that improve testing. For example, changes that allow for more thorough testing.
- thepudds: A couple CLs for making the GC more stringent about bad pointers.
- Michael P: You're always welcome to send CL that we merge after the freeze lifts.
- Rhys: Looked into some performance problems on arm64 with scalability a few months ago. Saw that when an application allocates a large byte slice it's not preemptible. https://github.com/golang/go/issues/69327
- Michael K: This is doable, but requires recovering why it's not OK to have preemption points in the clearing routines in mbarrier.go. For pointer-free memory, I do not see why it would be a problem. For memory with pointers, it might be OK to chunk the bulk write barrier with the clears.
- Michael K: Posted publicly about GC regions finally.
- Bryan: What are the next steps?
- Michael K: Prototype and removal of arenas experiment, but the timeline is TBD. Hopefully next year, but probably not early next year.
- Bryan: I want to retract my statement that I never got anything useful after execution traces. My position changed in the last two weeks. I tried out go tool trace and gotraceui.
- Bryan: When I added trace regions that showed up clearly in gotraceui.
- Bryan: It would be nice if trace.Region matched up with OTel spans.
- Bryan: It would be nice if everything in slog also appeared in traces.
- Michael K: Spoke about that with Jonathan but we never got to actually implementing it.
- Michael K: There are still barriers to using execution traces, but things have improved a little bit.
- Nick: For DataDog, we do correlate traces with APM spans. We use tasks instead of regions for this. You can also use trace.Log to put the span ID into a trace. We tend not to put logs into traces because logs are huge.
- Michael K: You can put logs into traces only behind a flag. For example, if the injection point is slog.Handler.
- Rhys: You can put different kinds of logs into the execution traces, more focused around operational events, and not including sensitive info.
- Bryan: Ran into some problems with gotraceui and Go 1.22 and Go 1.23. Dominik mentioned that it wasn't updated for Go 1.23.
- Michael K: There was only one notable update in Go 1.23 IIRC and that's for direct coroutine-style switches in iter.Pull, to emit fewer events. Michael P had the idea to use the same event for direct switches done by channels, if we ever do that. This required no changes to the high level API though, so it should be very easy for Dominik to upgrade, hopefully.
Comment From: mknyszek
2024-12-05 Sync
Attendees: Michael P, Felix, Rhys, thepudds, Bryan, Michael K, Carlos
- Felix: Quick updates on execution traces showing frozen goroutines (theory was execution traces are messing with scheduling) and flight recorder stuff
- Written a tool to turn execution traces into SQL that can be queried. (!)
- Demo.
- Michael P: Perfetto makes everything available via SQL
- thepudds: Java heap analysis, etc is also queryable with SQL
- Michael K: Plan to open source?
- Felix: Yeah probably.
- Michael K: Would love to have this on an "execution trace user manual" page.
- Rhys: It's so fast! Do you know how big the original execution trace was?
- Felix: Small execution trace, ~5 MiB, but DuckDB is far from capacity. 10x or 100x should be fine. Parsing the trace takes 1-2 seconds for a trace of this size. DuckDB has an optimized way of inserting bulk data.
- Rhys: I love that there are more tools to programmatically analyze execution traces, to answer questions like “does this trace file include the rare event I’m looking for” and “if so, what goroutine at what time”. Figuring out deep linking would be super useful.
- Felix: +1.
- Michael K: Would like to have a seeking API, the current API forces you to go through the whole trace once.
- Felix: Seeking would be very useful for my other project when dealing with large traces.
- Rhys: For a deep link, if you start with a tool that tells you what the interesting thing is, that could go into a secondary tool that takes a minute-long execution trace and trims out just to the time range of interest. Then that may be faster to load.
- Michael P: The deep link could be a query parameter on whatever URL is given and go tool trace wouldn't try to send all the data to the browser. It may have to parse the trace again, but wouldn't have to keep all the parts in memory.
- Rhys: Re-parsing but at what level? If you have a 20 second execution trace and you're looking 15 seconds in then the execution trace tool could do a "coarse" pass on the first 15 seconds and a finer-grained pass on the last 5 seconds. We often end up with huge execution traces, hundreds of MBs, and it takes a minute or two to load that (plus a bunch of RAM). Maybe we can’t make that fast to do, but we could instead skip over big chunks of what would be expensive work.
- Felix: Sounds doable. IIRC it should be easy to skip through the trace quickly by generation and each generation has the absolute timestamp it starts at (?).
- Michael K: +1. Wall-clock timestamp?
- Rhys: Protobuf protos have a wall-clock timestamp which is useful. It would be nice to have the timestamp available for traces.
- Michael K: Could have the binary an optional thing to pass in. Could be useful for disabling on-line symbolization.
- Rhys: Execution traces benefited a lot from symbolization by default.
- Michael K: Agreed that symbolization is the right default.
- Felix: Strongly agree. eBPF profiler has a problem where it gets profiling data but no symbols, and then it's a huge lift for people how to get symbols and upload them.
- Nick and I blocked a good amount of time next week to sit down and discuss it again.
- Written a tool to turn execution traces into SQL that can be queried. (!)
- Michael K: Flight recorder proposal looks OK, some discussion of minor details
- Most recent discussion is about e.g. SetMinAge failing if flight recording has already started. The new CPU profiling API returns an error in this case. But there was some discussion over making FlightRecorder immutable, and having to pass that information up-front.
- Rhys: I’ve heard that there’s runtime scalability work coming up. Maybe from the anniversary blog post? Can you share the name of who’s working on that, so I can encourage people to tag them in issues I ask them to file when they have interesting-looking problems? Or maybe this is an issue label (like “Performance”, but specifically for runtime scalability).
- Rhys: Klaus Post’s short-lived goroutines with GMP=64 (or 128?)
- Michael P: A GitHub label would be helpful. Most of the people likely to work on it are in the room. I created a scalability label just now.
- Michael K: viewcore progress (no promises, but looking better)
- Lots of issues with refactorings of e.g. runtime._type
- Does anyone in this room care about viewcore for Go <1.22.
- Felix: I sometimes look at older Go cores, but if it helps viewcore work it's not a big deal.. I could always use an older version of viewcore.
- Rhys: I wanted to clarify whether viewcore should no longer support versions that the Go team does not support.
- thepudds: Starting with just supporting tip is also great.
- Michael K: We've discussed supporting older versions as long as they didn't get in the way. "Good effort."
Comment From: mknyszek
2024-12-19 Sync
Attendees: Michael K, Rhys, Mike D, Nick, Carlos, Felix, Bryan, thepudds
- Mike D: Working at Odigos on eBPF. Interested in Swiss maps, profiling work, etc.
- Felix: I published an early version of sqlprof. Feedback welcome!
- Felix walked us through the use-cases in the README. Looks great!
- Rhys: What are the output formats? The SQL text or is it possible to do analysis in SQL and a gzipped pprof proto back out?
- Felix: Not implemented but open to it. For now you can do CSV output.
- thepudds: Very cool, and thanks for publishing it! The name made me think it was for profiling SQL execution. I had this idea to work in a lot of contextual knowledge like that showcased in the examples (like categorizing the GC stacks) into the tools that ship with the Go distribution. For example, there could a JSON file that is maintained and shipped by Go that pprof knows how to read to report categories, so pprof does not have a prior knowledge of categorization but easy for Go to teach it.
- Felix: Not married to the name, ideas welcome. RE: categorizing stack traces, it's something I'm super interested in. The general problem is hard, because I'd ideally like to break down the trace into even more categories automatically. Like serialization/deserialization, in addition to low-level hard-coded things. Subtle things like mallocgc/gcAssist make things complicated. I don't think the idea of proposing to the pprof tool of providing a JSON file of categories is a bad one. I'll even propose it myself if you don't get to it.
- thepudds: A category like "other runtime" or "this is in the runtime but we don't know what exactly it is" is useful both to an end-user troubleshooting, but also to help triage issues on Go issue tracker
- Michael K: I like shipping this JSON file with Go. It's testable! Make sure the symbols still exist.
- thepudds: You can write a benchmark too to make a given category pop to double-check results (more end-to-end) to automate the testing of some correctness.
- Felix: Correcting myself from last time: the execution trace parser is definitely not the bottleneck.
- Felix: Should the GC guide mention the write barrier? Especially in the identify cost section? Datadog spends > 1k cores on it.
- Michael K: +1, please feel free to send a patch. Small caveat that we should document: the fast path likely does not get attributed to the write barrier.
- Michael K: Maybe we could have the compiler claim that the instructions that write pointers into the buffer slots are part of a special function that was inlined?
- Felix: If there's a lot of subtleness to documenting this properly, I'd prefer if you did it.
- Michael K: OK no problem. :)
- Rhys: RE: write barriers and making profiles easy to understand, I'm curious if the contributors and maintainers think there's a benefit to have a way for Go users with some level of experience to find out what about their programs are unusual. Like, 1M goroutines is within limits, but unusually high. Or a high GOMAXPROCS.
- Michael K: I like the idea and approach, but I'm not sure where to go with it.
- Rhys: Another approach is to put feelers in the community for programs that are unusual in some ways. For example, more than 1M goroutines, >1TiB heap, etc.
- Michael K: Putting out feelers has traditionally been hard.
- Rhys: Wouldn't want to give the impression that "Go is not suitable for X." If you can convince a user to apply some instrumentation and give back some data, that's certainly a good start.
- Felix: +1 that helping people understand what is unusual is an important challenge to overcome. Especially for people in stressful situations who are trying to fix an active problem.
- Felix: Has the Go runtime/compiler ever considered providing support for USDTs (implementation details)?
- Linux uprobes: rewrite an instruction to an int3 to trap when entering a function to track execution. Used with eBPF. USDTs are like a safer version of uprobes. Just put a nop instruction somewhere and put data in the binary that describes data associated with that instruction. You still patch the nop instruction.
- thepudds: it’s more stable over time because it’s in the source code, and can be placed in the middle of a a sensitive function, which is harder to do and less stable over time w/ a uprobe.
- Rhys: Years ago we were interested, but didn't know exactly what we'd use them for.
- Mike D: +1 to what Felix was saying. We try to instrument all sorts of things and we patch specific functions, but we also read and write values.
- Michael K: How would this look?
- Felix: A magic function like "Probe(fn)" which normally compiles to a nop but can be patched with a trap that then executes the function.
- Mike D: This is definitely just gaining some traction now.
- Rhys: This is an opportunity to run eBPF code while a thread is paused in the runtime. This could be a way to add programmability to the runtime, if part of the contract for the particular tracepoint is that you write out some value to some stack offset.
- Felix: I do not know the answer for sure but I think you have to basically chase down the pointer if the probe is active or not.
- thepudds: Maybe I didn't understand the question, but if there's a pointer that's available, it's not going to dereference until the probe fires.
- Rhys: It sounded to me like to have a stable API for a named probe is that you take all the information that you guarantee all the information that's available to the problem is still available across releases.
- thepudds: In dtrace, which this is based on to some degree, you describe what you want to happen, and it only happens if the probe is executed. You can have a little C-ish expression that walks through pointers and that only happens if probe is enabled. (Caveat -- I’m more familiar with dtrace mechanics than the low-level eBPF mechanics).
- Rhys: Depends on the language you have to describe the probe, and I don't know that at all.
- Felix: I brought it up for the same use-case that Mike D has. It would be great if Go binaries could be as easily observed as other languages.
- thepudds: In the world of eBPF there was a more fundamental problem with Go: return probes due to stack moves. There are a few issues there. Commercial eBPF vendors I think work around it, but I think it’s a problem for other developers just trying to do basic things on their own with eBPF and Go. Has that changed?
- Mike D: I don't think we've hit too much of a problem with our workarounds. If we could solve the return probe problems that would help.
- Rhys: Are we meeting Jan 2?
- Nope!
Comment From: mknyszek
2025-01-16 Sync
Attendees: Michael K, Michael P, Felix, Rhys, Dominik, Mike D, Carlos, Nick R, Nicolas H, thepudds, Milind
- Felix: Critical Path Analysis
- Try to use execution traces to figure out which goroutine at which point in time is relevant to the latency of a request. You have instrumentation for request starts and stops (tasks), so you can find bottlenecks in that time from the goroutines associated with those tasks. Our first idea was to "walk backward" from the end of the task to the start, but you run into issues with e.g. mutex contention, since you might follow backwards into a different request. It's not wrong to say that if that other request ran faster this one would too, but it's not exactly right. My solution is to only walk backwards while the original goroutine is contended instead of fully walking backwards into another request. This cuts off many of these "fake" paths.
- There is some research on this in the context of distributed tracing (e.g., from Google).
- And this PhD Thesis specifically about applying critical path analysis to multi threaded programs.
- Nicolas: I've thought about it but haven't done it. Very interesting. So you stop walking back in a period of contention?
- Felix: Not quite, we do walk back into a different request, and that just provides more context.
- Nicolas: What are your plans for this?
- Felix: We have built a tool to show p95, p99 examples, where you find lots of interesting cases, but it is hard to tell if individual cases are common problems, so we’d like to aggregate them. You could diff aggregations of p50 vs p95.
- Rhys: I have explored things related to this but haven't ported this to the new trace format. The old code is in the Go tracing toolbox I've mentioned before. I've found it helpful to find a reason that any goroutine is running at any given time. This has helped with not accidentally going into other requests. I have other techniques in there that I'm happy to discuss.
- Nicolas: The more I think about it the more excited about it I get. The thing that I would like to do is have centralized tracing infrastructure for ad-hoc investigation. But if you can offer people aggregated analysis, that would be very useful and very cool.
- Milind: Really cool idea. Something of interest to me as well. At Uber, so far we haven't seen many cases where latency is a concern from execution. Often it's the network of RPCs that cause issues, and distributed tracing helps a lot with that. Maybe there are situations where there's an issue within a particular service. Do you have use-cases where this is important and frequent?
- Felix: I agree that in most cases in distributed systems most latency issues come from downstream services. I think we do see issues frequently enough that this is useful. Like where people spin up too many goroutines in a loop, or have poor load balancing. That often comes with mutex contention too. This all also lines up with the service being overloaded, often. Just trying to guide the user toward the overload vs. the mutex contention as the root cause. Part of this is to aggregate the data at scale and push people in the right direction.
- Milind: Could a higher level metric work instead?
- Felix: Yeah, we have scheduling latency, but it's hard to contextualize to people what that means.
- Milind: I think there's a bigger problem where if this isn't connected to the bigger picture, across RPC services, users may get confused. So they definitely need to be aggregated at the very least. We've tried something like this and the next question is frequently: "Is there anything lying behind that critical path? How are we certain this is worth fixing?"
- Rhys: I saw research at some point on injecting latency into individual goroutines in http2 as a way of figuring out what parts of it would respond most to being sped up. I see that as an example of a critical path being a property of a program that could matter. Also, I've seen services run into problems when they experience errors and need to log them, and run into contention on the logger's mutex. Mutex contention can show up in a surprise way and ruin a day, and surfacing that before it becomes a huge problem would be helpful. Getting CPU profile samples on the critical path is helpful.
- Milind: Paper on injecting latency: https://www.usenix.org/conference/atc16/technical-sessions/presentation/curtsinger
- Felix: Injecting latency is very interesting; I'm wondering if there's another path where you simulate injecting it instead.
- Nicolas: CPU samples in runtime traces can be very handy, +1. The asymmetry between what the Go runtime believes and the Linux kernel gives it has been giving us trouble, and this is evident in where the CPU samples land. We've seen in production that if you have large scheduling latency, it quickly starts affecting everything. Being able to aggregate all this and make inferences would be very helpful.
- Rhys: I've found larger machines with larger GOMAXPROCS tend to have larger scheduling latency because more local runqueues need to be considered for scheduling.
- Nicolas: It's not just that, if the machine is large enough to go over multiple NUMA nodes, that causes sizeable slowdowns in Go programs.
- Michael P: Rhys, I found your error + contention + logging example interesting, though I'm not sure this technique would help as much with this case. It's hard to tell whether contention has the potential to become a big problem and make the service fall over. Unclear how to prioritize which lock is the most dangerous.
- Rhys: Teams often only realize a lock exists when something goes wrong, but being aware the lock exists at all can be helpful. With information about how things change in different circumstances, it's usually not too hard to land at a useful conclusion.
- Nicolas: The locks discussion is also interesting in the context of "larger machines". An essentially-global lock becomes a bigger problem (proportionally) on bigger machines. This is related to the "consult more Ps" problem, but not entirely the same.
- Felix: Users tend not to be good at identifying what tasks are useful to cover with spans. My ideal instrumentation for Go would be to just instrument the incoming and outgoing stuff, and tags would just be logs or something.
- Felix:
- Michael P: The inability to see the flow of items in buffered channels makes me think it would be useful if execution traces could track data flow of all items in a channel (not just scheduling events). Though a trace event on each chan send/recv may be too much overhead.
- Felix: We're trying to understand data flow issues in pipeline architectures. Even outside of the critical path, it would be useful to understand how data is move between channels, etc.
- Nicolas: Thanks for the presentation! We don't see the pattern of popping something from a channel and passing it through many goroutines a lot. We avoid worker pools. You could maybe use trace annotations and a framework to make sure the data ends up in an execution trace.
- Felix: I don't think I'll manage to change our architecture, but we might be able to do something with pprof labels and annotations. One issue with execution traces is that they're a firehose. I wonder if this critical path aggregation is a way toward a potential sampling strategy.
- Michael P: You're suggesting that the tracer only traces specific goroutines and specific related goroutines.
- Felix: Yeah, basically a more sophisticated version of how pprof labels propagate.
- Nicolas: Weeks ago I looked at an execution trace with Michael K and it looked like there were missing edges. I can't remember if that was a tracer issue or a bug in the UI.
- Felix: In my experience it's usually my code or a bug in the UI.
- Rhys: Yeah, the execution traces are a lot of data since they have the whole process view. And that whole process view is necessary for things like how the GC impacts latency. If the data seems like too much, does that mean that we just need better tools for processing it?
- Felix: I'm not gonna file a proposal to do this kind of sampling upstream because I totally agree. The main use-case I have for this is that some services just have long tail latencies but produce so much trace data that, to keep memory overheads low, we can't actually see a lot of stuff. Making the firehose less powerful would be helpful.
- Milind: If there is communication between goroutines through a buffered channel, will that avoid creating edges? If so, that's incomplete.
- Felix: Yes, it might be incomplete. But in practice, there's almost always a way to still get a lot closer to the root cause.
- Milind: Just for correctness, shouldn't we long those events too?
- Rhys: With application changes you can get to 100%.
- Michael K: Goroutines can synchronize with shared memory in so many ways that you basically need something as powerful and expensive as the race detector to track it, and it's not feasible for what we're trying to achieve for execeution traces.
- Milind: But what about buffered channels specifically?
- Michael K: Maybe, we could do a trace experiment. It's still weird though, because from the scheduler's perspective it still looks like shared memory synchronization.
- Nicolas: Felix, for the time being we're also just trying to target like 5-10 MiB and the firehose will be a problem for us too.
- Felix: I can share one realistic example for an HTTP request.
Notably, stacks stacks of goroutines waiting on one another in the pprof profile flame graph view. - Rhys: Neat!
- Nicolas: Stack stacking (wakee/waker) is what Brendan Gregg also does in offcputime analysis (he's got a bunch of blog posts about this). I believe you're onto something here.
- Felix: I think Go has a real opportunity here by offering execution traces out of the box. Many other language platforms don't offer the same thing, and it makes it hard for these dev tools to take off.
- Try to use execution traces to figure out which goroutine at which point in time is relevant to the latency of a request. You have instrumentation for request starts and stops (tasks), so you can find bottlenecks in that time from the goroutines associated with those tasks. Our first idea was to "walk backward" from the end of the task to the start, but you run into issues with e.g. mutex contention, since you might follow backwards into a different request. It's not wrong to say that if that other request ran faster this one would too, but it's not exactly right. My solution is to only walk backwards while the original goroutine is contended instead of fully walking backwards into another request. This cuts off many of these "fake" paths.
- Nicolas: Michael K and I have been looking into viewcore. Has anyone tried it out?
- Nick: I tried it out and filed a bug and Michael fixed it, but I'm not sure if that was root caused.
- Michael K: I do think it's the correct fix in the end.
- Nicolas: Still working on clearing things up and fixing up simple cases.
- Michael K: Yeah, expect the heap not to be fully typed.
Comment From: mknyszek
2025-01-30 Sync
Attendees: Michael K, Michael P, Bryan, Felix, Nick, Milind
- Felix: Do the weak pointers in go1.24 enable the implementation of string interning without allocation? (similar to this implementation, but with strings getting GC’ed when they are no longer referenced?).
- Strings in protobuf constructed from []byte of larger payload. We construct the same string over and over.
- Use map[string]string (same key and value), lookup m[string(b)] (which does not allocate) to see if the string has already been allocated.
- But what eviction policy to use?
- Michael K: It is safe to pass an unsafe string to unique.Make, which can avoid the allocation at unique.Make.
- Felix: unique.Make is ~3x more expensive than a map lookup.
- Michael K: You could potentially build something specialized with weak pointers and a standard map.
- thepudds: related, maybe 2-3 examples for weak and unique packages could go a long way. For interning, one could use sync.Map and one could use regular map. Could include a weak map. Two recent CLs, one of which was wrong or wrong-ish.
- https://go-review.googlesource.com/c/go/+/644176
- https://go-review.googlesource.com/c/go/+/643955
- The second one from Filippo has a reasonably clean weak pointer based cache example that could be lifted: https://go-review.googlesource.com/c/go/+/643955/1/src/crypto/internal/fips140cache/cache.go
- Also, maybe link to new sections in GC guide:
- https://tip.golang.org/doc/gc-guide#Finalizers_cleanups_and_weak_pointers
- Felix: I will give the unique package another shot. We also don't have good benchmarks for the concurrency situation.
- Michael P: Why not compare string data pointer for trivial comparison?
- Michael K: We want all compares (equal and unequal) to be trivial
- Michael K: I don't think the weak.Pointer route is a dead end. You can definitely beat the unique package because it's operating in a more general case, and weak pointers are intended to help you build canonicalization maps for things unique doesn't support. For strings, you will likely need to do some unsafe hackery.
- Michael K: There is also room to improve unique.Make. We have some optimizations in mind, but they'll take some work.
- thepudds: Strings don't actually do a pointer check first, at least as of two months ago. Maps did do the extra pointer check, but only the old maps, not the new maps. (Or at least, that was my impression from looking ~2 months ago; maybe wrong).
- Michael K: In terms of concurrency, unique.Make tends to perform much better under contention than a simple locked map.
- thepudds: It depends on if your goal was cheap comparisons or to deduplicate memory and/or eliminate allocations. For example, if you go from millions of allocated strings to 100s of strings or 1000s of strings, that can be a big win, even if those 100s of strings or 1000s of strings have some duplicates or are all duplicated a handful of of times. (It would be nice if there was an example that covers this style of interning, which would not need a sync.Map, could use a regular map)
- Felix: Allocations and memory overhead are the biggest concern.
- Bryan: Will there be an RC3?
- Carlos: Yes. Our goal is for it to be released Tuesday or Wednesday. There are still a couple blocking issues. No guarantees. Tree was supposed to be opened yesterday.
- Go 1.25 plans
- Michael P: Hoping to do a little more with builtin maps. Specifically maps that are cold and miss in cache. Interested in a variety of runtime and application CPU scalability stuff. Hoping we can get around to container-aware GOMAXPROCS. Also want to look at a variety of CPU scalability issues. Curious if anyone here as particular production scalability issues that haven't come up in this meeting before.
- Bryan: We have a talk accepted for "the biggest Prometheus you ever saw." 4 TiB, 128 cores. Some of our talk is about organizational matters, and we're not solely focusing on technical aspects. A lot of the time we didn't have a good idea of what was going on. Figuring out if it was NUMA, hyperthreading, caching issues etc. was quite hard.
- Milind: I can give you a data point on NUMA. In data centers we don't pin containers to CPU cores and about 5 years ago we started collecting hardware performance metrics. About 50% of memory accesses were cross-socket. Based on that observation we started pinning to sockets and we observed its IPC went up by 20-25% and the CPU utilization of the containers went down 15-30%. Done at a much smaller scale than a whole data center though. We never took the next step of assigning containers to NUMA nodes. There was a secondary concern of how global scheduling of containers would work. May not be solvable in the Go runtime since the scheduler could migrate the container.
- Bryan: One way to think about it is if your NUMA domain size is 16 cores and you schedule 17 threads, then one of them will always be on the wrong domain.
- Michael K: Might be nice to make sure the GC dedicated workers are on the same NUMA node, so at least all the metadata is local. The heap itself may still be remote.
- thepudds: I have a couple of map tweaks I'm hoping to send out. One thing is about storage growing due to tombstones. The idea is that if iteration never happens then we can do a rehash in place, and otherwise fall back to a same-size-grow if needed. Also a few escape analysis things for reflect, interface receivers, etc. I also have a rough cut runtime free that I want to hook into escape analysis but that's a bit more exploratory. For example, make on dynamically-sized slices that don't otherwise escape, certain patterns of append, maybe backing storage for maps in some cases.
- thepudds: Is the map optimization fundamentally flawed in your mind Michael P?
- Michael P: I think you only need to know whether it's currently in progress, but there's no way to know if iteration stopped. I worry a little about the cost of doing the write when iteration starts.
- thepudds: The write only needs to happen once, and it purposefully doesn’t try to track when iteration stops
- Michael P: I don't know whether the write would actually be too expensive or not, it's worth measuring.
- Carlos: I'll be working on the flight recorder and a couple of other runtime diagnostics things. The majority of the work I'll be doing will also be CPU scalability related.
- Felix: I'll continue with the critical path analysis work. I hope that I can contribute that back upstream. Maybe some other smaller tracing stuff, like timestamps when generations start (system clock timestamps).
- It's a bit difficult to write tests against the trace API because you can't construct the types yourself.
- Nick: For Go 1.25, I have two frame pointer vet check improvements ready to land: basic arm64 support, and checking past conditional branches. Cherry also suggests trying to make the check work on package runtime (currently the check hard-codes skipping runtime). I'll give that a shot as well.
- Michael P: Hoping to do a little more with builtin maps. Specifically maps that are cold and miss in cache. Interested in a variety of runtime and application CPU scalability stuff. Hoping we can get around to container-aware GOMAXPROCS. Also want to look at a variety of CPU scalability issues. Curious if anyone here as particular production scalability issues that haven't come up in this meeting before.
Comment From: mknyszek
2025-02-13 Sync
Attendees: Michael K, Carlos, Michael P, Felix, Bryan, Nick, Kemal, Rhys
- Felix: CL 646095: decorate anonymous memory mappings with prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, ...)
- Nick: Mutex profiling - consider sampling based on magnitude of contention, rather than a 1/N coin toss?
- Michael K: Reminds me of heap profiling.
- Rhys: One of the challenges with mutex profiling is you don't know at the sample decision time how long you're going to have to wait. Seems like a good thing to have but I don't know how to do it efficiently.
- Nick: Background is our previous issue with goroutines just sometimes not getting scheduled. The runtime is still doing things (sysmon events, CPU profile events from background mark workers, etc.) but there's 100ms+ stretches where goroutines just aren't getting scheduled. Suspect lock contention, maybe the trace lock.
- Michael P: Integrating with perf tracing would be help here.
- Felix: Collecting scheduler traces from the kernel would be interesting, but the challenge there is the size of the traces and we'd need to line them up with our runtime trace sampling strategy.
- Rhys: We could ask the CPU profile timers to report wall time instead of on-CPU time. Ideally this would be exposed via CPU profile API knobs.
- Felix: I think that's a very interesting idea. I'm nervous that it's only useful for debugging specific weird things.
- Rhys: We have wall clock samples for goroutines (goroutine profiles) and those are useful. It doesn't seem more niche to me than perf events (cache misses, etc.).
- Felix: Makes sense, just a minor concern.
- Michael K: I think if there was motion on the CPU profile API proposal things here could move pretty quickly. One concern is how the wall time sampling (and perf event sampling) composes with multiple consumers. It doesn't necessarily have to, but it's probably the biggest open question.
- Proposed API: https://github.com/golang/go/issues/42502#issuecomment-1325561860
- Nick: Just to come back to mutex profiling, it sounds like we can't necessarily do this efficiently.
- Michael K: Might be able to do it in just the starvation case? Could be too hard to interpret the results.
- Rhys: RE: attributing runtime-internal lock contention to the unlocker, I think Go 1.24 does that now.
- Milind: NUMA affinity (pinning processes to sockets) (TODO: link))
- ~50% remote accesses
- Frequent CPU migrations
- ~30% wasted CPU stall cycles
- Pinning to individual sockets saves ~12% based on individual experiments.
- Michael P: Linux seems to be doing a bad job here.
- Bryan: Prometheus started collecting NUMA zone data for memory allocations. 50% of hosts on more than one socket.
- Milind: Can see 100% local allocation but 99% remote access.
- Rhys: At what level of detail do you know why this happens? (For example, stack samples for cross-socket accesses. Who's making the cross-socket accesses, and when? GC-related call stacks, or behavior post-STW when goroutines may be shuffled across Ps.)
- Milind: We don't have that data, since we did perf stat instead of perf record. It was difficult to have perf record running on all our hosts.
- Rhys: Could do it in the Go runtime instead.
Comment From: mknyszek
2025-02-27 Sync
Attendees: Bryan, Rhys, thepudds, Carlos, Michael P, Michael K, Milind
- Michael K: Sharing some context on epoll/netpoll scalability
- BlueSky and Storj ran into netpoller scalability issues.
- Storj has a reproducer.
- Tried Go 1.24 + Ian's 1-line patch to only have one thread check the netpoller at a time.
- This fixed the problem. Rhys' runtime.mutex work seemed to be an important bit here too.
- Carlos: Trying to put together a benchmark to confirm Ian's fix and prevent us from regressing here. Any suggestions?
- thepudds: Maybe ask on the issue? Someone might just do it.
- Carlos: That's a good idea. We were hoping for some open source benchmark.
- thepudds: Makes sense.
- Michael P: The Storj people did provide a microbenchmark
- thepudds: TLDR on running sweet. Updating readme, including which benchmarks are more reliable vs. which are noisier. Better default config.
- thepudds: ~18 months ago it seemed some were noisy ones that the core team knew to mostly ignore. I ran it for a few different CLs, but did not have confidence in the results.
- thepudds: What default config should I use? Which results are more trustworthy?
- Michael P: We did remove the biogo benchmarks in the last year. The benchmarks are also different sizes.
- thepudds: also maybe some sizing for a test VM. Maybe even something like “these 6 benchmarks can run on an VM with X cores with Y GB in about 90 minutes, which at typical cloud rates translates to < $1. These 8 require a larger VM”
- Michael K: there are builders that run this
- https://go.dev/wiki/PerformanceMonitoring#presubmit
- thepudds: excellent! 🎉
- thepudds: does the wiki have advice about not running it too often, or maybe not many people know currently to run them?
- Michael K.: lazy eval on that (don’t want to discourage people now).
- thepudds: I’ll probably add a link to cmd/compile/README and maybe runtime/HACKING, along with link to published benchmarks, sweet, bent
- Bent is already linked to from cmd/compile/README, but I think I didn’t have a good destination for that link that seemed like an external contributor could be up & running quickly (but would need to double check)
- Ideally bent would have a TLDR too
- E.g., assume someone has toolstash setup, what are 4 lines to paste?
- thepudds: FYI for folks here, if you are curious about instrumenting or improving the compiler, I had added a ‘Tips’ section to compiler readme to help orient contributors on how to set things up to work on the compiler more conveniently, how to get different types of debug output (e.g., IR dumps at different points), using toolstash, etc.: https://github.com/golang/go/tree/master/src/cmd/compile/#8-tips
- thepudds: does PGO work on the sweet builders? (I have some CLs that rely on PGO)
- Michael K.: no. but we want to do that.
- Michael K: we also want to do compare vs. merge-base rather than with tip.
- Go 1.25 plans
- Rhys: Some work I think should be done but not ready to commit to doing it. I can send you an email.
- Nicolas: Interested in any movement on mutex profiling.
- Michael K: GC performance stuff. And the unique package (and a bit on runtime.AddCleanup).
- GC stuff is focused on memory bandwidth and cache misses, and also scalability to some degree (like with buffer management).
- Nicolas: Interested in better mutex profile sampling discussion from last time.
- Rhys: If you see a stack with a long duration, is that enough?
- Nicolas: We take delta profiles, and it occurs to me that it's easier to miss long delays that way.
- Rhys: We would take profiles and programmatically produce deltas after the fact. (Data collected via autoprof, summarized via apshuffle and its “-vs-prev” flag.)
- Nicolas: Thanks, this is pretty great and your approach is insightful. I don't think it's feasible to rework our big system, but I could use this approach in the single processes that I know are problematic, by extracting profiles myself. I honestly had not made the cumulative-vs-gauge connection for profiles before (while I've done so many times for metrics).
- Nicolas: This is basically a better design (cumulative rather than gauge). Unfortunately, we don't do it internally (I'm referring to this system). We take gauge points, using a delta profile. This allows normalizing over different processes/jobs/… Unfortunately in this case it also hides these spikes, especially with the current mutex profile sampling algorithm (1/N coin toss instead of exponential-delay). I guess I just had to adjust my expectations when seeing these profiles.
- Nicolas (added later): I've attempted to convince myself more than the uniform strategy is bad for our (Google's) approach of storing delta profiles (see gist). It's true, at low sample counts (which one would see with a delta profile), the error is higher for uniform sampling than exponential sampling. At high sample counts, as rhys.hiltner@gmail.com also mentioned during the meeting (just wait long enough), both approaches become similar in terms of error.
- Nicolas: …thinking about it, why can't we do biased sampling cheaply for the mutex profiler? Given that we only take the stack at the unlock site and we already know the contention length there, I assume this is not the important cost.
- thepudds: Overall design constraints when contemplating extending the export format?
- I looked at export format ~12-18 months ago, might be returning to it soon for some inter-related escape analysis & runtime changes.
- Is the export format smart about deduplicating information? (I recall seeing that, but maybe misunderstood at the time, and also want to understand more about how good it is, including the cost of N copies, such as is the cost maybe a varint per copy, vs. some X-byte hash per copy, vs ___, and how far does it scale down if you have small-sized things you want to de-dup, and at what point does it become a wash or no-op or pessimization).
- Michael P: Honestly, I don't know.
- thepudds: I think it might slurp in and re-export dependent packages?
- Michael P: Yes. I believe the export data for a package includes data needed from transitive packages. But it doesn't include everything. For example, some things don't get inlined due to trimmed export data. We've brainstormed and discussed flipping the model so export data is shallow (each package's export data only has its own data).
- thepudds: What's the sensitivity to export data size? Is 10% bigger a problem?
- Michael P: The size is a concern because it does affect large builds. This comes up in Google. I don’t have an exact threshold on when it starts to matter.
Comment From: mknyszek
2025-03-13 Sync
Attendees: Michael K, Michael P, Rhys, Felix, Bryan, Kemal, thepudds, Milind, Chris, Carlos, Nicolas
- Felix: CLs for emitting walltime events in execution traces: https://go-review.googlesource.com/c/go/+/653575/5
- MK: Does this work OK for old parsers parsing new traces
- Felix: The parser happens to ignore extra data in the frequency batch, so it works OK, but raises a question of when we should bump the version.
- MK: Normal batches validate events as know (since it doesn’t know how long unknown events are), but it happens to work for the frequency batch. Maybe that batch should have been validating?
- Felix: The API emits a final Sync event, but that is awkward now because that event has no wall time information.
- MK: The trailing Sync event is historical, I think we should remove it now.
- thepudds: It could be useful to have arbitrary key-value pairs especially if this is only emitted infrequently.
- MK: For now I don't think we need it, but that's a good idea.
- Felix: Unfortunately using walltime in the Catapult viewer looks very bad as you might expect, just a really long Unix timestamp.
- MP: Possible difficulties with skew between monotonic time and wall time.
- Felix: Is the skew just basically that you can't grab them at the same time, or is it something that gets worse over time?
- MP: I'm thinking the former. But also on Linux time is weird. If the kernel gets an update from NTP that your wall time is a second off, the wall time is going to be moving at a different rate (smeared) over some time period.
- thepudds: That's a well-configured system. It can also jump on unhappy systems.
- MK: You could maybe just use the first wall time timestamp only in parsing? Then you don't need to try to make sense of two notions of time.
- Rhys: Sounds like this is just a time.Now call that anyone could make. Could this just be a background goroutine that does a trace.Log? There's stuff in net/http that would be nice to switch on too. That is, should this be one layer up?
- Felix: I think we want it to be foundational to the degree that there's one timestamp per generation. The trace can be split at each generation.
- MK: I think the runtime should control time in traces. There are too many notions of time.
- Rhys: I do think wall clock time is a little less special. Also, per-generation data (insofar as wanting the data to survive splitting the trace), there's lots of other things I'd like to know. The process, the CPU, etc.
- MP: Generalizing things is a good idea. Maybe we need something slightly fancier than trace.Log.
- thepudds: It does make sense for time to be a first-class concept. Even if it's slightly wrong though, it's still super useful. Just need to do it carefully.
- MP: +1 to have the times even if it's slightly skewed.
- MK: Passing "constants" (constant on every generation) is relatively easy. Doing any computation (like getting a timestamp) is a bit harder.
- Rhys: You could imagine something like writing to a channel once per generation. I would also like constants in profiles.
- MK: Maybe just one runtime call? "Set process attributes"?
- Nicolas: Maybe it would be enough to set these as pprof labels on the root goroutine?
- Felix: RE: the pprof labels, there isn't really a root goroutine to add labels on that would apply to the whole profile.
- Nicolas: I was thinking that since pprof labels propagate to child goroutines, it would work.
- Felix: I think pprof labels are just operating at a different semantic level.
- Nicolas: If there were a way to interpose into the stream generation, you could insert whatever you want. Currently you can register an io.Writer, and that could parse and rewrite the trace.
- MP: Probable contributor summit at Gophercon EU
- Several Go team members are giving talks.
- thepudds: some type of pattern for runtime debug flags like 'GODEBUG=trial=X' 'GODEBUG=trial=Y', like 'GODEBUG=trial=profileruntimeallocs', maybe?
- It would be nice to have runtime flags that are (1) useful for debugging, and (2) might go away and (3) maybe are not documented or not formally documented.
- The cmd/compile folks don't hesitate at all to add debug flags, which helps them day-to-day, helps give triage steps on the issue tracker, helps other gophers.
- GODEBUG has become more formal.
- GOEXPERIMENT is heavierweight, and has a specific meaning today.
- Example cmd/compile flags I was just running with multiple levels of name=value pairs: go test -trimpath -gcflags='-l -d=escapeifacemethparms=2 -m=2'
- It’s a bit of flag Inception, but it works reasonably well.
- Rhys: Nicolas H, did you want to discuss the comment thread we’ve had since the Feb 27 meeting, re mutex contention profile sampling / bucketing?
- Nicolas: I haven't done much more work, except to say I'm now more confident that biased sampling would be better for us (we essentially take 10-second mutex profiles at random intervals from random tasks, we'd prefer if long contentions have a higher chance of making it into a delta profile).
- Nicolas: My last comment (see down thread) asks why biased sampling would be much more expensive (which seems to be the assumption)? AFAICT, mutex profiling samples the unlock stack, so it's not like we need to sample all stacks if we go for biased sampling.
- Michael K: you need a timestamp every time
- Rhys: determining that there was contention takes time. If there is zero contention, we don't want that, like 10% of time would go to that check. The “10%” isn’t in terms of total CPU overhead, most programs don’t use runtime.mutex that much. What it means is that any scaling cliff for the throughput on a single runtime.mutex (a hot chan value in particular, or sched.lock, etc) is 10% closer. It hurts the vertically-scaled parts of the app.
- Rhys: Maybe this is becoming a question of “mutex profile”, controlled by “fraction”, versus “block profile”, controlled by “rate”. You’re looking for the stack that made the unlock call, but with “rate” semantics. That seems like a different API promise than we have now.
- Nicolas: It's true that changing to biased sampling in this way would change the cost of sampling by sampling more/less than previously if people do not alter their MutexProfileRate. Even if biased sampling itself doesn't add any extra cost. Example: SetMutexProfileRate(100) now means sample 1 of every 100 contended unlocks. Afterwards, it could mean (depending on the unit we choose): sample on average every 100ms of contended time. If someone has many >100ms contentions, a much higher fraction will be sampled. Not sure how to handle this.
- Nicolas: Also, as you state, block profiling is currently already biased. I'm not sure why it's differently biased from heap sampling (uses exponential distribution), while block profiling uses some sort of linear/proportional sampling when below the BlockProfileRate, and always sampling above the BlockProfileRate.
- Michael K: Maybe this should just be a different profile type, where users know that it's more expensive.
- Nicolas H: Or maybe SetMutexProfileRate() and SetBiasedMutexProfileRate(), whichever is called last determines the type of sampling.
- Nicolas H: But we have mutex profiling enabled in production (rate 100) for all jobs, so the cost does matter. If it's "too expensive" (I know, vague), we won't use it.
Comment From: mknyszek
2025-03-27 Sync
Attendees: Nick, Chris, Kemal, Michael K, Rhys, thepudds, Michael P, Milind, Felix
- Michael P: “Parallelism” / “latency” / “walltime” profiling by dvyukov
- (Context on dvyukov: wrote a lot of the Go scheduler, TSAN (so, the race detector), syzkaller, famously good at lock-free data structures, dvyukov/go-fuzz, etc.)
- Documentation commit: https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/commit/?id=32ecca8d7a3e2400805f8b5564a1b07e05cfcd54
- Problem with CPU profiles is you don't see idle time.
- Look at what a CPU is doing at a regular interval, including whether it's idle.
- Already found some significant performance bugs, especially when system resources aren't being appropriately utilized.
- e.g. a bottleneck of single-threaded Python code.
- Felix: How is that different from what we can do with the execution tracer?
- Michael P: The execution tracer contains almost all the data you need for this. The big difference is the visualization of it.
- thepudds: It's a summarization. The use of the term 'wall clock time' in the commit feels atypical.
- Michael P: Agreed, it's a little strange. I connected it back to Felix's fgprof initially. No one has a great name for this.
- Milind: HPCToolkit sampling calls it as REALTIME https://hpctoolkit.org/manual/HPCToolkit-users-manual.pdf See section 5.3.3 REALTIME and CPUTIME
- Michael K: The overlap with execution traces came up, but it's more of a convenience/usability thing. It's a useful short-cut.
- Rhys: What does it mean when you're looking at more complex concurrency situations? It makes sense when you're looking at a program doing a particular thing (for a single unit of work) with a particular amount of parallelism (e.g. the linker), but you want to know the amount of parallelism at any given time. For an RPC server that’s handling multiple work units, I expect you’d want to track the parallelism of each independently.
- Michael P: Agreed, this works the best for batch jobs. But if you can tease apart different parts of an individual program, you can maybe use it in a server.
- Michael P: One other visualization idea that came up, that Russ suggested, you can include all Ps including idle time in your CPU profile as a simple compromise. You also lose information with an aggregate profile, whereas with an execution trace you can see which components are lacking parallelism.
- Milind: There's also the GC which makes it complicated on whose behalf the GC is running.
- Michael K: The GC is sort of inherently global and it's useful to think of it in a first-class way as a component of the system. But you can attribute assists.
- Milind: If there's a lot of time in GC, the problem isn't necessarily just the GC. Going purely by CPU profiles might lead you to believe the GC has regressed, but it's something else going on. HPCToolkit has a compact visualization that displays what each thread is doing. It's sampling-based. There's a third dimension to it, the stack depth.
- Michael K: This seems similar to fgprof.
- Michael P: I agree this could be built as a sampling profiler. I brought up tracing initially because it already contains a lot of the data you want in such a profile. In theory you don't even need to change the runtime to get this information.
- Milind: The main question to me is whether you want it to be lightweight (profiles) or heavyweight (traces).
- Michael K: You will need to be okay with skew if you want it to be lightweight.
- Milind: Yeah, there will be skew across threads as to when samples are taken.
- thepudds: perf is doing the "wall time" profile as a sampling profile, right?
- Michael P: Yes.
- Milind: Perf has no notion of tracing.
- Nicolas H: Is this true? I thought that when you register for scheduler events in perf, you get every single scheduler event by default. I agree this is true for events like the (cpu-)cycles events, where the
-F
flag determines how many of the events need to pass before a sample is taken. - Michael P: I think this was part of a confusing tangent I brought up about having perf avoid doing aggregation like pprof does. But then I remembered that perf doesn't do aggregation at all in the perf.data, each sample is it's own event, and the aggregation is done post facto. So I shouldn't have said anything...
- Nicolas H: Is this true? I thought that when you register for scheduler events in perf, you get every single scheduler event by default. I agree this is true for events like the (cpu-)cycles events, where the
- thepudds: If you look at the data in the profiles right now, the absence of samples might imply idle. Is there enough information already? Like, if you're missing samples according to the sampling rate. (At least in simple cases, like ignoring if someone changes GOMAXPROCS)
- Michael K: Hard to tease apart the source of missing samples, but I do think in the simple case it implies idle time. If a VM is overloaded is another reason samples might go missing.
- Rhys: I think timestamps CPU profiles is key. I may not understand the perf change because it's making an internal change, not just a visualization change. And perf can already emit timestamps for samples. If we have timestamps, we can do all sorts of analysis and reweighting. If you can apply labels, that's even better.
- Milind: sampled profiles + labels, +1
- Michael K: I think (but am not 100% sure) that an execution trace that just contains profile samples is still totally valid.
- Michael P: I may have confused things earlier regarding perf aggregating samples. [missing some notes]
- Rhys: Yeah, I'm suggesting just adding information to the data sources to be able to do the analyses users want in post-processing.
- Felix: It sounds like, in the use-cases, it seems to be trying to do something like critical path analysis, which I presented on a few weeks ago. I'm planning to pick that work up again soon.
- Michael K: At the very least, the technique presented here is very general. C doesn't have a runtime, but this gives some useful insights that aren't super accessible to C developers. (Some overlap with scheduler traces?)
- thepudds: It's good that this concept is being introduced to the world. But the naming still seems terrible.
- Michael P: Felix you made a good observation that this overlaps with critical path analysis. Just looking at an execution trace is in some sense more powerful.
- Felix: I want to +1 what you said about off-CPU time. I think that's one reason people like fgprof. One smart thing we can do for the tracer is to convert traces to pprof profiles. We have some of that, but the profiles are kind of weird. If we could do that, that would already be a very useful change, since the tracer doesn't have the same scalability issues fgprof has.
- Michael K: I think gotraceui might already provide this out of the box?
- thepudds: In terms of making fgprof built-in, if you get a stack dump you can get the reason a goroutine is sleeping, but you need to parse the stack dump. A general problem with off-CPU profiles is that something could just be waiting for more work. So knowing the reason something is waiting helps you programmatically differentiate why.
- Rhys: Isn't that implied by the call stack?
- thepudds: It is, but the runtime does already know this.
- Nick: For flame graphs and visualization of aggregated information, that might take you further away from the truth. You still want to be able to slice and dice the data, and put it in context. For parallelism especially, the Firefox profiler has the ability to look at an execution time window and the flame graph changes as you slice the data.
- Michael P: I believe that context was the idea behind perf providing a histogram and letting you zoom to portions of the histogram.
- Felix: There is at least one really important category that is not implied by the call stack if a goroutine is not running and that's when a goroutine is runnable (scheduler latency). You can have a profile where the leaf frame is the reason why a goroutine stopped executing (weighted by how long it was there fore), and you can also include the on-CPU time profile with the profile samples. This would be a single profile we could produce that would satisfy many use-cases. I can find some time to implement this I might do it, because it's pretty straightforward and we can see how well it works.
- Nicolas: I'm curious about the results of this.
Comment From: mknyszek
2025-04-10 Sync
Attendees: Michael K, Rhys, Junyang, Nick, Dominik, Chris, thepudds, Felix, Kemal, Milind, Angelina, Michael
- Michael P: https://go.dev/issue/73193 proposal: runtime: CPU limit-aware GOMAXPROCS default
- Any feedback?
- thepudds: It might be harder to notice issues from rounding up, like tail latency issues, whereas rounding down might be more obvious in e.g. CPU utilization graphs. Rounding down seems like it might be a better default.
- Rhys: My position at the moment is to round up because it's a smaller change (this proposal goes down, rounding down is even more down) and won't mess with auto-scaling as much. I'm curious about removing GOMAXPROCS or making it more automatic. I'm also wondering how much of the problem is just idle mark workers.
- Michael P: The ideal situation is that Go just keeps scheduling threads as long as they keep running. It would work great on an empty machine, but has lots of problems in corner cases. Also, how do you behave fairly with tenants? How do you coordinate?
- Michael K: One idea we had was to schedule on an idle-priority thread. RE: idle GC workers, hard to get rid of them because they do help latency. There are two contrasting views of GOMAXPROCS. The scheduler sees it as parallelism. The GC sees it as available CPU time.
- Rhys: Seems similar to TCP congestion control. Here we're relying on the operator to tell us. Problems with big GOMAXPROCS: idle mark workers, and having to look through all Ps to find a stashed goroutine. The scheduling latency increases with GOMAXPROCS.
- thepudds: TCP congestion control is an interesting analogy... TCP congestion control relies on the peer TCPs in essence collaborating to be fair to each other... but for example Go in co-tenancy has peers that won't be collaborating.
- Michael P: Limiting idle workers according to the CPU request is in an appendix of the proposal, and that's blocked on k8s not telling you what your CPU request is. Some k8s folks were open to finding ways to exposing that number. RE: scaling down P runqueues, there's a bitmask of idle Ps and the scheduler will try not to steal from idle Ps. Might not work well if going from idle to running frequently.
- Michael K: Maybe we iterate over the bitmask? That would violate the random steal order though. Back to rounding. Sounds like we have two opposing views?
- Rhys: I think it comes down to dealing with tiny sidecar apps. My experience is with an environment that has a round number of CPUs, say 4, and you cut out like 0.25 CPUs for the sidecar.
- Michael P: That's an interesting perspective. I think that's an argument in favor of Milind's suggestion of just using math.Round. I worry that people will not think about it until they have a problem, and with math.Round they could have one of two problems. If we pick either floor or ceil, then you can only have one type of problem.
- Milind: For the example of 2.6, what is the harm in rounding down? In rounding up?
- Michael P: Rounding down can interfere with auto-scaling systems and idle CPU. Rounding up can cause throttling.
- thepudds: Who is this mostly for? Is it for sophisticated operators? Or is it for people who might be on a smaller team? (Sophisticated operators know more about the trade offs, can set GOMAXPROCS or have code dynamically adjust settings, etc.).
- thepudds: Separately, maybe part of the solution is a 2-3 different community packages in the spirit of uber-go/automaxprocs that take different round up / round down / other behavior, or the Go team could publish a blog that outlines 2-3 strategies that could be implemented in a small amount of code.
- Michael K: Complicated solution that makes everyone happy: fractional Ps?
- thepudds: what time quantum? (Though interesting idea!)
- Michael P: It's interesting but I don't think it'll make everyone happy. The extra fractional core, in the argument for rounding down, is sometimes reserved for some sidecar process. Though perhaps in that case, they should be setting GOMAXPROCS explicitly anyway.
- Actually, they should already be setting GOMAXPROCS explicitly because otherwise GOMAXPROCS includes all cores, and then Go will still use the fractional quota.
- Rhys: Fractional Ps sound like latency. Right now Go is aware of the CPU mask, that's what it looks at, and that's what it could burst to. Even if it has a tiny averaged-over-time reservation in a cgroup. Interactive apps are currently able to burst, and now it seems like they're able to burst too high, but bursting is still useful. The more we constrain GOMAXPROCS the better it is for throughput, but throughput is not the only thing.
- Milind: We might be overfitting for Uber's use-case with rounding down. Uber's automaxprocs always rounds down because there are always sidecars at Uber.
- Nicolas: I agree with what's been said. Internally at Google we round up. I feel like it's a pretty good default vs. leaving some CPU on the table. Of course in my ideal world there would be no GOMAXPROCS. Whatever is done shouldn't preclude the removal of GOMAXPROCS.
- thepudds: Nicolas, what is the mechanism you use today to round up?
- thepudds: one ¼ baked idea is to use the two contrasting views of GOMAXPROCS (“The scheduler sees it as parallelism. The GC sees it as available CPU time.”) and take the problem of removing or reducing the importance of GOMAXPROCS and try to split that problem by those two different views. For example, maybe one of the ideas like “schedule an idle-priority thread” to effectively probe how much CPU is really available could be just for influencing how much the GC sees as available CPU (but not influence how the scheduler sees available parallelism). Or maybe that’s only ⅛ baked.
- thepudds: Maybe the more general comment is maybe “removing GOMAXPROCS” could be broken down into 2-3 smaller problems that could get tackled separately, given completely removing GOMAXPROCS is likely hard.
- Any feedback?
- Rhys: continued discussion of Nicolas H’s mutex profile feedback. And, for mutex and block profiles .. adding numeric tags for log-scale buckets on the duration of the delay (similar to what we have already for heap profile allocation sizes).
- Reminder: looking for a profile that is sampled on the magnitude of the contention, not the number of contentions. The idea is to catch long contention events. Your concern is that they're 'under-sampled' in some sense. Knowing the contention requires looking at the time. Unfortunately, nanotime takes about as long as an uncontended lock/unlock pair. I see that as an expensive thing. At a previous meeting we talked about how changing from fractional sample to weighted sample would be a change in semantics and likely a new profile.
- Idea: what if we attached the duration of the contention event to the profile sample?
- My position in general: can you profile for longer? If the sample is 1 in 100, maybe wait 100 times longer?
- Nicolas: If you just wait it's true that you'll see the long tail, but it's unfortunately not how our infrastructure works. Hearing you talk about the cost is surprising to me, but it also means it's not really workable. I might have to learn to live with the poor sampling. If we take a profile at program start and diff at view time instead of collection time that would work much better, but we don't do that.
- Nick: Going by duration rather than fraction is viable for sync.Mutex, since we're taking timestamps already anyway. But it's less viable for runtime-internal locks where doubling the overhead is really undesirable
- Rhys: With changed sampling you still end up with the samples mixed together. If there's one call stack that's sometimes fast and slow, it'll be jumbled together. You only see an average.
- Nicolas: True, but I think it would be an improvement.
- Rhys: How do you know that there are zero of them?
- Nicolas: We've measured the long contentions separately.
- Rhys: If the slow contentions have different call stacks then you can see them. But if there's a change to the weighting, everything is still in the same pile. The needle you're looking for will be bigger, but it's still in a haystack.
- Nicolas: Our problem is very multi-modal, everything gets stuck in these bad periods.
- Rhys: Maybe look at a goroutine profile instead?
Comment From: mknyszek
2025-04-24 Sync
Attendees: Michael K, Carlos, Junyang, Felix, Kemal, Milind, Chris, Rhys, thepudds, Angelina
- Milind Chabbi: Necro-reaper: Pruning away Dead Memory Traffic in Warehouse-Scale Computers mentions dead reads/writes as a cause of 26% data center memory traffic. Can Go compiler avoid zero initializations where it can be avoided? runtime.memclr* consumes 1% of Uber fleet.
- Two categories: unused zero initializations and pulling in dirty memory from cache on reallocation when that memory will be overwritten.
- Very common to overwrite zero-initialized memory in Go.
- Milind: Preliminary numbers are 43% of time spent zeroing. (!!)
- Michael K: We’re seeing 1.5% in our fleet.
- Felix: 1.7% over here in runtime.memclr*.
- Junyang: It looks like hardware/software co-design that they're doing in the paper.
- Milind: The implementation uses instructions that store without pulling in the cache line.
- Junyang: This came up in our chat, we considered some compiler algorithms to avoid zeroing. Relying solely on PGO might not be sound. Curious what technique you have in mind. Using a data-flow analysis in the compiler might help.
- Milind: Haven't thought through how it would work in the compiler yet.
- Michael K: There's some trickiness here related to the GC. Pointer-free memory is easier.
- thepudds: There are some existing optimizations that try to avoid zeroing memory when it can prove it is about to be over-written. For example, if compiler sees:
- append(b, make([]byte, n)...)
- I think that doesn't zero out the first len(b) byte because
- . Implemented via growslice:
- Junyang: What does the JVM do?
- Michael K: I wouldn't be surprised if it's doing fancy things, at least in the JIT path.
- thepudds: There definitely are some optimizations already to avoid zeroing memory when possible. For example in append.
- Milind: Is that specific to slices?
- Michael K: I'm only aware of growslice. It also only works for pointer-free memory.
- thepudds: subtle conversation about zeroing between Michael K. and Keith:
- CL 659956: "runtime: prevent unnecessary zeroing of large objects with pointers"
- https://go-review.googlesource.com/c/go/+/659956
- Michael K: Cleanup/finalizer debug mode
- Michael K: Green Tea GC 🎉
- thepudds: separately, gctrace=2 having OS-measured core metrics like CPU, RSS, Sys - HeapReleased, GOMEMLIIMIT would help individual Go developers but also help with supportability / triage on the issue tracker
- For people reporting GitHub issues against runtime, it’s often easier for user to understand the overhaul behavior if they see their OS metrics easily, and having them in simple ASCII form means easier for others to see on issue, including with corresponding gctrace=1 data
- Could be best effort, just 3 primary platforms to start.
- Doesn’t need to be same on all platforms (best way to express Windows memory is different than Linux/macOS).
- “OS measured” is important (the runtime view of CPU usage needs to be taken with grain of salt, including when program or system is under pressure)
- Could be documented to change over time, or even undocumented
- GOMEMLIMIT behavior harder to interpret with just gctrace=1 today
Comment From: mknyszek
2025-05-08 Sync
Attendees: Michael P, Nick R, Nicolas H, thepudds, Chris, Rhys, Bryan, Michael K, Milind, Angelina, Carlos, Dominik
- FYI: GopherCon EU contributor summit interest form: https://groups.google.com/g/golang-dev/c/EfmNZZBqmQI/m/A4hHoOP0EAAJ
- Milind Chabbi: (may be contentious topic) - concurrent map access (and the fatal exit behavior) caused a serious outage at Uber recently. How does making the default map (and may be slice) into a lock-free data structure sound like? The compiler may help a bit in using serial version when it can prove the safety. It is not inconceivable to create a map that dynamically switches from serial to lock-free version. Attach an owner/creator; if a subsequent access is from a different goroutine, transition to lock-free.
- Michael K: We’re committed to the performance of the single-threaded version. The community has expectations of the performance. (There are always tradeoffs, and a concurrent version would trade away some performance.)
- Michael P: RE: the transition to a thread-safe map being transparent, I think we would at least need a barrier.
- Milind: Could we piggyback on the concurrent map check?
- Nicolas: We've seen crashes from this too, and on the whole, it's uncovered a whole bunch of other more serious races. The map was just the canary in the coal mine in most cases. I worry that it will just hide real issues to change the map behind the scenes.
- Milind: That's not quite true for us. The problem usually is limited to just the map.
- Michael K: The barrier would be expensive, especially on weak-memory architectures such as arm64. The map fast paths are very fast, and if you do them in a loop then the processor should be able to avoid frequent non-local memory operations. ARM would have a hard time if every fast path access needs to do an atomic operation to have a reliable “map in use” flag.
- Milind: Could this be an experiment?
- Michael K: Someone would have to do the work. It wouldn't be super easy.
- Michael P: Iteration semantics are also complicated enough for single-threaded maps. Getting that working with sync maps is also complicated, and we need to make choices.
- Bryan: IIRC, Java used to have concurrent-safe maps, and they decided to change that. Might have been 20 years ago. It was because you often want to do multiple operations under a larger lock (remove an item from the map and add it to a queue, etc).
- Milind: The bug was lingering for 2 years and a config change caused the difference. Tests couldn't catch it.
- Nicolas: It was the same for us.
- thepudds: One other approach for experimentation could be to change the sync map to use the builtin Swiss map, and then detect when to transition to the normal HashTrieMap used today in sync.Map. The sync.Map has simpler defined semantics (esp. for iteration) so might be easier to implement those semantics, and the code might be easier to work with (code changes might be at a higher level than trying to change the internal map, where the compiler for example has some knowledge of the layout of the internal map). It could give a sense of performance.
- Michael P: It looks like the Java history is true. The collections library switched to a thread-unsafe map when it was released.
- Milind: If we don't crash after a race, what are the consequences? "No such thing as a benign data race" but is that true?
- Rhys: We also voluntarily crash when you unlock a sync.Mutex that's already unlocked. That's another place we throw where we don't necessarily have to (but do, because it’s clearly a programming error). Data races or logical races could also lead there.
- Michael P: RE: what would happen if we didn't crash on a race, I mostly agree that basically anything can happen. Our memory model says values won't be produced out of thin air, but not much more. In the current implementation, in the typical case, it might be fine. But map growth is probably a problem. Your most likely failure case is a crash on a garbage pointer.
- thepudds: RE: using the sync map instead of the builtin map, the compiler is aware of the layout of the builtin map. The sync map is a bit simpler. Michael P's work does make the builtin map nicer to use for testing and swapping out the implementation.
- Milind: What would be the objectives of a sync map experiment?
- thepudds: To graduate to builtin maps, the highest bar is probably zero performance regression while also retaining iteration semantics. I think we should just see how bad it is. A positive (but not perfect) outcome might be a faster sync.Map when used in a single goroutine, and hence it could be used more often by more people (though not as nice syntactically not as convenient as the builtin map).
- Michael K: When you have data races, if you have multi-word structs, you will see tearing / partial updates. You can end up with strange memory corruption there, and once you have a little bit of memory corruption it quickly spreads everywhere. Unless you design your application very carefully to handle that. The runtime relies on the compiler to emit code that will only ever do full-pointer writes, so the worst case you have when reading a pointer is that you’ll see an old value. (But please don’t rely on this.) But interface values are two pointers. Updates to each of those pointers is atomic, but the value as a whole can be torn, so you end up with a mismatch between the data pointer and the type pointer. That’s the entry-point to all sorts of bad results.
- thepudds: classic post by Russ on data races and multi-word values like interfaces: https://research.swtch.com/gorace -- “The race is fundamentally caused by being able to observe partial updates to Go's multiword values (slices, interfaces, and strings): the updates are not atomic.”
- Millind: What about non-pointers? Sometimes teams push back on racy access to integers used for metrics.
- Michael K: You’re never going to identify the exact boundary between “ok race” and “not ok race”. It’s fractal. So we just say “don’t do it”.
- Michael P: Our memory model prohibits out-of-thin-air values.
- Dominik: I go with "we care about all races so the race detector stays useful"
- Chris: I saw that sync.Map uses sync.Mutex rather than sync.RWMutex
- Michael K: Reads don’t acquire the mutex. That’s a core part of the design. So the only reason you’re acquiring the mutex is because you’re doing a write.
- Chris: I found another (Go) concurrent map that’s much faster.
- Michael K: There’s a new implementation of sync.Map (shipped in Go 1.24) that’s based on a tree, it’s faster than the previous one. The old one was especially slow.
- Chris: I might have been benchmarking against the old implementation.
- Michael K: Oh, right, in the old one there was use of the mutex for some reads.
- thepudds: benchmarks on new HashTrieMap sync.map implementation: https://github.com/golang/go/issues/70683. (Short history with additional links: https://github.com/golang/go/issues/70683#issuecomment-2637038172)
- Chris: I was looking at https://github.com/orcaman/concurrent-map, which was faster than sync.Map of 3 or 4 years ago.
- Michael K: The old one was based on the built-in map under the hood. The new one is a bespoke datastructure.
- Michael P: https://go.dev/ref/mem: "If you must read the rest of this document to understand the behavior of your program, you are being too clever. Don't be clever."
- Michael K: “There’s no such thing as a benign race.”
- thepudds: You want to be able to get a clean run from the race detector. It’s a very useful tool.
- Michael K: You can mark stuff that you really know is benign as “norace”.
- Milind: I’m aware. People often misuse it.
- thepudds: from https://pkg.go.dev/cmd/compile#hdr-Function_Directives:
- //go:norace The //go:norace directive must be followed by a function declaration. It specifies that the function's memory accesses must be ignored by the race detector. This is most commonly used in low-level code invoked at times when it is unsafe to call into the race detector runtime.
- Michael K: This comes up every few years within the core Go team at Google … “wouldn’t it be nice if the built-in map implementation were safe to use concurrently”.
- thepudds and Michael K: [Discussion of David Chase’s proposal to make slice headers atomic, and the necessary field reordering.]
- Dominik: very old article about races on interface values https://research.swtch.com/gorace
Comment From: mknyszek
2025-05-22 Sync
Attendees: Bryan, Nick, Felix, Rhys, Junyang, Chris, Michael P, Carlos, Dominik
- Bryan hit the "Go stores 2GB in /memory/classes/profiling/buckets:bytes" issue.
- Can we just stop detailed profiling at some point?
- Felix: We have internal monitoring for this, but it rarely triggers. Usually it's some third-party library that's doing something weird. It's still annoying. If you don't have people who understand this problem then it's hard to figure out what's going on. Maybe short-term having a stopgap is interesting but I also think there should be a way that we should make it work to just do delta profiling internally.
- Michael P: I definitely think we should do something about this. I'm concerned about just stopping profiling in general because if the allocation pattern of the program changes over time then you can get confused that the profile only shows the beginning of the program. We may have discussed this before. One previous idea I liked the best was figuring out how to drop samples.
- Bryan: There are two services exhibiting this problem from the same project. That supports the idea that it's rare.
- Rhys: It sounds like what you want is a memory profile to identify the issue, but we only provide a heap profile. Have we considered making the heap profile a memory profile?
- Michael K: We’ve discussed that before, we could definitely do it. We can put all of the stuff from the runtime/metrics memory classes in there. And for stacks, we can even show the user-level functions that allocated those stacks.
- Michael K: On dropping samples: Making the heap profile generational, in the same way that the execution tracer are generational, where there are epochs that allow you to drop old samples. Or possibly something like the ring buffer we use to read heap metrics in a consistent way. At every epoch, we can update just a single atomic to move writers over to updating a new set of buckets, and then we can make a decision on what to merge forward or drop. This is underbaked.
- Felix: Big +1 to adding more to the heap profile. Ideally the total is the same as the total virtual memory. One issue people also tend to have is that when they look at heap profile is that it's the live heap at the end of the last GC cycle, which is not reflected in RSS.
- Michael K: Maybe we have both a heap profile and a memory profile?
- Felix: I think we would still be unclear whether the heap means live heap or the whole heap.
- Michael P: One other approach to showing the RSS is we could add a new sample type like RSS bytes. But maybe that's too hidden.
- Rhys: There's also the weird purgatory of programs that only had one or two GCs ever, and the heap profile only reports on a few GCs ago.
- Felix: That's a valid concern. I recently debugged an OOM kill and then I realized that there were two Go processes. One had GOGC=off set, which propagated to the other one. That was also a case where a GC never ran. So I took a core dump and used goref.
- thepudds: I would suggest caution around adding even more flavors. People already get confused by all the current flavors. Including everything and making it a UI problem might be the better choice. We could also make an executive decision not to include the heap released metric, just anything that estimates RSS.
- Felix: +1 to that. I find myself linking to this blog post I wrote frequently. We found that the /memory/classes/total:bytes − /memory/classes/heap/released:bytes estimate for Go memory usage works really well in practice, and it’d be great if memory profiling becomes the most detailed breakdown we can give of that number.
- thepudds: If you do just randomly drop, is that statistically accurate?
- Michael K: It might work, but reasoning about time windows is a little easier.
- Felix: Go 1.25 runtime traces will have clock snapshots (trace, wall, mono) - see CLs (API: ev.Sync().ClockSnapshot)
- Thank you!
- Nick: 400ms stop the world seen in an execution trace.
- thepudds: Can we just time-bound how long we spend in this path?
- Michael P: We can retry the allocation if we've spent too long.
- Michael K: Yeah, we can just let the allocation proceed.
- Rhys: Is there any way to see how responsive an M is to a preemption request? Like, in tests, rather than waiting for us to get unlucky.
- Nick: I'll file an issue.
- Rhys: We have a CPU profile, so maybe we can decide that an M is taking suspiciously long to respond, so maybe we can send a SIGPROF or something like it to see what it's doing.
- Michael P: We might not even need a new signal. The async preemption signals can just count how many times they failed and then tell the caller to do something about it. Maybe it can take a stack trace?
- Rhys: FP-based unwinding doesn’t work while the stack is still being set up. That includes every cooperative preemption event (during morestack), https://go.dev/issue/68090. Why do we have three different traceback functions (the main one, the fp-based one, the fp-based one that expands some frames), rather than one (the main one, with a “try to be very low overhead” mode)? Michael K is already assigned to the issue.
- Felix: It's hard to have an unwinder that's both fast and can decide what to do at every step. We were hoping to use the unwinder in more places. The plan was to work toward unification but it's quite a bit of work.
- Rhys: For the block and mutex profile, for runtime-internal mutexes, I used the regular unwinder. This is important because they're often used on the system stack, and the frame pointer unwinder doesn't attempt to trace back through the system stack.
- Nick: Michael P landed some changes that will let us go across systemstack calls safely. We can't go across mcalls but we don't want that anyway.
- Michael P: I sent some CLs just to explicitly maintain the frame pointer in the systemstack frame. And also to explicitly zero it in mcall.
- Felix: One other complication worth noting is that the unwinder is used by the garbage collector. For that we need to keep track of more.
- Michael P: https://go.dev/issue/73753 (“cmd/compile: play better with perf”) is also relevant to making FP unwinding more robust.
- Felix: One of the many things the unwinder does is it looks up the SP offset. This can be replaced with a delta from the FP (?) and that does make the unwinder faster, but it still started crashing.
- Rhys: I think I'm running into the opposite problem. One problem is when we're in the preamble before we even have a proper FP set up. One thing we can do is do the slow unwinding for the first frame, but go fast on everything else.
- Carlos: The flight recorder is in the 1.25 release.
Comment From: prattmic
2025-06-05 Sync
Attendees: Jake, Bryan, Dominik, Rhys, Nick, Felix, Carlos, Junyang, Milind, Michael P, thepudds, Kemal, Nicolas
- Michael K: Who has tried the greentea GC so far?
- Felix: There is excitement but no follow-up yet.
- Michael P: We should have RC1 coming out soon.
- Jake: Meaning to look at it for tsgo.
- Michael K: Thinking about net/http/pprof.
- Michael K: Problems, backlog of proposals. Seemed like a great idea at the time. But use of defaultservemux is a security footgun. How are you all using it?
- Felix: One issue is the lack of delta profiles. Another issue is that applications need to integrate with it.
- Felix: Applications have to have support for net/http/pprof, which is not standardized. It would be nice to have a way to get this data out of any Go application.
- Michael P: Previous discussion above
- Nick: Small clarification on delta profiles: there is a delta option which will take profiles a few seconds apart, but for heap profiles specifically, it would be
- Rhys: Long-time non-user and I have an additional set of problems. And I have opinions that don't align with your prompt. It's an HTTP handler so you need access across the network to your process instance, and you need to identify a single process instance. You might not have an address, or in a tiny isolated container. Being able to reach out and touch the process across the network always involves certain risk. I built a thing that's open source which deals with some of these issues. (In widespread use within one large company.) Applications profile themselves on a regular basis, zip them, and put them somewhere. Push model instead of a pull model.
- Michael P: Anecdotally, I talked to some people about PGO, and almost everyone I talked to that had a way to get profiles was using net/http/pprof. They didn't have strong opinions, it was just there. They either said "we never profile production" or "we use net/http/pprof".
- Milind: +1 for using net/http/pprof for production profiles.
- thepudds: One the behavior change, could you tie it to the go.mod version? GODEBUG? Regarding the delta, I haven't seen that proposal, but if it's just a list of fields to calculate the delta on, that seems pretty simple and flexible.
- Michael K: As an aside, there are 31k public imports.
- Rhys: I'm surprised, I would've assumed it was all 'package main' imports.
- thepudds: v2s are pretty slow to adopt.
- Michael P: If it's really a security concern, we should change the behavior, but it's a gray area. I don't know how many people use the anonymous import, but I suspect that's most of it, and I don't think most people would notice that it's not exported now.
- Jake: According to GitHub, about half of imports are anonymous. It's also always bugged me that runtime/pprof and net/http/pprof have the same name.
- Rhys: Some of the uses of net/http/pprof I've seen in response to security concerns is to just expose profiles that they think exist. Which leaves gaps in debuggability when new things get added. Goroutine profiles also include all the arguments on all the stacks.
- Michael K: We can probably make it more secure by default, and let users opt-in to more details. Some people expose net/http/pprof on a Unix domain socket.
- Rhys: I'm not a fan of using in-bound TCP for debugging. A GODEBUG for "dump diagnostic data here" would make it way easier to gather diagnostic data.
- Michael K: Agreed.
- Rhys: I ran into this with x/benchmarks/sweet, where it doesn't understand other profiles, because downstream binaries don't necessarily understand other profiles uniformly.
- thepudds: It's probably too bold to automatically add flags, but maybe:
- maybe environment variables for final binary that is opt in based on flags to ‘go build’ / ‘go install’?
- Or maybe just flags to ‘go run’ in particular (no env var needed in this case), just flags for ‘go run’ similar to how ‘go test’ takes very convenient performance related flags.
- thepudds: would be nice quality of life improvement for people who are already knowledgeable about performance, but also nice on-ramp for wider set of gophers
- Michael K: One problem with environment variables is you can't really tell a program where to write through an environment variable, since it's a security issue.
- Rhys: I think tests are great because it gives easy access to all the debugging stuff.
- Jake: How would 'go run' work with os.Exit, for example?
- Michael K: From within the runtime we can hook os.Exit. We currently hook os.Exit for flushing coverage data. We also hook panics to traceAdvance.
- Nicolas: Signals are still a problem, since they can't run traceAdvance.
- Michael P: We could freeze the world and do a best-effort traceAdvance to an os.File.
- Nicolas: This would not be so ideal for us (similar to another crash hook I saw added to stdlib recently, which writes to a file IIRC). In our case, we want to write to a socket, and in fact we do it through an opaque API that does not implement
io.Writer
. Our machines do not have space available in general, and even if they do, the resulting file gets wiped quickly. - thepudds: As a final thought, making understanding your program more accessible really pays dividends. So many times complex software boils down to poor understanding of performance.
- Rhys: Giving people confidence that they will be able to profile easy can encourage writing the code that just works and is easy to understand.
- Felix: Make profile compression configurable, optional or use compress/zstd (after proposal is implemented).
- The trade-off of the gzip the runtime does is not really a good CPU/compressed size trade-off.
- Michael P: Not opposed, as long as the upstream pprof tool also supports it.
- thepudds: You can sniff to see if something is gzipped. pprof could just do the right thing based on looking at the bytes.
- Michael P: An alternative may also be to just let the runtime give you an uncompressed profile proto.
- Michael K: I like the uncompressed idea since the default still works with the tooling and users are unlikely to use it accidentally.
- Felix: The testing we've done on a limited about of data shows it's about a 20% size difference in using zstd over gzip for the same CPU time. I like the idea of adding an option to get an uncompressed proto.
Comment From: prattmic
2025-06-26 Sync
Attendees: Piotr B, Rhys, thepudds, Nick, Junyang, Carlos, Michael P
- Piotr B: We needed to rewrite gosym internally, for inline tree support and minimizing memory overhead. Is there an interest to accept it in the standard library (I am happy to send a pr)? We can keep the existing library interface, but we need to add methods with different return types, to handle inlined cases (there can be many lines per pc), we can mark existing methods as deprecated.
- Michael P: https://go.dev/issue/58474
- Piotr: Memory overhead is important. Table has an exported slice of functions.
- Michael P: We have been thinking of making a new debug/gosym (v2) package, likely in x/debug, because the existing API is awful.
- Michael P: I will share API brainstorming for new package
- Piotr: Handling stripped binaries by searching for moduledata.
- Michael P: For future versions, we could put moduledata in its own section like gopclntab
- Michael P: https://go.dev/issue/57447 Related issue for core-parsing x/debug packages
- Rhys: STW times seem pretty bad (sometimes up to 4ms) in the Green Tea issue gctraces, both with and without greenteagc
- Michael K: I agree. This may be an effect of the extra page faults that are now fixed.
- Rhys: The only way to get gctraces is by setting an env var and reading stderr, which is awkward for production environments.
- Michael K: We might be able to put more data in the execution trace (for when that’s active).
- Rhys: The execution trace might already have what it needs: wall times, assist volume, heap goals. Missing things like the amount of stack and globals. And a tool to extract it.
- Rhys: I get the impression that the current GC STW goal is 100µs. Maybe we can get the data we need to bring those back down from combined execution traces + CPU profiles.
- Nick: I have execution traces with CPU profile samples. I’ll try finding some that include slow GC STWs.