Update, Jun 7 2023: runtime/metrics now exists, but there are a few metrics in the draft CL here that aren't yet exposed. See https://github.com/golang/go/issues/15490#issuecomment-1564615291.
MemStats provides a way to monitor allocation and garbage collection.
We need a similar facility to monitor the Scheduler.
Briefly: - Total goroutines create - Current number of of goroutines - Total number of goroutines scheduled - Current number of goroutines scheduled - Total thread starts - Current number of threads. - Metrics on the delay between a goroutine being ready and running on a proc.
Comment From: minux
You can get all of these statistics by parsing the tracing output from runtime/trace.
Comment From: aclements
@minux, while true, runtime/trace seems like a pretty high overhead way to collect what amounts to a fairly small amount of information. It's certainly low overhead for what it does, but what it does is much more than what's needed here. The metrics @deft-code wants are primarily intended for continuous monitoring (based on offline conversations), so it needs to be cheap.
Comment From: aclements
Here are the notes on the desired metrics I had from our meeting a while ago:
Ring buffer of sampled duration between entering and exiting runnable state - With some probability, when a goroutine enters runnable, tag it, and when it exits runnable, add the runnable duration to a ring - Consumers can do what they want with these samples, including just averaging them, or building distributions.
Four global stats - Current number of goroutines - Total number of goroutines ever created - Current number of runnable goroutines - Total number of runnable-to-running transitions
Maybe current number of running goroutines
Comment From: bradfitz
Assigning to @aclements to decide what we're willing to support long-term.
Comment From: adg
Ping @aclements
Comment From: rsc
Still up to @aclements.
Comment From: rsc
Ping @aclements. Can you look at this during the release candidate quiet?
Comment From: aclements
@deft-code, do the specific stats I suggested in https://github.com/golang/go/issues/15490#issuecomment-215921759 address your needs?
Comment From: aclements
Sorry, I'd lost track of the fact that there was a concrete proposal doc for this: https://github.com/deft-code/proposal/blob/master/design/15490-schedstats.md
@deft-code, could you mail a CL to add this to the go-proposal repository and, once submitted, edit your first post to link to it? Thanks.
Comment From: deft-code
I'll get on top of it.
Comment From: aclements
Thanks!
Comment From: gopherbot
CL https://golang.org/cl/38180 mentions this issue.
Comment From: rsc
Do we need to keep this issue open, or should we accept it?
Comment From: aclements
There's definitely still work to do on how and what exactly the API should expose, but I think it's pretty clear we need to provide some visibility into the scheduler.
Comment From: rsc
Teams inside Google are patching in CL 38180 and getting some experience with it. If others would like to do the same, please do. We'll probably wait until Go 1.10 to decide to add the API officially. Putting the proposal on hold until then.
Comment From: aclements
Update: Some teams inside Google tried CL 38180 for monitoring CPU load and performing load shedding. However, they've found that the stats provided in the CL aren't a good indicator of CPU load. In particular, it seemed that since goroutines are so cheap (compared to, say, using runnable threads as a load indicator), the runnable goroutine count often fluctuated dramatically, even when the system was under normal load. It was common to see a huge number of goroutines newly started or woken that would exit or sleep almost immediately once run. If the load shedder happened to sample the SchedStats during one of these spikes, it would think the system was overloaded.
There may be some other stat that is a more robust indicator of load. For example, maybe smoothing would help: the runtime could provide the time-integral of the runnable count, from which an application could compute the average runnable count over any desired time window. Or it could expose something similar for the running goroutines to give a measure of idle time (once a system was overloaded, this wouldn't be able to tell how overloaded it was).
Comment From: RaduBerinde
We are looking at ways of detecting overload in CockroachDB, and statistics like described in https://github.com/golang/go/issues/15490#issuecomment-215921759 or in the proposal would be extremely helpful. Is there any hope of getting something like this in Go any time soon? I can help with the work if folks are in agreement on what to build. As a timid first step, I was thinking of adding a /sched/goroutines:runnable
metric to runtime/metrics - any thoughts on that?
Also, is there any more information about the experiments to use the number of runnable goroutines as an overload indicator? I understand the fluctuations are an issue, but wouldn't sampling this info frequently and looking at an average over a reasonable timeframe address that?
Comment From: prattmic
Side-stepping whether or not we want this, but now that we have a runtime metrics API (#37112), if we do add these kinds of metrics, adding to the metric API will be the obvious place rather than a new SchedStats method/struct. (e.g., the metric @RaduBerinde proposes above).
cc @mknyszek
Comment From: ianlancetaylor
@golang/runtime @mknyszek Is there anything here that is not covered by the runtime metrics package?
Comment From: mknyszek
I think a bunch of the metrics in https://go.dev/cl/38180 are not covered by the runtime/metrics package. I've had it on my TODO list for a while but it's never quite made it to the top. Clearly not for this release, but I continue to hope.
In terms of the proposal process, I don't think this needs to be on hold anymore. Adding new metrics is a fair bit more lightweight than it used to be, so even if we don't see an obvious use-case right now, the bar is low enough that I'm comfortable with just adding the remaining metrics.
In terms of the original proposal, I think all we're missing is counts of goroutines in various states, total count of goroutines created (to create a rate metric), and a thread count. (We already have a histogram metric for time spent in "runnable.")
Comment From: mknyszek
If someone would like to take a stab at implementing this, please be my guest. Otherwise, I'll get to it next cycle.
Comment From: mknyszek
If someone would like to take a stab at implementing this, please be my guest. Otherwise, I'll get to it next cycle.
Comment From: ianlancetaylor
@mknyszek Thanks. Should we keep this issue open and retarget to runtime/metrics? Or should we open a new proposal?
Comment From: mknyszek
We can keep this issue open and retarget it. I'll update the header and such.
Comment From: rsc
This proposal has been added to the active column of the proposals project and will now be reviewed at the weekly proposal review meetings. — rsc for the proposal review group
Comment From: rsc
Retitled based on https://github.com/golang/go/issues/15490#issuecomment-1564615291. This is about per-state goroutine counts, a count of total goroutines created, and total number of OS threads.
Have all concerns about this proposal been addressed?
Comment From: mknyszek
Yeah, I believe all concerns are addressed.
Comment From: rsc
Based on the discussion above, this proposal seems like a likely accept. — rsc for the proposal review group
Comment From: rsc
No change in consensus, so accepted. 🎉 This issue now tracks the work of implementing the proposal. — rsc for the proposal review group
Comment From: mknyszek
As for what these metrics should be named, perhaps:
/sched/threads:threads
/sched/goroutines-created:goroutines (cumulative)
/sched/goroutines/waiting:goroutines
/sched/goroutines/runnable:goroutines
/sched/goroutines/running:goroutines
/sched/goroutines/not-in-go:goroutines
The goroutine state metrics come from https://go-review.googlesource.com/c/go/+/38180/9/src/runtime/pstats.go#18. I figure we can reuse most of that implementation.
Comment From: prattmic
/sched/goroutines/not-in-go:goroutines
is intended to cover syscalls and cgo, I assume? i.e., it is _Gsyscall
?
Comment From: gopherbot
This issue is currently labeled as early-in-cycle for Go 1.22. That time is now, so a friendly reminder to look at it again.
Comment From: gopherbot
This issue is currently labeled as early-in-cycle for Go 1.23. That time is now, so a friendly reminder to look at it again.
Comment From: ianlancetaylor
Too late for 1.23.
Comment From: gopherbot
This issue is currently labeled as early-in-cycle for Go 1.24. That time is now, so a friendly reminder to look at it again.
Comment From: serathius
Any progress? This would be pretty useful to implement load shedding.
Comment From: gopherbot
Change https://go.dev/cl/690397 mentions this issue: runtime/metrics: add metrics for goroutine sched states
Comment From: gopherbot
Change https://go.dev/cl/690399 mentions this issue: runtime/metrics: add metric for current in-Go thread count
Comment From: gopherbot
Change https://go.dev/cl/690398 mentions this issue: runtime/metrics: add metric for total goroutines created