What version of Go are you using (go version
)?
$ go version go version go1.17.2 darwin/arm64
Does this issue reproduce with the latest release?
Cannot test 1.17.3 currently.
What operating system and processor architecture are you using (go env
)?
Apple M1 Max MacBook Pro 16.
go env
Output
$ go env GO111MODULE="" GOARCH="arm64" GOBIN="" GOCACHE="/Users/XXX/Library/Caches/go-build" GOENV="/Users/XXX/Library/Application Support/go/env" GOEXE="" GOEXPERIMENT="" GOFLAGS="" GOHOSTARCH="arm64" GOHOSTOS="darwin" GOINSECURE="" GOMODCACHE="/Users/XXX/go/pkg/mod" GONOPROXY="" GONOSUMDB="" GOOS="darwin" GOPATH="/Users/XXX/go" GOPRIVATE="" GOPROXY="https://proxy.golang.org,direct" GOROOT="/opt/homebrew/Cellar/go/1.17.2/libexec" GOSUMDB="sum.golang.org" GOTMPDIR="" GOTOOLDIR="/opt/homebrew/Cellar/go/1.17.2/libexec/pkg/tool/darwin_arm64" GOVCS="" GOVERSION="go1.17.2" GCCGO="gccgo" AR="ar" CC="clang" CXX="clang++" CGO_ENABLED="1" GOMOD="/dev/null" CGO_CFLAGS="-g -O2" CGO_CPPFLAGS="" CGO_CXXFLAGS="-g -O2" CGO_FFLAGS="-g -O2" CGO_LDFLAGS="-g -O2" PKG_CONFIG="pkg-config" GOGCCFLAGS="-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/06/1d1vrn6541133ymsn2s2vwvw0000gn/T/go-build3566489552=/tmp/go-build -gno-record-gcc-switches -fno-common"
What did you do?
- Spawn a simple http server
package main
import "net/http"
func main() {
http.HandleFunc("/", func(rw http.ResponseWriter, r *http.Request) {
rw.WriteHeader(200)
})
if err := http.ListenAndServe(":8080", nil); err != nil {
panic(err)
}
}
- Run some requests against it
wrk -c500 -d10 -t1 http://localhost:8080
What did you expect to see?
With the previous macOS versions (as with every other OS I've tested lately) the app uses a couple of threads (usually around 8-12 all the time).
What did you see instead?
Thread count goes very high and never drops:
Comment From: heschi
cc @neild
Comment From: odeke-em
@chrisprobst thank you for the report and happy new year to you and to you @heschi :-)
@chrisprobst, I am on MacOS Monterrey as well and I've failed to reproduce this problem with various values of -c and -t in ranges too
go1.17.5
https://user-images.githubusercontent.com/4898263/148328684-07844562-0a32-48bf-b4c0-8f56e440635a.mp4
go1.18
https://user-images.githubusercontent.com/4898263/148328543-a217c428-b877-4fa1-aba6-b1ccfe501a04.mp4
@chrisprobst can you reproduce this even right now?
Comment From: chrisprobst
@odeke-em Hi, thanks for checking. I am running on M1 Max CPU and 12.0.1 (so still not updated to 12.1, will do shortly). The problem exists both for 1.17.5 and 1.18Beta1. Will notify you when updated to 12.1.
Comment From: chrisprobst
@odeke-em I've updated to 12.1, same results. Can you reproduce with an M1 Max? I have the feeling it is linked to the M1 architecture. But the problem is still present. With Rust, etc. this problem is not present so it is related to thread management of Go probably.
Comment From: odeke-em
Interesting, thank you @chrisprobst! Let me kindly ask a colleague @kirbyquerby who has an M1 to try to reproduce it, and then I shall go purchase an M1 Pro or M1 Max computer tomorrow morning too and try to reproduce it then debug it.
Comment From: kirbyquerby
Here's my results with Go 1.17 on Big Sur and then after updating to Monterey. The number of threads seemed to increase (peak 22 on Big Sur -> 32 on Monterey) but didn't balloon to large values like in the OP.
Big Sur results: https://drive.google.com/file/d/1IXAvDKuBSsuCX3GfpXCju2gKljn-sarV/view?usp=sharing
Monterey results: https://drive.google.com/file/d/1GIg5I7rsUc19cp0aOvpnbXw7SHZX3y1-/view?usp=sharing
`go env` output
$ go env
GO111MODULE=""
GOARCH="arm64"
GOBIN="/Users/nathan/go/bin"
GOCACHE="/Users/nathan/Library/Caches/go-build"
GOENV="/Users/nathan/Library/Application Support/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="arm64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/nathan/go/pkg/mod"
GONOPROXY="github.com/orijtech/*"
GONOSUMDB="github.com/orijtech/*"
GOOS="darwin"
GOPATH="/Users/nathan/go"
GOPRIVATE="github.com/orijtech/*"
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/darwin_arm64"
GOVCS=""
GOVERSION="go1.17"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/dev/null"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/h5/hnjslbmn52zb11xzqg9dmy7w0000gn/T/go-build2021634348=/tmp/go-build -gno-record-gcc-switches -fno-common"
Let me know if I can help test anything else :D
Comment From: chrisprobst
I will record a video as soon as I can. In my example, the Threads go up to the hundreds.
Comment From: chrisprobst
https://user-images.githubusercontent.com/976784/148617285-08a39909-d50c-4640-bc94-ff453365ba17.mp4
Comment From: chrisprobst
@odeke-em @kirbyquerby As you can see, the threads go extremely high. It's now macOS 12.1 and Go 1.17.5 (1.18beta1 is the same).
Comment From: chrisprobst
@kirbyquerby I noticed that you are using an M1, maybe it's related to M1 Max? The Low-Power cores are different, maybe it is somehow affected by this?
Comment From: davecheney
Is this possibly related to some security software on that laptop; previously users have reported that “anti virus” software from semantic and McAfee tend to have detrimental impact on macOS computers.
Comment From: kirbyquerby
I diff'd our go env
s and noticed that you installed using brew
. I tried the brew
version of go1.17.5 and still don't get the issue, so we can rule that out :)
Comment From: chrisprobst
@davecheney It's a stock device used in our lab. Based on the feedback it really sounds like an M1 Max issue.
Comment From: chrisprobst
I tested this app on a few other M1 Pro and also Max devices (same CPU), same result. Still, what could cause that different behaviour? How can a CPU model can have this impact on the app? Is this even possible, I mean theoretically? Or is it maybe due to different system libraries that Apple might use for these CPUs?
Comment From: ianlancetaylor
CC @cherrymui
Comment From: odeke-em
Unfortunately there isn't much we can do for Go1.18 but we are keeping this open for investigation for Go1.19 and if/when we encounter a need for a fix, we can backport. Moving this to the Go1.19 milestone. Thank you @chrisprobst and everyone.
Comment From: chrisprobst
After the release 1.18 I did some more tests and found some interesting observations that I would like to share, maybe it helps.
The M1 Max (Pro) has 8 performance cores and 2 efficiency cores. When I limit the processor count using GOMAXPROCS, the behaviour changes completely.
If I set a value between 1-4 (GOMAXPROCS=1 ... GOMAXPROCS=4), the problem disappears. The problem suddenly starts with GOMAXPROCS=5+. Also, with a value between 1-4 the performance is superb, as usual and expected. The numbers mean nothing, but the delta is interesting:
With 1-4 I get 175K req/s with wrk and the thread count stays under 10. With 5+ I get "only" 60-75K req/s and the thread count goes wild, 300+.
I am not too much into how Go's scheduler works exactly but I do know that slow syscalls might cause other threads to spawn to take over. Now, maybe the efficiency cores are too slow to take over and Go thinks the syscall takes too long and therefore simply spawns another thread?
If this is true, even modern Alder Lake CPUs with Big.Little might have similar issues. Again, it's a guess, but I have the feeling that it is related to this architecture design.
Let me know your thoughts.
@odeke-em @ianlancetaylor
Comment From: ianlancetaylor
CC @golang/runtime
I guess this did't get investigated for 1.19. Rolling forward to 1.20.
Comment From: chrisprobst
Just tested lately with latest 1.19 and up-to-date M1 Pro device, same result. The simple HTTP server listed above still generates hundreds of threads on macOS.
I've already mentioned that only a value of GOMAXPROCS=5+ causes issues. An M1 Pro/Max has 8 performance cores and 2 efficiency cores. A value of 4 works perfectly, so half of the performance cores. Not sure, maybe it's linked.
I really believe that the efficiency cores are the problem, maybe they take so much longer compared to high performance cores for certain operations that Go thinks it has to spawn a new thread to keep up with the load. Could this be related?
Comment From: chrisprobst
A lot of time has passed by now and I still have an eye on this topic. Meanwhile, I have access to an M4 Max chip (but tested over the years all generations).
What is interesting to see is that Go's performance of simple HTTP benchmarks degraded over the years on macOS. Sure, it's not a server environment. But newer chips get faster while Go's performance with growing version numbers get slower. This is by no means scientific and I don't want to open pandoras benchmark box. Really, I don't.
But, it is still interesting because this simple http benchmark (the one from the top of this issue) is enough to trigger entirely different behavior on macOS over the years.
Previously, and a main driver of this issue was the high thread count.
This is solved, I cannot reproduce it anymore. Thread count never grows crazy like in the beginning of this issue.
But now, something entirely else is strange. The "Idle Wake Ups" as shown in macOS' System Monitor grow like crazy. And I mean crazy. So for a simple web server and wrk hammering it I get up to 50K wake ups, which is totally insane.
This test is on all cores maxed out and a req/s rate of roughly 80-90k which is really low. A Rust server (stupid simple Tokio echo) peaks out at 210K req/s with barely above 100 Idle Wake Ups (yes, 100! not 50K!).
But, don't get me wrong here. This is no Rust vs. Go and I know that Rust in theory is faster due to memory allocs in Go.
But please, bear with me. Because it gets even more interesting for us Gophers. When I limit the number of threads to 1 (GOMAXPROCS=1), I get 130K/s in Go with ... still high Idle Wake Ups, but less (about 10k-15k).
These numbers mean nothing, they simply show the delta here.
I would like to revive this thread and ask for insights about the Go scheduler and maybe related to that the interrupt-based preemption.
I think, the Go scheduler has issues on macOS in high performance scenarios which shouldn't exist. If there is any interest from the Go team to investigate this, let me know. Then I can provide more test data like screen recordings etc.
Comment From: prattmic
Can you collect a CPU profile of this server while it is under the load test? My initial guess would be that something about the load makes it such that runtime threads are constantly going to sleep and waking back up rather than just staying awake like they should under high load.
Comment From: chrisprobst
Sure, so let's go. Just be clarify, we are still talking about the most basic HTTP hello world server, it can't get simpler than that. I'll copy the code from above for reference:
package main
import "net/http"
func main() {
http.HandleFunc("/", func(rw http.ResponseWriter, r *http.Request) {
rw.WriteHeader(200)
})
if err := http.ListenAndServe(":8080", nil); err != nil {
panic(err)
}
}
$ go run .
$ wrk -d10 -c250 -t4 http://localhost:8080 [23:07:34]
Running 10s test @ http://localhost:8080
4 threads and 250 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 3.09ms 352.57us 8.68ms 75.67%
Req/Sec 20.01k 575.48 21.29k 74.25%
796637 requests in 10.00s, 56.98MB read
Requests/sec: 79644.38
Transfer/sec: 5.70MB
(For reference, a Spring Boot gets 120K/s and Rust Axum gets 250K/s - Numbers mean nothing, just the delta. Both Rust and Java do NOT cause any major Idle Wake ups, a few hundred. Go goes crazy with about 40-50K).
Here is the result from profiling:
Showing nodes accounting for 56.32s, 99.00% of 56.89s total
Dropped 155 nodes (cum <= 0.28s)
Showing top 10 nodes out of 48
flat flat% sum% cum cum%
35.87s 63.05% 63.05% 35.87s 63.05% runtime.kevent
14.98s 26.33% 89.38% 14.99s 26.35% syscall.syscall
2.90s 5.10% 94.48% 2.90s 5.10% runtime.pthread_cond_wait
1.49s 2.62% 97.10% 1.49s 2.62% runtime.usleep
1.03s 1.81% 98.91% 1.03s 1.81% runtime.pthread_cond_signal
0.01s 0.018% 98.93% 7.30s 12.83% bufio.(*Reader).fill
0.01s 0.018% 98.95% 0.31s 0.54% net/http.(*connReader).backgroundRead
0.01s 0.018% 98.96% 0.34s 0.6% runtime.runqgrab
0.01s 0.018% 98.98% 2.94s 5.17% runtime.semasleep
0.01s 0.018% 99.00% 0.37s 0.65% runtime.stealWork
And web vis:
Sorry, I use German interface, "Reaktivierung" means Idle wake ups.
I hope this helps in any way. This should also be super trivial to test and/or replicate. I tested this on M3 Pro/Max & M4 Pro/Max (latest macOS) and get the same identical result every time.
Comment From: chrisprobst
To provide the whole picture:
When I run:
GOMAXPROCS=1 go run .
$ wrk -d10 -c1000 -t4 http://localhost:8080 [23:33:26]
Running 10s test @ http://localhost:8080
4 threads and 1000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 7.21ms 1.16ms 17.92ms 79.21%
Req/Sec 34.53k 4.02k 38.21k 86.25%
1374247 requests in 10.01s, 98.29MB read
Socket errors: connect 0, read 3342, write 0, timeout 0
Requests/sec: 137302.50
Transfer/sec: 9.82MB
I get very low Idle wake ups (around 50-80).
If I had to guess, I am sure it's somewhere in the go scheduler, work-stealing, maybe interrupt-based preemption or how kevent is implemented in Go. To me, this is clearly a misbehavior on macOS. I have too little knowledge about the internals of Go but this is so easy to replicate I wonder this never caught someones attention.
Comment From: prattmic
Thanks, this is helpful.
I can reproduce this on one of our remote testing gomotes (gotip-darwin-arm64_13
, macOS 13, arm64). Since there is no GUI access for Activity Monitor, I used the built-in powermetrics
command to collect similar metrics (thanks to this quick start) [1].
I see high wakeups:
Name ID CPU ms/s User% Deadlines (<2 ms, 2-5 ms) Wakeups (Intr, Pkg idle)
example.com 47287 2145.90 30.14 25460.88 0.58 25490.96 0.00
The command also shows the distribution of different interrupt types:
**** Interrupt distribution ****
CPU 0:
Total IRQ: 1448.47 interrupts/sec
|-> IPI: 1035.20 interrupts/sec
|-> TIMER: 375.03 interrupts/sec
CPU 1:
Total IRQ: 1346.17 interrupts/sec
|-> IPI: 920.29 interrupts/sec
|-> TIMER: 421.81 interrupts/sec
CPU 2:
Total IRQ: 1090.91 interrupts/sec
|-> IPI: 683.67 interrupts/sec
|-> TIMER: 401.81 interrupts/sec
CPU 3:
Total IRQ: 1011.72 interrupts/sec
|-> IPI: 666.19 interrupts/sec
|-> TIMER: 339.70 interrupts/sec
CPU 4:
Total IRQ: 31681.72 interrupts/sec
|-> IPI: 18133.82 interrupts/sec
|-> TIMER: 13457.45 interrupts/sec
CPU 5:
Total IRQ: 31290.39 interrupts/sec
|-> IPI: 17986.49 interrupts/sec
|-> TIMER: 13302.74 interrupts/sec
CPU 6:
Total IRQ: 31208.09 interrupts/sec
|-> IPI: 17874.29 interrupts/sec
|-> TIMER: 13333.60 interrupts/sec
CPU 7:
Total IRQ: 31143.06 interrupts/sec
|-> IPI: 17829.25 interrupts/sec
|-> TIMER: 13313.61 interrupts/sec
These are all less than 100/s when the load test is not running. Seems close to 50/50 between IPI (cross-CPU wakeup) and timers.
[1] For those reading along, the IDLEW
column in top
doesn't seem to be the same thing. It stayed at 0 the whole time.
Comment From: prattmic
The immediate cause appears to be as I theorized, that wrk
is not providing enough load to keep the server awake.
I found the wrk
source a bit difficult to follow, but I believe it only sends per request at a time per thread.
The general flow that I believe is happening:
wrk
has 4 worker threads (-t4
). Go allows up to 8 worker threads (GOMAXPROCS=8
on my machine). In practice only 4 will run at a time.wrk
sends a request.- A Go thread wakes to handle the request.
- Go sends a response and the goroutine waits for a new request.
- This Go thread looks for something else to do. There is nothing to do, so the thread goes to sleep.
wrk
receives the response and sends another request.- GOTO 3.
With 4 worker threads, I see ~40k wakeups. If I give wrk
8 worker threads, it should be able to keep Go busier, and indeed that drops the wakeups to ~9k. If I give wrk
16 worker threads, it drops further to ~3k wakeups.
I can also see in CPU profiles that time in the Go scheduler drops from 16% -> 5% -> 2%.
The final 3k number is still pretty high. I'm not sure why, it needs more investigation. It might just be that Go still ends up with no work due to OS scheduling effects sometimes not running enough wrk
threads (we have more threads than CPUs now).
I'm also not sure why tokio is doing better by default. It may be running fewer worker threads, or it may be spinning longer waiting for work before going to sleep.
We have been thinking for a long time about ways to eliminate GOMAXPROCS entirely and dynamically determine the amount of parallelism currently available in order to dynamically adjust. This problem sort of fits in that domain, but it is backwards: the machine has plenty of available parallelism to run one thread per request, but the request handler is so fast that it would be more efficient to just run one thread and queue the requests behind one another (this is what happens with GOMAXPROCS=1). However, the Go scheduler today is work conserving (#54622), meaning if there are idle CPUs and runnable work, we always use the idle CPUs. Determining in the scheduler that it is better to break this property because the cost of wake+sleep is higher than the latency of waiting to run sequential is difficult to determine a priori.
Comment From: prattmic
To be clear, this is a well-known (to me at least :)) issue with the Go scheduler. If there is lots to do, the Go scheduler does well keeping busy. If there are long periods of nothing to do, things go to sleep and it works well. However, if the application switches between active and idle very rapidly it behaves poorly due to the relatively high cost of going to sleep and waking back up. This is one of the cases that #54622 aims to improve.
Comment From: prattmic
One more thought: in my worst case, I see 16% of CPU time in the Go scheduler, but your profile shows 72% of time in the Go scheduler, so you somehow have hit a much more extreme case than me.
I'm not sure why. Perhaps you have more cores, resulting in a bigger mismatch between GOMAXPROCS and wrk
's 4 threads. (I do get up to 24% of time in the scheduler if I reduce wrk
to 1 thread.)
Though I take these profiles with a grain a salt, as macOS has suspicious profiling (#57722).
Comment From: andig
Its worth noting that the M1 ARM CPU has 4 performance and 4 efficiency cores.