Go version
go version go1.22.0 darwin/arm64
Output of go env
in your module/workspace:
GO111MODULE=''
GOARCH='arm64'
GOBIN='/Users/***/gopath/bin'
GOCACHE='/Users/***/Library/Caches/go-build'
GOENV='/Users/***/Library/Application Support/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='arm64'
GOHOSTOS='darwin'
GOINSECURE=''
GOMODCACHE='/Users/***/gopath/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='darwin'
GOPATH='/Users/***/gopath'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/opt/homebrew/Cellar/go/1.22.0/libexec'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/opt/homebrew/Cellar/go/1.22.0/libexec/pkg/tool/darwin_arm64'
GOVCS=''
GOVERSION='go1.22.0'
GCCGO='gccgo'
AR='ar'
CC='cc'
CXX='c++'
CGO_ENABLED='1'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -ffile-prefix-map=/var/folders/0q/57ygjhqj3k59y4b8txgw839r0000gn/T/go-build2558853606=/tmp/go-build -gno-record-gcc-switches -fno-common'
What did you do?
Running my application I have been investigating some long tail latency which is causing backlogs in the processing of incoming message. The application processes between 4,000 to 50,0000 incoming JSON messages a second, delivered over websocket connections. These messages are unmarshalled on between 8 and 16 separate sockets with a goroutine doing the reading on each. All unmarshalled messages are sent over a single channel to another goroutine that does time sensitive calculations on the resultant state of those messages. There are virtually no memory allocations and GC pauses typically happen once every 30 seconds. This goroutine running its calculation in a speedy manner is critical to the success of the application.
What did you see happen?
Under normal compilation using go 1.22, there is significant long tail (0.9999 percentile) latency of 5-15 ms on the time taken to complete the calculation which typically takes 30-90µs (0.99 percentile). With the asyncpreemptoff
flag set on GODEBUG, this latency (0.9999 percentile) falls below 1ms. The following image graphs the changed before and after setting the flag:
What did you expect to see?
I understand why preemption is useful for long running goroutines that starve other goroutines of access to the CPU, but I would not expect there to be such latency from its use. Being able to disable preemption for specific goroutines might be one solution. If you would like any more specific traces or profiles, happy to provide them. It would be difficult for me to provide a reproducer as the timings of all involved goroutines are very dependent on third party external state.
Comment From: mauri870
cc @golang/runtime
Comment From: prattmic
Interesting issue. The purpose of asynchronous preemption is to reduce tail latency, so seeing it increase tail latency instead isn't great.
One way I could imagine this happening would be if you have many goroutines, but they fall into two buckets: high priority, latency sensitive or low priority, latency tolerant. If the high priority goroutines happen to have tight loops that don't have synchronous preemption points, then asynchronous preemption would preempt them more often. This would improve latency of the low priority goroutines at the expense of the high priority goroutines latency. Does this describe your application?
One piece of data I'd like to see is /sched/latencies:seconds
from https://pkg.go.dev/runtime/metrics with and without asyncpreemptoff
. Note that this histogram is cumulative since process start, so you only need to collect it once at the end. https://go.dev/cl/513115 has some code you can copy to format the data into a ASCII graph.
Comment From: prattmic
cc @golang/runtime @mknyszek @aclements
Comment From: donovanhide
Hi again.
I think I may have used some bad PromQL (irate vs rate with a push gateway and inconsistent intervals) in Grafana to generate that seemingly conclusive graph above. It seems like asyncpreemptoff
is not related to my issue as perhaps the /sched/latencies:seconds
data below perhaps demonstrates. I could not reproduce the effects of asyncpreemptoff
affecting latency again.
What I am still seeing with asyncpreemptoff
enabled or disabled is that the CPU bound function in the crucial goroutine is, once every 10,000 messages or so, taking many milliseconds to complete, despite the actual processing parts taking microseconds to complete. This then creates a backlog in a queue, which has knock on effects for the subsequent iterations of the crucial goroutine. As you describe there are two sets of goroutines, a latency sensitive one, which does not want to be interrupted, and many asynchronous ones waiting on bytes to come in over the network. Another scheduler in another language discusses the issues and possible solutions better than I can:
https://docs.rs/tokio/latest/tokio/#cpu-bound-tasks-and-blocking-code
It's not necessarily a priority problem, just a "please do not interrupt me until I reach a channel send or receive op, which I will probably do very soon". If such a flag could be set at the start of the goroutine, that would help my use case. If that flag was set for all goroutines, then I guess CPU starvation is likely....
Anyway, thanks for your time! Any tips gladly received! Please feel free to close the issue if desired.
Results with: GODEBUG=asyncpreemptoff=1
| 0 [ -Inf, 0.0)
********************| 611508 [ 0.0, 64.0)
| 23961 [ 64.0, 128.0)
***| 102487 [ 128.0, 192.0)
**| 74167 [ 192.0, 256.0)
| 19521 [ 256.0, 320.0)
*| 38501 [ 320.0, 384.0)
*********| 295493 [ 384.0, 448.0)
********| 266913 [ 448.0, 512.0)
*****| 174287 [ 512.0, 640.0)
**| 84142 [ 640.0, 768.0)
***| 119178 [ 768.0, 896.0)
**| 89576 [ 896.0, 1024.0)
***| 92763 [1024.0, 1280.0)
*| 37674 [1280.0, 1536.0)
| 18239 [1536.0, 1792.0)
| 16608 [1792.0, 2048.0)
| 26630 [2048.0, 2560.0)
| 20516 [2560.0, 3072.0)
| 17352 [3072.0, 3584.0)
| 19656 [3584.0, 4096.0)
| 29331 [4096.0, 5120.0)
| 20887 [5120.0, 6144.0)
*| 33623 [6144.0, 7168.0)
*| 41952 [7168.0, 8192.0)
**| 70560 [8192.0, 10240.0)
*| 43181 [10240.0, 12288.0)
*| 44811 [12288.0, 14336.0)
*| 43833 [14336.0, 16384.0)
*| 44537 [16384.0, 20480.0)
| 30200 [20480.0, 24576.0)
| 28594 [24576.0, 28672.0)
| 24048 [28672.0, 32768.0)
***| 107010 [32768.0, 40960.0)
*| 59413 [40960.0, 49152.0)
| 29004 [49152.0, 57344.0)
| 13850 [57344.0, 65536.0)
| 11805 [65536.0, 81920.0)
| 4973 [81920.0, 98304.0)
| 2553 [98304.0, 114688.0)
| 1749 [114688.0, 131072.0)
| 2346 [131072.0, 163840.0)
| 1385 [163840.0, 196608.0)
| 949 [196608.0, 229376.0)
| 694 [229376.0, 262144.0)
| 958 [262144.0, 327680.0)
| 562 [327680.0, 393216.0)
| 415 [393216.0, 458752.0)
| 229 [458752.0, 524288.0)
| 289 [524288.0, 655360.0)
| 167 [655360.0, 786432.0)
| 85 [786432.0, 917504.0)
| 60 [917504.0, 1048576.0)
| 69 [1048576.0, 1310720.0)
| 59 [1310720.0, 1572864.0)
| 41 [1572864.0, 1835008.0)
| 33 [1835008.0, 2097152.0)
| 51 [2097152.0, 2621440.0)
| 12 [2621440.0, 3145728.0)
| 14 [3145728.0, 3670016.0)
| 10 [3670016.0, 4194304.0)
| 22 [4194304.0, 5242880.0)
| 15 [5242880.0, 6291456.0)
| 11 [6291456.0, 7340032.0)
| 5 [7340032.0, 8388608.0)
| 15 [8388608.0, 10485760.0)
| 4 [10485760.0, 12582912.0)
| 7 [12582912.0, 14680064.0)
| 2 [14680064.0, 16777216.0)
| 6 [16777216.0, 20971520.0)
| 2 [20971520.0, 25165824.0)
| 1 [25165824.0, 29360128.0)
| 0 [29360128.0, 33554432.0)
| 1 [33554432.0, 41943040.0)
| 0 [41943040.0, 50331648.0)
| 0 [50331648.0, 58720256.0)
| ...
| 0 [105553116266496.0, 123145302310912.0)
| 0 [123145302310912.0, 140737488355328.0)
| 0 [140737488355328.0, +Inf)
Results without
| 0 [ -Inf, 0.0)
********************| 636929 [ 0.0, 64.0)
| 24445 [ 64.0, 128.0)
***| 99320 [ 128.0, 192.0)
**| 82062 [ 192.0, 256.0)
| 22599 [ 256.0, 320.0)
*| 37629 [ 320.0, 384.0)
********| 273426 [ 384.0, 448.0)
********| 275386 [ 448.0, 512.0)
*****| 185945 [ 512.0, 640.0)
**| 80218 [ 640.0, 768.0)
****| 128796 [ 768.0, 896.0)
**| 85224 [ 896.0, 1024.0)
**| 83437 [1024.0, 1280.0)
*| 34046 [1280.0, 1536.0)
| 17497 [1536.0, 1792.0)
| 17253 [1792.0, 2048.0)
| 27411 [2048.0, 2560.0)
| 20776 [2560.0, 3072.0)
| 17502 [3072.0, 3584.0)
| 21533 [3584.0, 4096.0)
| 28812 [4096.0, 5120.0)
| 19218 [5120.0, 6144.0)
| 30163 [6144.0, 7168.0)
*| 39716 [7168.0, 8192.0)
**| 70413 [8192.0, 10240.0)
*| 43146 [10240.0, 12288.0)
*| 46100 [12288.0, 14336.0)
*| 46865 [14336.0, 16384.0)
*| 45291 [16384.0, 20480.0)
| 30861 [20480.0, 24576.0)
| 28138 [24576.0, 28672.0)
| 24143 [28672.0, 32768.0)
***| 114946 [32768.0, 40960.0)
*| 58894 [40960.0, 49152.0)
| 30010 [49152.0, 57344.0)
| 13650 [57344.0, 65536.0)
| 11804 [65536.0, 81920.0)
| 5049 [81920.0, 98304.0)
| 2803 [98304.0, 114688.0)
| 1888 [114688.0, 131072.0)
| 2275 [131072.0, 163840.0)
| 1445 [163840.0, 196608.0)
| 988 [196608.0, 229376.0)
| 728 [229376.0, 262144.0)
| 879 [262144.0, 327680.0)
| 558 [327680.0, 393216.0)
| 355 [393216.0, 458752.0)
| 228 [458752.0, 524288.0)
| 244 [524288.0, 655360.0)
| 143 [655360.0, 786432.0)
| 66 [786432.0, 917504.0)
| 55 [917504.0, 1048576.0)
| 74 [1048576.0, 1310720.0)
| 35 [1310720.0, 1572864.0)
| 36 [1572864.0, 1835008.0)
| 29 [1835008.0, 2097152.0)
| 45 [2097152.0, 2621440.0)
| 39 [2621440.0, 3145728.0)
| 9 [3145728.0, 3670016.0)
| 7 [3670016.0, 4194304.0)
| 14 [4194304.0, 5242880.0)
| 15 [5242880.0, 6291456.0)
| 6 [6291456.0, 7340032.0)
| 4 [7340032.0, 8388608.0)
| 14 [8388608.0, 10485760.0)
| 6 [10485760.0, 12582912.0)
| 4 [12582912.0, 14680064.0)
| 6 [14680064.0, 16777216.0)
| 3 [16777216.0, 20971520.0)
| 1 [20971520.0, 25165824.0)
| 2 [25165824.0, 29360128.0)
| 3 [29360128.0, 33554432.0)
| 0 [33554432.0, 41943040.0)
| 1 [41943040.0, 50331648.0)
| 0 [50331648.0, 58720256.0)
| 0 [58720256.0, 67108864.0)
| ...
| 0 [105553116266496.0, 123145302310912.0)
| 0 [123145302310912.0, 140737488355328.0)
| 0 [140737488355328.0, +Inf)
Comment From: prattmic
Thanks, I'm glad the problem is better understood now.
What you are asking for is a mechanism to disable preemption when executing some critical section. I don't think we have an existing issue for that, so this can serve that purpose, though I don't think it is something we clearly want to do.
I figured we would also have an issue for goroutine priorities, though I can't seem to find one.
Comment From: WinkelCode
I am currently investigating sporadic IOHIDDeviceSetReport failed: (0xE00002BC) (iokit/common) general error
errors coming back from IOHIDDeviceSetReport
on macOS: https://github.com/sstallion/go-hid/issues/15
I am not 100% confident about what is actually going wrong, however, it seems very probable that async preempt is to blame here: specifically A) It doesn't happen in a pure C test program B) asyncpreemptoff=1
fixes the issue and C) Ignoring SIGURG
in the thread calling the function also fixes it. Furthermore, so far, calling runtime.Gosched()
right before IOHIDDeviceSetReport
has helped avoid this error as well. Unfortunately, all of these solutions have drawbacks, as they either have to apply broadly across the program (e.g. potential side effects; problem for libraries), or have a performance impact when called repeatedly in tight loops.
With this particular API, I don't think the advice of "looping to try the system call again"[1] is fully viable because the error is too generic. It would be greatly appreciated if there was either some guidance on the correct way to approach cases like this or a way to granularly disable async preemtion.
[1] https://go.dev/doc/go1.14#runtime
Comment From: cherrymui
In https://github.com/sstallion/go-hid/issues/15 you mentioned
Calling IOHIDDeviceSetReport directly via a minimal C program resulted in no errors
Perhaps you could try calling IOHIDDeviceSetReport in a C program that also receives and handles SIGURG, to rule out the possibility of whether IOHIDDeviceSetReport is signal safe.
While preemption signals are a major source of signals for a Go program, a signal could come from other places, e.g. profiling, or sent from external programs. In general, it is probably better to make the program signal safe, instead of targeting specifically on preemptions.
Comment From: WinkelCode
@cherrymui This is the first time I am dealing with this particular area of programming, so please excuse if I misunderstood something.
I added a
void sigurg_handler(int signo) {}
function and
signal(SIGURG, sigurg_handler);
call to main()
of the C program.
Then I repeatedly sent SIGURG
via
while true; killall -urg send_reports; sleep 0.1; end
The result is IOHIDDeviceSetReport
repeatedly returning with 0xE00002BC ((iokit/common) general error
via mach_error_string()
)
Edit: Rewrote last sentence to be more precise.
Edit 2: I also ran it without sending the signal myself and didn't see any errors. Neither here nor with Go was I using any profilers or other programs that should send this signal to my knowledge.