What version of Go are you using (go version
)?
$ go version go version go1.20.5 linux/arm64
Does this issue reproduce with the latest release?
We can't use 1.20.6 because of https://github.com/golang/go/issues/61431
What operating system and processor architecture are you using (go env
)?
go env
Output
$ go env
What did you do?
Our production environments runs tens of thousands of containers written in Go, most of them running only for a few minutes or hours. About once every day one of them crashes with runtime: traceback stuck
. It is not always the same service, and has been happening for months and across multiple Go versions, going back to at least Go 1.18. We are not sure exactly when it started.
We did saw a common pattern, where the stack trace is always of the routine running our internal MemoryMonitor. It is a small library that runs in all our services, samples the cgroup memory parameters every second from procfs, and logs all the running operation if we use 90% of available memory. When we turn off this functionality the problem disappear, so we know it is related. All the containers that crashed didn't reach this limit during their run, so only the sampling occurred.
Another thing to note that we always see in the dump is an active runtime profiling running by the DataDog agent we are integrating with. It is running every 30 seconds and takes a CPU and memory profile using the standard pprof library. We are not sure if this is related.
What did you expect to see?
No crashes.
What did you see instead?
runtime: traceback stuck. pc=0xfdfcf0 sp=0x40097fbf90
stack: frame={sp:0x40097fbf90, fp:0x40097fbf90} stack=[0x40097fa000,0x40097fc000)
0x00000040097fbe90: 0x00000040075ae200 0x0000004003a53d00
0x00000040097fbea0: 0x0000004007587d40 0x0000000000000001
0x00000040097fbeb0: 0x0000000001b38580 <github.com/xxx/xxx/commonlib/memmonitor.(*memoryMonitor).run.func1.1+0x0000000000000000> 0x000000400771e500
0x00000040097fbec0: 0x0000004003a13e60 0x0000000000000000
0x00000040097fbed0: 0x0000000000000000 0x0000000000000000
0x00000040097fbee0: 0x00000040097fbeb0 0x0000004000507ef8
0x00000040097fbef0: 0x0000000000fe02c8 <github.com/xxx/xxx/commonlib/waitgroup.(*SafeErrorGroup).Go.func1+0x0000000000000088> 0x0000004000507f58
0x00000040097fbf00: 0x0000000000fdfbfc <golang.org/x/sync/errgroup.(*Group).Go.func1+0x000000000000005c> 0x0000004004eed080
0x00000040097fbf10: 0x0300000000000000 0x0000000000000000
0x00000040097fbf20: 0x0000000000000000 0x0000000000fe0330 <github.com/xxx/xxx/commonlib/waitgroup.(*SafeErrorGroup).Go.func1.1+0x0000000000000000>
0x00000040097fbf30: 0x0000000004324400 0x0000004007587d40
0x00000040097fbf40: 0x00000040097fbf18 0x00000000036f2020
0x00000040097fbf50: 0x00000040097fbf28 0x0000000000000000
0x00000040097fbf60: 0x000000000007fd44 <runtime.goexit+0x0000000000000004> 0x00000040076be780
0x00000040097fbf70: 0x0000004002347b60 0x0000000000000000
0x00000040097fbf80: 0x0100004004d079e0 0x0000004007585f00
0x00000040097fbf90: >0x0000000000fdfcf0 <golang.org/x/sync/errgroup.(*Group).Go.func1.2+0x0000000000000000> 0x0000004007585f00
0x00000040097fbfa0: 0x0000004000507f48 0x0000000000000000
0x00000040097fbfb0: 0x000000000007fd44 <runtime.goexit+0x0000000000000004> 0x000000400074f7a0
0x00000040097fbfc0: 0x00000040097fbf90 0x0000000000000000
0x00000040097fbfd0: 0x0000000000000000 0x0000000000000000
0x00000040097fbfe0: 0x0000000000000000 0x0000000000000000
0x00000040097fbff0: 0x0000000000000000 0x0000000000000000
fatal error: traceback stuck
runtime stack:
runtime.throw({0x30b0158?, 0x63afaa0?})
/usr/local/go/src/runtime/panic.go:1047 +0x40 fp=0x4000111bc0 sp=0x4000111b90 pc=0x49cd0
runtime.gentraceback(0x4000111f98?, 0x76480?, 0x50e55d58a9b8?, 0x4000102b60, 0x0, 0x4009100c00, 0x20, 0x0, 0x0?, 0x0)
/usr/local/go/src/runtime/traceback.go:487 +0xdd8 fp=0x4000111f30 sp=0x4000111bc0 pc=0x70658
runtime.saveg(0x0?, 0x7d92c?, 0x4005e36600?, 0x4009100c00)
/usr/local/go/src/runtime/mprof.go:1181 +0x44 fp=0x4000111f90 sp=0x4000111f30 pc=0x42154
runtime.doRecordGoroutineProfile.func1()
/usr/local/go/src/runtime/mprof.go:1093 +0x48 fp=0x4000111fc0 sp=0x4000111f90 pc=0x420e8
runtime.systemstack()
/usr/local/go/src/runtime/asm_arm64.s:243 +0x6c fp=0x4000111fd0 sp=0x4000111fc0 pc=0x7d92c
Comment From: cherrymui
It looks like it is stuck when collecting a goroutine profile, which takes stack traces for all goroutines. The place it is stuck is at the entry of a deferred function. It should never unwind to there. It looks like a stack slot to save the defer record, instead of a return PC. So the unwinding already went off at this point. It could be due to failing to unwind a frame lower down the stack, or the stack being corrupted.
@aciduck is your program a pure-Go program or it uses cgo? Does it use any unsafe code? Is there a way we can reproduce the failure? Thanks.
Comment From: aciduck
This happens in multiple services, some are compiled with cgo and some don't. We don't use unsafe code directly, but it might be called from some dependency we have. Unfortunately we can't reproduce it consistently or in a simple program. The only pattern we found is that this only happens if our MemoryMonitor code is running, and we suspect it is related to the fact it executes frequent syscalls.
Comment From: cherrymui
Could you show a more complete error message? There should be other goroutine stack dumps below, and my have information about how the unwinder get to that point (It is okay to have symbol names redacted). Thanks.
Comment From: aciduck
Full dump is in the attached file:
extract-2023-08-16T12_49_41.656Z - Copy.txt
Comment From: aciduck
Hi @cherrymui , is there anything else we can provide to help investigating this? We have upgraded to 1.21.4 in the past weeks and the problem still persists.
Comment From: cherrymui
Sorry the the delay. Thanks for the full stack trace.
Reading the stack trace, it seems to fail when unwinding the stack at memory address [0x40097fa000,0x40097fc000)
runtime: traceback stuck. pc=0xfdfcf0 sp=0x40097fbf90
stack: frame={sp:0x40097fbf90, fp:0x40097fbf90} stack=[0x40097fa000,0x40097fc000)
This matches goroutine 594:
goroutine 594 [select, 2 minutes]:
runtime.gopark(0x400941fec0?, 0x2?, 0x2?, 0x0?, 0x400941fe84?)
/usr/local/go/src/runtime/proc.go:381 +0xe0 fp=0x40097fbd10 sp=0x40097fbcf0 pc=0x4c740
runtime.selectgo(0x40097fbec0, 0x400941fe80, 0x4007587d40?, 0x0, 0x2?, 0x1)
/usr/local/go/src/runtime/select.go:327 +0x68c fp=0x40097fbe20 sp=0x40097fbd10 pc=0x5d1ac
github.com/xxx/xxx/commonlib/memmonitor.(*memoryMonitor).run.func1()
/home/circleci/project/commonlib/memmonitor/monitor.go:73 +0xb0 fp=0x40097fbef0 sp=0x40097fbe20 pc=0x1b383f0
github.com/xxx/xxx/commonlib/waitgroup.(*SafeWaitGroup).Go.func1()
/home/circleci/project/commonlib/waitgroup/waitgroup.go:83 +0x24 fp=0x40097fbf00 sp=0x40097fbef0 pc=0xfe0684
github.com/xxx/xxx/commonlib/waitgroup.(*SafeErrorGroup).Go.func1()
/home/circleci/project/commonlib/waitgroup/waitgroup.go:40 +0x88 fp=0x40097fbf60 sp=0x40097fbf00 pc=0xfe02c8
golang.org/x/sync/errgroup.(*Group).Go.func1()
/go/pkg/mod/golang.org/x/sync@v0.3.0/errgroup/errgroup.go:75 +0x5c fp=0x40097fbfd0 sp=0x40097fbf60 pc=0xfdfbfc
When the runtime crashes, it successfully unwound this stack. So it is weird that the first round of unwinding (for taking the goroutine profile) failed. It is possible that the goroutine has executed and changed its stack between the two unwinding. But given that it's been sitting there for 2 minutes, and the crash probably took much less time, it probably didn't change.
Besides, goroutine 105 is interesting:
goroutine 105 [select, 2 minutes]:
runtime.gopark(0x40097fbf48?, 0x3?, 0x80?, 0xd?, 0x40097fbefa?)
/usr/local/go/src/runtime/proc.go:381 +0xe0 fp=0x4004b92d70 sp=0x4004b92d50 pc=0x4c740
runtime.selectgo(0x4004b92f48, 0x40097fbef4, 0x40044f0a00?, 0x0, 0x3?, 0x1)
/usr/local/go/src/runtime/select.go:327 +0x68c fp=0x4004b92e80 sp=0x4004b92d70 pc=0x5d1ac
go.opentelemetry.io/otel/sdk/trace.(*batchSpanProcessor).processQueue(0x40005f5220)
/go/pkg/mod/go.opentelemetry.io/otel/sdk@v1.16.0/trace/batch_span_processor.go:293 +0xd4 fp=0x4004b92f90 sp=0x4004b92e80 pc=0x912a74
go.opentelemetry.io/otel/sdk/trace.NewBatchSpanProcessor.func1()
/go/pkg/mod/go.opentelemetry.io/otel/sdk@v1.16.0/trace/batch_span_processor.go:128 +0x58 fp=0x4004b92fd0 sp=0x4004b92f90 pc=0x911e68
runtime.goexit()
At the second frame, runtime.selectgo
, the second argument 0x40097fbef4
(which is pointer-typed) points to goroutine 594's stack, which should not happen. Perhaps something weird is going on with that stack...
It seems this failure occurs while taking a goroutine profile. Could you try
- running the program without enabling goroutine profiles, and see if the error still occurs?
- with Go 1.20.x, changing go119ConcurrentGoroutineProfile
to false, and see if the error still occurs? You'll need to use a Go 1.20.x source tree and modify the source code, then rebuild your program. (This constant only exists in Go 1.20.x, not Go 1.21 or later)
- running with Go 1.22rc1 (https://go.dev/dl/#go1.22rc1) or tip. If the error still occurs, this will print some additional information.
Thanks.
Comment From: lizthegrey
We also continue to see even after fix to #54332, traceback stuck approximately once a month across our arm64 fleet under continuous profiling, including on most recent go1.21.6
. Most recent dump here:
runtime: traceback stuck. pc=0x2c4670 sp=0x42091d9da0
stack: frame={sp:0x42091d9da0, fp:0x42091d9da0} stack=[0x420e72a000,0x420e72e000)
fatal error: traceback stuck
goroutine 0 [idle]:
runtime.throw({0x217a524?, 0x71700?})
/usr/local/go/src/runtime/panic.go:1077 +0x40 fp=0x4000514620 sp=0x40005145f0 pc=0x4a8d0
runtime.(*unwinder).next(0x4000514888)
/usr/local/go/src/runtime/traceback.go:486 +0x344 fp=0x40005146b0 sp=0x4000514620 pc=0x720a4
runtime.tracebackPCs(0x4000514888, 0x0?, {0x40005149e8, 0x40, 0x40})
/usr/local/go/src/runtime/traceback.go:628 +0x58 fp=0x4000514840 sp=0x40005146b0 pc=0x723c8
runtime.sigprof(0x25f88, 0x4000514c48?, 0x4000514c58?, 0x4000502b60, 0x400050c000)
/usr/local/go/src/runtime/proc.go:4969 +0x18c fp=0x4000514bf0 sp=0x4000514840 pc=0x5739c
runtime.sighandler(0x1b?, 0x400050c000?, 0x4000514c98?, 0x4000502b60?)
/usr/local/go/src/runtime/signal_unix.go:642 +0xf4 fp=0x4000514c60 sp=0x4000514bf0 pc=0x61f94
runtime.sigtrampgo(0x1b, 0x4000514da0, 0x4000514e20)
/usr/local/go/src/runtime/signal_unix.go:490 +0xd8 fp=0x4000514ce0 sp=0x4000514c60 pc=0x619f8
runtime.sigtramp()
/usr/local/go/src/runtime/sys_linux_arm64.s:462 +0x54 fp=0x4000514da0 sp=0x4000514ce0 pc=0x832d4
I've symbolised 0x2c4670
, it is:
$ go tool addr2line shepherd.full
0x2c4670
net/http.(*conn).serve.func1
/usr/local/go/src/net/http/server.go:1864
https://github.com/golang/go/blob/go1.21.6/src/net/http/server.go#L1864-L1882
Comment From: cherrymui
Thanks @lizthegrey for the information.
$ go tool addr2line shepherd.full 0x2c4670 net/http.(*conn).serve.func1 /usr/local/go/src/net/http/server.go:1864 https://github.com/golang/go/blob/go1.21.6/src/net/http/server.go#L1864-L1882
This looks like it is the entry address of a deferred function. Could you confirm that it is indeed the function entry, or it is somewhere in the function prologue, perhaps with go tool objdump
?
What is PC 0x25f88 ? It is the first argument of runtime.sigprof
, i.e. where the profiling signal lands, and where the unwinding starts. That would be helpful to see why it unwinds to a function entry (or prologue). It looks like it is pointing to a runtime function?
Thanks.
Comment From: lizthegrey
$ go tool addr2line shepherd.full 0x2c4670 net/http.(*conn).serve.func1 /usr/local/go/src/net/http/server.go:1864
go1.21.6
/src/net/http/server.go#L1864-L1882This looks like it is the entry address of a deferred function. Could you confirm that it is indeed the function entry, or it is somewhere in the function prologue, perhaps with
go tool objdump
?
Confirm, function entry.
TEXT net/http.(*conn).serve.func1(SB) /usr/local/go/src/net/http/server.go
server.go:1864 0x2c4670 f9400b90 MOVD 16(R28), R16
server.go:1864 0x2c4674 d10083f1 SUB $32, RSP, R17
server.go:1864 0x2c4678 eb10023f CMP R16, R17
server.go:1864 0x2c467c 54000e09 BLS 112(PC)
server.go:1864 0x2c4680 f8160ffe MOVD.W R30, -160(RSP)
server.go:1864 0x2c4684 f81f83fd MOVD R29, -8(RSP)
server.go:1864 0x2c4688 d10023fd SUB $8, RSP, R29
server.go:1865 0x2c468c 910283e1 ADD $160, RSP, R1
server.go:1865 0x2c4690 91002020 ADD $8, R1, R0
server.go:1864 0x2c4694 f9400741 MOVD 8(R26), R1
server.go:1864 0x2c4698 f9002be1 MOVD R1, 80(RSP)
server.go:1864 0x2c469c f9400b42 MOVD 16(R26), R2
server.go:1864 0x2c46a0 f90033e2 MOVD R2, 96(RSP)
server.go:1865 0x2c46a4 97f61867 CALL runtime.gorecover(SB)
server.go:1865 0x2c46a8 f9001fe0 MOVD R0, 56(RSP)
server.go:1869 0x2c46ac f90027e1 MOVD R1, 72(RSP)
server.go:1865 0x2c46b0 b4000200 CBZ R0, 16(PC)
server.go:1865 0x2c46b4 b001d09b ADRP 60887040(PC), R27
server.go:1865 0x2c46b8 f947f363 MOVD 4064(R27), R3
server.go:1865 0x2c46bc b4000043 CBZ R3, 2(PC)
server.go:1865 0x2c46c0 f9400463 MOVD 8(R3), R3
server.go:1865 0x2c46c4 eb03001f CMP R3, R0
server.go:1865 0x2c46c8 54000060 BEQ 3(PC)
server.go:1865 0x2c46cc b24003e3 ORR $1, ZR, R3
server.go:1865 0x2c46d0 14000009 JMP 9(PC)
What is PC 0x25f88 ? It is the first argument of
runtime.sigprof
, i.e. where the profiling signal lands, and where the unwinding starts. That would be helpful to see why it unwinds to a function entry (or prologue). It looks like it is pointing to a runtime function?
$ go tool addr2line shepherd.full
0x25f88
runtime.findObject
/usr/local/go/src/runtime/mbitmap.go:340
Comment From: aciduck
Hi, since upgrading to 1.21 the stack traces changed a bit, and it became obvious that the problem is triggered by the runtime profiling we are running every 30 seconds. We kept CPU and heap profile enabled and disabled goroutine profiling, and the problem almost disappeared. Instead of a random container crashing almost daily, it happened only twice in the past 3 weeks. Both error dumps are attached. extract-2024-02-18T09_43_34.174Z.txt extract-2024-02-18T09_39_23.548Z.txt
Comment From: cherrymui
@aciduck have you tried Go 1.22 ? Also, in your crash dumps it seems some lines are missing or reordered, like, frames from different stacks are mixed together. Could you paste an unfiltered stack dump? (It is fine to redact function names.)
Also,
runtime: traceback stuck. pc=0x266f680 sp=0x400369f160
stack: frame={sp:0x400369f160, fp:0x400369f160} stack=[0x40167fe000,0x4016802000)
fatal error: traceback stuck
the SP and the frame are not in the bounds of the stack. This might be that the G and the stack don't match? Are these crashes from a cgo program or pure Go? Thanks.
Comment From: aciduck
@cherrymui We haven't updated to 1.22 yet, will probably happen in the coming weeks. This output is the best we have, we don't get the output directly from the executables but from a log collecting service, so some lines might have been swapped. Those two components were compiled with CGO_ENABLED=0.
Comment From: lizthegrey
We've done our 1.22 upgrade, will report if we see any stuck tracebacks.
Comment From: lizthegrey
@cherrymui: got our first stuck traceback since go1.22. Let me know what you need symbolised from this
runtime: traceback stuck. pc=0x25d80 sp=0x402ec0f2d0
stack: frame={sp:0x402ec0f2d0, fp:0x402ec0f2d0} stack=[0x402ec0e000,0x402ec10000)
0x000000402ec0f1d0: 0x3fc5c01a39fbd68c 0xd0d3df5a4e91cfca
0x000000402ec0f1e0: 0x73a17ce7dbd37d1b 0x4e7c0e5d2b3f15a7
0x000000402ec0f1f0: 0x64a1b1812af5a73d 0xffffffffffffffff
0x000000402ec0f200: 0xffffffffffffffff 0xffffffffffffffff
0x000000402ec0f210: 0xffffffffffffffff 0x82b34e68bd8b9149
0x000000402ec0f220: 0xcc4956d8c5ab7c86 0xcf938da9c9928796
0x000000402ec0f230: 0x713ee03c43e459c9 0xc56fd90763005b1a
0x000000402ec0f240: 0x8f1bbcdc8f1bbcdc 0xca62c1d6ca62c1d6
0x000000402ec0f250: 0x6ed9eba15a827999 0x00000000c6e15407
0x000000402ec0f260: 0x0fb105ae0fb105ae 0x81e09f8c81e09f8c
0x000000402ec0f270: 0x47686bec47686bec 0xc6608605c6608605
0x000000402ec0f280: 0x44badf1b44badf1b 0xfb169498fb169498
0x000000402ec0f290: 0x0bccfead0bccfead 0x14547546c6e6ffdc
0x000000402ec0f2a0: 0x276b0427333f7161 0x000000005f83d1d4
0x000000402ec0f2b0: 0x0605040702010003 0x000000402ec0f2c8
0x000000402ec0f2c0: 0x0000000000025d80 <runtime.mapaccess2_faststr+0x00000000000003f0> 0x000000402ec0f3c8
0x000000402ec0f2d0: >0x00000000005c4528 <github.com/klauspost/compress/zstd.(*Encoder).EncodeAll+0x0000000000000688> 0x000000414d184000
0x000000402ec0f2e0: 0x000000414ce60000 0x0000000000020000 <runtime.mallocgc+0x0000000000000400>
0x000000402ec0f2f0: 0x00000000005c460c <github.com/klauspost/compress/zstd.(*Encoder).EncodeAll+0x000000000000076c> 0x00000040d0456698
0x000000402ec0f300: 0x0000004137ad3800 0x0000000000000100
0x000000402ec0f310: 0x0000000000000100 0x0000000000000117
0x000000402ec0f320: 0x00074b0d0006ccc0 0x0000000900000000
0x000000402ec0f330: 0x0000000000000220 0x0000000c00000004
0x000000402ec0f340: 0x0000000000000001 0x00074bb60007fff6
0x000000402ec0f350: 0x0000021d00000563 0x0000001200074bc2
0x000000402ec0f360: 0x0000004100074b0c 0x000000402ec0f3c8
0x000000402ec0f370: 0x0000021d005c44d8 0x000000000000000b
0x000000402ec0f380: 0x0000000000080000 <runtime.goroutineProfileWithLabelsConcurrent.func2+0x0000000000000010> 0x0000000001000000 <github.com/bradfitz/gomemcache/memcache.(*ServerList).SetServers+0x0000000000000180>
0x000000402ec0f390: 0x00000000000010b9 0x0000000000020000 <runtime.mallocgc+0x0000000000000400>
0x000000402ec0f3a0: 0x0000000000000b0f 0x0000004153870000
0x000000402ec0f3b0: 0x000000414d184048 0x000000414d1c4048
0x000000402ec0f3c0: 0x000000414d3f0000 0x000000402ec0f558
fatal error: traceback stuck
goroutine 0 gp=0x4000802380 m=18 mp=0x4000780808 [idle]:
runtime.throw({0x21b3a9f?, 0xfffffff000000000?})
/usr/local/go/src/runtime/panic.go:1023 +0x40 fp=0x4000d0e730 sp=0x4000d0e700 pc=0x4cd90
runtime.(*unwinder).next(0x4000d0e978)
/usr/local/go/src/runtime/traceback.go:480 +0x340 fp=0x4000d0e7c0 sp=0x4000d0e730 pc=0x79b70
runtime.tracebackPCs(0x4000d0e978, 0x408f4b72d8?, {0x4000d0e9d8, 0x40, 0x40})
/usr/local/go/src/runtime/traceback.go:622 +0x5c fp=0x4000d0e940 sp=0x4000d0e7c0 pc=0x79e9c
runtime.sigprof(0x8a620, 0x4000d0ec48?, 0x4000d0ec58?, 0x401de59340, 0x4000780808)
/usr/local/go/src/runtime/proc.go:5375 +0x18c fp=0x4000d0ebf0 sp=0x4000d0e940 pc=0x5b4dc
runtime.sighandler(0x1b?, 0x4000780808?, 0x4000d0ec98?, 0x40041f48c0?)
/usr/local/go/src/runtime/signal_unix.go:642 +0x61c fp=0x4000d0ec60 sp=0x4000d0ebf0 pc=0x66c7c
runtime.sigtrampgo(0x1b, 0x4000d0eda0, 0x4000d0ee20)
/usr/local/go/src/runtime/signal_unix.go:490 +0xd8 fp=0x4000d0ece0 sp=0x4000d0ec60 pc=0x661b8
runtime.sigtramp()
/usr/local/go/src/runtime/sys_linux_arm64.s:462 +0x54 fp=0x4000d0eda0 sp=0x4000d0ece0 pc=0x8ac14
goroutine 4203 gp=0x401de59340 m=18 mp=0x4000780808 [running]:
Comment From: aciduck
Hi, we upgraded to Go 1.22.1 and the problem keeps happening sporadically. Two such crashes are attached: extract-2024-03-27T12_14_33.042Z.txt extract-2024-03-27T12_12_46.194Z.txt
Comment From: lizthegrey
Another crash, let me know if you need me to symbolise it.
goroutine 4458 gp=0x401229b500 m=25 mp=0x400052e808 [running]:
runtime/sys_linux_arm64.s:462 +0x54 fp=0x4000bfeda0 sp=0x4000bfece0 pc=0x8ac14
runtime.sigtramp()
runtime/signal_unix.go:490 +0xd8 fp=0x4000bfece0 sp=0x4000bfec60 pc=0x661b8
runtime.sigtrampgo(0x1b, 0x4000bfeda0, 0x4000bfee20)
runtime/signal_unix.go:642 +0x61c fp=0x4000bfec60 sp=0x4000bfebf0 pc=0x66c7c
runtime.sighandler(0x1b?, 0x400052e808?, 0x4000bfec98?, 0x4000ba0fc0?)
runtime/proc.go:5375 +0x18c fp=0x4000bfebf0 sp=0x4000bfe940 pc=0x5b4dc
runtime.sigprof(0x8a620, 0x4000bfec48?, 0x4000bfec58?, 0x401229b500, 0x400052e808)
runtime/traceback.go:622 +0x5c fp=0x4000bfe940 sp=0x4000bfe7c0 pc=0x79e9c
runtime.tracebackPCs(0x4000bfe978, 0x4018186f58?, {0x4000bfe9d8, 0x40, 0x40})
runtime/traceback.go:480 +0x340 fp=0x4000bfe7c0 sp=0x4000bfe730 pc=0x79b70
runtime.(*unwinder).next(0x4000bfe978)
runtime/panic.go:1023 +0x40 fp=0x4000bfe730 sp=0x4000bfe700 pc=0x4cd90
runtime.throw({0x21e954a?, 0xfffffff000000000?})
goroutine 0 gp=0x4000ba0e00 m=25 mp=0x400052e808 [idle]:
fatal error: traceback stuck
0x0000004399cc93c0: 0x00000044d6ca2000 0x0000004399cc9558
0x0000004399cc93b0: 0x000000452dcd0048 0x000000452dd10048
0x0000004399cc93a0: 0x00000000000002f3 0x000000452de12000
0x0000004399cc9390: 0x0000000000000da1 0x0000000000020000 <runtime.mallocgc+0x0000000000000400>
0x0000004399cc9380: 0x0000000000040000 <runtime.(*pageAlloc).update+0x0000000000000490> 0x0000000001000000 <github.com/honeycombio/hound/keys.KeyList.UserIDs+0x00000000000002c0>
0x0000004399cc9370: 0x00001c8f005c3568 0x0000000000000000
0x0000004399cc9360: 0x0000004400024e45 0x0000004399cc93c8
0x0000004399cc9350: 0x00001c8f0000868c 0x0000000500025047
0x0000004399cc9340: 0x0000000000000001 0x000250420003fff6
0x0000004399cc9330: 0x0000000000001c92 0x0000000100000010
0x0000004399cc9320: 0x00024e46000233b3 0x0000000200000006
0x0000004399cc9310: 0x0000000000000100 0x000000000000018f
0x0000004399cc9300: 0x00000044c3d15400 0x0000000000000100
0x0000004399cc92f0: 0x00000000005c369c <github.com/klauspost/compress/zstd.(*Encoder).EncodeAll+0x000000000000076c> 0x00000040b3d80e78
0x0000004399cc92e0: 0x00000044e848c000 0x0000000000020000 <runtime.mallocgc+0x0000000000000400>
0x0000004399cc92d0: >0x00000000005c35b8 <github.com/klauspost/compress/zstd.(*Encoder).EncodeAll+0x0000000000000688> 0x000000452dcd0000
0x0000004399cc92c0: 0x0000000000015900 <indexbody+0x0000000000000110> 0x0000004399cc93c8
0x0000004399cc92b0: 0x0605040702010003 0x0000004399cc92c8
0x0000004399cc92a0: 0x4ed8aa4a391c0cb3 0x000000007bdcb9b0
0x0000004399cc9290: 0xb22567d4b22567d4 0x1e376c0819a4c116
0x0000004399cc9280: 0xf5c2a259f5c2a259 0x8a78a8298a78a829
0x0000004399cc9270: 0x97263d2997263d29 0x3a9cc8fc3a9cc8fc
0x0000004399cc9260: 0x663d6a46663d6a46 0xd403e495d403e495
0x0000004399cc9250: 0x6ed9eba15a827999 0x0000000043b3cc49
0x0000004399cc9240: 0x8f1bbcdc8f1bbcdc 0xca62c1d6ca62c1d6
0x0000004399cc9230: 0xf2d710aef8f89c7e 0xf26b17cdac4d3039
0x0000004399cc9220: 0xbf363a3e9d7e0b9e 0x934a39ccb5589239
0x0000004399cc9210: 0x3c5b0fbb53a804d7 0x19f140fa74ab92f0
0x0000004399cc9200: 0xba6eb3e8ab4e6384 0xf54c2e3362ed086f
0x0000004399cc91f0: 0x37ba530109445563 0xe64d8eecc33e1c87
0x0000004399cc91e0: 0xd07788fc7b01bab7 0x50ebbb82c7975f4b
0x0000004399cc91d0: 0x73531f6ab7360646 0x645c308cfb8ce111
stack: frame={sp:0x4399cc92d0, fp:0x4399cc92d0} stack=[0x4399cc8000,0x4399cca000)
runtime: traceback stuck. pc=0x15900 sp=0x4399cc92d0
Comment From: lizthegrey
Can we please get an update here? I know the 1.23 window is soon, and it would be nice to get a fix or additional telemetry into that release.
Comment From: cherrymui
Sorry for the delay. I still plan to look into this, and address it in 1.23 (we can still fix bugs during the freeze). Thanks.
Comment From: cherrymui
@aciduck in extract-2024-03-27T12_12_46.194Z.txt
runtime.sigprof(0x8a2cc, 0x4000150c48?, 0x4000150c58?, 0x4000005180, 0x400012d008)
The signal PC (where the traceback starts) is 0x8a2cc, which is close to
runtime.systemstack_switch() /usr/local/go/src/runtime/asm_arm64.s:200 +0x8 fp=0x4007620e60 sp=0x4007620e50 pc=0x8a218
So it may be something in the runtime, and probably assembly code. Is it possible to get what that PC is?
Also, from the SP,
runtime: traceback stuck. pc=0x170340 sp=0x4005a13ae0 stack: frame={sp:0x4005a13ae0, fp:0x4005a13ae0} stack=[0x4005a0e000,0x4005a16000)
it seems to be on this goroutine
goroutine 713 gp=0x4006610e00 m=7 mp=0x4000180808 [runnable]: encoding/json.stateInString(0x4017495e00?, 0x6f?) /usr/local/go/src/encoding/json/scanner.go:337 +0x1b4 fp=0x4005a13a10 sp=0x4005a13a10 pc=0x172544 encoding/json.appendCompact({0x40097dea80, 0x0, 0xa80}, {0x40097de000, 0x927, 0xa80}, 0x1) /usr/local/go/src/encoding/json/indent.go:70 +0x44c fp=0x4005a13b00 sp=0x4005a13a10 pc=0x17002c encoding/json.marshalerEncoder(0x4013aaaf00, {0x3572b20?, 0x4004ee0af0?, 0x35640a0?}, {0x20?, 0x2b?}) /usr/local/go/src/encoding/json/encode.go:444 +0x128 fp=0x4005a13bc0 sp=0x4005a13b00 pc=0x169f88
So apparently when it crashes, it actually can unwind the stack successfully. It starts with 0x4005a13a10, and the next frame is at 0x4005a13b00. Why is the profiling unwinding stuck in the middle of that frame at 0x4005a13ae0? Also, the first frame has FP=SP, which looks like it is in the prologue of calling morestack. Maybe it is related to profiling from morestack
?
@lizthegrey 's cases are weird.
runtime: traceback stuck. pc=0x25d80 sp=0x402ec0f2d0 0x000000402ec0f2c0: 0x0000000000025d80
0x000000402ec0f3c8 0x000000402ec0f2d0: >0x00000000005c4528 0x000000414d184000
It stuck at SP=0x402ec0f2d0. The saved LR, or next PC, should be *SP=0x5c4528 (https://cs.opensource.google/go/go/+/master:src/runtime/traceback.go;l=373). But that is not the one reported on the first line, which is 0x25d80, which is actually *(SP-16)
. This is super weird. Unless these PC, SP, and LR are not from unwinding a frame, but something else? It is not the first frame where the traceback starts, as the argument of sigprof (the signal PC) is 0x8a620. A stack transition? In that case the PC, SP, and LR are from gp.sched
that is populated in morestack
. But I don't see how morestack
could produce such PC and SP, either... Nevertheless, it seems also related to morestack...
Comment From: lizthegrey
It's clearly not a very frequent crash, so it's an edge case of an edge case of an edge case we suspect.
Comment From: cherrymui
For unwinding for profiling, we are already a bit permissive -- in a number of cases if we can't find a valid frame, instead of throwing an error, we simply give up. Maybe we should do this here as well: if it is going to fall into an unwinding loop, just give up and leave a truncated stack trace.
On the other hand, the "traceback stuck" error did uncover a few real bugs. I'm not sure...
Comment From: lizthegrey
For unwinding for profiling, we are already a bit permissive -- in a number of cases if we can't find a valid frame, instead of throwing an error, we simply give up. Maybe we should do this here as well: if it is going to fall into an unwinding loop, just give up and leave a truncated stack trace.
On the other hand, the "traceback stuck" error did uncover a few real bugs. I'm not sure...
How about a compromise? Leave a truncated stack, but output the error message somewhere, so that we can pick it up and use it for debugging in the rare case where it does happen?
Comment From: AdallomRoy
Can we get an update here? This is still happening from time to time. There are a few suggestions in this thread to add additional debug info if it is still a mystery.
Comment From: srosenberg
Another example of "stuck" traceback. I believe it's the first one we've seen while running CockroachDB nightlies. (It's also my first ever sighting of this in the wild.) This one was built with go1.23.7
for s390x
; it ran in IBM Cloud. Any new hints?
runtime: traceback stuck. pc=0x51837d0 sp=0xc006181058
stack: frame={sp:0xc006181058, fp:0xc006181058} stack=[0xc006162000,0xc006182000)
0x000000c006180f58: 0x000000c008834cf0 0x0000000000000001
0x000000c006180f68: 0x000000c00242c858 0x0000000000000004
0x000000c006180f78: 0x000000c00242c7dc 0x0000000000000004
0x000000c006180f88: 0x0000000000000004 0x0000000000000004
0x000000c006180f98: 0x000000c00242c858 0x000000c00242c7dc
0x000000c006180fa8: 0x000000c004902af0 0x00000000051836c4 <github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmtInCommitWaitState+0x00000000000002c4>
0x000000c006180fb8: 0x000000c014bc6008 0x0000000009c90fc0
0x000000c006180fc8: 0x000000000e629f00 0x0000000009136d30
0x000000c006180fd8: 0x0000000009136d38 0x0000000008ca61d0
0x000000c006180fe8: 0x000000c0025d24e0 0x0000000005163082 <github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt+0x00000000000008a2>
0x000000c006180ff8: 0x000000c014bc6008 0x0000000009c67bc8
0x000000c006181008: 0x000000c001e3f4a0 0x0000000009c90fc0
0x000000c006181018: 0x000000000111ec08 <runtime.nanotime+0x0000000000000018> 0x0000000009135ac8
0x000000c006181028: 0x0000000000000000 0x0000000000000000
0x000000c006181038: 0x0000000000000000 0x0000000001dde882 <github.com/cockroachdb/crlib/crtime.NowMono+0x0000000000000042>
0x000000c006181048: 0x0000000001c1bf00 <github.com/cockroachdb/cockroach/pkg/util/log.HasSpan+0x0000000000000010> 0x0000000005163544 <github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt+0x0000000000000d64>
0x000000c006181058: >0x00000000051837d0 <github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmtInCommitWaitState.func1+0x0000000000000000> 0x000000c0061810f8
0x000000c006181068: 0x000000c014bc6008 0x0000000009c90fc0
0x000000c006181078: 0x000000000e629f00 0x00000000051628ba <github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt+0x00000000000000da>
0x000000c006181088: 0x0000000009c67bc8 0x000000c001e3f4a0
0x000000c006181098: 0x0000000000007a70 0x000000c006181058
0x000000c0061810a8: 0x000000000514f66c <github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execCmd.func1+0x000000000000049c> 0x000000c014bc6008
0x000000c0061810b8: 0x0000000009c67bc8 0x000000c001e3f4a0
0x000000c0061810c8: 0x0000000009c90fc0 0x000000000e629f00
0x000000c0061810d8: 0x000003ff4c1720a0 0x000000c0074be760
0x000000c0061810e8: 0x0000000009c117e0 0x000000000e629f00
0x000000c0061810f8: 0x0000000000000000 0x0000000000000000
0x000000c006181108: 0x0000000000004800 0x0000000000000000
0x000000c006181118: 0x0000000000000000 0x0000010000000000
0x000000c006181128: 0x000000000154b486 <crypto/tls.(*Conn).Write+0x00000000000000c6> 0x0000000000000000
0x000000c006181138: 0x000000000514c3e0 <github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execCmd+0x0000000000001500> 0x0000000009c114c0
0x000000c006181148: 0x000000c014bc6be8 0x0000000007dd6f20
fatal error: traceback stuck
goroutine 0 gp=0xc007762a80 m=17 mp=0xc00a7be708 [idle]:
runtime.nanotime()
GOROOT/src/runtime/time_nofake.go:33 +0x18 fp=0xc006181058 sp=0xc006181048 pc=0x111ec08
github.com/cockroachdb/cockroach/pkg/util/log.HasSpan({0xc0061810f8, 0xc014bc6008})
pkg/util/log/trace.go:183 +0x10 fp=0xc006181058 sp=0xc006181058 pc=0x1c1bf00
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmtInCommitWaitState.func1()
pkg/sql/conn_executor_exec.go:3729 fp=0xc006181058 sp=0xc006181058 pc=0x51837d0
runtime: traceback stuck. pc=0x51837d0 sp=0xc006181058
stack: frame={sp:0xc006181058, fp:0xc006181058} stack=[0xc069d1a000,0xc069d22000)
fatal error: traceback stuck
panic during panic
goroutine 0 gp=0xc007762a80 m=17 mp=0xc00a7be708 [idle]:
runtime.nanotime()
GOROOT/src/runtime/time_nofake.go:33 +0x18 fp=0xc006181058 sp=0xc006181048 pc=0x111ec08
github.com/cockroachdb/cockroach/pkg/util/log.HasSpan({0xc0061810f8, 0xc014bc6008})
pkg/util/log/trace.go:183 +0x10 fp=0xc006181058 sp=0xc006181058 pc=0x1c1bf00
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmtInCommitWaitState.func1()
pkg/sql/conn_executor_exec.go:3729 fp=0xc006181058 sp=0xc006181058 pc=0x51837d0
runtime: traceback stuck. pc=0x51837d0 sp=0xc006181058
stack: frame={sp:0xc006181058, fp:0xc006181058} stack=[0xc069d1a000,0xc069d22000)
fatal error: traceback stuck
stack trace unavailable
Comment From: idanski
Hey, any news regarding this? Encountered in prod today as well 😬
Comment From: lizthegrey
Hey, any news regarding this? Encountered in prod today as well 😬
we've not reproed this in a year, oddly enough.