Go version
1.23.9 and 1.24.3 amd64
Output of go env
in your module/workspace:
GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/root/.cache/go-build'
GOENV='/root/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/local/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='local'
GOTOOLDIR='/usr/local/go/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.23.9'
GODEBUG=''
GOTELEMETRY='local'
GOTELEMETRYDIR='/root/.config/go/telemetry'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='0'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -fno-caret-diagnostics -Qunused-arguments -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build2929154795=/tmp/go-build -gno-record-gcc-switches'
What did you do?
Hello everyone. I don't know if my problem fits under the issue, but I don't know where to go with such a problem. If there is a better place to discuss this issue, please indicate the place, I will move the discussion.
We have one not very complex pull request https://github.com/deckhouse/deckhouse/pull/14195/files The main changes are in this file https://github.com/deckhouse/deckhouse/pull/14195/files#diff-b3ce967f2ebde578ed3a1a2a03e5e36e200aeb2f3cc1bb81efd46db1baa25cea, everything else was already added during debugging and no more changes were planned.
I will briefly describe the process and purpose of this pull request, I hope this will help to better immerse yourself in the context. The component (dhctl) that is being modified is responsible for installing Deckhouse Kubernetes Platform, let's just say here that it is an analogue of Openshift. dhctl does the following: - using terraform/opentofu (we call the binary file) deploys the basic infrastructure in the cloud and a virtual machine for the first control-plane node - deploys the first control-plane kubernetes node - deploys our controller, which deploys the remaining controllers and cluster components, we wait until the controller becomes Ready - if necessary, using terraform/opentofu orders machines in the cloud for the remaining control-plane nodes and waits for them to be added to the cluster - deploys the k8s resources specified by the user. There is one minor nuance here. Part of the CRD is deployed to the cluster only after at least one control-plane node is added to the cluster, and this node may appear from user resources (under the hood, to order, as we call them CloudEphemeral nodes, we use MachineControllerManager/ClusterAPI), so we try to create resources in a loop and if there is no CRD in the cluster, we try again.
Actually, we have a problem in one of the setups when deploying user resources, so we needed to add logging of all requests made by client-go. For this, we was decided to add the output of client-go logs to our logger, which was actually done here https://github.com/deckhouse/deckhouse/pull/14195/files#diff-b3ce967f2ebde578ed3a1a2a03e5e36e200aeb2f3cc1bb81efd46db1baa25cea and this should not have caused any problems.
dhctl can work in CLI and gRPC server mode. We have a specialized component called commander. This component is deployed to an existing cluster and is a web application that runs dhctl in gRPC server mode and communicates with it via gRPC requests. As you understand, dhctl in gRPC mode is deployed using Deployment and runs in a container. dhctl itself is built statically https://github.com/deckhouse/deckhouse/blob/main/dhctl/Makefile#L112, without CGO and runs inside a distroless container.
To implement logging, we wrote our own package https://github.com/deckhouse/deckhouse/blob/main/dhctl/pkg/log/logger.go, which under the hood uses other packages to output logs. To output debug logs, two functions are used that will be interesting to us and they are used by dhctl in gRPC server mode https://github.com/deckhouse/deckhouse/blob/main/dhctl/pkg/log/logger.go#L259 https://github.com/deckhouse/deckhouse/blob/main/dhctl/pkg/log/logger.go#L273
We start only one bootstrapp process per container in same time.
They do the following: if we passed a specialized structure (let's call it DebugLogWriter) during logger initialization, which implements the Writer interface, then we write logs using the Write call. Next, we check that we are in debug mode and write logs using the specialized library https://github.com/werf/logboek.
When we receive a request to create a cluster, we reinitialize the logger by passing it DebugLogWriter https://github.com/deckhouse/deckhouse/blob/main/dhctl/pkg/server/rpc/dhctl/bootstrap.go#L175 https://github.com/deckhouse/deckhouse/blob/main/dhctl/pkg/server/rpc/dhctl/bootstrap.go#L81
DebugLogWriter is a simple structure https://github.com/deckhouse/deckhouse/blob/main/dhctl/pkg/server/pkg/logger/writer.go#L68 during the initialization of which we pass an instance of slog, which already in json format writes to the stderr of the container what was received https://github.com/deckhouse/deckhouse/blob/main/dhctl/pkg/server/pkg/logger/logger.go#L60 after performing some operations on the string and the internal variable https://github.com/deckhouse/deckhouse/blob/main/dhctl/pkg/server/pkg/logger/writer.go#L85-L98. Since we have an internal variable that we perform operations on, and the probability of calling Write from several gorutines is high, then when calling Write we use Mutex.
And now, after a long explanation (sorry that we approached the problem so long, it seemed important to me to describe all the mechanics and provide more context for you) we came to the main problem. After adding logging client-go dhctl freezes on taking lock in Write https://github.com/deckhouse/deckhouse/blob/main/dhctl/pkg/server/pkg/logger/writer.go#L82 and hangs there endlessly, and no bootstrap actions are performed, but in the logs you can see that the gRPC server itself is alive and accepts health checks
{"time":"2025-07-02T21:16:48Z","level":"INFO","source":"/dhctl/pkg/server/pkg/interceptors/interceptors.go:62","msg":"started call","component":"server","protocol":"grpc","grpc.component":"server","grpc.service":"grpc.health.v1.Health","grpc.method":"Check","grpc.method_type":"unary","peer.address":"10.111.0.110:46834","grpc.start_time":"20 25-07-02T21:16:48Z","grpc.request.deadline":"2025-07-02T21:16:53Z","grpc.time_ms":"0.027"}
{"time":"2025-07-02T21:16:48Z","level":"INFO","source":"/dhctl/pkg/server/pkg/interceptors/interceptors.go:62","msg":"finished call","component":"server","protocol":"grpc","grpc.component":"server","grpc.service":"grpc.health.v1.Health","grpc.method":"Check","grpc.method_type":"unary","peer.address":"10.111.0.110:46834","grpc.start_time":"2025-07-02T21:16:48Z","grpc.request.deadline":"2025-07-02T21:16:53Z","grpc.code":"OK","grpc.time_ms":"0.293"}
Since debugging the problem in the container is very difficult, the following actions were taken.
Initially, logs were added that are written directly to stderr without using. Initially, logs were added for each step in the Write function. https://github.com/deckhouse/deckhouse/pull/14195/commits/a117cafd6f7e42285ec21e5117b0bcd28efffbcb
We saw the following picture:
As you can see from the screenshot, the number of log entries before taking a lock and finishing writing to slog differs by one. It follows that before taking a lock, all write operations were completed, otherwise we saw a discrepancy in the number of logs by 2.
What else was done.
We replaced sync.Mutex with this Mutex https://github.com/linkdata/deadlock in the hope that it would show some kind of deadlock, but it did not show anything. We built dhctl with the -race flag (to do this we had to enable CGO, abandon static build and switch to a non-distrolles image). This build showed some problems at the start of the bootstrap, but there were no detections at the moment of freezing. We tried to build with 1.23.9 version of go (we are currently using 1.24.3) and the problem did not go away. We also tried the following Linux kernels on the machine where the container is launched: - 5.15.0-131-generic - 5.15.0-92-generic - 5.15.0-46-generic - 6.11.0-26-generic - 6.11.0-29-generic We use Ubuntu for the machines. The problem is reproduced on all kernels. Removed output to slog and replaced it with output to stderr https://github.com/deckhouse/deckhouse/pull/14195/commits/90c544522d01e1ed58fdeebda178c147d7531e3b in the hope that we don’t see what lock is inside slog, but the problem didn’t go away. As a result, we listed all gorutenes before taking the lock https://github.com/deckhouse/deckhouse/pull/14195/commits/31d6e61a4adcd43dc4fe0059fd8c95380648509c hoping to see that at this moment two goroutines call Write, but saw that Write is called by one gorutine:
I will attach a truncated log, in which the last two Write calls will be visible, since the full log has a size of > 300MB and our credentials may be there.
https://pastebin.com/jaJcJz5e
I also tried to provide strace when working with dhctl, but I encountered the fact that strace in the container does not output the result to stderr/stderr, and the output to a file with the help of the -o
flag freezes at random moments, even such actions do not help https://stackoverflow.com/a/8252457.
Unfortunately, I don't know what more information can be provided. The team's internal developer meetings also didn't yield any results, so we was decided to write an issue. Maybe we don't see something and you can help us. I've been debugging this problem for 7 nights now and I have no more ideas.
P.S. We don't see any problems on the main branch, the problem only appeared after logging client-go requests. Perhaps this is somehow related to the number of calls to Mutex, since the number of calls has increased many times.
Thank you!
What did you see happen?
App frezee in taking sync.Mutex
What did you expect to see?
Lock of sync.Mutex should take
Comment From: gabyhelp
Related Issues
- log/slog: deadlock using wrapped default handler with SetDefault #61892
- build: coordinator deadlock #10077 (closed)
- x/tools/gopls: Possible deadlock occurs between `*cmdClient.filesMu` and `*Server.diagnosticsMu` #56450 (closed)
- cmd/compile: test failure in k8s after prove pass rewrite #68809 (closed)
- golang.org/x/crypto/ssh: goroutine leak #47541 (closed)
- affected/package: sync #53769 (closed)
- 我的go项目运行过程有各种崩溃问题 #41972 (closed)
- runtime: waitforsingleobject wait_failed #65255 (closed)
- runtime: all thread is block in runtime.futex,service is hanging,may be deadlock #22398 (closed)
(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)
Comment From: seankhliao
Unlike many projects, the Go project does not use GitHub Issues for general discussion or asking questions. GitHub Issues are used for tracking bugs and proposals only.
For questions please refer to https://github.com/golang/go/wiki/Questions