AIX golang CI recently started failing with the below error.
--- FAIL: TestImportTestdata (0.05s)
gcimporter_test.go:58: compile: /dev/null:1: unknown directive "Disconnected"
gcimporter_test.go:59: go tool compile generics.go failed: exit status 1
--- FAIL: TestTypeNamingOrder (0.04s)
gcimporter_test.go:58: compile: /dev/null:1: unknown directive "Disconnected"
gcimporter_test.go:59: go tool compile g.go failed: exit status 1
https://build.golang.org/log/28dadf4b964f28d9137fb36e3f3181f03394faa7
Looks like there is some problem with the CI machine because our internal CI is working fine. Any idea what could be the problem ? Any hints will be useful to fix that CI machine.
Comment From: cherrymui
cc @golang/aix
Comment From: gabyhelp
Related Issues and Documentation
(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)
Comment From: cherrymui
These tests also shows flaky but high frequency failures on the netbsd-arm-bsiegert builder, e.g. https://build.golang.org/log/7d88bbec36f25867219a11c16791e0c24ea622e7 . Not sure if they are separate builder issues or they are related.
cc @bsiegert
Comment From: pmur
I was pinged by the AIX maintainers to take a look. Something on the VM is misbehaving, /dev/null
is returning a single line of sshd logging. It claims an uptime of 1011 days. It's probably due for a reboot, so I've rebooted it.
@ayappanec I suspect this VM might be due for updates. Would you be able to investigate or verify it is up to date?
Comment From: bsiegert
This may actually be caused by the Go test in some way. The other day, I noticed that /dev/null on the netbsd-arm builder had been overwritten with a file containing one line of text.
Could it be that something specifies /dev/null as output file and the program is misbehaving somehow?
Comment From: ayappanec
I was pinged by the AIX maintainers to take a look. Something on the VM is misbehaving,
/dev/null
is returning a single line of sshd logging. It claims an uptime of 1011 days. It's probably due for a reboot, so I've rebooted it.@ayappanec I suspect this VM might be due for updates. Would you be able to investigate or verify it is up to date?
Yes, the machine is due for updates. I will check on this.
Comment From: cherrymui
It wouldn't be surprising if some tests write to /dev/null, but usually that should be fine just like it is common to do things like some shell command > /dev/null
, which should just discard the output. Unless /dev/null somehow becomes a regular file on the machine? I don't know how that could happen, like running as root, deleting /dev/null and recreating it as a regular file?
Comment From: pmur
It appeared something was recreating /dev/null
as a regular file. When non-empty, it had the full contents of an sshd log message, without the syslog prefix. It seems unlikely to be the fault of the Go tooling or CI tooling.
I modified sshd_config and syslog to place sshd's logs into /var/log/messages
last night. It still seems intact. If we start seeing builds pass again, I'll close this issue.
I think everything is running as root, so there are few guardrails. I don't have enough background with AIX to say what is clobbering the file, but sshd seems suspect.
Comment From: bsiegert
I assume this happens if the software thinks it's extra clever by creating the file under a different name and using rename(2) to put it into place.
Comment From: pmur
This is still happening. I wonder if a test is rewriting the file. I couldn't reproduce it when running the go dist tests on the latest 1.22 or 1.23rc releases.
Comment From: cherrymui
@bsiegert this sounds plausible. The go command does move the output file from the temp WORK directory to the output. But it special cases os.DevNull
https://cs.opensource.google/go/go/+/master:src/cmd/go/internal/work/build.go;l=499 . It could be that some other tool is not that careful. But I would guess that should be pretty reproducible, though. Maybe try running tests for the subrepos as well?
Comment From: pmur
I think I found the culprit after cycling through the x repos. Something in x/oscar (which coincidentally is failing to build on CI now) seems to be causing /dev/null to be deleted. I'll look more into this later today.
Comment From: dmitshur
Thanks for finding that. Note that x/oscar is defined PT.TOOL, a repository intended to only be tested on a few first-class platforms like Linux/Windows/macOS, in the LUCI build configuration (see here).
Hopefully this builder can be migrated to LUCI soon (issue #67299) since we're migrating away from the coordinator, and so the coordinator won't be maintained indefinitely. But in the short term it would be fine to reconfigure the coordinator not to test x/oscar on the GOOS=aix builder.
Comment From: pmur
And, the culprit is found. I've opened #68558.
Comment From: gopherbot
Change https://go.dev/cl/600515 mentions this issue: internal/httprr: do not delete /dev/null