Discovered while investigating #68311: telemetry counter file extension can cause windows processes to panic. For example: this crash is from a new end-to-end test added in https://go.dev/cl/597278, without any changes to the x/telemetry logic.

https://ci.chromium.org/ui/p/golang/builders/try/x_telemetry-go1.21-windows-amd64/b8742848988042691361/test-results?q=ExactID%3Agolang.org%2Fx%2Ftelemetry%2Finternal%2Fcounter.TestConcurrentExtension+VHash%3Afb236f0bcc2ba24d&clean=&sortby=&groupby=

I don't think this is a theoretical or pathological setup. With the high concurrency of toolchain processes, it seems likely that Windows users will encounter crashes due to this bug. Furthermore, these crashes will only become more likely as we add more instrumentation. Therefore, marking as a release blocker.

CC @golang/telemetry

Comment From: gabyhelp

Related Issues and Documentation

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

Comment From: findleyr

Investigating briefly, it seems likely that this is due to the unmapping of the existing file before it is remapped: https://go.googlesource.com/telemetry/+/refs/tags/config/v0.25.0/internal/mmap/mmap_windows.go#18

Comment From: hyangah

Thanks for the investigation. What happens if you delete the munmapFile call? Is it there because Windows is unhappy about creating another map on the same file handle?

Comment From: findleyr

In https://go.dev/cl/597279, I tested exactly that (see TestMultipleMaps). Windows seems happy to provide multiple views of the same file.

My hypothesis is that the unmap is done only for tests, which fail to clean up if there are any outstanding mappings of the counter file. See also the SetFinalizer call to mappedFile.close - the need for a setfinalizer is that counter.file.current may be read without holding counter.file.mu, and therefore the lifecycle of mappedFiles is unmanaged.

I believe the correct solution is to turn counter.file.current into a refcounted value, managed by the counter.file, and ensure that mappings are correctly cleaned up in tests.

Comment From: findleyr

After a bit more testing (ensuring that other mappings remain valid when one is closed), I believe the solution described in https://github.com/golang/go/issues/68358#issuecomment-2220649188 is the path forward. Therefore marking this as NeedsFix.

I will aim for a fix today. The fix will not be completely trivial, but there are only 7 non-test calls to file.current.load that need to be updated, and the logic should be simple. The additional complexity of awaiting the unmapping will only affect tests.

Comment From: gopherbot

Change https://go.dev/cl/597278 mentions this issue: internal/counter: fix windows mapped file extension

Comment From: findleyr

Well, after much iteration I realized that the existing code was very close to unmapping files correctly: we just need to close the mapping after invalidating counters, which must be done after unlocking the counter.file. This was done after rotation, but not after extension (but coincidentally I'd left a TODO a while ago to investigate why it wasn't done after extension).

So, we're able to (1) remove the unsafe unmapping, (2) remove the runtime.SetFinalizer logic, and (3) resolve that TODO, without introducing additional mechanisms. A satisfactory change.

I believe this should be a relatively safe CL to land.