Golang sync: Map panics with unhelpful error message when breaking hashing invariants (for example, string data changes)

Go version

go version go1.24.6 linux/amd64 (go1.24 as contained in golang:1.24 docker image)

Output of `go env` in your module/workspace:

AR='ar'
CC='gcc'
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_ENABLED='1'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
CXX='g++'
GCCGO='gccgo'
GO111MODULE=''
GOAMD64='v1'
GOARCH='amd64'
GOAUTH='netrc'
GOBIN=''
GOCACHE='/root/.cache/go-build'
GOCACHEPROG=''
GODEBUG=''
GOENV='/root/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFIPS140='off'
GOFLAGS=''
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build1479637807=/tmp/go-build -gno-record-gcc-switches'
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMOD='/dev/null'
GOMODCACHE='/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/local/go'
GOSUMDB='sum.golang.org'
GOTELEMETRY='local'
GOTELEMETRYDIR='/root/.config/go/telemetry'
GOTMPDIR=''
GOTOOLCHAIN='local'
GOTOOLDIR='/usr/local/go/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.24.6'
GOWORK=''
PKG_CONFIG='pkg-config'

What did you do?

Was using a sync.Map to store *time.Location values (caching time.LoadLocation outputs). This was being called at the start of each HTTP request in a gin HTTP server, where the server was getting 10-50 requests concurrently.

Here is the function:


var locationCache sync.Map

// Loads information about the given location, using a thread-safe
// cache for speed and to reduce memory pressure.
func LoadTimeLocationCached(name string) (*time.Location, error) {
    cachedValue, ok := locationCache.Load(name)
    if ok {
        switch v := cachedValue.(type) {
        case *time.Location:
            return v, nil
        }
    }
    loc, err := time.LoadLocation(name)
    if err != nil {
        return loc, err
    }
    locationCache.Store(name, loc)
    return loc, nil
}

The code was running in a long-running container in Google Cloud Run (request concurrency for the cloud run service was active with a high limit, so the container was getting concurrent requests). There is a fair amount of memory pressure in this container (GOGC=2000 and GOMEMLIMIT=2672MiB) -- the long running HTTP requests do a lot of allocation of temporary data as part of parsing/manipulating JSON.

What did you see happen?

One day in production (without any deployment or change in code), about 1 in every 5000 requests started to panic on the line of code locationCache.Store(name, loc) -- restarting the container didn't help. The panic message was internal/sync.HashTrieMap: ran out of hash bits while inserting. The input to LoadTimeLocationCached came from an HTTP request header, so a new combination of (valid) time location names is likely what started to trigger the issue. Some of these location names can be pretty similar so maybe that is a hint.

The LoadTimeLocationCached was being called with a likely low cardinality of inputs (only valid time.LoadLocation names would ever get to the locationCache.Store line).

Note that this is the entirety of the code that manipulated locationCache -- nothing ever deleted values from it, but there shouldn't have been more than a few dozen values in it (or a few hundred at most).

What did you expect to see?

The program should run without panicing. Changing to a sync.Mutex with a regular map[string]*time.Location fixed the issue.

Unfortuneately, I wasn't able to reproduce the issue locally, even when trying in a variety of ways.

Thanks to @mknyszek for encouraging me to submit an issue and helpfully pointing out the related issue #73427.

Comment From: mknyszek

Thanks! I walked over the code yet again and I'm still stuck unfortunately. However, the fact that your example does not include deletion in any way significantly narrows the search space.

Comment From: mknyszek

Here's what I've been mulling over:

To hit this crash, it must be true that the new hash is not equal to the old hash. There's a partial hash bits collision here.

One funky thing is that we recompute the hash for the key we're partially colliding with. I don't think for example maps do this, which makes sync.Map slightly more susceptible to internal invariants breaking if string invariants are broken (for example immutability). Thing is, even if the new hash was something totally busted... the insertion should still succeed. It's very unlikely for the new hash bits to all be the same.

Furthermore, given that there aren't very many entries in the tree, I would be very surprised if we actually have some kind of boundary case issue (where the maximum tree depth is being hit). Plus, we already check for those with the "bad hash" and "truncated hash" tests to check that everything still works correctly under complete hash collisions.

So this is leading me toward less of a logical bug/boundary issue and more toward a "bad race" issue. But that also doesn't really make sense to me because (1) these mutations are protected by very straightforward fine-grained locking and (2) I would expect different aspects of the structure to be very messed up and for there to be a larger variety of failures. But perhaps in this simpler use-case you wouldn't see that. It might result in a failed lookup or something.

The fact that you suddenly got relatively consistent failures suggests to me that indeed the tree is somehow getting into some bad state. I would love to know more about this bad state. Perhaps in case of this panic we should just dump the full tree structure. It's slow and best-effort, but we're going down anyway and at this point any additional information would be useful.

Comment From: gabyhelp

Related Issues

Related Discussions

Crashing at runtime.sigpanic() during runtime.mapassign() in highly concurrent map writing situation?

_{(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)}

Comment From: klondikedragon

Chatting with @mknyszek this is being caused by user error on my part!

The string key that was being passed to LoadTimeLocationCached (and thus to sync.Map.Store) was coming from a string created through unsafe.String(&b[0], len(b)) -- the underlying byte slice was from an arena of bytes managed by the fastjson package, and that arena is being reset during the lifecycle of the sync.Map (which is global). The theory is that the regular sync.Map might not be having an issue because the map is likely not growing large enough to require rehashes of keys.

I'm going to fix the input to be a proper copy, and then see if the issue with this fixed code reproduces or not, and report back.

I appreciate the quick response and collaboration!

Comment From: klondikedragon

The updated code that makes a copy of the string bytes before Store went through testing/CI/CD and is now running in prod without any panics (before the panic was triggering a few times a minute). So it seems that my mistake to improperly pass in an unsafe mutable string as key was definitely the issue. Thanks for your patience and time you spent on this!

Maybe as a hint to others who misuse sync.Map this way in the future, the panic message might hint that this can occur if you pass it objects whose hash keys can change (e.g., strings made unsafely from mutable byte slices)?

Comment From: artyom

Maybe as a hint to others […] this can occur if you pass it objects whose hash keys can change

@klondikedragon unsafe.String is already explicit that “the bytes passed to String must not be modified as long as the returned string value exists.”

Comment From: gopherbot

Change https://go.dev/cl/694376 mentions this issue: internal/sync: warn about incorrect unsafe usage in HashTrieMap

Comment From: gopherbot

Change https://go.dev/cl/694635 mentions this issue: internal/sync: detect illegally mutated keys in HashTrieMap

Comment From: thepudds

Hi @mknyszek, I'm curious what you think about possibly backporting to the go1.25 release branch the change I made in https://go.dev/cl/694376 (the wording change for the panic message)?

E.g., maybe for go1.25.0 if there's enough time, or for go1.25.x.

Part of the reason I separated out the follow-up change that attempts to improve the detection in https://go.dev/cl/694635 was so that we at least have an option to make independent backporting decisions.

It seems like probably multiple people have encountered this. (There's this issue, maybe > 1 person in #73427 though jury might still be out there, and maybe one of the iterations of https://github.com/golang/go/issues/69534 too, though that one had some twists and turns and is less clear to me based on a quick skim).

Of course, people can google, etc., to find the related issues, but the software essentially saying "hey, unsafe or similar might be the root cause here" might be more efficient for everyone (and the software itself stating that might be extra encouragement for people to double-check for mistakes, beyond just seeing someone suggest it as a possibility on the issue tracker).

In any event, I don't have a strong opinion here, but at least wanted to ask the question.

Comment From: mknyszek

Hi @mknyszek, I'm curious what you think about possibly backporting to the go1.25 release branch the change I made in https://go.dev/cl/694376 (the wording change for the panic message)?

E.g., maybe for go1.25.0 if there's enough time, or for go1.25.x.

This sounds like a good idea to me. I left a comment on your CL but for posterity: if we put it on the already-failing path, I personally feel the change is very safe and we should definitely backport it.

Part of the reason I separated out the follow-up change that attempts to improve the detection in https://go.dev/cl/694635 was so that we at least have an option to make independent backporting decisions.

Makes sense. FWIW, I think we should backport both your CLs merged together, but breaking it up this way while we deliberate is fine.

It seems like probably multiple people have encountered this. (There's this issue, maybe > 1 person in #73427 though jury might still be out there, and maybe one of the iterations of #69534 too, though that one had some twists and turns and is less clear to me based on a quick skim).

Yeah, I think the counter is at 2, maybe 3. Anyway, it seems high enough that an improvement is worthwhile.

Of course, people can google, etc., to find the related issues, but the software essentially saying "hey, unsafe or similar might be the root cause here" might be more efficient for everyone (and the software itself stating that might be extra encouragement for people to double-check for mistakes, beyond just seeing someone suggest it as a possibility on the issue tracker).

+1. Totally agree we should try to help people who run into this and point them toward possible problems.

Golang sync: Map panics with unhelpful error message when breaking hashing invariants (for example, string data changes)

Go version

Output of go env in your module/workspace:

What did you do?

What did you see happen?

What did you expect to see?

Output of `go env` in your module/workspace: