Go version

go version go1.24.6 linux/amd64 (go1.24 as contained in golang:1.24 docker image)

Output of go env in your module/workspace:

AR='ar'
CC='gcc'
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_ENABLED='1'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
CXX='g++'
GCCGO='gccgo'
GO111MODULE=''
GOAMD64='v1'
GOARCH='amd64'
GOAUTH='netrc'
GOBIN=''
GOCACHE='/root/.cache/go-build'
GOCACHEPROG=''
GODEBUG=''
GOENV='/root/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFIPS140='off'
GOFLAGS=''
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build1479637807=/tmp/go-build -gno-record-gcc-switches'
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMOD='/dev/null'
GOMODCACHE='/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/local/go'
GOSUMDB='sum.golang.org'
GOTELEMETRY='local'
GOTELEMETRYDIR='/root/.config/go/telemetry'
GOTMPDIR=''
GOTOOLCHAIN='local'
GOTOOLDIR='/usr/local/go/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.24.6'
GOWORK=''
PKG_CONFIG='pkg-config'

What did you do?

Was using a sync.Map to store *time.Location values (caching time.LoadLocation outputs). This was being called at the start of each HTTP request in a gin HTTP server, where the server was getting 10-50 requests concurrently.

Here is the function:


var locationCache sync.Map

// Loads information about the given location, using a thread-safe
// cache for speed and to reduce memory pressure.
func LoadTimeLocationCached(name string) (*time.Location, error) {
    cachedValue, ok := locationCache.Load(name)
    if ok {
        switch v := cachedValue.(type) {
        case *time.Location:
            return v, nil
        }
    }
    loc, err := time.LoadLocation(name)
    if err != nil {
        return loc, err
    }
    locationCache.Store(name, loc)
    return loc, nil
}

The code was running in a long-running container in Google Cloud Run (request concurrency for the cloud run service was active with a high limit, so the container was getting concurrent requests). There is a fair amount of memory pressure in this container (GOGC=2000 and GOMEMLIMIT=2672MiB) -- the long running HTTP requests do a lot of allocation of temporary data as part of parsing/manipulating JSON.

What did you see happen?

One day in production (without any deployment or change in code), about 1 in every 5000 requests started to panic on the line of code locationCache.Store(name, loc) -- restarting the container didn't help. The panic message was internal/sync.HashTrieMap: ran out of hash bits while inserting. The input to LoadTimeLocationCached came from an HTTP request header, so a new combination of (valid) time location names is likely what started to trigger the issue. Some of these location names can be pretty similar so maybe that is a hint.

The LoadTimeLocationCached was being called with a likely low cardinality of inputs (only valid time.LoadLocation names would ever get to the locationCache.Store line).

Note that this is the entirety of the code that manipulated locationCache -- nothing ever deleted values from it, but there shouldn't have been more than a few dozen values in it (or a few hundred at most).

What did you expect to see?

The program should run without panicing. Changing to a sync.Mutex with a regular map[string]*time.Location fixed the issue.

Unfortuneately, I wasn't able to reproduce the issue locally, even when trying in a variety of ways.

Thanks to @mknyszek for encouraging me to submit an issue and helpfully pointing out the related issue #73427.

Comment From: mknyszek

Thanks! I walked over the code yet again and I'm still stuck unfortunately. However, the fact that your example does not include deletion in any way significantly narrows the search space.

Comment From: mknyszek

Here's what I've been mulling over:

To hit this crash, it must be true that the new hash is not equal to the old hash. There's a partial hash bits collision here.

One funky thing is that we recompute the hash for the key we're partially colliding with. I don't think for example maps do this, which makes sync.Map slightly more susceptible to internal invariants breaking if string invariants are broken (for example immutability). Thing is, even if the new hash was something totally busted... the insertion should still succeed. It's very unlikely for the new hash bits to all be the same.

Furthermore, given that there aren't very many entries in the tree, I would be very surprised if we actually have some kind of boundary case issue (where the maximum tree depth is being hit). Plus, we already check for those with the "bad hash" and "truncated hash" tests to check that everything still works correctly under complete hash collisions.

So this is leading me toward less of a logical bug/boundary issue and more toward a "bad race" issue. But that also doesn't really make sense to me because (1) these mutations are protected by very straightforward fine-grained locking and (2) I would expect different aspects of the structure to be very messed up and for there to be a larger variety of failures. But perhaps in this simpler use-case you wouldn't see that. It might result in a failed lookup or something.

The fact that you suddenly got relatively consistent failures suggests to me that indeed the tree is somehow getting into some bad state. I would love to know more about this bad state. Perhaps in case of this panic we should just dump the full tree structure. It's slow and best-effort, but we're going down anyway and at this point any additional information would be useful.

Comment From: gabyhelp

Related Issues

Related Discussions

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

Comment From: klondikedragon

Chatting with @mknyszek this is being caused by user error on my part!

The string key that was being passed to LoadTimeLocationCached (and thus to sync.Map.Store) was coming from a string created through unsafe.String(&b[0], len(b)) -- the underlying byte slice was from an arena of bytes managed by the fastjson package, and that arena is being reset during the lifecycle of the sync.Map (which is global). The theory is that the regular sync.Map might not be having an issue because the map is likely not growing large enough to require rehashes of keys.

I'm going to fix the input to be a proper copy, and then see if the issue with this fixed code reproduces or not, and report back.

I appreciate the quick response and collaboration!

Comment From: klondikedragon

The updated code that makes a copy of the string bytes before Store went through testing/CI/CD and is now running in prod without any panics (before the panic was triggering a few times a minute). So it seems that my mistake to improperly pass in an unsafe mutable string as key was definitely the issue. Thanks for your patience and time you spent on this!

Maybe as a hint to others who misuse sync.Map this way in the future, the panic message might hint that this can occur if you pass it objects whose hash keys can change (e.g., strings made unsafely from mutable byte slices)?