Golang runtime: map cold cache improvements - Aurora Blog|java/go/python

There is room for improvement when some or all of a map is not in cache. The hierarchical design we use to enable incremental growth leaves us with several different allocations that can experience cache misses:

The Map itself (on m.Used)
The directory slice (on m.directoryAt)
The table (on t.groups.lengthMask)
The control word (on matchH2)
The slot key (on key comparison)
The slot value (on access in the caller)

(4), (5), and (6) are in the same object, so they may share a cache line depending on how far apart they are.

Some changes that may help (or may make things worse!):

Grouping control words together (prototype).
Good for locality of control words. Probably most useful for lookup of keys that aren't in the map (as the key has worse locality with the control word).
Also good for optimistic prefetch of the first control word, something that Abseil does.
Change the directory from []*table to []table (prototype).
Effectively merges cases (2) and (3).
Store group data directly in the table. The table becomes variable size, with a table8 containing 1 group, table16 containing 2 groups, and so on up to table1024.
Effectively merges cases (3) and (4).
Allows optimistic prefetch of the control word / first key before even touching the table as they are at fixed offsets.
We'd likely want to store the size in the directory as well to enable above.
This is unfortunately mutually exclusive with the previous option, as the table is variable-sized.

There is also the question of whether to store the key and value together (KVKVKV...), as we do today, or to group the keys together (KKKVVV...), as we do in the old maps. Both cases have pros and cons:

K/V together: * Map hit * Cache miss on control word * Cache miss on key comparison * No cache miss on value use in parent * Map miss * Cache miss on control word * Cache miss on key comparison in false positive matches (~1%) * No value

All keys together: * Map hit * Cache miss on control word * No cache miss on key comparison * Cache miss on value use in parent (if used) * Map miss * Cache miss on control word * No cache miss on key comparison in false positive matches (~1%) * No value

Comment From: gabyhelp

Related Issues

Related Code Changes

_{(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)}

Comment From: bboreham

I noticed Prometheus uses more CPU using Go 1.24, in a couple of places where it is accessing a large map.

This was hard to see with existing benchmarks, so I modified this benchmark to generate more data (and stripped out extra test cases to make it easier to run).

Check out that branch and run like this:

go test -run XXX -bench 'BenchmarkLoadWLs' -cpuprofile=cpu.pprof ./tsdb/
goos: linux
goarch: amd64
pkg: github.com/prometheus/prometheus/tsdb
cpu: Intel(R) Core(TM) i7-14700K
BenchmarkLoadWLs/batches=1000,seriesPerBatch=10000,samplesPerSeries=50,exemplarsPerSeries=0,mmappedChunkT=0,oooSeriesPct=0.000,oooSamplesPct=0.000,oooCapMax=0,missingSeriesPct=0.000-28                       1        25232921108 ns/op
PASS
ok      github.com/prometheus/prometheus/tsdb   33.909s

Using Go 1.23 I get 82 seconds of CPU in runtime.mapaccess1_fast64; with Go 1.24.2 it's 108 seconds.

Profile from the latter: cpu.pprof.gz

[Gnarly extra info: Prometheus shards this map 16,000 ways by default, to reduce lock contention. But when I change that number down to 128 this benchmark still looks much the same.]