Edit: Initially, I thought this was an "optimization opportunity" and wrote it up as such, but later realized it is seemingly a regression from Go 1.23 (comment). Changed title accordingly.
I was reading through the (excellent!) new Swiss Table implementation from #54766 and noticed a potential optimization opportunity for lookups with small maps (<= 8 elems) for the normal case of non-specialized keys.
I suspect the current linear scan for these maps might be better off if we instead build a match bitmap and jump to the candidate match. I mailed https://go.dev/cl/634396 with this change.
For predictable keys, it might be a small win or close to a wash, but for unpredictable keys, it might be a larger win.
To better illustrate this (as well as to help with analyzing a couple other experiments I'm trying), I also updated the benchmarks (#70700) to shuffle the keys for the newer benchmarks, and also added a new "Hot" benchmark that repeatedly looks up a single key (Hot=1 below) or a small number of random keys (Hot=3 below).
The geomean here is -28.69%. These results are for amd64 and use the SIMD optimizations. I did not test on arm or without the SIMD optimizations.
│ no-fix-new-bmarks │ fix-with-new-bmarks │
│ sec/op │ sec/op vs base │
MapAccessHit/Key=smallType/Elem=int32/len=6-4 24.55n ± 0% 13.63n ± 0% -44.48% (p=0.000 n=25)
MapAccessMiss/Key=smallType/Elem=int32/len=6-4 12.500n ± 0% 9.517n ± 32% -23.86% (p=0.007 n=25)
MapAccessHitHot/Key=smallType/Elem=int32/len=6/Hot=1-4 13.73n ± 4% 13.54n ± 0% ~ (p=0.096 n=25)
MapAccessHitHot/Key=smallType/Elem=int32/len=6/Hot=3-4 21.93n ± 1% 13.57n ± 1% -38.12% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=1-4 13.07n ± 0% 13.54n ± 0% +3.60% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=2-4 19.05n ± 0% 13.53n ± 0% -28.98% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=3-4 21.46n ± 0% 13.52n ± 0% -37.00% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=4-4 23.06n ± 0% 13.53n ± 0% -41.33% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=5-4 23.94n ± 0% 13.53n ± 0% -43.48% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=6-4 24.46n ± 0% 13.54n ± 0% -44.64% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=7-4 24.91n ± 0% 13.54n ± 0% -45.64% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=8-4 25.12n ± 0% 13.56n ± 0% -46.02% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=1-4 12.480n ± 0% 9.523n ± 0% -23.69% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=2-4 12.480n ± 0% 9.516n ± 0% -23.75% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=3-4 12.490n ± 0% 9.516n ± 0% -23.81% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=4-4 12.480n ± 0% 9.520n ± 0% -23.72% (p=0.001 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=5-4 12.480n ± 0% 9.520n ± 0% -23.72% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=6-4 12.490n ± 20% 9.527n ± 32% -23.72% (p=0.003 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=7-4 12.48n ± 0% 12.09n ± 21% -3.12% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=8-4 12.49n ± 16% 11.77n ± 19% -5.76% (p=0.000 n=25)
geomean 16.58n 11.82n -28.69%
For the miss benchmarks, I suspect the fact that a miss is happening is predictable in all cases, but in some cases the predictable miss is with a predictable key, vs. a predictable miss on an unpredictable key in other cases. For a table with a single group like these all have, that probably doesn't matter too much.
For the hits and misses, these are the ones that I expect to have predictable keys:
MapAccessHitHot/Key=smallType/Elem=int32/len=6/Hot=1-4 13.73n ± 4% 13.54n ± 0% ~ (p=0.096 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=1-4 13.07n ± 0% 13.54n ± 0% +3.60% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=1-4 12.480n ± 0% 9.523n ± 0% -23.69% (p=0.000 n=25)
(Those three are also listed above in the larger list, but pulling them out here for commentary).
The first has predictable keys because a single key is looked up repeatedly per run (with 6 elements in the small map). The second two have predictable keys because there is only a single element. In other words, the keys are shuffled in all ~20 of these benchmarks, but in the three here, the shuffling is effectively a no-op.
CC @prattmic
Comment From: gabyhelp
Related Issues
- runtime: some possible SwissTable map benchmark improvements #70700
- runtime: map cold cache improvements #70835
Related Code Changes
- internal/runtime/maps: speed up small map lookups ~1.7x for unpredictable keys
- internal/runtime/maps: search group using simple for loop
- internal/runtime/maps: simplify small group lookup
- internal/runtime/maps: use match to skip non-full slots in iteration
- runtime: vectorized map bucket lookup
- runtime: optimize small map lookups with int64 keys
- internal/runtime/maps: initial swiss table map implementation
- runtime: exit early when scanning map buckets
(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)
Comment From: thepudds
If we instead compare master from a week or so ago (8c3e391573) vs. just making the map lookup behavior change (but without changing benchmark behavior), we get:
│ master-8c3e391573 │ just-fix-with-old-bmarks │
│ sec/op │ sec/op vs base │
MapAccessHit/Key=smallType/Elem=int32/len=6-4 22.75n ± 0% 21.81n ± 0% -4.11% (p=0.001 n=20)
MapAccessMiss/Key=smallType/Elem=int32/len=6-4 20.02n ± 24% 17.34n ± 2% -13.39% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=1-4 20.87n ± 0% 21.76n ± 0% +4.26% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=2-4 21.22n ± 0% 21.72n ± 0% +2.33% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=3-4 21.72n ± 0% 21.71n ± 0% ~ (p=0.522 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=4-4 22.06n ± 0% 21.72n ± 0% -1.54% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=5-4 22.42n ± 0% 21.73n ± 0% -3.06% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=6-4 22.75n ± 0% 21.74n ± 0% -4.44% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=7-4 23.00n ± 0% 21.73n ± 0% -5.52% (p=0.001 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=8-4 23.22n ± 0% 21.73n ± 0% -6.40% (p=0.006 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=1-4 19.95n ± 0% 17.31n ± 0% -13.21% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=2-4 19.96n ± 0% 17.32n ± 0% -13.25% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=3-4 19.95n ± 0% 17.31n ± 0% -13.23% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=4-4 19.99n ± 0% 17.30n ± 0% -13.43% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=5-4 19.96n ± 0% 17.32n ± 0% -13.23% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=6-4 19.97n ± 0% 17.33n ± 2% -13.22% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=7-4 19.98n ± 20% 17.83n ± 13% -10.76% (p=0.012 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=8-4 22.65n ± 12% 17.34n ± 14% -23.47% (p=0.000 n=20)
geomean 21.21n 19.44n -8.36%
Note that both columns from this run are generally slower overall than the first benchmarks presented above. This is likely for a couple of reasons, including these benchmarks here still have a somewhat expensive mod operation in the benchmark code itself.
The most interesting change here from the top comment might be MapSmallAccessHit/Key=smallType/Elem=int32/len=2
, which above is a -28.98% win vs. a +2.33% loss here. (In the older flavor of the benchmarks here, two keys are looked up repeatedly in the exact same order which is presumably predictable, vs. in the run in the top comment above, the two keys are randomly shuffled and not predictable. It seems at least plausible that could explain the difference, with a predictable shorter loop in the old version doing better).
Also, I initially poked at the implementation in master a bit with perf
on Linux, but have not yet had a chance to do so after I started making changes... And in general, please read any theorizing in either of these two comments as best guesses based on the results so far. 😅
Comment From: thepudds
Finally, still using the old benchmarks (with predictable keys), if we compare master as of a week or so ago with and without the Swiss Table enabled, we can see that on these benchmarks the default Swiss Table in master does seem slower compared to the old runtime map (geomean of +9.31% worse).
In other words, decent chance that master has a performance regression in these lookup benchmarks compared to Go 1.23, though I did not check Go 1.23 directly.
│ disable-swissmap │ master-8c3e391573 │
│ sec/op │ sec/op vs base │
MapAccessHit/Key=smallType/Elem=int32/len=6-4 20.94n ± 0% 22.75n ± 0% +8.65% (p=0.000 n=20)
MapAccessMiss/Key=smallType/Elem=int32/len=6-4 18.79n ± 2% 20.02n ± 24% +6.52% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=1-4 19.68n ± 0% 20.87n ± 0% +6.05% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=2-4 19.92n ± 0% 21.22n ± 0% +6.55% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=3-4 20.09n ± 0% 21.72n ± 0% +8.11% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=4-4 20.38n ± 0% 22.06n ± 0% +8.24% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=5-4 20.57n ± 0% 22.42n ± 0% +8.99% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=6-4 21.04n ± 0% 22.75n ± 0% +8.13% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=7-4 21.37n ± 2% 23.00n ± 0% +7.65% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=8-4 21.78n ± 0% 23.22n ± 0% +6.61% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=1-4 16.77n ± 0% 19.95n ± 0% +18.93% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=2-4 16.99n ± 0% 19.96n ± 0% +17.51% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=3-4 17.22n ± 0% 19.95n ± 0% +15.85% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=4-4 17.79n ± 0% 19.99n ± 0% +12.37% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=5-4 18.25n ± 0% 19.96n ± 0% +9.40% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=6-4 18.77n ± 1% 19.97n ± 0% +6.42% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=7-4 19.36n ± 13% 19.98n ± 20% +3.20% (p=0.021 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=8-4 20.63n ± 1% 22.65n ± 12% ~ (p=0.180 n=20)
geomean 19.40n 21.21n +9.31%
(Those are still the slower version of the benchmarks, which includes the somewhat expensive mod operation in the benchmark code, so the denominator on the percent change here is larger than in our first set of benchmarks above).
I'll add a caveat that I was juggling different machines and different versions, and apologies in advance if I crossed some wires here.
Comment From: gopherbot
Change https://go.dev/cl/634396 mentions this issue: internal/runtime/maps: speed up small map lookups ~1.7x for unpredictable keys
Comment From: prattmic
Thanks for the thorough investigation! I can indeed reproduce this using the existing benchmarks as you did in https://github.com/golang/go/issues/70849#issuecomment-2543374547.
goos: linux
goarch: amd64
pkg: runtime
cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
│ tip │ matchH2 │ matchH2-v3 │
│ sec/op │ sec/op vs base │ sec/op vs base │
MapSmallAccessHit/Key=smallType/Elem=int32/len=1-12 17.48n ± 1% 18.09n ± 1% +3.43% (p=0.002 n=6) 17.99n ± 1% +2.86% (p=0.002 n=6)
MapSmallAccessHit/Key=smallType/Elem=int32/len=2-12 17.88n ± 1% 18.07n ± 9% ~ (p=0.065 n=6) 17.79n ± 1% ~ (p=0.455 n=6)
MapSmallAccessHit/Key=smallType/Elem=int32/len=3-12 18.21n ± 2% 18.22n ± 3% ~ (p=0.900 n=6) 17.94n ± 2% -1.46% (p=0.048 n=6)
MapSmallAccessHit/Key=smallType/Elem=int32/len=4-12 18.60n ± 2% 18.11n ± 3% -2.63% (p=0.009 n=6) 18.00n ± 5% -3.20% (p=0.041 n=6)
MapSmallAccessHit/Key=smallType/Elem=int32/len=5-12 19.14n ± 3% 18.10n ± 2% -5.46% (p=0.002 n=6) 17.91n ± 1% -6.43% (p=0.002 n=6)
MapSmallAccessHit/Key=smallType/Elem=int32/len=6-12 19.53n ± 3% 18.15n ± 28% ~ (p=0.058 n=6) 18.07n ± 4% -7.50% (p=0.002 n=6)
MapSmallAccessHit/Key=smallType/Elem=int32/len=7-12 19.34n ± 2% 17.96n ± 19% ~ (p=0.394 n=6) 18.08n ± 21% ~ (p=0.065 n=6)
MapSmallAccessHit/Key=smallType/Elem=int32/len=8-12 19.64n ± 3% 18.22n ± 16% ~ (p=0.065 n=6) 18.29n ± 15% ~ (p=0.065 n=6)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=1-12 17.61n ± 2% 17.95n ± 3% +1.96% (p=0.015 n=6) 17.93n ± 4% +1.79% (p=0.017 n=6)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=2-12 17.82n ± 6% 18.11n ± 3% ~ (p=0.240 n=6) 18.12n ± 2% ~ (p=0.310 n=6)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=3-12 18.38n ± 2% 18.34n ± 2% ~ (p=0.669 n=6) 18.11n ± 2% -1.47% (p=0.015 n=6)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=4-12 18.73n ± 2% 18.21n ± 2% -2.80% (p=0.002 n=6) 17.93n ± 1% -4.30% (p=0.002 n=6)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=5-12 18.86n ± 3% 18.25n ± 2% -3.24% (p=0.002 n=6) 18.07n ± 2% -4.19% (p=0.002 n=6)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=6-12 19.05n ± 2% 18.46n ± 27% ~ (p=0.394 n=6) 18.06n ± 27% ~ (p=0.065 n=6)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=7-12 19.29n ± 1% 18.07n ± 1% -6.35% (p=0.002 n=6) 19.54n ± 10% ~ (p=1.000 n=6)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=8-12 19.59n ± 2% 18.15n ± 15% ~ (p=0.065 n=6) 18.21n ± 6% -7.02% (p=0.004 n=6)
geomean 18.68n 18.15n -2.83% 18.12n -3.00%
matchH2
is a change equivalent to your CL. matchH2-v3
is the same with GOAMD64=v3
(slightly different SIMD instruction selection).
For reference, the same comparison against the old map implementation:
goos: linux
goarch: amd64
pkg: runtime
cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
│ noswiss │ tip │ matchH2 │ matchH2-v3 │
│ sec/op │ sec/op vs base │ sec/op vs base │ sec/op vs base │
MapSmallAccessHit/Key=smallType/Elem=int32/len=1-12 16.68n ± 2% 17.48n ± 1% +4.79% (p=0.002 n=6) 18.09n ± 1% +8.39% (p=0.002 n=6) 17.99n ± 1% +7.79% (p=0.002 n=6)
MapSmallAccessHit/Key=smallType/Elem=int32/len=2-12 17.04n ± 3% 17.88n ± 1% +4.87% (p=0.002 n=6) 18.07n ± 9% +5.98% (p=0.002 n=6) 17.79n ± 1% +4.34% (p=0.002 n=6)
MapSmallAccessHit/Key=smallType/Elem=int32/len=3-12 17.16n ± 1% 18.21n ± 2% +6.09% (p=0.002 n=6) 18.22n ± 3% +6.18% (p=0.002 n=6) 17.94n ± 2% +4.55% (p=0.002 n=6)
MapSmallAccessHit/Key=smallType/Elem=int32/len=4-12 17.58n ± 3% 18.60n ± 2% +5.80% (p=0.002 n=6) 18.11n ± 3% +3.01% (p=0.026 n=6) 18.00n ± 5% +2.42% (p=0.026 n=6)
MapSmallAccessHit/Key=smallType/Elem=int32/len=5-12 18.27n ± 2% 19.14n ± 3% +4.76% (p=0.002 n=6) 18.10n ± 2% ~ (p=0.589 n=6) 17.91n ± 1% ~ (p=0.093 n=6)
MapSmallAccessHit/Key=smallType/Elem=int32/len=6-12 18.64n ± 3% 19.53n ± 3% +4.83% (p=0.004 n=6) 18.15n ± 28% ~ (p=0.093 n=6) 18.07n ± 4% ~ (p=0.063 n=6)
MapSmallAccessHit/Key=smallType/Elem=int32/len=7-12 18.72n ± 6% 19.34n ± 2% ~ (p=0.065 n=6) 17.96n ± 19% ~ (p=0.394 n=6) 18.08n ± 21% ~ (p=0.065 n=6)
MapSmallAccessHit/Key=smallType/Elem=int32/len=8-12 19.32n ± 3% 19.64n ± 3% ~ (p=0.240 n=6) 18.22n ± 16% ~ (p=0.065 n=6) 18.29n ± 15% ~ (p=0.065 n=6)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=1-12 16.80n ± 5% 17.61n ± 2% +4.82% (p=0.013 n=6) 17.95n ± 3% +6.88% (p=0.002 n=6) 17.93n ± 4% +6.70% (p=0.002 n=6)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=2-12 17.16n ± 2% 17.82n ± 6% +3.88% (p=0.002 n=6) 18.11n ± 3% +5.60% (p=0.002 n=6) 18.12n ± 2% +5.65% (p=0.002 n=6)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=3-12 17.56n ± 4% 18.38n ± 2% +4.64% (p=0.002 n=6) 18.34n ± 2% +4.44% (p=0.004 n=6) 18.11n ± 2% +3.10% (p=0.026 n=6)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=4-12 17.80n ± 1% 18.73n ± 2% +5.22% (p=0.002 n=6) 18.21n ± 2% +2.27% (p=0.015 n=6) 17.93n ± 1% ~ (p=0.180 n=6)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=5-12 18.21n ± 1% 18.86n ± 3% +3.54% (p=0.002 n=6) 18.25n ± 2% ~ (p=0.416 n=6) 18.07n ± 2% ~ (p=0.699 n=6)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=6-12 18.61n ± 8% 19.05n ± 2% ~ (p=0.331 n=6) 18.46n ± 27% ~ (p=0.974 n=6) 18.06n ± 27% ~ (p=0.132 n=6)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=7-12 18.61n ± 2% 19.29n ± 1% +3.63% (p=0.004 n=6) 18.07n ± 1% -2.95% (p=0.002 n=6) 19.54n ± 10% ~ (p=1.000 n=6)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=8-12 19.20n ± 7% 19.59n ± 2% ~ (p=0.087 n=6) 18.15n ± 15% ~ (p=0.058 n=6) 18.21n ± 6% -5.11% (p=0.024 n=6)
geomean 17.94n 18.68n +4.14% 18.15n +1.19% 18.12n +1.02%
It's interesting since this linear scan was an improvement when I wrote https://go.dev/cl/611189, but a lot has changed since then so I'm not too surprised.
What CPU are you testing on? It's also interesting that you seem to see more extreme results than me. I'm on Intel(R) Xeon(R) W-2135 (Skylake).
I'll also try out your new benchmarks, I just haven't gotten around to it yet.
Comment From: thepudds
What CPU are you testing on? It's also interesting that you seem to see more extreme results than me. I'm on Intel(R) Xeon(R) W-2135 (Skylake).
Hi @prattmic, sorry I missed that question, which came in while I was traveling for the holidays.
For the benchmark results in the comments above, this is what my stored copy of those benchmark results say for cpu:
goos: linux
goarch: amd64
pkg: runtime
cpu: Intel(R) Xeon(R) CPU @ 2.80GHz
GCE console and gcc currently both say that benchmark VM is running on Cascade Lake.
Also, I probably should have mentioned the results in the comments above are for GOAMD64=v4
, though I'm not sure how much of a difference that might have made.
(The results in the comments above are from a cloud VM, where I used GOAMD64=v4
. Locally on my laptop I was using GOAMD64=v3
and saw fairly similar but not identical improvements. I could dig up the laptop results, but my laptop was probably not as quiescent as it could have been, so if there is interest in seeing GOAMD64=v3
results, I'd probably just re-run on a cloud VM).
I did not benchmark on arm or with SIMD disabled. I'm happy to do that if you think useful, or re-run benchmarks with GOAMD64=v3
, and/or if you have any other suggestions for next steps.
(Also, my take is this change here is mostly independent of whether or not we tweak the benchmarks as suggested in #70700, though I'd also be OK to treat them more together if you think that's better).
Thanks!
edit: originally in this comment, I referenced the results in the CL, but meant to reference the results in the comments above. edited to fix.
Comment From: thepudds
Hi @prattmic, I ran the benchmarks on arm64 (GCE Axion), which means there is no SIMD in these results.
I ran both the new benchmarks (with mostly unpredictable keys) and the old benchmarks (with mostly predictable keys). I did not set GOARM64, which I think means it defaults to GOARM64=v8.0.
The short version is the arm64 results seem broadly similar to the amd64 GOAMD64=v4 results in the comments above, with a caveat for the arm64 results for small map access hit with predictable keys (discussed with second table below).
The observed improvement with the fix is better for the benchmarks with mostly unpredictable keys (-30.45% geomean for these arm64 results) than with mostly predictable keys (-16.23% geomean for these arm64 results), which I think is expected (and as discussed in the comments above and in the CL).
The first table here of arm64 results is comparing the new benchmarks (mostly unpredictable keys) without the fix vs. with the fix. (This is similar to the first table of results in the first comment above in https://github.com/golang/go/issues/70849#issue-2740178859, which was for amd64 with a -28.69% geomean, vs. a -30.45% geomean here).
goos: linux
goarch: arm64
pkg: runtime
│ no-fix-new-bmarks-arm │ fix-new-bmarks-arm │
│ sec/op │ sec/op vs base │
MapAccessHit/Key=smallType/Elem=int32/len=6-4 17.610n ± 0% 9.331n ± 0% -47.01% (p=0.000 n=25)
MapAccessHitHot/Key=smallType/Elem=int32/len=6/Hot=1-4 8.840n ± 2% 9.282n ± 0% +5.00% (p=0.001 n=25)
MapAccessHitHot/Key=smallType/Elem=int32/len=6/Hot=3-4 15.620n ± 1% 9.296n ± 1% -40.49% (p=0.000 n=25)
MapAccessMiss/Key=smallType/Elem=int32/len=6-4 8.555n ± 21% 5.817n ± 0% -32.00% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=1-4 8.821n ± 0% 9.296n ± 0% +5.38% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=2-4 13.040n ± 0% 9.297n ± 0% -28.70% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=3-4 15.360n ± 0% 9.287n ± 0% -39.54% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=4-4 15.950n ± 0% 9.291n ± 0% -41.75% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=5-4 16.710n ± 0% 9.286n ± 0% -44.43% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=6-4 17.010n ± 0% 9.327n ± 0% -45.17% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=7-4 17.640n ± 0% 9.336n ± 0% -47.07% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=8-4 18.140n ± 0% 9.379n ± 0% -48.30% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=1-4 8.488n ± 9% 6.795n ± 0% -19.95% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=2-4 8.473n ± 3% 6.795n ± 0% -19.80% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=3-4 8.444n ± 4% 6.796n ± 0% -19.52% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=4-4 9.307n ± 9% 6.796n ± 0% -26.98% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=5-4 8.487n ± 4% 6.797n ± 0% -19.91% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=6-4 8.750n ± 18% 6.796n ± 0% -22.33% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=7-4 9.510n ± 11% 6.795n ± 0% -28.55% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=8-4 9.336n ± 7% 7.853n ± 13% -15.88% (p=0.000 n=25)
geomean 11.61n 8.076n -30.45%
The next table here for arm64 results is comparing the old benchmarks (mostly predictable keys) without the fix vs. with the fix. (This is similar to the second table of results in the comments above in https://github.com/golang/go/issues/70849#issuecomment-2543374547, which was for amd64 with a -8.36% geomean, vs. a -16.23% geomean here).
For these results on arm64, the small map access hit with predictable keys (MapSmallAccessHit) is roughly +8% worse on average with the fix for the 8 different sizes, but the small map access miss (MapSmallAccessMiss) has a bigger improvement here compared to amd64.
goos: linux
goarch: arm64
pkg: runtime
│ master-8391579ece-arm │ just-fix-with-old-bmarks-arm │
│ sec/op │ sec/op vs base │
MapAccessHit/Key=smallType/Elem=int32/len=6-4 8.826n ± 4% 9.633n ± 0% +9.14% (p=0.000 n=25)
MapAccessMiss/Key=smallType/Elem=int32/len=6-4 8.703n ± 3% 5.615n ± 9% -35.48% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=1-4 8.433n ± 0% 9.092n ± 1% +7.81% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=2-4 8.427n ± 0% 9.095n ± 0% +7.93% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=3-4 8.522n ± 0% 9.652n ± 0% +13.26% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=4-4 8.466n ± 2% 9.099n ± 0% +7.48% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=5-4 8.681n ± 3% 9.573n ± 0% +10.28% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=6-4 8.759n ± 2% 9.568n ± 0% +9.24% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=7-4 8.913n ± 2% 9.563n ± 0% +7.29% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=8-4 8.917n ± 1% 9.118n ± 0% +2.25% (p=0.001 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=1-4 8.449n ± 3% 5.419n ± 0% -35.86% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=2-4 8.386n ± 0% 5.417n ± 0% -35.40% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=3-4 8.394n ± 0% 5.644n ± 0% -32.76% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=4-4 8.621n ± 5% 5.415n ± 0% -37.19% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=5-4 8.484n ± 4% 5.628n ± 0% -33.66% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=6-4 8.704n ± 3% 5.620n ± 0% -35.43% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=7-4 8.437n ± 3% 5.624n ± 8% -33.34% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=8-4 8.657n ± 3% 5.424n ± 7% -37.35% (p=0.000 n=25)
geomean 8.597n 7.202n -16.23%
Finally, a few days ago in my immediately prior comment https://github.com/golang/go/issues/70849#issuecomment-2704872725, I said "For the benchmark results in the CL..." and then described the benchmark hardware, but I meant to refer to the benchmarks results in the issue comments above (rather than the results in the CL). I'll edit that comment to be clearer, and I'll also separately update the description in the CL.