Golang runtime: 6.16% regression in PendingDemotion10000-72 sec/op on linux/arm64 at bfc2095

Discovered a regression in sec/op of 6.16% for benchmark PendingDemotion10000-72 at bfc2095.

CC @thepudds @prattmic

This is a microbenchmark, so there might not be anything actionable here, but worth looking into.

Comment From: prattmic

github.com/ethereum/go-ethereum/core@v1.10.9, BenchmarkPendingDemotion10000

https://go-mod-viewer.appspot.com/github.com/ethereum/go-ethereum@v1.10.9/core/tx_pool_test.go#L2432

Comment From: gabyhelp

Related Issues

runtime: 9.81% regression in ParallelizeUntil/pieces:1000,workers:10,chunkSize:1-88 sec/op on linux/amd64 at 0afcf91 #74000

_{(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)}

Comment From: thepudds

Hi @mknyszek, thanks for finding this. I will look at it likely tomorrow.

Comment From: thepudds

A quick & incomplete update: I am able to reproduce something in this ballpark on an arm64 VM (GCE Axion).

For the map in question here, the synthetic benchmark here includes doing 2 lookups repeatedly in I think the same map, where the benchmark is such that the map always has 1 element and it is always a hit. In other words, the key is likely very predictable.

Here are some of the arm64 map benchmarks I ran back in March with my change (the snippet below is excerpted from my comment in https://github.com/golang/go/issues/70849#issuecomment-2709024108).

I suspect the ethereum benchmark is basically hitting the case shown on the first line below (the line with the <<<< added), which is a hit benchmark doing a lookup in a 1 element map, and hence the key is predictable. (See the comments in https://github.com/golang/go/issues/70849#issuecomment-2709024108 and the CL for a bit more why that's "expected" performance, but the short version is the change helps unpredictable keys much more than predictable keys, and the performance spread between those effects is larger on arm64, which does not have the swisstable SIMD support yet).

goos: linux
goarch: arm64
pkg: runtime
                                                  │ no-fix-new-bmarks-arm │          fix-new-bmarks-arm          │
                                                  │        sec/op         │    sec/op     vs base     
[...]
MapSmallAccessHit/Key=smallType/Elem=int32/len=1-4   <<<<    8.821n ±  0%   9.296n ±  0%   +5.38% (p=0.000 n=25)  
MapSmallAccessHit/Key=smallType/Elem=int32/len=2-4          13.040n ±  0%   9.297n ±  0%  -28.70% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=3-4          15.360n ±  0%   9.287n ±  0%  -39.54% (p=0.000 n=25)
MapSmallAccessHit/Key=smallType/Elem=int32/len=4-4          15.950n ±  0%   9.291n ±  0%  -41.75% (p=0.000 n=25)
[...]
MapSmallAccessMiss/Key=smallType/Elem=int32/len=1-4          8.488n ±  9%   6.795n ±  0%  -19.95% (p=0.000 n=25) 
MapSmallAccessMiss/Key=smallType/Elem=int32/len=2-4          8.473n ±  3%   6.795n ±  0%  -19.80% (p=0.000 n=25)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=3-4          8.444n ±  4%   6.796n ±  0%  -19.52% (p=0.000 n=25)
[...]

However, ns/op delta in the ethereum benchmark at least at first blush seems larger than "expected" from my older arm64 benchmarks, including more time spent in runtime.memequal in the faster vs. slower commit for the ethereum benchmark (which I think is probably via the typ.Key.Equal check in the map code). So maybe there is a secondary effect from the first effect, or maybe something completely different also going on.

In any event, just wanted to post a brief update.

I will look at this some more, including I need to cross check some more to see if what I wrote above is actually true. 😅

Comment From: thepudds

So having looked at this some more, what I said in my immediately prior comment continues to seem plausible (or at least, I haven't seen anything additional that contradicts that summary so far).

As I mentioned just above, it's "expected" that this CL 634396 (bfc2095) has some slowdown for the 1 element map hit case on arm64 based on the benchmarks I ran at the time. (In short, it's because that case is a very predictable case for the code prior to that CL, and the larger challenge for the prior code was unpredictable cases. That CL did help other predictable cases on arm64, but the 1 element map hit case on arm64 was among its weakest cases. Separately, amd64 was a different and better story.)

However, as I also mentioned above, the synthetic ethereum benchmark delta reported here is larger than the change on arm64 in the microbenchmarks for CL 634396 for this case, which was a ~0.7ns loss, vs. the synthetic ethereum benchmark here is closer to a ~10ns loss. Each benchmark loop in the ethereum PendingDemotion10000 benchmark seems to include two hit lookups of the same key in the same 1 element map.

It might be that the predictability of the 1 element map allowed the older code to better speculate and overlap other work, so 0.7ns+0.7ns does not simply add to a 1.4ns loss in PendingDemotion10000.

I have a CL for consideration that seems to get back the loss in the ethereum benchmark, which I ran overnight via the pre-submit perf builders on arm64:

                          │   baseline    │              experiment               │
                          │    sec/op     │    sec/op     vs base                 │
PendingDemotion10000-72     105.50n ±  0%   95.16n ±  0%  -9.81% (p=0.000 n=10)

I'll try to finish the writeup for that CL shortly, and also I'll post in that CL some more detailed benchmarks I ran, but wanted to give a quick update here.

Comment From: mknyszek

Thank you for looking into this @thepudds! Just to be abundantly clear, this is not necessarily a super important benchmark, but one we happen to run and we noticed a regression on. I appreciate the thorough investigation and it's awesome that you found a way to possibly improve this one case. (I think the fact that you were able to improve upon it is strong evidence to your hypothesis.)

Depending on what the change looks like, we might be fine with just shipping Go 1.25 as-is and land your CL in Go 1.26. If it's simple enough we may consider it for Go 1.25 as "polishing" this improvement, just to set expectations. (Considered according to the highly rigorous benchmark of "how we all feel about it," with the caveat that "we feel worse the closer we get to release." :P)

Comment From: gopherbot

Change https://go.dev/cl/680736 mentions this issue: internal/runtime/maps: optimize map access for single element non-specialized maps