Golang unique: 7.57% regression in ReplaceShortcodeTokens-72 sec/op on linux/arm64 at 0028532

Discovered a regression in sec/op of 7.57% for benchmark ReplaceShortcodeTokens-72 at 0028532.

Comment From: prattmic

github.com/gohugoio/hugo/hugolib@v0.88.1, BenchmarkReplaceShortcodeTokens

https://go-mod-viewer.appspot.com/github.com/gohugoio/hugo@v0.88.1/hugolib/shortcode_test.go#L772

Comment From: mknyszek

Aside from removing a system goroutine, this commit essentially only touches the unique package. I see no evidence in the code or in profiles of unique being used:

(pprof) top 20
Showing nodes accounting for 0.19s, 0.93% of 20.54s total
Dropped 7 nodes (cum <= 0.10s)
Showing top 20 nodes out of 254
      flat  flat%   sum%        cum   cum%
    -0.40s  1.95%  1.95%     -0.40s  1.95%  runtime.memmove
     0.29s  1.41%  0.54%      0.29s  1.41%  runtime.memclrNoHeapPointers
     0.21s  1.02%  0.49%      0.21s  1.02%  internal/bytealg.IndexByteString
     0.15s  0.73%  1.22%      0.17s  0.83%  runtime.lock2
    -0.13s  0.63%  0.58%     -0.12s  0.58%  runtime.findObject
    -0.13s  0.63% 0.049%     -0.11s  0.54%  runtime.sweepone
     0.11s  0.54%  0.49%      0.11s  0.54%  internal/runtime/atomic.(*Uint32).Add (inline)
    -0.10s  0.49%     0%      0.49s  2.39%  github.com/gohugoio/hugo/hugolib.replaceShortcodeTokens
     0.10s  0.49%  0.49%      0.52s  2.53%  runtime.mallocgcSmallNoscan
     0.10s  0.49%  0.97%      0.10s  0.49%  runtime.markBits.setMarked (inline)
     0.09s  0.44%  1.41%      0.09s  0.44%  runtime.futex
    -0.08s  0.39%  1.02%      0.14s  0.68%  bytes.Index
    -0.08s  0.39%  0.63%     -0.08s  0.39%  runtime.(*gcBitsArena).tryAlloc (inline)
     0.07s  0.34%  0.97%      0.07s  0.34%  runtime.(*mspan).init
    -0.07s  0.34%  0.63%     -0.12s  0.58%  runtime.gcDrain
     0.06s  0.29%  0.93%      0.02s 0.097%  runtime.(*mheap).freeSpanLocked
    -0.05s  0.24%  0.68%     -0.05s  0.24%  internal/runtime/atomic.(*Int64).Load (inline)
     0.05s  0.24%  0.93%      0.05s  0.24%  runtime.(*lfstack).pop (inline)
     0.05s  0.24%  1.17%      0.28s  1.36%  runtime.(*mheap).allocSpan
    -0.05s  0.24%  0.93%     -0.03s  0.15%  runtime.(*sweepLocked).sweep

I'm suspicious this is a binary layout issue. I will try randlayout.

Comment From: mknyszek

Hm, randlayout doesn't seem to make a difference.

Comment From: mknyszek

(I should say, I can reproduce the benchmark results.)

Comment From: mknyszek

Hm... I began to suspect some kind of ballast effect, and while that's not quite happening, there is something weird going on in this benchmark. Below are gctrace results for one benchmark iteration.

Before:

gc 36 @6.529s 0%: 0.058+0.69+0.010 ms clock, 0.93+0/2.0/3.7+0.16 ms cpu, 2->2->2 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 16 P (forced)
gc 37 @6.531s 0%: 0.050+0.61+0.090 ms clock, 0.81+0/2.0/3.8+1.4 ms cpu, 3->3->2 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 16 P (forced)
gc 38 @6.532s 0%: 0.11+0.50+0.020 ms clock, 1.7+0.35/1.7/3.4+0.33 ms cpu, 4->5->5 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 39 @6.533s 0%: 0.046+0.77+0.095 ms clock, 0.74+0.40/2.0/3.1+1.5 ms cpu, 9->10->10 MB, 10 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 40 @6.535s 0%: 0.040+0.85+0.088 ms clock, 0.65+0.33/2.1/3.1+1.4 ms cpu, 19->22->22 MB, 21 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 41 @6.538s 0%: 0.041+1.0+0.11 ms clock, 0.66+0.19/2.2/3.1+1.8 ms cpu, 38->45->45 MB, 45 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 42 @6.543s 0%: 0.045+1.1+0.023 ms clock, 0.72+0.13/3.0/3.5+0.38 ms cpu, 78->86->86 MB, 91 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 43 @6.565s 0%: 0.042+0.71+0.085 ms clock, 0.68+0/2.0/3.4+1.3 ms cpu, 157->157->2 MB, 173 MB goal, 0 MB stacks, 0 MB globals, 16 P (forced)
gc 44 @6.570s 0%: 0.032+0.69+0.090 ms clock, 0.51+0/2.0/2.9+1.4 ms cpu, 213->213->213 MB, 213 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 45 @6.592s 0%: 0.041+2.6+0.022 ms clock, 0.67+0.12/9.7/24+0.35 ms cpu, 421->444->441 MB, 427 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 46 @6.625s 0%: 0.044+3.7+0.022 ms clock, 0.71+0.33/11/27+0.36 ms cpu, 752->783->779 MB, 884 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 47 @6.685s 0%: 0.051+4.8+0.023 ms clock, 0.82+0.37/18/29+0.37 ms cpu, 1327->1367->1361 MB, 1559 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 48 @6.816s 0%: 0.047+5.2+0.18 ms clock, 0.76+0.39/20/47+2.9 ms cpu, 2622->2667->2653 MB, 2723 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 49 @7.059s 0%: 0.044+7.8+0.029 ms clock, 0.71+0.45/29/60+0.47 ms cpu, 5162->5234->5207 MB, 5307 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 50 @7.710s 0%: 0.047+13+0.026 ms clock, 0.76+0.11/51/91+0.43 ms cpu, 10201->10218->9977 MB, 10414 MB goal, 0 MB stacks, 0 MB globals, 16 P

After:

gc 37 @6.446s 0%: 0.049+0.61+0.010 ms clock, 0.78+0/2.0/4.9+0.17 ms cpu, 2->2->2 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 16 P (forced)
gc 38 @6.447s 0%: 0.048+0.73+0.090 ms clock, 0.77+0/2.1/4.5+1.4 ms cpu, 3->3->2 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 16 P (forced)
gc 39 @6.448s 0%: 0.045+0.62+0.086 ms clock, 0.72+0.29/1.8/3.6+1.3 ms cpu, 4->6->6 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 40 @6.449s 0%: 0.042+0.74+0.024 ms clock, 0.68+0.34/2.1/3.9+0.39 ms cpu, 11->14->14 MB, 13 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 41 @6.452s 0%: 0.044+1.0+0.079 ms clock, 0.70+0.20/2.2/3.5+1.2 ms cpu, 24->30->29 MB, 28 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 42 @6.455s 0%: 0.045+1.0+0.087 ms clock, 0.72+0.22/2.1/4.4+1.3 ms cpu, 51->57->57 MB, 60 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 43 @6.462s 0%: 0.046+1.5+0.022 ms clock, 0.74+0.17/3.3/4.1+0.35 ms cpu, 101->112->111 MB, 114 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 44 @6.482s 0%: 0.041+0.73+0.077 ms clock, 0.65+0/2.2/4.6+1.2 ms cpu, 157->157->2 MB, 223 MB goal, 0 MB stacks, 0 MB globals, 16 P (forced)
gc 45 @6.487s 0%: 0.033+0.72+0.099 ms clock, 0.53+0/2.0/3.8+1.5 ms cpu, 211->211->211 MB, 211 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 46 @6.509s 0%: 0.045+3.0+0.023 ms clock, 0.73+0.10/10/24+0.38 ms cpu, 414->439->436 MB, 423 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 47 @6.542s 0%: 0.043+3.6+0.027 ms clock, 0.69+0.30/13/27+0.43 ms cpu, 744->772->768 MB, 874 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 48 @6.603s 0%: 0.045+5.0+0.023 ms clock, 0.73+0.25/17/29+0.37 ms cpu, 1310->1352->1346 MB, 1538 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 49 @6.706s 0%: 0.042+5.8+0.024 ms clock, 0.68+0.24/21/40+0.38 ms cpu, 2292->2342->2332 MB, 2692 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 50 @6.918s 0%: 0.049+8.1+0.025 ms clock, 0.78+0.46/29/53+0.40 ms cpu, 4528->4601->4577 MB, 4664 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 51 @7.357s 0%: 0.047+14+0.024 ms clock, 0.75+0.47/57/72+0.39 ms cpu, 8959->9098->9050 MB, 9155 MB goal, 0 MB stacks, 0 MB globals, 16 P

The results are generally the same, including the same number of GC cycles... but the benchmark is weird in that the heap just grows monotonically in each iteration. Notably, however, in "after", there's about 1 GiB less heap in the total heap size by the end of the iteration. Other than that, I don't see any obvious differences in GC behavior. The actual benchmarked code is very... simple. https://github.com/satya-nutella/hugo/blob/f518b4f71e1a61b09d660b5c284121ebf3b3b86b/hugolib/shortcode.go#L590 There's an append which might explain the monotonically growing heap, but I don't see why the implicated commit would change anything here.

Comment From: mknyszek

Also, weirdly the regression only happens on arm64.

Comment From: mknyszek

If I pull the code out of the package, the differences seem to mostly go away... It seems like maybe the test package's initialization, or perhaps some other dependent package's initialization, is affecting the heap? Giving the GC slightly more or less work in the latter case?

Comment From: mknyszek

I still haven't gotten to the bottom of this, but the weirdness above is making me think this is not going to be some serious issue with the unique package, which is basically all the commit touches, but some second-order effect of the context of the package, hugolib. I am going to put this aside for now and focus my time on some other issues.

Comment From: mknyszek

Given my previous analysis, I'm inclined to close this issue.