Green Tea 🍵 Garbage Collector

Authors: Michael Knyszek, Austin Clements Updated: 2 May 2025

This issue tracks the design and implementation of the Green Tea garbage collector. As of the last update to this issue, development of Green Tea is still active. We'll produce more detailed design document once we're ready to commit to a design. For now, Green Tea is available as an experiment at tip-of-tree and is planned as to be available as an opt-in experiment in Go 1.25, once it releases. We encourage teams to try it out.

Introduction

Memory latency and bandwidth are becoming increasingly constrained as CPU clocks outpace DRAM clocks, increasing core counts offer more load on physically limited memory buses, and speed-of-light constraints necessitate increasingly non-uniform memory topologies. As a result, spatial locality, temporal locality, and topology-awareness are becoming ever more critical to high-performance systems.

Unfortunately, all of these trends are at odds with most of today’s garbage collection algorithms. Go’s garbage collector implements a classic tri-color parallel marking algorithm. This is, at its core, just a graph flood, where heap objects are nodes in the graph, and pointers are edges. However, this graph flood affords no consideration to the memory location of the objects that are being processed. As a result, it exhibits extremely poor spatial locality—jumping between completely different parts of memory—poor temporal locality—blithely spreading repeated accesses to the same memory across the GC cycle—and no concern for topology.

As a result, on average 85% of the garbage collector's time is spent in the core loop of this graph flood—the scan loop—and >35% of CPU cycles in the scan loop are spent solely stalled on memory accesses, excluding any knock-on effects. This problem is expected to only get worse as the industry trends toward many-core systems and non-uniform memory architectures.

In this document, we present Green Tea: a parallel marking algorithm that, if not memory-centric,1 is at least memory-aware, in that it endeavors to process objects close to one another together.

This new algorithm has an implementation that is ready for developers to trial on their workloads, and in this document we also present the results from evaluating this implementation against our benchmark suite. Overall, the algorithm shows a significant reduction in GC CPU costs on GC-heavy workloads.

Finally, this new marking algorithm unlocks new opportunities for future optimization, such as SIMD acceleration, which we discuss with other possible avenues of future work.

Design

The core idea behind the new parallel marking algorithm is simple. Instead of scanning individual objects, the garbage collector scans memory in much larger, contiguous blocks. The shared work queue tracks these coarse blocks instead of individual objects, and the individual objects waiting to be scanned in a block are tracked in that block itself. The core hypothesis is that while a block waits on the queue to be scanned, it will accumulate more objects to be scanned within that block, such that when a block does get dequeued, it’s likely that scanning will be able to scan more than one object in that block. This, in turn, improves locality of memory access, in addition to better amortizing per-scan costs.

Prototype implementation

In the prototype implementation of this new algorithm, the memory blocks we track are called spans. A span is always some multiple of 8 KiB, always aligned to 8 KiB, and consists entirely of objects of one size. Our prototype focuses exclusively on “small object spans”, which are exactly 8 KiB and contain objects up to 512 bytes.

A span is also the basic unit of storing heap metadata. In the prototype, each span stores two bits for each object: a gray bit and a black bit. These correspond to the tri-color abstraction: an object is black if it has been scanned, gray if it is in the queue to be scanned, and white if it has not been reached at all. In the prototype, white objects have neither bit set, gray objects have the gray bit set, and black objects have both bits set.

When scanning finds a pointer to a small object, it sets that object’s gray bit to indicate the object needs to be scanned. If the gray bit was not already set and the object’s span is not already enqueued for scanning, it enqueues the span. A per-span flag indicates whether the span is currently enqueued so it will only be enqueued once at a time. When the scan loop dequeues a span, it computes the difference between the gray bits and the black bits to identify objects to scan, copies the gray bits to the black bits, and scans any objects that had their gray bit set but not their black bit.

Limiting scope to small objects

The prototype focuses on small objects because we derive the most benefit from them. The per-scan overhead of small objects is much harder to amortize because the garbage collector spends so little time scanning each individual object. Larger objects continue to use the old algorithm.

The choice of which algorithm to use is made when scanning encounters a pointer. The span allocator maintains a bitmap with one bit for each 8 KiB page in the heap to indicate whether that page is backed by a small object span. The footprint of this fits easily into cache even for very large heaps, and contention is extremely low.

Since small object spans are always 8 KiB large and 8 KiB aligned, once the scanner knows the target of a pointer is in a small object span, it can use simple address arithmetic to find the object’s metadata within the span, thus avoiding indirections and dependent loads that seriously harm performance.

Work distribution

Go’s current garbage collector distributes work across scanners by having each scanner maintain a local fixed-sized stack of object pointers. However, in order to ensure parallelism, each scanner aggressively checks and populates global lists. This frequent mutation of the global lists is a significant source of contention in Go programs on many-core systems.

The prototype implementation has a separate queue dedicated to spans and based on the distributed work-stealing runqueues used by the goroutine scheduler. The opportunity for stealing work directly from other workers means less contention on global lists. Furthermore, by queuing spans instead of individual objects, there are far fewer items to queue and thus inherently lower contention on the queues.

Span work may be ordered several different ways. We explored several policies, including FIFO, LIFO, sparsest-first, and densest-first, random, and address-ordered. FIFO turned out to accumulate the highest average density of objects to scan on a span by the time it was dequeued for scanning.

Single-object scan optimization

If a span has only a single object to scan when it is dequeued, the new algorithm will have done more work to handle that single object than the current, object-centric algorithm does.

To bring the performance of the single-object-per-span case more in line with the current marking algorithm, we apply two tricks. First, we track the object that was marked when the span was enqueued. This object becomes the span's representative until the span is scanned. Next, we add a hit flag to the span that indicates an object was marked while the span was queued; that is, that at least two objects are marked. When scanning a span, if the hit flag is not set, then the garbage collector can directly scan the span’s representative, instead of processing the entire span.

Prototype evaluation

We evaluated the prototype implementation across a variety of benchmarks, on low-CPU-count, high-CPU-count, amd64, and arm64 Linux virtual machines. The rest of this section summarizes the results, focusing primarily on the differences in garbage collection CPU cost.

In select GC-heavy microbenchmarks (garbage, from golang.org/x/benchmarks, and binary-trees Go #2, from the Computer Language Benchmarks Game), depending on core count, we observed anywhere from a 10–50% reduction in GC CPU costs compared to the existing Go GC. The improvement generally rose with core count, indicating that the prototype scales better than the existing implementation. Furthermore, the number of L1 and L2 cache misses was reduced by half in these benchmarks.

On our bent and sweet benchmark suites, the results were more varied.

The results are positive overall, but include a mix of improvements and regressions.

Most benchmarks were either unaffected by the changes to the garbage collector or regressed or improved solely due to changes that had little to do with the garbage collector, such as code alignment changes. Some benchmarks regressed even though less CPU time is spent in the garbage collector. One reason for this is because the garbage collector's mark phase is active for less time, leading to less floating garbage which acts as a ballast in some benchmarks. Another reason for this is that less time spent in the garbage collector means more time spent in other scalability bottlenecks, either in the runtime or in user code, leading to a net apparent regression.

The Go compiler benchmarks appear to inconsistently show a very slight regression (0.5%). Given the magnitude and inconsistency of the regression, these benchmarks appear to be rather insensitive to this change. One hypothesis is that the occasional regression may be due to an out-of-date PGO profile, but remains to be investigated.

The tile38 benchmark shows substantial improvements across throughput, latency and memory use. In addition, it showed a 35% reduction in garbage collection overheads. This benchmark queries a local instance of a Tile38 in-memory geospatial database pre-seeded with data. Most of the heap consists of a high-fanout tree, so Green Tea is able to quickly generate large amounts of work and high densities.

The bleve-index benchmark has a heap topology that is quite difficult for Green Tea, though performance overall is a wash. Most of the heap consists of a low-fanout binary tree that is rapidly mutated by the benchmark. Green Tea struggles to generate locality in this benchmark, and half of all span scans only scan a single object. Our current hypothesis is that because of frequent tree rotations, the tree structure itself becomes shuffled across a large heap (100+ MiB). In contrast, the binary-trees benchmark does not perform rotations, so the tree layout retains the good locality of its initial allocations. This suggests that Green Tea has good locality when the application itself has good locality, unlike the current Go GC; but Green Tea, unsurprisingly, can’t create locality out of nothing. Both Linux perf and CPU profiles in the 16-core amd64 environment indicate a small \~2% regression in garbage collection overhead. Overall, the single object scan optimization was integral to making this benchmark perform well in the 16-core amd64 environment. In the 72- and 88-core environments we see a significant improvement, due to the design's improved many-core scalability. There remains a small overall regression on the 16-core arm64 environment that still needs to be investigated.

Future work

SIMD-accelerated scanning kernels

Scanning memory in larger blocks of memory unlocks the ability to apply SIMD to small objects in the garbage collector. The core idea is to generate a unique scanning kernel for each size class and use SIMD bit manipulation and permutation instructions to load, mask, swizzle, pack, and enqueue pointers. The regularity of the layout of objects in a single span and Go’s packed representation of pointer/scalar metadata for small objects both play a major role in making this feasible.

Austin Clements developed prototype AVX512-based scanning kernels that reduce garbage collection overheads by another 15–20% in the benchmarks where we already saw improvements. The prototype implementation does not currently use these kernels because they only apply to a small subset of objects at this point in time.

These SIMD kernels tend to require a higher density of objects in order to outperform sparsely scanning objects within a scan. These kernels are still being developed, so the prototype does not use them by default, and when they are enabled, it only uses SIMD scanning when a minimum density threshold is reached.

Concentrator network

Austin's original design for Green Tea used a sorting network called the concentrator network to achieve the even higher pointer density required by SIMD-based scanning, and to generate locality even for metadata operations like setting gray bits. The network was carefully designed to minimize queuing costs, so individual pointers could still be processed efficiently when sufficient scan density was unavailable.

There are two main reasons we did not pursue this direction in the short term. First, we found that even very low density can produce good results, especially with the single-object scan optimization. Second, the concentrator network is more complex to implement, as it is a greater departure from the existing algorithm. However, this design is still an avenue we plan to explore, since it is far more general and tunable.

Acknowledgements

Credit to Austin Clements for the inception of the algorithm and initial prototyping. (They wrote down the key idea in 2018!)

Credit to Yves Vandriessche from Intel for providing many microarchitectural insights that were vital to making this design viable. Many of his suggestions were applied to the core scan loop, including proper prefetching, batching subsequent pointer loads to hide their memory latency, and simpler iteration over object pointers.

Comment From: gopherbot

Change https://go.dev/cl/658036 mentions this issue: runtime: mark and scan small objects in whole spans [green tea]

Comment From: mknyszek

How to try it out

First,

go install golang.org/dl/go1.25rc1@latest

Then build your program with GOEXPERIMENT=greentea go1.25rc1 build.

Sending feedback

Please focus your attention on whole programs. Microbenchmarks tend to be poor representatives of garbage collection behavior in real programs. Large changes in the foundations such as these can move the needle for all sorts of reasons, unrelated to (or even inversely related to) how efficient the garbage collector actually is.

If you encounter a situation where Green Tea isn't working out for you, we'd appreciate it if you could share some details with us that would be really helpful moving forward. (This is also helpful if it is working for you.)

  1. The platform you're running on.
  2. The CPU model (or if in the cloud, the VM type) you're running your code on.
  3. Excerpts from stderr of your program running with GODEBUG=gctrace=2, both with and without Green Tea.
  4. A CPU profile of your program running with and without Green Tea.
  5. An execution trace capturing a few complete GC cycles, with and without Green Tea (a few seconds is usually enough).

Note: you can use GOEXPERIMENT=nogreenteagc to explicitly disable Green Tea at build time.

Feel free to post to this GitHub issue, or you can email me directly at mknyszek(at)golang.org.

Thank you!

Comment From: gopherbot

Change https://go.dev/cl/669655 mentions this issue: main.star: add greenteagc builders

Comment From: dominikh

~No discernible change~ Minor improvement in CPU time for Staticcheck.

CPU: AMD Ryzen 9 3950X 16-Core Processor

> ./staticcheck -debug.version | head -1
staticcheck (devel, v0.7.0-0.dev.0.20250511162810-63832850a84e)

Vanilla:
> GOEXPERIMENT=nogreenteagc go build
> go version ./staticcheck
./staticcheck: go1.25-devel_fac2ccbed3 Thu May 15 08:27:04 2025 -0700
> rm -rf ~/.cache/staticcheck/
> perf stat ./staticcheck -debug.cpuprofile=vanilla.cpu.pprof std

 Performance counter stats for './staticcheck -debug.cpuprofile=vanilla.cpu.pprof std':

        197,121.11 msec task-clock:u                     #   18.932 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
           741,678      page-faults:u                    #    3.763 K/sec                     
   738,518,295,442      cycles:u                         #    3.747 GHz                       
   124,817,958,966      stalled-cycles-frontend:u        #   16.90% frontend cycles idle      
   785,988,849,431      instructions:u                   #    1.06  insn per cycle            
                                                  #    0.16  stalled cycles per insn   
   163,480,505,449      branches:u                       #  829.340 M/sec                     
     2,197,680,754      branch-misses:u                  #    1.34% of all branches           

      10.411895545 seconds time elapsed

     182.959316000 seconds user
      10.483651000 seconds sys
> rm -rf ~/.cache/staticcheck
> ./staticcheck -debug.trace=vanilla.trace std # trace collected separately
> rm -rf ~/.cache/staticcheck
> GODEBUG=gctrace=2 ./staticcheck std 2>vanilla.gctrace # GC trace collected separately


Green tea:
> GOEXPERIMENT=greenteagc go build
> go version ./staticcheck
./staticcheck: go1.25-devel_fac2ccbed3 Thu May 15 08:27:04 2025 -0700 X:greenteagc
> rm -rf ~/.cache/staticcheck/
> perf stat ./staticcheck -debug.cpuprofile=greentea.cpu.pprof std
 Performance counter stats for './staticcheck -debug.cpuprofile=greentea.cpu.pprof std':

        193,681.28 msec task-clock:u                     #   18.553 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
           782,480      page-faults:u                    #    4.040 K/sec                     
   723,000,663,331      cycles:u                         #    3.733 GHz                       
   130,424,001,536      stalled-cycles-frontend:u        #   18.04% frontend cycles idle      
   788,115,346,312      instructions:u                   #    1.09  insn per cycle            
                                                  #    0.17  stalled cycles per insn   
   158,001,376,147      branches:u                       #  815.780 M/sec                     
     2,432,170,668      branch-misses:u                  #    1.54% of all branches           

      10.439577978 seconds time elapsed

     178.367421000 seconds user
      11.257257000 seconds sys
> rm -rf ~/.cache/staticcheck
> ./staticcheck -debug.trace=greentea.trace std # trace collected separately
> rm -rf ~/.cache/staticcheck
> GODEBUG=gctrace=2 ./staticcheck std 2>greentea.gctrace # GC trace collected separately

The attached files contain the CPU profiles and the gctrace output. I also have execution traces, but they span the whole execution and are ~50 MB each when compressed, which GitHub won't let me attach.

greentea.tar.gz vanilla.tar.gz

perf stat before/after courtesy of normalize-benchfmt:

             │ vanilla.bench │          greentea.bench           │
             │      sec      │    sec      vs base               │
Task-clock:u      199.3 ± 0%   194.4 ± 1%  -2.50% (p=0.000 n=10)
Wall-time         10.11 ± 0%   10.18 ± 1%  +0.74% (p=0.005 n=10)
User-time         184.9 ± 0%   178.9 ± 1%  -3.26% (p=0.000 n=10)
System-time       10.66 ± 2%   11.33 ± 3%  +6.32% (p=0.000 n=10)
geomean           44.63        44.75       +0.25%

                          │ vanilla.bench │            greentea.bench             │
                          │      val      │     val      vs base                  │
Context-switches:u           0.000 ± 0%      0.000 ± 0%        ~ (p=1.000 n=10) ¹
Cpu-migrations:u             0.000 ± 0%      0.000 ± 0%        ~ (p=1.000 n=10) ¹
Page-faults:u               796.2k ± 4%     804.4k ± 4%        ~ (p=0.579 n=10)
Cycles:u                    739.5G ± 0%     719.6G ± 1%   -2.69% (p=0.000 n=10)
Stalled-cycles-frontend:u   125.9G ± 0%     131.1G ± 0%   +4.07% (p=0.000 n=10)
Instructions:u              786.6G ± 0%     789.0G ± 0%   +0.30% (p=0.000 n=10)
Branches:u                  163.6G ± 0%     158.1G ± 0%   -3.36% (p=0.000 n=10)
Branch-misses:u             2.205G ± 0%     2.436G ± 0%  +10.48% (p=0.000 n=10)
geomean                                 ²                 +1.15%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

Comment From: mknyszek

Thanks for sharing! I think there is a win here, though it's not a massive improvement. I think there are some effects that might be masking a bigger win.

At least according to the gctrace I see less overall CPU time spent in any individual GC cycle. On average around 10-15% .

Also, here are the distributions of wall time spent in the mark phase:

$ cat greentea.gctrace | grep ^gc | cut -d ' ' -f 5 | cut -d '+' -f 2 | dist
N 152  sum 3269.07  mean 21.507  gmean 12.3254  std dev 19.7445  variance 389.844

     min 0.28
   1%ile 0.365667
   5%ile 0.9985
  25%ile 4.94167
  median 16.5
  75%ile 29.5833
  95%ile 59.1
  99%ile 82.1533
     max 101

⠀⠀⠀⠀⠀⠀⠀⠀⢀⡴⠒⠒⠢⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 0.036
⠀⠀⠀⠀⠀⠀⠀⡰⠋⠀⠀⠀⠀⠈⠓⠒⠒⠤⠒⠒⠒⠒⠒⠦⢤⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠤⠞⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⠉⠙⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠋⠉⠉⠙⠒⠒⠦⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠁
        0                          50                          100
$ cat vanilla.gctrace | grep ^gc | cut -d ' ' -f 5 | cut -d '+' -f 2 | dist
N 139  sum 3502.72  mean 25.1994  gmean 15.1408  std dev 20.7032  variance 428.621

     min 0.28
   1%ile 0.3018
   5%ile 1.13
  25%ile 6.28333
  median 21
  75%ile 36.6667
  95%ile 66.4
  99%ile 80.2733
     max 81

⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡤⠒⠒⠒⠲⢤⣀⠀⠀⠀⠀⠀⠀⠀⣀⣀⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 0.028
⠀⠀⠀⠀⠀⠀⠀⣠⠔⠁⠀⠀⠀⠀⠀⠀⠈⠓⠒⠒⠊⠉⠉⠉⠁⠀⠈⠉⠙⠒⠦⢤⣀⡀⠀⠀⠀⠀⠀⠀⠀⢀⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠤⠚⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠉⠉⠉⠉⠉⠉⠉⠁⠈⠉⠉⠉⠉⠉⠉⠉⠉⠒⠒⠒⠒⠒⠒⠢⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁
         0

Simultaneously, because each GC cycle is shorter, there's less floating garbage, leading to a smaller live heap. This means for the same amount of allocation work, we're going to have more GC cycles, if GOGC is constant (which it is here). Proof of that is in the fact that with Green Tea there are 152 GC cycles, whereas the original GC only executes 139 cycles. Despite this increase in the number of cycles, the CPU profile suggests that around 10% less time is spent in the garbage collector.

Here are the heap size distributions:

$ cat greentea.gctrace | grep ^gc | cut -d ' ' -f 11 | sed 's/->/\./g' | cut -d '.' -f 2 | dist
N 152  sum 78714  mean 517.855  gmean 295.01  std dev 441.656  variance 195060

     min 3
   1%ile 4.71333
   5%ile 23.85
  25%ile 118.167
  median 425
  75%ile 746.583
  95%ile 1422.05
  99%ile 1576.32
     max 1644

⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡤⠒⠒⠒⠲⢄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 0.001
⠀⠀⠀⠀⠀⠀⠀⣠⠔⠁⠀⠀⠀⠀⠀⠀⠙⠒⠦⠤⠤⠖⠒⠋⠉⠉⠉⠓⠒⠒⠒⠲⠤⠤⠤⣄⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠴⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠉⠙⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠚⠉⠉⠉⠉⠉⠉⠙⠒⠒⠢⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁
         0                              1000
$ cat vanilla.gctrace | grep ^gc | cut -d ' ' -f 11 | sed 's/->/\./g' | cut -d '.' -f 2 | dist
N 139  sum 77043  mean 554.266  gmean 314.515  std dev 450.016  variance 202514

     min 3
   1%ile 4.45333
   5%ile 22
  25%ile 124.333
  median 504
  75%ile 820.833
  95%ile 1492.1
  99%ile 1602.14
     max 1661

⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠴⠒⠒⠒⠦⣄⠀⠀⠀⠀⠀⠀⠀⢀⣀⣀⣀⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 0.001
⠀⠀⠀⠀⠀⠀⠀⣠⠞⠁⠀⠀⠀⠀⠀⠈⠑⠦⠤⠤⠤⠖⠋⠉⠀⠀⠀⠈⠉⠉⠙⠒⠦⠤⠤⢤⣀⣀⣀⡀⠀⣀⣀⣀⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠴⠊⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠉⠁⠀⠀⠀⠈⠉⠉⠉⠒⠒⠒⠒⠒⠚⠉⠉⠉⠉⠙⠒⠒⠒⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁

As for why this isn't improving overall end-to-end time much, I suspect that: 1. The garbage collector is just not on the critical path. Does staticcheck run in parallel or is it mostly a single goroutine? 2. The increase in the number of GC cycles from a reduction in memory is eroding some of the CPU win.

(I could probably analyze this more efficiently if I just made a tool to parse gctraces... I've been doing quite a lot of this lately. 😅)

Comment From: dominikh

The garbage collector is just not on the critical path. Does staticcheck run in parallel or is it mostly a single goroutine?

The actual analysis phase of Staticcheck tries to use GOMAXPROCS goroutines at once, parallelizing on packages as well as individual analyzers per package.

I've updated my original comment with a proper comparison of before and after perf stat. Maybe that can help shed some more light. The increase in system time surprises me.

Comment From: mknyszek

The increase in system time might be related to the use of a runtime.mutex for the global span list. I tried to swap that out with a lock-free stack (and a lock-free stack of blocks of spans, not coalesced) and it actually scaled significantly worse. With a lock, we can amortize the lock's use over several spans (we try to almost never take just one off the list once we get the lock) whereas a lock-free stack requires an atomic operation for each span we want to take off the list. I'm not quite sure how to confirm this. (It's possibly also a second-order effect from (apparently) having a little less GC pressure. Maybe there's some other system resource that staticcheck relies on that's now hammered slightly harder. Like a kernel lock or something.)

As an aside, there may be a more clever lock-free data structure we could use that would allow us to take many spans off the global list at once, but it's probably going to be quite complicated.

I also checked out the new scan stats I added and based on that, Green Tea's core hypothesis does seem to apply to staticcheck, which is good. A dozen or two objects are scanned on every span scan, which is not bad.

Comment From: mxmauro

Hi, will the new GC implement a method to remove/re-add an object completely from the allocated objects list?

Of course may introduce memory leaks if the developer does not take care but would be useful when a large amount of objects are allocated like an in-memory key/value storage where the value is not a slice of bytes.

If I'm right, the experimental arena feature was abandoned.

Comment From: mknyszek

Hi, will the new GC implement a method to remove/re-add an object completely from the allocated objects list?

No, that's out of scope. This is just changing the implementation, not adding any user-visible APIs.

Of course may introduce memory leaks if the developer does not take care but would be useful when a large amount of objects are allocated like an in-memory key/value storage where the value is not a slice of bytes.

If the value contains any pointers, the garbage collector must observe it to prevent a bad free. It's worse than just a memory leak in that case, it's potential memory corruption.

If I'm right, the experimental arena feature was abandoned.

You may be interested in the discussion of memory regions, which is more likely to be the successor of the arenas experiment. See https://github.com/golang/go/discussions/70257.

Comment From: mknyszek

With the release of Go 1.25 RC1, it should be substantially easier to try out Green Tea, since a tip toolchain is no longer needed. I updated the comment containing the instructions above.

As before, please try it out!

Comment From: DeedleFake

@mknyszek

The new instructions are not working for me. If I just set GOTOOLCHAIN and then run go version, I get go version go1.25rc1 linux/amd64, but if I set GOEXPERIMENT=greenteagc, I get

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x3 pc=0x81a081]

goroutine 1 [running]:
cmd/go/internal/fips140.Init()
    cmd/go/internal/fips140/fips140.go:117 +0x101
cmd/go/internal/modload.Init()
    cmd/go/internal/modload/init.go:419 +0x33
cmd/go/internal/toolchain.Exec({0xc0000261bc, 0x9})
    cmd/go/internal/toolchain/select.go:356 +0x3fc
cmd/go/internal/toolchain.Select()
    cmd/go/internal/toolchain/select.go:284 +0x11a5
main.main()
    cmd/go/main.go:107 +0x4f

This doesn't seem to be something specific to GOEXPERIMENT=greenteagc. I also get the exact same crash with GOEXPERIMENT=jsonv2. The go command just crashes immediately regardless of arguments. The value of GOTOOLCHAIN doesn't seem to matter, either, as long as it's a valid version. If GOTOOLCHAIN is set to a valid version and GOEXPERIMENT is set to anything at all, running the go command instantly crashes with the above.

For context, I'm on Arch and have Go installed via the go package.

Comment From: prattmic

Oops, that looks like a bug in cmd/go (in your installed version, not 1.25rc1) where it crashes if there is a GOEXPERIMENT the installed version doesn't know about before it switches to the new toolchain. I'll file an issue shortly.

As a workaround, it should work to go install golang.org/dl/go1.25rc1@latest and then use GOEXPERIMENT=greentea go1.25rc1 build.

Comment From: prattmic

Filed #74111.

Comment From: mknyszek

Ugh, my apologies. I ran the command in the instructions myself, but with a toolchain that understood the GOEXPERIMENT. I will fix up the instructions. Thanks for reporting so promptly! Thank you also to @prattmic for quickly filing a bug and figuring out the problem.


  1. "The Garbage Collection Handbook" by Richard Jones, Anthony Hosking, Eliot Moss, a canonical source for garbage collection techniques, divides parallel garbage collection algorithms into two categories, "processor-centric" and "memory-centric." Loosely, the former category consists of algorithms that aggressively balance work across processors to maximize parallelism, while the latter consists of algorithms that may not parallelize as well, but process contiguous blocks of the heap. The current marking algorithm in the Go garbage collector is firmly processor-centric. Unfortunately for us, the handbook's wisdom ends there, as it goes on to state that "[a]ll known parallel marking algorithms are processor-centric." (page 279)