Overview

Meta widely uses CGo, allowing Gophers to access a vast array of C++ libraries. We previously advocated for the defer Close() pattern for C++ bounded objects; however, this defer call was frequently overlooked, leading to memory leaks as these objects resemble regular Go objects. Consequently, we adopted automated memory management through runtime.AddCleanup.
Unfortunately, adding these cleanups incurs a performance penalty, specifically during the execution of the runtime.AddCleanup function.

Benchmark

Before implementing this change, we aimed to understand its performance implications. Several online blog posts compare manual deallocation with runtime.SetFinalizer, highlighting the poor performance of finalizers. We conducted a similar benchmark using the brand new runtime.AddCleanup, introduced in Go 1.24. While runtime.AddCleanup shows a 2x performance improvement, it still lags behind manual deallocation.

The following benchmark results on Go 1.25.0 on a MacBookPro M1 14" indicate approximately a 5x slowdown compared to manual deallocation (benchmarking on a Linux server yielded similar results):

BenchmarkAllocateFree-8                 15169078                73.42 ns/op
BenchmarkAllocateDefer-8                15890224                75.04 ns/op
BenchmarkAllocateAddCleanup-8            3516811               359.7 ns/op
BenchmarkAllocateWithFinalizer-8         1739244               676.9 ns/op

For the benchmark code and additional results, please refer to podtserkovskiy/slow-addcleanup-repro

Further investigation

Delving deeper into this issue, we believe the root cause lies in the linear search within the sorted linked list of special objects on mspan, specifically in the *mspan.specialFindSplicePoint function. A CPU profile generated on Go 1.25rc2 with GOGC=off (for a less noisy picture, as it doesn't significantly affect the results) illustrates this:

Image

Is there a potential to accelerate the *mspan.specialFindSplicePoint function (e.g., by using a skip-list instead of a plain linked-list)?

Comment From: podtserkovskiy

cc @mknyszek

Comment From: mknyszek

I have some ideas on how to make this faster, at the cost of some memory overhead. What we brainstormed at GopherCon was essentially: 1. Add an optional array of pointers to each mspan which contains a pointer to the first special for each object. 2. Create this array on-demand, when the first object is added.

If the size of the array is a problem (as it might be for, say, 8 byte objects) we can also try compressing it. For example, doing a linked list of chunks with some special compact encoding.

The microbenchmark is maximally pessimal, since it forces a full walk of the linked list. In practice I suspect that it's not quite so bad, since the linked list for each span is going to be fairly short.

That being said, it's still not fast. How big are the objects that you're actually applying the cleanup to in production?