I have a WIP CL 673695 with a better and more complete description, but opening this issue here to help with discussion.

Background

CL 673695 starts with:

For reducing allocations, there is sometimes a chicken and egg problem.

For example, #72036 is a WIP approach to update escape analysis so that interface arguments do not always escape, with a primary target of w.Write(buf) and friends, but many interesting byte slices are variable size known only at runtime. Currently, a make with an unknown capacity (beyond 32 bytes) causes a slice to be heap allocated, so solely improving escape analysis in this case has limited impact on actual applications -- if a slice is heap allocated for two reasons, eliminating one reason just means it is still heap allocated for the other reason, which lessens the value of doing the work to eliminate the first reason.

In an attempt to help triangulate on what is possible for the runtime, compiler, and standard library to do together to reduce how much work the GC must do, this CL has a proof-of-concept set of runtime.free implementations

Back in February, as part of my write up for #72036 I mentioned that a runtime.free could help with this chicken and egg problem:

I have a mostly working runtime.free that can be inserted by the compiler (and not for use by normal Go code of course), and I'm looking at hooking it up a few different ways, including when a slice is both variable size and also used as an interface method arg. (I'll probably open a separate issue on this, but I'm also attempting to free memory in some runtime map growth cases, some append cases, etc.

Prototype

CL 673695 contains that mostly working runtime.free. (It's "mostly working" in the sense that it can pass its own tests, including I let it accumulate a few hundred CPU hours running its tests successfully under stress. That of course does not mean it is bug free, and it still has some shortcuts, including for example how it updates user-facing stats).

I also have some additional WIP CLs that update the compiler to automatically insert free calls for slices allocated via make when the compiler can prove it is safe to do so, and separately updating maps to free map storage during growth and split as well as a small number of lower-level stdlib things like String.Builder and Bytes.Buffer (which are "low level" in the sense that they already encapsulate some low-level performance tricks, including use of unsafe, but also they let other higher-level pieces of the stdlib then safely benefit from their performance). It is not intended that higher-level portions of the stdlib (like net/http) would have manual calls to a runtime.free.

Patterns of append usage like Slices.Collect and Slices.AppendSeq are also targets, with some good initial results. (Slices.Collect for example could be updated manually to start, which I've tried as an experiment, or it could wait for the compiler to automatically recognize its pattern).

Note that runtime.free and the related runtime entry points are not public APIs, and cannot be called directly by end users.

Basic approach

See the more detailed write-up in the CL 673695 commit message, which covers these 4 runtime entry points to start:

 func free(ptr unsafe.Pointer)
 func freesized(ptr unsafe.Pointer, uintptr size, noscan bool)
 func freetracked(trackedObjs *[]trackedObj)
 func makeslicetracked64(et *_type, len64 int64, cap64 int64, trackedObjs *[]trackedObj) unsafe.Pointer

At least for the approach I have so far, the free work tries to take advantage of things the compiler already knows or that the compiler is taught with modest effort (for example in escape analysis).

Performance

Some preliminary benchmarks. See CL 673695 for commentary on these benchmarks.

strings.Builder.Write with runtime.freesized

                                               │       old        │    new                              │
                                               │      sec/op      │   sec/op     vs base                │
BuildString_Builder/1Write_36Bytes_NoGrow-4           55.82n ± 2%   55.86n ± 2%        ~ (p=0.794 n=20)
BuildString_Builder/2Write_36Bytes_NoGrow-4           125.2n ± 2%   115.4n ± 1%   -7.86% (p=0.000 n=20)
BuildString_Builder/3Write_36Bytes_NoGrow-4           224.0n ± 1%   188.2n ± 2%  -16.00% (p=0.000 n=20)
BuildString_Builder/5Write_36Bytes_NoGrow-4           239.1n ± 9%   205.1n ± 1%  -14.20% (p=0.000 n=20)
BuildString_Builder/8Write_36Bytes_NoGrow-4           422.8n ± 3%   325.4n ± 1%  -23.04% (p=0.000 n=20)
BuildString_Builder/10Write_36Bytes_NoGrow-4          436.9n ± 2%   342.3n ± 1%  -21.64% (p=0.000 n=20)
BuildString_Builder/100Write_36Bytes_NoGrow-4         4.403µ ± 1%   2.381µ ± 2%  -45.91% (p=0.000 n=20)
BuildString_Builder/1000Write_36Bytes_NoGrow-4        48.28µ ± 2%   21.38µ ± 2%  -55.71% (p=0.000 n=20)

Normal allocation (with these changes, but no free/reuse happening)

The reuse code path for reusing a freed object is currently somewhat interwoven with the normal allocation path in runtime/malloc.go, but some initial micro-benchmarks suggest the impact of these changes on the normal allocation path when no reuse is happening could be small to negligible:

                        │         old        │  new                             │
                        │        sec/op      │ sec/op     vs base               │
Malloc8-16                      11.01n ± 1%   10.94n ± 1%  -0.68% (p=0.038 n=20)
Malloc16-16                     17.15n ± 1%   17.05n ± 0%  -0.55% (p=0.007 n=20)
Malloc32-16                     18.65n ± 1%   18.42n ± 0%  -1.26% (p=0.000 n=20)
MallocTypeInfo8-16              18.63n ± 0%   18.36n ± 0%  -1.45% (p=0.000 n=20)
MallocTypeInfo16-16             22.32n ± 0%   22.65n ± 0%  +1.50% (p=0.000 n=20)
MallocTypeInfo32-16             23.37n ± 0%   23.89n ± 0%  +2.23% (p=0.000 n=20)
geomean                         18.02n        18.01n       -0.05%

The initial performance seems promising, but still preliminary.

In general, some possible performance improvements might include:

  1. less CPU used by GC, which might matter most when CPU is constrained or exhausted (either on a sustained basis, or over some shorter but meaningful time interval).
  2. longer time between GCs, with less time overall with the write barrier enabled (helping application code execute faster).
  3. more cache-friendly allocs for user code (esp. if it ends up with LIFO or near-LIFO free/alloc).
  4. with less GC work overall, maybe fewer assists or other ways the GC can intrude on the app.
  5. maybe slightly more friendly to the new Green Tea GC if it can result in a higher density per span (speculation 😅).

The first maybe is the most obvious, but that might not justify the change for apps that have spare CPU and the GC CPU costs are almost entirely in the "background" compared to the rest of the app work.

The third is particularly interesting because it might help make it more of a win in CPU costs even before considering the costs of running the GC. (I suspect I've observed this effect in some of the benchmarks, but need to look more carefully).

One approach to mitigate the cost for applications that don't care much about GC costs today is the free-related work currently has a size-based threshold (currently 16 bytes). That could be raised to something higher than 16 to the point where the other costs of allocating (including usually zeroing) or operating on the result (like writing data and doing useful work) are large enough costs compared to the overhead of the free-related work such that the free-related overhead is small enough. (For compiler-inserted frees of slices, it would be nice if the free threshold could be low enough to reach the 32-byte threshold for new Go 1.25 stack allocation optimization introduced by Keith, or maybe that 32-byte threshold ends up being higher). Also, the current cost in the prototype likely can be made cheaper.

That said, it likely might be easier to sort through some of the performance trade-offs by seeing more concrete performance examples, which is one of the main goals of this current prototype.


There are multiple ways that runtime.free and related entry points could be used, but I'm hoping this issue could be focused on some possible first steps.

In any event, this is still exploratory. @mknyszek took a look a ~month ago and gave some helpful initial feedback. Absent someone pointing out a fatal flaw in the overall approach, I plan to try to polish things up some more and get more feedback. It might be that this exact approach does not pan out, but maybe we'll learn something via the exploration.

Comment From: gabyhelp

Related Issues

Related Code Changes

Related Discussions

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)