Consider the following program:
package x
var x func(a, b, c, d uint64)
//go:nosplit
//go:noinline
func y(a, b, c, d uint64) {
x(a, b, c, d)
}
//go:nosplit
func z(a, b, c, d uint64) {
y(a, b, c, d)
}
When assembled, it looks like this:
TEXT .y(SB), NOSPLIT|ABIInternal, $40-32
PUSHQ BP
MOVQ SP, BP
SUBQ $32, SP
MOVQ .x(SB), DX
MOVQ (DX), SI
PCDATA $1, $0
CALL SI
ADDQ $32, SP
POPQ BP
RET
TEXT .z(SB), NOSPLIT|ABIInternal, $40-32
CMPQ SP, 16(R14)
JLS // morestack
PUSHQ BP
MOVQ SP, BP
SUBQ $32, SP
CALL .y(SB)
ADDQ $32, SP
POPQ BP
RET
The primary thing to note here is that both .y
and .z
reserves 32 bytes of spill space for its callees to spill their arguments. However, .z
's sole callee is a nosplit function, which therefore does not contain a morestack check. As far as I know, these 32 bytes are never written to in any code path.
This has a few unfortunate side effects, but the itch I'm trying to scratch is that I have a bunch of performance-critical nosplit functions whose arguments/returns fully saturate the argument and return registers, and are never spilled. On x86, I am limited to about 10 stack frames before I hit the nosplit limit in the linker.
This is all well and fine: 10 frames is a lot. Unfortunately, this assumes two things:
- No further stack variables are created. I have a custom build tag that turns on debug instrumentation, which blows up the size of the three nested nosplit frames I actually have just enough that my program fails to link.
- I am running into problems with turning on fuzzing inserting nosplit instrumentation function calls that cause me to blow the stack, and fail to link.
I have been working around this in a few different ways, because this is a very niche problem being suffered by a performance weirdo. However, I did notice that 72 bytes of each frame go unused: the morestack spill path.
As the ABI documentaiton observes, there are many options for improving this situation. I'd like to suggest an improvement that should be simple to implement, and will go some way to eliminating redundant stack growth: if a function only calls functions declared as nosplit, treat it like a leaf function for the purposes of prologue.
Of course, this isn't quite so simple. First, the argument registers no longer have a natural home, so those will need to be allocated if they are in fact necessary. Second, there might be a place in reflect that expects this spill area to be here, but I'm not certain. It also messes up traceback printing, which will need to be aware of spill-space-less functions.
This also only benefits nosplit code, which isn't particularly common. I'm more-or-less hitting a pathological case. The real fix is to modify stack growth to allow callees to reserve their own space, as the ABI document details.
Comment From: gabyhelp
Related Issues
- cmd/compile: structs with more than the hardcoded 4 words limits will always be spilled onto the stack even when passed in and out through registers by regabi #72897 (closed)
- cmd/compile: prefer to cheaply re-materialize after call site instead of spilling #32255
- cmd/compile: reduce binary space used for register spilling #47970
- cmd/compile: minimize morestack calls text footprint #29067
Related Code Changes
(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)
Comment From: mcy
Actually, come to think of it... is there any reason that there isn't a maximally large spill area in the G struct that morestack can just toss all the argument registers into? Is it because of the per-G overhead? Each G needs to be able to save all of its registers somewhere to deal with async preemption. Given ~10 argument gprs and vector regs, the cost is 3108=240, or the average cost of 15 stack frames' spill areas, assuming no arguments are spilled.
Seems like that would solve the issue far more cleanly than playing some kind of callgraph analysis that can't help virtual calls.
I guess the main pain here would be making sure that stack scanning actually sees those spilled registers as stack roots when scanning a paused G, which means that each frame now needs to specify which argument gprs contain a pointer, in addition to which frame offsets contain pointers...
Comment From: Jorropo
Actually, come to think of it... is there any reason that there isn't a maximally large spill area in the G struct that morestack can just toss all the argument registers into?
I remember @dr2chase wrote this for the interprocedural clobber sets CLs and it slowed things down altho I can't find where we had this conversation or if a solution was found nor if it is exactly 1:1 to what you are talking about here.
Edit: after rereading the code CL 636838 it look very comparable except due to the change in ABI behavior, all functions must now save all registers while currently and with what you are proposing we could keep the arguments based generation but relative to G+off
rather than SP+off
.