fitzgen opened PR #6478 from fitzgen:stack-to-pop-in-ret
to bytecodealliance:main
:
This is necessary for implementing callee-pops calling conventions, as is required for tail calls. This is just a small part of tail calls, and doesn't implement everything, but is a good piece to land on its own so that eventual PR isn't so huge.
<!--
Please make sure you include the following information:
If this work has been discussed elsewhere, please include a link to that
conversation. If it was discussed in an issue, just mention "issue #...".Explain why this change is needed. If the details are in an issue already,
this can be brief.Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.htmlPlease ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->
fitzgen requested cfallin for a review on PR #6478.
fitzgen requested wasmtime-compiler-reviewers for a review on PR #6478.
fitzgen edited PR #6478:
This is necessary for implementing callee-pops calling conventions, as is required for tail calls. This is just a small part of tail calls, and doesn't implement everything, but is a good piece to land on its own so that eventual PR isn't so huge. Just trying to shrink the yak stack.
<!--
Please make sure you include the following information:
If this work has been discussed elsewhere, please include a link to that
conversation. If it was discussed in an issue, just mention "issue #...".Explain why this change is needed. If the details are in an issue already,
this can be brief.Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.htmlPlease ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->
cfallin submitted PR review:
Looks great; thanks for splitting this out!
One thought on the pseudo-assembly printing in VCode (for aarch64 but similar thoughts on riscv64); and one request for a bit more in a doc-comment; but otherwise LGTM.
cfallin submitted PR review:
Looks great; thanks for splitting this out!
One thought on the pseudo-assembly printing in VCode (for aarch64 but similar thoughts on riscv64); and one request for a bit more in a doc-comment; but otherwise LGTM.
cfallin created PR review comment:
Rather than print a pseudo-inst here, could we print e.g. `add sp, sp, #imm ; ret" or similar?
cfallin created PR review comment:
"At the same time as returning" is slightly ambiguous: does the pop happen logically just before popping the return address, or just after? (I think after?) A diagram of the stack before-and-after might be helpful actually, if it's not too much trouble to draw one quickly!
jameysharp created PR review comment:
The exact sequence of events is target-specific. The return address may not be on the stack at all by the time we get to the return instruction; x86 is the only target we support where
ret
pops an address off the stack for you. So yes, on x86 this emits an instruction which first pops the return address, and then removes the arguments from the caller's frame. Everywhere else, the return address is in a register by that point. So we can emit separate instructions to adjust the stack pointer and then branch through the appropriate register.
fitzgen updated PR #6478.
fitzgen updated PR #6478.
fitzgen has enabled auto merge for PR #6478.
cfallin created PR review comment:
Unresolving -- this could still use a diagram or better description in the comment, I think.
The sequencing question is really about the order of items on the stack: are the additionally-popped items before (below) the return address or after (above)? I think the latter, but I have to think about it to come to that conclusion by reasoning through it; I'd prefer for it to be clear in the comment here.
cfallin created PR review comment:
Ah, I see the just-pushed commit does resolve the ambiguity, so I'm happy now!
fitzgen updated PR #6478.
fitzgen updated PR #6478.
bjorn3 created PR review comment:
Less commonly used instructions from the 16bit x86 era can be slower than writing multiple simpler instructions. Does this also apply to
retn
or is it just as fast asadd rsp, n
+ret
?
cfallin created PR review comment:
According to Agner Fog's instruction tables, and picking a random semi-recent microarchitecture (Skylake),
ret
andret imm
are both 2 µops; the latter has a throughput of 0.5 per cycle, the former 1 per cycle. So it looks like there's a tiny penalty but it's no worse than rearranging the stack manually. (And keep in mind that the popped data is above the return address, so one would need to load that, store it higher up, move rsp, andret
; the store alone in that sequence would kill the perf advantage.)
fitzgen created PR review comment:
add rsp, n; ret
isn't equivalent because the return address must be on top of the stack in both cases. That is,ret n
pops the RA and then popsn
bytes.So I don't know if it is actually slower or not, but it is not semantically equivalent, so the point is moot to some degree.
The actual alternative would be something like
mov tmp, [rsp] add rsp, n mov [rsp], tmp ret
I kind of doubt that is any better than
ret n
, but it is certainly possible. Would be interesting to see what LLVM does here.
cfallin edited PR review comment.
fitzgen merged PR #6478.
cfallin created PR review comment:
I did a fun little experiment with a
ret imm
implementation, a "move the return address andret
" implementation, and a "load the return address, manually pop, and jump-indirect" implementation.On my system (Zen 2 core, Ryzen 3900X) the first two have identical performance, and the last is 25% faster. Huh!
I sort of suspect what's happening here is:
- With enough spare pipeline capacity, the store of the return address in the new location is "free", and the memory-renaming that Zen is known to do (tracking constant offsets of rsp stored from registers and using that register directly) eliminate a critical path on re-loading the RA as part of
ret
.- With only one returned-to address in this microbenchmark, the
jmp rdi
is trivial for the indirect predictor to get right; but this example will overflow the return-stack predictor and so in a real program it would do poorly.- In other words, this is kind of hard to test without having a "real program" surrounding it.
Anyway it's sort of getting into the weeds and
ret imm
isn't obviously worse so I think we should keep this until/unless we discover it becoming a bottleneck.
bjorn3 created PR review comment:
Makes sense. I didn't know that
ret imm
would move the stack pointer after reading the return address.
Last updated: Jan 24 2025 at 00:11 UTC