elliottt opened PR #8246 from elliottt:trevor/copy-callee-saves
to bytecodealliance:main
:
Rework the tail calling convention on x64 to support tail calls by changing the complication strategy in the following ways:
- Instead of making a shadow frame that we copy over the old one, instead determine if we need to grow or shrink the stack argument area and emit a special-purpose instruction to handle that case.
- As we're now growing and shrinking the frame in a separate instruction sequence, emit arguments as moving to FP offsets instead of SP offsets, placing them where they'll need to go for the tail call being setup.
- Finally, when emitting a
Inst::ReturnCall
in the x64 backend, reuse thegen_clobber_save
andgen_epilogue_frame_restore
functions to setup the frame and any callee-saved registers to the state expected by the callee we're jumping to.With these three changes in place, we modified the x64 abi for tail calls to include a list of callee-saved registers, and observed that we now save them in the prologue, and restore them right before jumping to the tail-callee.
<!--
Please make sure you include the following information:
If this work has been discussed elsewhere, please include a link to that
conversation. If it was discussed in an issue, just mention "issue #...".Explain why this change is needed. If the details are in an issue already,
this can be brief.Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.htmlPlease ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->
elliottt requested fitzgen for a review on PR #8246.
elliottt requested wasmtime-compiler-reviewers for a review on PR #8246.
elliottt edited PR #8246:
Rework the tail calling convention on x64 to support tail calls by changing the complication strategy in the following ways:
- Instead of making a shadow frame that we copy over the old one, instead determine if we need to grow or shrink the stack argument area and emit a special-purpose instruction to handle that case.
- As we're now growing and shrinking the frame in a separate instruction sequence, emit arguments as moving to FP offsets instead of SP offsets, placing them where they'll need to go for the tail call being setup.
- Finally, when emitting a
Inst::ReturnCall
in the x64 backend, reuse thegen_clobber_save
andgen_epilogue_frame_restore
functions to setup the frame and any callee-saved registers to the state expected by the callee we're jumping to.With these three changes in place, we modified the x64 abi for tail calls to include a list of callee-saved registers, and observed that we now save them in the prologue, and restore them right before jumping to the tail-callee.
Co-authored-by: Jamey Sharp <jsharp@fastly.com>
<!--
Please make sure you include the following information:
If this work has been discussed elsewhere, please include a link to that
conversation. If it was discussed in an issue, just mention "issue #...".Explain why this change is needed. If the details are in an issue already,
this can be brief.Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.htmlPlease ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->
elliottt updated PR #8246.
elliottt edited PR #8246:
Rework the tail calling convention on x64 to support tail calls by changing the complication strategy in the following ways:
- Instead of making a shadow frame that we copy over the old one, instead determine if we need to grow or shrink the stack argument area and emit a special-purpose instruction to handle that case.
- As we're now growing and shrinking the frame in a separate instruction sequence, emit arguments as moving to FP offsets instead of SP offsets, placing them where they'll need to go for the tail call being setup.
- Finally, when emitting a
Inst::ReturnCall
in the x64 backend, reuse thegen_clobber_save
andgen_epilogue_frame_restore
functions to setup the frame and any callee-saved registers to the state expected by the callee we're jumping to.With these three changes in place, we modified the x64 abi for tail calls to include a list of callee-saved registers, and observed that we now save them in the prologue, and restore them right before jumping to the tail-callee.
TODO
- [ ] Account for the extra space used by
GrowFrame
in the stack check- [ ] Revisit the use of
r14
for the stack limit in theTail
calling conventionCo-authored-by: Jamey Sharp <jsharp@fastly.com>
<!--
Please make sure you include the following information:
If this work has been discussed elsewhere, please include a link to that
conversation. If it was discussed in an issue, just mention "issue #...".Explain why this change is needed. If the details are in an issue already,
this can be brief.Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.htmlPlease ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->
github-actions[bot] commented on PR #8246:
Subscribe to Label Action
cc @cfallin, @fitzgen
<details>
This issue or pull request has been labeled: "cranelift", "cranelift:area:aarch64", "cranelift:area:machinst", "cranelift:area:x64", "isle"Thus the following users have been cc'd because of the following labels:
- cfallin: isle
- fitzgen: isle
To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.
Learn more.
</details>
jameysharp submitted PR review:
I'm proud of the work Trevor and I did on this!
Here's a few quick comments edits after giving this another complete read-through.
jameysharp submitted PR review:
I'm proud of the work Trevor and I did on this!
Here's a few quick comments edits after giving this another complete read-through.
jameysharp created PR review comment:
Nitpick:
// pointer that are pushed on the stack already.
jameysharp created PR review comment:
After we wrote this we discovered the
fp_to_arg_offset
method which I now realize we could also use here instead of hard-coding 16. That might also be handy for generalizing this in shared code when we extend this work to other targets.
jameysharp created PR review comment:
We should update this comment. I guess this will do:
// Finally, do the actual tail call!
jameysharp commented on PR #8246:
Also I see we have only one function in the precise-output compile filetests that exercises
grow_frame
, and none that exerciseshrink_frame
. I suppose we ought to extend the tests.
elliottt edited PR #8246:
Rework the tail calling convention on x64 to support tail calls by changing the compilation strategy in the following ways:
- Instead of making a shadow frame that we copy over the old one, instead determine if we need to grow or shrink the stack argument area and emit a special-purpose instruction to handle that case.
- As we're now growing and shrinking the frame in a separate instruction sequence, emit arguments as moving to FP offsets instead of SP offsets, placing them where they'll need to go for the tail call being setup.
- Finally, when emitting a
Inst::ReturnCall
in the x64 backend, reuse thegen_clobber_save
andgen_epilogue_frame_restore
functions to setup the frame and any callee-saved registers to the state expected by the callee we're jumping to.With these three changes in place, we modified the x64 abi for tail calls to include a list of callee-saved registers, and observed that we now save them in the prologue, and restore them right before jumping to the tail-callee.
TODO
- [ ] Account for the extra space used by
GrowFrame
in the stack check- [ ] Revisit the use of
r14
for the stack limit in theTail
calling conventionCo-authored-by: Jamey Sharp <jsharp@fastly.com>
<!--
Please make sure you include the following information:
If this work has been discussed elsewhere, please include a link to that
conversation. If it was discussed in an issue, just mention "issue #...".Explain why this change is needed. If the details are in an issue already,
this can be brief.Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.htmlPlease ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->
fitzgen commented on PR #8246:
Super excited for this! Going to start digging in.
Concurrently, would you mind running the default sightglass suite with vs without this change (when enabling the wasm tail calls proposal for both) to determine if this fully resolves https://github.com/bytecodealliance/wasmtime/issues/6759 ?
fitzgen commented on PR #8246:
Concurrently, would you mind running the default sightglass suite with vs without this change (when enabling the wasm tail calls proposal for both) to determine if this fully resolves #6759 ?
Or I guess the comparison we really care about is
main
with default wasm features, versus- this branch with wasm tail calls enabled
This gives us the "wasmtime" vs "tail" calling conventions comparison we want.
fitzgen submitted PR review:
Thanks!! Lots of nitpicks, but generally really like this.
+1 to comment about more filetests and the lack of
shrink_frame
testing. Seems like it would be good to exercise the following cases, which it seems like we aren't already:
- tail call from a function with multiple stack args to a function with a single stack arg
- tail call from a function with multiple stack args to a function with zero stack args
- tail call from a function with a single stack arg to a function with multiple stack args
- tail call from a function with zero stack args to a function with multiple stack args
fitzgen submitted PR review:
Thanks!! Lots of nitpicks, but generally really like this.
+1 to comment about more filetests and the lack of
shrink_frame
testing. Seems like it would be good to exercise the following cases, which it seems like we aren't already:
- tail call from a function with multiple stack args to a function with a single stack arg
- tail call from a function with multiple stack args to a function with zero stack args
- tail call from a function with a single stack arg to a function with multiple stack args
- tail call from a function with zero stack args to a function with multiple stack args
fitzgen created PR review comment:
Is this doing more than just manipulating SP? I guess it also has to adjust FP and the return pointer to add more stack slots.
As written, the docs make it sound like this is doing the full "reconstruct the frame" thing, but I thought (and maybe I've gotten lost given all the different sequencing versions we've talked about for this stuff) we were planning on avoiding that from now on?
All this to say, I feel like the docs for this pseudo inst (and
ShrinkFrame
below) could be a bit more precise.
fitzgen created PR review comment:
This does not include stack arguments right? I think it is worth noting that in the comment.
(Sorry if these nitpicks on comments are too nitpicky. It's just that all this ABI code is so subtle and finicky and has to be Just Right, that I am constantly questioning everything, and if my questions aren't immediately answered in the comments I get really nervous that I am missing/overlooking something)
fitzgen created PR review comment:
Has this
TODO
been addressed?
fitzgen created PR review comment:
Ditto regarding naming here in
ShrinkFrame
as well.
fitzgen created PR review comment:
Same suggestion on rewording this comment the way I suggested for the corresponding comment in
GrowFrame
.
fitzgen created PR review comment:
Ditto regarding comment and whether args are included in the size and ditto regarding the hardcoding of
16
vs usingfp_to_arg_offset
fitzgen created PR review comment:
// Copy the `i`th word in the stack from `SP + amount + i * 8` // to `SP + i * 8`. Do this from lower to higher addresses to // avoid clobbering words we haven't copied yet.
fitzgen created PR review comment:
I think this snippet of code would be easier to read if we didn't name the readable- and writable-GPR versions of
tmp
assrc
anddst
. This is essentially doingmov tmp, [SP + N]; move [SP + M], tmp
but that isn't obvious to me as a reader.If we renamed
src
toreadable_tmp
anddst
towritable_tmp
(or something, open to bikeshedding so long as it is clear they are naming the same register) I think this code would be much clearer, and that clarity is easily worth losing the struct literal field shorthand notation here.
fitzgen created PR review comment:
Likewise about frame pointers being enabled.
fitzgen created PR review comment:
FWIW, this is the kind of thing I'm referring to when I talk about diagrams being nice to have in
{Grow,Shrink}Frame
.
fitzgen created PR review comment:
Is this going to break return calls on non-x64 until we get around to poring these architectures to the new tail call strategy? I don't think this PR should regress any existing functionality.
fitzgen created PR review comment:
Can we document what base this is an offset from? SP right?
fitzgen created PR review comment:
Nice.
fitzgen created PR review comment:
Can we debug assert that we have frame pointers enabled here?
fitzgen created PR review comment:
Can we debug assert that frame pointers are enabled here? Or guard this on frame pointers being enabled?
fitzgen created PR review comment:
I guess the offset is either SP or FP depending on whether the call is a tail call or not? That is pretty subtle and seems relatively easy to mix up. Is there a reason we can't keep using
StackAMode
here? That would avoid mix ups because we would always have to match on the amode to determine whether we are dealing with SP or FP offsets. I think it would be fine to haveunreachable!()
arms in these matches if we always use SP for regular calls and FP for tail calls, but avoiding the mix-up footgun is nice.
fitzgen created PR review comment:
(And presumably this change is because we don't need the full generality of
StackAMode
anymore?)
fitzgen created PR review comment:
Diagrams of the stack frame, with labeled sections/slots, could help alleviate some of my fears/questions here. Y'all should know I'm a sucker for ASCII diagrams in comments by now :-p
fitzgen created PR review comment:
+1 for using
fp_to_arg_offset
and avoiding the hard coding here (or at least consolidating the hard coding in that method).
elliottt updated PR #8246.
jameysharp submitted PR review.
jameysharp submitted PR review.
jameysharp created PR review comment:
No. I haven't traced through enough to understand how we should influence the stack check yet.
jameysharp created PR review comment:
GrowFrame
does a memmove of everything in the frame except for the argument area, to make more room in the argument area. That includes all the stack slots, the callee-saved registers, and the saved FP and return address. To keep the stack pointers in sync with that change, it also subtracts the given amount from both the FP and SP registers.ShrinkFrame
is the same but in the other direction.Ulrich's suggestion in today's Cranelift meeting that we allocate the maximum necessary space during the prologue instead lets us get rid of this memmove and only maybe have to move the return address, but we want to land this version first.
jameysharp created PR review comment:
We haven't removed the assert in
emit_return_call_common_sequence
that frame pointers are enabled, but yeah, we should add that in bothShrinkFrame
andGrowFrame
as well.
jameysharp created PR review comment:
This offset always describes the number of bytes into the argument area. Whether we find the argument area relative to FP or SP depends on how
gen_arg
's caller is set up, not on ABI details. That's why we switched away from encoding that decision here.
jameysharp created PR review comment:
Nope, we're using
Call
instead ofReturnCall
here specifically to avoid breaking aarch64 and riscv64 until we can get around to applying similar changes to them.from_func
was always usingCall
for the opcode before so passing that here now doesn't change what data structures that function constructs, and then we rely on the distinction in the newis_tail_call
helper to ensure that the generated code isn't changed.
elliottt updated PR #8246.
elliottt updated PR #8246.
jameysharp submitted PR review.
jameysharp created PR review comment:
On all targets, the stack limit check is based solely on
stackslots_size
plus the number of spillslots. (And spillslots are always word-sized, I see, which I'm confused about, but that's not the point right now.)As far as I can tell, this means the stack limit check never accounts for the size of a callee's argument or setup areas. So I guess we don't need to worry about it for tail calls either?
elliottt edited PR #8246:
Rework the tail calling convention on x64 to support tail calls by changing the compilation strategy in the following ways:
- Instead of making a shadow frame that we copy over the old one, instead determine if we need to grow or shrink the stack argument area and emit a special-purpose instruction to handle that case.
- As we're now growing and shrinking the frame in a separate instruction sequence, emit arguments as moving to FP offsets instead of SP offsets, placing them where they'll need to go for the tail call being setup.
- Finally, when emitting a
Inst::ReturnCall
in the x64 backend, reuse thegen_clobber_save
andgen_epilogue_frame_restore
functions to setup the frame and any callee-saved registers to the state expected by the callee we're jumping to.With these three changes in place, we modified the x64 abi for tail calls to include a list of callee-saved registers, and observed that we now save them in the prologue, and restore them right before jumping to the tail-callee.
TODO
- [ ] Account for the extra space used by
GrowFrame
in the stack check- [x] Revisit the use of
r14
for the stack limit in theTail
calling conventionCo-authored-by: Jamey Sharp <jsharp@fastly.com>
<!--
Please make sure you include the following information:
If this work has been discussed elsewhere, please include a link to that
conversation. If it was discussed in an issue, just mention "issue #...".Explain why this change is needed. If the details are in an issue already,
this can be brief.Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.htmlPlease ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->
elliottt updated PR #8246.
elliottt submitted PR review.
elliottt created PR review comment:
I went with
tmp
andtmp_w
like we had in the previous implementation :+1:
elliottt updated PR #8246.
elliottt updated PR #8246.
elliottt updated PR #8246.
fitzgen submitted PR review:
Looks great! Thanks! :tada:
Excited for the benchmark results!
fitzgen created PR review comment:
Nice, thanks for these tests.
fitzgen submitted PR review:
Looks great! Thanks! :tada:
Excited for the benchmark results!
fitzgen created PR review comment:
These comments are fantastic -- thanks!
elliottt commented on PR #8246:
Here are the execution benchmarks for
spidermonkey
,bz2
, andpulldown-cmark
relative to the current tail calls implementation on main:execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm Δ = 204255214.50 ± 513219.80 (confidence = 99%) tail-calls.so is 1.08x to 1.08x faster than main.so! [2862185961 2862312218.75 2862598037] main.so [2657985419 2658057004.25 2658128352] tail-calls.so execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm Δ = 13264568.75 ± 1.43 (confidence = 99%) tail-calls.so is 1.06x to 1.06x faster than main.so! [239534828 239534828.25 239534829] main.so [226270259 226270259.50 226270260] tail-calls.so execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm Δ = 391358.50 ± 894.30 (confidence = 99%) tail-calls.so is 1.02x to 1.02x faster than main.so! [20465937 20466114.00 20466584] main.so [20074442 20074755.50 20075106] tail-calls.so
Running the same benchmarks on main without tail calls and this branch with tail calls separately, yields the following results:
main
execution benchmarks/bz2/benchmark.wasm instructions-retired [226286674 226286674.00 226286674] /home/trevor/src/main.so benchmarks/pulldown-cmark/benchmark.wasm instructions-retired [20080325 20080512.75 20080999] /home/trevor/src/main.so benchmarks/spidermonkey/benchmark.wasm instructions-retired [2659583881 2659624337.75 2659672231] /home/trevor/src/main.so
this branch
execution benchmarks/bz2/benchmark.wasm instructions-retired [226270259 226270259.50 226270260] /home/trevor/src/tail-calls.so benchmarks/pulldown-cmark/benchmark.wasm instructions-retired [20074405 20074419.00 20074445] /home/trevor/src/tail-calls.so benchmarks/spidermonkey/benchmark.wasm instructions-retired [2657979471 2658099662.50 2658220755] /home/trevor/src/tail-calls.so
It's exciting to see that the max for the tail-calls branch is smaller than the min of main for spidermonkey, though even if that's spurious they're still in the same ballpark :tada:
jameysharp edited PR #8246:
Rework the tail calling convention on x64 to support tail calls by changing the compilation strategy in the following ways:
- Instead of making a shadow frame that we copy over the old one, instead determine if we need to grow or shrink the stack argument area and emit a special-purpose instruction to handle that case.
- As we're now growing and shrinking the frame in a separate instruction sequence, emit arguments as moving to FP offsets instead of SP offsets, placing them where they'll need to go for the tail call being setup.
- Finally, when emitting a
Inst::ReturnCall
in the x64 backend, reuse thegen_clobber_save
andgen_epilogue_frame_restore
functions to setup the frame and any callee-saved registers to the state expected by the callee we're jumping to.With these three changes in place, we modified the x64 abi for tail calls to include a list of callee-saved registers, and observed that we now save them in the prologue, and restore them right before jumping to the tail-callee.
TODO
- [x] Account for the extra space used by
GrowFrame
in the stack check- [x] Revisit the use of
r14
for the stack limit in theTail
calling conventionCo-authored-by: Jamey Sharp <jsharp@fastly.com>
<!--
Please make sure you include the following information:
If this work has been discussed elsewhere, please include a link to that
conversation. If it was discussed in an issue, just mention "issue #...".Explain why this change is needed. If the details are in an issue already,
this can be brief.Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.htmlPlease ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->
fitzgen commented on PR #8246:
Ship it!
elliottt merged PR #8246.
Last updated: Jan 24 2025 at 00:11 UTC