wasmtime / PR #8246 cranelift: Support callee-saved regis... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / PR #8246 cranelift: Support callee-saved regis...

Wasmtime GitHub notifications bot (Mar 26 2024 at 22:58):

elliottt opened PR #8246 from elliottt:trevor/copy-callee-saves to bytecodealliance:main:

Rework the tail calling convention on x64 to support tail calls by changing the complication strategy in the following ways:

Instead of making a shadow frame that we copy over the old one, instead determine if we need to grow or shrink the stack argument area and emit a special-purpose instruction to handle that case.

As we're now growing and shrinking the frame in a separate instruction sequence, emit arguments as moving to FP offsets instead of SP offsets, placing them where they'll need to go for the tail call being setup.

Finally, when emitting a Inst::ReturnCall in the x64 backend, reuse the gen_clobber_save and gen_epilogue_frame_restore functions to setup the frame and any callee-saved registers to the state expected by the callee we're jumping to.

With these three changes in place, we modified the x64 abi for tail calls to include a list of callee-saved registers, and observed that we now save them in the prologue, and restore them right before jumping to the tail-callee.

Wasmtime GitHub notifications bot (Mar 26 2024 at 22:58):

elliottt requested fitzgen for a review on PR #8246.

Wasmtime GitHub notifications bot (Mar 26 2024 at 22:58):

elliottt requested wasmtime-compiler-reviewers for a review on PR #8246.

Wasmtime GitHub notifications bot (Mar 26 2024 at 22:58):

elliottt edited PR #8246:

Rework the tail calling convention on x64 to support tail calls by changing the complication strategy in the following ways:

Instead of making a shadow frame that we copy over the old one, instead determine if we need to grow or shrink the stack argument area and emit a special-purpose instruction to handle that case.

As we're now growing and shrinking the frame in a separate instruction sequence, emit arguments as moving to FP offsets instead of SP offsets, placing them where they'll need to go for the tail call being setup.

Finally, when emitting a Inst::ReturnCall in the x64 backend, reuse the gen_clobber_save and gen_epilogue_frame_restore functions to setup the frame and any callee-saved registers to the state expected by the callee we're jumping to.

With these three changes in place, we modified the x64 abi for tail calls to include a list of callee-saved registers, and observed that we now save them in the prologue, and restore them right before jumping to the tail-callee.

Co-authored-by: Jamey Sharp <jsharp@fastly.com>

Wasmtime GitHub notifications bot (Mar 26 2024 at 23:25):

elliottt updated PR #8246.

Wasmtime GitHub notifications bot (Mar 26 2024 at 23:27):

elliottt edited PR #8246:

Rework the tail calling convention on x64 to support tail calls by changing the complication strategy in the following ways:

Instead of making a shadow frame that we copy over the old one, instead determine if we need to grow or shrink the stack argument area and emit a special-purpose instruction to handle that case.

As we're now growing and shrinking the frame in a separate instruction sequence, emit arguments as moving to FP offsets instead of SP offsets, placing them where they'll need to go for the tail call being setup.

Finally, when emitting a Inst::ReturnCall in the x64 backend, reuse the gen_clobber_save and gen_epilogue_frame_restore functions to setup the frame and any callee-saved registers to the state expected by the callee we're jumping to.

With these three changes in place, we modified the x64 abi for tail calls to include a list of callee-saved registers, and observed that we now save them in the prologue, and restore them right before jumping to the tail-callee.

TODO

[ ] Account for the extra space used by GrowFrame in the stack check

[ ] Revisit the use of r14 for the stack limit in the Tail calling convention

Co-authored-by: Jamey Sharp <jsharp@fastly.com>

Wasmtime GitHub notifications bot (Mar 26 2024 at 23:44):

github-actions[bot] commented on PR #8246:

Subscribe to Label Action

cc @cfallin, @fitzgen

<details>
This issue or pull request has been labeled: "cranelift", "cranelift:area:aarch64", "cranelift:area:machinst", "cranelift:area:x64", "isle"

Thus the following users have been cc'd because of the following labels:

cfallin: isle

fitzgen: isle

To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.

Learn more.
</details>

Wasmtime GitHub notifications bot (Mar 27 2024 at 01:00):

jameysharp submitted PR review:

I'm proud of the work Trevor and I did on this!

Here's a few quick comments edits after giving this another complete read-through.

Wasmtime GitHub notifications bot (Mar 27 2024 at 01:00):

jameysharp submitted PR review:

I'm proud of the work Trevor and I did on this!

Here's a few quick comments edits after giving this another complete read-through.

Wasmtime GitHub notifications bot (Mar 27 2024 at 01:00):

jameysharp created PR review comment:

Nitpick:

            // pointer that are pushed on the stack already.

Wasmtime GitHub notifications bot (Mar 27 2024 at 01:00):

jameysharp created PR review comment:

After we wrote this we discovered the fp_to_arg_offset method which I now realize we could also use here instead of hard-coding 16. That might also be handy for generalizing this in shared code when we extend this work to other targets.

Wasmtime GitHub notifications bot (Mar 27 2024 at 01:00):

jameysharp created PR review comment:

We should update this comment. I guess this will do:
        // Finally, do the actual tail call!

Wasmtime GitHub notifications bot (Mar 27 2024 at 01:03):

jameysharp commented on PR #8246:

Also I see we have only one function in the precise-output compile filetests that exercises grow_frame, and none that exercise shrink_frame. I suppose we ought to extend the tests.

Wasmtime GitHub notifications bot (Mar 27 2024 at 03:52):

elliottt edited PR #8246:

Rework the tail calling convention on x64 to support tail calls by changing the compilation strategy in the following ways:

Instead of making a shadow frame that we copy over the old one, instead determine if we need to grow or shrink the stack argument area and emit a special-purpose instruction to handle that case.

As we're now growing and shrinking the frame in a separate instruction sequence, emit arguments as moving to FP offsets instead of SP offsets, placing them where they'll need to go for the tail call being setup.

Finally, when emitting a Inst::ReturnCall in the x64 backend, reuse the gen_clobber_save and gen_epilogue_frame_restore functions to setup the frame and any callee-saved registers to the state expected by the callee we're jumping to.

With these three changes in place, we modified the x64 abi for tail calls to include a list of callee-saved registers, and observed that we now save them in the prologue, and restore them right before jumping to the tail-callee.

TODO

[ ] Account for the extra space used by GrowFrame in the stack check

[ ] Revisit the use of r14 for the stack limit in the Tail calling convention

Co-authored-by: Jamey Sharp <jsharp@fastly.com>

Wasmtime GitHub notifications bot (Mar 27 2024 at 15:04):

fitzgen commented on PR #8246:

Super excited for this! Going to start digging in.

Concurrently, would you mind running the default sightglass suite with vs without this change (when enabling the wasm tail calls proposal for both) to determine if this fully resolves https://github.com/bytecodealliance/wasmtime/issues/6759 ?

Wasmtime GitHub notifications bot (Mar 27 2024 at 15:10):

fitzgen commented on PR #8246:

Concurrently, would you mind running the default sightglass suite with vs without this change (when enabling the wasm tail calls proposal for both) to determine if this fully resolves #6759 ?

Or I guess the comparison we really care about is

main with default wasm features, versus

this branch with wasm tail calls enabled

This gives us the "wasmtime" vs "tail" calling conventions comparison we want.

Wasmtime GitHub notifications bot (Mar 27 2024 at 17:27):

fitzgen submitted PR review:

Thanks!! Lots of nitpicks, but generally really like this.

+1 to comment about more filetests and the lack of shrink_frame testing. Seems like it would be good to exercise the following cases, which it seems like we aren't already:

tail call from a function with multiple stack args to a function with a single stack arg

tail call from a function with multiple stack args to a function with zero stack args

tail call from a function with a single stack arg to a function with multiple stack args

tail call from a function with zero stack args to a function with multiple stack args

Wasmtime GitHub notifications bot (Mar 27 2024 at 17:27):

fitzgen submitted PR review:

Thanks!! Lots of nitpicks, but generally really like this.

+1 to comment about more filetests and the lack of shrink_frame testing. Seems like it would be good to exercise the following cases, which it seems like we aren't already:

tail call from a function with multiple stack args to a function with a single stack arg

tail call from a function with multiple stack args to a function with zero stack args

tail call from a function with a single stack arg to a function with multiple stack args

tail call from a function with zero stack args to a function with multiple stack args

Wasmtime GitHub notifications bot (Mar 27 2024 at 17:27):

fitzgen created PR review comment:

Is this doing more than just manipulating SP? I guess it also has to adjust FP and the return pointer to add more stack slots.

As written, the docs make it sound like this is doing the full "reconstruct the frame" thing, but I thought (and maybe I've gotten lost given all the different sequencing versions we've talked about for this stuff) we were planning on avoiding that from now on?

All this to say, I feel like the docs for this pseudo inst (and ShrinkFrame below) could be a bit more precise.

Wasmtime GitHub notifications bot (Mar 27 2024 at 17:27):

fitzgen created PR review comment:

This does not include stack arguments right? I think it is worth noting that in the comment.

(Sorry if these nitpicks on comments are too nitpicky. It's just that all this ABI code is so subtle and finicky and has to be Just Right, that I am constantly questioning everything, and if my questions aren't immediately answered in the comments I get really nervous that I am missing/overlooking something)

Wasmtime GitHub notifications bot (Mar 27 2024 at 17:27):

fitzgen created PR review comment:

Has this TODO been addressed?

Wasmtime GitHub notifications bot (Mar 27 2024 at 17:27):

fitzgen created PR review comment:

Ditto regarding naming here in ShrinkFrame as well.

Wasmtime GitHub notifications bot (Mar 27 2024 at 17:27):

fitzgen created PR review comment:

Same suggestion on rewording this comment the way I suggested for the corresponding comment in GrowFrame.

Wasmtime GitHub notifications bot (Mar 27 2024 at 17:27):

fitzgen created PR review comment:

Ditto regarding comment and whether args are included in the size and ditto regarding the hardcoding of 16 vs using fp_to_arg_offset

Wasmtime GitHub notifications bot (Mar 27 2024 at 17:27):

fitzgen created PR review comment:

            // Copy the `i`th word in the stack from `SP + amount + i * 8`
            // to `SP + i * 8`. Do this from lower to higher addresses to
            // avoid clobbering words we haven't copied yet.

Wasmtime GitHub notifications bot (Mar 27 2024 at 17:27):

fitzgen created PR review comment:

I think this snippet of code would be easier to read if we didn't name the readable- and writable-GPR versions of tmp as src and dst. This is essentially doing mov tmp, [SP + N]; move [SP + M], tmp but that isn't obvious to me as a reader.

If we renamed src to readable_tmp and dst to writable_tmp (or something, open to bikeshedding so long as it is clear they are naming the same register) I think this code would be much clearer, and that clarity is easily worth losing the struct literal field shorthand notation here.

Wasmtime GitHub notifications bot (Mar 27 2024 at 17:27):

fitzgen created PR review comment:

Likewise about frame pointers being enabled.

Wasmtime GitHub notifications bot (Mar 27 2024 at 17:27):

fitzgen created PR review comment:

FWIW, this is the kind of thing I'm referring to when I talk about diagrams being nice to have in {Grow,Shrink}Frame.

Wasmtime GitHub notifications bot (Mar 27 2024 at 17:27):

fitzgen created PR review comment:

Is this going to break return calls on non-x64 until we get around to poring these architectures to the new tail call strategy? I don't think this PR should regress any existing functionality.

Wasmtime GitHub notifications bot (Mar 27 2024 at 17:27):

fitzgen created PR review comment:

Can we document what base this is an offset from? SP right?

Wasmtime GitHub notifications bot (Mar 27 2024 at 17:27):

fitzgen created PR review comment:

Nice.

Wasmtime GitHub notifications bot (Mar 27 2024 at 17:27):

fitzgen created PR review comment:

Can we debug assert that we have frame pointers enabled here?

Wasmtime GitHub notifications bot (Mar 27 2024 at 17:27):

fitzgen created PR review comment:

Can we debug assert that frame pointers are enabled here? Or guard this on frame pointers being enabled?

Wasmtime GitHub notifications bot (Mar 27 2024 at 17:27):

fitzgen created PR review comment:

I guess the offset is either SP or FP depending on whether the call is a tail call or not? That is pretty subtle and seems relatively easy to mix up. Is there a reason we can't keep using StackAMode here? That would avoid mix ups because we would always have to match on the amode to determine whether we are dealing with SP or FP offsets. I think it would be fine to have unreachable!() arms in these matches if we always use SP for regular calls and FP for tail calls, but avoiding the mix-up footgun is nice.

Wasmtime GitHub notifications bot (Mar 27 2024 at 17:27):

fitzgen created PR review comment:

(And presumably this change is because we don't need the full generality of StackAMode anymore?)

Wasmtime GitHub notifications bot (Mar 27 2024 at 17:27):

fitzgen created PR review comment:

Diagrams of the stack frame, with labeled sections/slots, could help alleviate some of my fears/questions here. Y'all should know I'm a sucker for ASCII diagrams in comments by now :-p

Wasmtime GitHub notifications bot (Mar 27 2024 at 17:27):

fitzgen created PR review comment:

+1 for using fp_to_arg_offset and avoiding the hard coding here (or at least consolidating the hard coding in that method).

Wasmtime GitHub notifications bot (Mar 27 2024 at 17:29):

elliottt updated PR #8246.

Wasmtime GitHub notifications bot (Mar 27 2024 at 17:58):

jameysharp submitted PR review.

Wasmtime GitHub notifications bot (Mar 27 2024 at 17:58):

jameysharp submitted PR review.

Wasmtime GitHub notifications bot (Mar 27 2024 at 17:58):

jameysharp created PR review comment:

No. I haven't traced through enough to understand how we should influence the stack check yet.

Wasmtime GitHub notifications bot (Mar 27 2024 at 17:58):

jameysharp created PR review comment:

GrowFrame does a memmove of everything in the frame except for the argument area, to make more room in the argument area. That includes all the stack slots, the callee-saved registers, and the saved FP and return address. To keep the stack pointers in sync with that change, it also subtracts the given amount from both the FP and SP registers. ShrinkFrame is the same but in the other direction.

Ulrich's suggestion in today's Cranelift meeting that we allocate the maximum necessary space during the prologue instead lets us get rid of this memmove and only maybe have to move the return address, but we want to land this version first.

Wasmtime GitHub notifications bot (Mar 27 2024 at 17:58):

jameysharp created PR review comment:

We haven't removed the assert in emit_return_call_common_sequence that frame pointers are enabled, but yeah, we should add that in both ShrinkFrame and GrowFrame as well.

Wasmtime GitHub notifications bot (Mar 27 2024 at 17:58):

jameysharp created PR review comment:

This offset always describes the number of bytes into the argument area. Whether we find the argument area relative to FP or SP depends on how gen_arg's caller is set up, not on ABI details. That's why we switched away from encoding that decision here.

Wasmtime GitHub notifications bot (Mar 27 2024 at 17:58):

jameysharp created PR review comment:

Nope, we're using Call instead of ReturnCall here specifically to avoid breaking aarch64 and riscv64 until we can get around to applying similar changes to them. from_func was always using Call for the opcode before so passing that here now doesn't change what data structures that function constructs, and then we rely on the distinction in the new is_tail_call helper to ensure that the generated code isn't changed.

Wasmtime GitHub notifications bot (Mar 27 2024 at 18:57):

elliottt updated PR #8246.

Wasmtime GitHub notifications bot (Mar 27 2024 at 18:58):

elliottt updated PR #8246.

Wasmtime GitHub notifications bot (Mar 27 2024 at 20:10):

jameysharp submitted PR review.

Wasmtime GitHub notifications bot (Mar 27 2024 at 20:10):

jameysharp created PR review comment:

On all targets, the stack limit check is based solely on stackslots_size plus the number of spillslots. (And spillslots are always word-sized, I see, which I'm confused about, but that's not the point right now.)

As far as I can tell, this means the stack limit check never accounts for the size of a callee's argument or setup areas. So I guess we don't need to worry about it for tail calls either?

Wasmtime GitHub notifications bot (Mar 27 2024 at 20:19):

elliottt edited PR #8246:

Rework the tail calling convention on x64 to support tail calls by changing the compilation strategy in the following ways:

Instead of making a shadow frame that we copy over the old one, instead determine if we need to grow or shrink the stack argument area and emit a special-purpose instruction to handle that case.

As we're now growing and shrinking the frame in a separate instruction sequence, emit arguments as moving to FP offsets instead of SP offsets, placing them where they'll need to go for the tail call being setup.

Finally, when emitting a Inst::ReturnCall in the x64 backend, reuse the gen_clobber_save and gen_epilogue_frame_restore functions to setup the frame and any callee-saved registers to the state expected by the callee we're jumping to.

With these three changes in place, we modified the x64 abi for tail calls to include a list of callee-saved registers, and observed that we now save them in the prologue, and restore them right before jumping to the tail-callee.

TODO

[ ] Account for the extra space used by GrowFrame in the stack check

[x] Revisit the use of r14 for the stack limit in the Tail calling convention

Co-authored-by: Jamey Sharp <jsharp@fastly.com>

Wasmtime GitHub notifications bot (Mar 27 2024 at 20:35):

elliottt updated PR #8246.

Wasmtime GitHub notifications bot (Mar 27 2024 at 20:53):

elliottt submitted PR review.

Wasmtime GitHub notifications bot (Mar 27 2024 at 20:53):

elliottt created PR review comment:

I went with tmp and tmp_w like we had in the previous implementation :+1:

Wasmtime GitHub notifications bot (Mar 27 2024 at 20:59):

elliottt updated PR #8246.

Wasmtime GitHub notifications bot (Mar 27 2024 at 22:22):

elliottt updated PR #8246.

Wasmtime GitHub notifications bot (Mar 27 2024 at 22:49):

elliottt updated PR #8246.

Wasmtime GitHub notifications bot (Mar 27 2024 at 23:16):

fitzgen submitted PR review:

Looks great! Thanks! :tada:

Excited for the benchmark results!

Wasmtime GitHub notifications bot (Mar 27 2024 at 23:16):

fitzgen created PR review comment:

Nice, thanks for these tests.

Wasmtime GitHub notifications bot (Mar 27 2024 at 23:16):

fitzgen submitted PR review:

Looks great! Thanks! :tada:

Excited for the benchmark results!

Wasmtime GitHub notifications bot (Mar 27 2024 at 23:16):

fitzgen created PR review comment:

These comments are fantastic -- thanks!

Wasmtime GitHub notifications bot (Mar 27 2024 at 23:18):

elliottt commented on PR #8246:

Here are the execution benchmarks for spidermonkey, bz2, and pulldown-cmark relative to the current tail calls implementation on main:

execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 204255214.50 ± 513219.80 (confidence = 99%)

  tail-calls.so is 1.08x to 1.08x faster than main.so!

  [2862185961 2862312218.75 2862598037] main.so
  [2657985419 2658057004.25 2658128352] tail-calls.so

execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  Δ = 13264568.75 ± 1.43 (confidence = 99%)

  tail-calls.so is 1.06x to 1.06x faster than main.so!

  [239534828 239534828.25 239534829] main.so
  [226270259 226270259.50 226270260] tail-calls.so

execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 391358.50 ± 894.30 (confidence = 99%)

  tail-calls.so is 1.02x to 1.02x faster than main.so!

  [20465937 20466114.00 20466584] main.so
  [20074442 20074755.50 20075106] tail-calls.so

Running the same benchmarks on main without tail calls and this branch with tail calls separately, yields the following results:

main

execution
  benchmarks/bz2/benchmark.wasm
    instructions-retired
      [226286674 226286674.00 226286674] /home/trevor/src/main.so
  benchmarks/pulldown-cmark/benchmark.wasm
    instructions-retired
      [20080325 20080512.75 20080999] /home/trevor/src/main.so
  benchmarks/spidermonkey/benchmark.wasm
    instructions-retired
      [2659583881 2659624337.75 2659672231] /home/trevor/src/main.so

this branch

execution
  benchmarks/bz2/benchmark.wasm
    instructions-retired
      [226270259 226270259.50 226270260] /home/trevor/src/tail-calls.so
  benchmarks/pulldown-cmark/benchmark.wasm
    instructions-retired
      [20074405 20074419.00 20074445] /home/trevor/src/tail-calls.so
  benchmarks/spidermonkey/benchmark.wasm
    instructions-retired
      [2657979471 2658099662.50 2658220755] /home/trevor/src/tail-calls.so

It's exciting to see that the max for the tail-calls branch is smaller than the min of main for spidermonkey, though even if that's spurious they're still in the same ballpark :tada:

Wasmtime GitHub notifications bot (Mar 27 2024 at 23:19):

jameysharp edited PR #8246:

Rework the tail calling convention on x64 to support tail calls by changing the compilation strategy in the following ways:

Instead of making a shadow frame that we copy over the old one, instead determine if we need to grow or shrink the stack argument area and emit a special-purpose instruction to handle that case.

As we're now growing and shrinking the frame in a separate instruction sequence, emit arguments as moving to FP offsets instead of SP offsets, placing them where they'll need to go for the tail call being setup.

Finally, when emitting a Inst::ReturnCall in the x64 backend, reuse the gen_clobber_save and gen_epilogue_frame_restore functions to setup the frame and any callee-saved registers to the state expected by the callee we're jumping to.

With these three changes in place, we modified the x64 abi for tail calls to include a list of callee-saved registers, and observed that we now save them in the prologue, and restore them right before jumping to the tail-callee.

TODO

[x] Account for the extra space used by GrowFrame in the stack check

[x] Revisit the use of r14 for the stack limit in the Tail calling convention

Co-authored-by: Jamey Sharp <jsharp@fastly.com>

Wasmtime GitHub notifications bot (Mar 27 2024 at 23:22):

fitzgen commented on PR #8246:

Ship it!

Wasmtime GitHub notifications bot (Mar 27 2024 at 23:50):

elliottt merged PR #8246.

Last updated: Apr 18 2025 at 13:08 UTC