Stream: git-wasmtime

Topic: wasmtime / issue #6530 Cranelift: get tail calls working ...


view this post on Zulip Wasmtime GitHub notifications bot (Jul 19 2023 at 17:53):

fitzgen edited issue #6530:

I'm having a hard time figuring out the details of s390x here, so I am going to disable support for the tail calling convention on s390x (with a loud assertion) to land this PR. We can get s390x working in follow ups.

_Originally posted by @fitzgen in https://github.com/bytecodealliance/wasmtime/issues/6500#issuecomment-1579427510_

view this post on Zulip Wasmtime GitHub notifications bot (Jul 19 2023 at 17:53):

fitzgen commented on issue #6530:

Hey @uweigand, would really appreciate your help here. There are two things that need to be done (or at least, I have done it with these two different steps for the other backends):

  1. Get regular, non-tail calls working with the tail calling convention.

    This has involved modifying the calling convention such that callees (rather than callers) pop stack arguments as part of their function epilogue and before/at the same time as returning to their caller.

    There is a runtest that use the tail calling convention without actually doing tail calls, it is currently disabled on s390x but would presumably be enabled during this step. I also recommend copying that test to filetests/isa/x390x and turning it into a compile precise-output test as well, which the other backends have done too.

  2. Implement tail calls.

    These are only valid CLIF when both the caller and callee use the tail calling convention, so that is something we can rely upon.

    None of the other backends have been doing parallel move solving during tail calls to make sure new arguments end up in the correct locations without overwriting old arguments that are on the stack but become new arguments elsewhere on the stack in the tail call. Instead, they bump SP and construct a temporary frame and then move the frame down on top of the current frame and then initialize SP for the callee. I recommend starting with this simple approach, and we can later do parallel move solving as an optimization, but whatever works for you.

    There are a few runtests for exercising tail calls: filetests/runtests/return-call{.clif,-indirect.clif,-loop.clif}. Each backend also has a copy of return-call.clif (which actually covers both colocated and not calls) to their ISA-specific tests as compile precise-output tests, and I recommend that s390x has a copy as well.

    I've been adding emit_return_call methods to each backend's abi::CallSite<M> instantiation, so that the top-level entry is arch-specific rather than going through a common generic helper. The ISLE lowering calls the external gen_return_call[_indirect] constructor, which each backend implements. This external constructor creates an abi::CallSite<MyBackend> and then calls its emit_return_call method. There is a generic helper to create the temporary tail call frame that you can use if you want, but in general you have direct control of everything we do here, you don't have to fit into the shape of generic hooks and callbacks. I think it is easier to read and write the code this way, hopefully you find the same.

    For reference, the last commit in https://github.com/bytecodealliance/wasmtime/pull/6749 shows what implementing tail calls for riscv64 entailed. It was pretty straightforward, and hopefully can serve as a template to reference and crib from for s390x.

Don't hesitate to reach out with any questions or clarifications or anything like that! Thanks very much!

view this post on Zulip Wasmtime GitHub notifications bot (Jul 19 2023 at 17:54):

fitzgen commented on issue #6530:

Oh also, since Wasmtime will use the tail calling convention for all Wasm functions, I think we need to mark the tail calling convention as having little endian lane ordering on s390x. I am not sure where that change needs to be made.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 20 2023 at 15:25):

uweigand commented on issue #6530:

This has involved modifying the calling convention such that callees (rather than callers) pop stack arguments as part of their function epilogue and before/at the same time as returning to their caller.

This is probably the core of the problem - in the current s390x ABI, neither callers nor callees pop stack arguments as part of the calling sequence. Instead, the stack pointer stays constant around calls, and the space for all arguments for all calls the current function may ever make is statically allocated in its prologue. Therefore the tail calling convention will have to be implemented differently, either by switching over to a completely different convention that actually pushes/pops argument space around calls, or by some other means like allowing calls to not preserve the stack pointer (which in turn would enforce use of a frame pointer, which we also currently do not have on s390x ...).

I'll need to look into those options a bit more, but I won't get to this before I'm back from vacation in early August.

view this post on Zulip Wasmtime GitHub notifications bot (May 15 2024 at 15:18):

uweigand commented on issue #6530:

Here's finally my current design of a tail-call ABI for s390x.

The SystemV ABI has the following stack layout:

//!                              +---------------------------+
//!                              |          ...              |
//! CFA                  ----->  | stack args                |
//!                              +---------------------------+
//!                              |          ...              |
//!                              | 160 bytes reg save area   |
//!                              | (used to save GPRs)       |
//! SP at function entry ----->  | (incl. caller's backchain)|
//!                              +---------------------------+
//!                              |          ...              |
//!                              | clobbered callee-saves    |
//!                              | (used to save FPRs)       |
//! unwind-frame base     ---->  | (alloc'd by prologue)     |
//!                              +---------------------------+
//!                              |          ...              |
//!                              | spill slots               |
//!                              | (accessed via nominal SP) |
//!                              |          ...              |
//!                              | stack slots               |
//!                              | (accessed via nominal SP) |
//! nominal SP --------------->  | (alloc'd by prologue)     |
//!                              +---------------------------+
//!                              |          ...              |
//!                              | outgoing calls return buf |
//!                              | outgoing calls args       |
//!                              | outgoing reg save area    |
//!                              | (alloc'd by prologue)     |
//! SP during function  ------>  | (incl. callee's backchain)|
//!                              +---------------------------+

The new tail-call ABI would instead use this layout:

//!                              +---------------------------+
//!                              |          ...              |
//! CFA                  ----->  | (caller's frame)          |
//!                              +---------------------------+
//!                              |          ...              |
//!                              | 160 bytes reg save area   |
//!                              | (used to save GPRs)       |
//! SP at function return----->  | (incl. caller's backchain)|
//!                              +---------------------------+
//!                              |          ...              |
//!                              | incoming stack args       |
//! SP at function entry ----->  | (incl. backchain copy)    |
//!                              +---------------------------+
//!                              |          ...              |
//!                              | outgoing tail call args   |
//!                              | (overlaps incoming args)  |
//!                              | (incl. backchain copy)    |
//! SP at tail cail       ---->  | (alloc'd by prologue)     |
//!                              +---------------------------+
//!                              |          ...              |
//!                              | clobbered callee-saves    |
//!                              | (used to save FPRs)       |
//! unwind-frame base     ---->  | (alloc'd by prologue)     |
//!                              +---------------------------+
//!                              |          ...              |
//!                              | spill slots               |
//!                              | (accessed via nominal SP) |
//!                              |          ...              |
//!                              | stack slots               |
//!                              | (accessed via nominal SP) |
//! nominal SP --------------->  | (alloc'd by prologue)     |
//!                              +---------------------------+
//!                              |          ...              |
//!                              | outgoing calls return buf |
//!                              | outgoing reg save area    |
//!                              | (alloc'd by prologue)     |
//! SP during function  ------>  | (incl. callee's backchain)|
//!                              +---------------------------+
//!                              |          ...              |
//!                              | outgoing stack args       |
//!                              | (alloc'd by call sequence)|
//! SP at non-tail call  ----->  | (incl. backchain copy)    |
//!                              +---------------------------+

Notably, the handling of the stack backchain with this tail-call ABI remains fully compatible with the SystemV ABI. Existing unwinder routines should be able to correctly unwind through any combination of SystemV and tail-call ABI frames. To enable asynchronous unwinding during the call sequence, we store copies of the innermost backchain value at the bottom of the outgoing (or tail-call) argument area. These only remain live during the call sequence, they are no longer accessed after the call sequence is completed by the callee setting up it's own stack frame. Note that if unwinding during that period is not required, it would be possible to omit storing those copies to avoid some overhead.

In addition, the tail-call ABI would use both r6 and r7 as argument registers, and make them both non-callee-saved. (r6 is a callee-saved argument register in the SystemV ABI, which is incompatible with tail calls. As we're making changes anway, it makes sense to also change r7 so we have 6 argument registers instead of 5.)

To implement the new ABI, the following changes would need to be made:

  1. Tail-call ABI prologue:

(Overhead compared to SystemV ABI: compute backchain value; none if no incoming stack args)

  1. Tail-call ABI epilogue:

(Overhead compared to SystemV ABI: compute restored SP; none if no incoming stack args)

  1. Tail-call sequence:

(Overhead compared to SystemV ABI: n/a, doesn't have tail calls)

  1. Non-tail calls:

(Overhead compared to SystemV ABI: allocating outgoing args area & storing backchain copy; none if no outgoing stack args)

Cross-calling-convention calls would also be possible:

  1. SystemV ABI to tail-call ABI
  1. tail-call ABI to SystemV ABI

CC @elliottt @jameysharp @cfallin

view this post on Zulip Wasmtime GitHub notifications bot (May 15 2024 at 16:34):

fitzgen commented on issue #6530:

Noting my comment from today's Cranelift meeting for posterity: the async stack walking doesn't seem too important (can be relegated to a default off setting, imo) if we preserve the invariant on s390x that other backends have that we don't interleave argument evaluation with calling convention code. This would be easy to do, and wouldn't require moving s390x to the shared ABI framework code, by having s390x's calling convention code in ISLE take Regs rather than Values, forcing the evaluation to happen at the boundary before we begin calling convention stuff.

Ulrich, you mentioned that this would prevent us from performing an existing sign-/zero-extend optimization. I guess my question then is which has more overhead: missing that optimization or storing a copy of the backchain?

view this post on Zulip Wasmtime GitHub notifications bot (May 15 2024 at 16:35):

fitzgen edited a comment on issue #6530:

Noting my comment from today's Cranelift meeting for posterity: the async stack walking doesn't seem too important (can be relegated to a default off setting, imo) if we preserve the invariant on s390x that other backends have that we don't interleave argument evaluation with calling convention code (since argument evaluation can trap, and we must be able to walk the stack at trap sites. This would be easy to do, and wouldn't require moving s390x to the shared ABI framework code, by having s390x's calling convention code in ISLE take Regs rather than Values, forcing the evaluation to happen at the boundary before we begin calling convention stuff.

Ulrich, you mentioned that this would prevent us from performing an existing sign-/zero-extend optimization. I guess my question then is which has more overhead: missing that optimization or storing a copy of the backchain?

view this post on Zulip Wasmtime GitHub notifications bot (May 15 2024 at 16:36):

fitzgen edited a comment on issue #6530:

Noting my comment from today's Cranelift meeting for posterity: the async stack walking doesn't seem too important (can be relegated to a default off setting, imo) if we preserve the invariant on s390x that other backends have that we don't interleave argument evaluation with calling convention code (since argument evaluation can trap, and we must be able to walk the stack at trap sites). This would be easy to do, and wouldn't require moving s390x to the shared ABI framework code, by having s390x's calling convention code in ISLE take Regs rather than Values, forcing the evaluation to happen at the boundary before we begin calling convention stuff.

Ulrich, you mentioned that this would prevent us from performing an existing sign-/zero-extend optimization. I guess my question then is which has more overhead: missing that optimization or storing a copy of the backchain?

view this post on Zulip Wasmtime GitHub notifications bot (Jul 26 2024 at 13:43):

uweigand commented on issue #6530:

I've now completed a prototype implementation of the above design. This sucessfully passes all tests (including the wasm tail-call tests), with one execption: gc_and_tail_calls_and_stack_arguments.

Unfortunately, this indicates a fundamental problem: the above design requires dynamic stack adjustments around calls. However, common code support for that was removed recently. I thought I could work around this in the back-end, and that does indeed work fine for code generation - but then common code gets the stack maps at call sites wrong, as it no longer takes this dynamic adjustment into account.

So at this point we can either add that common code support back, or else re-design the ABI to avoid dynamic stack adjustments. The latter is a bit tricky if we want to avoid frequent copying of the back chain value. I'll need to think about that more.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 26 2024 at 19:59):

jameysharp commented on issue #6530:

I haven't been able to focus on this enough to understand what problem you're solving with dynamic stack adjustments, but I think we wanted to do something similar enough someday in other backends that it may be worth talking through how to get this right in common code.

In particular, to keep things simple, @elliottt and I placed outgoing arguments at the bottom of the outgoing argument area, so the stack pointer is always in the same place on entry to a call. But we planned eventually to place those arguments at the top of the outgoing argument area instead and move the stack pointer before and after the call, to reduce stack usage in deeply nested call stacks. I'm not sure but perhaps that would mean we would run into the same issue you're seeing with stack maps.

Of course, @fitzgen is currently working on significant changes to the stack map infrastructure, which might also change this situation, I'm not sure.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 27 2024 at 01:16):

fitzgen commented on issue #6530:

While walking the stack to identify on-stack GC roots, the runtime is given a series of frame pointers (backchains on s390x) and PCs for each frame. It looks up the appropriate stack map metadata based on the PC. The stack map provides the offset-from-SP of every GC root in the stack frame and the size of the stack frame that it is mapping. The runtime takes the frame's FP, subtracts the stack map's frame size from FP, which yields the SP. It then uses the stack map and the SP to note all of the live, on-stack GC roots.

(I have a branch that is just not quite ready yet to move Wasmtime over from the "old" stack maps (current situation) to the new "user" stack maps (stuff myself and Trevor have been poking at the last couple weeks). But it shouldn't actually change anything with regards to the above; all that should be just as true for the new system as it is for the old system.)

All this to say: If you are adjusting the SP at call sites, then it is possible (likely?) that the frame size noted in the stack maps have gotten out of sync with the frame's actual size, which could explain why that test is failing.

Some relevant bits of code:

The stack maps are then passed to the EmitState via the pre_safepoint method. Then they are taken out of the EmitState during via take_stack_map calls when emitting a call/call-indirect instruction in the backends. It seems like maybe updating the frame size of the stack map in this call-emit code, if you adjust the SP for the call, might be the easiest solution here.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 29 2024 at 14:04):

uweigand commented on issue #6530:

While walking the stack to identify on-stack GC roots, the runtime is given a series of frame pointers (backchains on s390x) and PCs for each frame. It looks up the appropriate stack map metadata based on the PC. The stack map provides the offset-from-SP of every GC root in the stack frame and the size of the stack frame that it is mapping. The runtime takes the frame's FP, subtracts the stack map's frame size from FP, which yields the SP. It then uses the stack map and the SP to note all of the live, on-stack GC roots.

Thanks for the explanation! With this approach, "local" SP manipulations around call sites should actually not affect use of the stack map at all, as long as runtime unwinding provides correct frame pointer values and the stack map provides the correct size of the "nominal" stack.

All this to say: If you are adjusting the SP at call sites, then it is possible (likely?) that the frame size noted in the stack maps have gotten out of sync with the frame's actual size, which could explain why that test is failing.

And this was indeed the case, but not because of SP adjusting, but because the frame size of tail call frames was just generally wrong. In my approach, incoming tail call arguments are part of the callee's frame (as opposed to the caller's), and therefore must be accounted as part of the callee's frame size.

With this fixed, I now see the GC test (and all other tests) pass.

I'll have to do a bit more cleanup, and then I should be ready to post a PR.

I haven't been able to focus on this enough to understand what problem you're solving with dynamic stack adjustments, but I think we wanted to do something similar enough someday in other backends that it may be worth talking through how to get this right in common code.

To elaborate on that: in the s390x ABI we do not have a frame pointer, but instead stack backtracing works by having a stack "backchain" that is always present directly at the location the stack pointer points to. This is usually quite efficient, but has the drawback that any adjustment of the stack pointer requires also updating the back chain field - unless you can prove it must have already been set in the past. In particular, for function return in the current ABI, we know that we'll restore SP back to where the caller already placed the correct backchain value, so we do not have to re-write the backchain on every return.

With the approach you're currently taking for tail calls on Intel and ARM, however, incoming tail call arguments are part of the caller's frame, and so are outgoing arguments. This may require growing the caller's frame. If we were to do that on s390x, that growing stack frame would overwrite the caller's backchain value, and would require us to re-write a backchain on every return (actually only on every return from a tail-call chain that somewhere grew the frame, but as we don't know whether this is the case, we'd in fact have to re-write on every return). This would be quite some overhead, which I'm avoiding by making tail call arguments always part of the callee's frame. But that means the caller has to start allocated that bit of the callee's frame and therefore temporarily adjust SP around the call site.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 29 2024 at 14:06):

uweigand edited a comment on issue #6530:

While walking the stack to identify on-stack GC roots, the runtime is given a series of frame pointers (backchains on s390x) and PCs for each frame. It looks up the appropriate stack map metadata based on the PC. The stack map provides the offset-from-SP of every GC root in the stack frame and the size of the stack frame that it is mapping. The runtime takes the frame's FP, subtracts the stack map's frame size from FP, which yields the SP. It then uses the stack map and the SP to note all of the live, on-stack GC roots.

Thanks for the explanation! With this approach, "local" SP manipulations around call sites should actually not affect use of the stack map at all, as long as runtime unwinding provides correct frame pointer values and the stack map provides the correct size of the "nominal" stack.

All this to say: If you are adjusting the SP at call sites, then it is possible (likely?) that the frame size noted in the stack maps have gotten out of sync with the frame's actual size, which could explain why that test is failing.

And this was indeed the case, but not because of SP adjusting, but because the frame size of tail call frames was just generally wrong. In my approach, incoming tail call arguments are part of the callee's frame (as opposed to the caller's), and therefore must be accounted as part of the callee's frame size.

With this fixed, I now see the GC test (and all other tests) pass.

I'll have to do a bit more cleanup, and then I should be ready to post a PR.

I haven't been able to focus on this enough to understand what problem you're solving with dynamic stack adjustments, but I think we wanted to do something similar enough someday in other backends that it may be worth talking through how to get this right in common code.

To elaborate on that: in the s390x ABI we do not have a frame pointer, but instead stack backtracing works by having a stack "backchain" that is always present directly at the location the stack pointer points to. This is usually quite efficient, but has the drawback that any adjustment of the stack pointer requires also updating the back chain field - unless you can prove it must have already been set in the past. In particular, for function return in the current ABI, we know that we'll restore SP back to where the caller already placed the correct backchain value, so we do not have to re-write the backchain on every return.

With the approach you're currently taking for tail calls on Intel and ARM, however, incoming tail call arguments are part of the caller's frame, and so are outgoing arguments. This may require growing the caller's frame. If we were to do that on s390x, that growing stack frame would overwrite the caller's backchain value, and would require us to re-write a backchain on every return (actually only on every return from a tail-call chain that somewhere grew the frame, but as we don't know whether this is the case, we'd in fact have to re-write on every return). This would be quite some overhead, which I'm avoiding by making tail call arguments always part of the callee's frame. But that means the caller has to start allocating that bit of the callee's frame and therefore temporarily adjust SP around the call site.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 01 2024 at 20:22):

fitzgen closed issue #6530:

I'm having a hard time figuring out the details of s390x here, so I am going to disable support for the tail calling convention on s390x (with a loud assertion) to land this PR. We can get s390x working in follow ups.

_Originally posted by @fitzgen in https://github.com/bytecodealliance/wasmtime/issues/6500#issuecomment-1579427510_


Last updated: Nov 22 2024 at 16:03 UTC