fitzgen edited issue #6530:
I'm having a hard time figuring out the details of s390x here, so I am going to disable support for the
tail
calling convention on s390x (with a loud assertion) to land this PR. We can get s390x working in follow ups._Originally posted by @fitzgen in https://github.com/bytecodealliance/wasmtime/issues/6500#issuecomment-1579427510_
fitzgen commented on issue #6530:
Hey @uweigand, would really appreciate your help here. There are two things that need to be done (or at least, I have done it with these two different steps for the other backends):
Get regular, non-tail calls working with the
tail
calling convention.This has involved modifying the calling convention such that callees (rather than callers) pop stack arguments as part of their function epilogue and before/at the same time as returning to their caller.
There is a runtest that use the
tail
calling convention without actually doing tail calls, it is currently disabled on s390x but would presumably be enabled during this step. I also recommend copying that test tofiletests/isa/x390x
and turning it into acompile precise-output
test as well, which the other backends have done too.Implement tail calls.
These are only valid CLIF when both the caller and callee use the
tail
calling convention, so that is something we can rely upon.None of the other backends have been doing parallel move solving during tail calls to make sure new arguments end up in the correct locations without overwriting old arguments that are on the stack but become new arguments elsewhere on the stack in the tail call. Instead, they bump SP and construct a temporary frame and then move the frame down on top of the current frame and then initialize SP for the callee. I recommend starting with this simple approach, and we can later do parallel move solving as an optimization, but whatever works for you.
There are a few runtests for exercising tail calls:
filetests/runtests/return-call{.clif,-indirect.clif,-loop.clif}
. Each backend also has a copy ofreturn-call.clif
(which actually covers both colocated and not calls) to their ISA-specific tests ascompile precise-output
tests, and I recommend that s390x has a copy as well.I've been adding
emit_return_call
methods to each backend'sabi::CallSite<M>
instantiation, so that the top-level entry is arch-specific rather than going through a common generic helper. The ISLE lowering calls the externalgen_return_call[_indirect]
constructor, which each backend implements. This external constructor creates anabi::CallSite<MyBackend>
and then calls itsemit_return_call
method. There is a generic helper to create the temporary tail call frame that you can use if you want, but in general you have direct control of everything we do here, you don't have to fit into the shape of generic hooks and callbacks. I think it is easier to read and write the code this way, hopefully you find the same.For reference, the last commit in https://github.com/bytecodealliance/wasmtime/pull/6749 shows what implementing tail calls for riscv64 entailed. It was pretty straightforward, and hopefully can serve as a template to reference and crib from for s390x.
Don't hesitate to reach out with any questions or clarifications or anything like that! Thanks very much!
fitzgen commented on issue #6530:
Oh also, since Wasmtime will use the
tail
calling convention for all Wasm functions, I think we need to mark thetail
calling convention as having little endian lane ordering on s390x. I am not sure where that change needs to be made.
uweigand commented on issue #6530:
This has involved modifying the calling convention such that callees (rather than callers) pop stack arguments as part of their function epilogue and before/at the same time as returning to their caller.
This is probably the core of the problem - in the current s390x ABI, neither callers nor callees pop stack arguments as part of the calling sequence. Instead, the stack pointer stays constant around calls, and the space for all arguments for all calls the current function may ever make is statically allocated in its prologue. Therefore the tail calling convention will have to be implemented differently, either by switching over to a completely different convention that actually pushes/pops argument space around calls, or by some other means like allowing calls to not preserve the stack pointer (which in turn would enforce use of a frame pointer, which we also currently do not have on s390x ...).
I'll need to look into those options a bit more, but I won't get to this before I'm back from vacation in early August.
uweigand commented on issue #6530:
Here's finally my current design of a tail-call ABI for s390x.
The SystemV ABI has the following stack layout:
//! +---------------------------+ //! | ... | //! CFA -----> | stack args | //! +---------------------------+ //! | ... | //! | 160 bytes reg save area | //! | (used to save GPRs) | //! SP at function entry -----> | (incl. caller's backchain)| //! +---------------------------+ //! | ... | //! | clobbered callee-saves | //! | (used to save FPRs) | //! unwind-frame base ----> | (alloc'd by prologue) | //! +---------------------------+ //! | ... | //! | spill slots | //! | (accessed via nominal SP) | //! | ... | //! | stack slots | //! | (accessed via nominal SP) | //! nominal SP ---------------> | (alloc'd by prologue) | //! +---------------------------+ //! | ... | //! | outgoing calls return buf | //! | outgoing calls args | //! | outgoing reg save area | //! | (alloc'd by prologue) | //! SP during function ------> | (incl. callee's backchain)| //! +---------------------------+
The new tail-call ABI would instead use this layout:
//! +---------------------------+ //! | ... | //! CFA -----> | (caller's frame) | //! +---------------------------+ //! | ... | //! | 160 bytes reg save area | //! | (used to save GPRs) | //! SP at function return-----> | (incl. caller's backchain)| //! +---------------------------+ //! | ... | //! | incoming stack args | //! SP at function entry -----> | (incl. backchain copy) | //! +---------------------------+ //! | ... | //! | outgoing tail call args | //! | (overlaps incoming args) | //! | (incl. backchain copy) | //! SP at tail cail ----> | (alloc'd by prologue) | //! +---------------------------+ //! | ... | //! | clobbered callee-saves | //! | (used to save FPRs) | //! unwind-frame base ----> | (alloc'd by prologue) | //! +---------------------------+ //! | ... | //! | spill slots | //! | (accessed via nominal SP) | //! | ... | //! | stack slots | //! | (accessed via nominal SP) | //! nominal SP ---------------> | (alloc'd by prologue) | //! +---------------------------+ //! | ... | //! | outgoing calls return buf | //! | outgoing reg save area | //! | (alloc'd by prologue) | //! SP during function ------> | (incl. callee's backchain)| //! +---------------------------+ //! | ... | //! | outgoing stack args | //! | (alloc'd by call sequence)| //! SP at non-tail call -----> | (incl. backchain copy) | //! +---------------------------+
Notably, the handling of the stack backchain with this tail-call ABI remains fully compatible with the SystemV ABI. Existing unwinder routines should be able to correctly unwind through any combination of SystemV and tail-call ABI frames. To enable asynchronous unwinding during the call sequence, we store copies of the innermost backchain value at the bottom of the outgoing (or tail-call) argument area. These only remain live during the call sequence, they are no longer accessed after the call sequence is completed by the callee setting up it's own stack frame. Note that if unwinding during that period is not required, it would be possible to omit storing those copies to avoid some overhead.
In addition, the tail-call ABI would use both
r6
andr7
as argument registers, and make them both non-callee-saved. (r6
is a callee-saved argument register in the SystemV ABI, which is incompatible with tail calls. As we're making changes anway, it makes sense to also changer7
so we have 6 argument registers instead of 5.)To implement the new ABI, the following changes would need to be made:
- Tail-call ABI prologue:
- Save callee-saved GPRs into incoming reg save area
- Compute size of tail-call argument area (extending incoming argument area)
- Allocate local stack frame, including
- tail-call argument area
- clobbered callee-saves area
- spill slots
- stack slots
- outgoing calls return buffer
- outgoing reg save area
- Store backchain, pointing to caller's backchain (not the copy!)
- Save callee-saved FPRs into callee-saves area
(Overhead compared to SystemV ABI: compute backchain value; none if no incoming stack args)
- Tail-call ABI epilogue:
- Restore callee-saved FPRs from callee-saves area
- Restore callee-saved GPRs (except SP, including %r14) from incoming reg save area
- Restore SP to value at function return (popping incoming args area)
- Return by branching to (restored) %r14
(Overhead compared to SystemV ABI: compute restored SP; none if no incoming stack args)
- Tail-call sequence:
- (If present) Construct outgoing args into (possibly extended) incoming args area
- Includes storing copy of caller's backchain to the bottom of the outgoing args
- Load up arguments into registers
- (If present) restore return buffer address register to incoming value
- Restore callee-saved FPRs from callee-saves area
- Restore callee-saved GPRs (except SP, including %r14) from incoming reg save area
- Set SP to bottom of outgoing args area for this call
- Jump to target address
(Overhead compared to SystemV ABI: n/a, doesn't have tail calls)
- Non-tail calls:
- (If present) Set up outgoing args area
- Allocate area by decrementing stack pointer
- Store copy of callee's backchain to the bottom of the outgoing args
- Construct outgoing args into outgoing args area
- Load up arguments into register
- (If present) set up return buffer address register to return buffer
- Perform function call (storing return address into %r14)
- Retrieve return values from registers and/or return buffer
(Overhead compared to SystemV ABI: allocating outgoing args area & storing backchain copy; none if no outgoing stack args)
Cross-calling-convention calls would also be possible:
- SystemV ABI to tail-call ABI
- mostly the same as tail-call to tail-call calls
- GPRs 6 and 7 must be considered clobbered by callee
- tail-call ABI to SystemV ABI
- allocate outgoing args in caller's frame as for SystemV ABI
CC @elliottt @jameysharp @cfallin
fitzgen commented on issue #6530:
Noting my comment from today's Cranelift meeting for posterity: the async stack walking doesn't seem too important (can be relegated to a default off setting, imo) if we preserve the invariant on s390x that other backends have that we don't interleave argument evaluation with calling convention code. This would be easy to do, and wouldn't require moving s390x to the shared ABI framework code, by having s390x's calling convention code in ISLE take
Reg
s rather thanValue
s, forcing the evaluation to happen at the boundary before we begin calling convention stuff.Ulrich, you mentioned that this would prevent us from performing an existing sign-/zero-extend optimization. I guess my question then is which has more overhead: missing that optimization or storing a copy of the backchain?
fitzgen edited a comment on issue #6530:
Noting my comment from today's Cranelift meeting for posterity: the async stack walking doesn't seem too important (can be relegated to a default off setting, imo) if we preserve the invariant on s390x that other backends have that we don't interleave argument evaluation with calling convention code (since argument evaluation can trap, and we must be able to walk the stack at trap sites. This would be easy to do, and wouldn't require moving s390x to the shared ABI framework code, by having s390x's calling convention code in ISLE take
Reg
s rather thanValue
s, forcing the evaluation to happen at the boundary before we begin calling convention stuff.Ulrich, you mentioned that this would prevent us from performing an existing sign-/zero-extend optimization. I guess my question then is which has more overhead: missing that optimization or storing a copy of the backchain?
fitzgen edited a comment on issue #6530:
Noting my comment from today's Cranelift meeting for posterity: the async stack walking doesn't seem too important (can be relegated to a default off setting, imo) if we preserve the invariant on s390x that other backends have that we don't interleave argument evaluation with calling convention code (since argument evaluation can trap, and we must be able to walk the stack at trap sites). This would be easy to do, and wouldn't require moving s390x to the shared ABI framework code, by having s390x's calling convention code in ISLE take
Reg
s rather thanValue
s, forcing the evaluation to happen at the boundary before we begin calling convention stuff.Ulrich, you mentioned that this would prevent us from performing an existing sign-/zero-extend optimization. I guess my question then is which has more overhead: missing that optimization or storing a copy of the backchain?
uweigand commented on issue #6530:
I've now completed a prototype implementation of the above design. This sucessfully passes all tests (including the wasm tail-call tests), with one execption:
gc_and_tail_calls_and_stack_arguments
.Unfortunately, this indicates a fundamental problem: the above design requires dynamic stack adjustments around calls. However, common code support for that was removed recently. I thought I could work around this in the back-end, and that does indeed work fine for code generation - but then common code gets the stack maps at call sites wrong, as it no longer takes this dynamic adjustment into account.
So at this point we can either add that common code support back, or else re-design the ABI to avoid dynamic stack adjustments. The latter is a bit tricky if we want to avoid frequent copying of the back chain value. I'll need to think about that more.
jameysharp commented on issue #6530:
I haven't been able to focus on this enough to understand what problem you're solving with dynamic stack adjustments, but I think we wanted to do something similar enough someday in other backends that it may be worth talking through how to get this right in common code.
In particular, to keep things simple, @elliottt and I placed outgoing arguments at the bottom of the outgoing argument area, so the stack pointer is always in the same place on entry to a call. But we planned eventually to place those arguments at the top of the outgoing argument area instead and move the stack pointer before and after the call, to reduce stack usage in deeply nested call stacks. I'm not sure but perhaps that would mean we would run into the same issue you're seeing with stack maps.
Of course, @fitzgen is currently working on significant changes to the stack map infrastructure, which might also change this situation, I'm not sure.
fitzgen commented on issue #6530:
While walking the stack to identify on-stack GC roots, the runtime is given a series of frame pointers (backchains on s390x) and PCs for each frame. It looks up the appropriate stack map metadata based on the PC. The stack map provides the offset-from-SP of every GC root in the stack frame and the size of the stack frame that it is mapping. The runtime takes the frame's FP, subtracts the stack map's frame size from FP, which yields the SP. It then uses the stack map and the SP to note all of the live, on-stack GC roots.
(I have a branch that is just not quite ready yet to move Wasmtime over from the "old" stack maps (current situation) to the new "user" stack maps (stuff myself and Trevor have been poking at the last couple weeks). But it shouldn't actually change anything with regards to the above; all that should be just as true for the new system as it is for the old system.)
All this to say: If you are adjusting the SP at call sites, then it is possible (likely?) that the frame size noted in the stack maps have gotten out of sync with the frame's actual size, which could explain why that test is failing.
Some relevant bits of code:
- The method that the new "user" stack maps use to get the frame size: https://github.com/bytecodealliance/wasmtime/blob/3767f76b487959194b10ec9e17ddb8d1c3a37116/cranelift/codegen/src/machinst/abi.rs#L986
- The open-coded version of getting the frame size for the "old" stack maps: https://github.com/bytecodealliance/wasmtime/blob/3767f76b487959194b10ec9e17ddb8d1c3a37116/cranelift/codegen/src/machinst/abi.rs#L1661
The stack maps are then passed to the
EmitState
via thepre_safepoint
method. Then they are taken out of theEmitState
during viatake_stack_map
calls when emitting a call/call-indirect instruction in the backends. It seems like maybe updating the frame size of the stack map in this call-emit code, if you adjust the SP for the call, might be the easiest solution here.
uweigand commented on issue #6530:
While walking the stack to identify on-stack GC roots, the runtime is given a series of frame pointers (backchains on s390x) and PCs for each frame. It looks up the appropriate stack map metadata based on the PC. The stack map provides the offset-from-SP of every GC root in the stack frame and the size of the stack frame that it is mapping. The runtime takes the frame's FP, subtracts the stack map's frame size from FP, which yields the SP. It then uses the stack map and the SP to note all of the live, on-stack GC roots.
Thanks for the explanation! With this approach, "local" SP manipulations around call sites should actually not affect use of the stack map at all, as long as runtime unwinding provides correct frame pointer values and the stack map provides the correct size of the "nominal" stack.
All this to say: If you are adjusting the SP at call sites, then it is possible (likely?) that the frame size noted in the stack maps have gotten out of sync with the frame's actual size, which could explain why that test is failing.
And this was indeed the case, but not because of SP adjusting, but because the frame size of tail call frames was just generally wrong. In my approach, incoming tail call arguments are part of the callee's frame (as opposed to the caller's), and therefore must be accounted as part of the callee's frame size.
With this fixed, I now see the GC test (and all other tests) pass.
I'll have to do a bit more cleanup, and then I should be ready to post a PR.
I haven't been able to focus on this enough to understand what problem you're solving with dynamic stack adjustments, but I think we wanted to do something similar enough someday in other backends that it may be worth talking through how to get this right in common code.
To elaborate on that: in the s390x ABI we do not have a frame pointer, but instead stack backtracing works by having a stack "backchain" that is always present directly at the location the stack pointer points to. This is usually quite efficient, but has the drawback that any adjustment of the stack pointer requires also updating the back chain field - unless you can prove it must have already been set in the past. In particular, for function return in the current ABI, we know that we'll restore SP back to where the caller already placed the correct backchain value, so we do not have to re-write the backchain on every return.
With the approach you're currently taking for tail calls on Intel and ARM, however, incoming tail call arguments are part of the caller's frame, and so are outgoing arguments. This may require growing the caller's frame. If we were to do that on s390x, that growing stack frame would overwrite the caller's backchain value, and would require us to re-write a backchain on every return (actually only on every return from a tail-call chain that somewhere grew the frame, but as we don't know whether this is the case, we'd in fact have to re-write on every return). This would be quite some overhead, which I'm avoiding by making tail call arguments always part of the callee's frame. But that means the caller has to start allocated that bit of the callee's frame and therefore temporarily adjust SP around the call site.
uweigand edited a comment on issue #6530:
While walking the stack to identify on-stack GC roots, the runtime is given a series of frame pointers (backchains on s390x) and PCs for each frame. It looks up the appropriate stack map metadata based on the PC. The stack map provides the offset-from-SP of every GC root in the stack frame and the size of the stack frame that it is mapping. The runtime takes the frame's FP, subtracts the stack map's frame size from FP, which yields the SP. It then uses the stack map and the SP to note all of the live, on-stack GC roots.
Thanks for the explanation! With this approach, "local" SP manipulations around call sites should actually not affect use of the stack map at all, as long as runtime unwinding provides correct frame pointer values and the stack map provides the correct size of the "nominal" stack.
All this to say: If you are adjusting the SP at call sites, then it is possible (likely?) that the frame size noted in the stack maps have gotten out of sync with the frame's actual size, which could explain why that test is failing.
And this was indeed the case, but not because of SP adjusting, but because the frame size of tail call frames was just generally wrong. In my approach, incoming tail call arguments are part of the callee's frame (as opposed to the caller's), and therefore must be accounted as part of the callee's frame size.
With this fixed, I now see the GC test (and all other tests) pass.
I'll have to do a bit more cleanup, and then I should be ready to post a PR.
I haven't been able to focus on this enough to understand what problem you're solving with dynamic stack adjustments, but I think we wanted to do something similar enough someday in other backends that it may be worth talking through how to get this right in common code.
To elaborate on that: in the s390x ABI we do not have a frame pointer, but instead stack backtracing works by having a stack "backchain" that is always present directly at the location the stack pointer points to. This is usually quite efficient, but has the drawback that any adjustment of the stack pointer requires also updating the back chain field - unless you can prove it must have already been set in the past. In particular, for function return in the current ABI, we know that we'll restore SP back to where the caller already placed the correct backchain value, so we do not have to re-write the backchain on every return.
With the approach you're currently taking for tail calls on Intel and ARM, however, incoming tail call arguments are part of the caller's frame, and so are outgoing arguments. This may require growing the caller's frame. If we were to do that on s390x, that growing stack frame would overwrite the caller's backchain value, and would require us to re-write a backchain on every return (actually only on every return from a tail-call chain that somewhere grew the frame, but as we don't know whether this is the case, we'd in fact have to re-write on every return). This would be quite some overhead, which I'm avoiding by making tail call arguments always part of the callee's frame. But that means the caller has to start allocating that bit of the callee's frame and therefore temporarily adjust SP around the call site.
fitzgen closed issue #6530:
I'm having a hard time figuring out the details of s390x here, so I am going to disable support for the
tail
calling convention on s390x (with a loud assertion) to land this PR. We can get s390x working in follow ups._Originally posted by @fitzgen in https://github.com/bytecodealliance/wasmtime/issues/6500#issuecomment-1579427510_
Last updated: Jan 24 2025 at 00:11 UTC