wasmtime / issue #9732 winch(x64): Tail call · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / issue #9732 winch(x64): Tail call

Wasmtime GitHub notifications bot (Jan 06 2025 at 15:18):

saulecabrera edited issue #9732:

The Wasm Tail Call proposal is considered Tier 1 according to Wasmtime's Tiers of support.

Winch currently doesn't support this proposal.

[ ] return_call

[ ] return_call_indirect

Wasmtime GitHub notifications bot (Jan 17 2025 at 00:24):

saulecabrera assigned saulecabrera to issue #9732.

Wasmtime GitHub notifications bot (Jan 17 2025 at 14:49):

MarinPostma commented on issue #9732:

@saulecabrera you can assign me to that :+1:

Wasmtime GitHub notifications bot (Jan 17 2025 at 14:50):

saulecabrera edited issue #9732:

The Wasm Tail Call proposal is considered Tier 1 according to Wasmtime's Tiers of support.

Winch currently doesn't support this proposal.

[ ] return_call @MarinPostma

[ ] return_call_indirect @MarinPostma

Wasmtime GitHub notifications bot (Jan 19 2025 at 17:19):

MarinPostma commented on issue #9732:

@saulecabrera

Did you have a sketch of a plan on how to implement that already?

I did some homework this weekend, trying to find a path toward tail calls. I think we could do something like described in this post about v8 baseline compiler, or as initially implemented in cranelift, basically:

set up the frame as if we were about to perform a normal call

memcpy the frame down to override the caller frame

jump to the callee entry-point

However this is over simplistic. There are many things getting in our way. Here's my understanding of the situation, thus far:

1) The callee and the caller may not require the same amount of stack space for their arguments. In the Winch calling convention, it is the caller that cleans up the stack space for the arguments. This is an issue for tail calls, because the caller doesn't know ahead of time how much stack space will need to be cleaned up when the callee returns. cranelift TailCall convention solves that by requiring that the callee cleans-up it's stack space. I suppose there are alternatives to let the caller know dynamically the amount of stack space to pop on cleanup, but in any case, it seems to me that we need a calling convention that supports that.
2) The ret_area is inherited from the caller, rather and allocated in the caller stack. This is easy to do: we can get it from the caller's initial control frame. (see there). This works, because the callee must have the same return as the caller.
3) We need to emit a jump to the callee, rather than a call, because we must not push a return address, since the state is already set up during the return_call preparation. Winch relies of cranelift to emit relocs for the callee address. The story around converting a function ref to a MachLabel that can be used for a jump is not entirely clear to me yet. Alternatively, we could pop the FP back into it's register and let call push it back (this is what v8 baseline does).
4) Cranelift has Inst::CallReturn* macro-instruction, but it's not clear to me if we can use it, since it emits code specific to cranelift Tail calling convention.

I am glancing over a lot of things here, obviously. My principal concern is with 1) right now; I don't see a way forward without addressing it.

Wasmtime GitHub notifications bot (Feb 10 2025 at 12:51):

saulecabrera commented on issue #9732:

Some preliminary context:

Winch has a single, internal calling convention, which is mostly used between Wasm-to-Wasm calls.

Non Wasm-to-Wasm calls (e.g., calls to host-defined functions), go through

trampolines, which are currently emitted
through Cranelift. These trampolines convert between different calling conventions.

Winch supports other calling conventions (e.g., SysV) used function calls
which don't go through trampolines (calls to built-in function like
f32.abs)

--

Regarding the implementation of tail-calls:

I don't have a one-size-fits-all answer unfortunately, but we did discuss some
ideas in the past, unfortunately they are not officially documented anywhere; in
the next couple of sentences I'll try to do a summary, taking into account your
points above:

I suppose there are alternatives to let the caller know dynamically the amount
of stack space to pop on cleanup, but in any case, it seems to me that we need
a calling convention that supports that.

Indeed, there have been attempts in other production ready Wasm compilers to avoid having
the callee clean up the stack space used for arguments. Given that (i) for tail
calls it's assumed that we're using the current frame to resolve calls and (ii)
even though that we don't know the exact space needed for arguments, there are
some expectations around the stack pointer position before/after the call, which
could be helpful to derive the correct stack pointer position in the presence of
tail calls and stack arguments. I was hoping to explore this idea, through
a prototype. The main benefit that I see here is minimizing ABI changes.

The alternative, as you note, is to have the callee clean up any stack space
allocated for stack arguments. Even though, this seems like the obvious
approach, I would not be comfortable introducing this change without a proper
exploration of the other alternatives; a change like this represents
a fundamental shift in the following fronts:

ABI

Frame handling

Multi-value

In either case, I don't think we need to introduce a different calling
convention, my expectation is that modifying Winch's default calling convention
should suffice.

The ret_area is inherited from the caller, rather and allocated in the caller
stack. This is easy to do: we can get it from the caller's initial control
frame. (see there). This works, because the callee must have the same return
as the caller.

In principle nothing fundamental should change for multiple return values in
the presence of tail calls. Note that if the callee cleans the space for stack
arguments, some updates will be needed to handle the stack return area.

We need to emit a jump to the callee, rather than a call, because we must not
push a return address, since the state is already set up during the
return_call preparation. Winch relies of cranelift to emit relocs for the
callee address. The story around converting a function ref to a MachLabel that
can be used for a jump is not entirely clear to me yet. Alternatively, we
could pop the FP back into it's register and let call push it back (this is
what v8 baseline does).

There are many assumptions in this statement, however, if I'm understanding this
statement correctly, I think that relocations and FP/return address handling are
two, largely orthogonal issues that we need to think about regardless.

--

In general, I agree that resolving how to handle stack arguments is a
pre-requisite for supporting tail-calls. My expectation is to take an informed
decision based on prototyping, given that there are many trade-offs to consider,
not only around the complexity of the potential ABI changes, but also the
impact of any new changes on:

Compilation performance

Runtime performance

Other non-x86-64 backends

After spending some time thinking a bit more about tail calls, and in
particular, with the last point above (non-x86-64 backends), I'd also recommend
waiting until we can fully run spec tests for aarch64. As I've been working on
trying to get them all passing I've found myself fixing some ABI-related bugs [1]
to ensure that our generated code meets all the aarc64 ABI requirements.
I suspect that if we introduce any new compiler-wide ABI changes, before the
aarch64 backend is fully complete, there's a risk of conflicts down the road,
more concreely around the stack pointer handling and relocations side of things.

[1]: https://github.com/bytecodealliance/wasmtime/pull/10146

Wasmtime GitHub notifications bot (Feb 12 2025 at 13:11):

saulecabrera edited issue #9732:

The Wasm Tail Call proposal is considered Tier 1 according to Wasmtime's Tiers of support.

Winch currently doesn't support this proposal.

[ ] return_call

[ ] return_call_indirect

Last updated: Apr 18 2025 at 07:03 UTC