Stream: git-wasmtime

Topic: wasmtime / issue #9510 Cranelift: Remove support for impl...


view this post on Zulip Wasmtime GitHub notifications bot (Oct 24 2024 at 16:41):

bjorn3 opened issue #9510:

Feature

When there are more returns than available return registers, the return value has to be returned through the introduction of a return area pointer. This can be done both explicitly using a ArgumentPurpose::StructReturn parameter (as cg_clif does) or implicitly by Cranelift. The support for implicitly adding it adds a lot of complexity to the ABI handling code in Cranelift (which is already way too complex) and has two ABI incompatibilities:

Given that there is a lot of complexity in Cranelift caused by this and it is effectively broken anyway for native calling conventions, I propose removing support for implicit return area pointers, requiring the frontend to use explicit return area pointers instead.

Benefit

view this post on Zulip Wasmtime GitHub notifications bot (Oct 24 2024 at 16:41):

bjorn3 edited issue #9510:

Feature

When there are more returns than available return registers, the return value has to be returned through the introduction of a return area pointer. This can be done both explicitly using a ArgumentPurpose::StructReturn parameter (as cg_clif does) or implicitly by Cranelift. The support for implicitly adding it adds a lot of complexity to the ABI handling code in Cranelift (which is already way too complex) and has two ABI incompatibilities:

Given that there is a lot of complexity in Cranelift caused by this and it is effectively broken anyway for native calling conventions, I propose removing support for implicit return area pointers, requiring the frontend to use explicit return area pointers instead.

Benefit

view this post on Zulip Wasmtime GitHub notifications bot (Oct 24 2024 at 16:54):

cfallin commented on issue #9510:

@bjorn3 and I talked about this a bit offline -- I was initially skeptical for the reason that supporting arbitrary return values is ergonomic, fits well with the CLIF abstraction level in general, and we have it already. However, if we're facing some brokenness in our implementation already with respect to what the ABI says should happen, it may not be a bad idea to say that the Cranelift embedder (Wasmtime or cg\_clif or other) needs to build their own mechanism for multiple return values. After all, "N args, 1 return" is also a perfectly consistent abstraction level and one that we live with in many other languages.

Practically speaking in this repo, this means a good amount of code removal in the ABI code as bjorn3 mentions, and then building the equivalent in cranelift-wasm. I suspect we could do it in a single pass still by, lazily when needed, creating a stackslot for returns, generating a CLIF call with extra arg, and loads of values from stackslot to define the return values. (The stackslot can be reused for all callsites, and we could update its size as we discover larger return-value sets.)

What do others think? We can talk about this in the next CL meeting if needed, too.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 24 2024 at 16:56):

bjorn3 commented on issue #9510:

I've opened https://github.com/bytecodealliance/wasmtime/pull/9511 to gate the support behind an option. This should help with removing it in the future and reduces likelihood of accidentally hitting the ABI issue in the short term.

view this post on Zulip Wasmtime GitHub notifications bot (Nov 07 2024 at 23:00):

fitzgen commented on issue #9510:

Practically speaking in this repo, this means a good amount of code removal in the ABI code as bjorn3 mentions, and then building the equivalent in cranelift-wasm. I suspect we could do it in a single pass still by, lazily when needed, creating a stackslot for returns, generating a CLIF call with extra arg, and loads of values from stackslot to define the return values. (The stackslot can be reused for all callsites, and we could update its size as we discover larger return-value sets.)

Clarification: Who is creating this stack slot? The caller of any function that returns multiple arguments? And then for tail calls, the caller would forward the return pointer it had to have been given for this tail call to type check?

Idea: we could also stuff a buffer within the vmctx and avoid burning a register on the return pointer. It does mean we would need to eagerly reload values out of the buffer before any subsequent calls, since the buffer would presumably be too small to use as a full stack, but I think that's what you were thinking we'd do anyways.

view this post on Zulip Wasmtime GitHub notifications bot (Nov 08 2024 at 06:22):

bjorn3 commented on issue #9510:

Clarification: Who is creating this stack slot? The caller of any function that returns multiple arguments? And then for tail calls, the caller would forward the return pointer it had to have been given for this tail call to type check?

Yes

Idea: we could also stuff a buffer within the vmctx and avoid burning a register on the return pointer.

That would work provided that you know an upper bound on the size of the return area. You could also put the return area pointer in the vmctx instead and keep the return area on the stack. That would avoid expensive copies entirely at the cost of an extra indirection. That said, I don't think multi value return with more values than fit in registers is common enough to warrant avoiding a return area pointer register. The register only needs to be reserved if a return area is necessary at all. For the Tail and Winch call convs we could add more return registers than the native call conv to reduce the likelihood of needing a return area.

view this post on Zulip Wasmtime GitHub notifications bot (Nov 08 2024 at 17:19):

fitzgen commented on issue #9510:

For the Tail and Winch call convs we could add more return registers than the native call conv to reduce the likelihood of needing a return area.

Very true, and this is definitely the first thing we should do if we find ourselves optimizing the return of many values.

That would work provided that you know an upper bound on the size of the return area.

We can do that, since the Wasm spec allows implementations to impose limits on many things including number of results. We try to match the JS/Web limits for consistency and portability, and the JS limit on number of results is 1000.

In general, I'd prefer embedding the return area space within the vmctx over pointing to the return area space from the vmctx just so we could avoid the indirection. If we get to the point where it makes sense for us to optimize this stuff by saving a register, we might as well go all the way and avoid the indirection as well.


FWIW, the use case where I could imagine these kinds of calling convention micro optimizations actually mattering is when we are dealing with basic blocks that have been compiled down to branchless functions that all tail call each other, and so every function/block in the program takes (say) 6 arguments and generates 6 returns and function calls are frequent.


Last updated: Nov 22 2024 at 17:03 UTC