Stream: git-wasmtime

Topic: wasmtime / issue #1065 Is there any support for tail call...


view this post on Zulip Wasmtime GitHub notifications bot (Sep 07 2021 at 13:59):

L-as commented on issue #1065:

Any update on this? Is it easy enough that an outsider could do it?

view this post on Zulip Wasmtime GitHub notifications bot (Sep 07 2021 at 15:46):

cfallin commented on issue #1065:

Hi @L-as -- no, there unfortunately hasn't been progress on this. I'd love for that to be different; if you or another motivated person wants to tackle it, it's possible, but it's likely a month or more of delicate surgery updating our ABI and runtime code to use a new calling convention that supports tail calls.

It's the sort of problem that feels simple-ish at first -- indeed if you only have args that go in registers, and no callee-saves or spillslots, it can be as simple as some moves to set up the tailcall args and then a jump... but imposing those requirements on the CLIF producer to obtain a guaranteed tailcall would lead to some surprisingly brittle behavior where e.g. changing some unrelated code requires a spill and suddenly breaks the tailcall. Basically, in the general case, one needs to clean up one's own stackframe, restore callee-saves, and remove one's stack args from the stack, and then set up args on the stack for the callee, but possibly some of those args need to come from your spillslots or stack args (and if there are enough of them you can't hold all of them in the register file at once) so you need to do the arbitrary-state-permute possibly with some temps somewhere, or maybe in the worst case you set up the new frame before you pop your own then memmove() it up, but then what about return-area pointers that need to be fixed up... ah and take care to mind any stack-limit checks we need to do as well... in short, it's a mess.

The Bugzilla bug for SpiderMonkey's Wasm tail call support here contains a link to 25-page Google Doc that outlines in some good detail new ABI that the SpiderMonkey folks have designed to support this... it's a lot of work, but understanding that (or at least the reasons for its decisions) would be a good starting point.

Anyway, if you or anyone else has the resources to tackle this and want to dig in more, I'd be happy to help work out details!

view this post on Zulip Wasmtime GitHub notifications bot (May 04 2022 at 20:25):

cfallin labeled issue #1065:

To write a functional language compiler using this IR, tail call eliminations would be desirable. Are there any plans to support this? I couldn't find any details in the docs.

view this post on Zulip Wasmtime GitHub notifications bot (Dec 29 2022 at 01:13):

andrew-johnson-4 commented on issue #1065:

@cfallin This may be really naïve of me, but what happens if the caller is made responsible for the tail-call elimination. In the simplest case of a recursive tail-call this is just a reentrant function call with new parameters. How much of a pain and/or performance hit would this be. I am just interested in this issue in general, even if it isn't "the best option possible".

view this post on Zulip Wasmtime GitHub notifications bot (Dec 29 2022 at 01:31):

cfallin commented on issue #1065:

@andrew-johnson-4 can you give a bit more detail what you mean by "caller is made responsible"? E.g., do you mean the callee should return some value that means "now call the next function"? (So f calls g tail-calls h becomes: f calls g, g returns to f, f calls h, somehow.) Or something else?

The general problem to wrestle with is easiest to see (IMHO) in a minimized example: f and g tail-call each other, both have enough arguments that some of them are passed on the stack, and they have different numbers of arguments. E.g., on x86-64 where the first six i64 args are passed in registers, say that f has 8 i64 args and g has 10 i64 args.

Then when f tail-calls g, it needs 32 bytes of stack-arg space (the remaining 4 i64 args), but it itself only was given 16 bytes of stack-arg space. Furthermore, the original caller of f, in the standard SysV ABI, is responsible for popping those 16 bytes of space. There's no way to satisfy all the constraints for a true tail-call in this case in the SysV ABI. That's why the usual way to achieve true tail-calls in the arbitrary case is to adopt a "callee-pop" ABI instead: f and g both expect a certain number of args on the stack, and pop them before returning. This way, f's tail-call to g can do that pop, and then push g's args; and when g returns to the original caller of f, there's no need to somehow know that the actual returnee had 32 bytes rather than 16 of stack args.

It's also possible to make it work in a "caller-pop" scheme, but requires dynamic information to be passed around about the size of argument and return-value areas. In the general case, dynamic approaches are less efficient (at the very least they require an extra arg and retval).

I suspect that what you're thinking of with "caller is made responsible" is either something like caller-pop, or else is a sort of "trampoline" scheme where the callee returns either "next func to call" or "final return" states and the callsite runs a worklist/driver loop. I'm curious to hear more what you have in mind though!

view this post on Zulip Wasmtime GitHub notifications bot (Dec 29 2022 at 01:50):

andrew-johnson-4 commented on issue #1065:

Yes, the f g recursion is the worst variant. I use that as a benchmark for testing that tail-calls are optimized.

To synthesize tail-call optimization right now (I want to use cranelift for a pet project) I am looking at the option of writing a function with an Either return type of Either<TailCall,RetVal>. Every marked tailcall function is executed in a loop that loads up the call, then either reenters some function or breaks the loop with a return value.

I am currently transpiling my interpreted toy language into JIT segments. I have the benefit of whole program optimization, which makes it a bit easier. This Either is the worst case scenario that would still potentially go into compiled code. If the f and g et. al. are finite and known then this can be optimized. Again, if it is only f recursion that makes this so much simpler.

For a test right now I am using this function in Javascript:

function f(x) {
   if x === 0 {
      return 1;
   } else {
      return f(x-1) + f(x-1)
   }
}

Without tce this is O(2^n) vs O(n) space.

view this post on Zulip Wasmtime GitHub notifications bot (Dec 29 2022 at 01:53):

andrew-johnson-4 edited a comment on issue #1065:

Yes, the f g recursion is the worst variant. I use that as a benchmark for testing that tail-calls are optimized.

To synthesize tail-call optimization right now (I want to use cranelift for a pet project) I am looking at the option of writing a function with an Either return type of Either<TailCall,RetVal>. Every marked tailcall function is executed in a loop that loads up the call, then either reenters some function or breaks the loop with a return value.

I am currently transpiling my interpreted toy language into JIT segments. I have the benefit of whole program optimization, which makes it a bit easier. This Either is the worst case scenario that would still potentially go into compiled code. If the f and g et. al. are finite and known then this can be optimized. Again, if it is only f recursion that makes this so much simpler.

view this post on Zulip Wasmtime GitHub notifications bot (Dec 29 2022 at 02:06):

andrew-johnson-4 edited a comment on issue #1065:

Yes, the f g recursion is the worst variant. I use that as a benchmark for testing that tail-calls are optimized.

To synthesize tail-call optimization right now (I want to use cranelift for a pet project) I am looking at the option of writing a function with an Either return type of Either<TailCall,RetVal>. Every marked tailcall function is executed in a loop that loads up the call, then either reenters some function or breaks the loop with a return value.

I am currently transpiling my interpreted toy language into JIT segments. I have the benefit of whole program optimization, which makes it a bit easier. This Either is the worst case scenario that would still potentially go into compiled code. If the f and g et. al. are finite and known then this can be optimized. Again, if it is only f recursion that makes this so much simpler.

I see this as primarily a space optimization problem, not so much a speed problem. For simple functional programs with loops this is a huge deal.

view this post on Zulip Wasmtime GitHub notifications bot (Dec 29 2022 at 02:29):

andrew-johnson-4 edited a comment on issue #1065:

Yes, the f g recursion is the worst variant. I use that as a benchmark for testing that tail-calls are optimized.

To synthesize tail-call optimization right now (I want to use cranelift for a pet project) I am looking at the option of writing a function with an Either return type of Either<TailCall,RetVal>. Every marked tailcall function is executed in a loop that loads up the call, then either reenters some function or breaks the loop with a return value.

I am currently transpiling my interpreted toy language into JIT segments. I have the benefit of whole program optimization, which makes it a bit easier. This Either is the worst case scenario that would still potentially go into compiled code. If the f and g et. al. are finite and known then this can be optimized. Again, if it is only f recursion that makes this so much simpler.

I see this as primarily a space optimization problem, not so much a speed problem. For simple functional programs with loops this is a huge deal.

"There's no way to satisfy all the constraints for a true tail-call in this case" yes, no unified calling convention. "there's no need to somehow know that the actual returnee had 32 bytes rather than 16 of stack args." that is certainly advantageous to have a unified interface. What would that entail development-wise to support?

view this post on Zulip Wasmtime GitHub notifications bot (Dec 29 2022 at 02:35):

andrew-johnson-4 edited a comment on issue #1065:

Yes, the f g recursion is the worst variant. I use that as a benchmark for testing that tail-calls are optimized.

To synthesize tail-call optimization right now (I want to use cranelift for a pet project) I am looking at the option of writing a function with an Either return type of Either<TailCall,RetVal>. Every marked tailcall function is executed in a loop that loads up the call, then either reenters some function or breaks the loop with a return value.

I am currently transpiling my interpreted toy language into JIT segments. I have the benefit of whole program optimization, which makes it a bit easier. This Either is the worst case scenario that would still potentially go into compiled code. If the f and g et. al. are finite and known then this can be optimized. Again, if it is only f recursion that makes this so much simpler.

I see this as primarily a space optimization problem, not so much a speed problem. For simple functional programs with loops this is a huge deal.

"There's no way to satisfy all the constraints for a true tail-call in this case" yes, no unified calling convention. "there's no need to somehow know that the actual returnee had 32 bytes rather than 16 of stack args." that is certainly advantageous to have a unified interface. What would that entail development-wise to support?

From a previously linked doc "Agreeing on common conventions allows SpiderMonkey to freely choose the compilation strategy for different WebAssembly functions, and to change that choice over time." This is a design constraint that I specifically don't care about right now, but probably wasmtime does.

view this post on Zulip Wasmtime GitHub notifications bot (Dec 29 2022 at 02:47):

cfallin commented on issue #1065:

Either<TailCall,RetVal>

Indeed, that's the "trampoline" approach. It's definitely used in the wild; e.g. when targeting Wasm there's this comment about a Haskell compiler, and IIRC, Clojure provides this (via an explicit API exposed to the user?) as a replacement for tail calls on the JVM. It seems like by far the easiest option at the moment, and could be extended to the corecursive case (tail-call to another function) by returning a "function pointer" (index into function table) as part of the TailCall type...

that is certainly advantageous to have a unified interface. What would that entail development-wise to support?

Probably about a month of effort, depending on who's doing the work and their familiarity with the codebase :-) Adding a new ABI is not too technically challenging on its own, but the cross-cutting interactions with everything else -- backtraces, trampolines elsewhere in the system, unwind info, etc. -- make for a likely somewhat gnarly project.

view this post on Zulip Wasmtime GitHub notifications bot (Dec 29 2022 at 02:52):

andrew-johnson-4 commented on issue #1065:

I'll try to familiarize myself more with cranelift before I volunteer. There is lots of low-hanging fruit on my end before tce is important. I would be afraid of completely beggaring your devs if I tried now :O

view this post on Zulip Wasmtime GitHub notifications bot (Dec 29 2022 at 03:18):

cfallin commented on issue #1065:

It's actually somewhat likely that @fitzgen will work on this sometime early next year, or at least, that was my most recent understanding :-) Having others interested in the issue as well is never a bad thing, of course!

view this post on Zulip Wasmtime GitHub notifications bot (Dec 29 2022 at 07:08):

bjorn3 commented on issue #1065:

By the way tail calls to _Unwind_Resume will be necessary for exceptions. This should be less hard than the general case though as it doesn't require any args being passed on the stack.

view this post on Zulip Wasmtime GitHub notifications bot (Jan 04 2023 at 22:14):

andrew-johnson-4 commented on issue #1065:

@cfallin Is it safe to assume that jumping to the entry block is one special case of a true tail-call? Does this violate SSA or other invariants potentially?

view this post on Zulip Wasmtime GitHub notifications bot (Jan 04 2023 at 22:38):

jameysharp commented on issue #1065:

I don't know that we've documented this, but no, Cranelift doesn't allow branching back to the entry block. I know one reason is that there's no block to hoist loop-invariant code into if the loop's entry doesn't have a dominator. I feel like this has come up as a problem in other contexts too but I don't remember them.

That said, you can always emit an entry block that immediately jumps to another block with the same signature, forwarding the arguments, and branch back to that second block instead of tail-calling.

Either way this only works for a function that tail-calls itself. Cranelift has no way to express a jump to a block in a different function, since block IDs are function-scoped. But in that special case, yes, you could implement tail-calls today.

view this post on Zulip Wasmtime GitHub notifications bot (Jan 10 2023 at 16:49):

fitzgen commented on issue #1065:

FWIW, I am working on an RFC for this at the moment.

view this post on Zulip Wasmtime GitHub notifications bot (Jan 26 2023 at 19:05):

bjorn3 commented on issue #1065:

The RFC in question is https://github.com/bytecodealliance/rfcs/pull/29.

view this post on Zulip Wasmtime GitHub notifications bot (Apr 21 2023 at 18:33):

fitzgen commented on issue #1065:

The big blocker for tail calls was overhauling Wasmtime's trampolines, which I've just posted a PR for: https://github.com/bytecodealliance/wasmtime/pull/6262

Will start on Cranelift support for tail calls when that lands.

view this post on Zulip Wasmtime GitHub notifications bot (Apr 21 2023 at 18:33):

fitzgen assigned issue #1065 (assigned to fitzgen):

To write a functional language compiler using this IR, tail call eliminations would be desirable. Are there any plans to support this? I couldn't find any details in the docs.


Last updated: Oct 23 2024 at 20:03 UTC