L-as commented on issue #1065:
Any update on this? Is it easy enough that an outsider could do it?
cfallin commented on issue #1065:
Hi @L-as -- no, there unfortunately hasn't been progress on this. I'd love for that to be different; if you or another motivated person wants to tackle it, it's possible, but it's likely a month or more of delicate surgery updating our ABI and runtime code to use a new calling convention that supports tail calls.
It's the sort of problem that feels simple-ish at first -- indeed if you only have args that go in registers, and no callee-saves or spillslots, it can be as simple as some moves to set up the tailcall args and then a jump... but imposing those requirements on the CLIF producer to obtain a guaranteed tailcall would lead to some surprisingly brittle behavior where e.g. changing some unrelated code requires a spill and suddenly breaks the tailcall. Basically, in the general case, one needs to clean up one's own stackframe, restore callee-saves, and remove one's stack args from the stack, and then set up args on the stack for the callee, but possibly some of those args need to come from your spillslots or stack args (and if there are enough of them you can't hold all of them in the register file at once) so you need to do the arbitrary-state-permute possibly with some temps somewhere, or maybe in the worst case you set up the new frame before you pop your own then
memmove()
it up, but then what about return-area pointers that need to be fixed up... ah and take care to mind any stack-limit checks we need to do as well... in short, it's a mess.The Bugzilla bug for SpiderMonkey's Wasm tail call support here contains a link to 25-page Google Doc that outlines in some good detail new ABI that the SpiderMonkey folks have designed to support this... it's a lot of work, but understanding that (or at least the reasons for its decisions) would be a good starting point.
Anyway, if you or anyone else has the resources to tackle this and want to dig in more, I'd be happy to help work out details!
cfallin labeled issue #1065:
To write a functional language compiler using this IR, tail call eliminations would be desirable. Are there any plans to support this? I couldn't find any details in the docs.
andrew-johnson-4 commented on issue #1065:
@cfallin This may be really naïve of me, but what happens if the caller is made responsible for the tail-call elimination. In the simplest case of a recursive tail-call this is just a reentrant function call with new parameters. How much of a pain and/or performance hit would this be. I am just interested in this issue in general, even if it isn't "the best option possible".
cfallin commented on issue #1065:
@andrew-johnson-4 can you give a bit more detail what you mean by "caller is made responsible"? E.g., do you mean the callee should return some value that means "now call the next function"? (So
f
callsg
tail-callsh
becomes:f
callsg
,g
returns tof
,f
callsh
, somehow.) Or something else?The general problem to wrestle with is easiest to see (IMHO) in a minimized example:
f
andg
tail-call each other, both have enough arguments that some of them are passed on the stack, and they have different numbers of arguments. E.g., on x86-64 where the first sixi64
args are passed in registers, say thatf
has 8i64
args andg
has 10i64
args.Then when
f
tail-callsg
, it needs 32 bytes of stack-arg space (the remaining 4i64
args), but it itself only was given 16 bytes of stack-arg space. Furthermore, the original caller off
, in the standard SysV ABI, is responsible for popping those 16 bytes of space. There's no way to satisfy all the constraints for a true tail-call in this case in the SysV ABI. That's why the usual way to achieve true tail-calls in the arbitrary case is to adopt a "callee-pop" ABI instead:f
andg
both expect a certain number of args on the stack, and pop them before returning. This way,f
's tail-call tog
can do that pop, and then pushg
's args; and wheng
returns to the original caller off
, there's no need to somehow know that the actual returnee had 32 bytes rather than 16 of stack args.It's also possible to make it work in a "caller-pop" scheme, but requires dynamic information to be passed around about the size of argument and return-value areas. In the general case, dynamic approaches are less efficient (at the very least they require an extra arg and retval).
I suspect that what you're thinking of with "caller is made responsible" is either something like caller-pop, or else is a sort of "trampoline" scheme where the callee returns either "next func to call" or "final return" states and the callsite runs a worklist/driver loop. I'm curious to hear more what you have in mind though!
andrew-johnson-4 commented on issue #1065:
Yes, the
f
g
recursion is the worst variant. I use that as a benchmark for testing that tail-calls are optimized.To synthesize tail-call optimization right now (I want to use cranelift for a pet project) I am looking at the option of writing a function with an
Either
return type ofEither<TailCall,RetVal>
. Every marked tailcall function is executed in a loop that loads up the call, then either reenters some function or breaks the loop with a return value.I am currently transpiling my interpreted toy language into JIT segments. I have the benefit of whole program optimization, which makes it a bit easier. This
Either
is the worst case scenario that would still potentially go into compiled code. If thef
andg
et. al. are finite and known then this can be optimized. Again, if it is onlyf
recursion that makes this so much simpler.For a test right now I am using this function in Javascript:
function f(x) { if x === 0 { return 1; } else { return f(x-1) + f(x-1) } }
Without tce this is O(2^n) vs O(n) space.
andrew-johnson-4 edited a comment on issue #1065:
Yes, the
f
g
recursion is the worst variant. I use that as a benchmark for testing that tail-calls are optimized.To synthesize tail-call optimization right now (I want to use cranelift for a pet project) I am looking at the option of writing a function with an
Either
return type ofEither<TailCall,RetVal>
. Every marked tailcall function is executed in a loop that loads up the call, then either reenters some function or breaks the loop with a return value.I am currently transpiling my interpreted toy language into JIT segments. I have the benefit of whole program optimization, which makes it a bit easier. This
Either
is the worst case scenario that would still potentially go into compiled code. If thef
andg
et. al. are finite and known then this can be optimized. Again, if it is onlyf
recursion that makes this so much simpler.
andrew-johnson-4 edited a comment on issue #1065:
Yes, the
f
g
recursion is the worst variant. I use that as a benchmark for testing that tail-calls are optimized.To synthesize tail-call optimization right now (I want to use cranelift for a pet project) I am looking at the option of writing a function with an
Either
return type ofEither<TailCall,RetVal>
. Every marked tailcall function is executed in a loop that loads up the call, then either reenters some function or breaks the loop with a return value.I am currently transpiling my interpreted toy language into JIT segments. I have the benefit of whole program optimization, which makes it a bit easier. This
Either
is the worst case scenario that would still potentially go into compiled code. If thef
andg
et. al. are finite and known then this can be optimized. Again, if it is onlyf
recursion that makes this so much simpler.I see this as primarily a space optimization problem, not so much a speed problem. For simple functional programs with loops this is a huge deal.
andrew-johnson-4 edited a comment on issue #1065:
Yes, the
f
g
recursion is the worst variant. I use that as a benchmark for testing that tail-calls are optimized.To synthesize tail-call optimization right now (I want to use cranelift for a pet project) I am looking at the option of writing a function with an
Either
return type ofEither<TailCall,RetVal>
. Every marked tailcall function is executed in a loop that loads up the call, then either reenters some function or breaks the loop with a return value.I am currently transpiling my interpreted toy language into JIT segments. I have the benefit of whole program optimization, which makes it a bit easier. This
Either
is the worst case scenario that would still potentially go into compiled code. If thef
andg
et. al. are finite and known then this can be optimized. Again, if it is onlyf
recursion that makes this so much simpler.I see this as primarily a space optimization problem, not so much a speed problem. For simple functional programs with loops this is a huge deal.
"There's no way to satisfy all the constraints for a true tail-call in this case" yes, no unified calling convention. "there's no need to somehow know that the actual returnee had 32 bytes rather than 16 of stack args." that is certainly advantageous to have a unified interface. What would that entail development-wise to support?
andrew-johnson-4 edited a comment on issue #1065:
Yes, the
f
g
recursion is the worst variant. I use that as a benchmark for testing that tail-calls are optimized.To synthesize tail-call optimization right now (I want to use cranelift for a pet project) I am looking at the option of writing a function with an
Either
return type ofEither<TailCall,RetVal>
. Every marked tailcall function is executed in a loop that loads up the call, then either reenters some function or breaks the loop with a return value.I am currently transpiling my interpreted toy language into JIT segments. I have the benefit of whole program optimization, which makes it a bit easier. This
Either
is the worst case scenario that would still potentially go into compiled code. If thef
andg
et. al. are finite and known then this can be optimized. Again, if it is onlyf
recursion that makes this so much simpler.I see this as primarily a space optimization problem, not so much a speed problem. For simple functional programs with loops this is a huge deal.
"There's no way to satisfy all the constraints for a true tail-call in this case" yes, no unified calling convention. "there's no need to somehow know that the actual returnee had 32 bytes rather than 16 of stack args." that is certainly advantageous to have a unified interface. What would that entail development-wise to support?
From a previously linked doc "Agreeing on common conventions allows SpiderMonkey to freely choose the compilation strategy for different WebAssembly functions, and to change that choice over time." This is a design constraint that I specifically don't care about right now, but probably wasmtime does.
cfallin commented on issue #1065:
Either<TailCall,RetVal>
Indeed, that's the "trampoline" approach. It's definitely used in the wild; e.g. when targeting Wasm there's this comment about a Haskell compiler, and IIRC, Clojure provides this (via an explicit API exposed to the user?) as a replacement for tail calls on the JVM. It seems like by far the easiest option at the moment, and could be extended to the corecursive case (tail-call to another function) by returning a "function pointer" (index into function table) as part of the
TailCall
type...that is certainly advantageous to have a unified interface. What would that entail development-wise to support?
Probably about a month of effort, depending on who's doing the work and their familiarity with the codebase :-) Adding a new ABI is not too technically challenging on its own, but the cross-cutting interactions with everything else -- backtraces, trampolines elsewhere in the system, unwind info, etc. -- make for a likely somewhat gnarly project.
andrew-johnson-4 commented on issue #1065:
I'll try to familiarize myself more with cranelift before I volunteer. There is lots of low-hanging fruit on my end before tce is important. I would be afraid of completely beggaring your devs if I tried now :O
cfallin commented on issue #1065:
It's actually somewhat likely that @fitzgen will work on this sometime early next year, or at least, that was my most recent understanding :-) Having others interested in the issue as well is never a bad thing, of course!
bjorn3 commented on issue #1065:
By the way tail calls to
_Unwind_Resume
will be necessary for exceptions. This should be less hard than the general case though as it doesn't require any args being passed on the stack.
andrew-johnson-4 commented on issue #1065:
@cfallin Is it safe to assume that jumping to the entry block is one special case of a true tail-call? Does this violate SSA or other invariants potentially?
jameysharp commented on issue #1065:
I don't know that we've documented this, but no, Cranelift doesn't allow branching back to the entry block. I know one reason is that there's no block to hoist loop-invariant code into if the loop's entry doesn't have a dominator. I feel like this has come up as a problem in other contexts too but I don't remember them.
That said, you can always emit an entry block that immediately jumps to another block with the same signature, forwarding the arguments, and branch back to that second block instead of tail-calling.
Either way this only works for a function that tail-calls itself. Cranelift has no way to express a jump to a block in a different function, since block IDs are function-scoped. But in that special case, yes, you could implement tail-calls today.
fitzgen commented on issue #1065:
FWIW, I am working on an RFC for this at the moment.
bjorn3 commented on issue #1065:
The RFC in question is https://github.com/bytecodealliance/rfcs/pull/29.
fitzgen commented on issue #1065:
The big blocker for tail calls was overhauling Wasmtime's trampolines, which I've just posted a PR for: https://github.com/bytecodealliance/wasmtime/pull/6262
Will start on Cranelift support for tail calls when that lands.
fitzgen assigned issue #1065 (assigned to fitzgen):
To write a functional language compiler using this IR, tail call eliminations would be desirable. Are there any plans to support this? I couldn't find any details in the docs.
Last updated: Dec 23 2024 at 12:05 UTC