alexcrichton opened issue #3927:
This issues comes out of a discussion that @lukewagner, @fitzgen, and I were having recently. We were thinking again about how Wasmtime implements calls into WebAssembly and about some of the overhead associated with that. Currently it's suprisingly expensive relative to wasm->host transitions, where host->wasm is on the order of 20-30ns where wasm->host is on the order of 3-5ns.
One of the major costs of entering WebAssembly is that we have to call
setjmp
. Not only issetjmp
complicated since it's platform-specific but as seen there it's also written in C. We can't call setjmp from Rust (since it "returns twice" and the Rust compiler doesn't inform LLVM of that, meaning optimizations could go awry) which means entering WebAssembly is even further de-optimized because all arguments must pass through the stack. This closure captures all arguments into WebAssembly and is forced to be on the stack as we pass a single pointer to C which is called back.Another further complication with this current strategy of entering WebAssembly is that in a future world with the wasm exceptions proposal whatever is chosen to implement exceptions at the cranelift level is highly unlikely to be exposed in the full fidelity required to native stable Rust, meaning that we couldn't actually write a "catch" block in Rust (and probably not C).
To solve all these issues, @lukewagner mentioned we could do something like SpiderMonkey which is to have specialized entry trampolines into WebAssembly code. Currently our trampolines are primarily just converting from a dynamic stack-based layout to a particular System-V ABI signature, which isn't really all that interesting. Instead, though, we could specifically have a trampoline that receives the appropriate arguments, sets up a "catch" frame, and then enters the desired WebAssembly code. This could have a number of benefits:
- Nothing is forced to be on the stack since we never cross between Rust & C. Instead Rust, when using
TypedFunc
, would make a System-V call to this entry trampoline and the entry trampoline would indicate via its return value whether a trap was caught. Note, though, that we may still want to store params/results on the stack for other reasons, such as communicating results since this trampoline would always have at least one result (whether the function trapped or not).- We get to entirely define
setjmp
and/or the trap exception protocol within Cranelift. This entry trampoline would implementsetjmp
, or at least the pieces necessary, or whatever exception handling implementation we get around to using in Cranelift.- This could further reduce the need for external C helpers which moves Wasmtime a bit closer to having 0 C dependencies.
Another possible idea is that currently trampolines are one-per-function-signature which means that they always contain an indirect call to a target. Instead we could also explore a scheme where we have one-per-export which would enable the trampoline to statically call into the correct export (no indirect function call necessary) which is another route to possibly optimize this.
The implementation of setjmp/longjmp in Cranelift is likely to be pretty nontrivial for this which is why I wanted to open an issue on this and let it get some feedback before implementing. I also don't think that this is super pressing at this time to the point that we should implement it, but it's good to have in our back pocket if we run into issues with the overhead of host->wasm transitions. I'm not actually sure how we'd implement setjmp/longjmp in Cranelift (e.g. expose it and represent it in clif) myself. Implementation-wise we'd probably want to at least take inspiration if not scrutinize the SpiderMonkey implementation since we don't need a general setjmp/longjmp mechanism, only one that works for wasm traps.
lukewagner commented on issue #3927:
Agreed that you can get away with something much simpler than a general-purpose setjmp/longjmp implementation.
Just putting out some info and links for how SpiderMonkey does this in case anyone is interested later:
- When it's time to throw, the "normal" compiled wasm code jumps to the throw stub generated by
GenerateThrowStub
- The throw stub calls the C++ function
WasmHandleThrow
which uses the stack iterator to determine the fp/sp to unwind to (by walking the stack frame structure maintained by the entry trampoline and normal calls)- The throw stub executes a careful, but ultimately not too complicated, sequence of instructions to update to the sp found by the previous step (which points to the return address pushed by the original call from the entry trampoline into normal wasm code) and then executes a
ret
.- Instead of using the ABI's normal return-value register to communicate the error (which are already used for the non-exceptional return value), the convention SpiderMonkey uses is storing
0xbad
in the fp register (right before returning from the throw stub to the entry trampoline), which the entry trampoline immediately branches on.Ultimately, because the sp of the entry trampoline is restored to the same offset expected upon normal (non-exceptional) return, the entry trampoline doesn't need to do much specially: it simply places the data it needs for returning (exceptionally and normally) in the stack frame and branches right after the call into normal wasm code to detect the exceptional return.
cfallin commented on issue #3927:
At a high level, I like this direction. Two major points:
When we talk about stack frame formats and such to allow for custom unwinding, #2459 comes to mind; I still have a latent desire to see our unwinding come more under our control, like in SpiderMonkey, and this is another argument in favor of that approach. I know that while discussing exception-handling approaches in #3427 we at least initially concluded that an approach built around libunwind and DWARF unwind-info would be best to address all use-cases, but I still feel that having more integration with our runtime is a cleaner, less brittle, and easier-to-optimize approach for Wasmtime in particular. So I guess this is just a +1 from me in that direction.
I wonder how general we can make our CLIF-level instructions: I suspect that providing actual
setjmp
andlongjmp
intrinsics may be a good way to address this need. In particular we need a bit more of the register save/restore actions than just SP/FP: I think the SpiderMonkey stubs (thanks Luke for the links!) get away with not saving/restoring registers other than SP/FP because SM's JIT ABI has no callee-save registers, whereas in Wasmtime we stick basically to SysV. Given that, I imagine we could add intrinsics that essentially have the semantics of setjmp and longjmp and use them.I have some thoughts about how the intrinsics could work -- actually I think we could piggyback on the regalloc's clobber-saves / spilling by marking a longjmp pseudoinstruction as "clobbers everything". This doesn't save much when a function body is just longjmp/call (clobber saves on one side, jmpbuf on the other, same either way) but inline exception-handler paths could potentially be more efficient because the code will then only reload the registers it needs from ordinary spillslots.
bjorn3 commented on issue #3927:
Could we model setjmp as a terminator with two successors? The first successor is directly jumped to while the second successor is jumped to in case of a longjmp. Also I think setjmp needs to be marked as clobbers everything too. At least for the second successor.
cfallin commented on issue #3927:
Err, sorry, yeah, I meant
setjmp
above where I wrotelongjmp
. Thelongjmp
is just an unconditional branch (terminator with zero successors) as far as the CFG is concerned I think.
alexcrichton labeled issue #3927:
This issues comes out of a discussion that @lukewagner, @fitzgen, and I were having recently. We were thinking again about how Wasmtime implements calls into WebAssembly and about some of the overhead associated with that. Currently it's suprisingly expensive relative to wasm->host transitions, where host->wasm is on the order of 20-30ns where wasm->host is on the order of 3-5ns.
One of the major costs of entering WebAssembly is that we have to call
setjmp
. Not only issetjmp
complicated since it's platform-specific but as seen there it's also written in C. We can't call setjmp from Rust (since it "returns twice" and the Rust compiler doesn't inform LLVM of that, meaning optimizations could go awry) which means entering WebAssembly is even further de-optimized because all arguments must pass through the stack. This closure captures all arguments into WebAssembly and is forced to be on the stack as we pass a single pointer to C which is called back.Another further complication with this current strategy of entering WebAssembly is that in a future world with the wasm exceptions proposal whatever is chosen to implement exceptions at the cranelift level is highly unlikely to be exposed in the full fidelity required to native stable Rust, meaning that we couldn't actually write a "catch" block in Rust (and probably not C).
To solve all these issues, @lukewagner mentioned we could do something like SpiderMonkey which is to have specialized entry trampolines into WebAssembly code. Currently our trampolines are primarily just converting from a dynamic stack-based layout to a particular System-V ABI signature, which isn't really all that interesting. Instead, though, we could specifically have a trampoline that receives the appropriate arguments, sets up a "catch" frame, and then enters the desired WebAssembly code. This could have a number of benefits:
- Nothing is forced to be on the stack since we never cross between Rust & C. Instead Rust, when using
TypedFunc
, would make a System-V call to this entry trampoline and the entry trampoline would indicate via its return value whether a trap was caught. Note, though, that we may still want to store params/results on the stack for other reasons, such as communicating results since this trampoline would always have at least one result (whether the function trapped or not).- We get to entirely define
setjmp
and/or the trap exception protocol within Cranelift. This entry trampoline would implementsetjmp
, or at least the pieces necessary, or whatever exception handling implementation we get around to using in Cranelift.- This could further reduce the need for external C helpers which moves Wasmtime a bit closer to having 0 C dependencies.
Another possible idea is that currently trampolines are one-per-function-signature which means that they always contain an indirect call to a target. Instead we could also explore a scheme where we have one-per-export which would enable the trampoline to statically call into the correct export (no indirect function call necessary) which is another route to possibly optimize this.
The implementation of setjmp/longjmp in Cranelift is likely to be pretty nontrivial for this which is why I wanted to open an issue on this and let it get some feedback before implementing. I also don't think that this is super pressing at this time to the point that we should implement it, but it's good to have in our back pocket if we run into issues with the overhead of host->wasm transitions. I'm not actually sure how we'd implement setjmp/longjmp in Cranelift (e.g. expose it and represent it in clif) myself. Implementation-wise we'd probably want to at least take inspiration if not scrutinize the SpiderMonkey implementation since we don't need a general setjmp/longjmp mechanism, only one that works for wasm traps.
alexcrichton labeled issue #3927:
This issues comes out of a discussion that @lukewagner, @fitzgen, and I were having recently. We were thinking again about how Wasmtime implements calls into WebAssembly and about some of the overhead associated with that. Currently it's suprisingly expensive relative to wasm->host transitions, where host->wasm is on the order of 20-30ns where wasm->host is on the order of 3-5ns.
One of the major costs of entering WebAssembly is that we have to call
setjmp
. Not only issetjmp
complicated since it's platform-specific but as seen there it's also written in C. We can't call setjmp from Rust (since it "returns twice" and the Rust compiler doesn't inform LLVM of that, meaning optimizations could go awry) which means entering WebAssembly is even further de-optimized because all arguments must pass through the stack. This closure captures all arguments into WebAssembly and is forced to be on the stack as we pass a single pointer to C which is called back.Another further complication with this current strategy of entering WebAssembly is that in a future world with the wasm exceptions proposal whatever is chosen to implement exceptions at the cranelift level is highly unlikely to be exposed in the full fidelity required to native stable Rust, meaning that we couldn't actually write a "catch" block in Rust (and probably not C).
To solve all these issues, @lukewagner mentioned we could do something like SpiderMonkey which is to have specialized entry trampolines into WebAssembly code. Currently our trampolines are primarily just converting from a dynamic stack-based layout to a particular System-V ABI signature, which isn't really all that interesting. Instead, though, we could specifically have a trampoline that receives the appropriate arguments, sets up a "catch" frame, and then enters the desired WebAssembly code. This could have a number of benefits:
- Nothing is forced to be on the stack since we never cross between Rust & C. Instead Rust, when using
TypedFunc
, would make a System-V call to this entry trampoline and the entry trampoline would indicate via its return value whether a trap was caught. Note, though, that we may still want to store params/results on the stack for other reasons, such as communicating results since this trampoline would always have at least one result (whether the function trapped or not).- We get to entirely define
setjmp
and/or the trap exception protocol within Cranelift. This entry trampoline would implementsetjmp
, or at least the pieces necessary, or whatever exception handling implementation we get around to using in Cranelift.- This could further reduce the need for external C helpers which moves Wasmtime a bit closer to having 0 C dependencies.
Another possible idea is that currently trampolines are one-per-function-signature which means that they always contain an indirect call to a target. Instead we could also explore a scheme where we have one-per-export which would enable the trampoline to statically call into the correct export (no indirect function call necessary) which is another route to possibly optimize this.
The implementation of setjmp/longjmp in Cranelift is likely to be pretty nontrivial for this which is why I wanted to open an issue on this and let it get some feedback before implementing. I also don't think that this is super pressing at this time to the point that we should implement it, but it's good to have in our back pocket if we run into issues with the overhead of host->wasm transitions. I'm not actually sure how we'd implement setjmp/longjmp in Cranelift (e.g. expose it and represent it in clif) myself. Implementation-wise we'd probably want to at least take inspiration if not scrutinize the SpiderMonkey implementation since we don't need a general setjmp/longjmp mechanism, only one that works for wasm traps.
alexcrichton labeled issue #3927:
This issues comes out of a discussion that @lukewagner, @fitzgen, and I were having recently. We were thinking again about how Wasmtime implements calls into WebAssembly and about some of the overhead associated with that. Currently it's suprisingly expensive relative to wasm->host transitions, where host->wasm is on the order of 20-30ns where wasm->host is on the order of 3-5ns.
One of the major costs of entering WebAssembly is that we have to call
setjmp
. Not only issetjmp
complicated since it's platform-specific but as seen there it's also written in C. We can't call setjmp from Rust (since it "returns twice" and the Rust compiler doesn't inform LLVM of that, meaning optimizations could go awry) which means entering WebAssembly is even further de-optimized because all arguments must pass through the stack. This closure captures all arguments into WebAssembly and is forced to be on the stack as we pass a single pointer to C which is called back.Another further complication with this current strategy of entering WebAssembly is that in a future world with the wasm exceptions proposal whatever is chosen to implement exceptions at the cranelift level is highly unlikely to be exposed in the full fidelity required to native stable Rust, meaning that we couldn't actually write a "catch" block in Rust (and probably not C).
To solve all these issues, @lukewagner mentioned we could do something like SpiderMonkey which is to have specialized entry trampolines into WebAssembly code. Currently our trampolines are primarily just converting from a dynamic stack-based layout to a particular System-V ABI signature, which isn't really all that interesting. Instead, though, we could specifically have a trampoline that receives the appropriate arguments, sets up a "catch" frame, and then enters the desired WebAssembly code. This could have a number of benefits:
- Nothing is forced to be on the stack since we never cross between Rust & C. Instead Rust, when using
TypedFunc
, would make a System-V call to this entry trampoline and the entry trampoline would indicate via its return value whether a trap was caught. Note, though, that we may still want to store params/results on the stack for other reasons, such as communicating results since this trampoline would always have at least one result (whether the function trapped or not).- We get to entirely define
setjmp
and/or the trap exception protocol within Cranelift. This entry trampoline would implementsetjmp
, or at least the pieces necessary, or whatever exception handling implementation we get around to using in Cranelift.- This could further reduce the need for external C helpers which moves Wasmtime a bit closer to having 0 C dependencies.
Another possible idea is that currently trampolines are one-per-function-signature which means that they always contain an indirect call to a target. Instead we could also explore a scheme where we have one-per-export which would enable the trampoline to statically call into the correct export (no indirect function call necessary) which is another route to possibly optimize this.
The implementation of setjmp/longjmp in Cranelift is likely to be pretty nontrivial for this which is why I wanted to open an issue on this and let it get some feedback before implementing. I also don't think that this is super pressing at this time to the point that we should implement it, but it's good to have in our back pocket if we run into issues with the overhead of host->wasm transitions. I'm not actually sure how we'd implement setjmp/longjmp in Cranelift (e.g. expose it and represent it in clif) myself. Implementation-wise we'd probably want to at least take inspiration if not scrutinize the SpiderMonkey implementation since we don't need a general setjmp/longjmp mechanism, only one that works for wasm traps.
Last updated: Jan 24 2025 at 00:11 UTC