alexcrichton opened PR #11592 from alexcrichton:trampoline-try-call to bytecodealliance:main:
Since Wasmtime's inception it's used the
setjmpandlongjmp
functions in C to implement handling of traps. While this solution was
easy to implement, relatively portable, and performant enough, there are
a number of downsides that have evolved over time to make this an
unattractive approach in the long run:
Using
setjmpfundamentally requires using C because Rust does not
understand a function that returns twice. It's fundamentally unsound
to invokesetjmpin Rust meaning that Wasmtime has forever needed a
C compiler configured and set up to build. This notably means that
cargo checkcannot check other targets easily.Using
longjmpmeans that Rust function frames are unwound on the
stack without running destructors. This is a dangerous operation of
which we get no protection from the compiler about. Both frames
entering wasm and frames exiting wasm are all skipped. Absolutely
minimizing this has been beneficial for portability to platforms such
as Pulley.Currently the no_std implementation of Wasmtime requires embedders to
providewasmtime_{setjmp,longjmp}which is a thorn in the side of
what is otherwise a mostly entirely independent implementation of
Wasmtime.There is a performance floor to using
setjmpandlongjmp. Calling
setjmprequires using C but Wasmtime is otherwise written in Rust
meaning that there's a Rust->C->Rust->Wasm boundary which
fundamentally can't be inlined without cross-language LTO which is
difficult to configure.With the implementation of the WebAssembly exceptions proposal
Wasmtime now has two means of unwinding the stack. Ideally Wasmtime
would only have one, and the more general one is the method of
exceptions.Jumping out of a signal handler on Unix is tricky business. While
we've made it work it's generally most robust of the signal handler
simply returns which it now does.With all of that in mind the purpose of this commit is to replace the
setjmp/longjmp mechanism of handling traps with the recently implemented
support for exceptions in Cranelift. That is intended to resolve all of
the above points in one swoop.One point in particular though that's nice about setjmp/longjmp is that
unwinding the stack on a trap is an O(1) operation. For situations such
as stack overflow that's a particularly nice property to have as we can
guarantee embedders that traps are a constant time (albeit somewhat
expensive with signals) operation. Exceptions naively require unwinding
the entire stack, and although frame pointers mean we're just traversing
a linked list I wanted to preserve the O(1) property here nonetheless.
To achieve this a solution is implemented where the array-to-wasm
(host-to-wasm) trampolines setup state inVMStoreContextso looking up
the current trap handler frame is an O(1) operation. Namely the sp/fp/pc
values for aHandlerare stored inline.Implementing this feature required supporting
relocations-to-offsets-in-functions which was not previously supported
by Wasmtime. This required Cranelift refactorings such as https://github.com/bytecodealliance/wasmtime/pull/11570, https://github.com/bytecodealliance/wasmtime/pull/11585,
and https://github.com/bytecodealliance/wasmtime/pull/11576. This then additionally required some more refactoring in
this commit which was difficult to split out as it otherwise wouldn't be
tested.Apart from the relocation-related business much of this change is about
updating the platform signal handlers to use exceptions instead of
longjmp to return. For example on Unix this means updating the
ucontext_twith register values that the handler specifies. Windows
involves updating similar contexts, and macOS mach ports ended up not
needing too many changes.In terms of overall performance the relevant benchmark from this
repository, compared to before this commit, is:sync/no-hook/core - host-to-wasm - typed - nop time: [10.552 ns 10.561 ns 10.571 ns] change: [−7.5238% −7.4011% −7.2786%] (p = 0.00 < 0.05) Performance has improved.Closes https://github.com/bytecodealliance/wasmtime/issues/3927
cc https://github.com/bytecodealliance/wasmtime/issues/10923
alexcrichton requested abrown for a review on PR #11592.
alexcrichton requested wasmtime-compiler-reviewers for a review on PR #11592.
alexcrichton requested dicej for a review on PR #11592.
alexcrichton requested wasmtime-core-reviewers for a review on PR #11592.
alexcrichton requested wasmtime-default-reviewers for a review on PR #11592.
alexcrichton commented on PR #11592:
Procedurally this is stacked on https://github.com/bytecodealliance/wasmtime/pull/11585, https://github.com/bytecodealliance/wasmtime/pull/11577, and https://github.com/bytecodealliance/wasmtime/pull/11576 at this time. I expect this to have a bit of a gauntlet on CI, however, so I wanted to get working on that sooner rather than later.
alexcrichton updated PR #11592.
alexcrichton updated PR #11592.
alexcrichton updated PR #11592.
alexcrichton updated PR #11592.
alexcrichton updated PR #11592.
alexcrichton updated PR #11592.
alexcrichton updated PR #11592.
alexcrichton updated PR #11592.
alexcrichton edited a comment on PR #11592:
Procedurally this is stacked on
https://github.com/bytecodealliance/wasmtime/pull/11585, https://github.com/bytecodealliance/wasmtime/pull/11577, andhttps://github.com/bytecodealliance/wasmtime/pull/11576at this time. I expect this to have a bit of a gauntlet on CI, however, so I wanted to get working on that sooner rather than later.
alexcrichton updated PR #11592.
alexcrichton edited a comment on PR #11592:
Procedurally this is stacked on
https://github.com/bytecodealliance/wasmtime/pull/11585,https://github.com/bytecodealliance/wasmtime/pull/11577, andhttps://github.com/bytecodealliance/wasmtime/pull/11576at this time. I expect this to have a bit of a gauntlet on CI, however, so I wanted to get working on that sooner rather than later.
alexcrichton submitted PR review.
alexcrichton created PR review comment:
For review this is particularly noteworthy, I had to enable this because when we compile for Winch now the entry trampolines (system ABI) are calling a Winch-defined function (winch ABI). The
try_callbeing done requires that the callee ABI (here: Winch) needs to support exceptions.I don't fully understand the implications of this flag, but I updated the aarch64/x64 backends to explicitly say that exceptions + winch clobbers all registers (like tail + winch). I'm not sure if there's more that needs be done, however.
alexcrichton created PR review comment:
This is an example of trampolines getting significantly larger than before. It might worthwhile to invest some effort in making a new set of trampolines where we have something like all static entrypoints have their normal trampoline signature they have today, but internally they dispatch to a signature-specific trampoline which takes the function-to-call as a function pointer. That way we'd effectively deduplicate trampolines at least per-signature as we used to do.
alexcrichton created PR review comment:
For review this is me from 3 years ago causing pain for myself. Years ago we switched from the
winapicrate to thewindows-syscrate. During that transition I found that thewindows-syscrate, at the time, did not haveEXCEPTION_CONTINUE_{SEARCH,EXECUTION}defined. It did, however, haveExceptionContinue{Search,Execution}. Being the naive little flower I am I assumed that this was some sort of mistake in the bindings and the values were all the same. Turns out this has always been wrong but it hasn't mattered since we practically never usedExceptionContinueExecution(only used for embedding-handled signals, which folks do sometimes on Linux but basically never on Windows).Turns out that
ExceptionContinue{Search,Execution}have different values thanEXCEPTION_CONTINUE_{SEARCH,EXECUTION}and they're both interpreted as "continue search". It's basically a bit of a miracle this never broke before now. In this PR though we're actually usingEXCEPTION_CONTINUE_EXECUTIONwhich required this change.
alexcrichton created PR review comment:
This is another consequence of this PR, the single-pass register allocator effectively no longer works with Wasmtime because it has issues with exceptions and Wasmtime unconditionally uses
try_callfor trampolines. This manifested in tests as crashes so I've disabled the usage in various tests.
alexcrichton created PR review comment:
This test is to double-check that we don't accidentally ever run exception handlers on traps now that we're almost implementing traps with exceptions.
alexcrichton updated PR #11592.
alexcrichton updated PR #11592.
github-actions[bot] commented on PR #11592:
Label Messager: wasmtime:config
It looks like you are changing Wasmtime's configuration options. Make sure to
complete this check list:
[ ] If you added a new
Configmethod, you wrote extensive documentation for
it.<details>
Our documentation should be of the following form:
```text
Short, simple summary sentence.More details. These details can be multiple paragraphs. There should be
information about not just the method, but its parameters and results as
well.Is this method fallible? If so, when can it return an error?
Can this method panic? If so, when does it panic?
Example
Optional example here.
```</details>
[ ] If you added a new
Configmethod, or modified an existing one, you
ensured that this configuration is exercised by the fuzz targets.<details>
For example, if you expose a new strategy for allocating the next instance
slot inside the pooling allocator, you should ensure that at least one of our
fuzz targets exercises that new strategy.Often, all that is required of you is to ensure that there is a knob for this
configuration option in [wasmtime_fuzzing::Config][fuzzing-config] (or one
of its nestedstructs).Rarely, this may require authoring a new fuzz target to specifically test this
configuration. See [our docs on fuzzing][fuzzing-docs] for more details.</details>
[ ] If you are enabling a configuration option by default, make sure that it
has been fuzzed for at least two weeks before turning it on by default.[fuzzing-config]: https://github.com/bytecodealliance/wasmtime/blob/ca0e8d0a1d8cefc0496dba2f77a670571d8fdcab/crates/fuzzing/src/generators.rs#L182-L194
[fuzzing-docs]: https://docs.wasmtime.dev/contributing-fuzzing.html
<details>
To modify this label's message, edit the <code>.github/label-messager/wasmtime-config.md</code> file.
To add new label messages or remove existing label messages, edit the
<code>.github/label-messager.json</code> configuration file.</details>
alexcrichton updated PR #11592.
alexcrichton has marked PR #11592 as ready for review.
alexcrichton requested fitzgen for a review on PR #11592.
alexcrichton requested cfallin for a review on PR #11592.
cfallin submitted PR review:
This is fantastic -- thanks for pushing toward the "unwinds only within Cranelift frames" goal!
Overall this all looks fine, with a few exceptions (no pun intended) around the handling of trap-handler addresses/relocations and the relocation hacks here. It's workable (and I left some comments on the MachBuffer changes below if we end up going this way) but I suspect there might be a simpler approach...
First: I think the handler address is always within the current function (there's an assert saying so), right? Then it seems like we shouldn't need to wait until we're using the text-section builder and working across functions to resolve the reference, channeling through the relocations mechanism; rather, it seems like we should be able to take the address of an exception handler target directly, and lower to an instruction/pair of instructions with the right
LabelUses, never creatingRelocs.What I'm thinking is that an
exception_handler_addressCLIF operator, taking a block entity, could directly give us what we need. It's a property of thetry_calllowering that we directly pass through the label and use that label to get an offset, so there is no "magic glue" between thetry_calland the block -- in the lowering we can just get the block address directly by using the appropriateLabelUse(s).That would remove the slightly brittle "take the one and only exception handler in this trampoline" logic, the special relocation and special
FuncKey, the magic late-resolved labels and offset/addend stuff, and make things generally much cleaner I think.What do you think?
cfallin created PR review comment:
The code in
buffer.rsis a little idiosyncratic in that it has an embedded informal correctness proof, but at the risk of being a slightly annoying reviewer, I think it's worth extending the correctness arguments here to really make sure we get it right...I was starting to write out a sketch of an argument that we can say "invariants don't matter when called from
finish()but that can invoke forced veneer creation that then calls into much of the rest of the mechanisms here so I'm not super-confident actually. I think it's fine becauseoptimize_branchesis the step that makes the invariants load-bearing, and the text-section builder doesn't invoke that afterfinishlike the VCode pipeline does here, but that's a fairly distributed and brittle argument. I think the fact that in practice the labels you bind late are not at the very end of the buffer may also be load-bearing here.I think what I'd rather see here is that
bind_labelbecomesbind_label_at_offsetwith its whole body intact (and add a surfacebind_labelthat delegates), and we do the labels-at-tail updates and optimize-branches step only ifoffset == self.cur_offset(). What do you think?
cfallin created PR review comment:
This is only used in the verifier, so I think this is fine practically. Semantically, we more or less have defined exception throws in the Winch ABI by virtue of adding the clobber and payload definitions -- so we now "support" it, even though the Winch-the-compiler doesn't yet have lowerings for any of the opcodes.
alexcrichton commented on PR #11592:
What I'm thinking is that an
exception_handler_addressCLIF operatorI definitely agree this would be nicer, but can you sketch out more fully what you're thinking? For example this would need to come before a
try_calland I'm not aware of anything else in CLIF where you'd want to refer to a label as a value (e.g. likelabel_addr block0or something). I've been under the impression that adding something a bit more first-class in Cranelift would be a pretty big lift and not necessarily worth it for this case where a "special reloc" worked well enough (you're right though that there's no need to defer until the end of theMachBufferbuilding, I think it could be done sooner)
alexcrichton submitted PR review.
alexcrichton created PR review comment:
To make sure I understand, you're saying revert this PR's changes, add an
offsetparameter tobind_labeland only apply the optimizations if it'scur_offset, and then add a new methodbind_label_at_current_offsetas a helper for setting theoffsetparameter toself.cur_offset(or similarly bikeshedded)?That all sounds reasonable to me yeah, I'd also be fine asserting that the labels are all empty since I think during text section creation we never hit those paths. If you're confident that running the optimizations is safe to only do conditionally though I'm happy to change to that too
cfallin commented on PR #11592:
Right, it would more or less be that -- an operator that takes a
Blockas an argument; and we define its semantics to be: return the address of the handler that would be placed in exception-table metadata if this block were an exception handler. I suppose it also needs an ABI for that exception transfer, and we check the signature as we do fortry_call. So something likev1 = exception_handler_address winch block1could appear anywhere in the function, and would lower to the ADRP/ADR, or LEA, or whatever is appropriate on the target.
I'm happy to whip this up tomorrow or late tonight if you like (about to disappear for the day right now)...
alexcrichton updated PR #11592.
alexcrichton commented on PR #11592:
Is it reasonable to expect Cranelift to always provide the guarantee that an exceptional entrypoint can be purely based on a block name alone? I would have expected that what we really want here is an identification of an edge, or a
try_callplusblockpair, as opposed to just theblock. Otherwise if we're comfortable going just with blocks, how comeexception_handler_addresswould be needed vs a barev1 = block_addr block2?I suppose put another way, if you wouldn't mind sketching this out I'd appreciate that. I'm having a bit of a block imagining what this would look like in CLIF but I fear I'm shackled by my current understanding of CLIF. This isn't that urgent though so don't feel the need to drop everything to sketch this out.
cfallin commented on PR #11592:
Is it reasonable to expect Cranelift to always provide the guarantee that an exceptional entrypoint can be purely based on a block name alone? I would have expected that what we really want here is an identification of an edge, or a
try_callplusblockpair, as opposed to just theblock. Otherwise if we're comfortable going just with blocks, how comeexception_handler_addresswould be needed vs a barev1 = block_addr block2?I suppose put another way, if you wouldn't mind sketching this out I'd appreciate that. I'm having a bit of a block imagining what this would look like in CLIF but I fear I'm shackled by my current understanding of CLIF. This isn't that urgent though so don't feel the need to drop everything to sketch this out.
So I dove down the surprisingly deep rabbithole here, at least for x64 (other backends left as an exercise for the reader...).
You're right that one actually needs to identify the edge, not the handler block; I ended up spec'ing the instruction to take a raw block entity reference to the callsite block (the one that ends with
try_call) and an immediate indicating which out-edge. Then given that, the idea is it gives exactly the PC that the exception table gives for that handler edge.Now that I wrote it out I'm not sure if I like it any better than the relocation-with-fake-external-
FuncKeyhack that you have here, though. It is more direct at least. I'll flesh it out for the other four backends and put up a PR if you'd like. Or if not, the one thing I would say about this PR is that, for avoidance of potential for subtle correctness bugs or unexpected coupling later, I'd prefer that we resolve the reloc eagerly after function compilation rather than add the deferred-label-resolution mechanism to theMachBuffer-- that's the part that sketches me out the most honestly. Thoughts?
alexcrichton commented on PR #11592:
I'd prefer that we resolve the reloc eagerly after function compilation
For this I'd defintely be ok switching to this behavior, but I'm not sure how to implement this. From Cranelift's perspective it just sees a
symbol_addrwith some opaque symbol and it's later embedder-specific knowlege "oh yeah that symbol is in the function here". Given that the function is completely finished by the time it's learned what the label is bound to, so no matter what theMachBufferwould need to support "bind this label at this offset which isn't necessarily the current offset".Were you thinking something else though? Or did you have an idea of how this would get plumbed to the
MachBufferduring construction?
bjorn3 commented on PR #11592:
v1 = exception_handler_address winch block1
You could emulate this in the frontend as
v1 = symbol_value gv1wheregv1gets a relocation applied immediately after compilation based on the exception table Cranelift emits, right?
alexcrichton commented on PR #11592:
I believe that's more-or-less what this PR does already though. Supporting that requires a bind-this-label-at-this-offset mechanism in
MachBufferwhich it seems @cfallin would prefer to avoid
cfallin commented on PR #11592:
Hmm, yeah, if we don't adopt the
get_exception_handler_addressapproach from my branch above, I was imagining a more direct approach where we process the relocs directly as they come out of Cranelift -- the key is not using the deferred-label mechanism, because I am not actually sure that it is correct in all cases (it violates the invariants at least on paper, but right at the end of processing, but there are edge cases that could still possibly happen wrt island insertion).Basically rather than converting the reloc to a label we would apply the reloc directly into the slice of function body. Given that we know the offset of the start of the function and the offset of the exception handler at that point we should have everything we need I think.
That said, I think it's probably best I finish other backends on my branch and put up a PR for the first-class operator instead -- it's a more straightforward design (even if a slightly heavy lift to add the new instruction format). Will do that shortly!
alexcrichton updated PR #11592.
alexcrichton commented on PR #11592:
Updated now to incorporate the work from https://github.com/bytecodealliance/wasmtime/pull/11629
cfallin submitted PR review:
Looks great -- thanks for the patience!
alexcrichton merged PR #11592.
Last updated: Dec 06 2025 at 07:03 UTC