frank-emrich opened PR #10388 from frank-emrich:stack-switching-infra
to bytecodealliance:main
:
This PR is part of a series that adds support for the Wasm stack switching proposal. The explainer document for the proposal is here. There's a tracking issue describing the overall progress and limitations here: #10248
The draft PR #10177 contains the whole (initial) implementation of the proposal.This PR contains all of the necessary infrastructure and runtime support. In other words, this contains the entire implementation, except for codegen. Tests are added at the end, after codegen has also been added in a follow-up PR.
This was developed together with @dhil.
General implementation notes
In Wasm, continuations are represented by values of type
(ref $ct)
, where$ct
is a new composite type/heap type for continuations.
In the implementation, these are represented by values of typeVMContObj
. These are fat pointers, consisting of a sequence counter, and a pointer to aVMContRef
. The latter type is used for the actual representation of continuations.
The sequence counter part ofVMContObj
s is used to check that every continuation value can only be used once.The
VMStoreContext
is extended to contain a "stack chain": It indicates what stack we are currently executing on. Logically, this is a linked list of stacks, since each continuation has a parent field. The chain stored in theVMStoreContext
always ends with a value representing the initial stack. This is the stack we were on when entering Wasm, which will usually be the main stack.Memory Management
Currently, memory management is very basic: The
StoreOpaque
provides a new method for allocation a newVMContRef
, and keeps track of all continuations created this way. Continuations are never deallocated at the moment, meaning that they live until the store itself is deallocated.The stack memory used by each allocation (represented by type
ContinuationStack
) is alwaysmmap
-ed, with a guard page at the end. There is currently no support for growing stacks, or segmented stacks.Backtrace generation
The existing backtrace generation mechanism is extended to be fully aware of continuations: When generating backtraces, the entire chain of continuation is traversed, not just the stack frames of the currently active stack/continuation.
Entering/Exiting Wasm
Prior to this PR, there were two separate mechanisms that save and restore some state of the runtime when entering/exiting Wasm:
- The functions
enter_wasm
andexit_wasm
infunc.rs
CallThreadState
saves and restores (ondrop
) parts of theVMStoreContext
This PR consolidates these two mechanism, because it requires some additional state to be updated and restored on enter/exit:
the typewasmtime::runtime::func::EntryStoreContext
now stores all of the required runtime state and ensures that it's restored when exiting Wasm.Tables
- The runtime part for resizing and filling tables containing continuations is added.
- Note that there is a potentially controversial change to how the pooling allocator works for tables: Since a
VMContObj
is 2 pointers wide, the maximum size of an entry in a table doubles. As a result, this PR doubles the amount of virtual memory space occupied by the table pool (but the amount of pages actually used stays the same).
frank-emrich requested fitzgen for a review on PR #10388.
frank-emrich requested wasmtime-fuzz-reviewers for a review on PR #10388.
frank-emrich requested wasmtime-default-reviewers for a review on PR #10388.
frank-emrich requested wasmtime-core-reviewers for a review on PR #10388.
github-actions[bot] commented on PR #10388:
Subscribe to Label Action
cc @fitzgen
<details>
This issue or pull request has been labeled: "fuzzing", "wasmtime:api", "wasmtime:c-api", "wasmtime:config", "wasmtime:ref-types"Thus the following users have been cc'd because of the following labels:
- fitzgen: fuzzing, wasmtime:ref-types
To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.
Learn more.
</details>
github-actions[bot] commented on PR #10388:
Label Messager: wasmtime:config
It looks like you are changing Wasmtime's configuration options. Make sure to
complete this check list:
[ ] If you added a new
Config
method, you wrote extensive documentation for
it.<details>
Our documentation should be of the following form:
```text
Short, simple summary sentence.More details. These details can be multiple paragraphs. There should be
information about not just the method, but its parameters and results as
well.Is this method fallible? If so, when can it return an error?
Can this method panic? If so, when does it panic?
Example
Optional example here.
```</details>
[ ] If you added a new
Config
method, or modified an existing one, you
ensured that this configuration is exercised by the fuzz targets.<details>
For example, if you expose a new strategy for allocating the next instance
slot inside the pooling allocator, you should ensure that at least one of our
fuzz targets exercises that new strategy.Often, all that is required of you is to ensure that there is a knob for this
configuration option in [wasmtime_fuzzing::Config
][fuzzing-config] (or one
of its nestedstruct
s).Rarely, this may require authoring a new fuzz target to specifically test this
configuration. See [our docs on fuzzing][fuzzing-docs] for more details.</details>
[ ] If you are enabling a configuration option by default, make sure that it
has been fuzzed for at least two weeks before turning it on by default.[fuzzing-config]: https://github.com/bytecodealliance/wasmtime/blob/ca0e8d0a1d8cefc0496dba2f77a670571d8fdcab/crates/fuzzing/src/generators.rs#L182-L194
[fuzzing-docs]: https://docs.wasmtime.dev/contributing-fuzzing.html
<details>
To modify this label's message, edit the <code>.github/label-messager/wasmtime-config.md</code> file.
To add new label messages or remove existing label messages, edit the
<code>.github/label-messager.json</code> configuration file.</details>
fitzgen commented on PR #10388:
Haven't dug into the details of the PR yet, but I have some initial questions after reading the OP.
The stack memory used by each allocation (represented by type
ContinuationStack
) is alwaysmmap
-ed, with a guard page at the end. There is currently no support for growing stacks, or segmented stacks.Is there a particular reason this does not use the
InstanceAllocatorImpl::[de]allocate_fiber_stack
methods via[store.engine().allocator().[de]allocate_fiber_stack()
](https://github.com/bytecodealliance/wasmtime/blob/9da52ede33a9f8996832b0210579ef15296addbd/crates/wasmtime/src/engine.rs#L659)? That would make it so that this feature automatically integrates with thewasmtime::Config
, its various knobs for stack sizes and such, and whether the pooling or on-demand allocator is in use.Note that there is a potentially controversial change to how the pooling allocator works for tables: Since a
VMContObj
is 2 pointers wide, the maximum size of an entry in a table doubles. As a result, this PR doubles the amount of virtual memory space occupied by the table pool (but the amount of pages actually used stays the same).Is this because
(ref cont)
/VMContObj
tables are being shoe-horned into the same underlying representation asanyref
/externref
/VMGcRef
tables? BecauseVMContObj
has a different representation fromVMGcRef
, I think it would be best to add a newwasmtime::runtime::vm::Table
variant forVMContObj
, the same way that we have different variants forVMGcRef
tables versus*const VMFuncRef
tables.
fitzgen edited a comment on PR #10388:
Haven't dug into the details of the PR yet, but I have some initial questions after reading the OP.
The stack memory used by each allocation (represented by type
ContinuationStack
) is alwaysmmap
-ed, with a guard page at the end. There is currently no support for growing stacks, or segmented stacks.Is there a particular reason this does not use the
InstanceAllocatorImpl::[de]allocate_fiber_stack
methods viastore.engine().allocator().[de]allocate_fiber_stack()
? That would make it so that this feature automatically integrates with thewasmtime::Config
, its various knobs for stack sizes and such, and whether the pooling or on-demand allocator is in use.Note that there is a potentially controversial change to how the pooling allocator works for tables: Since a
VMContObj
is 2 pointers wide, the maximum size of an entry in a table doubles. As a result, this PR doubles the amount of virtual memory space occupied by the table pool (but the amount of pages actually used stays the same).Is this because
(ref cont)
/VMContObj
tables are being shoe-horned into the same underlying representation asanyref
/externref
/VMGcRef
tables? BecauseVMContObj
has a different representation fromVMGcRef
, I think it would be best to add a newwasmtime::runtime::vm::Table
variant forVMContObj
, the same way that we have different variants forVMGcRef
tables versus*const VMFuncRef
tables.
fitzgen submitted PR review:
(review is still in progress, but I need to context switch for a little bit and might as well share what I have so far)
fitzgen created PR review comment:
Were these going to be removed in the next PR, or should they be removed now? Maybe throw an item into the meta task list if not now.
fitzgen created PR review comment:
Not something that needs to be handled in this PR, but there is no reason we can't enable GC and also maintain the existing memory management for continuations, where they have the same lifetime as the
Store
. We don't have to implement full GC of continuations before we allow enabling both GC and stack switching at the same time.
fitzgen created PR review comment:
FWIW, we already have
TRAP_INTERNAL_ASSERT
that we use for debug-only assertions inside JIT code.
fitzgen created PR review comment:
I don't see a corresponding function definition in
crates/c-api
, I think it is missing?
fitzgen created PR review comment:
Ditto to the debug printing stuff in this file.
fitzgen created PR review comment:
fyi, I tend to do
log::trace!(...)
for debug prints (both temporary ones that I will remove, and for ones that are useful enough to keep in tree).when running the CLI, you can see them by setting the
WASMTIME_LOG=trace
env var, or only a specific crate/module viaWASMTIME_LOG=wasmtime::runtime::vm=trace
, or multiple crates/modules at different levels viaWASMTIME_LOG=debug,regalloc2=info,cranelift_codegen::machinst=trace
.When running tests instead of the CLI,
WASMTIME_LOG
becomesRUST_LOG
. Most tests initialize the env-logger, but if you aren't seeing the logs, then you can putlet _ = env_logger::try_init();
at the start of the test (or ideally in a shared test helper function that a bunch of related tests already call at the beginning of their execution).
fitzgen created PR review comment:
At least for now, I'm not sure it is worth having separate config knobs for stacks created via
cont.new
versus API-level fibers. This effectively creates two different kinds of stacks in the system, and I'd prefer to try and keep things simple for as long as we can by having a single kind of stack that we config and allocate and handle in the system.
fitzgen created PR review comment:
Similarly, I don't see a function definition for this one either.
fitzgen submitted PR review.
fitzgen created PR review comment:
fitzgen submitted PR review:
(again, review still in progress, leaving intermediate notes)
fitzgen created PR review comment:
Reminder to fill this in.
fitzgen created PR review comment:
Two things:
Why does stack-switching need
std
?We should be able to enable stack-switching for a compiler-without-runtime build to be able to compile stack-switching Wasm to be run elsewhere, and similarly should be able to enable stack-switching in a runtime-without-compiler build to run elsewhere-compiled stack-switching Wasm modules. As written, enabling the stack-switching feature will always enable the runtime, making the compiler-without-runtime version impossible. That is, unless compilation of stack-switching isn't gated on any cargo feature, but that is kind of funky given that this cargo feature exists and I expect users would think "I want to compile Wasm programs that use stack-switching, therefore I will enable this feature". This is a long way of basically saying, I think this should not unconditionally enable the runtime and, once compilation is added in the next PR, we ultimately end up with something like this:
rust stack-switching = [ "wasmtime-cranelift?/stack-switching", ]
Neither of these things necessarily needs to be addressed now, before this PR merges, but it may end up being easier doing it now before more stuff gets entangled, making it harder to pull the
cfg
s apart.
fitzgen created PR review comment:
This seems unused. Does it get used in follow up PRs?
fitzgen created PR review comment:
Because (AFAICT) these don't contain any
usize
s or pointers or anything else that changes with the target's word size, you could move them into thewasmtime_environ
crate as#[repr(u32)] pub enum VMControlEffect { Return = 0, // ... }
and avoid the wordy and kind of annoying split between the definitions of the
enum
s and their discriminants.
fitzgen created PR review comment:
These sorts of config knobs that need to exist at the
wasmtime_environ
level usually go insidewasmtime_environ::Tunables
. Unless there is a reason not to do that here, we should follow the existing convention.
fitzgen created PR review comment:
Nitpick: both the
Engine
andModule
variants are interned, so perhaps we can rename this method tounwrap_module_type_index
?
fitzgen created PR review comment:
This
mem::drop
dance is kind of funky -- can we instead pass ownership intocatch_traps
and let it handle the dropping, if necessary? Or are there callers that need to do different things?
fitzgen created PR review comment:
You'll also want to add some CI checks for these features being enabled/disabled in various combinations like this:
https://github.com/bytecodealliance/wasmtime/blob/main/.github/workflows/main.yml#L363-L365
fitzgen created PR review comment:
If they are written to directly from Wasm JIT code, then probably yes. This struct should also be
#[repr(C)]
in that case.If it is not written to directly from Wasm, then it would only need to be unsafe cells if you are ever violating Rust's mutable/exclusive xor immutable/shared semantics.
I haven't read enough of these changes to say any more than that yet.
fitzgen created PR review comment:
Can you run the function call microbenchmarks on
main
and on this branch? I want to make sure that we aren't regressing performance here, as this path is very hot, and if we are we may need to figure out how tocfg(...)
some of this stuff so that it only happens when thestack-switching
feature is enabled.This will run the microbenchmarks and will check the results for statistical significance:
$ git checkout main $ cargo bench --bench call -- --save-baseline main $ git checkout stack-switching-infra $ cargo bench --bench call -- --baseline main
I can help figure things out if we need to do something here.
fitzgen created PR review comment:
Can you add an item to the meta task list for the embedder API, if you haven't already?
dhil submitted PR review.
dhil created PR review comment:
I think this is a stray declaration from some of our downstream experiments. It has been needed to capture the C shadow stack pointer, that clang generates. It must be kept in sync with the stack switching done by the engine. To be clear, this is just due to the fact that clang doesn't know about stack switching.
dhil submitted PR review.
dhil created PR review comment:
Same reason as above.
dhil edited PR review comment.
fitzgen submitted PR review.
fitzgen created PR review comment:
FWIW, exposing these things the C API would be good in general, so if you want to do it here or in a new PR, you're more than welcome to!
dhil submitted PR review.
dhil created PR review comment:
Yes sure! I am leaning towards adding this in a follow-up PR focused on the C API. What do you think @frank-emrich ?
dhil submitted PR review.
dhil created PR review comment:
As I discussed with @alexcrichton (https://github.com/bytecodealliance/wasmtime/pull/10177/files#r1965733715), I'd be OK with no bespoke stack switching configuration for now.
dhil submitted PR review.
dhil created PR review comment:
This is an interesting question, which Frank and I have discussed in the past. If my memory serves me right, then our layout is slightly different. However perhaps we can unify the layouts (or we can perhaps simply adopt the fibers' layout). I am not too sure about the implications, I think @frank-emrich has the key knowledge to best answer this question.
fitzgen submitted PR review.
fitzgen created PR review comment:
The automatic pooling allocator integration would also get you fast allocation of new stacks from a pool, which would look nice on your benchmarks and what have you ;)
dhil submitted PR review.
dhil created PR review comment:
My take is that they shouldn't make it into the
main
branch -- probably a stray include?
dhil submitted PR review.
dhil created PR review comment:
Yes, agreed! I totally see the appeal!
dhil submitted PR review.
dhil created PR review comment:
Yes, it gets used during the code translation. It should arguably be added along with it in the next patch.
frank-emrich commented on PR #10388:
(Sorry for the long delay, I've moved to London and started a new job, where I needed to settle in first)
@fitzgen
Is there a particular reason this does not use the InstanceAllocatorImpl::[de]allocate_fiber_stack methods via store.engine().allocator().[de]allocate_fiber_stack()? That would make it so that this feature automatically integrates with the wasmtime::Config, its various knobs for stack sizes and such, and whether the pooling or on-demand allocator is in use.
The reason we do not use these in verbatim is that our stacks are not the same as
wasmtime_fiber::FiberStack
. Trying to make our stacks compatible with that type would be quite difficult and IMO not worth the effort.
However, our medium-term plan was to add similar allocation and deallocation functions toInstanceAllocatorImpl
. This shouldn't be too much work, but something we didn't see through for now. In particular, @dhil is very interested in restoring support for pooled allocation of stack. We had implemented our own interface for that (outside ofInstanceAllocatorImpl
) and it greatly improves performance in those use cases where you frequently (de-) allocate continuations.Is this because (ref cont)/VMContObj tables are being shoe-horned into the same underlying representation as anyref/externref/VMGcRef tables? Because VMContObj has a different representation from VMGcRef, I think it would be best to add a new wasmtime::runtime::vm::Table variant for VMContObj, the same way that we have different variants for VMGcRef tables versus *const VMFuncRef tables.
We actually extend
StaticTable
andTableElement
in that file with appropriate variants for our types. The problem is that in TablePool::new we need tommap
enough memory to holdlimits.total_tables
*limits.table_elements
entries, independent from what data is stored in those tables. In order to support tables storing continuations pointers, this effectively doubles the required size of the mmapped area.
frank-emrich submitted PR review.
frank-emrich created PR review comment:
From a technical perspective, there is no need for us to have our own config setting for the stack size.
I guess the reason we didn't just re-use the existingstack_size
option is that the latter always felt like a soft limit to me: AFAIK you can still occupy stack space beyond that limit, for example with host functions.
But I have no strong feelings about re-using the existingstack_size
config option, or merging ourstack_switching_stack_size
withasync_stack_size
.
frank-emrich submitted PR review.
frank-emrich created PR review comment:
Good question! These only exist for some hacky debug printing code that should not end up in
main
. Initially, when we thought that there would only be a single, big PR, the plan was to keep them during reviewing, and then remove them at the last moment before merging. Now that we will land this in smaller pieces, we could either remove them now, or once all PRs from the series have landed.The reason why we don't want to land this is explained here: On the codegen side, we just emit the addresses of Rust string literals as literals into generated code, with no relocation information. So this crashes and burns if you ever try to re-use the same generated code from a different invocation of
wamtime
.
frank-emrich submitted PR review.
frank-emrich created PR review comment:
Haha, these were in
wasmtime_environ
originally, but then @alexcrichton suggested moving them into the runtime. I don't mind it either way.
frank-emrich submitted PR review.
frank-emrich created PR review comment:
Yeah, that
drop
call is definitely not pretty. We could get away with not explicitly callingdrop
, but managing the scopes here slightly differently. But that seems equally funky.The main reason for all this new
EntryStoreContext
business is the following: Previously, there were two places that do some updates to the runtime on entering and exiting Wasm: Incatch_traps
(together withCallThreadState
'sdrop
) and here infunc.rs
(usingenter_wasm
/exit_wasm
). I believe @alexcrichton mentioned that this split is mostly for legacy reasons due to how to things were separated into different crates until recently.For stack switching, we now require an additional update to the runtime on entering and exiting. Intuitively, it felt to me like these changes should be made in
enter_wasm
/exit_wasm
, but that doesn't quite work: We need to undo these changes if youpanic
out of a Wasm execution, so I would indeed need to do all of this incatch_traps
instead, so thatCallThreadState
'sdrop
can do the steps needed on exiting the runtime.But updating the stack chain in
catch_traps
to signal that we are entering Wasm seemed awkward to me, so I decided to just consolidate all the logic for changes that need to happen on entering and exiting Wasm into one place, which resulted in the birth ofEntryStoreContext
.TL;DR Technically, all that's needed for stack switching could all be done in
catch_traps
, but that felt like it's morally the wrong spot to me.Any suggestions? Should I just move it to
catch_traps
?
fitzgen commented on PR #10388:
(Sorry for the long delay, I've moved to London and started a new job, where I needed to settle in first)
No worries. Congrats on the move and the new gig!
The reason we do not use these in verbatim is that our stacks are not the same as
wasmtime_fiber::FiberStack
. Trying to make our stacks compatible with that type would be quite difficult and IMO not worth the effort.I'd like to dive more into this because I would (perhaps naively?) assume that there is no fundamental reason they couldn't be the same. This is also one of the biggest concerns I have about the the stack-switching implementation: the complexity of having multiple kinds of stacks that the runtime must manage; when there are different kinds of resources, we will eventually need different config knobs for tuning them; and etc. Basically, this feels like a bunch of cognitive overhead for both maintainers and users. Keeping things uniform keeps things simpler. For example, https://github.com/bytecodealliance/wasmtime/issues/9350 (which is admittedly a beast of a PR; I have been splitting it into smaller pieces and landing those incrementally) will ultimately reduce complexity in Wasmtime by unifying our handling of Wasm linear memories and GC heaps. I just point this out so that you know I'm not trying to pick on you, and that I am attempting to hold myself to these standards as well.
But perhaps I am missing some key point that makes it so that these stack-switching stacks and fiber stacks really do have different requirements that do not overlap enough that sharing an implementation makes sense?
You say it would be quite difficult to unify them, can you expand on what difficulties your foresee?
In particular, @dhil is very interested in restoring support for pooled allocation of stack. We had implemented our own interface for that (outside of
InstanceAllocatorImpl
) and it greatly improves performance in those use cases where you frequently (de-) allocate continuations.Totally, that's exactly what I'd expect: a free list pop should be much faster than
mmap
, especially under load in a concurrent, multi-threaded embedding of Wasmtime.We actually extend
StaticTable
andTableElement
in that file with appropriate variants for our types. The problem is that in TablePool::new we need tommap
enough memory to holdlimits.total_tables
*limits.table_elements
entries, independent from what data is stored in those tables. In order to support tables storing continuations pointers, this effectively doubles the required size of the mmapped area.Ah okay, I get it now. You're right that we definitely don't want to cut the capacity of all existing tables in half in order to support a proposal that 99% of users won't be leveraging yet.
Two things:
- And this is a pre-existing issue, that calculation of table size should be a little more robust, and use something like
mem::size_of::<TableElementUnion>()
so that it is more future-proof for changes like we have here where we are adding new kinds of table elements.- We could plumb through a boolean for whether the stack-switching proposal is enabled to this method, or the whole
WasmFeatures
or something, and then adjust the size of a single element in the table size calculation based on that value.In the meantime, halving the capacity just for continuation tables (rather than all other existing tables) is fine. And we can do the above suggestions in follow up PRs.
fitzgen submitted PR review.
fitzgen created PR review comment:
AFAIK you can still occupy stack space beyond that limit, for example with host functions.
We are generally a little loosey-goosey with our nomenclature here, but to make things precise for a moment:
- "Host functions" = functions defined in native code by the Wasmtime embedder (e.g. via
wasmtime::Func::new
) and imported by Wasm- "libcalls" = various bits of Wasm functionality (e.g.
memory.grow
) that are not implemented inline in JIT code, but instead in native functions inside of Wasmtime's runtime.libcalls are part of Wasmtime's trusted compute base, host functions are not (modulo some hand waving around their usage of
unsafe
).Given all that: we will execute libcalls on async fibers, but will never execute host functions on async fibers. When Wasm calls an imported host function, we first switch from the async fiber to the native stack, then run the host function, and finally switch back to the Wasm's async fiber stack to pass the function results back to Wasm JIT code.
A libcall should never access stack memory beyond the
async_stack_size
limit (and we rely on guard pages to help enforce this). The Wasm JIT code is givenmax_wasm_stack
bytes for its own use, it must always be true thatasync_stack_size > max_wasm_stack
, and therefore that leavesasync_stack_size - max_wasm_stack
bytes of stack space for libcalls in the worst case.This is all a little bit of an aside and I'm mostly just explaining all this to make sure we are on the same page here. I don't think any of this needs to change for stack-switching, and I would also expect that we would call libcalls, but never host functions, on stack-switching stacks, same way we do things on our async fibers.
fitzgen submitted PR review.
fitzgen created PR review comment:
The reason why we don't want to land this is explained here: On the codegen side, we just emit the addresses of Rust string literals as literals into generated code, with no relocation information. So this crashes and burns if you ever try to re-use the same generated code from a different invocation of
wamtime
.Ah okay! Yes, definitely should not land these bits in that case.
fitzgen submitted PR review.
fitzgen created PR review comment:
Ha, lets keep it as-is for now then. I don't want to make you bounce back and forth.
fitzgen submitted PR review.
fitzgen created PR review comment:
The main reason for all this new
EntryStoreContext
business is the following: Previously, there were two places that do some updates to the runtime on entering and exiting Wasm: Incatch_traps
(together withCallThreadState
'sdrop
) and here infunc.rs
(usingenter_wasm
/exit_wasm
)....
Any suggestions? Should I just move it to
catch_traps
?Do you mind splitting out just the
EntryStoreContext
bits (without even the stack chain preservation) into its own PR? I think that would help me get a better grasp of these refactorings and provide better review and feedback, especially since the wasm entry/exit paths are both very fiddly and also perf sensitive.I do like the idea of unifying these code paths a lot, but I just want to be very careful about changes to them in general. Honestly, it is fairly likely that we won't change the mechanisms any further than what you have already done in this PR here, but it would help a lot with alleviating my nibbling paranoia that we are introducing subtle bugs or perf regressions.
And I apologize for asking you to keep splitting these PRs even finer-grained. I know it can be a pain (again, see https://github.com/bytecodealliance/wasmtime/pull/10503) but it is super super super helpful for reviewers and making sure that we are following along with all the changes. Thanks!
Last updated: Apr 17 2025 at 08:04 UTC