alexcrichton opened PR #11430 from alexcrichton:internal-async-functions to bytecodealliance:main:
This commit is an initial step towards resolving https://github.com/bytecodealliance/wasmtime/issues/11262 by having async
functions internally Wasmtime actually beasyncinstead of requiring
the use of fibers. This is expected to have a number of benefits:
The Rust compiler can be used to verify a future is
Sendinstead of
"please audit the whole codebase's stack-local variables".Raw pointer workarounds during table/memory growth will no longer be
required since the arguments can properly be a split borrow to data in
the store (eventually leading to unblocking https://github.com/bytecodealliance/wasmtime/issues/11178).Less duplication inside of Wasmtime and clearer implementations
internally. For example GC bits prior to this PR has duplicated
sync/async entrypoints (sometimes a few layers deep) which eventually
bottomed out in*_maybe_asyncbits which wereunsafeand require
fiber bits to be setup. All of that is now gone with theasync
functions being the "source of truth" and sync functions just call
them.Fibers are not required for operations such as a GC, growing memory,
etc.The general idea is that the source of truth for the implementation of
Wasmtime internals are allasyncfunctions. These functions are
callable from synchronous functions in the API with a documented panic
condition about avoiding them whenConfig::async_supportis disabled.
Whenasync_supportis disabled it's known internally there should
never be an.awaitpoint meaning that we can poll the future of the
async version and assert that it's ready.This commit is not the full realization of plumbing
asynceverywhere
internally in Wasmtime. Instead all this does is plumb the async-ness of
ResourceLimiterAsyncand that's it, aka memory and table growth are
now properly async. It turns out though that these limiters are
extremely deep within Wasmtime and thus necessitated many changes to get
this all working. In the end this ended up covering some of the trickier
parts of dealing with async and propagating that throughout the runtime.Most changes in this commit are intended to be straightforward, but a
summary is:
Many more functions are
asyncand.awaittheir internals.Some instances of run-a-closure-and-catch-the-error are now replaced
with type-with-Dropas that's the equivalent in the async world.Internal traits in Wasmtime are now
#[async_trait]to be object
safe. This has a performance impact detailed more below.
vm::assert_readyis used in synchronous contexts to assert that the
async version is done immediately. This is intended to always be
accompanied with an assert aboutasync_supportnearby.
vm::one_pollis used test if an asynchronous computation is ready
yet and is used in a few locations where a synchronous public API says
it'll work inasync_supportmode but fails with an async resource limiter.GC and other internals were simplified where
asyncfunctions are now
the "guts" and sync functions are thin veneers over theseasyncfunctions.An example of new async functions are that lazy GC store allocation
and instance allocation are both async functions now.In a small number of locations a conditional check of
store.async_support()is done. For example during GC if
async_supportis enabled arbitrary yield points are injected. For
libcalls if it's enabledblock_onis used or otherwise it's asserted
to complete synchronously.Previously
unsafefunctions dealing requiring external fiber
handling are now all safe andasync.Libcalls have a
block_on!helper macro which should be itself a
function-taking-async-closure but requires future Rust features to
make it a function.A consequence of this refactoring is that instantiation is now slower
than before. For example from ourinstantiation.rsbenchmark:sequential/pooling/spidermonkey.wasm time: [2.6674 µs 2.6691 µs 2.6718 µs] change: [+20.975% +21.039% +21.111%] (p = 0.00 < 0.05) Performance has regressed.Other benchmarks I've been looking at locally in
instantiation.rshave
pretty wild swings from either a performance improvement in this PR of
10% to a regression of 20%. This benchmark in particular though, also
one of the more interesting ones, is consistently 20% slower with this
commit. Attempting to bottom out this performance difference it looks
like it's largely "just async state machines vs not" where nothing else
really jumps out in the profile to me. In terms of absolute numbers the
time-to-instantiate is still in the single-digit-microsecond range with
madvisebeing the dominant function.
alexcrichton commented on PR #11430:
I'm starting this as a draft for now while I sort out CI things, but I also want to have some discussion about this ideally before landing. I plan on bringing this up in tomorrow's Wasmtime meeting.
alexcrichton updated PR #11430.
alexcrichton updated PR #11430.
tschneidereit commented on PR #11430:
Attempting to bottom out this performance difference it looks
like it's largely "just async state machines vs not" where nothing else
really jumps out in the profile to me.IIUC, that means the overhead is a fixed cost that should be stable across different module types, as opposed to somehow scaling with _something_ about the module type itself? If so, that seems not ideal but okay to me, given that we're talking about 0.5us. Otherwise I'd like to understand the implications a bit more.
fitzgen commented on PR #11430:
For posterity, in today's Wasmtime meeting, we discussed this PR and ways to claw back some of the perf regression. The primary option we discussed was using Cranelift to compile a state-initialization function, which we have on file as https://github.com/bytecodealliance/wasmtime/issues/2639
alexcrichton commented on PR #11430:
Ok I've done some more performance profiling an analysis of this. After more thinking and more optimizing, I think I've got an idea for a design that is cheaper at runtime as well as doesn't require
T: Send. It'll require preparatory refactorings though so I'm going to start things out in https://github.com/bytecodealliance/wasmtime/pull/11442 and we can go from there. I've got everything queued up in my head I think but it'll take some time to get it all into PRs. The other benefit of all of this is that it's going to resolve a number of issues related to unsafe code and unnecessaryunsafe, e.g. #11442 handles an outstanding unsafe block intable.rs.
alexcrichton commented on PR #11430:
Further work/investigation on https://github.com/bytecodealliance/wasmtime/pull/11468 revealed an optimization opportunity I was not aware of, but makes sense in retrospect: in an
asyncfunction if an.awaitpoint is dynamically not executed then the function will execute faster. This makes sense to me because it avoids updating a state machine and/or spilling locals and execution continues as "normal", so hot-path/fast-path optimizations need to model, statically, that.awaitisn't necessary.With https://github.com/bytecodealliance/wasmtime/pull/11468 there's no performance regression currently. That's not the complete story but I'm growing confident we can land this PR without
T: Sendand without a performance regression. Basically we get to have our cake and eat it too.
alexcrichton closed without merge PR #11430.
alexcrichton commented on PR #11430:
Ok through all the various PRs above this PR is now entirely obsolete. All the benefits of this are on
main, yay!There's a 5% performance regression on
mainrelative to when I started this work which is due to#[async_trait]making boxed futures. Otherwise though I think it all worked out well!
tschneidereit commented on PR #11430:
There's a 5% performance regression on
mainrelative to when I started this work which is due to#[async_trait]making boxed futures.Can you say more about what kinds of things regressed? Or is this just "everything is pretty uniformly 5% slower"?
And separately, is there anything we can do to claw this back? And if so, can we track that somewhere?
alexcrichton commented on PR #11430:
Throughout this work I was watching the
sequential/pooling/(spidermonkey|wasi).wasmbenchmark defined inbenches/instantiation.rsin this repo. I copiedspidermonkey.wasmfrom Sightglass and otherwise this benchmark repeatedly instantiates in a loop these wasm modules. The 5% regression was time-to-instantiate-and-tear-down-the-store as measured by Criterion. Numbers were in the ~2us range for both modules and the 5% regression was on that number as well.https://github.com/bytecodealliance/wasmtime/pull/11470 was the cause of this change and in profiling and analyzing that my conclusion was it's more-or-less entirely due to
#[async_trait]. Previously where we had only dynamic dispatch we now have dynamic dispatch plus heap-allocated futures. The extra heap allocation was what was showing up in the profile primarily different from before. Effectively each table and memory being allocated now requires a heap-allocated future to track the state of progressing through the allocation there.I don't really know of a great way to claw back this performance easily. One option is to way for
dyn-compatible async traits in Rust, but that's likely to take awhile. Another option is to possibly have both an async and a sync trait method and we dynamically select which one depending on the resource limiter that's been configured. For the small wins here though I'd say that's probably not worth it, personally. Given the scale of the numbers here and the micro-benchmark nature I also wasn't planning on tracking this since we generally just try to get instantiation as fast as possible as opposed to "must be below this threshold at all times". In that sense it's a larger constant-factor than before, but that's naturally going to fluctuate over time IMO
tschneidereit commented on PR #11430:
Thank you, that's very helpful. I was mildly concerned because I thought you were talking about _everything_ being 5% slower. If it's just instantiation (and I now remember you mentioning this earlier), not e.g. execution throughput, then that's much less concerning. I think that all seems fine, then.
Last updated: Dec 06 2025 at 07:03 UTC