wasmtime / PR #11430 Use `async fn` internally within Was... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / PR #11430 Use `async fn` internally within Was...

Wasmtime GitHub notifications bot (Aug 13 2025 at 18:24):

alexcrichton opened PR #11430 from alexcrichton:internal-async-functions to bytecodealliance:main:

This commit is an initial step towards resolving https://github.com/bytecodealliance/wasmtime/issues/11262 by having async
functions internally Wasmtime actually be async instead of requiring
the use of fibers. This is expected to have a number of benefits:

The Rust compiler can be used to verify a future is Send instead of
"please audit the whole codebase's stack-local variables".

Raw pointer workarounds during table/memory growth will no longer be
required since the arguments can properly be a split borrow to data in
the store (eventually leading to unblocking https://github.com/bytecodealliance/wasmtime/issues/11178).

Less duplication inside of Wasmtime and clearer implementations
internally. For example GC bits prior to this PR has duplicated
sync/async entrypoints (sometimes a few layers deep) which eventually
bottomed out in *_maybe_async bits which were unsafe and require
fiber bits to be setup. All of that is now gone with the async
functions being the "source of truth" and sync functions just call
them.

Fibers are not required for operations such as a GC, growing memory,
etc.

The general idea is that the source of truth for the implementation of
Wasmtime internals are all async functions. These functions are
callable from synchronous functions in the API with a documented panic
condition about avoiding them when Config::async_support is disabled.
When async_support is disabled it's known internally there should
never be an .await point meaning that we can poll the future of the
async version and assert that it's ready.

This commit is not the full realization of plumbing async everywhere
internally in Wasmtime. Instead all this does is plumb the async-ness of
ResourceLimiterAsync and that's it, aka memory and table growth are
now properly async. It turns out though that these limiters are
extremely deep within Wasmtime and thus necessitated many changes to get
this all working. In the end this ended up covering some of the trickier
parts of dealing with async and propagating that throughout the runtime.

Most changes in this commit are intended to be straightforward, but a
summary is:

Many more functions are async and .await their internals.

Some instances of run-a-closure-and-catch-the-error are now replaced
with type-with-Drop as that's the equivalent in the async world.

Internal traits in Wasmtime are now #[async_trait] to be object
safe. This has a performance impact detailed more below.

vm::assert_ready is used in synchronous contexts to assert that the
async version is done immediately. This is intended to always be
accompanied with an assert about async_support nearby.

vm::one_poll is used test if an asynchronous computation is ready
yet and is used in a few locations where a synchronous public API says
it'll work in async_support mode but fails with an async resource limiter.

GC and other internals were simplified where async functions are now
the "guts" and sync functions are thin veneers over these async functions.

An example of new async functions are that lazy GC store allocation
and instance allocation are both async functions now.

In a small number of locations a conditional check of
store.async_support() is done. For example during GC if
async_support is enabled arbitrary yield points are injected. For
libcalls if it's enabled block_on is used or otherwise it's asserted
to complete synchronously.

Previously unsafe functions dealing requiring external fiber
handling are now all safe and async.

Libcalls have a block_on! helper macro which should be itself a
function-taking-async-closure but requires future Rust features to
make it a function.

A consequence of this refactoring is that instantiation is now slower
than before. For example from our instantiation.rs benchmark:
sequential/pooling/spidermonkey.wasm
                        time:   [2.6674 µs 2.6691 µs 2.6718 µs]
                        change: [+20.975% +21.039% +21.111%] (p = 0.00 < 0.05)
                        Performance has regressed.
Other benchmarks I've been looking at locally in instantiation.rs have
pretty wild swings from either a performance improvement in this PR of
10% to a regression of 20%. This benchmark in particular though, also
one of the more interesting ones, is consistently 20% slower with this
commit. Attempting to bottom out this performance difference it looks
like it's largely "just async state machines vs not" where nothing else
really jumps out in the profile to me. In terms of absolute numbers the
time-to-instantiate is still in the single-digit-microsecond range with
madvise being the dominant function.

Wasmtime GitHub notifications bot (Aug 13 2025 at 18:25):

alexcrichton commented on PR #11430:

I'm starting this as a draft for now while I sort out CI things, but I also want to have some discussion about this ideally before landing. I plan on bringing this up in tomorrow's Wasmtime meeting.

Wasmtime GitHub notifications bot (Aug 13 2025 at 18:30):

alexcrichton updated PR #11430.

Wasmtime GitHub notifications bot (Aug 13 2025 at 19:30):

alexcrichton updated PR #11430.

Wasmtime GitHub notifications bot (Aug 14 2025 at 08:51):

tschneidereit commented on PR #11430:

Attempting to bottom out this performance difference it looks
like it's largely "just async state machines vs not" where nothing else
really jumps out in the profile to me.

IIUC, that means the overhead is a fixed cost that should be stable across different module types, as opposed to somehow scaling with _something_ about the module type itself? If so, that seems not ideal but okay to me, given that we're talking about 0.5us. Otherwise I'd like to understand the implications a bit more.

Wasmtime GitHub notifications bot (Aug 14 2025 at 18:13):

fitzgen commented on PR #11430:

For posterity, in today's Wasmtime meeting, we discussed this PR and ways to claw back some of the perf regression. The primary option we discussed was using Cranelift to compile a state-initialization function, which we have on file as https://github.com/bytecodealliance/wasmtime/issues/2639

Wasmtime GitHub notifications bot (Aug 15 2025 at 21:48):

alexcrichton commented on PR #11430:

Ok I've done some more performance profiling an analysis of this. After more thinking and more optimizing, I think I've got an idea for a design that is cheaper at runtime as well as doesn't require T: Send. It'll require preparatory refactorings though so I'm going to start things out in https://github.com/bytecodealliance/wasmtime/pull/11442 and we can go from there. I've got everything queued up in my head I think but it'll take some time to get it all into PRs. The other benefit of all of this is that it's going to resolve a number of issues related to unsafe code and unnecessary unsafe, e.g. #11442 handles an outstanding unsafe block in table.rs.

Wasmtime GitHub notifications bot (Aug 19 2025 at 22:59):

alexcrichton commented on PR #11430:

Further work/investigation on https://github.com/bytecodealliance/wasmtime/pull/11468 revealed an optimization opportunity I was not aware of, but makes sense in retrospect: in an async function if an .await point is dynamically not executed then the function will execute faster. This makes sense to me because it avoids updating a state machine and/or spilling locals and execution continues as "normal", so hot-path/fast-path optimizations need to model, statically, that .await isn't necessary.

With https://github.com/bytecodealliance/wasmtime/pull/11468 there's no performance regression currently. That's not the complete story but I'm growing confident we can land this PR without T: Send and without a performance regression. Basically we get to have our cake and eat it too.

Wasmtime GitHub notifications bot (Aug 21 2025 at 01:15):

alexcrichton closed without merge PR #11430.

Wasmtime GitHub notifications bot (Aug 21 2025 at 01:15):

alexcrichton commented on PR #11430:

Ok through all the various PRs above this PR is now entirely obsolete. All the benefits of this are on main, yay!

There's a 5% performance regression on main relative to when I started this work which is due to #[async_trait] making boxed futures. Otherwise though I think it all worked out well!

Wasmtime GitHub notifications bot (Aug 21 2025 at 08:15):

tschneidereit commented on PR #11430:

There's a 5% performance regression on main relative to when I started this work which is due to #[async_trait] making boxed futures.

Can you say more about what kinds of things regressed? Or is this just "everything is pretty uniformly 5% slower"?

And separately, is there anything we can do to claw this back? And if so, can we track that somewhere?

Wasmtime GitHub notifications bot (Aug 21 2025 at 19:23):

alexcrichton commented on PR #11430:

Throughout this work I was watching the sequential/pooling/(spidermonkey|wasi).wasm benchmark defined in benches/instantiation.rs in this repo. I copied spidermonkey.wasm from Sightglass and otherwise this benchmark repeatedly instantiates in a loop these wasm modules. The 5% regression was time-to-instantiate-and-tear-down-the-store as measured by Criterion. Numbers were in the ~2us range for both modules and the 5% regression was on that number as well.

https://github.com/bytecodealliance/wasmtime/pull/11470 was the cause of this change and in profiling and analyzing that my conclusion was it's more-or-less entirely due to #[async_trait]. Previously where we had only dynamic dispatch we now have dynamic dispatch plus heap-allocated futures. The extra heap allocation was what was showing up in the profile primarily different from before. Effectively each table and memory being allocated now requires a heap-allocated future to track the state of progressing through the allocation there.

I don't really know of a great way to claw back this performance easily. One option is to way for dyn-compatible async traits in Rust, but that's likely to take awhile. Another option is to possibly have both an async and a sync trait method and we dynamically select which one depending on the resource limiter that's been configured. For the small wins here though I'd say that's probably not worth it, personally. Given the scale of the numbers here and the micro-benchmark nature I also wasn't planning on tracking this since we generally just try to get instantiation as fast as possible as opposed to "must be below this threshold at all times". In that sense it's a larger constant-factor than before, but that's naturally going to fluctuate over time IMO

Wasmtime GitHub notifications bot (Aug 21 2025 at 20:11):

tschneidereit commented on PR #11430:

Thank you, that's very helpful. I was mildly concerned because I thought you were talking about _everything_ being 5% slower. If it's just instantiation (and I now remember you mentioning this earlier), not e.g. execution throughput, then that's much less concerning. I think that all seems fine, then.

Last updated: Feb 24 2026 at 05:28 UTC