Stream: git-wasmtime

Topic: wasmtime / PR #11430 Use `async fn` internally within Was...


view this post on Zulip Wasmtime GitHub notifications bot (Aug 13 2025 at 18:24):

alexcrichton opened PR #11430 from alexcrichton:internal-async-functions to bytecodealliance:main:

This commit is an initial step towards resolving https://github.com/bytecodealliance/wasmtime/issues/11262 by having async
functions internally Wasmtime actually be async instead of requiring
the use of fibers. This is expected to have a number of benefits:

The general idea is that the source of truth for the implementation of
Wasmtime internals are all async functions. These functions are
callable from synchronous functions in the API with a documented panic
condition about avoiding them when Config::async_support is disabled.
When async_support is disabled it's known internally there should
never be an .await point meaning that we can poll the future of the
async version and assert that it's ready.

This commit is not the full realization of plumbing async everywhere
internally in Wasmtime. Instead all this does is plumb the async-ness of
ResourceLimiterAsync and that's it, aka memory and table growth are
now properly async. It turns out though that these limiters are
extremely deep within Wasmtime and thus necessitated many changes to get
this all working. In the end this ended up covering some of the trickier
parts of dealing with async and propagating that throughout the runtime.

Most changes in this commit are intended to be straightforward, but a
summary is:

A consequence of this refactoring is that instantiation is now slower
than before. For example from our instantiation.rs benchmark:

sequential/pooling/spidermonkey.wasm
                        time:   [2.6674 µs 2.6691 µs 2.6718 µs]
                        change: [+20.975% +21.039% +21.111%] (p = 0.00 < 0.05)
                        Performance has regressed.

Other benchmarks I've been looking at locally in instantiation.rs have
pretty wild swings from either a performance improvement in this PR of
10% to a regression of 20%. This benchmark in particular though, also
one of the more interesting ones, is consistently 20% slower with this
commit. Attempting to bottom out this performance difference it looks
like it's largely "just async state machines vs not" where nothing else
really jumps out in the profile to me. In terms of absolute numbers the
time-to-instantiate is still in the single-digit-microsecond range with
madvise being the dominant function.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 13 2025 at 18:25):

alexcrichton commented on PR #11430:

I'm starting this as a draft for now while I sort out CI things, but I also want to have some discussion about this ideally before landing. I plan on bringing this up in tomorrow's Wasmtime meeting.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 13 2025 at 18:30):

alexcrichton updated PR #11430.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 13 2025 at 19:30):

alexcrichton updated PR #11430.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 14 2025 at 08:51):

tschneidereit commented on PR #11430:

Attempting to bottom out this performance difference it looks
like it's largely "just async state machines vs not" where nothing else
really jumps out in the profile to me.

IIUC, that means the overhead is a fixed cost that should be stable across different module types, as opposed to somehow scaling with _something_ about the module type itself? If so, that seems not ideal but okay to me, given that we're talking about 0.5us. Otherwise I'd like to understand the implications a bit more.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 14 2025 at 18:13):

fitzgen commented on PR #11430:

For posterity, in today's Wasmtime meeting, we discussed this PR and ways to claw back some of the perf regression. The primary option we discussed was using Cranelift to compile a state-initialization function, which we have on file as https://github.com/bytecodealliance/wasmtime/issues/2639

view this post on Zulip Wasmtime GitHub notifications bot (Aug 15 2025 at 21:48):

alexcrichton commented on PR #11430:

Ok I've done some more performance profiling an analysis of this. After more thinking and more optimizing, I think I've got an idea for a design that is cheaper at runtime as well as doesn't require T: Send. It'll require preparatory refactorings though so I'm going to start things out in https://github.com/bytecodealliance/wasmtime/pull/11442 and we can go from there. I've got everything queued up in my head I think but it'll take some time to get it all into PRs. The other benefit of all of this is that it's going to resolve a number of issues related to unsafe code and unnecessary unsafe, e.g. #11442 handles an outstanding unsafe block in table.rs.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 19 2025 at 22:59):

alexcrichton commented on PR #11430:

Further work/investigation on https://github.com/bytecodealliance/wasmtime/pull/11468 revealed an optimization opportunity I was not aware of, but makes sense in retrospect: in an async function if an .await point is dynamically not executed then the function will execute faster. This makes sense to me because it avoids updating a state machine and/or spilling locals and execution continues as "normal", so hot-path/fast-path optimizations need to model, statically, that .await isn't necessary.

With https://github.com/bytecodealliance/wasmtime/pull/11468 there's no performance regression currently. That's not the complete story but I'm growing confident we can land this PR without T: Send and without a performance regression. Basically we get to have our cake and eat it too.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 21 2025 at 01:15):

alexcrichton closed without merge PR #11430.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 21 2025 at 01:15):

alexcrichton commented on PR #11430:

Ok through all the various PRs above this PR is now entirely obsolete. All the benefits of this are on main, yay!

There's a 5% performance regression on main relative to when I started this work which is due to #[async_trait] making boxed futures. Otherwise though I think it all worked out well!

view this post on Zulip Wasmtime GitHub notifications bot (Aug 21 2025 at 08:15):

tschneidereit commented on PR #11430:

There's a 5% performance regression on main relative to when I started this work which is due to #[async_trait] making boxed futures.

Can you say more about what kinds of things regressed? Or is this just "everything is pretty uniformly 5% slower"?

And separately, is there anything we can do to claw this back? And if so, can we track that somewhere?

view this post on Zulip Wasmtime GitHub notifications bot (Aug 21 2025 at 19:23):

alexcrichton commented on PR #11430:

Throughout this work I was watching the sequential/pooling/(spidermonkey|wasi).wasm benchmark defined in benches/instantiation.rs in this repo. I copied spidermonkey.wasm from Sightglass and otherwise this benchmark repeatedly instantiates in a loop these wasm modules. The 5% regression was time-to-instantiate-and-tear-down-the-store as measured by Criterion. Numbers were in the ~2us range for both modules and the 5% regression was on that number as well.

https://github.com/bytecodealliance/wasmtime/pull/11470 was the cause of this change and in profiling and analyzing that my conclusion was it's more-or-less entirely due to #[async_trait]. Previously where we had only dynamic dispatch we now have dynamic dispatch plus heap-allocated futures. The extra heap allocation was what was showing up in the profile primarily different from before. Effectively each table and memory being allocated now requires a heap-allocated future to track the state of progressing through the allocation there.

I don't really know of a great way to claw back this performance easily. One option is to way for dyn-compatible async traits in Rust, but that's likely to take awhile. Another option is to possibly have both an async and a sync trait method and we dynamically select which one depending on the resource limiter that's been configured. For the small wins here though I'd say that's probably not worth it, personally. Given the scale of the numbers here and the micro-benchmark nature I also wasn't planning on tracking this since we generally just try to get instantiation as fast as possible as opposed to "must be below this threshold at all times". In that sense it's a larger constant-factor than before, but that's naturally going to fluctuate over time IMO

view this post on Zulip Wasmtime GitHub notifications bot (Aug 21 2025 at 20:11):

tschneidereit commented on PR #11430:

Thank you, that's very helpful. I was mildly concerned because I thought you were talking about _everything_ being 5% slower. If it's just instantiation (and I now remember you mentioning this earlier), not e.g. execution throughput, then that's much less concerning. I think that all seems fine, then.


Last updated: Dec 06 2025 at 07:03 UTC