Stream: wasmtime

Topic: Wasmtime sync<->sync adapter optimizability


view this post on Zulip Alex Crichton (Dec 11 2025 at 12:09):

@Chris Fallin @fitzgen (he/him) question for y'all. Components historically have had a may_{enter,leave} flag associated with them which is a spec-level definition about reentrance protection and things like that. In the course of the component-model-async work this has changed, notably may_enter, which is no longer a flag but rather morally a stack that's passed around. In the course of component-model-async development work @Joel Dice has run across a situation where the old may_enter handling is now broken and needs to be updated.

The easiest way to update things is to invoke a function that, on the host, that Joel already has for component-model-async work. This is basically already host host<->guest calls work (or so I'm led to believe, Joel correct me if I'm wrong). The stickler here is guest<->guest calls. Currently that manipulation/inspection of may_{enter,leave} is baked directly in the adapter between components. The flags are wasm globals defined in the VMComponentInstance.

This all brings me to the question for y'all: the easiest way to return to spec compliance is to update adapters to call a host function. I expect that this is unacceptable perf-wise for use caes y'all have envisioned for compile-time builtins. Are y'all in a position where you can't take a regression in compile-time-builtin runtime performance w.r.t. inlining and such? Or is there runway to land a host call for now and optimize later as necessary?

view this post on Zulip Alex Crichton (Dec 11 2025 at 12:11):

The change in question that this is coming up on is https://github.com/bytecodealliance/wasmtime/pull/12153

The spec says we should allow this, so now we do. Thansk to Alex for the test case! Fixes #12128

view this post on Zulip Chris Fallin (Dec 11 2025 at 17:34):

Hmm, thanks for the heads-up. Question: will this affect async-ABI components or "p2 style" (sync ABI) components as well?

It would be pretty unfortunate for our inlined "get a byte" accessor in a linked-in component to make a hostcall on every loop iteration in which it's called, to the extent that it would destroy our zero-copy story, so it would be nice if there were a path to not doing this. (FWIW I suspect once we have this zero-copy story working it'll be of interest to a lot of other embedded and network-stack folks as well.) Is there such a path or is it expected that this is the forever-solution?

view this post on Zulip fitzgen (he/him) (Dec 11 2025 at 17:39):

Is there a reason that continuing to use globals is difficult? As Chris said, this would be pretty damning to the compile-time builtins story. Even the global is a lot more overhead than we want

view this post on Zulip Alex Crichton (Dec 11 2025 at 17:53):

This would affect all component<->component interactions, including sync<->sync functions, which is what wasip2 today exclusively uses.

To clarify, optimizations are possible here. It requires more compliation infrastructure, more analyses, more runtime data structures, etc. It is inevitable to me that this'll get implemented.

What I'm mostly curious on is whether these optimizations are a hard requirement today for anything to land

view this post on Zulip Joel Dice (Dec 11 2025 at 17:53):

(Note that Alex is out on PTO for a week, so probably won't respond soon. I'll try to answer as best I can.) EDIT: nevermind :)

Chris Fallin said:

Question: will this affect async-ABI components or "p2 style" (sync ABI) components as well?

It depends on how much up-front optimization we do. If no optimization, then yes, sync ABI components will be affected. One perhaps not-too-difficult optimization would be to detect that the component has only sync imports and exports at load time, carry that info through to the FACT module generator, and fall back to the existing check-and-set-a-flag behavior. There are also other optimizations available.

view this post on Zulip Joel Dice (Dec 11 2025 at 17:55):

If you search for trap_if_on_the_stack in https://github.com/WebAssembly/component-model/blob/main/design/mvp/CanonicalABI.md and scroll down to the prose about optimization, you can see some of the options.

Repository for design and specification of the Component Model - WebAssembly/component-model

view this post on Zulip Chris Fallin (Dec 11 2025 at 17:55):

I think if that single optimization were implemented (avoid all this if everything is sync) at least I would be happy for now (it would maintain our status-quo). We'd need to figure out how to avoid the hostcall when we eventually move to native async ABI but our embedding doesn't do that today.

It still seems like a pretty big perf regression in general for "tightly bound components" (little accessor calls) and I'm sure y'all have done a lot of thinking on this already but I'm happy to brainstorm or help on optimizations to make this logic inlined if needed!

view this post on Zulip Joel Dice (Dec 11 2025 at 17:56):

Alex felt the all-sync-component optimization would be onerous when we chatted last night. Alex, do you still feel that way? It doesn't feel like a huge lift to me.

view this post on Zulip fitzgen (he/him) (Dec 11 2025 at 18:00):

FWIW, we now have at least one component-level analysis in-tree already with the analyze_function_imports stuff: https://github.com/bytecodealliance/wasmtime/blob/99ef35d1477ce35754ac338f28bee542d1d42661/crates/environ/src/component/translate.rs#L550

view this post on Zulip Alex Crichton (Dec 11 2025 at 18:01):

I am not aware personally of anyone currently today relying on the absolute fastest possible performance between two components. I realize there are plans that could manifest in such a dependency, but I am under the impression that those plans are for the future. I am hoping to decouple the optimization work and get it off the critical path.

I would primarily want to make sure that the optimization work doesn't have accidental holes in it, so I'd want to be able to spend a nontrivial chunk of time thinking/reviewing/planning/etc, and that is something that I would prefer to not have on the critical path

view this post on Zulip Ralph (Dec 11 2025 at 19:16):

I'm not sure this is the kind of "relying" you mean, but we absolutely intend to have the fastest possible component<-->component performance, yes.

view this post on Zulip Ralph (Dec 11 2025 at 19:17):

HOWEVER, our time-to-market is likely so slow that this "requirement" doesn't have much impact. It might take 9-12 months before we could even offer something that takes advantage here -- I'm imagining that is sufficiently long that it isn't as directly relevant to you as it might have been.

view this post on Zulip Ralph (Dec 11 2025 at 19:17):

???

view this post on Zulip Chris Fallin (Dec 11 2025 at 20:00):

Yes, we don't literally have a dependence on this today either, but it's the sort of thing that needs to be resolved sooner than later from our PoV; the zero-copy approach we've sketched (and that Nick has built the foundations of) is pretty essential to us getting buy-in in the places where we plan to use Wasmtime. I guess I just want to make sure that we're not taking such a regression lightly.

I'm also curious -- we're not seeing the other side of the tradeoff here (or I haven't digested all the other threads that would give me that background, sorry) -- presumably this change is happening because of unforeseen corner cases? Or is it a spec change that has its own discussion and set of tradeoffs (such that we'd want to feed back this pain into that tradeoff decision)?

view this post on Zulip Joel Dice (Dec 11 2025 at 20:45):

Yes, this is the result of a spec change related to async tasks. Given that there can be more than one concurrent task for an instance at a given time, guarding against recursive reentrance requires not just asking "has this instance been entered" but "has this instance been entered for this task". So rather than just a single flag that's global to the instance, we at minimum need a per-task bitset with capacity for the number of instances in a composition. Please see the CanonticalAbi.md discussion of trap_if_on_the_stack for a list of optimization opportunities.

In short, we have some great options for optimizing this; it's just a question of how urgent those optimizations are vs. other priorities.

view this post on Zulip Alex Crichton (Dec 15 2025 at 23:29):

My read on this Joel is that we should plan out and concretely know what optimizations we want to implement short-term, and while we can defer a literal PR we shouldn't defer planning such a PR

view this post on Zulip Alex Crichton (Dec 15 2025 at 23:29):

so e.g. we can split PRs up but they need to be ~weeks at most apart

view this post on Zulip Joel Dice (Dec 15 2025 at 23:35):

Even just the unoptimized version has proven a bit more complicated than I had initially expected, given that we need to check for recursive reentrance during initialization, when dropping resources, and during calls to [Typed]Func::call -- all places we previously relied on a per-instance flag and did not maintain a task stack. The good news is that once we have the unoptimized version passing all tests, optimization should be straightforward, especially with a fresh mental model of everything.

view this post on Zulip Alex Crichton (Dec 22 2025 at 19:07):

Wrote up the current state of affairs here, tl;dr; current plan is to change the adapter but not in a way such that the current performance profile will change

The spec says we should allow this, so now we do. Thansk to Alex for the test case! Fixes #12128

view this post on Zulip Joel Dice (Dec 22 2025 at 19:10):

Thanks, Alex. I'm implementing all that now.


Last updated: Jan 09 2026 at 13:15 UTC