wasmtime / issue #12311 Optimize guest-to-guest sync-to-s... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / issue #12311 Optimize guest-to-guest sync-to-s...

Wasmtime GitHub notifications bot (Jan 09 2026 at 21:52):

alexcrichton opened issue #12311:

This is a meta/tracking issue about remaining work necessary to optimize the guest-to-guest sync-to-sync adapter generated by Wasmtime when component-model-async is enabled. Some more historical discussion of this happened at #wasmtime > Wasmtime sync<->sync adapter optimizability @ 💬 as well, and I'll try to keep this up-to-date.

What is the problem

Wasmtime will compile an "adapter" with the FACT compiler when one guest component calls another. With the advent of component-model-async this adapter has a large number of permutations, for example the caller could be sync/async lowered, the callee could be sync/async lifted, and the function type itself could be sync or async. This specific issue is about the single case of a sync lowered caller, sync lifted callee, and sync function type. This doesn't mean the other permutations should be ignored, but that's the most interesting case for now.

Additionally with the advent of component-model-async it's required, spec-wise, to manage async-task-related-infrastructure when crossing component boundaries. Task infrastructure comes into play in a number of scenarios, such as:

When a task calls an imported function, that creates a new task. This new task has the current task as a parent task.

Intrinsics such as backpressure.{inc,dec} modify the backpressure counter in the current task.

When a task exits/returns all of its pending subtasks are "reparented" to the task's own parent.

Effectively, there's substantial infrastructure pieces that may be used across component boundaries, and thus Wasmtime needs to handle this. This leads us to the problem: with component-model-async disabled this task management is all ignored as it's not applicable, but with component-model-async enabled this task management is enabled. This means that the sync<->sync adapter will call a host function to manage task infrastructure pieces.

This cost of this hostcall is relative to the situation of the adaptation being performed, but the goal of sync<->sync adapter is to, ideally, compile to a grand total of 0 instructions. Given that it's impossible to optimize away a call into the host, this issue is thus about the problem of solving the task infrastructure management problem without actually making a host call. This should restore the prior-to-component-model-async behavior of a sync<->sync adapter compiling to pure optimizable CLIF which mostly boils away.

History and Current Status

As of the time of this writing Wasmtime doesn't actually do any manipulation of task infrastructure on sync<->sync adapters. This is a bug and results in issues such as https://github.com/bytecodealliance/wasmtime/issues/12128 (plus many undocumented others we have since realized). @dicej will soon have a PR to fix this situation where task infrastructure will be maintained across these boundaries.

The plan is to have a PR which will enhance the sync<->sync adapter with task infrastructure management, conditionally. The condition will be based on whether the component-model-async wasm feature is enabled in the Config. This is intended to be a stopgap because embedders should not need to disable features for performance. For the time being though it'll retain the pre-p3 performance profile of sync adapters while retaining p3-relatevant spec compliance.

Future plans for optimization

Enabling Cranelift to compile these adapters to zero instructions is going to require special care and a number of refactorings of Wasmtime's task infrastructure in addition to new Cranelift optimizations. The general rough idea for the implementation is:

A new VMAsyncTask type will be added. Fields this will contain are:

A "kind", more relevant in a moment

Fields for context.set {0,1}

A parent pointer for the parent task. Option<NonNull<VMAsyncTask>>

Backpressure fields (if necessary still, we've talked about removing backpressure)

A flag of whether this task can block or not.

The Rust-based "full" async task will contain this field as well as any other tables and such necessary. This will be similar to VMContext vs vm::Instance, for example.

The current task will be stored in VMContext or VMComponentContext (maybe both? unsure?)

Sync<->sync adapters will allocate, on the stack, a VMAsyncTask with just these fields. This will be initialized with the current task and then the current task will be set to this.

Manipulations of the current task will go directly through VMAsyncTask if applicable, e.g. context.{g,s}et {0,1}

Manipulations of the current task that require Rust data structures, for example adding a subtask, will "promote" the task from the stack to the Rust heap. This will go back through the entire chain of tasks and promote them all to the heap most likely too.

Returning from a sync<->sync adapter will restore the current task to its previous value.

Effectively, at a high level, sync<->sync adapters will allocate a task on the stack that, if necessary, will get promoted to the Rust heap to perform more expensive maniuplations on. In essence Rust-level tasks are lazily created only as necessary for "more complicated" things, like spawning subtasks, while low-level actions like context.get will remain efficient.

The resulting CLIF for a sync<->sync adapter will pseudo-code look like:
void adapter(vmctx *vmctx) {
    vmtask *prev_head = vmctx->current_task;
    vmtask stack_node;
    stack_node->kind = VMTASK_STACK;
    // ...
    stack_node->prev = prev_head;
    vmctx->current_task = &stack_node;

    the_callee_component(vmctx);

    vmctx->current_task = prev_head;
}
If the_callee_component(vmctx) is small enough the theory here is:

Cranelift will see that vmctx->current_task is loaded, stored to, then stored to with the previous value. If the_callee_component(vmctx) has no obviously aliasing regions, then it can eliminate both stores as dead.

If the_callee_component doesn't actually do anything like call the host then Cranelift will see that all the stores to stack_node are unused, so they're all eliminated.

If all the previous loads/stores were eliminated, then the load from vmctx->current_task is also dead, so that's also eliminated.

I don't believe that Cranelift will perform all of these optimizations, but my understanding so far is that this is well within Cranelift's complexity budget and wheelhouse to implement optimizations like these.

Expected Timeline

The current plan is to ship the hostcall-to-manipulate-task-infrastructure with WASIp3 originally. Embeddings that need the highest performance on sync<->sync adapters will disable the component-model-async runtime feature (and maybe compile time feature). After WASIp3 ships and we have enough time to come back to this and design this all "for real" we'll implement this. At that point it won't matter if engines turn the component-model-async feature on-or-off, it'll be the same.

Another point to note here is that it's expected that in WASIp3 Wasmtime will need to pretty heavily optimize calls to context.{get,set}. This work, while not the same as optimizing get/set, is highly related and will likely be a prerequisite for this work. That's to say that this work isn't solely motivated by sync<->sync adapters, but instead it's motivated by other routes too.

Wasmtime GitHub notifications bot (Jan 09 2026 at 21:52):

alexcrichton added the wasm-proposal:component-model-async label to Issue #12311.

Wasmtime GitHub notifications bot (Jan 09 2026 at 22:14):

cfallin commented on issue #12311:

If the_callee_component doesn't actually do anything like call the host then Cranelift will see that all the stores to stack_node are unused, so they're all eliminated.

Unless I'm misunderstanding the problem statement, I think this is outside the scope of ordinary dead-store elimination or the sort of thing Cranelift would tackle: it implies interprocedural program analysis, which is fundamentally hard.

Said another way: you're pushing a local alloc onto a linked list, then calling some arbitrary code; absent some global analysis, we can't know that that code won't eventually reach some behavior that will require observing that list, right? And that global analysis would need to reason about the callgraph, which depends on a value-range analysis and points-to analysis, both of which are extremely expensive, imprecise (overly conservative / brittle, easy to collapse with the wrong operator), or both.

Separately, we'd also need an escape analysis to not do that local alloc at all, right? That's a whole separate can of worms. Possible, but complex.

Overall: I'd be somewhat concerned waiting for a "sufficiently smart compiler" to get good component-to-component call performance; while it is definitely within scope to build new optimizations, trying to derive-from-first-principles why the code we emitted is unnecessary is always less preferable than modifying the runtime (or spec?) so we don't need that code.

Wasmtime GitHub notifications bot (Jan 09 2026 at 22:23):

alexcrichton commented on issue #12311:

No no, I understand that interprocedural analysis is off the table. I can try to expand more on this in a Cranelift meeting if desired to double-check the optimizations are in-scope.

What I want Cranelift to be able to optimize is something like:
v0 = stack_addr ;; some stack-based node
v1 = load vmctx+0x100 ; load prev
store vmctx+0x100, v0
;; some inlined version of `the_callee_component` that clearly doesn't store to vmctx based on alias analysis
store vmctx+0x100, v1
Here the first store is dead since it's never read, so it's eliminated. It's also into the vmctx so it's trusted/notrap/etc. The second store is then the same as what was loaded, so there's no need to load-then-store, so it's eliminated. Then the load is eliminated because it's dead.

I'd be somewhat concerned waiting for a "sufficiently smart compiler" to get good component-to-component call performance

Oh don't worry, I've worked long enough with Rust and optimizations that a sufficiently-smart-compiler is "this either works with simple-ish heuristics or not at all".

Basically I didn't explicitly say that the_callee_component(vmctx) was inlined, but for all optimizations above I meant "this optimization is only applicable when the entire body is fully inlined". The puropse is to ensure component functions using unsafe intrinsics, which are expected to be fully inlined, to boil away the surrounding infrastructure

Wasmtime GitHub notifications bot (Jan 09 2026 at 22:26):

cfallin commented on issue #12311:

Ah, I see -- yeah, if we're also assuming cross-component inlining then this is again intraprocedural, and at least tractable. Thanks for the clarification!

Last updated: Feb 24 2026 at 05:28 UTC