rfcs / PR #46 Propose tools and APIs for lowering compone... · rfc-notifications

Stream: rfc-notifications

Topic: rfcs / PR #46 Propose tools and APIs for lowering compone...

RFC notifications bot (Mar 09 2026 at 23:26):

dicej opened PR #46 from dicej:lower-component to bytecodealliance:main:

Rendered RFC

RFC notifications bot (Mar 10 2026 at 18:07):

avrabe commented on PR #46:

Great to see this. Happy to share notes on canonical ABI edge cases from Meld if useful.

RFC notifications bot (Mar 10 2026 at 18:26):

cfallin submitted PR review:

This is really important work and I'm happy to see it being developed -- thanks!

RFC notifications bot (Mar 10 2026 at 18:26):

cfallin created PR review comment:

This is a really good point and I think it's very important to solve "right", and not just reject components that instantiate a module more than once: that's a fundamental capability of the component model that core Wasm (without metadata/wrapper) doesn't have, and we don't want to bifurcate the ecosystem into components that fit this restriction and those that don't.

Function duplication (your second option) seems conceptually appealing because it hides the complexity, but in practice I suspect a large majority of functions will be duplicated, because almost everything will access memory...

Maybe the best option here is to actually define a "just the module linking, please" subset of the component model semantics that gives (i) a flat index space of core modules, (ii) a wiring diagram instantiating them and connecting imports and exports? The host already has to do some work to provide some intrinsics so this proposal is not "free" in any case; so ingesting such a format should not be too much of an additional sell (though there is certainly a step-function increase from "one core module" to "graph of core modules"). It's also conceptually the cleanest IMHO: this really is a thing that the component semantics can describe that a core Wasm can't, but most core Wasm runtimes should have host APIs to instantiate a thing more than once, so we should just "pass it through".

Just to note it down, though I don't like it: I guess there could be a fourth option here, which is (at a high level) something like "reify the vmctx as actual Wasm state". That seems to be the most "honest" w.r.t. the lowering paradigm.

The idea is that one would reify data structures that look like Wasmtime's instance state as Wasm GC values. A Wasm memory could be an arrayref to an array-of-i8; a Wasm table could be an arrayref to an array-of-whatever. Given those, one could define a vmctx Wasm struct that contains memory refs and table refs as our native vmctx does today, as well as any globals, inlined; then the lowered functions take this vmctx struct ref as an implicit first arg.

This clearly would have nontrivial runtime overhead as well, since in essence we'd have two levels of indirection for any state access.

RFC notifications bot (Mar 10 2026 at 18:46):

jellevandenhooff commented on PR #46:

Bit of a drive-by thought: My guess is that if the lowering tooling is performant enough, any wasm host would want to adopt it, and then the component-model spec splits in two. these new slim host bindings, and the component-model guest bindings as today. Do you think you would end up committing to API stability on the host bindings part? Standardize them? I suspect wasm runtimes would want that.

RFC notifications bot (Mar 10 2026 at 18:46):

jellevandenhooff edited a comment on PR #46:

Bit of a drive-by thought: My guess is that if the lowering tooling is performant enough, any wasm host would want to adopt it, and then the component-model spec splits in two: these new slim host bindings and the component-model guest bindings as today. Do you think you would end up committing to API stability on the host bindings part? Standardize them? I suspect wasm runtimes would want that.

RFC notifications bot (Mar 10 2026 at 19:10):

dicej commented on PR #46:

Do you think you would end up committing to API stability on the host bindings part? Standardize them?

Yeah, I think for this to work the API would need to at least be "officially" documented in the same way the dylink.0 convention is documented. Ideally, though, the "slim host bindings" API/ABI would just be a subset of the Component Model ABI (e.g. some or all of the thread.* and context.* canonical built-ins) and therefore not need to be documented or standardized separately. I _think_ that should work, in which case the TODO item in the Host C API for Lowered Components section will just be to describe the relevant CM ABI built-ins as C function declarations.

RFC notifications bot (Mar 10 2026 at 19:23):

dicej submitted PR review.

RFC notifications bot (Mar 10 2026 at 19:23):

dicej created PR review comment:

Maybe the best option here is to actually define a "just the module linking, please" subset of the component model semantics that gives (i) a flat index space of core modules, (ii) a wiring diagram instantiating them and connecting imports and exports?

Yeah, I expect this is what it would have to look like. One thought that crossed my mind would be to literally output a real component, but one that only uses the absolute minimum set of features needed to embed, instantiate, and link modules. Hosts would need to be able to parse and instantiate these "simple components" but not need to support the entire component model.

RFC notifications bot (Mar 10 2026 at 20:52):

alexcrichton commented on PR #46:

Personally I'm all for reducing complexity as much as we can, and the motivation section of this RFC resonates with me accordingly and I agree it's a worthwhile problem to tackle. At the same time though I'm personally skeptical of this approach in terms of practicality. For example as-written the RFC is currently relatively hand-wavy in terms of what exact responsibilities lie where. I understand though this is a relatively early-stages proposal so it's naturally not going to have anything fully fleshed out on day 1, but nonetheless I want to point out that at least for me it's difficult to form a concrete opinion without having more concrete details.

As a general thrust of "make the component model simpler to implement and make components easier to run", that seems reasonable to have an RFC on-the-record from the BA blessing that approach. For me personally I don't find that too useful because if aspirations are high-level enough it runs the risk of getting agreement amongst lots of folks but being quite difficult to actually make progress.

So, a question for this: is that the purpose of this RFC? To get high-level agreement on the approach? We've done this with some debugging-related RFCs for example as an approach to have implementation details sketched but not fully fleshed out while still maintaining high-level agreement. If that's the goal then I'm happy to approve as-is. If the goal though is to get more in-depth discussion of the technical specifics, viability, etc, that's a pretty different conversation.

I also was thinking the same as @jellevandenhooff when reading over this -- whatever intermediate APIs are needed between the runtime and a lowered component effectively need to end up being standards for this to work (IMO). That raises the bar quite a lot in terms of expected quality and care to design which would be an important point to note.

RFC notifications bot (Mar 10 2026 at 21:19):

dicej commented on PR #46:

So, a question for this: is that the purpose of this RFC? To get high-level agreement on the approach?

I opened this as a draft with some TODO items because I indeed wanted to gauge high-level agreement on the approach to begin with, but I also want to get into the details before calling it "done".

BTW, I went into a lot of detail in #38, but a lot of those details changed once we had real-world experience with the implementation. Personally, I think that's fine; the goal here is to be specific and make sure the details are not obviously wrong, but still be able to change things later during implementation as needed.

Anyway, yes, my goal is both high-level consensus and to get into the details as well. I'm aiming to add those by the end of the week, at which point I'll switch this out of draft mode.

RFC notifications bot (Mar 11 2026 at 15:10):

alexcrichton commented on PR #46:

Ok makes sense, and yeah I would agree that trying to flesh out all the details up front is probably not worthwhile because of how much will change during the implementation as we get more experience. As the goal here is to be more-detail-oriented-than-high-level-goals, however, some thoughts I'd have on this are:

A perhaps chief concern of mine is going to be performance/overhead. With native integration/implementation there's a lot of mechanisms to bypass overhead, and for example this change would require that all compoents likely to have at least 2 linear memories (one for the guest, one for the runtime), which baloons 8G of virtual memory to 16G of virtual memory per-component. This is just one example, but I'd be initially wary that we would want to switch everything over in Wasmtime to this paradigm before being more confident in the performance profile, for example.

Another thing I'd want to be pretty up-front about is that while lowering a component to a core wasm module certainly helps a lot there is still quite a lot of work for a host to do. Here it's under the guise of a bindgen but even just writing a bindgen requires significant effort/maintenance and is not something we can hand-wave away. This is fundamental to a host interacting with a component because somehow core wasm things need to get translated to host things, and this can get significantly complicated in the face of resources, futures, streams, lazy lowering, etc. Basically I don't want to give anyone the impression that this will basically delete 99% of component-model code in Wasmtime or other runtimes, my gut is that it'd be more like 50% in the end.

Particularly w.r.t. async I don't actually know how a built-in wasm-based runtime could shave off a large chunk of the complexity burden from embedders. Of primary concern here to me is the lack of core wasm stack switching. With stack switching in theory a lot more can be moved to the guest, but without stack switching we're left with JSPI-like approaches which puts quite a lot more on the host. Even still, somehow the host's notion of async needs to be bridged into the wasm concept of async and that will inevitably require a lot of careful design and probably a lot of work on the host.

Overall I think I'm actually relatively skeptical of this approach panning out in the long run. Despite that I do want this endeavor to succeed, however, but my point is that it's going to require signfiicant investment and design to even just evaluate the approach. Personally at least I don't feel like there's a clear way to implement all of this which requires only figuring out some minor details, but rather the unknowns are much larger. In that sense I think it's worthwhile to experiment more here, but to truly feel comfortable about accepting this I'd personally want to see more proof-of-concept style work to flesh out more details about how these fundamentals are going to work and play out in the end

RFC notifications bot (Mar 11 2026 at 15:28):

dicej commented on PR #46:

n that sense I think it's worthwhile to experiment more here, but to truly feel comfortable about accepting this I'd personally want to see more proof-of-concept style work to flesh out more details about how these fundamentals are going to work and play out in the end

Yes, agreed that a PoC is needed before we'll really know whether this is (A) feasible, and (B) worth doing. If that means leaving this PR unmerged until the PoC done, it's fine with me. Meanwhile, it's already generated some good discussion and serves as something we can point interested folks too.

RFC notifications bot (Mar 11 2026 at 15:29):

dicej edited a comment on PR #46:

In that sense I think it's worthwhile to experiment more here, but to truly feel comfortable about accepting this I'd personally want to see more proof-of-concept style work to flesh out more details about how these fundamentals are going to work and play out in the end

Yes, agreed that a PoC is needed before we'll really know whether this is (A) feasible, and (B) worth doing. If that means leaving this PR unmerged until the PoC done, it's fine with me. Meanwhile, it's already generated some good discussion and serves as something we can point interested folks too.

RFC notifications bot (Mar 11 2026 at 15:30):

dicej edited a comment on PR #46:

In that sense I think it's worthwhile to experiment more here, but to truly feel comfortable about accepting this I'd personally want to see more proof-of-concept style work to flesh out more details about how these fundamentals are going to work and play out in the end

Yes, agreed that a PoC is needed before we'll really know whether this is (A) feasible, and (B) worth doing. If that means leaving this PR unmerged until the PoC is done, it's fine with me. Meanwhile, it's already generated some good discussion and serves as something we can point interested folks too.

RFC notifications bot (Mar 13 2026 at 23:51):

dicej commented on PR #46:

Anyway, yes, my goal is both high-level consensus and to get into the details as well. I'm aiming to add those by the end of the week, at which point I'll switch this out of draft mode.

I didn't get around to this, but will try to do it early next week.

RFC notifications bot (Mar 17 2026 at 20:24):

dicej updated PR #46.

RFC notifications bot (Mar 17 2026 at 20:26):

dicej updated PR #46.

RFC notifications bot (Mar 17 2026 at 20:30):

dicej commented on PR #46:

I just pushed an update which adds a bunch of detail regarding the proposed APIs.

RFC notifications bot (Mar 18 2026 at 16:52):

dicej has marked PR #46 as ready for review.

RFC notifications bot (Mar 18 2026 at 16:55):

dicej updated PR #46.

RFC notifications bot (Mar 18 2026 at 16:56):

dicej updated PR #46.

RFC notifications bot (Mar 18 2026 at 17:28):

dicej updated PR #46.

RFC notifications bot (Mar 18 2026 at 18:18):

dicej updated PR #46.

RFC notifications bot (Mar 20 2026 at 08:39):

fitzgen submitted PR review.

RFC notifications bot (Mar 20 2026 at 08:39):

fitzgen created PR review comment:

I’ll echo Chris’s point here, even though I haven’t seen any disagreement with it: multiple instantiation is a core capability of the CM and we must cover all CM semantics.

I also think the output shouldn’t be a component with a minimal feature subset, because the idea is that we are implementing the CM desugaring for engines that don’t support it, so we shouldn’t assume that they can parse even a subset of it. The output should be a flat list of core modules (including those generated for fused adapters) and a flat list of instantiation and import-export wiring commands. Basically the simplest thing that covers the component model semantics, with no syntax sugar.

RFC notifications bot (Mar 20 2026 at 08:51):

fitzgen commented on PR #46:

@alexcrichton

this change would require that all compoents likely to have at least 2 linear memories (one for the guest, one for the runtime)

Can you clarify why there would need to be a second memory for the runtime? I don’t follow how that would be required.

I agree that it would be a large problem however. I would personally be extremely surprised/concerned if we didn’t use the same number of memories in the desugared core output as were defined and instantiated in the input component.

RFC notifications bot (Mar 20 2026 at 13:56):

dicej commented on PR #46:

Can you clarify why there would need to be a second memory for the runtime? I don’t follow how that would be required.

I believe he's referring to this part of the proposal (from the lower-component section):

In addition to the generated "fused adapter" code, the output module will
include component model runtime code, separately compiled from Rust source,
which handles, among other things:

table management for resource and waitable values

guest-to-guest stream and future I/O

task and thread bookkeeping

That code will definitely need to allocate, which means it either needs to have its own memory or be able to allocate from another module's memory (e.g. via cabi_realloc, but note that we may be getting rid of that once lazy lowering arrives).

RFC notifications bot (Mar 20 2026 at 14:39):

dicej commented on PR #46:

That code will definitely need to allocate, which means it either needs to have its own memory or be able to allocate from another module's memory (e.g. via cabi_realloc, but note that we may be getting rid of that once lazy lowering arrives).

Also, allocating from the memory of one of the (potentially malicious and/or buggy) modules taken from the input component invites the risks of tampering and information leaks.

RFC notifications bot (Mar 20 2026 at 17:17):

dicej commented on PR #46:

The other option to avoid the extra memory is to compile the component runtime into native code and run it in the host instead. The tradeoff there is that it becomes part of the TCB along with all the other host code, but that's probably fine if the component runtime is written in Rust with zero unsafe code. The code would remain runtime-agnostic and thus reusable either way.

RFC notifications bot (Mar 23 2026 at 13:58):

dicej updated PR #46.

RFC notifications bot (Mar 24 2026 at 15:44):

fitzgen commented on PR #46:

I talked a little with Alex about this at Wasm I/O and I see now that there are two slightly different use cases and some of us (me) have perhaps been assuming one or the other:

Drawing the dividing line between existing core Wasm semantics and new component model semantics as an interface, and effectively creating a function for the host runtime to implement that has one function for each kind of CM intrinsic.

Making the interface as small as possible, so that runtimes would have to implement as little as possible to take this thing and get CM for as close to free as possible, even if that means "virtualizing" some intrinsics (e.g. implementing resource table management as a core Wasm module).[^virt]

[^virt]: Taken to the extreme limit, this is basically "just compile Wasmtime to Wasm and run components inside Wasmtime inside of the non-component-model runtime".

I think (2) is not something that Wasmtime could realistically share in its component model implementation because of things like the extra-memory issue.

I think (1) is something that Wasmtime could use in its component model implementation, although to achieve runtime performance on par with today's implementation, this would probably require self-hosting the interface definitions, using unsafe intrinsics to access vmctx data, and inlining. Perhaps if the tool also had a callback where the host compiler was able to either emit a Wasm function call for an intrinsic or some inline Wasm code we could avoid requiring inlining (and its hit to compilation performance) in order to match today's runtime performance.

But I also think that (2) is still a valid use case and additionally could be layered on top of (1).

Does that all make sense?

RFC notifications bot (Mar 24 2026 at 16:10):

cfallin commented on PR #46:

I'll add a little bit of thinking from that same conversation with Alex and Nick: I think that rather than building a monolithic, somewhat opaque runtime that requires an assortment of random host functionalities, trying to "factor out" core intrinsics or primitives and building the component model on top of it has a lot of pedagogical/explanatory value, which is important if we want the componeht model to be widely implemented and understood. For example, reifying unforgeable references (resource handles, capabilities) as a primitive that is provided by host intrinsics, has value; so does defining exact mappings from async task primitives to something like stack switching or host intrinsics with equivalent semantics (pure fiber-switching).

Said another way, if the component model were decomposed into "canonical 1-to-1 mapping to these N host primitives", that is a much more satisfying and convincing argument for a sound fundamental design and for reusability/generality. On the other hand, building a single canonical libComponentModelRuntime.wasm that runs in core Wasm is (as Nick said) more like compiling Wasmtime into Wasm; and feels like something close to admitting defeat, in the sense that we are saying things are complex enough that we just need to distribute a reference implementation. It's far better if the mapping is "thin" and the primitives are well-defined and reasonable to implement independently in many engines.

RFC notifications bot (Mar 24 2026 at 16:11):

cfallin edited a comment on PR #46:

I'll add a little bit of thinking from that same conversation with Alex and Nick: I think that rather than building a monolithic, somewhat opaque runtime that requires an assortment of random host functionalities, trying to "factor out" core intrinsics or primitives and building the component model on top of it has a lot of pedagogical/explanatory value, which is important if we want the componeht model to be widely implemented and understood. For example, reifying unforgeable references (resource handles, capabilities) as a primitive that is provided by host intrinsics, has value; so does defining exact mappings from async task primitives to something like stack switching or host intrinsics with equivalent semantics (pure fiber-switching).

Said another way, if the component model were decomposed into "canonical 1-to-1 mapping to these N host primitives", that is a much more satisfying and convincing argument for a sound fundamental design and for reusability/generality. On the other hand, building a single canonical libComponentModelRuntime.wasm that runs in core Wasm is (as Nick said) more like compiling Wasmtime into Wasm; and feels like something close to admitting defeat, in the sense that we are saying things are complex enough that we just need to distribute a reference implementation. It's far better if the mapping is "thin" and the primitives are well-defined and reasonable to implement independently in many engines.

EDIT: I realize the above is a little bit abstract, but the main point I'm trying to make is that there is a social-signalling aspect to the direction that we choose, and I'd prefer that we try to signal "it's built of reasonable primitives and here is the decomposition" rather than "just ship our opaque blob".

RFC notifications bot (Mar 24 2026 at 20:29):

dicej commented on PR #46:

@fitzgen Yes, that makes sense. We could break the first use case you listed down even further:

1a. I want to run components on any runtime supporting core Wasm + fibers. If that runtime doesn't support components, I want to be able to pair it with a library (and maybe a host binding generator) which can parse, link (generating fused adapters on the fly if appropriate), and instantiate a component, deferring to the runtime for core Wasm and fiber operations.

1b. I want to flatten my component into a core module which imports a bunch of component model intrinsic functions and run it as in use case (1) above. In this case, the fused adapters would be generated as part of flattening, but the host library + Wasm runtime would take care of all runtime state.

In (1a), there's no need for lower-component and no need to address the question of multiply-instantiated modules. In (1b), lower-component still has a role, but (like (1a) and unlike (2)) needs no extra guest memory. I expect that (1b) will also require standardizing additional intrinsics for use in fused adapters (equivalent to the ones Wasmtime's FACT uses now for managing task and thread state during guest-to-guest calls) which aren't part of the component model.

On the face of it (1a) seems simpler, both for users ("I just want to run a component; don't bother me with more tools and more steps") and us (no need to standardize the intrinsics fused adapters will need beyond those already defined in the component model). I'm wondering who might choose (1b) over (1a), and why.

@cfallin That sounds great; not sure exactly what it would look like, though. @lukewagner might have thoughts.

RFC notifications bot (Mar 27 2026 at 15:35):

ttraenkler submitted PR review.

RFC notifications bot (Mar 27 2026 at 15:35):

ttraenkler created PR review comment:

https://github.com/WebAssembly/component-model/issues/626 discusses this exact scenario of merging modules into a core wasm module using multi memory to maintain memory isolation as well as imports and exports as the only shared surface. This I think could in many cases mean zero runtime cost for crossing the component boundary while maintaining isolation: Function calls could be inlined trivially by wasm-opt, and in many cases even memory copies could be avoided altogether. For scalars and static memory indices and lengths at zero runtime cost, for dynamic memory indices or lengths with a bounds check runtime cost. This would be a form of "lazy lowering" if I understood @lukewagner correctly. Some details are left unspecified and I assume I am not alone with this idea, but I thought nevertheless it would be good to point out in case this has not been considered.

RFC notifications bot (Apr 06 2026 at 20:36):

avrabe commented on PR #46:

Following up with concrete findings from P3 async fusion work in meld:

P3 async components now fuse to valid core modules with multi-memory isolation. The async task primitives (task.return, waitable-set.*, context.*) flow through as host-provided imports, which aligns with the host intrinsic approach proposed here.

One design consideration for the host intrinsic API: after fusion, multiple original component instances share a single core module. Each has its own task.return with a different result type (tied to the original export's signature). The host needs to dispatch these correctly — the fused module uses distinct import slots per original task.return (e.g., [task-return]0, [task-return]1), so the host can use the import identity to determine the task context.

For component-model-native runtimes, wrapping the fused output back as a component hits call_might_be_recursive when internal async calls cross the now-collapsed instance boundary. The planned recursive effect would address this.

I'm currently working on both paths: the core module + host intrinsic path (via synth, an AOT compiler with its own runtime), and a nested component wrapper that preserves instance topology for wasmtime compatibility.

RFC notifications bot (Apr 07 2026 at 17:42):

avrabe commented on PR #46:

Correction to my earlier comment: the call_might_be_recursive issue I described was an artifact of the wrong architecture, not a fundamental limitation. As @lukewagner pointed out, a fused component shouldn't need internal canon lift/canon lower at all.

The correct approach for async cross-component calls after fusion: the adapter drives the callee's callback loop directly in core wasm — calling [async-lift] to start, waitable-set-poll (host import) to wait for events, and [callback] to drive progress until EXIT. task.return can be resolved as an in-module shim for result delivery. This keeps everything in core wasm with no component boundary.

The earlier points about host intrinsic design still hold: after fusion, each original async export's task.return has a distinct signature and import slot, so the host (or in-module shim) can dispatch by import identity.

RFC notifications bot (Apr 08 2026 at 18:10):

alexcrichton commented on PR #46:

Reflecting on this more, reading over the current state of things, and digesting conversations I had at Wasm.io, my current thinking is that I think it would be best to pare back this RFC to just the lower-component tool with the constraint that lower-component will not add any more linear memories than are already present within a component. I think it's also worth explicitly saying that multiply-instantiated-components will be supported, and picking a strategy. To the extent that this tool wants to be used in Wasmtime we won't want the "duplicate the module items" approach, so that would necessitate the approach of "generate N modules + metadata".

I realize, however, that this is a bit of a spicy take on this RFC, so I want to expand more on the rationale as well.

Paring down to just lower-component

Personally I feel that this RFC is a bit too ambitious about what it's trying to specify at this time. I don't disagree with any of the end goals or means by which we get there, but I feel that there's just too much up in the air to make any real meaningful progress on evaluating/reviewing/etc. To me this feels similar to the arc of the series of debugging RFCs we have for Wasmtime where we started out (in my opinion) a bit too ambitious and further RFCs refined things saying "ok here's what we can more practically achieve in the near-term". While the sort of vision-setting of the entire arc can be valuable I'm not sure that bytecodealliance RFCs are necessarily the best venue by which to do that.

To me the abstraction level of core wasm is really the central part here. Everything else is later a derivative of this abstraction boundary, which is both a benefit but also means that the work is separatable and/or can have separate RFCs. For example host-wit-bindgen, while I agree will be necessary, can be specified/implemented entirely in terms of "here's the shape of core module that pops out". The C APIs mentioned here I feel are a bit more out-there in terms of design. While I think we can reasonably work with the core wasm abstraction level once you go all the way to a C API that feels way more specific and limiting. For example that doesn't handle GC at all, it glosses over multi-return details, assumptions about host runtime implementaiton are made, etc. While I again feel this is useful as a sort of vision-setting exercise I think it'll be most productive to have an RFC on-the-record for viable work that can be done in a realistic time frame.

Putting all that together I feel that lower-component is the juiciest part to get alignment on in this entire RFC. Everything else, while it should be considered, is effectively a directly result of dealing with the output of lower-component. Given all the questions/thoughts around lower-component as well, that's why I feel that this RFC should be pared down to just lower-component with possibly future designs/RFCs for the subsequent tools/APIs.

No extra linear memories

The next part I'm thinking is that this we should take on a hard constraint that lower-component does not inject linear memories into the output. This would, for example, preclude the concept of a wasm module that is injected which implements more runtime functionality. More-or-less this boils down to "all component model intrinsics end up become host function calls". I realize, though, that this is in direct opposition to this RFC as-is, and would remove this part of the RFC:

the output module will include component model runtime code, separately compiled from Rust source, which handles, among other things:

My feelings here are from what we discussed at Wasm.io. I think it will be much more maintainable to be able to explain and document what all these host intrinsics are if they're not a sort of halfway point between what the component model intrinsic is and what the host needs to do. By having everything get routed directly to the host it'll make it much easier to document semantics.

One example of this is the resource.new intrinsic. While it's possible that this could be entirely implemented by an auxiliary runtime I think it'll be clearer/easier to have lower-component, by default, import a function to do this. Now unlike the component model intrisnic I'm thinking that this would look something like:
(import "cm-intrinsics" "resource.new.i32" (func (param i32 i32) (result i32)))
The extra i32 parameter here would be documented as "this is the type of the resource being created" where the other i32 is the raw value provided by the component itself. This feels easier to document/specify as it's largely just referring to preexisting intrinsic definitions.

Furthermore by not injecting linear memories and importing intrinsics this still empowers hosts to self-host some functionality in wasm. There's nothing stopping a host from implementing resource.new.i32 with a wasm function, for example. In that sense translating everything to imports is a lower-common-denomintor of an at-least-partially-self-hosted runtime.

Multiple instantiations

I feel similar to what @cfallin and @fitzgen mentioned up-thread about this where we should strive to support all input components in lower-component insofar as I don't think it would be reasonable to reject components that multiply-instantiate sub-components. Between the two implementation strategies of "duplicate everything" and "emit multiple modules" I think only the latter is within scope for Wasmtime. Ideally I'd like to use lower-component within Wasmtime directly and that would also ideally come with a similar performance/compilation profile as we have today for components. To that end I think that necessitates the output being multiple modules.

I realize though that this is a significant increase in complexity relative to squashing a component into just a single core module. The good news is that most of the time this won't be necessary and just one core wasm will continue to pop out. The bad news is that fully compliant hosts will have to handle the case that multiple modules appear.

In the end though I feel that emission of metadata is inevitable anyway. For example there will want to be metadata about how many resource types were create and other miscellaneous things about limits and such. Hosts can probably get away with ignoring most of the metadata most of the time, though.

I understand as well that others can feel differently about what exactly goes into this RFC and the various technical decisions here. So given that I'm curious what others think too on all this!

RFC notifications bot (Apr 09 2026 at 03:21):

ttraenkler commented on PR #46:

Focusing on lowering components seems a clear and incremental actionable first step. :+1:

Multiple instantiations

I feel similar to what @cfallin and @fitzgen mentioned up-thread about this where we should strive to support all input components in lower-component insofar as I don't think it would be reasonable to reject components that multiply-instantiate sub-components. Between the two implementation strategies of "duplicate everything" and "emit multiple modules" I think only the latter is within scope for Wasmtime. Ideally I'd like to use lower-component within Wasmtime directly and that would also ideally come with a similar performance/compilation profile as we have today for components. To that end I think that necessitates the output being multiple modules.

I realize though that this is a significant increase in complexity relative to squashing a component into just a single core module. The good news is that most of the time this won't be necessary and just one core wasm will continue to pop out. The bad news is that fully compliant hosts will have to handle the case that multiple modules appear.

Instantiating multiple modules would but ideal, but if the constraint is lowering into a single core wasm module, as an alternative to the options presented here is a workaround that works with multiple memories today without forcing N copies:

The idea is to rewrite exported functions and those called by them that touch global state with an additional module index parameter during the merge. Since memory instructions take the memory index as an immediate in core Wasm today, a workaround is to wrap these in an function that dispatches to the correct memory using a br_table.

This
(func (export "malloc") (param $size i32) (result i32)
    (local $old i32)
    (local.set $old (global.get $heap_end))
    (global.set $heap_end
      (i32.add (global.get $heap_end)
               (call $align_up (local.get $size) (i32.const 8))))
    (local.get $old))
becomes
  ;; Shared malloc — one copy, dispatches on $instance at runtime
  (func $malloc (param $size i32) (param $instance i32) (result i32)
    (local $old i32)
    ;; old = heap_end[$instance]  (via br_table)
    (block $done
      (block $b1 (block $b0
        (br_table $b0 $b1 (local.get $instance)))
        (local.set $old (global.get 0)) (br $done))             ;; instance 0
      (local.set $old (global.get 1)))                          ;; instance 1
    ;; heap_end[$instance] += align_up(size, 8)
    (block $done2
      (block $b1 (block $b0
        (br_table $b0 $b1 (local.get $instance)))
        (global.set 0 (i32.add (global.get 0)
          (call $align_up (local.get $size) (i32.const 8))))
        (br $done2))
      (global.set 1 (i32.add (global.get 1)
        (call $align_up (local.get $size) (i32.const 8)))))
    (local.get $old))
The overhead can be eliminated with an optimization pass for hot paths where inlining would duplicate the code anyways, and where the overhead is negligible could avoid N copies altogether, just duplicating memory instructions, not the whole function or calling functions - not even all instances of the call to this instruction if this pattern is wrapped in a helper function, but it's a tradeoff.

It is a workaround, but it can be efficient and if lowering into a single module is a requirement or inlining across modules is not possible but the call is performance sensitive this could provide a solution.

The more elegant solution would of course be a dynamic memory index. Even if instantiating multiple modules are an option for the runtime in question, not all of them would allow to inline calls across modules - can V8 for example?

RFC notifications bot (Apr 09 2026 at 03:27):

ttraenkler edited a comment on PR #46:

Focusing on lowering components seems a clear and incremental actionable first step. :+1:

Multiple instantiations

I feel similar to what @cfallin and @fitzgen mentioned up-thread about this where we should strive to support all input components in lower-component insofar as I don't think it would be reasonable to reject components that multiply-instantiate sub-components. Between the two implementation strategies of "duplicate everything" and "emit multiple modules" I think only the latter is within scope for Wasmtime. Ideally I'd like to use lower-component within Wasmtime directly and that would also ideally come with a similar performance/compilation profile as we have today for components. To that end I think that necessitates the output being multiple modules.

I realize though that this is a significant increase in complexity relative to squashing a component into just a single core module. The good news is that most of the time this won't be necessary and just one core wasm will continue to pop out. The bad news is that fully compliant hosts will have to handle the case that multiple modules appear.

Multiple instantiations would but ideal, but if the constraint is lowering into a single core wasm module, as an alternative to the options presented here is a workaround that works with multiple memories today without forcing N copies:

The idea is to rewrite exported functions and those called by them that touch global state with an additional module index parameter during the merge. Since memory instructions take the memory index as an immediate in core Wasm today, a workaround is to wrap these in a function that dispatches to the correct memory using a br_table.

This
(func (export "malloc") (param $size i32) (result i32)
    (local $old i32)
    (local.set $old (global.get $heap_end))
    (global.set $heap_end
      (i32.add (global.get $heap_end)
               (call $align_up (local.get $size) (i32.const 8))))
    (local.get $old))
becomes
  ;; Shared malloc — one copy, dispatches on $instance at runtime
  (func $malloc (param $size i32) (param $instance i32) (result i32)
    (local $old i32)
    ;; old = heap_end[$instance]  (via br_table)
    (block $done
      (block $b1 (block $b0
        (br_table $b0 $b1 (local.get $instance)))
        (local.set $old (global.get 0)) (br $done))             ;; instance 0
      (local.set $old (global.get 1)))                          ;; instance 1
    ;; heap_end[$instance] += align_up(size, 8)
    (block $done2
      (block $b1 (block $b0
        (br_table $b0 $b1 (local.get $instance)))
        (global.set 0 (i32.add (global.get 0)
          (call $align_up (local.get $size) (i32.const 8))))
        (br $done2))
      (global.set 1 (i32.add (global.get 1)
        (call $align_up (local.get $size) (i32.const 8)))))
    (local.get $old))
The overhead can be eliminated with an optimization pass for hot paths where inlining would duplicate the code anyways, and where the overhead is negligible could avoid N copies altogether, just duplicating memory instructions, not the whole function or calling functions - not even all instances of the call to this instruction if this pattern is wrapped in a helper function, but it's a tradeoff.

It is a workaround, but it can be efficient and if lowering into a single module is a requirement or inlining across modules is not possible but the call is performance sensitive this could provide a solution.

The more elegant solution would of course be a dynamic memory index. Even if instantiating multiple modules are an option for the runtime in question, not all of them would allow to inline calls across modules - can V8 for example?

RFC notifications bot (Apr 09 2026 at 03:34):

ttraenkler edited a comment on PR #46:

Focusing on lowering components seems a clear and incremental actionable first step. :+1:

Multiple instantiations

I feel similar to what @cfallin and @fitzgen mentioned up-thread about this where we should strive to support all input components in lower-component insofar as I don't think it would be reasonable to reject components that multiply-instantiate sub-components. Between the two implementation strategies of "duplicate everything" and "emit multiple modules" I think only the latter is within scope for Wasmtime. Ideally I'd like to use lower-component within Wasmtime directly and that would also ideally come with a similar performance/compilation profile as we have today for components. To that end I think that necessitates the output being multiple modules.

I realize though that this is a significant increase in complexity relative to squashing a component into just a single core module. The good news is that most of the time this won't be necessary and just one core wasm will continue to pop out. The bad news is that fully compliant hosts will have to handle the case that multiple modules appear.

Multiple instantiations would but ideal, but if the constraint is lowering into a single core wasm module, as an alternative to the options presented here is a workaround that works with multiple memories today without forcing N copies:

The idea is to rewrite exported functions and those called by them that touch global state with an additional module index parameter during the merge. Since memory instructions take the memory index as an immediate in core Wasm today, a workaround is to wrap these in a function that dispatches to the correct memory using a br_table.

This
(func (export "malloc") (param $size i32) (result i32)
    (local $old i32)
    (local.set $old (global.get $heap_end))
    (global.set $heap_end
      (i32.add (global.get $heap_end)
               (call $align_up (local.get $size) (i32.const 8))))
    (local.get $old))
becomes
  ;; Shared malloc — one copy, dispatches on $instance at runtime
  (func $malloc (param $size i32) (param $instance i32) (result i32)
    (local $old i32)
    ;; old = heap_end[$instance]  (via br_table)
    (block $done
      (block $b1 (block $b0
        (br_table $b0 $b1 (local.get $instance)))
        (local.set $old (global.get 0)) (br $done))             ;; instance 0
      (local.set $old (global.get 1)))                          ;; instance 1
    ;; heap_end[$instance] += align_up(size, 8)
    (block $done2
      (block $b1 (block $b0
        (br_table $b0 $b1 (local.get $instance)))
        (global.set 0 (i32.add (global.get 0)
          (call $align_up (local.get $size) (i32.const 8))))
        (br $done2))
      (global.set 1 (i32.add (global.get 1)
        (call $align_up (local.get $size) (i32.const 8)))))
    (local.get $old))
The overhead can be eliminated with an optimization pass for hot paths where inlining would duplicate the code anyways, and where the overhead is negligible could avoid N copies altogether, just duplicating memory instructions, not the whole function or calling functions - not even all instances of the call to this instruction if this pattern is wrapped in a helper function, but it's a tradeoff.

It is a workaround, but it can be efficient and if lowering into a single module is a requirement or inlining across modules is not possible but the call is performance sensitive this could provide a solution.

The more elegant solution would of course be a dynamic memory index. Even if instantiating multiple modules is an option for the runtime in question, not all of them would allow to inline calls across modules - can V8 for example?

RFC notifications bot (Apr 09 2026 at 09:06):

tschneidereit commented on PR #46:

Furthermore by not injecting linear memories and importing intrinsics this still empowers hosts to self-host some functionality in wasm. There's nothing stopping a host from implementing resource.new.i32 with a wasm function, for example. In that sense translating everything to imports is a lower-common-denomintor of an at-least-partially-self-hosted runtime.

I agree with this, but wonder if maybe a different framing could be that it makes sense to split the RFC itself into two parts? One containing what @alexcrichton is describing, the other one what in the current plan is happening in the injected linear memory. Perhaps that part could even consist of multiple bits that can be used to self-host different parts of the host API, such that embedders can choose how much they want to implement on the host side vs use off-the-shelf.

RFC notifications bot (Apr 09 2026 at 14:29):

alexcrichton commented on PR #46:

@ttraenkler what you're describing is more-or-less duplicating the entire module though, right? That looks like it's effectively got the same code-size impact where if a core module is instantiated N times there'll be N copies of its machine code. You're right that if all instructions referencing wasm definitions took dynamic immediates it would be somewhat ameliorated, but that then lends doubt to performance since the dynamic input would likely perform much worse than a static immediate.

Overall, personally, what I'm getting at is that there needs to be a core assumption in the lower-component tool, and users of the tool, that multiple core modules may be output. I don't personally think there's any viable way around this. It's an accurate representation of what's actually happening and what's desired on behalf of the component. Other models are trading off performance/complexity/etc for the goal of having "just" a single core wasm module, which I personally don't think are viable for a full-fledged runtime (e.g. in my opinion Wasmtime wouldn't use the mode that outputs just a single core wasm module).

@tschneidereit personally, along the lines of keeping things more tractable and easier to reason about, I'd say that the hypothetical at-least-somewhat-self-hosted wasm runtime should be deferred until after lower-component is more fleshed out. I feel that we need more experience with what exactly the imports are to this core wasm module before we specify what the self-hosted version would be. Given a lower-component tool it wouldn't be too hard, in theory, to at least experiment with various shapes of a self-hosted runtime and then propose/standardize on the one that feels best. Or, better yet, maybe this is someting that wouldn't need an RFC/standardization and could just become a "well known useful tool" or something like that.

Also, another somewhat unrelated though. One axiom that's not necessarily explicitly spelled out here but I think might be worthwhile to explain and write down -- in my opinion the goal here is to be able to take a WIT world and then enumerate the set of intrinsics, via core wasm imports, that a host must provide (and will expect from a core module) to be able to run any component that adheres to that WIT world. The lower-component tool would then generate core wasm modules that use a subset of these expectations of the host (e.g. not all components call all functions, use all intrinsics, etc). This describes a "tool" of sorts that's not described in this RFC of going from a WIT world to this set of functions, but I don't think that the tool necessarily needs to exist immediately. It'll more-or-less be the host-wit-bindgen step, however.

RFC notifications bot (Apr 09 2026 at 19:07):

ttraenkler edited a comment on PR #46:

@ttraenkler what you're describing is more-or-less duplicating the entire module though, right? That looks like it's effectively got the same code-size impact where if a core module is instantiated N times there'll be N copies of its machine code. You're right that if all instructions referencing wasm definitions took dynamic immediates it would be somewhat ameliorated, but that then lends doubt to performance since the dynamic input would likely perform much worse than a static immediate.

Not necessarily. We can avoid $O(N \times M)$ bloat by abstracting state-access into centralized dispatchers during lowering. Instead of duplicating the entire module logic, we rewrite state-touching instructions to call a br_table wrapper exactly once per type.

This transforms the code-size impact from multiplicative to additive. It also leaves the performance trade-off in the hands of the runtime: the JIT can selectively inline these wrappers on hot paths to recover performance (effectively specializing the immediate), while leaving cold paths as small function calls to preserve a tiny binary footprint.
;; Centralized 'Virtual Instruction' (Defined once per module)
(func $dispatch_load (param $inst i32) (param $addr i32) (result i32)
  (block $m1 (block $m0
    (br_table $m0 $m1 (local.get $inst)))
    (return (i32.load 0 (local.get $addr)))) ;; Hardcoded to Memory 0
  (return (i32.load 1 (local.get $addr)))    ;; Hardcoded to Memory 1
)

;; The Logic (One copy shared by all N instances)
(func $malloc_shared (param $size i32) (param $inst i32) (result i32)
  (local $ptr i32)
  ;; ... complex logic here ...

  ;; Instead of a hardcoded i32.load, we call the dispatcher.
  ;; The logic body is never duplicated.
  (local.set $ptr (call $dispatch_load (local.get $inst) (i32.const 0)))

  ;; ...
  (local.get $ptr)
)
For lower-component, this provides a 'middle gear' that supports a single core module without forcing massive logic duplication in scenarios where binary size is a constraint and peak throughput for every single instruction is not the primary requirement.

Overall, personally, what I'm getting at is that there needs to be a core assumption in the lower-component tool, and users of the tool, that multiple core modules may be output. I don't personally think there's any viable way around this. It's an accurate representation of what's actually happening and what's desired on behalf of the component. Other models are trading off performance/complexity/etc for the goal of having "just" a single core wasm module, which I personally don't think are viable for a full-fledged runtime (e.g. in my opinion Wasmtime wouldn't use the mode that outputs just a single core wasm module).

Even if runtimes like Wasmtime prefer multiple modules and I think it is a reasonable default, this approach ensures that "Single Module Lowering" remains a viable and efficient target for the broader ecosystem. For example in environments like V8, where cross-module inlining is currently limited, staying within a single module can actually provide a better optimization boundary for the JIT than multiple modules would.

RFC notifications bot (Apr 09 2026 at 19:08):

ttraenkler commented on PR #46:

@ttraenkler what you're describing is more-or-less duplicating the entire module though, right? That looks like it's effectively got the same code-size impact where if a core module is instantiated N times there'll be N copies of its machine code. You're right that if all instructions referencing wasm definitions took dynamic immediates it would be somewhat ameliorated, but that then lends doubt to performance since the dynamic input would likely perform much worse than a static immediate.

Not necessarily. We can avoid $O(N \times M)$ bloat by abstracting state-access into centralized dispatchers during lowering. Instead of duplicating the entire module logic, we rewrite state-touching instructions to call a br_table wrapper exactly once per type.

This transforms the code-size impact from multiplicative to additive. It also leaves the performance trade-off in the hands of the runtime: the JIT can selectively inline these wrappers on hot paths to recover performance (effectively specializing the immediate), while leaving cold paths as small function calls to preserve a tiny binary footprint.
;; Centralized 'Virtual Instruction' (Defined once per module)
(func $dispatch_load (param $inst i32) (param $addr i32) (result i32)
  (block $m1 (block $m0
    (br_table $m0 $m1 (local.get $inst)))
    (return (i32.load 0 (local.get $addr)))) ;; Hardcoded to Memory 0
  (return (i32.load 1 (local.get $addr)))    ;; Hardcoded to Memory 1
)

;; The Logic (One copy shared by all N instances)
(func $malloc_shared (param $size i32) (param $inst i32) (result i32)
  (local $ptr i32)
  ;; ... complex logic here ...

  ;; Instead of a hardcoded i32.load, we call the dispatcher.
  ;; The logic body is never duplicated.
  (local.set $ptr (call $dispatch_load (local.get $inst) (i32.const 0)))

  ;; ...
  (local.get $ptr)
)
For lower-component, this provides a 'middle gear' that supports a single core module without forcing massive logic duplication in scenarios where binary size is a constraint and peak throughput for every single instruction is not the primary requirement.

Overall, personally, what I'm getting at is that there needs to be a core assumption in the lower-component tool, and users of the tool, that multiple core modules may be output. I don't personally think there's any viable way around this. It's an accurate representation of what's actually happening and what's desired on behalf of the component. Other models are trading off performance/complexity/etc for the goal of having "just" a single core wasm module, which I personally don't think are viable for a full-fledged runtime (e.g. in my opinion Wasmtime wouldn't use the mode that outputs just a single core wasm module).

Even if runtimes like Wasmtime prefer multiple modules I think it could be a reasonable default, this approach ensures that "Single Module Lowering" remains a viable and efficient target for the broader ecosystem. For example in environments like V8, where cross-module inlining is currently limited, staying within a single module can actually provide a better optimization boundary for the JIT than multiple modules would.

RFC notifications bot (Apr 09 2026 at 19:24):

ttraenkler edited a comment on PR #46:

@ttraenkler what you're describing is more-or-less duplicating the entire module though, right? That looks like it's effectively got the same code-size impact where if a core module is instantiated N times there'll be N copies of its machine code. You're right that if all instructions referencing wasm definitions took dynamic immediates it would be somewhat ameliorated, but that then lends doubt to performance since the dynamic input would likely perform much worse than a static immediate.

Not necessarily. We can avoid $O(N \times M)$ bloat by abstracting state-access into centralized dispatchers during lowering. Instead of duplicating the entire module logic, we rewrite state-touching instructions to call a br_table wrapper exactly once per type.

This transforms the code-size impact from multiplicative to additive. It also leaves the performance trade-off in the hands of the runtime: the JIT can selectively inline these wrappers on hot paths to recover performance (effectively specializing the immediate), while leaving cold paths as small function calls to preserve a tiny binary footprint.
;; Centralized 'Virtual Instruction' (Defined once per module)
(func $dispatch_load (param $inst i32) (param $addr i32) (result i32)
  (block $m1
    (block $m0
      (br_table $m0 $m1 (local.get $inst))
    )
    (return (i32.load 0 (local.get $addr)))  ;; Hardcoded to Memory 0
  )
  (return (i32.load 1 (local.get $addr)))    ;; Hardcoded to Memory 1
)

;; The Logic (One copy shared by all N instances)
(func $malloc_shared (param $size i32) (param $inst i32) (result i32)
  (local $ptr i32)
  ;; ... complex logic here ...

  ;; Instead of a hardcoded i32.load, we call the dispatcher.
  ;; The logic body is never duplicated.
  (local.set $ptr (call $dispatch_load (local.get $inst) (i32.const 0)))

  ;; ...
  (local.get $ptr)
)
For lower-component, this provides a 'middle gear' that supports a single core module without forcing massive logic duplication in scenarios where binary size is a constraint and peak throughput for every single instruction is not the primary requirement.

Overall, personally, what I'm getting at is that there needs to be a core assumption in the lower-component tool, and users of the tool, that multiple core modules may be output. I don't personally think there's any viable way around this. It's an accurate representation of what's actually happening and what's desired on behalf of the component. Other models are trading off performance/complexity/etc for the goal of having "just" a single core wasm module, which I personally don't think are viable for a full-fledged runtime (e.g. in my opinion Wasmtime wouldn't use the mode that outputs just a single core wasm module).

Even if runtimes like Wasmtime prefer multiple modules and I think it is a reasonable default, this approach ensures that "Single Module Lowering" remains a viable and efficient target for the broader ecosystem. For example in environments like V8, where cross-module inlining is currently limited, staying within a single module can actually provide a better optimization boundary for the JIT than multiple modules would.

RFC notifications bot (Apr 09 2026 at 23:07):

alexcrichton commented on PR #46:

That's true, yeah, that the set of instructions modifying state is O(1) and the transformation could build a fucntion-per-instruction which internally dispatches. For Wasmtime at least this would not be viable as the inlining you describe won't happen. Additionally I would be pretty surprised if any runtime could get realistically close to a strategy of multiple instantiations due to all the checks necessary to even opportunistically inline. Effectively in my (possibly naive) opinion I'd say that the performance of that solution is not viable.

this approach ensures that "Single Module Lowering" remains a viable and efficient target for the broader ecosystem

Personally what I'm trying to say is that I don't think this is a viable way to think about the component model. Environments like V8 can already instantiate multiple core modules just fine, and I would be surprised if any production-ready runtime were incapable of instantiating multiple modules. Given that I'm not sure I understand the desire to push so hard on having a single-module output vs accepting that some outputs will have multiple modules.

RFC notifications bot (Apr 10 2026 at 10:27):

ttraenkler commented on PR #46:

Given that I'm not sure I understand the desire to push so hard on having a single-module output vs accepting that some outputs will have multiple modules.

No question multiple modules are the better default. I present this primarily for completeness as a "size-optimized" alternative to the less desirable "N copies" or "not supporting" scenarios. To help inform the trade-offs, is there a table that tracks runtime support for multi-module instantiation and cross-module inlining?

My personal motivation is being able to guarantee zero-overhead cross-module calls, rather than leaving it to the discretion of a runtime's JIT. Within a single module, this can be ensured by an ahead-of-time optimization pass during the merge (as in my POC). This would allow modules to become much smaller and composable, similar to JS modules. This doesn't necessarily have to happen at the component level, but I wanted to provide the perspective for the broader discussion.

RFC notifications bot (Apr 10 2026 at 15:42):

ttraenkler edited a comment on PR #46:

Given that I'm not sure I understand the desire to push so hard on having a single-module output vs accepting that some outputs will have multiple modules.

No question multiple modules are the better default. I present this primarily for completeness as a "size-optimized" alternative to the less desirable "N copies" or "not supporting" scenarios. To help inform the trade-offs, is there a table that tracks runtime support for multi-module instantiation and cross-module inlining?

My personal motivation is being able to guarantee zero-overhead cross-module calls, rather than leaving it to the discretion of a runtime's JIT. Within a single module, this can be ensured by an ahead-of-time optimization pass during the merge (as in my POC). This would allow modules to become much smaller and composable, similar to JS modules. This doesn't necessarily have to happen at the component level, but reducing the overhead of the component model might make it easier to build an Deno like secure and lightweight ecosystem on top of it so I decided to provide the perspective for the broader discussion.

RFC notifications bot (Apr 10 2026 at 15:42):

ttraenkler edited a comment on PR #46:

Given that I'm not sure I understand the desire to push so hard on having a single-module output vs accepting that some outputs will have multiple modules.

No question multiple modules are the better default. I present this primarily for completeness as a "size-optimized" alternative to the less desirable "N copies" or "not supporting" scenarios. To help inform the trade-offs, is there a table that tracks runtime support for multi-module instantiation and cross-module inlining?

My personal motivation is being able to guarantee zero-overhead cross-module calls, rather than leaving it to the discretion of a runtime's JIT. Within a single module, this can be ensured by an ahead-of-time optimization pass during the merge (as in my POC). This would allow modules to become much smaller and composable, similar to JS modules. This doesn't necessarily have to happen at the component level, but reducing the overhead of the component model might make it easier to build a Deno like secure and lightweight ecosystem on top of it so I decided to provide the perspective for the broader discussion.

RFC notifications bot (Apr 10 2026 at 15:44):

ttraenkler edited a comment on PR #46:

Given that I'm not sure I understand the desire to push so hard on having a single-module output vs accepting that some outputs will have multiple modules.

No question multiple modules are the better default. I present this primarily for completeness as a "size-optimized" alternative to the less desirable "N copies" or "not supporting" scenarios. To help inform the trade-offs, is there a table that tracks runtime support for multi-module instantiation and cross-module inlining?

My personal motivation is being able to guarantee zero-overhead cross-module calls, rather than leaving it to the discretion of a runtime's JIT. Within a single module, this can be ensured by an ahead-of-time optimization pass during the merge (as in my POC). This would allow modules to become much smaller and composable, similar to JS modules. This doesn't necessarily have to happen at the component level, but reducing the overhead of the component model might make it easier to build a JSR or NPM like secure and lightweight ecosystem on top of it so I decided to provide the perspective for the broader discussion.

RFC notifications bot (Apr 10 2026 at 15:49):

ttraenkler edited a comment on PR #46:

Given that I'm not sure I understand the desire to push so hard on having a single-module output vs accepting that some outputs will have multiple modules.

No question multiple modules are the better default. I present this primarily for completeness as a "size-optimized" alternative to the less desirable "N copies" or "not supporting" scenarios. To help inform the trade-offs, is there a table that tracks runtime support for multi-module instantiation and cross-module inlining?

My personal motivation is being able to guarantee zero-overhead cross-module calls, rather than leaving it to the discretion of a runtime's JIT. Within a single module, this can be ensured by an ahead-of-time optimization pass during the merge (as in my POC). This would allow modules to become much smaller and composable, similar to JS modules. This doesn't necessarily have to happen at the component level, but reducing the overhead of the component model might make it easier to build a JSR (Deno's package registry, similar to NPM) like secure and lightweight ecosystem on top of it so I decided to provide the perspective for the broader discussion.

RFC notifications bot (Apr 10 2026 at 15:53):

ttraenkler edited a comment on PR #46:

Given that I'm not sure I understand the desire to push so hard on having a single-module output vs accepting that some outputs will have multiple modules.

No question multiple modules are the better default. I present this primarily for completeness as a "size-optimized" alternative to the less desirable "N copies" or "not supporting" scenarios. To help inform the trade-offs, is there a table that tracks runtime support for multi-module instantiation and cross-module inlining?

My personal motivation is being able to guarantee zero-overhead cross-module calls, rather than leaving it to the discretion of a runtime's JIT. Within a single module, this can be ensured by an ahead-of-time optimization pass during the merge (as in my POC). This would allow modules to become much smaller and composable, similar to JS modules. This doesn't necessarily have to happen at the component level, but reducing the overhead of the component model might make it easier to build a JSR like secure and lightweight ecosystem on top of it with less dependencies on toolchains so I decided to provide the perspective for the broader discussion.

RFC notifications bot (Apr 10 2026 at 16:01):

ttraenkler edited a comment on PR #46:

Given that I'm not sure I understand the desire to push so hard on having a single-module output vs accepting that some outputs will have multiple modules.

No question multiple modules are the better default. I present this primarily for completeness as a "size-optimized" alternative to the less desirable "N copies" or "not supporting" scenarios. To help inform the trade-offs, is there a table that tracks runtime support for multi-module instantiation and cross-module inlining?

My personal motivation is being able to guarantee zero-overhead cross-module calls, rather than leaving it to the discretion of a runtime's JIT. Within a single module, this can be ensured by an ahead-of-time optimization pass during the merge (as in my POC). This would allow modules to become much smaller and composable, similar to JS modules. This doesn't necessarily have to happen at the component level, but reducing the overhead of the component model by a "bundling" step might make it a better fit to build an ecosystem like JSR or NPM on top of it with less dependencies on specific toolchains so I decided to provide the perspective for the broader discussion.

RFC notifications bot (Apr 10 2026 at 16:01):

ttraenkler edited a comment on PR #46:

Given that I'm not sure I understand the desire to push so hard on having a single-module output vs accepting that some outputs will have multiple modules.

No question multiple modules are the better default. I present this primarily for completeness as a "size-optimized" alternative to the less desirable "N copies" or "not supporting" scenarios. To help inform the trade-offs, is there a table that tracks runtime support for multi-module instantiation and cross-module inlining?

My personal motivation is being able to guarantee zero-overhead cross-module calls, rather than leaving it to the discretion of a runtime's JIT. Within a single module, this can be ensured by an ahead-of-time optimization pass during the merge (as in my POC). This would allow modules to become much smaller and composable, similar to JS modules. This doesn't necessarily have to happen at the component level, but reducing the overhead of the component model by a "bundling" step in cases where it helps performance or deployment might make it a better fit to build an ecosystem like JSR or NPM on top of it with less dependencies on specific toolchains so I decided to provide the perspective for the broader discussion.

RFC notifications bot (Apr 10 2026 at 19:14):

alexcrichton commented on PR #46:

Those are excellent points yeah, sorry I was a bit too strong in my wording. Agreed it's good to explore from a trade-off perspective! I'm not aware of a table of what you're looking for, insfoar as runtimes that support multiple modules vs those that don't.

I'm not entirely sure what you mean by smaller and more composable though. A lowered component which generates multiple modules is almost guaranteed to be smaller than any of the techniques described in this thread about generating a single module (even the dispatch idea you have, although there the size difference is much less). Composability I would imagine is a function of whatever system is being embedded into, where for example if it's a module system of some kind (e.g. JS modules) then that's an orthogonal concern where JS glue would be required to wrap the multiple modules (e.g. wire up instantiations and such), but I don't see how composability in general is lost if there's more than one module output.

I also don't think it's quite accurate to characterize this as overhead of the component model. The purpose of emitting multiple modules is to precisely represent this with no overhead. If a component internally instantiates things twice then that's what any conforming runtime must do more-or-less, instantiate something twice. If a component doesn't actually instantiate a module or component twice then the lowering process will continue to produce just a single core wasm module, I'm not envisioning something where multiple modules are arbitrarily generated "just because" for example.

RFC notifications bot (Apr 10 2026 at 22:59):

ttraenkler commented on PR #46:

I'm not entirely sure what you mean by smaller and more composable though. A lowered component which generates multiple modules is almost guaranteed to be smaller than any of the techniques described in this thread about generating a single module (even the dispatch idea you have, although there the size difference is much less). Composability I would imagine is a function of whatever system is being embedded into, where for example if it's a module system of some kind (e.g. JS modules) then that's an orthogonal concern where JS glue would be required to wrap the multiple modules (e.g. wire up instantiations and such), but I don't see how composability in general is lost if there's more than one module output.

In terms of binary size, I agree: nothing beats multiple instantiation. However, by "composable," I'm referring to the performance cost of fine-grained boundaries. As we shrink modules to be more modular, cross-module calls become more frequent. On hot paths—like kernel functions manipulating bulk images, audio, or strings - these boundaries matter.

If a developer publishes a "String Utilities" or "Image Math kernels" component to an OCI registry, they currently have no guarantee that their fine-grained accessors will be optimized/inlined by the end-user's runtime. This potentially discourages using the Component Model for high-performance libraries, as "linking" effectively becomes a non-portable, toolchain-specific performance gamble.

While Wasmtime aims to eliminate cross-module costs, it seems many other runtimes (like V8) don't yet have clear paths or public plans for the optimizations required to make those boundaries zero-cost.

Providing a single-module fallback in lower-component ensures that:

Producers don't have to worry about how a specific JIT handles boundaries, allowing them to choose the optimal balance of binary size and execution speed for their specific use case.

Consumers get a "statically linked" guarantee that their hot paths can be optimized AOT during the merge.

Essentially, it makes the Component Model a safe "module substrate" even for fine grained performance sensitive modules, regardless of the sophistication of the target runtime's cross-module inlining.

If components are to become the "currency of exchange" for the ecosystem, they should provide a path for these use cases without forcing developers to adopt a different model to achieve predictable performance. This lowers the buy-in and facilitates sharing across the OCI registry as .wasm binaries which could be a catalyst for kick-starting the Wasm OCI ecosystem.

RFC notifications bot (Apr 10 2026 at 23:11):

cfallin commented on PR #46:

@ttraenkler you're focusing on performance of fine-grained composition, which is an excellent goal. However, I don't know if the solution you propose to actually merge the modules seems (to me at least) like it will be very performant. In particular, where you propose

;; Centralized 'Virtual Instruction' (Defined once per module) (func $dispatch_load (param $inst i32) (param $addr i32) (result i32) ...)

Are you proposing to literally replace every load instruction in the original core module with a call to dispatch_load(id, addr)? That's going to be absurdly slow (relative to the original code) on any reasonable engine, I think, even if the dispatch_load function is inlined, because its logic (the br_table etc) is still dynamic, and it will inhibit all sorts of optimizations. In essence, you're getting halfway to an interpreter (!): the original code still exists statically, but as a bit sequence of calls to a function-per-instruction.

I think that Alex is right here: fundamentally, to retain 1:1 performance expectations, a Wasm function needs to remain a Wasm function, without virtualization. Our options then are to merge all Wasm functions and entities into one module (linker-style) and either be done, if an original module is not instantiated more than once; or else actually have multiple modules at runtime.

Said another way: engines may not cross-module-inline well today, but they also certainly do not monomorphize at all, and your approach requires monomorphization to remove its substantial overhead. The alternative is to "manually monomorphize", aka, produce a final artifact with multiple modules.

RFC notifications bot (Apr 10 2026 at 23:12):

cfallin edited a comment on PR #46:

@ttraenkler you're focusing on performance of fine-grained composition, which is an excellent goal. However, I don't know if the solution you propose to actually merge the modules seems (to me at least) like it will be very performant. In particular, where you propose

;; Centralized 'Virtual Instruction' (Defined once per module) (func $dispatch_load (param $inst i32) (param $addr i32) (result i32) ...)

Are you proposing to literally replace every load instruction in the original core module with a call to dispatch_load(id, addr)? That's going to be absurdly slow (relative to the original code) on any reasonable engine, I think, even if the dispatch_load function is inlined, because its logic (the br_table etc) is still dynamic, and it will inhibit all sorts of optimizations. In essence, you're getting halfway to an interpreter (!): the original code still exists statically, but as a big sequence of calls to a function-per-instruction.

I think that Alex is right here: fundamentally, to retain 1:1 performance expectations, a Wasm function needs to remain a Wasm function, without virtualization. Our options then are to merge all Wasm functions and entities into one module (linker-style) and either be done, if an original module is not instantiated more than once; or else actually have multiple modules at runtime.

Said another way: engines may not cross-module-inline well today, but they also certainly do not monomorphize at all, and your approach requires monomorphization to remove its substantial overhead. The alternative is to "manually monomorphize", aka, produce a final artifact with multiple modules.

RFC notifications bot (Apr 10 2026 at 23:52):

ttraenkler commented on PR #46:

Let me take a step back and close this point to let the discussion focus on the more fundamental issues, but I thought it was worth mentioning as a potential option for the RFC.

Are you proposing to literally replace every load instruction in the original core module with a call to dispatch_load(id, addr)?

In the extreme case - though in many cases only a fraction of library code is actually on a hot path or even used (and could be DCEd when linked ahead of time) - though the drafted POC currently replaces instructions directly with br_table logic and has a simple heuristic to detect loops during the merge to switch to inlining (built on wasm-encode/wasm-parse). You're absolutely right about the performance floor; in my tests, the br_table overhead on Wasmtime was massive (+3000-4000%), whereas V8's speculative optimizations seemed to handle it much better (+100-200%).

However, the intent is to treat this as a link-time transformation. Just as a traditional linker can perform LTO to devirtualize calls, a tool like wasm-opt or wasm-compose could use this pattern as a baseline and then "manually monomorphize" (inline and specialize) only the hot paths. For the "cold" 90% of the code, we save significant binary size; for the hot 10%, we pay the duplication cost only where it moves the needle.

Engines may not cross-module-inline well today, but they also certainly do not monomorphize at all, and your approach requires monomorphization to remove its substantial overhead.

I agree. Since engines don't monomorphize at runtime, my proposal essentially moves that responsibility to the producer-side tooling.

I certainly don't suggest this as the default path for high-performance runtimes. I see this as a fallback for when runtime support is lacking or where binary size becomes a constraint (e.g., many module instances sharing a large string or math library). It gives the developer a knob to turn between size and performance, making the trade-off predictable at compile-time rather than leaving it to the heuristics of a specific engine's JIT.

RFC notifications bot (Apr 11 2026 at 00:30):

ttraenkler edited a comment on PR #46:

Let me take a step back and close this point to let the discussion focus on the more fundamental issues, but I thought it was worth mentioning as a potential option for the RFC.

Are you proposing to literally replace every load instruction in the original core module with a call to dispatch_load(id, addr)?

In the extreme case - though in many cases only a fraction of library code is actually on a hot path or even used (and could be DCEd when linked ahead of time) - though the drafted POC currently replaces instructions directly with br_table logic and has a simple heuristic to detect loops during the merge to switch to eliminating br_table (built on wasm-encode/wasm-parse). You're absolutely right about the performance floor; in my tests, the br_table overhead on Wasmtime was massive (+3000-4000%), whereas V8's speculative optimizations seemed to handle it much better (+100-200%).

However, the intent is to treat this as a link-time transformation. Just as a traditional linker can perform LTO to devirtualize calls, a tool like wasm-opt or wasm-compose could use this pattern as a baseline and then "manually monomorphize" (inline and specialize) only the hot paths. For the "cold" 90% of the code, we save significant binary size; for the hot 10%, we pay the duplication cost only where it moves the needle.

Engines may not cross-module-inline well today, but they also certainly do not monomorphize at all, and your approach requires monomorphization to remove its substantial overhead.

I agree. Since engines don't monomorphize at runtime, my proposal essentially moves that responsibility to the producer-side tooling.

I certainly don't suggest this as the default path for high-performance runtimes. I see this as a fallback for when runtime support is lacking or where binary size becomes a constraint (e.g., many module instances sharing a large string or math library). It gives the developer a knob to turn between size and performance, making the trade-off predictable at compile-time rather than leaving it to the heuristics of a specific engine's JIT.

RFC notifications bot (Apr 11 2026 at 01:21):

ttraenkler edited a comment on PR #46:

Let me take a step back and close this point to let the discussion focus on the more fundamental issues, but I thought it was worth mentioning as a potential option for the RFC.

Are you proposing to literally replace every load instruction in the original core module with a call to dispatch_load(id, addr)?

In the extreme case - though in many cases only a fraction of library code is actually on a hot path or even used (and could be DCEd when linked ahead of time) - though the drafted POC currently replaces instructions directly with br_table logic (built on top of wasm-encode/wasm-parse) and wasm-opt -O4 eliminates the br_table. You're absolutely right about the performance floor; in my tests, the br_table overhead on Wasmtime was massive (+3000-4000%), whereas V8's speculative optimizations seemed to handle it much better (+100-200%).

However, the intent is to treat this as a link-time transformation. Just as a traditional linker can perform LTO to devirtualize calls, a tool like wasm-opt or wasm-compose could use this pattern as a baseline and then "manually monomorphize" (inline and specialize) only the hot paths. For the "cold" 90% of the code, we save significant binary size; for the hot 10%, we pay the duplication cost only where it moves the needle.

Engines may not cross-module-inline well today, but they also certainly do not monomorphize at all, and your approach requires monomorphization to remove its substantial overhead.

I agree. Since engines don't monomorphize at runtime, my proposal essentially moves that responsibility to the producer-side tooling.

I certainly don't suggest this as the default path for high-performance runtimes. I see this as a fallback for when runtime support is lacking or where binary size becomes a constraint (e.g., many module instances sharing a large string or math library). It gives the developer a knob to turn between size and performance, making the trade-off predictable at compile-time rather than leaving it to the heuristics of a specific engine's JIT.

RFC notifications bot (Apr 11 2026 at 01:22):

ttraenkler edited a comment on PR #46:

Let me take a step back and close this point to let the discussion focus on the more fundamental issues, but I thought it was worth mentioning as a potential option for the RFC.

Are you proposing to literally replace every load instruction in the original core module with a call to dispatch_load(id, addr)?

In the extreme case - though in many cases only a fraction of library code is actually on a hot path or even used (and could be DCEd when linked ahead of time) - though the drafted POC currently replaces instructions directly with br_table without a function call inline (built on top of wasm-encode/wasm-parse) and a wasm-opt -O4 pass later eliminates the br_table. You're absolutely right about the performance floor; in my tests, the br_table overhead on Wasmtime was massive (+3000-4000%), whereas V8's speculative optimizations seemed to handle it much better (+100-200%).

However, the intent is to treat this as a link-time transformation. Just as a traditional linker can perform LTO to devirtualize calls, a tool like wasm-opt or wasm-compose could use this pattern as a baseline and then "manually monomorphize" (inline and specialize) only the hot paths. For the "cold" 90% of the code, we save significant binary size; for the hot 10%, we pay the duplication cost only where it moves the needle.

Engines may not cross-module-inline well today, but they also certainly do not monomorphize at all, and your approach requires monomorphization to remove its substantial overhead.

I agree. Since engines don't monomorphize at runtime, my proposal essentially moves that responsibility to the producer-side tooling.

I certainly don't suggest this as the default path for high-performance runtimes. I see this as a fallback for when runtime support is lacking or where binary size becomes a constraint (e.g., many module instances sharing a large string or math library). It gives the developer a knob to turn between size and performance, making the trade-off predictable at compile-time rather than leaving it to the heuristics of a specific engine's JIT.

RFC notifications bot (Apr 11 2026 at 01:22):

ttraenkler edited a comment on PR #46:

Let me take a step back and close this point to let the discussion focus on the more fundamental issues, but I thought it was worth mentioning as a potential option for the RFC.

Are you proposing to literally replace every load instruction in the original core module with a call to dispatch_load(id, addr)?

In the extreme case - though in many cases only a fraction of library code is actually on a hot path or even used (and could be DCEd when linked ahead of time) - though the drafted POC currently replaces instructions directly with br_table without a function call inline (built on top of wasm-encode/wasm-parse) and a wasm-opt -O4 pass later eliminates the br_table. You're absolutely right about the performance floor; in my tests, the br_table overhead on Wasmtime without the wasm-opt pass was massive (+3000-4000%), whereas V8's speculative optimizations seemed to handle it much better (+100-200%).

However, the intent is to treat this as a link-time transformation. Just as a traditional linker can perform LTO to devirtualize calls, a tool like wasm-opt or wasm-compose could use this pattern as a baseline and then "manually monomorphize" (inline and specialize) only the hot paths. For the "cold" 90% of the code, we save significant binary size; for the hot 10%, we pay the duplication cost only where it moves the needle.

Engines may not cross-module-inline well today, but they also certainly do not monomorphize at all, and your approach requires monomorphization to remove its substantial overhead.

I agree. Since engines don't monomorphize at runtime, my proposal essentially moves that responsibility to the producer-side tooling.

I certainly don't suggest this as the default path for high-performance runtimes. I see this as a fallback for when runtime support is lacking or where binary size becomes a constraint (e.g., many module instances sharing a large string or math library). It gives the developer a knob to turn between size and performance, making the trade-off predictable at compile-time rather than leaving it to the heuristics of a specific engine's JIT.

RFC notifications bot (Apr 12 2026 at 11:33):

avrabe commented on PR #46:

For context on a use case where single-module output matters: we use meld to fuse components for embedded targets (Cortex-M, KB-range RAM) in automotive. The Component Model gives us typed interface boundaries between components from different suppliers — useful for safety certification. At build time, the fused module is AOT compiled to native code. No JIT, no dynamic linking, no multi-module instantiation available.

For the common case (no multiply-instantiated modules), this aligns with what the RFC already plans — a single core module output. Multi-module for the general case works for us too as a future option. Just wanted to add the embedded perspective since it hasn't come up in the discussion.

RFC notifications bot (Apr 13 2026 at 14:45):

ttraenkler edited a comment on PR #46:

Let me take a step back and close this point to let the discussion focus on the more fundamental issues, but I thought it was worth mentioning as a potential option for the RFC.

Are you proposing to literally replace every load instruction in the original core module with a call to dispatch_load(id, addr)?

In the extreme case - though in many cases only a fraction of library code is actually on a hot path or even used (and could be DCEd when linked ahead of time) - though the drafted POC currently replaces instructions directly with br_table without a function call inline (built on top of wasm-encode/wasm-parse) and a wasm-opt -O4 pass later eliminates the br_table. You're absolutely right about the performance floor; in my tests, the br_table overhead on Wasmtime without the wasm-opt pass was massive (+3000-4000%), whereas V8's speculative optimizations seemed to handle it much better (+100-200%).

However, the intent is to treat this as a link-time transformation. Just as a traditional linker can perform LTO to devirtualize calls, a tool like wasm-opt or wasm-compose could use this pattern as a baseline and then "manually monomorphize" (inline and specialize) only the hot paths. For the "cold" 90% of the code, we save significant binary size; for the hot 10%, we pay the duplication cost only where it moves the needle.

Engines may not cross-module-inline well today, but they also certainly do not monomorphize at all, and your approach requires monomorphization to remove its substantial overhead.

I agree. Since engines don't monomorphize at runtime, my proposal essentially moves that responsibility to the producer-side tooling.

I certainly don't suggest this as the default path for high-performance runtimes. I see this as a fallback for when runtime support is lacking and where binary size becomes a constraint (e.g., many module instances sharing a large string or math library). It gives the developer a knob to turn between size and performance, making the trade-off predictable at compile-time rather than leaving it to the heuristics of a specific engine's JIT.

RFC notifications bot (Apr 13 2026 at 15:04):

ttraenkler edited a comment on PR #46:

Let me take a step back and close this point to let the discussion focus on the more fundamental issues, but I thought it was worth mentioning as a potential option for the RFC.

Are you proposing to literally replace every load instruction in the original core module with a call to dispatch_load(id, addr)?

In the extreme case - though in many cases only a fraction of library code is actually on a hot path or even used (and could be DCEd when linked ahead of time) - though the drafted POC currently replaces instructions directly with br_table without a function call inline (built on top of wasm-encode/wasm-parse) and a wasm-opt -O4 pass later eliminates the br_table. You're absolutely right about the performance floor; in my tests, the br_table overhead on Wasmtime without the wasm-opt pass was massive (+3000-4000%), whereas V8's speculative optimizations seemed to handle it much better (+100-200%).

However, the intent is to treat this as a link-time transformation. Just as a traditional linker can perform LTO to devirtualize calls, a tool like wasm-opt or wasm-compose could use this pattern as a baseline and then "manually monomorphize" (inline and specialize) only the hot paths. For the "cold" 90% of the code, we save significant binary size; for the hot 10%, we pay the duplication cost only where it moves the needle.

Engines may not cross-module-inline well today, but they also certainly do not monomorphize at all, and your approach requires monomorphization to remove its substantial overhead.

I agree. Since engines don't monomorphize at runtime, my proposal essentially moves that responsibility to the producer-side tooling.

I certainly don't suggest this as the default path for high-performance runtimes. I see this as a fallback for when runtime support is lacking and where binary size becomes a constraint (e.g., many module instances sharing a large string or math library). It gives the developer a knob to turn between size and performance, making the trade-off predictable when modules are linked (merged) ahead of time rather than leaving it to the heuristics of a specific engine's JIT.

RFC notifications bot (Apr 13 2026 at 15:04):

ttraenkler edited a comment on PR #46:

Let me take a step back and close this point to let the discussion focus on the more fundamental issues, but I thought it was worth mentioning as a potential option for the RFC.

Are you proposing to literally replace every load instruction in the original core module with a call to dispatch_load(id, addr)?

In the extreme case - though in many cases only a fraction of library code is actually on a hot path or even used (and could be DCEd when linked ahead of time) - though the drafted POC currently replaces instructions directly with br_table without a function call inline (built on top of wasm-encode/wasm-parse) and a wasm-opt -O4 pass later eliminates the br_table. You're absolutely right about the performance floor; in my tests, the br_table overhead on Wasmtime without the wasm-opt pass was massive (+3000-4000%), whereas V8's speculative optimizations seemed to handle it much better (+100-200%).

However, the intent is to treat this as a link-time transformation. Just as a traditional linker can perform LTO to devirtualize calls, a tool like wasm-opt or wasm-compose could use this pattern as a baseline and then "manually monomorphize" (inline and specialize) only the hot paths. For the "cold" 90% of the code, we save significant binary size; for the hot 10%, we pay the duplication cost only where it moves the needle.

Engines may not cross-module-inline well today, but they also certainly do not monomorphize at all, and your approach requires monomorphization to remove its substantial overhead.

I agree. Since engines don't monomorphize at runtime, my proposal essentially moves that responsibility to the producer-side tooling.

I certainly don't suggest this as the default path for high-performance runtimes. I see this as a fallback for when runtime support is lacking and where binary size becomes a constraint (e.g., many module instances sharing a large string or math library). It gives the developer a knob to turn between size and performance, making the trade-off predictable when modules are statically linked (merged) ahead of time rather than leaving it to the heuristics of a specific engine's JIT.

RFC notifications bot (Apr 13 2026 at 17:56):

alexcrichton commented on PR #46:

@avrabe multi-module outputs don't preclude AOT or the optimizations you describe, however. Wasmtime today also supports AOT for components, even those with multiple instantiations within them. The output of lower-component will always be able to be statically represented in some sort of pre-compiled form, even without dynamic allocation if desired. While the implementation will be a little bit more involved I personally think AOT-vs-multi-module should be considered orthogonal concerns.

Last updated: Jun 01 2026 at 09:49 UTC

Stream: rfc-notifications

Topic: rfcs / PR #46 Propose tools and APIs for lowering compone...

RFC notifications bot (Mar 09 2026 at 23:26):

RFC notifications bot (Mar 10 2026 at 18:07):

RFC notifications bot (Mar 10 2026 at 18:26):

RFC notifications bot (Mar 10 2026 at 18:26):

RFC notifications bot (Mar 10 2026 at 18:46):

RFC notifications bot (Mar 10 2026 at 18:46):

RFC notifications bot (Mar 10 2026 at 19:10):

RFC notifications bot (Mar 10 2026 at 19:23):

RFC notifications bot (Mar 10 2026 at 19:23):

RFC notifications bot (Mar 10 2026 at 20:52):

RFC notifications bot (Mar 10 2026 at 21:19):

RFC notifications bot (Mar 11 2026 at 15:10):

RFC notifications bot (Mar 11 2026 at 15:28):

RFC notifications bot (Mar 11 2026 at 15:29):

RFC notifications bot (Mar 11 2026 at 15:30):

RFC notifications bot (Mar 13 2026 at 23:51):

RFC notifications bot (Mar 17 2026 at 20:24):

RFC notifications bot (Mar 17 2026 at 20:26):

RFC notifications bot (Mar 17 2026 at 20:30):

RFC notifications bot (Mar 18 2026 at 16:52):

RFC notifications bot (Mar 18 2026 at 16:55):

RFC notifications bot (Mar 18 2026 at 16:56):

RFC notifications bot (Mar 18 2026 at 17:28):

RFC notifications bot (Mar 18 2026 at 18:18):

RFC notifications bot (Mar 20 2026 at 08:39):

RFC notifications bot (Mar 20 2026 at 08:39):

RFC notifications bot (Mar 20 2026 at 08:51):

RFC notifications bot (Mar 20 2026 at 13:56):

RFC notifications bot (Mar 20 2026 at 14:39):

RFC notifications bot (Mar 20 2026 at 17:17):

RFC notifications bot (Mar 23 2026 at 13:58):

RFC notifications bot (Mar 24 2026 at 15:44):

RFC notifications bot (Mar 24 2026 at 16:10):

RFC notifications bot (Mar 24 2026 at 16:11):

RFC notifications bot (Mar 24 2026 at 20:29):

RFC notifications bot (Mar 27 2026 at 15:35):

RFC notifications bot (Mar 27 2026 at 15:35):

RFC notifications bot (Apr 06 2026 at 20:36):

RFC notifications bot (Apr 07 2026 at 17:42):

RFC notifications bot (Apr 08 2026 at 18:10):

Paring down to just lower-component

No extra linear memories

Multiple instantiations

RFC notifications bot (Apr 09 2026 at 03:21):

Multiple instantiations

RFC notifications bot (Apr 09 2026 at 03:27):

Multiple instantiations

RFC notifications bot (Apr 09 2026 at 03:34):

Multiple instantiations

RFC notifications bot (Apr 09 2026 at 09:06):

RFC notifications bot (Apr 09 2026 at 14:29):

RFC notifications bot (Apr 09 2026 at 19:07):

RFC notifications bot (Apr 09 2026 at 19:08):

RFC notifications bot (Apr 09 2026 at 19:24):

RFC notifications bot (Apr 09 2026 at 23:07):

RFC notifications bot (Apr 10 2026 at 10:27):

RFC notifications bot (Apr 10 2026 at 15:42):

RFC notifications bot (Apr 10 2026 at 15:42):

RFC notifications bot (Apr 10 2026 at 15:44):

RFC notifications bot (Apr 10 2026 at 15:49):

RFC notifications bot (Apr 10 2026 at 15:53):

RFC notifications bot (Apr 10 2026 at 16:01):

RFC notifications bot (Apr 10 2026 at 16:01):

RFC notifications bot (Apr 10 2026 at 19:14):

RFC notifications bot (Apr 10 2026 at 22:59):

RFC notifications bot (Apr 10 2026 at 23:11):

RFC notifications bot (Apr 10 2026 at 23:12):

RFC notifications bot (Apr 10 2026 at 23:52):

RFC notifications bot (Apr 11 2026 at 00:30):

RFC notifications bot (Apr 11 2026 at 01:21):

RFC notifications bot (Apr 11 2026 at 01:22):

RFC notifications bot (Apr 11 2026 at 01:22):

RFC notifications bot (Apr 12 2026 at 11:33):

RFC notifications bot (Apr 13 2026 at 14:45):

RFC notifications bot (Apr 13 2026 at 15:04):

RFC notifications bot (Apr 13 2026 at 15:04):

RFC notifications bot (Apr 13 2026 at 17:56):

Paring down to just `lower-component`