dicej opened PR #46 from dicej:lower-component to bytecodealliance:main:
Great to see this. Happy to share notes on canonical ABI edge cases from Meld if useful.
cfallin submitted PR review:
This is really important work and I'm happy to see it being developed -- thanks!
cfallin created PR review comment:
This is a really good point and I think it's very important to solve "right", and not just reject components that instantiate a module more than once: that's a fundamental capability of the component model that core Wasm (without metadata/wrapper) doesn't have, and we don't want to bifurcate the ecosystem into components that fit this restriction and those that don't.
Function duplication (your second option) seems conceptually appealing because it hides the complexity, but in practice I suspect a large majority of functions will be duplicated, because almost everything will access memory...
Maybe the best option here is to actually define a "just the module linking, please" subset of the component model semantics that gives (i) a flat index space of core modules, (ii) a wiring diagram instantiating them and connecting imports and exports? The host already has to do some work to provide some intrinsics so this proposal is not "free" in any case; so ingesting such a format should not be too much of an additional sell (though there is certainly a step-function increase from "one core module" to "graph of core modules"). It's also conceptually the cleanest IMHO: this really is a thing that the component semantics can describe that a core Wasm can't, but most core Wasm runtimes should have host APIs to instantiate a thing more than once, so we should just "pass it through".
Just to note it down, though I don't like it: I guess there could be a fourth option here, which is (at a high level) something like "reify the
vmctxas actual Wasm state". That seems to be the most "honest" w.r.t. the lowering paradigm.The idea is that one would reify data structures that look like Wasmtime's instance state as Wasm GC values. A Wasm memory could be an arrayref to an array-of-i8; a Wasm table could be an arrayref to an array-of-whatever. Given those, one could define a
vmctxWasm struct that contains memory refs and table refs as our nativevmctxdoes today, as well as any globals, inlined; then the lowered functions take thisvmctxstruct ref as an implicit first arg.This clearly would have nontrivial runtime overhead as well, since in essence we'd have two levels of indirection for any state access.
jellevandenhooff commented on PR #46:
Bit of a drive-by thought: My guess is that if the lowering tooling is performant enough, any wasm host would want to adopt it, and then the component-model spec splits in two. these new slim host bindings, and the component-model guest bindings as today. Do you think you would end up committing to API stability on the host bindings part? Standardize them? I suspect wasm runtimes would want that.
jellevandenhooff edited a comment on PR #46:
Bit of a drive-by thought: My guess is that if the lowering tooling is performant enough, any wasm host would want to adopt it, and then the component-model spec splits in two: these new slim host bindings and the component-model guest bindings as today. Do you think you would end up committing to API stability on the host bindings part? Standardize them? I suspect wasm runtimes would want that.
Do you think you would end up committing to API stability on the host bindings part? Standardize them?
Yeah, I think for this to work the API would need to at least be "officially" documented in the same way the dylink.0 convention is documented. Ideally, though, the "slim host bindings" API/ABI would just be a subset of the Component Model ABI (e.g. some or all of the
thread.*andcontext.*canonical built-ins) and therefore not need to be documented or standardized separately. I _think_ that should work, in which case the TODO item in theHost C API for Lowered Componentssection will just be to describe the relevant CM ABI built-ins as C function declarations.
dicej submitted PR review.
dicej created PR review comment:
Maybe the best option here is to actually define a "just the module linking, please" subset of the component model semantics that gives (i) a flat index space of core modules, (ii) a wiring diagram instantiating them and connecting imports and exports?
Yeah, I expect this is what it would have to look like. One thought that crossed my mind would be to literally output a real component, but one that only uses the absolute minimum set of features needed to embed, instantiate, and link modules. Hosts would need to be able to parse and instantiate these "simple components" but not need to support the entire component model.
alexcrichton commented on PR #46:
Personally I'm all for reducing complexity as much as we can, and the motivation section of this RFC resonates with me accordingly and I agree it's a worthwhile problem to tackle. At the same time though I'm personally skeptical of this approach in terms of practicality. For example as-written the RFC is currently relatively hand-wavy in terms of what exact responsibilities lie where. I understand though this is a relatively early-stages proposal so it's naturally not going to have anything fully fleshed out on day 1, but nonetheless I want to point out that at least for me it's difficult to form a concrete opinion without having more concrete details.
As a general thrust of "make the component model simpler to implement and make components easier to run", that seems reasonable to have an RFC on-the-record from the BA blessing that approach. For me personally I don't find that too useful because if aspirations are high-level enough it runs the risk of getting agreement amongst lots of folks but being quite difficult to actually make progress.
So, a question for this: is that the purpose of this RFC? To get high-level agreement on the approach? We've done this with some debugging-related RFCs for example as an approach to have implementation details sketched but not fully fleshed out while still maintaining high-level agreement. If that's the goal then I'm happy to approve as-is. If the goal though is to get more in-depth discussion of the technical specifics, viability, etc, that's a pretty different conversation.
I also was thinking the same as @jellevandenhooff when reading over this -- whatever intermediate APIs are needed between the runtime and a lowered component effectively need to end up being standards for this to work (IMO). That raises the bar quite a lot in terms of expected quality and care to design which would be an important point to note.
So, a question for this: is that the purpose of this RFC? To get high-level agreement on the approach?
I opened this as a draft with some TODO items because I indeed wanted to gauge high-level agreement on the approach to begin with, but I also want to get into the details before calling it "done".
BTW, I went into a lot of detail in #38, but a lot of those details changed once we had real-world experience with the implementation. Personally, I think that's fine; the goal here is to be specific and make sure the details are not obviously wrong, but still be able to change things later during implementation as needed.
Anyway, yes, my goal is both high-level consensus and to get into the details as well. I'm aiming to add those by the end of the week, at which point I'll switch this out of draft mode.
alexcrichton commented on PR #46:
Ok makes sense, and yeah I would agree that trying to flesh out all the details up front is probably not worthwhile because of how much will change during the implementation as we get more experience. As the goal here is to be more-detail-oriented-than-high-level-goals, however, some thoughts I'd have on this are:
- A perhaps chief concern of mine is going to be performance/overhead. With native integration/implementation there's a lot of mechanisms to bypass overhead, and for example this change would require that all compoents likely to have at least 2 linear memories (one for the guest, one for the runtime), which baloons 8G of virtual memory to 16G of virtual memory per-component. This is just one example, but I'd be initially wary that we would want to switch everything over in Wasmtime to this paradigm before being more confident in the performance profile, for example.
- Another thing I'd want to be pretty up-front about is that while lowering a component to a core wasm module certainly helps a lot there is still quite a lot of work for a host to do. Here it's under the guise of a bindgen but even just writing a bindgen requires significant effort/maintenance and is not something we can hand-wave away. This is fundamental to a host interacting with a component because somehow core wasm things need to get translated to host things, and this can get significantly complicated in the face of resources, futures, streams, lazy lowering, etc. Basically I don't want to give anyone the impression that this will basically delete 99% of component-model code in Wasmtime or other runtimes, my gut is that it'd be more like 50% in the end.
- Particularly w.r.t. async I don't actually know how a built-in wasm-based runtime could shave off a large chunk of the complexity burden from embedders. Of primary concern here to me is the lack of core wasm stack switching. With stack switching in theory a lot more can be moved to the guest, but without stack switching we're left with JSPI-like approaches which puts quite a lot more on the host. Even still, somehow the host's notion of async needs to be bridged into the wasm concept of async and that will inevitably require a lot of careful design and probably a lot of work on the host.
Overall I think I'm actually relatively skeptical of this approach panning out in the long run. Despite that I do want this endeavor to succeed, however, but my point is that it's going to require signfiicant investment and design to even just evaluate the approach. Personally at least I don't feel like there's a clear way to implement all of this which requires only figuring out some minor details, but rather the unknowns are much larger. In that sense I think it's worthwhile to experiment more here, but to truly feel comfortable about accepting this I'd personally want to see more proof-of-concept style work to flesh out more details about how these fundamentals are going to work and play out in the end
n that sense I think it's worthwhile to experiment more here, but to truly feel comfortable about accepting this I'd personally want to see more proof-of-concept style work to flesh out more details about how these fundamentals are going to work and play out in the end
Yes, agreed that a PoC is needed before we'll really know whether this is (A) feasible, and (B) worth doing. If that means leaving this PR unmerged until the PoC done, it's fine with me. Meanwhile, it's already generated some good discussion and serves as something we can point interested folks too.
dicej edited a comment on PR #46:
In that sense I think it's worthwhile to experiment more here, but to truly feel comfortable about accepting this I'd personally want to see more proof-of-concept style work to flesh out more details about how these fundamentals are going to work and play out in the end
Yes, agreed that a PoC is needed before we'll really know whether this is (A) feasible, and (B) worth doing. If that means leaving this PR unmerged until the PoC done, it's fine with me. Meanwhile, it's already generated some good discussion and serves as something we can point interested folks too.
dicej edited a comment on PR #46:
In that sense I think it's worthwhile to experiment more here, but to truly feel comfortable about accepting this I'd personally want to see more proof-of-concept style work to flesh out more details about how these fundamentals are going to work and play out in the end
Yes, agreed that a PoC is needed before we'll really know whether this is (A) feasible, and (B) worth doing. If that means leaving this PR unmerged until the PoC is done, it's fine with me. Meanwhile, it's already generated some good discussion and serves as something we can point interested folks too.
Anyway, yes, my goal is both high-level consensus and to get into the details as well. I'm aiming to add those by the end of the week, at which point I'll switch this out of draft mode.
I didn't get around to this, but will try to do it early next week.
dicej updated PR #46.
dicej updated PR #46.
I just pushed an update which adds a bunch of detail regarding the proposed APIs.
dicej has marked PR #46 as ready for review.
dicej updated PR #46.
dicej updated PR #46.
dicej updated PR #46.
dicej updated PR #46.
fitzgen submitted PR review.
fitzgen created PR review comment:
I’ll echo Chris’s point here, even though I haven’t seen any disagreement with it: multiple instantiation is a core capability of the CM and we must cover all CM semantics.
I also think the output shouldn’t be a component with a minimal feature subset, because the idea is that we are implementing the CM desugaring for engines that don’t support it, so we shouldn’t assume that they can parse even a subset of it. The output should be a flat list of core modules (including those generated for fused adapters) and a flat list of instantiation and import-export wiring commands. Basically the simplest thing that covers the component model semantics, with no syntax sugar.
@alexcrichton
this change would require that all compoents likely to have at least 2 linear memories (one for the guest, one for the runtime)
Can you clarify why there would need to be a second memory for the runtime? I don’t follow how that would be required.
I agree that it would be a large problem however. I would personally be extremely surprised/concerned if we didn’t use the same number of memories in the desugared core output as were defined and instantiated in the input component.
Can you clarify why there would need to be a second memory for the runtime? I don’t follow how that would be required.
I believe he's referring to this part of the proposal (from the
lower-componentsection):In addition to the generated "fused adapter" code, the output module will
include component model runtime code, separately compiled from Rust source,
which handles, among other things:
- table management for resource and waitable values
- guest-to-guest stream and future I/O
- task and thread bookkeeping
That code will definitely need to allocate, which means it either needs to have its own memory or be able to allocate from another module's memory (e.g. via
cabi_realloc, but note that we may be getting rid of that once lazy lowering arrives).
That code will definitely need to allocate, which means it either needs to have its own memory or be able to allocate from another module's memory (e.g. via
cabi_realloc, but note that we may be getting rid of that once lazy lowering arrives).Also, allocating from the memory of one of the (potentially malicious and/or buggy) modules taken from the input component invites the risks of tampering and information leaks.
The other option to avoid the extra memory is to compile the component runtime into native code and run it in the host instead. The tradeoff there is that it becomes part of the TCB along with all the other host code, but that's probably fine if the component runtime is written in Rust with zero unsafe code. The code would remain runtime-agnostic and thus reusable either way.
dicej updated PR #46.
I talked a little with Alex about this at Wasm I/O and I see now that there are two slightly different use cases and some of us (me) have perhaps been assuming one or the other:
Drawing the dividing line between existing core Wasm semantics and new component model semantics as an interface, and effectively creating a function for the host runtime to implement that has one function for each kind of CM intrinsic.
Making the interface as small as possible, so that runtimes would have to implement as little as possible to take this thing and get CM for as close to free as possible, even if that means "virtualizing" some intrinsics (e.g. implementing resource table management as a core Wasm module).[^virt]
[^virt]: Taken to the extreme limit, this is basically "just compile Wasmtime to Wasm and run components inside Wasmtime inside of the non-component-model runtime".
I think (2) is not something that Wasmtime could realistically share in its component model implementation because of things like the extra-memory issue.
I think (1) is something that Wasmtime could use in its component model implementation, although to achieve runtime performance on par with today's implementation, this would probably require self-hosting the interface definitions, using unsafe intrinsics to access vmctx data, and inlining. Perhaps if the tool also had a callback where the host compiler was able to either emit a Wasm function call for an intrinsic or some inline Wasm code we could avoid requiring inlining (and its hit to compilation performance) in order to match today's runtime performance.
But I also think that (2) is still a valid use case and additionally could be layered on top of (1).
Does that all make sense?
I'll add a little bit of thinking from that same conversation with Alex and Nick: I think that rather than building a monolithic, somewhat opaque runtime that requires an assortment of random host functionalities, trying to "factor out" core intrinsics or primitives and building the component model on top of it has a lot of pedagogical/explanatory value, which is important if we want the componeht model to be widely implemented and understood. For example, reifying unforgeable references (resource handles, capabilities) as a primitive that is provided by host intrinsics, has value; so does defining exact mappings from async task primitives to something like stack switching or host intrinsics with equivalent semantics (pure fiber-switching).
Said another way, if the component model were decomposed into "canonical 1-to-1 mapping to these N host primitives", that is a much more satisfying and convincing argument for a sound fundamental design and for reusability/generality. On the other hand, building a single canonical
libComponentModelRuntime.wasmthat runs in core Wasm is (as Nick said) more like compiling Wasmtime into Wasm; and feels like something close to admitting defeat, in the sense that we are saying things are complex enough that we just need to distribute a reference implementation. It's far better if the mapping is "thin" and the primitives are well-defined and reasonable to implement independently in many engines.
cfallin edited a comment on PR #46:
I'll add a little bit of thinking from that same conversation with Alex and Nick: I think that rather than building a monolithic, somewhat opaque runtime that requires an assortment of random host functionalities, trying to "factor out" core intrinsics or primitives and building the component model on top of it has a lot of pedagogical/explanatory value, which is important if we want the componeht model to be widely implemented and understood. For example, reifying unforgeable references (resource handles, capabilities) as a primitive that is provided by host intrinsics, has value; so does defining exact mappings from async task primitives to something like stack switching or host intrinsics with equivalent semantics (pure fiber-switching).
Said another way, if the component model were decomposed into "canonical 1-to-1 mapping to these N host primitives", that is a much more satisfying and convincing argument for a sound fundamental design and for reusability/generality. On the other hand, building a single canonical
libComponentModelRuntime.wasmthat runs in core Wasm is (as Nick said) more like compiling Wasmtime into Wasm; and feels like something close to admitting defeat, in the sense that we are saying things are complex enough that we just need to distribute a reference implementation. It's far better if the mapping is "thin" and the primitives are well-defined and reasonable to implement independently in many engines.EDIT: I realize the above is a little bit abstract, but the main point I'm trying to make is that there is a social-signalling aspect to the direction that we choose, and I'd prefer that we try to signal "it's built of reasonable primitives and here is the decomposition" rather than "just ship our opaque blob".
@fitzgen Yes, that makes sense. We could break the first use case you listed down even further:
1a. I want to run components on any runtime supporting core Wasm + fibers. If that runtime doesn't support components, I want to be able to pair it with a library (and maybe a host binding generator) which can parse, link (generating fused adapters on the fly if appropriate), and instantiate a component, deferring to the runtime for core Wasm and fiber operations.
1b. I want to flatten my component into a core module which imports a bunch of component model intrinsic functions and run it as in use case (1) above. In this case, the fused adapters would be generated as part of flattening, but the host library + Wasm runtime would take care of all runtime state.
In (1a), there's no need for
lower-componentand no need to address the question of multiply-instantiated modules. In (1b),lower-componentstill has a role, but (like (1a) and unlike (2)) needs no extra guest memory. I expect that (1b) will also require standardizing additional intrinsics for use in fused adapters (equivalent to the ones Wasmtime's FACT uses now for managing task and thread state during guest-to-guest calls) which aren't part of the component model.On the face of it (1a) seems simpler, both for users ("I just want to run a component; don't bother me with more tools and more steps") and us (no need to standardize the intrinsics fused adapters will need beyond those already defined in the component model). I'm wondering who might choose (1b) over (1a), and why.
@cfallin That sounds great; not sure exactly what it would look like, though. @lukewagner might have thoughts.
ttraenkler submitted PR review.
ttraenkler created PR review comment:
https://github.com/WebAssembly/component-model/issues/626 discusses this exact scenario of merging modules into a core wasm module using multi memory to maintain memory isolation as well as imports and exports as the only shared surface. This I think could in many cases mean zero runtime cost for crossing the component boundary while maintaining isolation: Function calls could be inlined trivially by wasm-opt, and in many cases even memory copies could be avoided altogether. For scalars and static memory indices and lengths at zero runtime cost, for dynamic memory indices or lengths with a bounds check runtime cost. This would be a form of "lazy lowering" if I understood @lukewagner correctly. Some details are left unspecified and I assume I am not alone with this idea, but I thought nevertheless it would be good to point out in case this has not been considered.
Following up with concrete findings from P3 async fusion work in meld:
P3 async components now fuse to valid core modules with multi-memory isolation. The async task primitives (
task.return,waitable-set.*,context.*) flow through as host-provided imports, which aligns with the host intrinsic approach proposed here.One design consideration for the host intrinsic API: after fusion, multiple original component instances share a single core module. Each has its own
task.returnwith a different result type (tied to the original export's signature). The host needs to dispatch these correctly — the fused module uses distinct import slots per originaltask.return(e.g.,[task-return]0,[task-return]1), so the host can use the import identity to determine the task context.For component-model-native runtimes, wrapping the fused output back as a component hits
call_might_be_recursivewhen internal async calls cross the now-collapsed instance boundary. The plannedrecursiveeffect would address this.I'm currently working on both paths: the core module + host intrinsic path (via synth, an AOT compiler with its own runtime), and a nested component wrapper that preserves instance topology for wasmtime compatibility.
Correction to my earlier comment: the
call_might_be_recursiveissue I described was an artifact of the wrong architecture, not a fundamental limitation. As @lukewagner pointed out, a fused component shouldn't need internalcanon lift/canon lowerat all.The correct approach for async cross-component calls after fusion: the adapter drives the callee's callback loop directly in core wasm — calling
[async-lift]to start,waitable-set-poll(host import) to wait for events, and[callback]to drive progress until EXIT.task.returncan be resolved as an in-module shim for result delivery. This keeps everything in core wasm with no component boundary.The earlier points about host intrinsic design still hold: after fusion, each original async export's
task.returnhas a distinct signature and import slot, so the host (or in-module shim) can dispatch by import identity.
alexcrichton commented on PR #46:
Reflecting on this more, reading over the current state of things, and digesting conversations I had at Wasm.io, my current thinking is that I think it would be best to pare back this RFC to just the
lower-componenttool with the constraint thatlower-componentwill not add any more linear memories than are already present within a component. I think it's also worth explicitly saying that multiply-instantiated-components will be supported, and picking a strategy. To the extent that this tool wants to be used in Wasmtime we won't want the "duplicate the module items" approach, so that would necessitate the approach of "generate N modules + metadata".I realize, however, that this is a bit of a spicy take on this RFC, so I want to expand more on the rationale as well.
Paring down to just
lower-componentPersonally I feel that this RFC is a bit too ambitious about what it's trying to specify at this time. I don't disagree with any of the end goals or means by which we get there, but I feel that there's just too much up in the air to make any real meaningful progress on evaluating/reviewing/etc. To me this feels similar to the arc of the series of debugging RFCs we have for Wasmtime where we started out (in my opinion) a bit too ambitious and further RFCs refined things saying "ok here's what we can more practically achieve in the near-term". While the sort of vision-setting of the entire arc can be valuable I'm not sure that bytecodealliance RFCs are necessarily the best venue by which to do that.
To me the abstraction level of core wasm is really the central part here. Everything else is later a derivative of this abstraction boundary, which is both a benefit but also means that the work is separatable and/or can have separate RFCs. For example
host-wit-bindgen, while I agree will be necessary, can be specified/implemented entirely in terms of "here's the shape of core module that pops out". The C APIs mentioned here I feel are a bit more out-there in terms of design. While I think we can reasonably work with the core wasm abstraction level once you go all the way to a C API that feels way more specific and limiting. For example that doesn't handle GC at all, it glosses over multi-return details, assumptions about host runtime implementaiton are made, etc. While I again feel this is useful as a sort of vision-setting exercise I think it'll be most productive to have an RFC on-the-record for viable work that can be done in a realistic time frame.Putting all that together I feel that
lower-componentis the juiciest part to get alignment on in this entire RFC. Everything else, while it should be considered, is effectively a directly result of dealing with the output oflower-component. Given all the questions/thoughts aroundlower-componentas well, that's why I feel that this RFC should be pared down to justlower-componentwith possibly future designs/RFCs for the subsequent tools/APIs.No extra linear memories
The next part I'm thinking is that this we should take on a hard constraint that
lower-componentdoes not inject linear memories into the output. This would, for example, preclude the concept of a wasm module that is injected which implements more runtime functionality. More-or-less this boils down to "all component model intrinsics end up become host function calls". I realize, though, that this is in direct opposition to this RFC as-is, and would remove this part of the RFC:the output module will include component model runtime code, separately compiled from Rust source, which handles, among other things:
My feelings here are from what we discussed at Wasm.io. I think it will be much more maintainable to be able to explain and document what all these host intrinsics are if they're not a sort of halfway point between what the component model intrinsic is and what the host needs to do. By having everything get routed directly to the host it'll make it much easier to document semantics.
One example of this is the
resource.newintrinsic. While it's possible that this could be entirely implemented by an auxiliary runtime I think it'll be clearer/easier to havelower-component, by default, import a function to do this. Now unlike the component model intrisnic I'm thinking that this would look something like:(import "cm-intrinsics" "resource.new.i32" (func (param i32 i32) (result i32)))The extra
i32parameter here would be documented as "this is the type of the resource being created" where the otheri32is the raw value provided by the component itself. This feels easier to document/specify as it's largely just referring to preexisting intrinsic definitions.Furthermore by not injecting linear memories and importing intrinsics this still empowers hosts to self-host some functionality in wasm. There's nothing stopping a host from implementing
resource.new.i32with a wasm function, for example. In that sense translating everything to imports is a lower-common-denomintor of an at-least-partially-self-hosted runtime.Multiple instantiations
I feel similar to what @cfallin and @fitzgen mentioned up-thread about this where we should strive to support all input components in
lower-componentinsofar as I don't think it would be reasonable to reject components that multiply-instantiate sub-components. Between the two implementation strategies of "duplicate everything" and "emit multiple modules" I think only the latter is within scope for Wasmtime. Ideally I'd like to uselower-componentwithin Wasmtime directly and that would also ideally come with a similar performance/compilation profile as we have today for components. To that end I think that necessitates the output being multiple modules.I realize though that this is a significant increase in complexity relative to squashing a component into just a single core module. The good news is that most of the time this won't be necessary and just one core wasm will continue to pop out. The bad news is that fully compliant hosts will have to handle the case that multiple modules appear.
In the end though I feel that emission of metadata is inevitable anyway. For example there will want to be metadata about how many resource types were create and other miscellaneous things about limits and such. Hosts can probably get away with ignoring most of the metadata most of the time, though.
I understand as well that others can feel differently about what exactly goes into this RFC and the various technical decisions here. So given that I'm curious what others think too on all this!
ttraenkler commented on PR #46:
Focusing on lowering components seems a clear and incremental actionable first step. :+1:
Multiple instantiations
I feel similar to what @cfallin and @fitzgen mentioned up-thread about this where we should strive to support all input components in
lower-componentinsofar as I don't think it would be reasonable to reject components that multiply-instantiate sub-components. Between the two implementation strategies of "duplicate everything" and "emit multiple modules" I think only the latter is within scope for Wasmtime. Ideally I'd like to uselower-componentwithin Wasmtime directly and that would also ideally come with a similar performance/compilation profile as we have today for components. To that end I think that necessitates the output being multiple modules.I realize though that this is a significant increase in complexity relative to squashing a component into just a single core module. The good news is that most of the time this won't be necessary and just one core wasm will continue to pop out. The bad news is that fully compliant hosts will have to handle the case that multiple modules appear.
Instantiating multiple modules would but ideal, but if the constraint is lowering into a single core wasm module, as an alternative to the options presented here is a workaround that works with multiple memories today without forcing N copies:
The idea is to rewrite exported functions and those called by them that touch global state with an additional module index parameter during the merge. Since memory instructions take the memory index as an immediate in core Wasm today, a workaround is to wrap these in an function that dispatches to the correct memory using a
br_table.This
(func (export "malloc") (param $size i32) (result i32) (local $old i32) (local.set $old (global.get $heap_end)) (global.set $heap_end (i32.add (global.get $heap_end) (call $align_up (local.get $size) (i32.const 8)))) (local.get $old))becomes
;; Shared malloc — one copy, dispatches on $instance at runtime (func $malloc (param $size i32) (param $instance i32) (result i32) (local $old i32) ;; old = heap_end[$instance] (via br_table) (block $done (block $b1 (block $b0 (br_table $b0 $b1 (local.get $instance))) (local.set $old (global.get 0)) (br $done)) ;; instance 0 (local.set $old (global.get 1))) ;; instance 1 ;; heap_end[$instance] += align_up(size, 8) (block $done2 (block $b1 (block $b0 (br_table $b0 $b1 (local.get $instance))) (global.set 0 (i32.add (global.get 0) (call $align_up (local.get $size) (i32.const 8)))) (br $done2)) (global.set 1 (i32.add (global.get 1) (call $align_up (local.get $size) (i32.const 8))))) (local.get $old))The overhead can be eliminated with an optimization pass for hot paths where inlining would duplicate the code anyways, and where the overhead is negligible could avoid N copies altogether, just duplicating memory instructions, not the whole function or calling functions - not even all instances of the call to this instruction if this pattern is wrapped in a helper function, but it's a tradeoff.
It is a workaround, but it can be efficient and if lowering into a single module is a requirement or inlining across modules is not possible but the call is performance sensitive this could provide a solution.
The more elegant solution would of course be a dynamic memory index. Even if instantiating multiple modules are an option for the runtime in question, not all of them would allow to inline calls across modules - can V8 for example?
ttraenkler edited a comment on PR #46:
Focusing on lowering components seems a clear and incremental actionable first step. :+1:
Multiple instantiations
I feel similar to what @cfallin and @fitzgen mentioned up-thread about this where we should strive to support all input components in
lower-componentinsofar as I don't think it would be reasonable to reject components that multiply-instantiate sub-components. Between the two implementation strategies of "duplicate everything" and "emit multiple modules" I think only the latter is within scope for Wasmtime. Ideally I'd like to uselower-componentwithin Wasmtime directly and that would also ideally come with a similar performance/compilation profile as we have today for components. To that end I think that necessitates the output being multiple modules.I realize though that this is a significant increase in complexity relative to squashing a component into just a single core module. The good news is that most of the time this won't be necessary and just one core wasm will continue to pop out. The bad news is that fully compliant hosts will have to handle the case that multiple modules appear.
Multiple instantiations would but ideal, but if the constraint is lowering into a single core wasm module, as an alternative to the options presented here is a workaround that works with multiple memories today without forcing N copies:
The idea is to rewrite exported functions and those called by them that touch global state with an additional module index parameter during the merge. Since memory instructions take the memory index as an immediate in core Wasm today, a workaround is to wrap these in a function that dispatches to the correct memory using a
br_table.This
(func (export "malloc") (param $size i32) (result i32) (local $old i32) (local.set $old (global.get $heap_end)) (global.set $heap_end (i32.add (global.get $heap_end) (call $align_up (local.get $size) (i32.const 8)))) (local.get $old))becomes
;; Shared malloc — one copy, dispatches on $instance at runtime (func $malloc (param $size i32) (param $instance i32) (result i32) (local $old i32) ;; old = heap_end[$instance] (via br_table) (block $done (block $b1 (block $b0 (br_table $b0 $b1 (local.get $instance))) (local.set $old (global.get 0)) (br $done)) ;; instance 0 (local.set $old (global.get 1))) ;; instance 1 ;; heap_end[$instance] += align_up(size, 8) (block $done2 (block $b1 (block $b0 (br_table $b0 $b1 (local.get $instance))) (global.set 0 (i32.add (global.get 0) (call $align_up (local.get $size) (i32.const 8)))) (br $done2)) (global.set 1 (i32.add (global.get 1) (call $align_up (local.get $size) (i32.const 8))))) (local.get $old))The overhead can be eliminated with an optimization pass for hot paths where inlining would duplicate the code anyways, and where the overhead is negligible could avoid N copies altogether, just duplicating memory instructions, not the whole function or calling functions - not even all instances of the call to this instruction if this pattern is wrapped in a helper function, but it's a tradeoff.
It is a workaround, but it can be efficient and if lowering into a single module is a requirement or inlining across modules is not possible but the call is performance sensitive this could provide a solution.
The more elegant solution would of course be a dynamic memory index. Even if instantiating multiple modules are an option for the runtime in question, not all of them would allow to inline calls across modules - can V8 for example?
ttraenkler edited a comment on PR #46:
Focusing on lowering components seems a clear and incremental actionable first step. :+1:
Multiple instantiations
I feel similar to what @cfallin and @fitzgen mentioned up-thread about this where we should strive to support all input components in
lower-componentinsofar as I don't think it would be reasonable to reject components that multiply-instantiate sub-components. Between the two implementation strategies of "duplicate everything" and "emit multiple modules" I think only the latter is within scope for Wasmtime. Ideally I'd like to uselower-componentwithin Wasmtime directly and that would also ideally come with a similar performance/compilation profile as we have today for components. To that end I think that necessitates the output being multiple modules.I realize though that this is a significant increase in complexity relative to squashing a component into just a single core module. The good news is that most of the time this won't be necessary and just one core wasm will continue to pop out. The bad news is that fully compliant hosts will have to handle the case that multiple modules appear.
Multiple instantiations would but ideal, but if the constraint is lowering into a single core wasm module, as an alternative to the options presented here is a workaround that works with multiple memories today without forcing N copies:
The idea is to rewrite exported functions and those called by them that touch global state with an additional module index parameter during the merge. Since memory instructions take the memory index as an immediate in core Wasm today, a workaround is to wrap these in a function that dispatches to the correct memory using a
br_table.This
(func (export "malloc") (param $size i32) (result i32) (local $old i32) (local.set $old (global.get $heap_end)) (global.set $heap_end (i32.add (global.get $heap_end) (call $align_up (local.get $size) (i32.const 8)))) (local.get $old))becomes
;; Shared malloc — one copy, dispatches on $instance at runtime (func $malloc (param $size i32) (param $instance i32) (result i32) (local $old i32) ;; old = heap_end[$instance] (via br_table) (block $done (block $b1 (block $b0 (br_table $b0 $b1 (local.get $instance))) (local.set $old (global.get 0)) (br $done)) ;; instance 0 (local.set $old (global.get 1))) ;; instance 1 ;; heap_end[$instance] += align_up(size, 8) (block $done2 (block $b1 (block $b0 (br_table $b0 $b1 (local.get $instance))) (global.set 0 (i32.add (global.get 0) (call $align_up (local.get $size) (i32.const 8)))) (br $done2)) (global.set 1 (i32.add (global.get 1) (call $align_up (local.get $size) (i32.const 8))))) (local.get $old))The overhead can be eliminated with an optimization pass for hot paths where inlining would duplicate the code anyways, and where the overhead is negligible could avoid N copies altogether, just duplicating memory instructions, not the whole function or calling functions - not even all instances of the call to this instruction if this pattern is wrapped in a helper function, but it's a tradeoff.
It is a workaround, but it can be efficient and if lowering into a single module is a requirement or inlining across modules is not possible but the call is performance sensitive this could provide a solution.
The more elegant solution would of course be a dynamic memory index. Even if instantiating multiple modules is an option for the runtime in question, not all of them would allow to inline calls across modules - can V8 for example?
tschneidereit commented on PR #46:
Furthermore by not injecting linear memories and importing intrinsics this still empowers hosts to self-host some functionality in wasm. There's nothing stopping a host from implementing
resource.new.i32with a wasm function, for example. In that sense translating everything to imports is a lower-common-denomintor of an at-least-partially-self-hosted runtime.I agree with this, but wonder if maybe a different framing could be that it makes sense to split the RFC itself into two parts? One containing what @alexcrichton is describing, the other one what in the current plan is happening in the injected linear memory. Perhaps that part could even consist of multiple bits that can be used to self-host different parts of the host API, such that embedders can choose how much they want to implement on the host side vs use off-the-shelf.
alexcrichton commented on PR #46:
@ttraenkler what you're describing is more-or-less duplicating the entire module though, right? That looks like it's effectively got the same code-size impact where if a core module is instantiated N times there'll be N copies of its machine code. You're right that if all instructions referencing wasm definitions took dynamic immediates it would be somewhat ameliorated, but that then lends doubt to performance since the dynamic input would likely perform much worse than a static immediate.
Overall, personally, what I'm getting at is that there needs to be a core assumption in the
lower-componenttool, and users of the tool, that multiple core modules may be output. I don't personally think there's any viable way around this. It's an accurate representation of what's actually happening and what's desired on behalf of the component. Other models are trading off performance/complexity/etc for the goal of having "just" a single core wasm module, which I personally don't think are viable for a full-fledged runtime (e.g. in my opinion Wasmtime wouldn't use the mode that outputs just a single core wasm module).
@tschneidereit personally, along the lines of keeping things more tractable and easier to reason about, I'd say that the hypothetical at-least-somewhat-self-hosted wasm runtime should be deferred until after
lower-componentis more fleshed out. I feel that we need more experience with what exactly the imports are to this core wasm module before we specify what the self-hosted version would be. Given alower-componenttool it wouldn't be too hard, in theory, to at least experiment with various shapes of a self-hosted runtime and then propose/standardize on the one that feels best. Or, better yet, maybe this is someting that wouldn't need an RFC/standardization and could just become a "well known useful tool" or something like that.
Also, another somewhat unrelated though. One axiom that's not necessarily explicitly spelled out here but I think might be worthwhile to explain and write down -- in my opinion the goal here is to be able to take a WIT world and then enumerate the set of intrinsics, via core wasm imports, that a host must provide (and will expect from a core module) to be able to run any component that adheres to that WIT world. The
lower-componenttool would then generate core wasm modules that use a subset of these expectations of the host (e.g. not all components call all functions, use all intrinsics, etc). This describes a "tool" of sorts that's not described in this RFC of going from a WIT world to this set of functions, but I don't think that the tool necessarily needs to exist immediately. It'll more-or-less be thehost-wit-bindgenstep, however.
ttraenkler edited a comment on PR #46:
@ttraenkler what you're describing is more-or-less duplicating the entire module though, right? That looks like it's effectively got the same code-size impact where if a core module is instantiated N times there'll be N copies of its machine code. You're right that if all instructions referencing wasm definitions took dynamic immediates it would be somewhat ameliorated, but that then lends doubt to performance since the dynamic input would likely perform much worse than a static immediate.
Not necessarily. We can avoid $O(N \times M)$ bloat by abstracting state-access into centralized dispatchers during lowering. Instead of duplicating the entire module logic, we rewrite state-touching instructions to call a
br_tablewrapper exactly once per type.This transforms the code-size impact from multiplicative to additive. It also leaves the performance trade-off in the hands of the runtime: the JIT can selectively inline these wrappers on hot paths to recover performance (effectively specializing the immediate), while leaving cold paths as small function calls to preserve a tiny binary footprint.
;; Centralized 'Virtual Instruction' (Defined once per module) (func $dispatch_load (param $inst i32) (param $addr i32) (result i32) (block $m1 (block $m0 (br_table $m0 $m1 (local.get $inst))) (return (i32.load 0 (local.get $addr)))) ;; Hardcoded to Memory 0 (return (i32.load 1 (local.get $addr))) ;; Hardcoded to Memory 1 ) ;; The Logic (One copy shared by all N instances) (func $malloc_shared (param $size i32) (param $inst i32) (result i32) (local $ptr i32) ;; ... complex logic here ... ;; Instead of a hardcoded i32.load, we call the dispatcher. ;; The logic body is never duplicated. (local.set $ptr (call $dispatch_load (local.get $inst) (i32.const 0))) ;; ... (local.get $ptr) )For
lower-component, this provides a 'middle gear' that supports a single core module without forcing massive logic duplication in scenarios where binary size is a constraint and peak throughput for every single instruction is not the primary requirement.Overall, personally, what I'm getting at is that there needs to be a core assumption in the
lower-componenttool, and users of the tool, that multiple core modules may be output. I don't personally think there's any viable way around this. It's an accurate representation of what's actually happening and what's desired on behalf of the component. Other models are trading off performance/complexity/etc for the goal of having "just" a single core wasm module, which I personally don't think are viable for a full-fledged runtime (e.g. in my opinion Wasmtime wouldn't use the mode that outputs just a single core wasm module).Even if runtimes like Wasmtime prefer multiple modules and I think it is a reasonable default, this approach ensures that "Single Module Lowering" remains a viable and efficient target for the broader ecosystem. For example in environments like V8, where cross-module inlining is currently limited, staying within a single module can actually provide a better optimization boundary for the JIT than multiple modules would.
ttraenkler commented on PR #46:
@ttraenkler what you're describing is more-or-less duplicating the entire module though, right? That looks like it's effectively got the same code-size impact where if a core module is instantiated N times there'll be N copies of its machine code. You're right that if all instructions referencing wasm definitions took dynamic immediates it would be somewhat ameliorated, but that then lends doubt to performance since the dynamic input would likely perform much worse than a static immediate.
Not necessarily. We can avoid $O(N \times M)$ bloat by abstracting state-access into centralized dispatchers during lowering. Instead of duplicating the entire module logic, we rewrite state-touching instructions to call a
br_tablewrapper exactly once per type.This transforms the code-size impact from multiplicative to additive. It also leaves the performance trade-off in the hands of the runtime: the JIT can selectively inline these wrappers on hot paths to recover performance (effectively specializing the immediate), while leaving cold paths as small function calls to preserve a tiny binary footprint.
;; Centralized 'Virtual Instruction' (Defined once per module) (func $dispatch_load (param $inst i32) (param $addr i32) (result i32) (block $m1 (block $m0 (br_table $m0 $m1 (local.get $inst))) (return (i32.load 0 (local.get $addr)))) ;; Hardcoded to Memory 0 (return (i32.load 1 (local.get $addr))) ;; Hardcoded to Memory 1 ) ;; The Logic (One copy shared by all N instances) (func $malloc_shared (param $size i32) (param $inst i32) (result i32) (local $ptr i32) ;; ... complex logic here ... ;; Instead of a hardcoded i32.load, we call the dispatcher. ;; The logic body is never duplicated. (local.set $ptr (call $dispatch_load (local.get $inst) (i32.const 0))) ;; ... (local.get $ptr) )For
lower-component, this provides a 'middle gear' that supports a single core module without forcing massive logic duplication in scenarios where binary size is a constraint and peak throughput for every single instruction is not the primary requirement.Overall, personally, what I'm getting at is that there needs to be a core assumption in the
lower-componenttool, and users of the tool, that multiple core modules may be output. I don't personally think there's any viable way around this. It's an accurate representation of what's actually happening and what's desired on behalf of the component. Other models are trading off performance/complexity/etc for the goal of having "just" a single core wasm module, which I personally don't think are viable for a full-fledged runtime (e.g. in my opinion Wasmtime wouldn't use the mode that outputs just a single core wasm module).Even if runtimes like Wasmtime prefer multiple modules I think it could be a reasonable default, this approach ensures that "Single Module Lowering" remains a viable and efficient target for the broader ecosystem. For example in environments like V8, where cross-module inlining is currently limited, staying within a single module can actually provide a better optimization boundary for the JIT than multiple modules would.
ttraenkler edited a comment on PR #46:
@ttraenkler what you're describing is more-or-less duplicating the entire module though, right? That looks like it's effectively got the same code-size impact where if a core module is instantiated N times there'll be N copies of its machine code. You're right that if all instructions referencing wasm definitions took dynamic immediates it would be somewhat ameliorated, but that then lends doubt to performance since the dynamic input would likely perform much worse than a static immediate.
Not necessarily. We can avoid $O(N \times M)$ bloat by abstracting state-access into centralized dispatchers during lowering. Instead of duplicating the entire module logic, we rewrite state-touching instructions to call a
br_tablewrapper exactly once per type.This transforms the code-size impact from multiplicative to additive. It also leaves the performance trade-off in the hands of the runtime: the JIT can selectively inline these wrappers on hot paths to recover performance (effectively specializing the immediate), while leaving cold paths as small function calls to preserve a tiny binary footprint.
;; Centralized 'Virtual Instruction' (Defined once per module) (func $dispatch_load (param $inst i32) (param $addr i32) (result i32) (block $m1 (block $m0 (br_table $m0 $m1 (local.get $inst)) ) (return (i32.load 0 (local.get $addr))) ;; Hardcoded to Memory 0 ) (return (i32.load 1 (local.get $addr))) ;; Hardcoded to Memory 1 ) ;; The Logic (One copy shared by all N instances) (func $malloc_shared (param $size i32) (param $inst i32) (result i32) (local $ptr i32) ;; ... complex logic here ... ;; Instead of a hardcoded i32.load, we call the dispatcher. ;; The logic body is never duplicated. (local.set $ptr (call $dispatch_load (local.get $inst) (i32.const 0))) ;; ... (local.get $ptr) )For
lower-component, this provides a 'middle gear' that supports a single core module without forcing massive logic duplication in scenarios where binary size is a constraint and peak throughput for every single instruction is not the primary requirement.Overall, personally, what I'm getting at is that there needs to be a core assumption in the
lower-componenttool, and users of the tool, that multiple core modules may be output. I don't personally think there's any viable way around this. It's an accurate representation of what's actually happening and what's desired on behalf of the component. Other models are trading off performance/complexity/etc for the goal of having "just" a single core wasm module, which I personally don't think are viable for a full-fledged runtime (e.g. in my opinion Wasmtime wouldn't use the mode that outputs just a single core wasm module).Even if runtimes like Wasmtime prefer multiple modules and I think it is a reasonable default, this approach ensures that "Single Module Lowering" remains a viable and efficient target for the broader ecosystem. For example in environments like V8, where cross-module inlining is currently limited, staying within a single module can actually provide a better optimization boundary for the JIT than multiple modules would.
alexcrichton commented on PR #46:
That's true, yeah, that the set of instructions modifying state is O(1) and the transformation could build a fucntion-per-instruction which internally dispatches. For Wasmtime at least this would not be viable as the inlining you describe won't happen. Additionally I would be pretty surprised if any runtime could get realistically close to a strategy of multiple instantiations due to all the checks necessary to even opportunistically inline. Effectively in my (possibly naive) opinion I'd say that the performance of that solution is not viable.
this approach ensures that "Single Module Lowering" remains a viable and efficient target for the broader ecosystem
Personally what I'm trying to say is that I don't think this is a viable way to think about the component model. Environments like V8 can already instantiate multiple core modules just fine, and I would be surprised if any production-ready runtime were incapable of instantiating multiple modules. Given that I'm not sure I understand the desire to push so hard on having a single-module output vs accepting that some outputs will have multiple modules.
ttraenkler commented on PR #46:
Given that I'm not sure I understand the desire to push so hard on having a single-module output vs accepting that some outputs will have multiple modules.
No question multiple modules are the better default. I present this primarily for completeness as a "size-optimized" alternative to the less desirable "N copies" or "not supporting" scenarios. To help inform the trade-offs, is there a table that tracks runtime support for multi-module instantiation and cross-module inlining?
My personal motivation is being able to guarantee zero-overhead cross-module calls, rather than leaving it to the discretion of a runtime's JIT. Within a single module, this can be ensured by an ahead-of-time optimization pass during the merge (as in my POC). This would allow modules to become much smaller and composable, similar to JS modules. This doesn't necessarily have to happen at the component level, but I wanted to provide the perspective for the broader discussion.
ttraenkler edited a comment on PR #46:
Given that I'm not sure I understand the desire to push so hard on having a single-module output vs accepting that some outputs will have multiple modules.
No question multiple modules are the better default. I present this primarily for completeness as a "size-optimized" alternative to the less desirable "N copies" or "not supporting" scenarios. To help inform the trade-offs, is there a table that tracks runtime support for multi-module instantiation and cross-module inlining?
My personal motivation is being able to guarantee zero-overhead cross-module calls, rather than leaving it to the discretion of a runtime's JIT. Within a single module, this can be ensured by an ahead-of-time optimization pass during the merge (as in my POC). This would allow modules to become much smaller and composable, similar to JS modules. This doesn't necessarily have to happen at the component level, but reducing the overhead of the component model might make it easier to build an Deno like secure and lightweight ecosystem on top of it so I decided to provide the perspective for the broader discussion.
ttraenkler edited a comment on PR #46:
Given that I'm not sure I understand the desire to push so hard on having a single-module output vs accepting that some outputs will have multiple modules.
No question multiple modules are the better default. I present this primarily for completeness as a "size-optimized" alternative to the less desirable "N copies" or "not supporting" scenarios. To help inform the trade-offs, is there a table that tracks runtime support for multi-module instantiation and cross-module inlining?
My personal motivation is being able to guarantee zero-overhead cross-module calls, rather than leaving it to the discretion of a runtime's JIT. Within a single module, this can be ensured by an ahead-of-time optimization pass during the merge (as in my POC). This would allow modules to become much smaller and composable, similar to JS modules. This doesn't necessarily have to happen at the component level, but reducing the overhead of the component model might make it easier to build a Deno like secure and lightweight ecosystem on top of it so I decided to provide the perspective for the broader discussion.
ttraenkler edited a comment on PR #46:
Given that I'm not sure I understand the desire to push so hard on having a single-module output vs accepting that some outputs will have multiple modules.
No question multiple modules are the better default. I present this primarily for completeness as a "size-optimized" alternative to the less desirable "N copies" or "not supporting" scenarios. To help inform the trade-offs, is there a table that tracks runtime support for multi-module instantiation and cross-module inlining?
My personal motivation is being able to guarantee zero-overhead cross-module calls, rather than leaving it to the discretion of a runtime's JIT. Within a single module, this can be ensured by an ahead-of-time optimization pass during the merge (as in my POC). This would allow modules to become much smaller and composable, similar to JS modules. This doesn't necessarily have to happen at the component level, but reducing the overhead of the component model might make it easier to build a JSR or NPM like secure and lightweight ecosystem on top of it so I decided to provide the perspective for the broader discussion.
ttraenkler edited a comment on PR #46:
Given that I'm not sure I understand the desire to push so hard on having a single-module output vs accepting that some outputs will have multiple modules.
No question multiple modules are the better default. I present this primarily for completeness as a "size-optimized" alternative to the less desirable "N copies" or "not supporting" scenarios. To help inform the trade-offs, is there a table that tracks runtime support for multi-module instantiation and cross-module inlining?
My personal motivation is being able to guarantee zero-overhead cross-module calls, rather than leaving it to the discretion of a runtime's JIT. Within a single module, this can be ensured by an ahead-of-time optimization pass during the merge (as in my POC). This would allow modules to become much smaller and composable, similar to JS modules. This doesn't necessarily have to happen at the component level, but reducing the overhead of the component model might make it easier to build a JSR (Deno's package registry, similar to NPM) like secure and lightweight ecosystem on top of it so I decided to provide the perspective for the broader discussion.
ttraenkler edited a comment on PR #46:
Given that I'm not sure I understand the desire to push so hard on having a single-module output vs accepting that some outputs will have multiple modules.
No question multiple modules are the better default. I present this primarily for completeness as a "size-optimized" alternative to the less desirable "N copies" or "not supporting" scenarios. To help inform the trade-offs, is there a table that tracks runtime support for multi-module instantiation and cross-module inlining?
My personal motivation is being able to guarantee zero-overhead cross-module calls, rather than leaving it to the discretion of a runtime's JIT. Within a single module, this can be ensured by an ahead-of-time optimization pass during the merge (as in my POC). This would allow modules to become much smaller and composable, similar to JS modules. This doesn't necessarily have to happen at the component level, but reducing the overhead of the component model might make it easier to build a JSR like secure and lightweight ecosystem on top of it with less dependencies on toolchains so I decided to provide the perspective for the broader discussion.
ttraenkler edited a comment on PR #46:
Given that I'm not sure I understand the desire to push so hard on having a single-module output vs accepting that some outputs will have multiple modules.
No question multiple modules are the better default. I present this primarily for completeness as a "size-optimized" alternative to the less desirable "N copies" or "not supporting" scenarios. To help inform the trade-offs, is there a table that tracks runtime support for multi-module instantiation and cross-module inlining?
My personal motivation is being able to guarantee zero-overhead cross-module calls, rather than leaving it to the discretion of a runtime's JIT. Within a single module, this can be ensured by an ahead-of-time optimization pass during the merge (as in my POC). This would allow modules to become much smaller and composable, similar to JS modules. This doesn't necessarily have to happen at the component level, but reducing the overhead of the component model by a "bundling" step might make it a better fit to build an ecosystem like JSR or NPM on top of it with less dependencies on specific toolchains so I decided to provide the perspective for the broader discussion.
ttraenkler edited a comment on PR #46:
Given that I'm not sure I understand the desire to push so hard on having a single-module output vs accepting that some outputs will have multiple modules.
No question multiple modules are the better default. I present this primarily for completeness as a "size-optimized" alternative to the less desirable "N copies" or "not supporting" scenarios. To help inform the trade-offs, is there a table that tracks runtime support for multi-module instantiation and cross-module inlining?
My personal motivation is being able to guarantee zero-overhead cross-module calls, rather than leaving it to the discretion of a runtime's JIT. Within a single module, this can be ensured by an ahead-of-time optimization pass during the merge (as in my POC). This would allow modules to become much smaller and composable, similar to JS modules. This doesn't necessarily have to happen at the component level, but reducing the overhead of the component model by a "bundling" step in cases where it helps performance or deployment might make it a better fit to build an ecosystem like JSR or NPM on top of it with less dependencies on specific toolchains so I decided to provide the perspective for the broader discussion.
alexcrichton commented on PR #46:
Those are excellent points yeah, sorry I was a bit too strong in my wording. Agreed it's good to explore from a trade-off perspective! I'm not aware of a table of what you're looking for, insfoar as runtimes that support multiple modules vs those that don't.
I'm not entirely sure what you mean by smaller and more composable though. A lowered component which generates multiple modules is almost guaranteed to be smaller than any of the techniques described in this thread about generating a single module (even the dispatch idea you have, although there the size difference is much less). Composability I would imagine is a function of whatever system is being embedded into, where for example if it's a module system of some kind (e.g. JS modules) then that's an orthogonal concern where JS glue would be required to wrap the multiple modules (e.g. wire up instantiations and such), but I don't see how composability in general is lost if there's more than one module output.
I also don't think it's quite accurate to characterize this as overhead of the component model. The purpose of emitting multiple modules is to precisely represent this with no overhead. If a component internally instantiates things twice then that's what any conforming runtime must do more-or-less, instantiate something twice. If a component doesn't actually instantiate a module or component twice then the lowering process will continue to produce just a single core wasm module, I'm not envisioning something where multiple modules are arbitrarily generated "just because" for example.
ttraenkler commented on PR #46:
I'm not entirely sure what you mean by smaller and more composable though. A lowered component which generates multiple modules is almost guaranteed to be smaller than any of the techniques described in this thread about generating a single module (even the dispatch idea you have, although there the size difference is much less). Composability I would imagine is a function of whatever system is being embedded into, where for example if it's a module system of some kind (e.g. JS modules) then that's an orthogonal concern where JS glue would be required to wrap the multiple modules (e.g. wire up instantiations and such), but I don't see how composability in general is lost if there's more than one module output.
In terms of binary size, I agree: nothing beats multiple instantiation. However, by "composable," I'm referring to the performance cost of fine-grained boundaries. As we shrink modules to be more modular, cross-module calls become more frequent. On hot paths—like kernel functions manipulating bulk images, audio, or strings - these boundaries matter.
If a developer publishes a "String Utilities" or "Image Math kernels" component to an OCI registry, they currently have no guarantee that their fine-grained accessors will be optimized/inlined by the end-user's runtime. This potentially discourages using the Component Model for high-performance libraries, as "linking" effectively becomes a non-portable, toolchain-specific performance gamble.
While Wasmtime aims to eliminate cross-module costs, it seems many other runtimes (like V8) don't yet have clear paths or public plans for the optimizations required to make those boundaries zero-cost.
Providing a single-module fallback in lower-component ensures that:
- Producers don't have to worry about how a specific JIT handles boundaries, allowing them to choose the optimal balance of binary size and execution speed for their specific use case.
- Consumers get a "statically linked" guarantee that their hot paths can be optimized AOT during the merge.
Essentially, it makes the Component Model a safe "module substrate" even for fine grained performance sensitive modules, regardless of the sophistication of the target runtime's cross-module inlining.
If components are to become the "currency of exchange" for the ecosystem, they should provide a path for these use cases without forcing developers to adopt a different model to achieve predictable performance. This lowers the buy-in and facilitates sharing across the OCI registry as .wasm binaries which could be a catalyst for kick-starting the Wasm OCI ecosystem.
@ttraenkler you're focusing on performance of fine-grained composition, which is an excellent goal. However, I don't know if the solution you propose to actually merge the modules seems (to me at least) like it will be very performant. In particular, where you propose
;; Centralized 'Virtual Instruction' (Defined once per module) (func $dispatch_load (param $inst i32) (param $addr i32) (result i32) ...)Are you proposing to literally replace every load instruction in the original core module with a call to
dispatch_load(id, addr)? That's going to be absurdly slow (relative to the original code) on any reasonable engine, I think, even if thedispatch_loadfunction is inlined, because its logic (thebr_tableetc) is still dynamic, and it will inhibit all sorts of optimizations. In essence, you're getting halfway to an interpreter (!): the original code still exists statically, but as a bit sequence of calls to a function-per-instruction.I think that Alex is right here: fundamentally, to retain 1:1 performance expectations, a Wasm function needs to remain a Wasm function, without virtualization. Our options then are to merge all Wasm functions and entities into one module (linker-style) and either be done, if an original module is not instantiated more than once; or else actually have multiple modules at runtime.
Said another way: engines may not cross-module-inline well today, but they also certainly do not monomorphize at all, and your approach requires monomorphization to remove its substantial overhead. The alternative is to "manually monomorphize", aka, produce a final artifact with multiple modules.
cfallin edited a comment on PR #46:
@ttraenkler you're focusing on performance of fine-grained composition, which is an excellent goal. However, I don't know if the solution you propose to actually merge the modules seems (to me at least) like it will be very performant. In particular, where you propose
;; Centralized 'Virtual Instruction' (Defined once per module) (func $dispatch_load (param $inst i32) (param $addr i32) (result i32) ...)Are you proposing to literally replace every load instruction in the original core module with a call to
dispatch_load(id, addr)? That's going to be absurdly slow (relative to the original code) on any reasonable engine, I think, even if thedispatch_loadfunction is inlined, because its logic (thebr_tableetc) is still dynamic, and it will inhibit all sorts of optimizations. In essence, you're getting halfway to an interpreter (!): the original code still exists statically, but as a big sequence of calls to a function-per-instruction.I think that Alex is right here: fundamentally, to retain 1:1 performance expectations, a Wasm function needs to remain a Wasm function, without virtualization. Our options then are to merge all Wasm functions and entities into one module (linker-style) and either be done, if an original module is not instantiated more than once; or else actually have multiple modules at runtime.
Said another way: engines may not cross-module-inline well today, but they also certainly do not monomorphize at all, and your approach requires monomorphization to remove its substantial overhead. The alternative is to "manually monomorphize", aka, produce a final artifact with multiple modules.
ttraenkler commented on PR #46:
Let me take a step back and close this point to let the discussion focus on the more fundamental issues, but I thought it was worth mentioning as a potential option for the RFC.
Are you proposing to literally replace every load instruction in the original core module with a call to
dispatch_load(id, addr)?In the extreme case - though in many cases only a fraction of library code is actually on a hot path or even used (and could be DCEd when linked ahead of time) - though the drafted POC currently replaces instructions directly with
br_tablelogic and has a simple heuristic to detect loops during the merge to switch to inlining (built onwasm-encode/wasm-parse). You're absolutely right about the performance floor; in my tests, thebr_tableoverhead on Wasmtime was massive (+3000-4000%), whereas V8's speculative optimizations seemed to handle it much better (+100-200%).However, the intent is to treat this as a link-time transformation. Just as a traditional linker can perform LTO to devirtualize calls, a tool like
wasm-optorwasm-composecould use this pattern as a baseline and then "manually monomorphize" (inline and specialize) only the hot paths. For the "cold" 90% of the code, we save significant binary size; for the hot 10%, we pay the duplication cost only where it moves the needle.Engines may not cross-module-inline well today, but they also certainly do not monomorphize at all, and your approach requires monomorphization to remove its substantial overhead.
I agree. Since engines don't monomorphize at runtime, my proposal essentially moves that responsibility to the producer-side tooling.
I certainly don't suggest this as the default path for high-performance runtimes. I see this as a fallback for when runtime support is lacking or where binary size becomes a constraint (e.g., many module instances sharing a large string or math library). It gives the developer a knob to turn between size and performance, making the trade-off predictable at compile-time rather than leaving it to the heuristics of a specific engine's JIT.
ttraenkler edited a comment on PR #46:
Let me take a step back and close this point to let the discussion focus on the more fundamental issues, but I thought it was worth mentioning as a potential option for the RFC.
Are you proposing to literally replace every load instruction in the original core module with a call to
dispatch_load(id, addr)?In the extreme case - though in many cases only a fraction of library code is actually on a hot path or even used (and could be DCEd when linked ahead of time) - though the drafted POC currently replaces instructions directly with
br_tablelogic and has a simple heuristic to detect loops during the merge to switch to eliminatingbr_table(built onwasm-encode/wasm-parse). You're absolutely right about the performance floor; in my tests, thebr_tableoverhead on Wasmtime was massive (+3000-4000%), whereas V8's speculative optimizations seemed to handle it much better (+100-200%).However, the intent is to treat this as a link-time transformation. Just as a traditional linker can perform LTO to devirtualize calls, a tool like
wasm-optorwasm-composecould use this pattern as a baseline and then "manually monomorphize" (inline and specialize) only the hot paths. For the "cold" 90% of the code, we save significant binary size; for the hot 10%, we pay the duplication cost only where it moves the needle.Engines may not cross-module-inline well today, but they also certainly do not monomorphize at all, and your approach requires monomorphization to remove its substantial overhead.
I agree. Since engines don't monomorphize at runtime, my proposal essentially moves that responsibility to the producer-side tooling.
I certainly don't suggest this as the default path for high-performance runtimes. I see this as a fallback for when runtime support is lacking or where binary size becomes a constraint (e.g., many module instances sharing a large string or math library). It gives the developer a knob to turn between size and performance, making the trade-off predictable at compile-time rather than leaving it to the heuristics of a specific engine's JIT.
ttraenkler edited a comment on PR #46:
Let me take a step back and close this point to let the discussion focus on the more fundamental issues, but I thought it was worth mentioning as a potential option for the RFC.
Are you proposing to literally replace every load instruction in the original core module with a call to
dispatch_load(id, addr)?In the extreme case - though in many cases only a fraction of library code is actually on a hot path or even used (and could be DCEd when linked ahead of time) - though the drafted POC currently replaces instructions directly with
br_tablelogic (built on top ofwasm-encode/wasm-parse) andwasm-opt -O4eliminates thebr_table. You're absolutely right about the performance floor; in my tests, thebr_tableoverhead on Wasmtime was massive (+3000-4000%), whereas V8's speculative optimizations seemed to handle it much better (+100-200%).However, the intent is to treat this as a link-time transformation. Just as a traditional linker can perform LTO to devirtualize calls, a tool like
wasm-optorwasm-composecould use this pattern as a baseline and then "manually monomorphize" (inline and specialize) only the hot paths. For the "cold" 90% of the code, we save significant binary size; for the hot 10%, we pay the duplication cost only where it moves the needle.Engines may not cross-module-inline well today, but they also certainly do not monomorphize at all, and your approach requires monomorphization to remove its substantial overhead.
I agree. Since engines don't monomorphize at runtime, my proposal essentially moves that responsibility to the producer-side tooling.
I certainly don't suggest this as the default path for high-performance runtimes. I see this as a fallback for when runtime support is lacking or where binary size becomes a constraint (e.g., many module instances sharing a large string or math library). It gives the developer a knob to turn between size and performance, making the trade-off predictable at compile-time rather than leaving it to the heuristics of a specific engine's JIT.
ttraenkler edited a comment on PR #46:
Let me take a step back and close this point to let the discussion focus on the more fundamental issues, but I thought it was worth mentioning as a potential option for the RFC.
Are you proposing to literally replace every load instruction in the original core module with a call to
dispatch_load(id, addr)?In the extreme case - though in many cases only a fraction of library code is actually on a hot path or even used (and could be DCEd when linked ahead of time) - though the drafted POC currently replaces instructions directly with
br_tablewithout a function call inline (built on top ofwasm-encode/wasm-parse) and awasm-opt -O4pass later eliminates thebr_table. You're absolutely right about the performance floor; in my tests, thebr_tableoverhead on Wasmtime was massive (+3000-4000%), whereas V8's speculative optimizations seemed to handle it much better (+100-200%).However, the intent is to treat this as a link-time transformation. Just as a traditional linker can perform LTO to devirtualize calls, a tool like
wasm-optorwasm-composecould use this pattern as a baseline and then "manually monomorphize" (inline and specialize) only the hot paths. For the "cold" 90% of the code, we save significant binary size; for the hot 10%, we pay the duplication cost only where it moves the needle.Engines may not cross-module-inline well today, but they also certainly do not monomorphize at all, and your approach requires monomorphization to remove its substantial overhead.
I agree. Since engines don't monomorphize at runtime, my proposal essentially moves that responsibility to the producer-side tooling.
I certainly don't suggest this as the default path for high-performance runtimes. I see this as a fallback for when runtime support is lacking or where binary size becomes a constraint (e.g., many module instances sharing a large string or math library). It gives the developer a knob to turn between size and performance, making the trade-off predictable at compile-time rather than leaving it to the heuristics of a specific engine's JIT.
ttraenkler edited a comment on PR #46:
Let me take a step back and close this point to let the discussion focus on the more fundamental issues, but I thought it was worth mentioning as a potential option for the RFC.
Are you proposing to literally replace every load instruction in the original core module with a call to
dispatch_load(id, addr)?In the extreme case - though in many cases only a fraction of library code is actually on a hot path or even used (and could be DCEd when linked ahead of time) - though the drafted POC currently replaces instructions directly with
br_tablewithout a function call inline (built on top ofwasm-encode/wasm-parse) and awasm-opt -O4pass later eliminates thebr_table. You're absolutely right about the performance floor; in my tests, thebr_tableoverhead on Wasmtime without the wasm-opt pass was massive (+3000-4000%), whereas V8's speculative optimizations seemed to handle it much better (+100-200%).However, the intent is to treat this as a link-time transformation. Just as a traditional linker can perform LTO to devirtualize calls, a tool like
wasm-optorwasm-composecould use this pattern as a baseline and then "manually monomorphize" (inline and specialize) only the hot paths. For the "cold" 90% of the code, we save significant binary size; for the hot 10%, we pay the duplication cost only where it moves the needle.Engines may not cross-module-inline well today, but they also certainly do not monomorphize at all, and your approach requires monomorphization to remove its substantial overhead.
I agree. Since engines don't monomorphize at runtime, my proposal essentially moves that responsibility to the producer-side tooling.
I certainly don't suggest this as the default path for high-performance runtimes. I see this as a fallback for when runtime support is lacking or where binary size becomes a constraint (e.g., many module instances sharing a large string or math library). It gives the developer a knob to turn between size and performance, making the trade-off predictable at compile-time rather than leaving it to the heuristics of a specific engine's JIT.
For context on a use case where single-module output matters: we use meld to fuse components for embedded targets (Cortex-M, KB-range RAM) in automotive. The Component Model gives us typed interface boundaries between components from different suppliers — useful for safety certification. At build time, the fused module is AOT compiled to native code. No JIT, no dynamic linking, no multi-module instantiation available.
For the common case (no multiply-instantiated modules), this aligns with what the RFC already plans — a single core module output. Multi-module for the general case works for us too as a future option. Just wanted to add the embedded perspective since it hasn't come up in the discussion.
Last updated: Apr 12 2026 at 23:10 UTC