@Peter Huene looking at https://github.com/WebAssembly/component-model/blob/main/design/mvp/Binary.md ; do you know if it is intended to have the same section ordering constraints as the core spec (non-custom sections have a fixed order; custom sections can be interleaved anywhere)?
With the component model, there's no constraint on section ordering at all.
index spaces are built up incrementally as a result
so the only requirement is that indexes are valid where they are read (i.e. if they refer to another section, that section must precede it)
ah ok that makes sense
in terms of being able to explode a component into individual sections, the original order would need to be maintained in a manifest of some sort for full fidelity (not unlike layer orders in an OCI)
I'm thinking about which sections we actually care to split out - if its just custom sections there may be a compact encoding where you squash each consecutive run of non-custom sections into a single (length, hash)
The simplest approach would be to include every section in the manifest, but that would be a lot of entries
perhaps we design it such that both are supported: can list each section if desired but also consecutive sections as a single entry; thus we don't have to bake in the "important" section semantics into the manifest structure itself
That makes sense; a toolchain may know that a particular data section should be split, but that's not something you can really determine from the binary itself
perhaps we just say that each entry is a stream of 1 or more wasm sections (each section header describes the size, so it inherently can be read until the "end" of the current stream), for which the hash is of the entire stream of bytes?
Just confirming: after the header, all content must be sections of the form <type*1> <size*4> <content*size>
, right?
right
perhaps we just say that each entry is a stream of 1 or more wasm sections (each section header describes the size, so it inherently can be read until the "end" of the current stream), for which the hash is of the entire stream of bytes?
It would be useful to know the labels of at least some custom sections; in particular if we are embedding package metadata in one of them the registry indexer would like to be able to identify that
or signatures
yeah definitely should be in such a manifest so that the tools can decide to skip fetching sections not relevant to them (e.g. custom)
so the types, and if custom, names
something like the output of wasm-tools objdump
(conceptually)
yeah i think that'd make sense
Something like:
manifest := version:u8 vec<entry>
entry := vec<section> digest:vec<u8>
section := type:u8 size:u32 label:string
Not sure how to deal with labels for hashing; the hash could include all the section content (including the label) or the label could be separate
Possibly safer to include it in the hashing and treat the section.label
as arbitrary entry metadata? :thinking:
is the label going to be anything other than the custom section name or do we think it'll be arbitrary?
I'm thinking about the "component VFS" with files in individual data sections
hmm, so data segments might be defined in the component via a nested module (i.e. there's no name to associate) or (eventually when we support it) imported with a specific name, which would mean it'd appear in an import section along with other importable items
i'm not sure there'd be useful information we could extract for a label there
components don't have data sections either, so i'm not sure what Luke is thinking for this tooling (possibly a nested core module that implements the wasi:filesystem
interface?)
so from the manifest perspective it'd appear as a core module section
which goes to the point of whether or not we need to explode nested sections ( :exploding_head:)
yeah, this is exactly where my understanding of this gets very hazy...
glad I'm not the only one
@Luke Wagner some thoughts on the above conversation?
If we added data imports to core modules (a fairly minor core spec change) and added data sections to components (no reason not to, also i expect an easy addition), then that could solve this problem. However, if we want to be able to de-dupe nested components (components nested in components, which i think will be common in a deployment registry, because what were originally external package references in the development registry will have been "bundled"), then I think we'll want to recursively explode components. And once we can recursively explode components, might as well explode modules, and now we don't have the original problem. How to represent recursive-explosion is its own tricky question, though ( :mind_blown: indeed)
Assuming module data imports and component data sections, could a nested module data section be mechanically extracted into a root component data section?
Thinking about this a bit more, here's a scheme that seems pretty simple and has some nice properties:
Define a new section id for an "external" section, where an "external" section is a section that is meant to be replaced in-place by another section whose contents are stored out-of-line. The format of an external section would be:
section-id:byte size:u32 custom-section-name?:name? contents:hash
where:
* The section-id
and size
are that of the other section that we want to replace the containing external section with, thereby allowing the out-of-line file to contain only the section contents (otherwise we'd have to stash the section-id
and size
in the file, which will be a problem for static assets)
* custom-section-name
is only present if section-id = 0
(i.e., this is a custom section). The reason is, as above, to allow the out-of-line file to simply be the contents of the metadata file.
* hash
is a name
(a length + UTF-8 string) matching (for the sake of argument, open to alternatives) the hash-expression
grammar.
This should work recursively: I start at a root component and then can recursively follow external sections into files containing other components and core modules.
What's neat about this approach is that an external section says precisely what it is meant to be replaced with, but it doesn't say how to find it, leaving that up to a higher-level to say how to track down a blob with the requisite content hash. Thus, there can be a separate manifest (say, an OCI Artifact manifest) that lists a set of hashes and the exact format of this manifest doesn't have to be fixed because it's not included in any of these content hashes; the root .wasm
is the manifest and a component package release can simply use the content-hash of the root .wasm
. The package release's URL would thus be the key for locating the backing storage (say, identifying an OCI Artifact which contained all the requisite hashes).
@Lann Martin @Peter Huene WDYT?
I like the approach. Would we want to account for the data section "type" field (i.e. "passive") similarly to the custom section name to allow VFS file contents to match external sections contents?
or, brainstorming a more general approach:
section-id:byte size:u32 content-prefix:vec<byte> contents:hash
where content-prefix
is inlined just before the external contents
Oh right; great point and yeah, great generalization!
This would require bumping the core wasm binary format version, wouldn't it?
(sorry for slow reply; i need to fix my notification prefs.) i think we have a couple of options of how to frame this, but none should require bumping the "version" since, at the end of the day, we're not breaking any existing modules, just offering new options. one way to frame this is that we're proposing a new section to the module (and component) binary format, just like several existing wasm proposals (e.g., exception-handling). some hosts may not implement this section (for a long time or forever), but that's also fine and could be spec'd as such. but this probably requires us to actually make a proposal to the core wasm CG and maybe it gets shot down. as a fallback, we could frame this as a compression format that is logically decompressed before the core wasm binary format gets decoded, in the same way as core wasm doesn't know anything about gzip/brotli, despite browsers serving core wasm compressed by these
I think it would at least mean that a "split" module would potentially no longer be well formed for existing impls. For example it seems that a data count section would no longer be valid if its data section "disappeared" (from the perspective of an external-section-naive impl). More to the point it would potentially invalidate table.init
instructions, at worst silently giving them a different section than intended
Maybe they could be .wazm
files :smile:
I personally don't see the need to have modules in a different format; I would imagine a new wasm parser that can be given one of these "manifests" and capable of fetching/verifying sections as defined in the manifest, but provide a unified wasmparser
-like streaming interface where it doesn't matter where the sections come from; consumers of the parser can't tell if it's a single file or reconstituted from multiple streams as it doesn't matter to them at all (the offsets given to the callers would be as if it were a single file too). to me this really isn't that much different from an OCI image, but instead of layering a file system on top of another, it's a single file with sequential sections.
Are you talking about Luke's scheme above (https://bytecodealliance.zulipchat.com/#narrow/stream/352111-warg/topic/Component.20Manifest/near/310668719)?
i am, but more generally i think changes to the core spec to support a registry splitting a module's contents, personally, to be a non-starter
OK, sure. So this would be a new file type, maybe with an additional magic header
we're talking for the "manifest" here, right?
We don't really need a manifest under this design; you still need to be able to bundle files but that can be anything from OCI to a tarball as long as it allows some kind of content-addressed lookup
Hashing the root "wasm-like object" recursively protects external sections
Its sort of an "intrusive manifest"
accomplishing this by extending the component/core specs to have a concept of external section or by a completely different file format (i.e a discrete manifest that, with the right tooling, can produce a component / module based on a manifest)?
i seem to have gotten lost on exactly which of those approaches is being discussed here
I'm talking about the external sections approach
ok i'm on the same page now; that said, I'm quite hesitant to take dependencies on core spec proposals to move this work forward
Right, so one option is to say it is a new binary format that just references the wasm binary encoding spec
that's not really the approach we take for core proposals in the tooling, though; for the various core proposals, including ones like the now-defunct module linking and interface types proposals, we just bake encoding/decoding support in and hide the decoding of it behind a runtime feature flag. i think that approach would work here just fine, it's just up until now we haven't had to touch anything core-related in the implementation of the component model and that's been nice from a maintenance perspective. at any rate, my hesitancy isn't a blocker (it does remind me of Alex being hesitant to having a discrete component binary AST in the tooling way back when)
i do realize we need a core spec proposal for data segment imports anyway for this VFS machinery to work as Luke described (it'd also probably need a non-trapping variant of memory.init
to signal EOF of the segment offset, perhaps?)
@Peter Huene @Luke Wagner I put together a prototype to make sure I understood the approach: https://github.com/lann/wasm-splice/
It just uses custom @external-section
sections rather than a new section ID
Great prototype Lann; looks good to me! You're probably already thinking this, but I could imagine wanting a flag for wasm-split
that says to recursively split out all nested components, modules, passive data segments and custom sections (without me having to explicitly enumerate them).
Agreed that we shouldn't take any dependency on the core wasm changes, so probably Lann was right above in suggesting that we have something in the wasm binary that indicates "this contains expanded sections" and probably an extension change to go along with it. Spitballing here: in the layer
field, where 0
means "core module", 1
means "component", we could add 2
to be a core module with external sections and 3
to be a component with external sections. And then the "decoding" for these two new layers could perform the substitution, yielding a layer 0
or 1
binary that is then decoded as normal.
@Peter Huene one neat thing about this approach is that we don't block on adding data imports to core wasm; we're using these expanded sections to roughly achieve the same effect, with the added bonus that they work symmetrically for custom sections and any other random section that we one day think is duplicative and want to split out. Also, since the contents of the external data segments are known to the containing module at build time, in theory there shouldn't be a need for a dynamic way to probe their size; it could be statically known per-data-segment-index.
As @Peter Huene and I just discussed: the presence of an external section (with a unique section ID) would effectively be that flag that "this contains external sections"; implementations that don't know what to do with that section type will need to fail anyway
either way it's a proposal, so i think one on the core spec or one where we have a different AST format (i.e. different preamble) that's a strict superset of the core spec is mostly semantics; my concerns around these dependencies has waned
i might now lean towards having a different AST format that are strict module and component supersets so we don't have to put external sections into the component model proposal itself, much like you both describe above
sorry for the spinning of the wheels here
I don't have a strong opinion either way. Would we prefer using a bit from layer
(which would make it part of the component proposal?) or using a different magic prefix?
@Luke Wagner i now see how external sections alone could fully implement a VFS as the VFS implementation will have static knowledge of the data; when the data changes, it needs a new VFS implementation tied to that data (which itself is referenced externally by a digest, so even if the VFS implementation were somehow generic, i.e. without statically-known segment sizes, it would need to change _anyway_)
@Lann Martin i think the magic would remain the same and we can simply bump the layer as luke describes; i like this because it means these external section proposals can be decoupled from either core and component specs and really only implemented by things like registry tooling (or other places where split components/modules might be relevant)
sgtm, thanks for all the careful thought on the option space here. and it seems like we can course-correct if we discover new reasons to prefer one of these other variations we've discussed while keeping the meat of the idea the same
Updated the prototype to set layer |= 2
and use a new section ID 0x5E
(for 5ection External). Interesting to note that I couldn't use 0xE5
because wasmparser
rejects IDs >= 0x80
. I don't see any rationale for that in the specs but there are some related cases in the conformance test suite :shrug:
Sounds great; 0x5E
seems unlikely to collide with core wasm any time soon. I expect the wasmparser
limitation is trying to be conservative and preserving the optionality of the byte one day in the future being "upgraded" to a variable-length LEB128, in which case you need the high bit unused.
added an issue for the component-model to include some of what is discussed in this thread: https://github.com/WebAssembly/component-model/issues/138
Last updated: Nov 26 2024 at 02:29 UTC