Stream: warg

Topic: Component Manifest


view this post on Zulip Lann Martin (Nov 16 2022 at 20:33):

@Peter Huene looking at https://github.com/WebAssembly/component-model/blob/main/design/mvp/Binary.md ; do you know if it is intended to have the same section ordering constraints as the core spec (non-custom sections have a fixed order; custom sections can be interleaved anywhere)?

Repository for design and specification of the Component Model - component-model/Binary.md at main · WebAssembly/component-model

view this post on Zulip Peter Huene (Nov 16 2022 at 20:33):

With the component model, there's no constraint on section ordering at all.

view this post on Zulip Peter Huene (Nov 16 2022 at 20:34):

index spaces are built up incrementally as a result

view this post on Zulip Peter Huene (Nov 16 2022 at 20:34):

so the only requirement is that indexes are valid where they are read (i.e. if they refer to another section, that section must precede it)

view this post on Zulip Lann Martin (Nov 16 2022 at 20:34):

ah ok that makes sense

view this post on Zulip Peter Huene (Nov 16 2022 at 20:38):

in terms of being able to explode a component into individual sections, the original order would need to be maintained in a manifest of some sort for full fidelity (not unlike layer orders in an OCI)

view this post on Zulip Lann Martin (Nov 16 2022 at 20:40):

I'm thinking about which sections we actually care to split out - if its just custom sections there may be a compact encoding where you squash each consecutive run of non-custom sections into a single (length, hash)

view this post on Zulip Lann Martin (Nov 16 2022 at 20:42):

The simplest approach would be to include every section in the manifest, but that would be a lot of entries

view this post on Zulip Peter Huene (Nov 16 2022 at 20:44):

perhaps we design it such that both are supported: can list each section if desired but also consecutive sections as a single entry; thus we don't have to bake in the "important" section semantics into the manifest structure itself

view this post on Zulip Lann Martin (Nov 16 2022 at 20:45):

That makes sense; a toolchain may know that a particular data section should be split, but that's not something you can really determine from the binary itself

view this post on Zulip Peter Huene (Nov 16 2022 at 20:48):

perhaps we just say that each entry is a stream of 1 or more wasm sections (each section header describes the size, so it inherently can be read until the "end" of the current stream), for which the hash is of the entire stream of bytes?

view this post on Zulip Lann Martin (Nov 16 2022 at 20:48):

Just confirming: after the header, all content must be sections of the form <type*1> <size*4> <content*size>, right?

view this post on Zulip Peter Huene (Nov 16 2022 at 20:48):

right

view this post on Zulip Lann Martin (Nov 16 2022 at 20:50):

perhaps we just say that each entry is a stream of 1 or more wasm sections (each section header describes the size, so it inherently can be read until the "end" of the current stream), for which the hash is of the entire stream of bytes?

It would be useful to know the labels of at least some custom sections; in particular if we are embedding package metadata in one of them the registry indexer would like to be able to identify that

view this post on Zulip Lann Martin (Nov 16 2022 at 20:51):

or signatures

view this post on Zulip Peter Huene (Nov 16 2022 at 20:51):

yeah definitely should be in such a manifest so that the tools can decide to skip fetching sections not relevant to them (e.g. custom)

view this post on Zulip Peter Huene (Nov 16 2022 at 20:51):

so the types, and if custom, names

view this post on Zulip Lann Martin (Nov 16 2022 at 20:53):

something like the output of wasm-tools objdump (conceptually)

view this post on Zulip Peter Huene (Nov 16 2022 at 20:54):

yeah i think that'd make sense

view this post on Zulip Lann Martin (Nov 16 2022 at 21:13):

Something like:

manifest := version:u8  vec<entry>
entry := vec<section>  digest:vec<u8>
section := type:u8  size:u32  label:string

view this post on Zulip Lann Martin (Nov 16 2022 at 21:20):

Not sure how to deal with labels for hashing; the hash could include all the section content (including the label) or the label could be separate

view this post on Zulip Lann Martin (Nov 16 2022 at 21:22):

Possibly safer to include it in the hashing and treat the section.label as arbitrary entry metadata? :thinking:

view this post on Zulip Peter Huene (Nov 16 2022 at 21:35):

is the label going to be anything other than the custom section name or do we think it'll be arbitrary?

view this post on Zulip Lann Martin (Nov 16 2022 at 21:36):

I'm thinking about the "component VFS" with files in individual data sections

view this post on Zulip Peter Huene (Nov 16 2022 at 21:38):

hmm, so data segments might be defined in the component via a nested module (i.e. there's no name to associate) or (eventually when we support it) imported with a specific name, which would mean it'd appear in an import section along with other importable items

view this post on Zulip Peter Huene (Nov 16 2022 at 21:38):

i'm not sure there'd be useful information we could extract for a label there

view this post on Zulip Peter Huene (Nov 16 2022 at 21:41):

components don't have data sections either, so i'm not sure what Luke is thinking for this tooling (possibly a nested core module that implements the wasi:filesystem interface?)

view this post on Zulip Peter Huene (Nov 16 2022 at 21:42):

so from the manifest perspective it'd appear as a core module section

view this post on Zulip Peter Huene (Nov 16 2022 at 21:42):

which goes to the point of whether or not we need to explode nested sections ( :exploding_head:)

view this post on Zulip Lann Martin (Nov 16 2022 at 21:45):

yeah, this is exactly where my understanding of this gets very hazy...

view this post on Zulip Lann Martin (Nov 16 2022 at 21:45):

glad I'm not the only one

view this post on Zulip Peter Huene (Nov 16 2022 at 21:46):

@Luke Wagner some thoughts on the above conversation?

view this post on Zulip Luke Wagner (Nov 16 2022 at 23:24):

If we added data imports to core modules (a fairly minor core spec change) and added data sections to components (no reason not to, also i expect an easy addition), then that could solve this problem. However, if we want to be able to de-dupe nested components (components nested in components, which i think will be common in a deployment registry, because what were originally external package references in the development registry will have been "bundled"), then I think we'll want to recursively explode components. And once we can recursively explode components, might as well explode modules, and now we don't have the original problem. How to represent recursive-explosion is its own tricky question, though ( :mind_blown: indeed)

view this post on Zulip Lann Martin (Nov 17 2022 at 14:00):

Assuming module data imports and component data sections, could a nested module data section be mechanically extracted into a root component data section?

view this post on Zulip Luke Wagner (Nov 17 2022 at 18:11):

Thinking about this a bit more, here's a scheme that seems pretty simple and has some nice properties:

Define a new section id for an "external" section, where an "external" section is a section that is meant to be replaced in-place by another section whose contents are stored out-of-line. The format of an external section would be:

section-id:byte size:u32 custom-section-name?:name? contents:hash

where:

* The section-id and size are that of the other section that we want to replace the containing external section with, thereby allowing the out-of-line file to contain only the section contents (otherwise we'd have to stash the section-id and size in the file, which will be a problem for static assets)
* custom-section-name is only present if section-id = 0 (i.e., this is a custom section). The reason is, as above, to allow the out-of-line file to simply be the contents of the metadata file.
* hash is a name (a length + UTF-8 string) matching (for the sake of argument, open to alternatives) the hash-expression grammar.

This should work recursively: I start at a root component and then can recursively follow external sections into files containing other components and core modules.

What's neat about this approach is that an external section says precisely what it is meant to be replaced with, but it doesn't say how to find it, leaving that up to a higher-level to say how to track down a blob with the requisite content hash. Thus, there can be a separate manifest (say, an OCI Artifact manifest) that lists a set of hashes and the exact format of this manifest doesn't have to be fixed because it's not included in any of these content hashes; the root .wasm is the manifest and a component package release can simply use the content-hash of the root .wasm. The package release's URL would thus be the key for locating the backing storage (say, identifying an OCI Artifact which contained all the requisite hashes).

@Lann Martin @Peter Huene WDYT?

view this post on Zulip Lann Martin (Nov 17 2022 at 18:26):

I like the approach. Would we want to account for the data section "type" field (i.e. "passive") similarly to the custom section name to allow VFS file contents to match external sections contents?

view this post on Zulip Lann Martin (Nov 17 2022 at 18:29):

or, brainstorming a more general approach:

section-id:byte size:u32 content-prefix:vec<byte> contents:hash

where content-prefix is inlined just before the external contents

view this post on Zulip Luke Wagner (Nov 17 2022 at 19:48):

Oh right; great point and yeah, great generalization!

view this post on Zulip Lann Martin (Nov 17 2022 at 20:13):

This would require bumping the core wasm binary format version, wouldn't it?

view this post on Zulip Luke Wagner (Nov 18 2022 at 20:36):

(sorry for slow reply; i need to fix my notification prefs.) i think we have a couple of options of how to frame this, but none should require bumping the "version" since, at the end of the day, we're not breaking any existing modules, just offering new options. one way to frame this is that we're proposing a new section to the module (and component) binary format, just like several existing wasm proposals (e.g., exception-handling). some hosts may not implement this section (for a long time or forever), but that's also fine and could be spec'd as such. but this probably requires us to actually make a proposal to the core wasm CG and maybe it gets shot down. as a fallback, we could frame this as a compression format that is logically decompressed before the core wasm binary format gets decoded, in the same way as core wasm doesn't know anything about gzip/brotli, despite browsers serving core wasm compressed by these

view this post on Zulip Lann Martin (Nov 18 2022 at 21:00):

I think it would at least mean that a "split" module would potentially no longer be well formed for existing impls. For example it seems that a data count section would no longer be valid if its data section "disappeared" (from the perspective of an external-section-naive impl). More to the point it would potentially invalidate table.init instructions, at worst silently giving them a different section than intended

view this post on Zulip Lann Martin (Nov 18 2022 at 21:10):

Maybe they could be .wazm files :smile:

view this post on Zulip Peter Huene (Nov 18 2022 at 21:25):

I personally don't see the need to have modules in a different format; I would imagine a new wasm parser that can be given one of these "manifests" and capable of fetching/verifying sections as defined in the manifest, but provide a unified wasmparser-like streaming interface where it doesn't matter where the sections come from; consumers of the parser can't tell if it's a single file or reconstituted from multiple streams as it doesn't matter to them at all (the offsets given to the callers would be as if it were a single file too). to me this really isn't that much different from an OCI image, but instead of layering a file system on top of another, it's a single file with sequential sections.

view this post on Zulip Lann Martin (Nov 18 2022 at 21:28):

Are you talking about Luke's scheme above (https://bytecodealliance.zulipchat.com/#narrow/stream/352111-warg/topic/Component.20Manifest/near/310668719)?

view this post on Zulip Peter Huene (Nov 18 2022 at 21:31):

i am, but more generally i think changes to the core spec to support a registry splitting a module's contents, personally, to be a non-starter

view this post on Zulip Lann Martin (Nov 18 2022 at 21:35):

OK, sure. So this would be a new file type, maybe with an additional magic header

view this post on Zulip Peter Huene (Nov 18 2022 at 21:44):

we're talking for the "manifest" here, right?

view this post on Zulip Lann Martin (Nov 18 2022 at 21:46):

We don't really need a manifest under this design; you still need to be able to bundle files but that can be anything from OCI to a tarball as long as it allows some kind of content-addressed lookup

view this post on Zulip Lann Martin (Nov 18 2022 at 21:48):

Hashing the root "wasm-like object" recursively protects external sections

view this post on Zulip Lann Martin (Nov 18 2022 at 21:49):

Its sort of an "intrusive manifest"

view this post on Zulip Peter Huene (Nov 18 2022 at 21:58):

accomplishing this by extending the component/core specs to have a concept of external section or by a completely different file format (i.e a discrete manifest that, with the right tooling, can produce a component / module based on a manifest)?

view this post on Zulip Peter Huene (Nov 18 2022 at 22:00):

i seem to have gotten lost on exactly which of those approaches is being discussed here

view this post on Zulip Lann Martin (Nov 18 2022 at 22:04):

I'm talking about the external sections approach

view this post on Zulip Peter Huene (Nov 18 2022 at 22:08):

ok i'm on the same page now; that said, I'm quite hesitant to take dependencies on core spec proposals to move this work forward

view this post on Zulip Lann Martin (Nov 18 2022 at 22:12):

Right, so one option is to say it is a new binary format that just references the wasm binary encoding spec

view this post on Zulip Peter Huene (Nov 18 2022 at 22:20):

that's not really the approach we take for core proposals in the tooling, though; for the various core proposals, including ones like the now-defunct module linking and interface types proposals, we just bake encoding/decoding support in and hide the decoding of it behind a runtime feature flag. i think that approach would work here just fine, it's just up until now we haven't had to touch anything core-related in the implementation of the component model and that's been nice from a maintenance perspective. at any rate, my hesitancy isn't a blocker (it does remind me of Alex being hesitant to having a discrete component binary AST in the tooling way back when)

view this post on Zulip Peter Huene (Nov 18 2022 at 22:24):

i do realize we need a core spec proposal for data segment imports anyway for this VFS machinery to work as Luke described (it'd also probably need a non-trapping variant of memory.init to signal EOF of the segment offset, perhaps?)

view this post on Zulip Lann Martin (Nov 21 2022 at 15:44):

@Peter Huene @Luke Wagner I put together a prototype to make sure I understood the approach: https://github.com/lann/wasm-splice/

Contribute to lann/wasm-splice development by creating an account on GitHub.

view this post on Zulip Lann Martin (Nov 21 2022 at 15:46):

It just uses custom @external-section sections rather than a new section ID

view this post on Zulip Luke Wagner (Nov 21 2022 at 20:08):

Great prototype Lann; looks good to me! You're probably already thinking this, but I could imagine wanting a flag for wasm-split that says to recursively split out all nested components, modules, passive data segments and custom sections (without me having to explicitly enumerate them).

Agreed that we shouldn't take any dependency on the core wasm changes, so probably Lann was right above in suggesting that we have something in the wasm binary that indicates "this contains expanded sections" and probably an extension change to go along with it. Spitballing here: in the layer field, where 0 means "core module", 1 means "component", we could add 2 to be a core module with external sections and 3 to be a component with external sections. And then the "decoding" for these two new layers could perform the substitution, yielding a layer 0 or 1 binary that is then decoded as normal.

@Peter Huene one neat thing about this approach is that we don't block on adding data imports to core wasm; we're using these expanded sections to roughly achieve the same effect, with the added bonus that they work symmetrically for custom sections and any other random section that we one day think is duplicative and want to split out. Also, since the contents of the external data segments are known to the containing module at build time, in theory there shouldn't be a need for a dynamic way to probe their size; it could be statically known per-data-segment-index.

Repository for design and specification of the Component Model - component-model/Binary.md at main · WebAssembly/component-model

view this post on Zulip Lann Martin (Nov 21 2022 at 20:22):

As @Peter Huene and I just discussed: the presence of an external section (with a unique section ID) would effectively be that flag that "this contains external sections"; implementations that don't know what to do with that section type will need to fail anyway

view this post on Zulip Peter Huene (Nov 21 2022 at 20:24):

either way it's a proposal, so i think one on the core spec or one where we have a different AST format (i.e. different preamble) that's a strict superset of the core spec is mostly semantics; my concerns around these dependencies has waned

view this post on Zulip Peter Huene (Nov 21 2022 at 20:26):

i might now lean towards having a different AST format that are strict module and component supersets so we don't have to put external sections into the component model proposal itself, much like you both describe above

view this post on Zulip Peter Huene (Nov 21 2022 at 20:26):

sorry for the spinning of the wheels here

view this post on Zulip Lann Martin (Nov 21 2022 at 20:29):

I don't have a strong opinion either way. Would we prefer using a bit from layer (which would make it part of the component proposal?) or using a different magic prefix?

view this post on Zulip Peter Huene (Nov 21 2022 at 20:39):

@Luke Wagner i now see how external sections alone could fully implement a VFS as the VFS implementation will have static knowledge of the data; when the data changes, it needs a new VFS implementation tied to that data (which itself is referenced externally by a digest, so even if the VFS implementation were somehow generic, i.e. without statically-known segment sizes, it would need to change _anyway_)

view this post on Zulip Peter Huene (Nov 21 2022 at 20:46):

@Lann Martin i think the magic would remain the same and we can simply bump the layer as luke describes; i like this because it means these external section proposals can be decoupled from either core and component specs and really only implemented by things like registry tooling (or other places where split components/modules might be relevant)

view this post on Zulip Luke Wagner (Nov 21 2022 at 20:57):

sgtm, thanks for all the careful thought on the option space here. and it seems like we can course-correct if we discover new reasons to prefer one of these other variations we've discussed while keeping the meat of the idea the same

view this post on Zulip Lann Martin (Nov 22 2022 at 14:37):

Updated the prototype to set layer |= 2 and use a new section ID 0x5E (for 5ection External). Interesting to note that I couldn't use 0xE5 because wasmparser rejects IDs >= 0x80. I don't see any rationale for that in the specs but there are some related cases in the conformance test suite :shrug:

view this post on Zulip Luke Wagner (Nov 22 2022 at 16:41):

Sounds great; 0x5E seems unlikely to collide with core wasm any time soon. I expect the wasmparser limitation is trying to be conservative and preserving the optionality of the byte one day in the future being "upgraded" to a variable-length LEB128, in which case you need the high bit unused.

view this post on Zulip Bailey Hayes (Dec 01 2022 at 01:40):

added an issue for the component-model to include some of what is discussed in this thread: https://github.com/WebAssembly/component-model/issues/138

How should components bundle static assets? We have seen several early incantations of this for WebAssembly modules, e.g. emscripten file systems. Many languages have a concept of embedding data, e...

Last updated: Oct 23 2024 at 20:03 UTC