wasm split proposal · warg · Zulip Chat Archive

Stream: warg

Topic: wasm split proposal

Lann Martin (Feb 10 2023 at 16:28):

@Peter Huene (and anyone else): I finally got around to writing up a draft proposal for "split wasm". The biggest open questions for me are how to handle data segments, including whether we want to split all of them or just passives: https://hackmd.io/@lann/HJ_f7geai

Wasm split - HackMD

# Wasm split :::info Since this is largely about size efficiency we will only define this splitting

Peter Huene (Feb 10 2023 at 18:21):

I'll look it over on Monday as Fastly has the day off today.

Peter Huene (Feb 13 2023 at 19:00):

This looks great, thanks for writing it up! Regarding the active/passive segments open question, I think we can split the whole section as one for now.

The active segments would come from the code that virtualizes the file system, while, in theory, the passive segments are from the files being virtualized. So it would mean a change to the "virtualizer" would mean a new hash of the section even if the files being virtualized don't change, but I would assume that would happen fairly frequently and might even be desired if the virtualizer code changes how it interprets the passive segments.

Lann Martin (Feb 13 2023 at 19:04):

@Peter Huene I think one of the goals of this splitting is for each virtualized file's content to be its own "fragment", which I believe implies splitting out individual data segments, right?

Peter Huene (Feb 13 2023 at 19:04):

individual data segments, yes, but there's only one section

Peter Huene (Feb 13 2023 at 19:05):

i don't know how this scheme would apply to items within a section

Lann Martin (Feb 13 2023 at 19:05):

I'm just trying to parse your comment "split the whole section as one"

Peter Huene (Feb 13 2023 at 19:05):

meaning the whole of the data section

Lann Martin (Feb 13 2023 at 19:06):

So split the top-level section and then split out its segments as well?

Peter Huene (Feb 13 2023 at 19:12):

ok so i see what you mean with wanting to split each passive segment, so we can reuse storage of large files seldomly updated; do you propose extending this spec for a specific format of the data section?

Peter Huene (Feb 13 2023 at 19:14):

as the way i'm reading it right now the split data section is in its entirety (a single entry vs. a vector of split data segment items)

Lann Martin (Feb 13 2023 at 19:15):

Oh yes, I rewrote that a couple of times and the heading is poorly worded in its current state

Peter Huene (Feb 13 2023 at 19:15):

Oh I see the comments now in the "split data section", sorry

Peter Huene (Feb 13 2023 at 19:16):

my brain sees light gray text and goes "nah, not important"

Lann Martin (Feb 13 2023 at 19:17):

There, I made it colorful :smile:

Peter Huene (Feb 13 2023 at 19:17):

thanks :innocent:

Lann Martin (Feb 13 2023 at 19:18):

Splitting passive data segments is in part to help dedupe data, but we've also discussed it in terms of local development tooling that makes it easier to work with "static assets"

Peter Huene (Feb 13 2023 at 19:18):

i think we could just opt for all data segments, active or passive; the active ones will rarely change with a stable "virtualizer" implementation, i would hope

Peter Huene (Feb 13 2023 at 19:25):

it'd also mean less things to record if we don't need to differentiate in the split format for active vs. passive, just keep it a section header of (id, size, digest, len) followed by an array of size len containing (size, digest) for each segment in order

Lann Martin (Feb 13 2023 at 19:27):

:thumbs_up: Seems reasonable. I'll write that up.

Robin Brown (Feb 13 2023 at 19:31):

Can we define the canonical splitting as splitting out all segments, but allow split wasms to inline things that are small if they want?

Lann Martin (Feb 13 2023 at 19:34):

Yes, good point. You only have to follow the canonical split strictly in order to hash, and that only requires a little bookkeeping in memory

Peter Huene (Feb 13 2023 at 19:36):

ok so perhaps each entry in the segments array having a preamble byte that signifies inline or split? with inline being just vec(byte) (doesn't need its own digest since they're inline) so we don't recreate the entire data section format here?

Lann Martin (Feb 13 2023 at 20:08):

The notation is a little wonky but I think it gets the basic idea across: https://hackmd.io/AKlx0_jYQoSIqnOKk3lJSg?both#Split-data-section

Wasm split - HackMD

Lann Martin (Feb 13 2023 at 20:14):

As an alternative the segment headers could be encoded generically as a seghdr:vec(byte), which would reduce required knowledge of the core spec for the ~~split algo~~ splice algo, but not the split algo

Peter Huene (Feb 13 2023 at 20:59):

One of the reasons I was preferring just a vec(byte) to core:data is that the latter requires the splicer to know the format of core:data to read how many bytes it needs to just copy into the final artifact; it itself doesn't really care if it's a passive or active segment.

Perhaps this format of a split section could be generalized to a "split item section" where it can represent any section containing items that may be split out?

Lann Martin (Feb 13 2023 at 21:54):

I can do that, but it only really simplifies the splice algorithm. Canonical split still needs to know how much segment header to copy before it gets to the "real" content

Peter Huene (Feb 13 2023 at 22:09):

right, the splitter will need to understand the item format to even calculate the hash / determine what can be inlined vs. split, but i was imagining the splicer to be a (somewhat) dumb adapter that sits between the split file format and something like wasmparser, which it will understand the item format and validate. i don't have a strong opinion on it.

Lann Martin (Feb 13 2023 at 22:10):

Yeah I suppose there is some advantage in simplifying splice if the hash calculation is handled by some outer process

Lann Martin (Feb 13 2023 at 22:11):

I'll make another pass on that section of the proposal tomorrow

Peter Huene (Feb 13 2023 at 22:12):

that said, i don't think we'll really need a generalized "split item section" if ultimately it's just the data section we do the most splitting on (at least in terms of sections with a vec of items; modules and component sections in a component will obviously also support splitting, but those are single item)

Lann Martin (Feb 13 2023 at 22:18):

I started writing a TODO to myself and just made the change instead

Lann Martin (Feb 14 2023 at 14:52):

Updated the proposal to use a single "split section" ID that includes the original section ID as its first field. This is more consistent with the new split data segment approach and reduces the impact on the core spec to a single reserved section ID. Also tried to clarify throughout that splitting is mandatory for hash calculation but optional for actual content storage.

Lann Martin (Feb 14 2023 at 14:57):

If this looks good I'll plan on updating my old prototype to implement this proposal

GitHub - fermyon/wasm-splice

Contribute to fermyon/wasm-splice development by creating an account on GitHub.

Luke Wagner (Feb 14 2023 at 21:37):

nice job! i'm trying to understand what are the implications of this optimization in datasegmentsplitopt in which the bytes can either be stored inline or out-of-line, while the hash is always calculated as-if it's stored out-of-line. so, iiuc, when i start with a hash (either for the root component stored in a registry or from a typeddigest), it's not simply the content hash of the blob, it's the content hash of the blob as-if the blob had out-of-lined all its data segments. iiuc, what this means is that we need the storage service used to store registry contents to be a general key-value store (because, from the storage service's perspective, our split-content-hash is just an arbitrary string), as opposed to a content-addressable system. if that's right, does that perhaps limit us in the systems we can use to store split components and rule out pure content-addressed systems?

(fwiw, i was assuming that, to have this maybe-inline-maybe-not optimization, we'd need to specify the exact criteria in the canonical splitting algorithm so that the our hashes are always the hashes of the actual bytes.)

Lann Martin (Feb 14 2023 at 21:49):

Yeah, if you wanted to use a content-addressed store "natively" you would need to stick to the "canonical fully split" form. That does make the inlining option less appealing.

Lann Martin (Feb 14 2023 at 21:52):

I don't have a good sense of how many data segments we should expect in general. I've seen some pretty heavy fragmentation from wizer+dotnet iirc; if there are a lot of small segments the OCI protocol overhead might eat away at the deduplication savings

Lann Martin (Feb 14 2023 at 21:52):

I guess that would be an argument in favor of splitting only passive segments

Luke Wagner (Feb 15 2023 at 18:59):

I think the canonical split algorithm could also make the canonical decision of whether to split or not based on byte length (e.g.: data segments don't get split out unless they are bigger than X bytes). if we version the split format up front, then we could have different magic constants and criteria for different versions over time based on experience

Lann Martin (Feb 15 2023 at 20:03):

@Peter Huene was your "I think so" in zoom a pro-splitting stance?

Peter Huene (Feb 15 2023 at 20:24):

oh sorry, no that was in response to the question if clients would cache the content-addressed data by the digest so that it can be reused between multiple split components

Peter Huene (Feb 15 2023 at 20:24):

but i am pro-splitting in general if that's the question

Peter Huene (Feb 15 2023 at 20:25):

sorry i missed some context having to chase the dog down the street

Peter Huene (Feb 15 2023 at 20:25):

thankfully we (i rounded up two neighbors to help) caught her

Last updated: Apr 11 2025 at 05:03 UTC