@Peter Huene (and anyone else): I finally got around to writing up a draft proposal for "split wasm". The biggest open questions for me are how to handle data segments, including whether we want to split all of them or just passives: https://hackmd.io/@lann/HJ_f7geai
I'll look it over on Monday as Fastly has the day off today.
This looks great, thanks for writing it up! Regarding the active/passive segments open question, I think we can split the whole section as one for now.
The active segments would come from the code that virtualizes the file system, while, in theory, the passive segments are from the files being virtualized. So it would mean a change to the "virtualizer" would mean a new hash of the section even if the files being virtualized don't change, but I would assume that would happen fairly frequently and might even be desired if the virtualizer code changes how it interprets the passive segments.
@Peter Huene I think one of the goals of this splitting is for each virtualized file's content to be its own "fragment", which I believe implies splitting out individual data segments, right?
individual data segments, yes, but there's only one section
i don't know how this scheme would apply to items within a section
I'm just trying to parse your comment "split the whole section as one"
meaning the whole of the data section
So split the top-level section and then split out its segments as well?
ok so i see what you mean with wanting to split each passive segment, so we can reuse storage of large files seldomly updated; do you propose extending this spec for a specific format of the data section?
as the way i'm reading it right now the split data section is in its entirety (a single entry vs. a vector of split data segment items)
Oh yes, I rewrote that a couple of times and the heading is poorly worded in its current state
Oh I see the comments now in the "split data section", sorry
my brain sees light gray text and goes "nah, not important"
There, I made it colorful :smile:
thanks :innocent:
Splitting passive data segments is in part to help dedupe data, but we've also discussed it in terms of local development tooling that makes it easier to work with "static assets"
i think we could just opt for all data segments, active or passive; the active ones will rarely change with a stable "virtualizer" implementation, i would hope
it'd also mean less things to record if we don't need to differentiate in the split format for active vs. passive, just keep it a section header of (id, size, digest, len) followed by an array of size len containing (size, digest) for each segment in order
:thumbs_up: Seems reasonable. I'll write that up.
Can we define the canonical splitting as splitting out all segments, but allow split wasms to inline things that are small if they want?
Yes, good point. You only have to follow the canonical split strictly in order to hash, and that only requires a little bookkeeping in memory
ok so perhaps each entry in the segments array having a preamble byte that signifies inline or split? with inline being just vec(byte) (doesn't need its own digest since they're inline) so we don't recreate the entire data section format here?
The notation is a little wonky but I think it gets the basic idea across: https://hackmd.io/AKlx0_jYQoSIqnOKk3lJSg?both#Split-data-section
As an alternative the segment headers could be encoded generically as a seghdr:vec(byte)
, which would reduce required knowledge of the core spec for the split algo splice algo, but not the split algo
One of the reasons I was preferring just a vec(byte)
to core:data
is that the latter requires the splicer to know the format of core:data
to read how many bytes it needs to just copy into the final artifact; it itself doesn't really care if it's a passive or active segment.
Perhaps this format of a split section could be generalized to a "split item section" where it can represent any section containing items that may be split out?
I can do that, but it only really simplifies the splice algorithm. Canonical split still needs to know how much segment header to copy before it gets to the "real" content
right, the splitter will need to understand the item format to even calculate the hash / determine what can be inlined vs. split, but i was imagining the splicer to be a (somewhat) dumb adapter that sits between the split file format and something like wasmparser
, which it will understand the item format and validate. i don't have a strong opinion on it.
Yeah I suppose there is some advantage in simplifying splice if the hash calculation is handled by some outer process
I'll make another pass on that section of the proposal tomorrow
that said, i don't think we'll really need a generalized "split item section" if ultimately it's just the data section we do the most splitting on (at least in terms of sections with a vec of items; modules and component sections in a component will obviously also support splitting, but those are single item)
I started writing a TODO to myself and just made the change instead
Updated the proposal to use a single "split section" ID that includes the original section ID as its first field. This is more consistent with the new split data segment approach and reduces the impact on the core spec to a single reserved section ID. Also tried to clarify throughout that splitting is mandatory for hash calculation but optional for actual content storage.
If this looks good I'll plan on updating my old prototype to implement this proposal
nice job! i'm trying to understand what are the implications of this optimization in datasegmentsplitopt
in which the bytes can either be stored inline or out-of-line, while the hash is always calculated as-if it's stored out-of-line. so, iiuc, when i start with a hash (either for the root component stored in a registry or from a typeddigest
), it's not simply the content hash of the blob, it's the content hash of the blob as-if the blob had out-of-lined all its data segments. iiuc, what this means is that we need the storage service used to store registry contents to be a general key-value store (because, from the storage service's perspective, our split-content-hash is just an arbitrary string), as opposed to a content-addressable system. if that's right, does that perhaps limit us in the systems we can use to store split components and rule out pure content-addressed systems?
(fwiw, i was assuming that, to have this maybe-inline-maybe-not optimization, we'd need to specify the exact criteria in the canonical splitting algorithm so that the our hashes are always the hashes of the actual bytes.)
Yeah, if you wanted to use a content-addressed store "natively" you would need to stick to the "canonical fully split" form. That does make the inlining option less appealing.
I don't have a good sense of how many data segments we should expect in general. I've seen some pretty heavy fragmentation from wizer+dotnet iirc; if there are a lot of small segments the OCI protocol overhead might eat away at the deduplication savings
I guess that would be an argument in favor of splitting only passive segments
I think the canonical split algorithm could also make the canonical decision of whether to split or not based on byte length (e.g.: data segments don't get split out unless they are bigger than X bytes). if we version the split format up front, then we could have different magic constants and criteria for different versions over time based on experience
@Peter Huene was your "I think so" in zoom a pro-splitting stance?
oh sorry, no that was in response to the question if clients would cache the content-addressed data by the digest so that it can be reused between multiple split components
but i am pro-splitting in general if that's the question
sorry i missed some context having to chase the dog down the street
thankfully we (i rounded up two neighbors to help) caught her
Last updated: Dec 23 2024 at 13:07 UTC