Stream: git-wasmtime

Topic: wasmtime / issue #4155 Incremental compilation cache in C...


view this post on Zulip Wasmtime GitHub notifications bot (May 16 2022 at 16:55):

bnjbvr opened issue #4155:

Feature

Allow Wasmtime/Cranelift to store compiled artifacts of single functions in a global cache, and when trying to compile a function with the same IR, reload the compiled artifact from the cache instead of recompiling it from scratch.

Benefit

Implementation

To make this really efficient, the caching system should be resilient to _non-meaningful_ changes in the function's input. For instance, it should be independent of the source locations, but also to adding/removing functions (which would shift function references indices, and thus cause mostly only cache misses). This may involve relocations or other fixups to make this work, just after reloading from the cache.

I also want to make sure that we don't accidentally miss a new field in the ir::Function being meaningful in the cache key, so here I think of using a mix of strategies:

(more implementations strategies / design choices below)

Alternatives

view this post on Zulip Wasmtime GitHub notifications bot (May 16 2022 at 16:55):

bnjbvr labeled issue #4155:

Feature

Allow Wasmtime/Cranelift to store compiled artifacts of single functions in a global cache, and when trying to compile a function with the same IR, reload the compiled artifact from the cache instead of recompiling it from scratch.

Benefit

Implementation

To make this really efficient, the caching system should be resilient to _non-meaningful_ changes in the function's input. For instance, it should be independent of the source locations, but also to adding/removing functions (which would shift function references indices, and thus cause mostly only cache misses). This may involve relocations or other fixups to make this work, just after reloading from the cache.

I also want to make sure that we don't accidentally miss a new field in the ir::Function being meaningful in the cache key, so here I think of using a mix of strategies:

(more implementations strategies / design choices below)

Alternatives

view this post on Zulip Wasmtime GitHub notifications bot (May 16 2022 at 16:55):

bnjbvr labeled issue #4155:

Feature

Allow Wasmtime/Cranelift to store compiled artifacts of single functions in a global cache, and when trying to compile a function with the same IR, reload the compiled artifact from the cache instead of recompiling it from scratch.

Benefit

Implementation

To make this really efficient, the caching system should be resilient to _non-meaningful_ changes in the function's input. For instance, it should be independent of the source locations, but also to adding/removing functions (which would shift function references indices, and thus cause mostly only cache misses). This may involve relocations or other fixups to make this work, just after reloading from the cache.

I also want to make sure that we don't accidentally miss a new field in the ir::Function being meaningful in the cache key, so here I think of using a mix of strategies:

(more implementations strategies / design choices below)

Alternatives

view this post on Zulip Wasmtime GitHub notifications bot (May 16 2022 at 16:55):

bnjbvr assigned issue #4155 (assigned to bnjbvr):

Feature

Allow Wasmtime/Cranelift to store compiled artifacts of single functions in a global cache, and when trying to compile a function with the same IR, reload the compiled artifact from the cache instead of recompiling it from scratch.

Benefit

Implementation

To make this really efficient, the caching system should be resilient to _non-meaningful_ changes in the function's input. For instance, it should be independent of the source locations, but also to adding/removing functions (which would shift function references indices, and thus cause mostly only cache misses). This may involve relocations or other fixups to make this work, just after reloading from the cache.

I also want to make sure that we don't accidentally miss a new field in the ir::Function being meaningful in the cache key, so here I think of using a mix of strategies:

(more implementations strategies / design choices below)

Alternatives

view this post on Zulip Wasmtime GitHub notifications bot (May 16 2022 at 16:57):

bnjbvr commented on issue #4155:

I should add: I've started working on this. I already have something that works for the simplest case, that is, changing a single statement in the source code and recompiling only recompiles the one function that changed. I'm now looking into making it resilient when adding/removing functions internally-defined in the same wasm module (viz. not imported functions yet).

view this post on Zulip Wasmtime GitHub notifications bot (May 16 2022 at 17:10):

bjorn3 commented on issue #4155:

Can you hash the entirety of Function as cache key and instead ensure that the Function only changes when there are meaningful changes to the input wasm? For example you could renumber all FuncRef starting from 0 in a deterministic way. So for example a function calling functions 5 and 3 could get FuncRef(0) and FuncRef(1) to reference functions 5 and 3 respectively. If a new function shifts said functions to offset 6 and 4, you would still get FuncRef(0) and FuncRef(1), but a different mapping table. This should even be resilient against a completely different function with the same signature being called as the FuncRef's wouldn't change. For VMContext offsets you could introduce a new GlobalValueData variant containing a constant allowing relocating of VMContext offsets after compilation. For source locations you could tell cranelift-codegen the source locations relative to the start of the function and then add the function offset when actually using them.

Also I think this caching implementation can be kept in a separate crate. There is already a cranelift-codegen feature to enable serde serialization and deserialization of Function, which should be enough.

Probably a bit simpler, but would not benefit non-wasmtime consumers like cg_clif (... which may not need it, because it might benefit from rustc's incremental compilation framework, likely?)

Cg_clif benefits from rustc's incremental compilation framework, but that one only works at the level of codegen units for codegen which is less fine grained than individual functions.

view this post on Zulip Wasmtime GitHub notifications bot (May 16 2022 at 17:14):

cfallin commented on issue #4155:

Thanks for writing all of this up, @bnjbvr! I am very excited to see how this works out, and early results already show that it should be a huge improvement for users with tight edit-compile-reload loops.

Two things I wanted to add:

  1. A little more detail on the "relocations" idea that we came up with, both to document it here and to see if others can think of related ideas or any issues. The basic idea is that parts of input (CLIF) to Cranelift may vary slightly in unimportant ways when other parts of a Wasm module are changed. These are "renumbering" kinds of problems: the shape of the CLIF is the same, but particular values change.

    Two particular examples are: SourceLocs (e.g. all the source code for the slightly-edited input Wasm module shifts down by one line) and offsets in Wasmtime's vmctx struct at which imported function structs are loaded (e.g. adding an import shifts all of these after the import down by 16 bytes).

    The basic idea is to make these constants "symbolic" somehow and feed them into the code as fixups after the point at which it's cached. (We called them "relocations" in our discussion but actually I like something like "constant parameters" a bit better now...) Inserting concrete values for constant parameters may involve fixups in the MachCompileResult's metadata (e.g. for line-number info) but also maybe in the machine code itself (e.g. a constant offset for a load).

    The latter case makes things interesting too: e.g. a load may go out-of-range (for example with the 12-bit offset for a reg+imm addressing mode on AArch64). In this case I think we can say that parameter insertion may "fail", and we just fall back to a compilation from scratch. So adding imports might hit a cliff at some point at which all the generated code needs to start using a more verbose access sequence to a huge vmctx, but that's OK.

  2. Regarding the caching API itself: I think I might actually slightly prefer an "inverted control" scheme here where the embedder controls all access to the KV store, rather than try to reuse something from Wasmtime. In particular, if the CLIF lets us get a "cache key", and if the compiler context exposes the two-step process "compile to cacheable object" and "finalize cacheable object given concrete params/actual CLIF/..." then one could write a wrapper around this that does the right thing with a KV store, but it's more flexible.

    The specific reason I favor this is that it keeps Cranelift agnostic to any notions of storage or IO; a compiler really should live as much as possible in the "pure function from data to data" space. And there may be other ways a user would want to use this functionality than we initially imagine and bake into a KV-store trait. Two particular cases that are hard are the C API, as you mention; but also e.g. async (what if the KV store is async? Do large parts of Cranelift become async too?).

view this post on Zulip Wasmtime GitHub notifications bot (May 16 2022 at 17:14):

cfallin edited a comment on issue #4155:

Thanks for writing all of this up, @bnjbvr! I am very excited to see how this works out, and early results already show that it should be a huge improvement for users with tight edit-compile-reload loops.

Two things I wanted to add:

  1. A little more detail on the "relocations" idea that we came up with, both to document it here and to see if others can think of related ideas or any issues. The basic idea is that parts of input (CLIF) to Cranelift may vary slightly in unimportant ways when other parts of a Wasm module are changed. These are "renumbering" kinds of problems: the shape of the CLIF is the same, but particular values change.

    Two particular examples are: SourceLocs (e.g. all the source code for the slightly-edited input Wasm module shifts down by one line) and offsets in Wasmtime's vmctx struct at which imported function structs are loaded (e.g. adding an import shifts all of these after the import down by 16 bytes).

    The basic idea is to make these constants "symbolic" somehow and feed them into the code as fixups after the point at which it's cached. (We called them "relocations" in our discussion but actually I like something like "constant parameters" a bit better now...) Inserting concrete values for constant parameters may involve fixups in the MachCompileResult's metadata (e.g. for line-number info) but also maybe in the machine code itself (e.g. a constant offset for a load).

    The latter case makes things interesting too: e.g. a load may go out-of-range (for example with the 12-bit offset for a reg+imm addressing mode on AArch64). In this case I think we can say that parameter insertion may "fail", and we just fall back to a compilation from scratch. So adding imports might hit a cliff at some point at which all the generated code needs to start using a more verbose access sequence to a huge vmctx, but that's OK.

  2. Regarding the caching API itself: I think I might actually slightly prefer an "inverted control" scheme here where the embedder controls all access to the KV store, rather than try to reuse something from Wasmtime. In particular, if the CLIF lets us get a "cache key", and if the compiler context exposes the two-step process "compile to cacheable object" and "finalize cacheable object given concrete params/actual CLIF/..." then one could write a wrapper around this that does the right thing with a KV store, but it's more flexible.

    The specific reason I favor this is that it keeps Cranelift agnostic to any notions of storage or IO; a compiler really should live as much as possible in the "pure function from data to data" space. And there may be other ways a user would want to use this functionality than we initially imagine and bake into a KV-store trait. Two particular cases that are hard are the C API, as you mention; but also e.g. async (what if the KV store is async? Do large parts of Cranelift become async too?).

view this post on Zulip Wasmtime GitHub notifications bot (May 16 2022 at 17:56):

alexcrichton commented on issue #4155:

From an internal design perspective I would recommend following what wasmtime-cache is already doing which is to have a function that looks like:

    pub fn get_data_raw<T, U, E>(
        &self,
        state: &T,
        // NOTE: These are function pointers instead of closures so that they
        // don't accidentally close over something not accounted in the cache.
        compute: fn(&T) -> Result<U, E>,
        serialize: fn(&T, &U) -> Option<Vec<u8>>,
        deserialize: fn(&T, Vec<u8>) -> Option<U>,
    ) -> Result<U, E>
    where
        T: Hash,
    {
        // ...
    }

Here state is explicitly Hash to generate a sha or something like that and the compute, serialize, and deserialize functions are all explicitly function pointers that cannot close over state. This means that the computation of U, the memoization of the compute computation over T, is guaranteed to reasonably be a function of T.

Ideally the wasmtime-cache bits could either be reused for this or somehow refactored to not duplicate this. Historically I have found that stale caches, especially with incremental compilation, are so difficult to debug and diagnose that getting the interface right so that it's practically never broken is extremely important.

view this post on Zulip Wasmtime GitHub notifications bot (May 17 2022 at 15:12):

bnjbvr commented on issue #4155:

(Will use ICC for incremental compilation cache from now on :))

Can you hash the entirety of Function as cache key and instead ensure that the Function only changes when there are meaningful changes to the input wasm? For example you could renumber all FuncRef starting from 0 in a deterministic way. So for example a function calling functions 5 and 3 could get FuncRef(0) and FuncRef(1) to reference functions 5 and 3 respectively. If a new function shifts said functions to offset 6 and 4, you would still get FuncRef(0) and FuncRef(1), but a different mapping table. This should even be resilient against a completely different function with the same signature being called as the FuncRef's wouldn't change. For VMContext offsets you could introduce a new GlobalValueData variant containing a constant allowing relocating of VMContext offsets after compilation. For source locations you could tell cranelift-codegen the source locations relative to the start of the function and then add the function offset when actually using them.

I think that's a fine idea, but I wonder if this wouldn't bias the code towards the incremental cache a bit too much, making it non "zero-cost" for users who disabled the incremental compilation cache (ICC) feature? Say, when it comes to source locations, this means that first we'd transform all source locations into relative (from start) ones, then if the ICC isn't used, re-transform them back later into absolute source locations (for the MachRelocs to be correct) -- or carry different meanings for these fields whether the ICC is enabled or not. This back-and-forth wouldn't add any value to users not using the ICC, so I tend to slightly prefer the opposite choice which is to have bespoke "mirror" data structures for the compilation cache (which could contain plain references to Function's fields, when they don't need any change, and only introduce new ones when they do need internal changes).

The strategies you mention here can still be used, though. In fact the relative-from-start source location was what I implemented for that specific problem indeed, and I had a similar idea of a scheme for function refs.

Also I think this caching implementation can be kept in a separate crate. There is already a cranelift-codegen feature to enable serde serialization and deserialization of Function, which should be enough.

That would be really nice if it could be a separate crate, yet I am unclear if it's doable because currently the ICC uses lots of cranelift-codegen internals. Will keep this in mind, and we can reconsider this later when closer to the "final" design and implementation.

view this post on Zulip Wasmtime GitHub notifications bot (May 17 2022 at 15:59):

bnjbvr commented on issue #4155:

@alexcrichton @cfallin I am trying to consolidate the two design suggestions, so rephrasing to make sure I've understood clearly:

view this post on Zulip Wasmtime GitHub notifications bot (May 17 2022 at 16:06):

alexcrichton commented on issue #4155:

To clarify I'm talking more about methods of caching moreso than using specific code exactly as-is without any modifications. The strategy used by wasmtime-cache I feel is pretty solid and we should be quite deliberate about migrating away from that if we do.

I do think it would be a not-great state to be in to have two caching systems at the end of this, so I think in the long run we don't want two entirely separate systems for caching things that know nothing of each other. In the short-to-mid term though it's fine to develop whatever and refactor things later to share more code.

view this post on Zulip Wasmtime GitHub notifications bot (May 17 2022 at 16:22):

bjorn3 commented on issue #4155:

Say, when it comes to source locations, this means that first we'd transform all source locations into relative (from start) ones, then if the ICC isn't used, re-transform them back later into absolute source locations (for the MachRelocs to be correct)

We already need transforming between the output format of cranelift and the serialized format of wasmtime. It is just a cheap addition which can be folded into this transformation. Basically just add the addition here: https://github.com/bytecodealliance/wasmtime/blob/cb13175c429d6b3e63c8122ce6427b4529a2873d/crates/cranelift/src/compiler.rs#L652

so I tend to slightly prefer the opposite choice which is to have bespoke "mirror" data structures for the compilation cache (which could contain plain references to Function's fields, when they don't need any change, and only introduce new ones when they do need internal changes).

That would make the incremental compilation cache depend on the original version of the function, right? I'm afraid that would turn hard to debug bugs like is normally the case for incremental compilation related bugs into straight up nightmares.

view this post on Zulip Wasmtime GitHub notifications bot (May 17 2022 at 17:10):

bnjbvr commented on issue #4155:

It is just a cheap addition which can be folded into this transformation. Basically just add the addition here:

I guess it's not "just" that :-). In practice every transitive field that contains a SourceLoc in MachCompileResult should get the same treatment, so that also means tweaking mach_reloc_to_reloc, mach_trap_to_trap, etc. That also means moving the responsibility of doing those fixups from Cranelift to the consumer of the compiled artifact, when the initial intent was to make it as easy as possible for every Cranelift user to have something that works out of the box without requiring them to do extra work. (This argument has limited applicability as the two main users are wasmtime and cg_clif, but still hold in theory at least.)

I do hear your concerns about running into incremental compilation related bugs, though, which would be mitigated by the mix of static typing sanity + fuzzing; any other idea to make it more robust is appreciated here too.

view this post on Zulip Wasmtime GitHub notifications bot (May 17 2022 at 17:14):

cfallin commented on issue #4155:

To unify things a bit: I think that we could, and probably should, use the same caching mechanism for both the Wasmtime module-level cache and any Cranelift function-level cache. I was mainly concerned above about flexibility (unsure with a synchronous API like wasmtime-cache: how would one use an async external storage provider, e.g. a key-value store across a network?).

I agree with @alexcrichton that we should design to make it impossible to get cache keys wrong. To that end if we have a "put together the pieces yourself" option, I think that we can still make it completely safe by storing the expected CLIF hash in the cached value as well, and checking that on return. In essence the interface that I think provides maximum flexibility is something like:

This is fully "dynamically safe" in that the cached data is only used if the cache key matches, and the cache key production is internal to Cranelift and opaque to outside. We can then use the type-safety ideas that Ben mentions (wrapper types on fields) to make that CacheKey correct.

This together with the fuzzing to verify correctness of incremental compilation I suspect is enough, and this whole approach grants the flexibility that Ben's case needs.

Then tying this back to Wasmtime: wasmtime-cache could potentially use these building blocks to cache functions as well as whole modules; or an advanced user of Cranelift could use them directly. Thoughts?

view this post on Zulip Wasmtime GitHub notifications bot (May 17 2022 at 17:15):

bjorn3 commented on issue #4155:

I guess it's not "just" that :-). In practice every transitive field that contains a SourceLoc in MachCompileResult should get the same treatment, so that also means tweaking mach_reloc_to_reloc, mach_trap_to_trap, etc.

mach_reloc_to_reloc ignores srcloc and so does mach_trap_to_trap. collect_address_maps seems to be the only that actually reads srcloc.

view this post on Zulip Wasmtime GitHub notifications bot (May 23 2022 at 13:41):

bnjbvr commented on issue #4155:

mach_reloc_to_reloc ignores srcloc and so does mach_trap_to_trap. collect_address_maps seems to be the only that actually reads srcloc.

Indeed, I even started a patch to remove those as they're really unused in the code base,... and then I recalled that Spidermonkey would use those (if it integrated with the new backend). Anyhow, I don't still think it would make for a good API to request any cranelift embedder to do this kind of transform themselves, when it can just be done in Cranelift directly, so I'm going to stick with my plan (which is more aligned with what Chris suggests too), incorporating the good ideas from yours :)

view this post on Zulip Wasmtime GitHub notifications bot (May 23 2022 at 13:54):

bjorn3 commented on issue #4155:

Indeed, I even started a patch to remove those as they're really unused in the code base,... and then I recalled that Spidermonkey would use those (if it integrated with the new backend).

mach_reloc_to_reloc and friends are in wasmtime-cranelift so spidermonkey can't use them even if it wanted to. They are the translation between Cranelift's representation and Wasmtime's serializable representation of the compilation results.

view this post on Zulip Wasmtime GitHub notifications bot (May 23 2022 at 13:56):

bnjbvr commented on issue #4155:

I was talking about the srcloc fields in MachTrap et al. In fact I've received confirmation from the Spidermonkey folks that they might not need it, so will open a PR to remove these, as an incremental step towards this.

(Sorry all for the noise caused by this sidetracked discussion, and @bjorn3 happy to follow up in Zulip if you'd like to chat more about this.)

view this post on Zulip Wasmtime GitHub notifications bot (Aug 12 2022 at 16:47):

bnjbvr closed issue #4155 (assigned to bnjbvr):

Feature

Allow Wasmtime/Cranelift to store compiled artifacts of single functions in a global cache, and when trying to compile a function with the same IR, reload the compiled artifact from the cache instead of recompiling it from scratch.

Benefit

Implementation

To make this really efficient, the caching system should be resilient to _non-meaningful_ changes in the function's input. For instance, it should be independent of the source locations, but also to adding/removing functions (which would shift function references indices, and thus cause mostly only cache misses). This may involve relocations or other fixups to make this work, just after reloading from the cache.

I also want to make sure that we don't accidentally miss a new field in the ir::Function being meaningful in the cache key, so here I think of using a mix of strategies:

(more implementations strategies / design choices below)

Alternatives


Last updated: Oct 23 2024 at 20:03 UTC