cfallin opened PR #8870 from cfallin:direct-mapped-indirect-cache
to bytecodealliance:main
:
Currently, the indirect-call cache operates on a basis of one slot per callsite: each
call_indirect
instruction in a module, up to a limit, has its own slot of storage in the VM context struct that caches a called-table-index to called-raw-function-pointer mapping.This is fine but means the storage requirement scales with the module size; hence "up to a limit" above. It also means that each callsite needs to "warm up" separately, whereas we could in theory reuse the resolved index->code pointer mapping for the same index across callsites.
This PR switches instead to a "direct-mapped cache": that is, we have a fixed number of cache slots per table per instance, of user-configurable count, and we look in a slot selected by the called table index (modulo the cache size). As before, if the "tag" (cache key, called table index) matches, we use the "value" (raw code pointer).
The main advantage of this scheme, and my motivation for making the switch, is that the storage size is fixed and quite small, even for arbitrarily-large modules: for example, on a 64-bit platform with 12-byte slots(*) (4-byte key, 8-byte resolved pointer), for a module with one funcptr table, a 1K-entry cache uses 12KiB per instance. That's much smaller than the total VMFuncRef array size in large modules and should be no problem. My goal in getting to this constant size offset is that turning this on by default eventually will be easier to justify, and that we won't have unexpected perf cliffs for callsites beyond a certain index.
This also means that if one callsite resolves index 23 to some raw code pointer, other callsites that call index 23 also receive a "hit" from that warmup. This could be beneficial when there are many callsites but a relatively smaller pool of called functions (e.g., ICs).
The downside to caching indexed on callee rather than callsite is that if there are a large number of callees, we can expect more cache conflicts and hence misses. (If funcref table indices 1 and 1025 are both frequently called, a 1024-entry direct-mapped cache will thrash.) But I expect with ICs in particular to have a lot of callsites and relatively few (shared) callees.
On Octane-compiled-to-Wasm with my JS AOT compilation tooling using
call_indirect
for all ICs, I see: baseline score (higher is better, proportional to runtime speed) of 2406, score with old one-entry-per-callsite scheme of 2479, score with this scheme of
2509. So it's slightly faster as well, probably due to a combination of the warmup benefit and a smaller cache footprint, even with the more involved logic to compute the slot address. (This also tells me the benefit of this cache is smaller than I had originally measured on a microbenchmark (20%) -- at 5% on all of Octane -- but that's still worth it, IMHO.)(*) Note that slots are not actually contiguous: I did a
struct-of-arrays trick, separating cache tags from cache values, so
that the assembly lowering can use scaling amodes (vmctx + offset + 4*idx
for u32 accesses, and8*idx
for u64 accesses) for more
efficient code.<!--
Please make sure you include the following information:
If this work has been discussed elsewhere, please include a link to that
conversation. If it was discussed in an issue, just mention "issue #...".Explain why this change is needed. If the details are in an issue already,
this can be brief.Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.htmlPlease ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->
cfallin requested wasmtime-fuzz-reviewers for a review on PR #8870.
cfallin requested alexcrichton for a review on PR #8870.
cfallin requested wasmtime-core-reviewers for a review on PR #8870.
cfallin edited PR #8870:
Currently, the indirect-call cache operates on a basis of one slot per callsite: each
call_indirect
instruction in a module, up to a limit, has its own slot of storage in the VM context struct that caches a called-table-index to called-raw-function-pointer mapping.This is fine but means the storage requirement scales with the module size; hence "up to a limit" above. It also means that each callsite needs to "warm up" separately, whereas we could in theory reuse the resolved index->code pointer mapping for the same index across callsites.
This PR switches instead to a "direct-mapped cache": that is, we have a fixed number of cache slots per table per instance, of user-configurable count, and we look in a slot selected by the called table index (modulo the cache size). As before, if the "tag" (cache key, called table index) matches, we use the "value" (raw code pointer).
The main advantage of this scheme, and my motivation for making the switch, is that the storage size is fixed and quite small, even for arbitrarily-large modules: for example, on a 64-bit platform with 12-byte slots(*) (4-byte key, 8-byte resolved pointer), for a module with one funcptr table, a 1K-entry cache uses 12KiB per instance. That's much smaller than the total VMFuncRef array size in large modules and should be no problem. My goal in getting to this constant size offset is that turning this on by default eventually will be easier to justify, and that we won't have unexpected perf cliffs for callsites beyond a certain index.
This also means that if one callsite resolves index 23 to some raw code pointer, other callsites that call index 23 also receive a "hit" from that warmup. This could be beneficial when there are many callsites but a relatively smaller pool of called functions (e.g., ICs).
The downside to caching indexed on callee rather than callsite is that if there are a large number of callees, we can expect more cache conflicts and hence misses. (If funcref table indices 1 and 1025 are both frequently called, a 1024-entry direct-mapped cache will thrash.) But I expect with ICs in particular to have a lot of callsites and relatively few (shared) callees.
On Octane-compiled-to-Wasm with my JS AOT compilation tooling using
call_indirect
for all ICs, I see: baseline score (higher is better, proportional to runtime speed) of 2406, score with old one-entry-per-callsite scheme of 2479, score with this scheme of
2509. So it's slightly faster as well, probably due to a combination of the warmup benefit and a smaller cache footprint, even with the more involved logic to compute the slot address. (This also tells me the benefit of this cache is smaller than I had originally measured on a microbenchmark (20%) -- at 5% on all of Octane -- but that's still worth it, IMHO.)(*) Note that slots are not actually contiguous: I did a struct-of-arrays trick, separating cache tags from cache values, so that the assembly lowering can use scaling amodes (
vmctx + offset + 4*idx
for u32 accesses, and8*idx
for u64 accesses) for more efficient code.<!--
Please make sure you include the following information:
If this work has been discussed elsewhere, please include a link to that
conversation. If it was discussed in an issue, just mention "issue #...".Explain why this change is needed. If the details are in an issue already,
this can be brief.Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.htmlPlease ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->
cfallin edited PR #8870:
Currently, the indirect-call cache operates on a basis of one slot per callsite: each
call_indirect
instruction in a module, up to a limit, has its own slot of storage in the VM context struct that caches a called-table-index to called-raw-function-pointer mapping.This is fine but means the storage requirement scales with the module size; hence "up to a limit" above. It also means that each callsite needs to "warm up" separately, whereas we could in theory reuse the resolved index->code pointer mapping for the same index across callsites.
This PR switches instead to a "direct-mapped cache": that is, we have a fixed number of cache slots per table per instance, of user-configurable count, and we look in a slot selected by the called table index (modulo the cache size). As before, if the "tag" (cache key, called table index) matches, we use the "value" (raw code pointer).
The main advantage of this scheme, and my motivation for making the switch, is that the storage size is fixed and quite small, even for arbitrarily-large modules: for example, on a 64-bit platform with 12-byte slots(*) (4-byte key, 8-byte resolved pointer), for a module with one funcptr table, a 1K-entry cache uses 12KiB per instance. That's much smaller than the total VMFuncRef array size in large modules and should be no problem. My goal in getting to this constant size offset is that turning this on by default eventually will be easier to justify, and that we won't have unexpected perf cliffs for callsites beyond a certain index.
This also means that if one callsite resolves index 23 to some raw code pointer, other callsites that call index 23 also receive a "hit" from that warmup. This could be beneficial when there are many callsites but a relatively smaller pool of called functions (e.g., ICs).
The downside to caching indexed on callee rather than callsite is that if there are a large number of callees, we can expect more cache conflicts and hence misses. (If funcref table indices 1 and 1025 are both frequently called, a 1024-entry direct-mapped cache will thrash.) But I expect with ICs in particular to have a lot of callsites and relatively few (shared) callees.
On Octane-compiled-to-Wasm with my JS AOT compilation tooling using
call_indirect
for all ICs, I see: baseline score (higher is better, proportional to runtime speed) of 2406, score with old one-entry-per-callsite scheme of 2479, score with this scheme of 2509. So it's slightly faster as well, probably due to a combination of the warmup benefit and a smaller cache footprint, even with the more involved logic to compute the slot address. (This also tells me the benefit of this cache is smaller than I had originally measured on a microbenchmark (20%) -- at 5% on all of Octane -- but that's still worth it, IMHO.)(*) Note that slots are not actually contiguous: I did a struct-of-arrays trick, separating cache tags from cache values, so that the assembly lowering can use scaling amodes (
vmctx + offset + 4*idx
for u32 accesses, and8*idx
for u64 accesses) for more efficient code.<!--
Please make sure you include the following information:
If this work has been discussed elsewhere, please include a link to that
conversation. If it was discussed in an issue, just mention "issue #...".Explain why this change is needed. If the details are in an issue already,
this can be brief.Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.htmlPlease ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->
cfallin updated PR #8870.
cfallin updated PR #8870.
cfallin updated PR #8870.
cfallin updated PR #8870.
github-actions[bot] commented on PR #8870:
Subscribe to Label Action
cc @fitzgen
<details>
This issue or pull request has been labeled: "fuzzing", "wasmtime:api", "wasmtime:config", "wasmtime:ref-types"Thus the following users have been cc'd because of the following labels:
- fitzgen: fuzzing, wasmtime:ref-types
To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.
Learn more.
</details>
github-actions[bot] commented on PR #8870:
Label Messager: wasmtime:config
It looks like you are changing Wasmtime's configuration options. Make sure to
complete this check list:
[ ] If you added a new
Config
method, you wrote extensive documentation for
it.<details>
Our documentation should be of the following form:
```text
Short, simple summary sentence.More details. These details can be multiple paragraphs. There should be
information about not just the method, but its parameters and results as
well.Is this method fallible? If so, when can it return an error?
Can this method panic? If so, when does it panic?
Example
Optional example here.
```</details>
[ ] If you added a new
Config
method, or modified an existing one, you
ensured that this configuration is exercised by the fuzz targets.<details>
For example, if you expose a new strategy for allocating the next instance
slot inside the pooling allocator, you should ensure that at least one of our
fuzz targets exercises that new strategy.Often, all that is required of you is to ensure that there is a knob for this
configuration option in [wasmtime_fuzzing::Config
][fuzzing-config] (or one
of its nestedstruct
s).Rarely, this may require authoring a new fuzz target to specifically test this
configuration. See [our docs on fuzzing][fuzzing-docs] for more details.</details>
[ ] If you are enabling a configuration option by default, make sure that it
has been fuzzed for at least two weeks before turning it on by default.[fuzzing-config]: https://github.com/bytecodealliance/wasmtime/blob/ca0e8d0a1d8cefc0496dba2f77a670571d8fdcab/crates/fuzzing/src/generators.rs#L182-L194
[fuzzing-docs]: https://docs.wasmtime.dev/contributing-fuzzing.html
<details>
To modify this label's message, edit the <code>.github/label-messager/wasmtime-config.md</code> file.
To add new label messages or remove existing label messages, edit the
<code>.github/label-messager.json</code> configuration file.</details>
alexcrichton submitted PR review:
Before I dive too deep into this, I don't think that what's implemented is sufficient. For example this code:
(module (func (export "_start") i32.const 0 call_indirect i32.const 0 call_indirect (result i32) drop ) (func $a) (table 10 10 funcref) (elem (offset (i32.const 1)) func $a) )
runs as:
$ cargo run -q run -O cache-call-indirects=n foo.wat Error: failed to run main module `foo.wat` Caused by: 0: failed to invoke command default 1: error while executing at wasm backtrace: 0: 0x3a - <unknown>!<wasm function 0> 2: wasm trap: uninitialized element $ cargo run -q run -O cache-call-indirects=y foo.wat zsh: segmentation fault cargo run -q run -O cache-call-indirects=y foo.wat
Notably I think that the cache key can't just be the index but must also be the type used as well? I'm not sure how much the load of the VMSharedTypeIndex will affect perf, so I figure it's probably best to figure this out first before diving too deep in review.
alexcrichton submitted PR review:
Before I dive too deep into this, I don't think that what's implemented is sufficient. For example this code:
(module (func (export "_start") i32.const 0 call_indirect i32.const 0 call_indirect (result i32) drop ) (func $a) (table 10 10 funcref) (elem (offset (i32.const 1)) func $a) )
runs as:
$ cargo run -q run -O cache-call-indirects=n foo.wat Error: failed to run main module `foo.wat` Caused by: 0: failed to invoke command default 1: error while executing at wasm backtrace: 0: 0x3a - <unknown>!<wasm function 0> 2: wasm trap: uninitialized element $ cargo run -q run -O cache-call-indirects=y foo.wat zsh: segmentation fault cargo run -q run -O cache-call-indirects=y foo.wat
Notably I think that the cache key can't just be the index but must also be the type used as well? I'm not sure how much the load of the VMSharedTypeIndex will affect perf, so I figure it's probably best to figure this out first before diving too deep in review.
alexcrichton created PR review comment:
This should use
ptr_size
to handle cross-compiles between 32/64-bit (it's somewhere onVMOffsets
, I forget which)
alexcrichton created PR review comment:
Also could this and the above go ahead and use
ishl
instead ofimul_imm
?
alexcrichton commented on PR #8870:
Er I actually flubbed the test a bit and discovered a bug I didn't mean to discover. This is what I intended:
(module (func (export "_start") i32.const 1 ;; not 0 unlike above call_indirect i32.const 1 ;; not 0 unlike above call_indirect (result i32) drop ) (func $a) (table 10 10 funcref) (elem (offset (i32.const 1)) func $a) )
which runs successfully with
-O cache-call-indirects
but fails without it due to a signature mismatch error. This is what I was trying to reproduce above and is my comment about how the cache key needs to take the tag into account as well. Otherwise though I'm not actually sure what the segfault above is, but that might be more benign.
Out of curiosity I was also curious, had I not caught this, if fuzzing would catch it. Forcing differential execution of Wasmtime-against-Wasmtime and module generation with
wasm-smith
this isn't showing up after ~10 minutes of 30 cores fuzzing. This might be something where cache-call-indirects isn't getting a hit on the fuzz modules being configured (does it require the table min == max? I forget...). Basically our fuzz coverage of this may not be that great right now (although oss-fuzz is generally better about coverage than local runs)
cfallin commented on PR #8870:
Notably I think that the cache key can't just be the index but must also be the type used as well? I'm not sure how much the load of the VMSharedTypeIndex will affect perf, so I figure it's probably best to figure this out first before diving too deep in review.
Ah, could you say more why the type is necessary? In my (possibly naive) view, we are choosing the cache slot to look at a different way, but nothing else should change (obviously there is some breakage here which I'll take a look at -- team meeting next two days so slight delay).
alexcrichton commented on PR #8870:
The behavior of
call_indirect
is dependent on the table being used, the runtime index, and the type ascribed oncall_indirect
, but the implementation here only takes two of these into account. The caches are per-table (I think? I didn't dive that deep) and the cache key handles the runtime index. Basically the previous behavior would implicitly take all three into account by having a cache-per-callsite. This implementation has where multiple call sites can share the same cache slot (intentional) but if each call site has different types ascribed to thecall_indirect
then they have to have different behavior.The basic tl;dr; though is that this PR as-is violates wasm-semantics and isn't memory safe. (insofar that you can successfully call a function with the wrong signature)
cfallin commented on PR #8870:
Ah, I completely missed the dynamic signature check -- will have to incorporate that probably by baking it into the cache key somehow. (I don't think we'll need a load on the hot path at all.) Solution TBD, thanks!
fitzgen submitted PR review:
Overall, modulo Alex's point about types, this looks great! I like this a lot better than having a cliff where call sites beyond the cliff don't get caching.
In addition to the point Alex made about types, I think it is worth thinking about (and testing!) the case where we have two tables, both sharing the same cache entries, but where one is a table of untyped
funcref
s and the other is a table of(ref $some_specific_func_type)
.In general, I think it is also worth adding a few combinatorial-y unit tests where we have
cache_size_log2 == 0
(i.e. a single, shared cache entry) and we hammer on that cache entry with a variety of different calls from different sites with the same and different expected types and via different tables with different types offuncref
s.
fitzgen submitted PR review:
Overall, modulo Alex's point about types, this looks great! I like this a lot better than having a cliff where call sites beyond the cliff don't get caching.
In addition to the point Alex made about types, I think it is worth thinking about (and testing!) the case where we have two tables, both sharing the same cache entries, but where one is a table of untyped
funcref
s and the other is a table of(ref $some_specific_func_type)
.In general, I think it is also worth adding a few combinatorial-y unit tests where we have
cache_size_log2 == 0
(i.e. a single, shared cache entry) and we hammer on that cache entry with a variety of different calls from different sites with the same and different expected types and via different tables with different types offuncref
s.
fitzgen created PR review comment:
Might want to clamp this value to a reasonable range?
fitzgen created PR review comment:
Nitpick: can we use the same variable names in the comments' pseudocode and the builder variables? Annoying to translate between them in my head.
cfallin updated PR #8870.
cfallin updated PR #8870.
cfallin commented on PR #8870:
Excellent finds on both -- fixes pushed to (i) tag cache entries with signature ID as well, (ii) check for null code pointer (the feature on
main
also doesn't catch this; not sure why fuzzing hasn't hit it yet). It's late so I'll test perf tomorrow. Hopefully still a win: the key advantage here should be avoiding the dependent cache misses inherent in the two-level pointer chase of theVMFuncRef
, but this is an objective question I can answer concretely soon :-)
cfallin submitted PR review.
cfallin created PR review comment:
Done both!
cfallin submitted PR review.
cfallin created PR review comment:
Done
cfallin submitted PR review.
cfallin created PR review comment:
Ah, it's already clamped in the method on
Config
, so preferred to let fuzzing maximally poke the API surface here (no special limits). I guess we could do something that avoids biasing the randomness (as-is, theArbitrary
range on theu8
maps 0..=255 to 0..=16 with 15/16ths of the space going to 16) but I don't think it's too important to evenly spread effort across sizes that way...
cfallin commented on PR #8870:
On testing the newest version of this PR, that correctly keeps a cache tag for signature ID and also correctly checks for null(*), I'm now seeing a 2.5% slowdown on Octane in AOT JS relative to no caching at all. So it seems this has crossed the instruction-count tradeoff threshold (the benchmarks have quite high IPC so it's about bulk of instruction stream, not cache misses).
Given the smaller delta I'm seeing now with indirect-call caching in full benchmarks, vs. the microbenchmark I was using to guide me at the time, I think I will actually opt to remove indirect-call caching altogether: the null handling in
main
's version is also incorrect (see footnote below) and I'd rather not have the complexity, perf cliff on module size, and vmctx size overhead if it's worth relatively little.(*) I had previously incorrectly remembered the SIGSEGV-for-call-to-null-funcref mechanism as working by catching a call to address 0; actually it works by catching a SIGSEGV on a load from the
VMFuncRef
, so the caching mechanism here would not catch that (and would "safely" crash wasmtime at least, rather than execute type-unsafely, but that's still wrong).
cfallin closed without merge PR #8870.
Last updated: Dec 23 2024 at 13:07 UTC