cfallin opened PR #8509 from cfallin:call-indirect-caching
to bytecodealliance:main
:
In WebAssembly, an indirect call is somewhat slow, because of the indirection required by CFI (control-flow integrity) sandboxing. In particular, a "function pointer" in most source languages compiled to Wasm is represented by an index into a table of funcrefs. The
call_indirect
instruction then has to do the following steps to invoke a function pointer:
- Load the funcref table's base and length values from the vmctx.
- Bounds-check the invoked index against the actual table size; trap if out-of-bounds.
- Spectre mitigation (cmove) on that bounds-check.
- Load the
vmfuncref
from the table given base and index.
- For lazy table init, check if this is a non-initialized funcref pointer, and initialize the entry.
- Load the signature from the funcref struct and compare it against the
call_indirect
's expected signature; trap if wrong.- Load the actual code pointer for the callee's Wasm-ABI entry point.
- Load the callee vmctx (which may be different for a cross-module call).
- Put that vmctx in arg 0, our vmctx in arg 1, and invoke the loaded code pointer with an indirect call instruction.
Compare and contrast to the process involved in invoking a native function pointer:
- Invoke the code pointer with an indirect call instruction.
This overhead buys us something -- it is part of the SFI sandbox boundary -- but it is very repetitive and unnecessary work in most cases when indirect function calls are performed repeatedly (such as within an inner loop).
This PR introduces the idea of caching: if we know that the result of all the above checks won't change, then if we use the same index as "the last time" (for some definition), we can skip straight to the "invoke the code pointer" step, with a cached code pointer from that last time.
Concretely, it introduces a two-word struct inlined into the vmctx for each
call_indirect
instruction in the module (up to a limit):
- The last invoked index;
- The code pointer that index corresponded to.
When compiling the module, we check whether the table could possibly be mutable at a given index once read: any instructions like
table.set
, or the whole table exported thus writable from the outside. We also check whether index 0 is a non-null funcref. If neither of these things are true, then we know we can cache an index-to-code-pointer mapping, and we know we can use index 0 as a sentinel for "no cached value".We then make use of the struct for each indirect call site and generate code to check if the index matches; if so, call cached pointer; if not, load the vmfuncref, check the signature, check that the callee vmctx is the same as caller (intra-module case), and stash the code pointer and index away (fill the cache), then make the call.
On an in-development branch of SpiderMonkey-in-Wasm with ICs (using indirect calls), this is about a 20% speedup; I haven't yet measured on other benchmarks. It is expected that this might be an instantiation-time slowdown due to a larger vmctx (but we could use madvise to zero if needed).
This feature is off by default right now.
<!--
Please make sure you include the following information:
If this work has been discussed elsewhere, please include a link to that
conversation. If it was discussed in an issue, just mention "issue #...".Explain why this change is needed. If the details are in an issue already,
this can be brief.Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.htmlPlease ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->
cfallin requested fitzgen for a review on PR #8509.
cfallin requested wasmtime-core-reviewers for a review on PR #8509.
cfallin commented on PR #8509:
(I'll run some benchmarks to see what the compile-time impact is like, and how this benefits non-SpiderMonkey workloads; TBD.)
cfallin updated PR #8509.
cfallin commented on PR #8509:
Benchmarks:
- Sightglass measures the following for
spidermonkey.wasm
:compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm Δ = 1456249767.40 ± 1315409020.02 (confidence = 99%) original.so is 1.01x to 1.17x faster than cache-indirect-calls.so! [17934023164 18070879867.60 18215783322] cache-indirect-calls.so [15377276916 16614630100.20 17840489432] original.so execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm Δ = 27069455.80 ± 22219096.46 (confidence = 99%) original.so is 1.00x to 1.05x faster than cache-indirect-calls.so! [1079208474 1095097604.00 1142114396] cache-indirect-calls.so [1035034082 1068028148.20 1094035960] original.so instantiation :: cycles :: benchmarks/spidermonkey/benchmark.wasm No difference in performance. [834480 856949.40 902120] cache-indirect-calls.so [831744 860156.60 978994] original.so
so compilation time does take a little hit but my system is noisy; somewhere between 1% and 17%, the mean looks to be around 8.7%. Instantiation apparently unaffected. And execution time... slightly slower? (But again, noisy system.)
I tried running the next benchmark in the list (pulldown-cmark) and Sightglass crashed mysteriously, so on to trusty
hyperfine
: no deltas in runtime on spidermonkey or bz2, my two usual suspects.I suspect that mostly we'll see mostly in-the-noise results for most benchmarks because indirect calls are somewhat rare in C-ish code; it will matter mostly when compiling dynamic languages, or something like Java with a lot of virtual calls. The speedup on weval'd SpiderMonkey is real though so I'd like to at least have this in tree as an option :-)
github-actions[bot] commented on PR #8509:
Subscribe to Label Action
cc @saulecabrera
<details>
This issue or pull request has been labeled: "wasmtime:api", "wasmtime:config", "winch"Thus the following users have been cc'd because of the following labels:
- saulecabrera: winch
To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.
Learn more.
</details>
github-actions[bot] commented on PR #8509:
Label Messager: wasmtime:config
It looks like you are changing Wasmtime's configuration options. Make sure to
complete this check list:
[ ] If you added a new
Config
method, you wrote extensive documentation for
it.<details>
Our documentation should be of the following form:
```text
Short, simple summary sentence.More details. These details can be multiple paragraphs. There should be
information about not just the method, but its parameters and results as
well.Is this method fallible? If so, when can it return an error?
Can this method panic? If so, when does it panic?
Example
Optional example here.
```</details>
[ ] If you added a new
Config
method, or modified an existing one, you
ensured that this configuration is exercised by the fuzz targets.<details>
For example, if you expose a new strategy for allocating the next instance
slot inside the pooling allocator, you should ensure that at least one of our
fuzz targets exercises that new strategy.Often, all that is required of you is to ensure that there is a knob for this
configuration option in [wasmtime_fuzzing::Config
][fuzzing-config] (or one
of its nestedstruct
s).Rarely, this may require authoring a new fuzz target to specifically test this
configuration. See [our docs on fuzzing][fuzzing-docs] for more details.</details>
[ ] If you are enabling a configuration option by default, make sure that it
has been fuzzed for at least two weeks before turning it on by default.[fuzzing-config]: https://github.com/bytecodealliance/wasmtime/blob/ca0e8d0a1d8cefc0496dba2f77a670571d8fdcab/crates/fuzzing/src/generators.rs#L182-L194
[fuzzing-docs]: https://docs.wasmtime.dev/contributing-fuzzing.html
<details>
To modify this label's message, edit the <code>.github/label-messager/wasmtime-config.md</code> file.
To add new label messages or remove existing label messages, edit the
<code>.github/label-messager.json</code> configuration file.</details>
fitzgen commented on PR #8509:
FWIW, I did a 100-iteration sightglass run of the default suite and got these results:
instantiation :: cycles :: benchmarks/spidermonkey/benchmark.wasm No difference in performance. [472878 584750.13 1435913] cache.so [471975 647599.28 1762837] main.so execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm No difference in performance. [7422842 8744920.24 28196476] cache.so [7439782 9281270.19 32569639] main.so instantiation :: cycles :: benchmarks/bz2/benchmark.wasm No difference in performance. [165456 245749.33 728069] cache.so [161764 251740.72 723740] main.so execution :: cycles :: benchmarks/bz2/benchmark.wasm No difference in performance. [95484875 104221743.11 196171026] cache.so [95454637 105918836.28 208434699] main.so compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm No difference in performance. [358282063 404458268.52 568579026] cache.so [367557812 399158971.64 551388853] main.so compilation :: cycles :: benchmarks/bz2/benchmark.wasm No difference in performance. [194765165 225581880.12 437412507] cache.so [197703277 222917937.46 281571195] main.so execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm No difference in performance. [895587497 923280792.22 972838288] cache.so [891917856 927697965.55 1121115203] main.so compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm No difference in performance. [8325711261 8502522129.94 8732981273] cache.so [8370183347 8530524550.49 9984684780] main.so instantiation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm No difference in performance. [214742 289658.22 800685] cache.so [204642 290351.57 854625] main.so
I think it is safe to say that this doesn't have an affect on our regular suite, and I expect that the JS speed ups you are reporting rely on weval'ing first.
fitzgen submitted PR review:
Overall looks great, but I have a few nitpicks, mostly related to improving things for future code readers. Thanks!
fitzgen submitted PR review:
Overall looks great, but I have a few nitpicks, mostly related to improving things for future code readers. Thanks!
fitzgen created PR review comment:
Care to add a brief comment to this test file about what we are aiming to exercise here, so that future changes that happen to tweak this disassembly know what property to check is preserved in their changes?
fitzgen created PR review comment:
All the rest of our offset getters are using typed indices; can we define a
CallIndirectSiteIndex
(feel free to word smith) type for these? It just takes three lines incrates/types/src/lib.rs
:
fitzgen created PR review comment:
That's a lot of "index", perhaps we could reword this a little and rename the
index
parameter tocall_site
? I think this is a little more clear:/// Return the offset from `self` of `self.call_indirect_caches[call_site].index`.
(And similar above)
fitzgen created PR review comment:
(ditto for all these tests)
fitzgen created PR review comment:
Could you add some comments here for future readers and possibly factor this scan out into a helper method to help prevent this function from getting even more gigantic?
fitzgen created PR review comment:
This could really use some more detail. As a user, why shouldn't I just enable this all the time? What are the trade offs? What is the default? What does it do? When does it apply? How does it affect memory footprint and pooling allocator integration? Etc...
The automatic github actions comment has a good template in the first check box: https://github.com/bytecodealliance/wasmtime/pull/8509#issuecomment-2084268706
fitzgen created PR review comment:
indireaching? This is a typo and not an intentional contraction, right?
fitzgen created PR review comment:
Ditto
fitzgen created PR review comment:
Could we factor this one out into smaller helpers and add comments describing the code and blocks we are emitting as well? I'm finding it really hard to read this giant block of clif-building code directly. See for example the overview comment and comments for each block we generate here:
fitzgen created PR review comment:
Speaking of that comment, the second item doesn't seem to be satisfied in this PR either: adding support for fuzzing this option. It should be just a couple-line change in these places:
cfallin updated PR #8509.
cfallin updated PR #8509.
cfallin submitted PR review.
cfallin created PR review comment:
Done!
cfallin submitted PR review.
cfallin created PR review comment:
Done!
cfallin submitted PR review.
cfallin created PR review comment:
Done!
cfallin submitted PR review.
cfallin created PR review comment:
Done!
cfallin submitted PR review.
cfallin created PR review comment:
For sure, added explanations here and below giving the reasons for expecting/not expecting the opt, and the key bits of IR we're asserting.
cfallin submitted PR review.
cfallin created PR review comment:
Updated!
cfallin submitted PR review.
cfallin created PR review comment:
Updated!
cfallin submitted PR review.
cfallin created PR review comment:
Huh, should be "indirect call caching"; somehow my fingers or a misplaced editor command swallowed the middle bits of that, sorry!
cfallin submitted PR review.
cfallin created PR review comment:
Added a similar comment and factored out a little bit of the logic (and reduced the CFG by one block while I was at it) -- hopefully clearer now!
cfallin updated PR #8509.
cfallin commented on PR #8509:
Updated, let me know what you think!
cfallin commented on PR #8509:
(Oh, and forgot to mention: thanks for the thorough benchmarks on this! I've gotta dig into why Sightglass is so finicky for me...)
cfallin updated PR #8509.
github-actions[bot] commented on PR #8509:
Subscribe to Label Action
cc @fitzgen
<details>
This issue or pull request has been labeled: "fuzzing", "wasmtime:ref-types"Thus the following users have been cc'd because of the following labels:
- fitzgen: fuzzing, wasmtime:ref-types
To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.
Learn more.
</details>
fitzgen commented on PR #8509:
Oh also, I mentioned this in the Cranelift meeting, but wanted to put it down here for posterity too: it seems like the limit on the number of caches isn't actually implemented.
cfallin updated PR #8509.
cfallin commented on PR #8509:
Yep, I missed that entirely in this iteration of the patch -- sorry about that! Just updated. I ended up wiring the limit through to a new config option (with docs, CLI flag, fuzzing) so that I could easily test it without a huge test input.
cfallin updated PR #8509.
fitzgen submitted PR review:
:rocket:
fitzgen submitted PR review:
:rocket:
fitzgen created PR review comment:
This is great, thank you. I really appreciate having this stuff factored out into bite-sized chunks.
cfallin updated PR #8509.
cfallin commented on PR #8509:
The merge queue did its job (nice!) and found a conflict with #8515 which merged after this PR's CI ran. I pushed a fix which is fairly small but @fitzgen maybe PTAL at least commit just to be sure?
cfallin edited a comment on PR #8509:
The merge queue did its job (nice!) and found a conflict with #8515 which merged after this PR's CI ran. I pushed a fix which is fairly small but @fitzgen maybe PTAL at least the last commit just to be sure?
cfallin commented on PR #8509:
(or @jameysharp feel free to review last commit too as it interacts with your change in #8515!)
fitzgen submitted PR review.
fitzgen created PR review comment:
Oh I just noticed this new index type is in between the definition of VMSharedTypeIndex and the impl block defining methods for it. Can we move it before or after so that those two are together?
fitzgen submitted PR review.
fitzgen created PR review comment:
Is this allowed to return false even when the value is actually zero? Jamey is adding support for the extended const proposal and it extends const expressions with addition, subtraction, and multiplication, so there will soon be many ways to evaluate to zero. Doesn’t mean we can’t land this PR now but we should be clear about the semantics of this method
cfallin updated PR #8509.
cfallin submitted PR review.
cfallin created PR review comment:
Clarified a bit -- renamed to
provably_nonzero_i32
(inverted from is-possibly-nonzero -- clearer this way IMHO) and commented more. It's always allowed to returnfalse
(can't prove anything), but iftrue
, must actually be a case we know is always nonzero (currently only an i32.const with nonzero constant).
cfallin submitted PR review.
cfallin created PR review comment:
Ah, yeah, didn't notice that; moved down a bit.
cfallin has enabled auto merge for PR #8509.
jameysharp submitted PR review.
jameysharp created PR review comment:
I would suggest naming this with the opposite sense, like:
is_definitely_nonzero
, which returns true forI32Const(x) if x != 0
, and may also return true if there are any other circumstances where it can prove this, but otherwise returns false. Then, extended constants or any other strange things will safely return false, meaning that you can't be sure your optimization is valid, and you'll only apply the optimization when it's definitely okay.Further, when we follow Nick's suggestion for making the
ConstExpr
evaluator try to run even before anInstance
is available, it will be clear how to invoke that to implement this. If it fails because it needs inputs which aren't known until instantiation, or it succeeds but returns 0, then we should return that the expression is not "definitely nonzero".
jameysharp submitted PR review.
jameysharp created PR review comment:
Hah, raced; I'm glad we're on the same page.
cfallin merged PR #8509.
alexcrichton submitted PR review.
alexcrichton created PR review comment:
Could these be moved to
-O
options as "tuning" or "optimizaion" options?The
WasmOptions
group here is intended for semantic differences in the generated code and these options shouldn't affect any generated code.(thanks for adding cli flags though!)
cfallin created PR review comment:
Ah, yeah, for sure -- will do in a followup PR.
cfallin submitted PR review.
Last updated: Dec 23 2024 at 12:05 UTC