github-actions[bot] commented on Issue #1647:
Subscribe to Label Action
cc @bnjbvr
<details>
This issue or pull request has been labeled: "cranelift"Thus the following users have been cc'd because of the following labels:
- bnjbvr: cranelift
To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.
Learn more.
</details>
bjorn3 commented on Issue #1647:
Cool! By the way does this fix the bug where preopt forgets to sign extend
imm
when optimizingv1 = iconst.i8 imm; v2 = icmp sgt v0, v1
tov2 = icmp_imm sgt v0, imm
?
bjorn3 commented on Issue #1647:
Can you add the commit messages introducing a new crate to a top level doc comment in the respective crate?
bjorn3 edited a comment on Issue #1647:
Cool! By the way does this fix the bug where preopt forgets to sign extend
imm
when optimizingv1 = iconst.i8 imm; v2 = icmp sgt v0, v1
tov2 = icmp_imm sgt v0, imm
?Edit: It doesn't. Left a comment at the place it should sign extend.
bjorn3 commented on Issue #1647:
This is very well documented and structured code!
Techcable commented on Issue #1647:
Super cool! :grinning: How does this compare to LuaJIT's FOLD optimization and their perfect-hash system? I know their trace compiler has a simpler IR than crenelift, but what is the motivation of using a fst over perfect hash map? Does it enable more complex matching?
bnjbvr commented on Issue #1647:
This is exciting! A few high-level questions that I think would be important to answer before merging:
- does this replace the existing
simple_preopt
, or is this something that ought to be disabled until it has feature parity with the currentsimple_preopt
? I would advocate not enabling this by default if this isn't at feature parity with the existing system, since the current system was actually useful.- if this is at feature parity and we plan to enable it by default, can you provide performance data, please? (comparisons of before/after for: wallclocks compile run time + total number of executed instructions through Valgrind/perf) If this is a slowdown considering one of these two measures, I would strongly advocate not enabling it by default and keeping the existing system in the meanwhile.
- (Less important, mostly for my personal curiosity but this doesn't have to block anything) do we have any ideas of what the (Rust) compile time difference would be, with this new system? (Auto-generated code tends to create large functions which are quite slow to compile.)
froydnj commented on Issue #1647:
* (Less important, mostly for my personal curiosity but this doesn't have to block anything) do we have any ideas of what the (Rust) compile time difference would be, with this new system? (Auto-generated code tends to create large functions which are quite slow to compile.)
Something that would be nice to have sorted (pun intended) prior to merge is whether the auto-generated code is the same over multiple compilations of the crate, so
sccache
works correctly.
fitzgen commented on Issue #1647:
@Techcable
How does this compare to LuaJIT's FOLD optimization and their perfect-hash system?
I'm not really familiar with LuaJIT's FOLD optimizations, but reading through that comment, it seems a little less general (can only match three operations at most?). The idea of combining three opcode checks into a single check via perfect hashing is something we could investigate and add as a new
MatchOp
, perhaps.
@bnjbvr, as you know, we talked a bit about this at the Cranelift meeting today, but for posterity I'll put them in a comment again.
does this replace the existing
simple_preopt
, or is this something that ought to be disabled until it has feature parity with the currentsimple_preopt
? I would advocate not enabling this by default if this isn't at feature parity with the existing system, since the current system was actually useful.Yes, this is feature-gated behind the
"enable-peepmatic"
cargo feature right now, and the feature is not enabled by default.can you provide performance data, please?
Performance doesn't quite match the hand-coded peephole optimizer yet. This is one reason why it makes sense to land this off-by-default. This is unsurprising, since I haven't spent time on perf and optimization yet, other than the big picture design.
<details><summary><b>Graphs of wall time, instructions retired, cache misses, and branch misses</b></summary>
The following examples are for running
wasmtime markdown.wasm '# Hello, World!'
wheremarkdown.wasm
is internally usingpulldown-cmark
. This is a 272KiB wasm file.Wall Time
![time](https://user-images.githubusercontent.com/74571/81618958-59905080-939d-11ea-921a-9193334a841f.png)
Instructions Retired
![instructions](https://user-images.githubusercontent.com/74571/81618946-4f6e5200-939d-11ea-8961-097e7d1a078e.png)
Branch Misses
![branch-misses](https://user-images.githubusercontent.com/74571/81618997-762c8880-939d-11ea-93de-c93451f84cf2.png)
Cache Misses
![cache-misses](https://user-images.githubusercontent.com/74571/81619012-7b89d300-939d-11ea-8fe2-98fb6bc55716.png)
</details>
I have many ideas for perf improvements, but I'd like to land this PR first, and then start investigating perf in follow ups. Since peepmatic is not enabled by default, this shouldn't be risky.
do we have any ideas of what the (Rust) compile time difference would be, with this new system?
The vast majority of peepmatic code is not necessary to compile unless you're changing the set of peephole optimizations. This is the motivation for the split between the
peepmatic
crate (the compiler, only run at build time if the"rebuild-peephole-optimizers"
feature is enabled) and thepeepmatic-runtime
crate (just the things needed to use a peepmatic-generated peephole optimizer.<details><summary><b>Timings of Cranelift's compile time</b></summary>
Without Peepmatic
fitzgen@erdos :: (master) :: ~/wasmtime/cranelift/codegen $ cargo clean; time cargo build --quiet real 0m24.207s user 1m13.714s sys 0m4.391s fitzgen@erdos :: (master) :: ~/wasmtime/cranelift/codegen $ echo "// comment" >> src/lib.rs fitzgen@erdos :: (master *) :: ~/wasmtime/cranelift/codegen $ time cargo build --quiet real 0m2.424s user 0m1.962s sys 0m0.559s
With Peepmatic (Not Rebuilding Peephole Optimizers)
fitzgen@erdos :: (integrate-peepmatic) :: ~/wasmtime/cranelift/codegen $ cargo clean; time cargo build --quiet --features enable-peepmatic real 0m31.580s user 1m44.893s sys 0m6.192s fitzgen@erdos :: (integrate-peepmatic) :: ~/wasmtime/cranelift/codegen $ echo "// comment" >> src/lib.rs fitzgen@erdos :: (integrate-peepmatic *) :: ~/wasmtime/cranelift/codegen $ time cargo build --quiet --features enable-peepmatic real 0m2.491s user 0m1.988s sys 0m0.604s
With Peepmatic (With Rebuilding Peephole Optimizers)
fitzgen@erdos :: (integrate-peepmatic) :: ~/wasmtime/cranelift/codegen $ cargo clean; time cargo build --quiet --features 'enable-peepmatic rebuild-peephole-optimizers' real 3m35.014s user 20m46.827s sys 1m40.616s fitzgen@erdos :: (integrate-peepmatic) :: ~/wasmtime/cranelift/codegen $ echo "// comment" >> src/lib.rs fitzgen@erdos :: (integrate-peepmatic *) :: ~/wasmtime/cranelift/codegen $ time cargo build --quiet --features 'enable-peepmatic rebuild-peephole-optimizers' real 0m2.649s user 0m2.187s sys 0m0.563s
</details>
Incremental builds are unaffected.
Clean builds without rebuilding the peephole optimizers take a little bit longer (24 -> 31 seconds).
Clean builds with rebuilding the peephole optimizers take ~3.5 minutes. This is mainly due to building and statically linking Z3. We could also shared link the system Z3 to avoid much of this overhead, but this has other problems, namely old Z3s that are missing some exported symbols (e.g. Ubuntu's packaged Z3).
@froydnj
whether the auto-generated code is the same over multiple compilations of the crate, so
sccache
works correctly.(There is currently no generated Rust code, only a generated automaton that is then interpreted. This may change in the future. Sorry to nitpick.)
Yes, builds are deterministic, producing the same automaton bit-for-bit given the same DSL input. CI is checking this, and one of the fuzz targets is also checking this.
fitzgen commented on Issue #1647:
Oh, also, there was a question at the Cranelift meeting about how many optimizations we can expect to get out of Souper.
@jubitaneja harvested candidate left-hand sides from
rustfmt
compiled to Wasm with LLVM optimizations and then ran them through Souper. Souper successfully synthesized 836 optimizations, of which 221 are reducing the whole LHS to a constant.I think we can expect to see roughly similar results, with a couple caveats:
First, she was harvesting LHS candidates from the Wasm, not the clif that the Wasm gets translated into. On the one hand, it isn't clear how many of these synthesized optimizations are subsumed by our existing preopt pass. On the other, these candidates are harvested after LLVM optimizations, and I'm pretty sure LLVM's optimizations largely subsume our preopt pass's, so maybe these are new/unique/missing optimizations?
Second, choosing a corpus of benchmark Wasms to harvest LHSes from is tricky, but I am pretty sure we will have more than a single file in the corpus from more than just a single toolchain. So I'd suspect that we would synthesize even more optimizations than this.
bjorn3 commented on Issue #1647:
It would be nice to also harvested candidate left-hand sides from cg_clif generated clif ir. Maybe add a way for a user to provide it's own set of peephole optimizations to Cranelift?
fitzgen commented on Issue #1647:
It would be nice to also harvested candidate left-hand sides from cg_clif generated clif ir. Maybe add a way for a user to provide it's own set of peephole optimizations to Cranelift?
Yep, this is definitely something we could do in the future.
fitzgen commented on Issue #1647:
Finally got windows CI green, so now all CI is green!
Last updated: Jan 24 2025 at 00:11 UTC