wasmtime / Issue #1647 Introduce peepmatic: a peephole op... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / Issue #1647 Introduce peepmatic: a peephole op...

Wasmtime GitHub notifications bot (May 02 2020 at 00:42):

github-actions[bot] commented on Issue #1647:

Subscribe to Label Action

cc @bnjbvr

<details>
This issue or pull request has been labeled: "cranelift"

Thus the following users have been cc'd because of the following labels:

bnjbvr: cranelift

To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.

Learn more.
</details>

Wasmtime GitHub notifications bot (May 02 2020 at 07:23):

bjorn3 commented on Issue #1647:

Cool! By the way does this fix the bug where preopt forgets to sign extend imm when optimizing v1 = iconst.i8 imm; v2 = icmp sgt v0, v1 to v2 = icmp_imm sgt v0, imm?

Wasmtime GitHub notifications bot (May 02 2020 at 11:13):

bjorn3 commented on Issue #1647:

Can you add the commit messages introducing a new crate to a top level doc comment in the respective crate?

Wasmtime GitHub notifications bot (May 02 2020 at 12:35):

bjorn3 edited a comment on Issue #1647:

Cool! By the way does this fix the bug where preopt forgets to sign extend imm when optimizing v1 = iconst.i8 imm; v2 = icmp sgt v0, v1 to v2 = icmp_imm sgt v0, imm?

Edit: It doesn't. Left a comment at the place it should sign extend.

Wasmtime GitHub notifications bot (May 02 2020 at 13:01):

bjorn3 commented on Issue #1647:

This is very well documented and structured code!

Wasmtime GitHub notifications bot (May 09 2020 at 05:59):

Techcable commented on Issue #1647:

Super cool! :grinning: How does this compare to LuaJIT's FOLD optimization and their perfect-hash system? I know their trace compiler has a simpler IR than crenelift, but what is the motivation of using a fst over perfect hash map? Does it enable more complex matching?

Wasmtime GitHub notifications bot (May 11 2020 at 08:03):

bnjbvr commented on Issue #1647:

This is exciting! A few high-level questions that I think would be important to answer before merging:

does this replace the existing simple_preopt, or is this something that ought to be disabled until it has feature parity with the current simple_preopt? I would advocate not enabling this by default if this isn't at feature parity with the existing system, since the current system was actually useful.

if this is at feature parity and we plan to enable it by default, can you provide performance data, please? (comparisons of before/after for: wallclocks compile run time + total number of executed instructions through Valgrind/perf) If this is a slowdown considering one of these two measures, I would strongly advocate not enabling it by default and keeping the existing system in the meanwhile.

(Less important, mostly for my personal curiosity but this doesn't have to block anything) do we have any ideas of what the (Rust) compile time difference would be, with this new system? (Auto-generated code tends to create large functions which are quite slow to compile.)

Wasmtime GitHub notifications bot (May 11 2020 at 11:01):

froydnj commented on Issue #1647:

* (Less important, mostly for my personal curiosity but this doesn't have to block anything) do we have any ideas of what the (Rust) compile time difference would be, with this new system? (Auto-generated code tends to create large functions which are quite slow to compile.)
Something that would be nice to have sorted (pun intended) prior to merge is whether the auto-generated code is the same over multiple compilations of the crate, so sccache works correctly.

Wasmtime GitHub notifications bot (May 11 2020 at 23:22):

fitzgen commented on Issue #1647:

@Techcable

How does this compare to LuaJIT's FOLD optimization and their perfect-hash system?

I'm not really familiar with LuaJIT's FOLD optimizations, but reading through that comment, it seems a little less general (can only match three operations at most?). The idea of combining three opcode checks into a single check via perfect hashing is something we could investigate and add as a new MatchOp, perhaps.

@bnjbvr, as you know, we talked a bit about this at the Cranelift meeting today, but for posterity I'll put them in a comment again.

does this replace the existing simple_preopt, or is this something that ought to be disabled until it has feature parity with the current simple_preopt? I would advocate not enabling this by default if this isn't at feature parity with the existing system, since the current system was actually useful.

Yes, this is feature-gated behind the "enable-peepmatic" cargo feature right now, and the feature is not enabled by default.

can you provide performance data, please?

Performance doesn't quite match the hand-coded peephole optimizer yet. This is one reason why it makes sense to land this off-by-default. This is unsurprising, since I haven't spent time on perf and optimization yet, other than the big picture design.

<details><summary><b>Graphs of wall time, instructions retired, cache misses, and branch misses</b></summary>

The following examples are for running wasmtime markdown.wasm '# Hello, World!' where markdown.wasm is internally using pulldown-cmark. This is a 272KiB wasm file.

Wall Time

![time](https://user-images.githubusercontent.com/74571/81618958-59905080-939d-11ea-921a-9193334a841f.png)

Instructions Retired

![instructions](https://user-images.githubusercontent.com/74571/81618946-4f6e5200-939d-11ea-8961-097e7d1a078e.png)

Branch Misses

![branch-misses](https://user-images.githubusercontent.com/74571/81618997-762c8880-939d-11ea-93de-c93451f84cf2.png)

Cache Misses

![cache-misses](https://user-images.githubusercontent.com/74571/81619012-7b89d300-939d-11ea-8fe2-98fb6bc55716.png)

</details>

I have many ideas for perf improvements, but I'd like to land this PR first, and then start investigating perf in follow ups. Since peepmatic is not enabled by default, this shouldn't be risky.

do we have any ideas of what the (Rust) compile time difference would be, with this new system?

The vast majority of peepmatic code is not necessary to compile unless you're changing the set of peephole optimizations. This is the motivation for the split between the peepmatic crate (the compiler, only run at build time if the "rebuild-peephole-optimizers" feature is enabled) and the peepmatic-runtime crate (just the things needed to use a peepmatic-generated peephole optimizer.

<details><summary><b>Timings of Cranelift's compile time</b></summary>

Without Peepmatic
fitzgen@erdos :: (master) :: ~/wasmtime/cranelift/codegen
    $ cargo clean; time cargo build --quiet

real    0m24.207s
user    1m13.714s
sys     0m4.391s

fitzgen@erdos :: (master) :: ~/wasmtime/cranelift/codegen
    $ echo "// comment" >> src/lib.rs

fitzgen@erdos :: (master *) :: ~/wasmtime/cranelift/codegen
    $ time cargo build --quiet

real    0m2.424s
user    0m1.962s
sys     0m0.559s
With Peepmatic (Not Rebuilding Peephole Optimizers)
fitzgen@erdos :: (integrate-peepmatic) :: ~/wasmtime/cranelift/codegen
    $ cargo clean; time cargo build --quiet --features enable-peepmatic

real    0m31.580s
user    1m44.893s
sys     0m6.192s

fitzgen@erdos :: (integrate-peepmatic) :: ~/wasmtime/cranelift/codegen
    $ echo "// comment" >> src/lib.rs

fitzgen@erdos :: (integrate-peepmatic *) :: ~/wasmtime/cranelift/codegen
    $ time cargo build --quiet --features enable-peepmatic

real    0m2.491s
user    0m1.988s
sys     0m0.604s
With Peepmatic (With Rebuilding Peephole Optimizers)
fitzgen@erdos :: (integrate-peepmatic) :: ~/wasmtime/cranelift/codegen
    $ cargo clean; time cargo build --quiet --features 'enable-peepmatic rebuild-peephole-optimizers'

real    3m35.014s
user    20m46.827s
sys     1m40.616s

fitzgen@erdos :: (integrate-peepmatic) :: ~/wasmtime/cranelift/codegen
    $ echo "// comment" >> src/lib.rs

fitzgen@erdos :: (integrate-peepmatic *) :: ~/wasmtime/cranelift/codegen
    $ time cargo build --quiet --features 'enable-peepmatic rebuild-peephole-optimizers'

real    0m2.649s
user    0m2.187s
sys     0m0.563s
</details>

Incremental builds are unaffected.

Clean builds without rebuilding the peephole optimizers take a little bit longer (24 -> 31 seconds).

Clean builds with rebuilding the peephole optimizers take ~3.5 minutes. This is mainly due to building and statically linking Z3. We could also shared link the system Z3 to avoid much of this overhead, but this has other problems, namely old Z3s that are missing some exported symbols (e.g. Ubuntu's packaged Z3).

@froydnj

whether the auto-generated code is the same over multiple compilations of the crate, so sccache works correctly.

(There is currently no generated Rust code, only a generated automaton that is then interpreted. This may change in the future. Sorry to nitpick.)

Yes, builds are deterministic, producing the same automaton bit-for-bit given the same DSL input. CI is checking this, and one of the fuzz targets is also checking this.

Wasmtime GitHub notifications bot (May 11 2020 at 23:43):

fitzgen commented on Issue #1647:

Oh, also, there was a question at the Cranelift meeting about how many optimizations we can expect to get out of Souper.

@jubitaneja harvested candidate left-hand sides from rustfmt compiled to Wasm with LLVM optimizations and then ran them through Souper. Souper successfully synthesized 836 optimizations, of which 221 are reducing the whole LHS to a constant.

I think we can expect to see roughly similar results, with a couple caveats:

First, she was harvesting LHS candidates from the Wasm, not the clif that the Wasm gets translated into. On the one hand, it isn't clear how many of these synthesized optimizations are subsumed by our existing preopt pass. On the other, these candidates are harvested after LLVM optimizations, and I'm pretty sure LLVM's optimizations largely subsume our preopt pass's, so maybe these are new/unique/missing optimizations?

Second, choosing a corpus of benchmark Wasms to harvest LHSes from is tricky, but I am pretty sure we will have more than a single file in the corpus from more than just a single toolchain. So I'd suspect that we would synthesize even more optimizations than this.

Wasmtime GitHub notifications bot (May 12 2020 at 09:45):

bjorn3 commented on Issue #1647:

It would be nice to also harvested candidate left-hand sides from cg_clif generated clif ir. Maybe add a way for a user to provide it's own set of peephole optimizations to Cranelift?

Wasmtime GitHub notifications bot (May 12 2020 at 15:36):

fitzgen commented on Issue #1647:

It would be nice to also harvested candidate left-hand sides from cg_clif generated clif ir. Maybe add a way for a user to provide it's own set of peephole optimizations to Cranelift?

Yep, this is definitely something we could do in the future.

Wasmtime GitHub notifications bot (May 12 2020 at 20:28):

fitzgen commented on Issue #1647:

Finally got windows CI green, so now all CI is green!

Last updated: Apr 18 2025 at 20:03 UTC