jameysharp commented on issue #4939:
Also, I haven't tested the performance impact of this patch yet, just its correctness. We should try both the cranelift-fuzzgen input from #4931 and @adambratschikaye's example wasm program from #4923. I can't do that until later though.
Also, we have this new benchmarking infrastructure that we're supposed to be able to trigger from a GitHub comment, right? I don't know how that works yet.
afonso360 commented on issue #4939:
I've been running this in the fuzzer for the past 2 hours, and it hasn't complained so far.
jameysharp commented on issue #4939:
Awesome, thank you @afonso360! I've just pushed some changes to the comments and an additional
debug_assert
but I don't think that invalidates your fuzzing results.I don't know how to explain this more clearly than with these updated comments, so I'm marking this ready for review. I still need to evaluate its performance impact though.
jameysharp commented on issue #4939:
Running wasmtime-cli from
main
on @adambratschikaye's example withcargo build --release --no-default-features --features wasmtime/wat
:$ /usr/bin/time target/release/wasmtime compile --disable-cache /tmp/4923.wat 1.55user 0.02system 0:01.57elapsed 100%CPU (0avgtext+0avgdata 39532maxresident)k 0inputs+24outputs (0major+13706minor)pagefaults 0swaps
Under
valgrind --tool=callgrind
that executes 25,270,685,419 instructions, with about 98% of the time inuse_var
, and three-quarters of that hashing keys for thevisited
hash-map.Same test on this branch:
$ /usr/bin/time target/release/wasmtime compile --disable-cache /tmp/4923.wat 0.39user 0.02system 0:00.41elapsed 100%CPU (0avgtext+0avgdata 39744maxresident)k 0inputs+24outputs (3major+13731minor)pagefaults 0swaps
Only 2,803,649,669 instructions according to callgrind, but about 80% of that is in the new
update_predecessor_cycle
function.So this branch takes 11% as many instructions and 25% of the user time on that example.
adambratschikaye commented on issue #4939:
Thanks for handling this @jameysharp!
jameysharp commented on issue #4939:
Sure thing, @adambratschikaye; thanks for pointing out the issue!
Okay, I think it was not a coincidence that 75% of the time on
main
was spent in hash-table operations, and my changes cut compile time by 75%. On the test case from #4923, bothuse_var
anddeclare_block_predecessor
are called the same number of times, and in some sense all I did was move the graph traversal from one to the other. So the entire difference is that onmain
this uses aHashSet
, but with my PR it's indexing directly into a vector. Replacing theHashSet
with aSecondaryMap
, without adding any memoization, should have the same effect for this particular test case.I suspect real programs average more than one call to
use_var
per block, so I think this is still a good idea, but yesterday @cfallin suggested there might be a way to speed this check up for blocks that are sealed, so I've been playing with that. Turns out, it was a good idea.When compiling our
pulldown-cmark
benchmark from Sightglass, this saves about 50 million instructions, or about 3% of the total, according to callgrind. So I think this is a measurable improvement to real-world performance.As for the original test case:
$ /usr/bin/time target/release/wasmtime compile --disable-cache /tmp/4923.wat 0.03user 0.03system 0:00.06elapsed 104%CPU (0avgtext+0avgdata 39644maxresident)k 0inputs+24outputs (0major+13739minor)pagefaults 0swaps
According to callgrind, this took 554,072,900 instructions.
Taking sealed blocks into account eliminates 98% of the compile time on this test case, both in terms of user time and instruction count, compared with
main
. And at this pointuse_var_nonlocal
plusupdate_predecessor_cycle
take 0.29% of total runtime, so on this test there probably isn't any more performance to extract here.Let's see if this works this time:
/bench_x64
jameysharp commented on issue #4939:
I finally ran Sightglass myself and this is a measurable improvement on real programs:
compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm Δ = 49281270.08 ± 21356907.03 (confidence = 99%) no-quadratic-3c0a6c74f.so is 1.09x to 1.22x faster than main-b2d13ebd4.so! [233253012 368287922.00 470539364] main-b2d13ebd4.so [226078864 319006651.92 393048464] no-quadratic-3c0a6c74f.so compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm Δ = 995084992.10 ± 398195457.23 (confidence = 99%) no-quadratic-3c0a6c74f.so is 1.07x to 1.17x faster than main-b2d13ebd4.so! [8794947694 8985953112.82 9227174380] main-b2d13ebd4.so [5097575234 7990868120.72 9268837322] no-quadratic-3c0a6c74f.so compilation :: cycles :: benchmarks/bz2/benchmark.wasm Δ = 7813047.40 ± 7010542.19 (confidence = 99%) no-quadratic-3c0a6c74f.so is 1.00x to 1.09x faster than main-b2d13ebd4.so! [161067634 176419598.60 253954244] main-b2d13ebd4.so [122513560 168606551.20 271954282] no-quadratic-3c0a6c74f.so
Last updated: Dec 23 2024 at 12:05 UTC