Stream: git-wasmtime

Topic: wasmtime / issue #4939 Avoid quadratic behavior in `can_o...


view this post on Zulip Wasmtime GitHub notifications bot (Sep 21 2022 at 18:22):

jameysharp commented on issue #4939:

Also, I haven't tested the performance impact of this patch yet, just its correctness. We should try both the cranelift-fuzzgen input from #4931 and @adambratschikaye's example wasm program from #4923. I can't do that until later though.

Also, we have this new benchmarking infrastructure that we're supposed to be able to trigger from a GitHub comment, right? I don't know how that works yet.

view this post on Zulip Wasmtime GitHub notifications bot (Sep 21 2022 at 21:07):

afonso360 commented on issue #4939:

I've been running this in the fuzzer for the past 2 hours, and it hasn't complained so far.

view this post on Zulip Wasmtime GitHub notifications bot (Sep 21 2022 at 21:24):

jameysharp commented on issue #4939:

Awesome, thank you @afonso360! I've just pushed some changes to the comments and an additional debug_assert but I don't think that invalidates your fuzzing results.

I don't know how to explain this more clearly than with these updated comments, so I'm marking this ready for review. I still need to evaluate its performance impact though.

view this post on Zulip Wasmtime GitHub notifications bot (Sep 21 2022 at 22:30):

jameysharp commented on issue #4939:

Running wasmtime-cli from main on @adambratschikaye's example with cargo build --release --no-default-features --features wasmtime/wat:

$ /usr/bin/time target/release/wasmtime compile --disable-cache /tmp/4923.wat
1.55user 0.02system 0:01.57elapsed 100%CPU (0avgtext+0avgdata 39532maxresident)k
0inputs+24outputs (0major+13706minor)pagefaults 0swaps

Under valgrind --tool=callgrind that executes 25,270,685,419 instructions, with about 98% of the time in use_var, and three-quarters of that hashing keys for the visited hash-map.

Same test on this branch:

$ /usr/bin/time target/release/wasmtime compile --disable-cache /tmp/4923.wat
0.39user 0.02system 0:00.41elapsed 100%CPU (0avgtext+0avgdata 39744maxresident)k
0inputs+24outputs (3major+13731minor)pagefaults 0swaps

Only 2,803,649,669 instructions according to callgrind, but about 80% of that is in the new update_predecessor_cycle function.

So this branch takes 11% as many instructions and 25% of the user time on that example.

view this post on Zulip Wasmtime GitHub notifications bot (Sep 22 2022 at 11:08):

adambratschikaye commented on issue #4939:

Thanks for handling this @jameysharp!

view this post on Zulip Wasmtime GitHub notifications bot (Sep 22 2022 at 22:03):

jameysharp commented on issue #4939:

Sure thing, @adambratschikaye; thanks for pointing out the issue!

Okay, I think it was not a coincidence that 75% of the time on main was spent in hash-table operations, and my changes cut compile time by 75%. On the test case from #4923, both use_var and declare_block_predecessor are called the same number of times, and in some sense all I did was move the graph traversal from one to the other. So the entire difference is that on main this uses a HashSet, but with my PR it's indexing directly into a vector. Replacing the HashSet with a SecondaryMap, without adding any memoization, should have the same effect for this particular test case.

I suspect real programs average more than one call to use_var per block, so I think this is still a good idea, but yesterday @cfallin suggested there might be a way to speed this check up for blocks that are sealed, so I've been playing with that. Turns out, it was a good idea.

When compiling our pulldown-cmark benchmark from Sightglass, this saves about 50 million instructions, or about 3% of the total, according to callgrind. So I think this is a measurable improvement to real-world performance.

As for the original test case:

$ /usr/bin/time target/release/wasmtime compile --disable-cache /tmp/4923.wat
0.03user 0.03system 0:00.06elapsed 104%CPU (0avgtext+0avgdata 39644maxresident)k
0inputs+24outputs (0major+13739minor)pagefaults 0swaps

According to callgrind, this took 554,072,900 instructions.

Taking sealed blocks into account eliminates 98% of the compile time on this test case, both in terms of user time and instruction count, compared with main. And at this point use_var_nonlocal plus update_predecessor_cycle take 0.29% of total runtime, so on this test there probably isn't any more performance to extract here.

Let's see if this works this time:
/bench_x64

view this post on Zulip Wasmtime GitHub notifications bot (Sep 23 2022 at 16:07):

jameysharp commented on issue #4939:

I finally ran Sightglass myself and this is a measurable improvement on real programs:

compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 49281270.08 ± 21356907.03 (confidence = 99%)

  no-quadratic-3c0a6c74f.so is 1.09x to 1.22x faster than main-b2d13ebd4.so!

  [233253012 368287922.00 470539364] main-b2d13ebd4.so
  [226078864 319006651.92 393048464] no-quadratic-3c0a6c74f.so

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 995084992.10 ± 398195457.23 (confidence = 99%)

  no-quadratic-3c0a6c74f.so is 1.07x to 1.17x faster than main-b2d13ebd4.so!

  [8794947694 8985953112.82 9227174380] main-b2d13ebd4.so
  [5097575234 7990868120.72 9268837322] no-quadratic-3c0a6c74f.so

compilation :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 7813047.40 ± 7010542.19 (confidence = 99%)

  no-quadratic-3c0a6c74f.so is 1.00x to 1.09x faster than main-b2d13ebd4.so!

  [161067634 176419598.60 253954244] main-b2d13ebd4.so
  [122513560 168606551.20 271954282] no-quadratic-3c0a6c74f.so


Last updated: Nov 22 2024 at 16:03 UTC