Stream: cranelift

Topic: regalloc3 benchmarks


view this post on Zulip Erik Rose (Feb 26 2025 at 20:36):

@Amanieu Per our exchange at today's meeting, here are the benchmarks I ran against wasmtime main (79e1b5374710cd66af9f6330b4c2a231e61d9866) vs. https://github.com/bytecodealliance/wasmtime/commit/eb2acdfb4f27795856c7455824b08bb705d60e64, which pulls in your regalloc3 branch of regalloc2 (but not your adjust-split branch). You had expected a 50% slowdown in compilation, but I recalled a 25% one. Some benchmarks were indeed up in the 20s, but, on average, they were 14% slower on the low end of the CI, 17% slower on the high. I offer this only in case you find it interesting/surprising and by no means want to slow your work on the allocator, which will benefit us all! :-) Again, I ran these on an ARM chip (M3 Max), which has double the GPRs as an x64.

regalloc3 all suite compilation only.txt

I didn't change anything in there yet anyway.

view this post on Zulip fitzgen (he/him) (Feb 26 2025 at 20:39):

I'd suggest focusing on just spidermonkey, bz2, and pulldown-cmark. maybe as a second tier intgemm-simd, meshoptimizer, and libsodium. but all the shootout stuff is just tiny micro benchmarks that aren't really interesting unless you are focused on optimizing the thing they happen to micro bench.

view this post on Zulip fitzgen (he/him) (Feb 26 2025 at 20:40):

in practice, the shootout benchmarks are going to just be noise unless you're trying to focus on their specific thing

view this post on Zulip fitzgen (he/him) (Feb 26 2025 at 20:41):

I guess blake3 is fairly interesting interesting

view this post on Zulip Erik Rose (Feb 26 2025 at 20:42):

So 28-29, 17-20, and 25-27%, respectively.
2nd tier: 25-26, 17-21, and ≈15-20

Doesn't really change the numbers, but sure will save me time benchmarking! Thanks!

view this post on Zulip fitzgen (he/him) (Feb 26 2025 at 20:44):

right yeah, I'm just trying to help make benchmarking easier and make it easier to evaluate the results

view this post on Zulip fitzgen (he/him) (Feb 26 2025 at 20:47):

you can also benchmark particular phases as well, if you want to focus only on compile time for example, you can do

$ sightglass-cli benchmark --benchmark-phase compilation ...

view this post on Zulip Erik Rose (Feb 26 2025 at 20:49):

I can also have it spit out JSON and save myself a lot of regexes, but I have yet to remember to do that until the middle of a long run. ;-)

view this post on Zulip Amanieu (Feb 26 2025 at 22:21):

FYI I run my benchmarks with the shuffling allocator disabled because it makes allocations extremely slow and distorts the runtime. That could be an explanation of why you're seeing different results.

view this post on Zulip Chris Fallin (Feb 26 2025 at 23:01):

ah, right, I filed https://github.com/bytecodealliance/sightglass/issues/280 a while ago to switch the default but then ran out of spare energy to do it; @Erik Rose that could be an easy PR to make

A recent report by @d-sonuga indicates wildly differing data when measuring the impact of regalloc improvements on compile time depending on whether Sightglass's shuffling allocator is enabled or n...

view this post on Zulip Amanieu (Feb 26 2025 at 23:01):

It's a one-line change:

diff --git a/crates/bench-api/Cargo.toml b/crates/bench-api/Cargo.toml
index a171b0beed..de288922ba 100644
--- a/crates/bench-api/Cargo.toml
+++ b/crates/bench-api/Cargo.toml
@@ -35,5 +35,5 @@ clap = { workspace = true }
 wat = { workspace = true }

 [features]
-default = ["shuffling-allocator", "wasi-nn"]
+default = ["wasi-nn"]
 wasi-nn = ["wasmtime-wasi-nn"]

view this post on Zulip Chris Fallin (Feb 26 2025 at 23:03):

I must have been very low on spare energy then :-)

view this post on Zulip Amanieu (Feb 26 2025 at 23:12):

@Erik Rose Here are the results I am getting on my (admittedly quite perf-noisy) machine:

compilation :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 43971560.50 ± 34846300.90 (confidence = 99%)

  3-spill.so is 1.08x to 1.66x faster than 2.so!

  [129186085 162930575.50 216937350] 2.so
  [85216530 118959015.00 141328075] 3-spill.so

compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [95009950 116081567.00 158748170] 2.so
  [80138100 111471111.50 141233575] 3-spill.so

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [1066633785 1108409337.00 1163300915] 2.so
  [1039408055 1077584459.00 1112557740] 3-spill.so

This was using an older version of regalloc3 which didn't have live range splitting (the 50% cost I mentioned in only with live range splitting). This should be the same version you tested based on your Cargo.lock.

view this post on Zulip Amanieu (Feb 26 2025 at 23:14):

However you can see that ra3 is consistently faster than ra2 in terms of compilation speed.

view this post on Zulip Chris Fallin (Feb 26 2025 at 23:14):

how does runtime perf look in the non-spilling version? (I forget, sorry; it'd be helpful to have an up-to-date summary of all the numbers somewhere)

view this post on Zulip Amanieu (Feb 26 2025 at 23:17):

I'm currently in the middle of some perf optimizations, but you can find the older results here: #cranelift > regalloc3 progress update @ 💬

view this post on Zulip Amanieu (Feb 27 2025 at 13:32):

@Erik Rose I've confirmed that the 25% regression you observed is entirely due to the shuffling allocator. regalloc3 doesn't make use of smallvec and instead relies on the caller to reuse the register allocation context for multiple functions (preserving the Vec allocations inside it). Unfortunately Cranelift doesn't use regalloc2::run_with_ctx and therefore cannot take advantage of this.

view this post on Zulip Amanieu (Feb 27 2025 at 13:32):

However the difference is only really noticable with the shuffling allocator enabled since it allocations so slow that compilation takes ~5x longer.

view this post on Zulip Erik Rose (Feb 27 2025 at 14:40):

Fantastic. Thanks, Amanieu! I will update my checkouts, get off the shuffling allocator, and re-run my benchmarks.

I opened a PR to disable the shuffling allocator as well: https://github.com/bytecodealliance/wasmtime/pull/10300.

The slowness of the shuffling obscured performance signal (by dwarfing it) more than the accidental localities it was meant to avoid. Closes bytecodealliance/sightglass#280. This issue came up in h...

view this post on Zulip Amanieu (Feb 27 2025 at 14:46):

Note that if you update your checkouts now, the current main branch default to enabling live range splitting which is what causes the 50% slowdown I previously mentioned.

view this post on Zulip Amanieu (Feb 27 2025 at 14:47):

You can override it by selecting SplitStrategy::Spill in the regalloc3 options. Or just modify your regalloc3 checkout to make that the default temporarily.


Last updated: Feb 27 2025 at 23:03 UTC