regalloc3 benchmarks · cranelift

@Amanieu Per our exchange at today's meeting, here are the benchmarks I ran against wasmtime main (79e1b5374710cd66af9f6330b4c2a231e61d9866) vs. https://github.com/bytecodealliance/wasmtime/commit/eb2acdfb4f27795856c7455824b08bb705d60e64, which pulls in your regalloc3 branch of regalloc2 (but not your adjust-split branch). You had expected a 50% slowdown in compilation, but I recalled a 25% one. Some benchmarks were indeed up in the 20s, but, on average, they were 14% slower on the low end of the CI, 17% slower on the high. I offer this only in case you find it interesting/surprising and by no means want to slow your work on the allocator, which will benefit us all! :-) Again, I ran these on an ARM chip (M3 Max), which has double the GPRs as an x64.

fitzgen (he/him) (Feb 26 2025 at 20:39):

I'd suggest focusing on just spidermonkey, bz2, and pulldown-cmark. maybe as a second tier intgemm-simd, meshoptimizer, and libsodium. but all the shootout stuff is just tiny micro benchmarks that aren't really interesting unless you are focused on optimizing the thing they happen to micro bench.

fitzgen (he/him) (Feb 26 2025 at 20:40):

in practice, the shootout benchmarks are going to just be noise unless you're trying to focus on their specific thing

fitzgen (he/him) (Feb 26 2025 at 20:41):

Erik Rose (Feb 26 2025 at 20:42):

Doesn't really change the numbers, but sure will save me time benchmarking! Thanks!

fitzgen (he/him) (Feb 26 2025 at 20:44):

right yeah, I'm just trying to help make benchmarking easier and make it easier to evaluate the results

fitzgen (he/him) (Feb 26 2025 at 20:47):

you can also benchmark particular phases as well, if you want to focus only on compile time for example, you can do

$ sightglass-cli benchmark --benchmark-phase compilation ...

Erik Rose (Feb 26 2025 at 20:49):

I can also have it spit out JSON and save myself a lot of regexes, but I have yet to remember to do that until the middle of a long run. ;-)

Amanieu (Feb 26 2025 at 22:21):

FYI I run my benchmarks with the shuffling allocator disabled because it makes allocations extremely slow and distorts the runtime. That could be an explanation of why you're seeing different results.

Chris Fallin (Feb 26 2025 at 23:01):

Consider disabling the shuffling allocator by default · Issue #280 · bytecodealliance/sightglass

A recent report by @d-sonuga indicates wildly differing data when measuring the impact of regalloc improvements on compile time depending on whether Sightglass's shuffling allocator is enabled or n...

Amanieu (Feb 26 2025 at 23:01):

diff --git a/crates/bench-api/Cargo.toml b/crates/bench-api/Cargo.toml
index a171b0beed..de288922ba 100644
--- a/crates/bench-api/Cargo.toml
+++ b/crates/bench-api/Cargo.toml
@@ -35,5 +35,5 @@ clap = { workspace = true }
 wat = { workspace = true }

 [features]
-default = ["shuffling-allocator", "wasi-nn"]
+default = ["wasi-nn"]
 wasi-nn = ["wasmtime-wasi-nn"]

Chris Fallin (Feb 26 2025 at 23:03):

Amanieu (Feb 26 2025 at 23:12):

@Erik Rose Here are the results I am getting on my (admittedly quite perf-noisy) machine:

compilation :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 43971560.50 ± 34846300.90 (confidence = 99%)

  3-spill.so is 1.08x to 1.66x faster than 2.so!

  [129186085 162930575.50 216937350] 2.so
  [85216530 118959015.00 141328075] 3-spill.so

compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [95009950 116081567.00 158748170] 2.so
  [80138100 111471111.50 141233575] 3-spill.so

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [1066633785 1108409337.00 1163300915] 2.so
  [1039408055 1077584459.00 1112557740] 3-spill.so

This was using an older version of regalloc3 which didn't have live range splitting (the 50% cost I mentioned in only with live range splitting). This should be the same version you tested based on your Cargo.lock.

Amanieu (Feb 26 2025 at 23:14):

However you can see that ra3 is consistently faster than ra2 in terms of compilation speed.

Chris Fallin (Feb 26 2025 at 23:14):

how does runtime perf look in the non-spilling version? (I forget, sorry; it'd be helpful to have an up-to-date summary of all the numbers somewhere)

Amanieu (Feb 26 2025 at 23:17):

Amanieu (Feb 27 2025 at 13:32):

@Erik Rose I've confirmed that the 25% regression you observed is entirely due to the shuffling allocator. regalloc3 doesn't make use of smallvec and instead relies on the caller to reuse the register allocation context for multiple functions (preserving the Vec allocations inside it). Unfortunately Cranelift doesn't use regalloc2::run_with_ctx and therefore cannot take advantage of this.

Amanieu (Feb 27 2025 at 13:32):

However the difference is only really noticable with the shuffling allocator enabled since it allocations so slow that compilation takes ~5x longer.

Erik Rose (Feb 27 2025 at 14:40):

Fantastic. Thanks, Amanieu! I will update my checkouts, get off the shuffling allocator, and re-run my benchmarks.

Disable shuffling allocator during benchmarks by default. by erikrose · Pull Request #10300 · bytecodealliance/wasmtime

The slowness of the shuffling obscured performance signal (by dwarfing it) more than the accidental localities it was meant to avoid. Closes bytecodealliance/sightglass#280. This issue came up in h...

Amanieu (Feb 27 2025 at 14:46):

Note that if you update your checkouts now, the current main branch default to enabling live range splitting which is what causes the 50% slowdown I previously mentioned.

Amanieu (Feb 27 2025 at 14:47):

You can override it by selecting SplitStrategy::Spill in the regalloc3 options. Or just modify your regalloc3 checkout to make that the default temporarily.

Stream: cranelift

Topic: regalloc3 benchmarks

Erik Rose (Feb 26 2025 at 20:36):

fitzgen (he/him) (Feb 26 2025 at 20:39):

fitzgen (he/him) (Feb 26 2025 at 20:40):

fitzgen (he/him) (Feb 26 2025 at 20:41):

Erik Rose (Feb 26 2025 at 20:42):

fitzgen (he/him) (Feb 26 2025 at 20:44):

fitzgen (he/him) (Feb 26 2025 at 20:47):

Erik Rose (Feb 26 2025 at 20:49):

Amanieu (Feb 26 2025 at 22:21):

Chris Fallin (Feb 26 2025 at 23:01):

Amanieu (Feb 26 2025 at 23:01):

Chris Fallin (Feb 26 2025 at 23:03):

Amanieu (Feb 26 2025 at 23:12):

Amanieu (Feb 26 2025 at 23:14):

Chris Fallin (Feb 26 2025 at 23:14):

Amanieu (Feb 26 2025 at 23:17):

Amanieu (Feb 27 2025 at 13:32):

Amanieu (Feb 27 2025 at 13:32):

Erik Rose (Feb 27 2025 at 14:40):

Amanieu (Feb 27 2025 at 14:46):

Amanieu (Feb 27 2025 at 14:47):