@Amanieu Per our exchange at today's meeting, here are the benchmarks I ran against wasmtime main
(79e1b5374710cd66af9f6330b4c2a231e61d9866) vs. https://github.com/bytecodealliance/wasmtime/commit/eb2acdfb4f27795856c7455824b08bb705d60e64, which pulls in your regalloc3
branch of regalloc2 (but not your adjust-split
branch). You had expected a 50% slowdown in compilation, but I recalled a 25% one. Some benchmarks were indeed up in the 20s, but, on average, they were 14% slower on the low end of the CI, 17% slower on the high. I offer this only in case you find it interesting/surprising and by no means want to slow your work on the allocator, which will benefit us all! :-) Again, I ran these on an ARM chip (M3 Max), which has double the GPRs as an x64.
regalloc3 all suite compilation only.txt
I'd suggest focusing on just spidermonkey
, bz2
, and pulldown-cmark
. maybe as a second tier intgemm-simd
, meshoptimizer
, and libsodium
. but all the shootout stuff is just tiny micro benchmarks that aren't really interesting unless you are focused on optimizing the thing they happen to micro bench.
in practice, the shootout benchmarks are going to just be noise unless you're trying to focus on their specific thing
I guess blake3 is fairly interesting interesting
So 28-29, 17-20, and 25-27%, respectively.
2nd tier: 25-26, 17-21, and ≈15-20
Doesn't really change the numbers, but sure will save me time benchmarking! Thanks!
right yeah, I'm just trying to help make benchmarking easier and make it easier to evaluate the results
you can also benchmark particular phases as well, if you want to focus only on compile time for example, you can do
$ sightglass-cli benchmark --benchmark-phase compilation ...
I can also have it spit out JSON and save myself a lot of regexes, but I have yet to remember to do that until the middle of a long run. ;-)
FYI I run my benchmarks with the shuffling allocator disabled because it makes allocations extremely slow and distorts the runtime. That could be an explanation of why you're seeing different results.
ah, right, I filed https://github.com/bytecodealliance/sightglass/issues/280 a while ago to switch the default but then ran out of spare energy to do it; @Erik Rose that could be an easy PR to make
It's a one-line change:
diff --git a/crates/bench-api/Cargo.toml b/crates/bench-api/Cargo.toml
index a171b0beed..de288922ba 100644
--- a/crates/bench-api/Cargo.toml
+++ b/crates/bench-api/Cargo.toml
@@ -35,5 +35,5 @@ clap = { workspace = true }
wat = { workspace = true }
[features]
-default = ["shuffling-allocator", "wasi-nn"]
+default = ["wasi-nn"]
wasi-nn = ["wasmtime-wasi-nn"]
I must have been very low on spare energy then :-)
@Erik Rose Here are the results I am getting on my (admittedly quite perf-noisy) machine:
compilation :: cycles :: benchmarks/bz2/benchmark.wasm
Δ = 43971560.50 ± 34846300.90 (confidence = 99%)
3-spill.so is 1.08x to 1.66x faster than 2.so!
[129186085 162930575.50 216937350] 2.so
[85216530 118959015.00 141328075] 3-spill.so
compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm
No difference in performance.
[95009950 116081567.00 158748170] 2.so
[80138100 111471111.50 141233575] 3-spill.so
compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm
No difference in performance.
[1066633785 1108409337.00 1163300915] 2.so
[1039408055 1077584459.00 1112557740] 3-spill.so
This was using an older version of regalloc3 which didn't have live range splitting (the 50% cost I mentioned in only with live range splitting). This should be the same version you tested based on your Cargo.lock.
However you can see that ra3 is consistently faster than ra2 in terms of compilation speed.
how does runtime perf look in the non-spilling version? (I forget, sorry; it'd be helpful to have an up-to-date summary of all the numbers somewhere)
I'm currently in the middle of some perf optimizations, but you can find the older results here:
@Erik Rose I've confirmed that the 25% regression you observed is entirely due to the shuffling allocator. regalloc3 doesn't make use of smallvec and instead relies on the caller to reuse the register allocation context for multiple functions (preserving the Vec
allocations inside it). Unfortunately Cranelift doesn't use regalloc2::run_with_ctx
and therefore cannot take advantage of this.
However the difference is only really noticable with the shuffling allocator enabled since it allocations so slow that compilation takes ~5x longer.
Fantastic. Thanks, Amanieu! I will update my checkouts, get off the shuffling allocator, and re-run my benchmarks.
I opened a PR to disable the shuffling allocator as well: https://github.com/bytecodealliance/wasmtime/pull/10300.
Note that if you update your checkouts now, the current main branch default to enabling live range splitting which is what causes the 50% slowdown I previously mentioned.
You can override it by selecting SplitStrategy::Spill
in the regalloc3 options. Or just modify your regalloc3 checkout to make that the default temporarily.
Last updated: Feb 27 2025 at 23:03 UTC