cfallin opened issue #3942:
This issue is meant to track the status of migrating Cranelift to use regalloc2, our new register allocator. We started this work a while ago, and as detailed in our 2022 roadmap, we plan to finish the migration this year.
The major tasks remaining are:
- [x] Develop regalloc2 as a standalone project
- [ ] Get a second reviewer to triple-check the symbolic checker we've been using to fuzz regalloc2, since regalloc correctness is essential for correctness/security of all layers above it (@fitzgen is currently looking this over)
- [ ] Integrate any remaining regalloc2 tweaks/improvements (I have a few queued up, mostly API-related; this is my running branch as I bring it up with Cranelift.)
- [ ] Release regalloc2 crate on crates.io
- [ ] Merge support for regalloc2 into Cranelift
The last task has been under development for the past 2.5 weeks or so. I'll make my private branch public shortly, after a bit of cleanup. Its current status is that it is fully functional (passes tests, runs benchmarks) on x86-64. There is work to do to move the other two backends over (aarch64, s390x) and I will do this before we merge. (I might not be able to do this before Mon Mar 28; I'm out-of-office and offline all of next week unfortunately, but wanted to get these results out first!)
The nature of the changes to Cranelift are such that we do have to do the transition atomically and remove regalloc.rs support at the same time; the whole MachInst infrastructure is basically built up around the regalloc abstractions, so swapping it out has a large effect. Fortunately though I think there is not too much of a downside (aside from the usual code-churn risk, which we mitigate with ongoing fuzzing and careful review) -- performance numbers look good.
Here is a current snapshot of some benchmark results:
Benchmark Compilation (wallclock) Execution (wallclock) Benchmark Compilation (wallclock) Execution (wallclock) blake3-scalar 25% faster 28% faster blake3-simd no diff no diff meshoptimizer 19% faster 17% faster pulldown-cmark 17% faster no diff bz2 15% faster no diff SpiderMonkey, 21% faster 2% faster fib(30) clang.wasm 42% faster N/A
with full details here:
<details>
<summary>Benchmark methodology and raw output</summary>
As percentage improvement over baseline (old):Benchmark Compilation (wallclock) Execution (wallclock)
blake3-scalar 25% faster 28% faster
blake3-simd no diff no diff
meshoptimizer 19% faster 17% faster
pulldown-cmark 17% faster no diff
bz2 15% faster no diff
SpiderMonkey, 21% faster 2% faster
fib(30)
clang.wasm 42% faster N/AAs ratios (percent improvement above = 100% * (1 - 1/speedup_ratio))
Benchmark Compilation (wallclock) Execution (wallclock)
blake3-scalar 1.34x faster 1.38x faster
blake3-simd no diff no diff
meshoptimizer 1.24x faster 1.21x faster
pulldown-cmark 1.21x faster no diff
bz2 1.18x faster no diff
SpiderMonkey, 1.26x faster 1.02x faster
fib(30)
clang.wasm 1.71x faster N/AMethodology:
- Sightglass with --processes 2 --iterations-per-process 5.
- Last two benchmarks running commandline wasmtime
- rm -r ~/.cache/wasmtime
- run
wasmtime run
once to ensure compiled- measure runtime 5x, take best of five
- measure compile time with
wasmtime compile
5x, take best of five- clang.wasm doesn't have a test harness, so is compile-only
- Testing on 12-core / 24-thread Ryzen 3900X, Linux/x86-64
Comparing baseline of Wasmtime fdf063df98ad3839b0e0b78ea55b53b1a296abb0 (from
Mar 16) against my internal regalloc2 branch
9b89942cf62d262ee9ac3e7eab525ea8544a458b (from Mar 17) which last synced with
Wasmtime at eb1b71e31c035ff4250c5013ca0268deb931aa7c (from Feb 24).Raw output of Sightglass below (instantiation excluded, not interesting).
compilation :: cycles :: benchmarks-next/blake3-scalar/benchmark.wasm
Δ = 121531866.00 ± 51042761.18 (confidence = 99%)
new.so is 1.14x to 1.34x faster than old.so!
old.so is 0.72x to 0.89x faster than new.so![478052996 501410277.40 591983000] new.so
[604955098 622942143.40 709527450] old.socompilation :: nanoseconds :: benchmarks-next/blake3-scalar/benchmark.wasm
Δ = 31981472.00 ± 13432120.92 (confidence = 99%)
new.so is 1.14x to 1.34x faster than old.so!
old.so is 0.72x to 0.89x faster than new.so![125802142 131948268.40 155782325] new.so
[159196645 163929740.40 186715328] old.soexecution :: nanoseconds :: benchmarks-next/blake3-scalar/benchmark.wasm
Δ = 36931.50 ± 3272.72 (confidence = 99%)
new.so is 1.32x to 1.38x faster than old.so!
old.so is 0.72x to 0.77x faster than new.so![105358 106660.00 110728] new.so
[140608 143591.50 149787] old.soexecution :: cycles :: benchmarks-next/blake3-scalar/benchmark.wasm
Δ = 140341.60 ± 12437.21 (confidence = 99%)
new.so is 1.32x to 1.38x faster than old.so!
old.so is 0.72x to 0.77x faster than new.so![400368 405315.60 420774] new.so
[534318 545657.20 569202] old.so
compilation :: cycles :: benchmarks-next/blake3-simd/benchmark.wasm
No difference in performance.
[112727304 139448014.80 189082604] new.so
[123143218 156732493.40 233512432] old.socompilation :: nanoseconds :: benchmarks-next/blake3-simd/benchmark.wasm
No difference in performance.
[29664800 36696541.20 49758219] new.so
[32405712 41244760.40 61449541] old.soexecution :: cycles :: benchmarks-next/blake3-simd/benchmark.wasm
No difference in performance.
[400672 739521.80 1042226] new.so
[498142 828791.40 1160786] old.soexecution :: nanoseconds :: benchmarks-next/blake3-simd/benchmark.wasm
No difference in performance.
[105439 194609.20 274267] new.so
[131088 218099.20 305464] old.so
compilation :: cycles :: benchmarks-next/meshoptimizer/benchmark.wasm
Δ = 483775336.20 ± 24646158.96 (confidence = 99%)
new.so is 1.22x to 1.24x faster than old.so!
old.so is 0.80x to 0.82x faster than new.so![2090515508 2113482784.00 2150210240] new.so
[2554359582 2597258120.20 2630111328] old.socompilation :: nanoseconds :: benchmarks-next/meshoptimizer/benchmark.wasm
Δ = 127275628.40 ± 6480546.57 (confidence = 99%)
new.so is 1.22x to 1.24x faster than old.so!
old.so is 0.80x to 0.82x faster than new.so![550127669 556172437.60 565836581] new.so
[672188482 683448066.00 692063546] old.soexecution :: cycles :: benchmarks-next/meshoptimizer/benchmark.wasm
Δ = 3386913742.00 ± 454568778.61 (confidence = 99%)
new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so![17786842514 17978520795.40 18352029814] new.so
[20863697992 21365434537.40 22139271504] old.soexecution :: nanoseconds :: benchmarks-next/meshoptimizer/benchmark.wasm
Δ = 891020039.40 ± 119694835.02 (confidence = 99%)
new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so![4680694128 4731128047.40 4829411387] new.so
[5489883512 5622148086.80 5826025212] old.so
compilation :: cycles :: benchmarks-next/pulldown-cmark/benchmark.wasm
Δ = 213252595.20 ± 29303757.92 (confidence = 99%)
new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so![1120180378 1148350389.80 1203069094] new.so
[1340768136 1361602985.00 1397014596] old.socompilation :: nanoseconds :: benchmarks-next/pulldown-cmark/benchmark.wasm
Δ = 56118120.00 ± 7711578.76 (confidence = 99%)
new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so![294780634 302193792.40 316593182] new.so
[352828441 358311912.40 367631343] old.soexecution :: cycles :: benchmarks-next/pulldown-cmark/benchmark.wasm
No difference in performance.
[8257780 8443755.80 8560944] new.so
[8455570 9495162.60 17648568] old.soexecution :: nanoseconds :: benchmarks-next/pulldown-cmark/benchmark.wasm
No difference in performance.
[2173072 2222013.50 2252853] new.so
[2225116 2498693.60 4644290] old.so
compilation :: cycles :: benchmarks-next/bz2/benchmark.wasm
Δ = 58684068.80 ± 36909440.37 (confidence = 99%)
new.so is 1.04x to 1.18x faster than old.so!
old.so is 0.84x to 0.96x faster than new.so![498967588 545831464.20 586460840] new.so
[540660276 604515533.00 635005118] old.socompilation :: nanoseconds :: benchmarks-next/bz2/benchmark.wasm
Δ = 15436153.00 ± 9714229.01 (confidence = 99%)
new.so is 1.04x to 1.18x faster than old.so!
old.so is 0.84x to 0.96x faster than new.so![131305387 143637939.40 154329874] new.so
[142264400 159074092.40 167089438] old.soexecution :: nanoseconds :: benchmarks-next/bz2/benchmark.wasm
No difference in performance.
[25932760 35978222.50 53794238] new.so
[28960083 29737468.90 35137211] old.soexecution :: cycles :: benchmarks-next/bz2/benchmark.wasm
No difference in performance.
[98545894 136719075.20 204420658] new.so
[110059628 113008690.20 133522880] old.so
</details>
cfallin edited issue #3942:
This issue is meant to track the status of migrating Cranelift to use regalloc2, our new register allocator. We started this work a while ago, and as detailed in our 2022 roadmap, we plan to finish the migration this year.
The major tasks remaining are:
- [x] Develop regalloc2 as a standalone project
- [ ] Get a second reviewer to triple-check the symbolic checker we've been using to fuzz regalloc2, since regalloc correctness is essential for correctness/security of all layers above it (@fitzgen is currently looking this over)
- [ ] Integrate any remaining regalloc2 tweaks/improvements (I have a few queued up, mostly API-related; this is my running branch as I bring it up with Cranelift.)
- [ ] Release regalloc2 crate on crates.io
- [ ] Merge support for regalloc2 into Cranelift
The last task has been under development for the past 2.5 weeks or so. I'll make my private branch public shortly, after a bit of cleanup. Its current status is that it is fully functional (passes tests, runs benchmarks) on x86-64. There is work to do to move the other two backends over (aarch64, s390x) and I will do this before we merge. (I might not be able to do this before Mon Mar 28; I'm out-of-office and offline all of next week unfortunately, but wanted to get these results out first!)
The nature of the changes to Cranelift are such that we do have to do the transition atomically and remove regalloc.rs support at the same time; the whole MachInst infrastructure is basically built up around the regalloc abstractions, so swapping it out has a large effect. Fortunately though I think there is not too much of a downside (aside from the usual code-churn risk, which we mitigate with ongoing fuzzing and careful review) -- performance numbers look good.
Here is a current snapshot of some benchmark results:
Benchmark Compilation (wallclock) Execution (wallclock) blake3-scalar 25% faster 28% faster blake3-simd no diff no diff meshoptimizer 19% faster 17% faster pulldown-cmark 17% faster no diff bz2 15% faster no diff SpiderMonkey, 21% faster 2% faster fib(30) clang.wasm 42% faster N/A
with full details here:
<details>
<summary>Benchmark methodology and raw output</summary>
As percentage improvement over baseline (old):Benchmark Compilation (wallclock) Execution (wallclock)
blake3-scalar 25% faster 28% faster
blake3-simd no diff no diff
meshoptimizer 19% faster 17% faster
pulldown-cmark 17% faster no diff
bz2 15% faster no diff
SpiderMonkey, 21% faster 2% faster
fib(30)
clang.wasm 42% faster N/AAs ratios (percent improvement above = 100% * (1 - 1/speedup_ratio))
Benchmark Compilation (wallclock) Execution (wallclock)
blake3-scalar 1.34x faster 1.38x faster
blake3-simd no diff no diff
meshoptimizer 1.24x faster 1.21x faster
pulldown-cmark 1.21x faster no diff
bz2 1.18x faster no diff
SpiderMonkey, 1.26x faster 1.02x faster
fib(30)
clang.wasm 1.71x faster N/AMethodology:
- Sightglass with --processes 2 --iterations-per-process 5.
- Last two benchmarks running commandline wasmtime
- rm -r ~/.cache/wasmtime
- run
wasmtime run
once to ensure compiled- measure runtime 5x, take best of five
- measure compile time with
wasmtime compile
5x, take best of five- clang.wasm doesn't have a test harness, so is compile-only
- Testing on 12-core / 24-thread Ryzen 3900X, Linux/x86-64
Comparing baseline of Wasmtime fdf063df98ad3839b0e0b78ea55b53b1a296abb0 (from
Mar 16) against my internal regalloc2 branch
9b89942cf62d262ee9ac3e7eab525ea8544a458b (from Mar 17) which last synced with
Wasmtime at eb1b71e31c035ff4250c5013ca0268deb931aa7c (from Feb 24).Raw output of Sightglass below (instantiation excluded, not interesting).
compilation :: cycles :: benchmarks-next/blake3-scalar/benchmark.wasm
Δ = 121531866.00 ± 51042761.18 (confidence = 99%)
new.so is 1.14x to 1.34x faster than old.so!
old.so is 0.72x to 0.89x faster than new.so![478052996 501410277.40 591983000] new.so
[604955098 622942143.40 709527450] old.socompilation :: nanoseconds :: benchmarks-next/blake3-scalar/benchmark.wasm
Δ = 31981472.00 ± 13432120.92 (confidence = 99%)
new.so is 1.14x to 1.34x faster than old.so!
old.so is 0.72x to 0.89x faster than new.so![125802142 131948268.40 155782325] new.so
[159196645 163929740.40 186715328] old.soexecution :: nanoseconds :: benchmarks-next/blake3-scalar/benchmark.wasm
Δ = 36931.50 ± 3272.72 (confidence = 99%)
new.so is 1.32x to 1.38x faster than old.so!
old.so is 0.72x to 0.77x faster than new.so![105358 106660.00 110728] new.so
[140608 143591.50 149787] old.soexecution :: cycles :: benchmarks-next/blake3-scalar/benchmark.wasm
Δ = 140341.60 ± 12437.21 (confidence = 99%)
new.so is 1.32x to 1.38x faster than old.so!
old.so is 0.72x to 0.77x faster than new.so![400368 405315.60 420774] new.so
[534318 545657.20 569202] old.so
compilation :: cycles :: benchmarks-next/blake3-simd/benchmark.wasm
No difference in performance.
[112727304 139448014.80 189082604] new.so
[123143218 156732493.40 233512432] old.socompilation :: nanoseconds :: benchmarks-next/blake3-simd/benchmark.wasm
No difference in performance.
[29664800 36696541.20 49758219] new.so
[32405712 41244760.40 61449541] old.soexecution :: cycles :: benchmarks-next/blake3-simd/benchmark.wasm
No difference in performance.
[400672 739521.80 1042226] new.so
[498142 828791.40 1160786] old.soexecution :: nanoseconds :: benchmarks-next/blake3-simd/benchmark.wasm
No difference in performance.
[105439 194609.20 274267] new.so
[131088 218099.20 305464] old.so
compilation :: cycles :: benchmarks-next/meshoptimizer/benchmark.wasm
Δ = 483775336.20 ± 24646158.96 (confidence = 99%)
new.so is 1.22x to 1.24x faster than old.so!
old.so is 0.80x to 0.82x faster than new.so![2090515508 2113482784.00 2150210240] new.so
[2554359582 2597258120.20 2630111328] old.socompilation :: nanoseconds :: benchmarks-next/meshoptimizer/benchmark.wasm
Δ = 127275628.40 ± 6480546.57 (confidence = 99%)
new.so is 1.22x to 1.24x faster than old.so!
old.so is 0.80x to 0.82x faster than new.so![550127669 556172437.60 565836581] new.so
[672188482 683448066.00 692063546] old.soexecution :: cycles :: benchmarks-next/meshoptimizer/benchmark.wasm
Δ = 3386913742.00 ± 454568778.61 (confidence = 99%)
new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so![17786842514 17978520795.40 18352029814] new.so
[20863697992 21365434537.40 22139271504] old.soexecution :: nanoseconds :: benchmarks-next/meshoptimizer/benchmark.wasm
Δ = 891020039.40 ± 119694835.02 (confidence = 99%)
new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so![4680694128 4731128047.40 4829411387] new.so
[5489883512 5622148086.80 5826025212] old.so
compilation :: cycles :: benchmarks-next/pulldown-cmark/benchmark.wasm
Δ = 213252595.20 ± 29303757.92 (confidence = 99%)
new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so![1120180378 1148350389.80 1203069094] new.so
[1340768136 1361602985.00 1397014596] old.socompilation :: nanoseconds :: benchmarks-next/pulldown-cmark/benchmark.wasm
Δ = 56118120.00 ± 7711578.76 (confidence = 99%)
new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so![294780634 302193792.40 316593182] new.so
[352828441 358311912.40 367631343] old.soexecution :: cycles :: benchmarks-next/pulldown-cmark/benchmark.wasm
No difference in performance.
[8257780 8443755.80 8560944] new.so
[8455570 9495162.60 17648568] old.soexecution :: nanoseconds :: benchmarks-next/pulldown-cmark/benchmark.wasm
No difference in performance.
[2173072 2222013.50 2252853] new.so
[2225116 2498693.60 4644290] old.so
compilation :: cycles :: benchmarks-next/bz2/benchmark.wasm
Δ = 58684068.80 ± 36909440.37 (confidence = 99%)
new.so is 1.04x to 1.18x faster than old.so!
old.so is 0.84x to 0.96x faster than new.so![498967588 545831464.20 586460840] new.so
[540660276 604515533.00 635005118] old.socompilation :: nanoseconds :: benchmarks-next/bz2/benchmark.wasm
Δ = 15436153.00 ± 9714229.01 (confidence = 99%)
new.so is 1.04x to 1.18x faster than old.so!
old.so is 0.84x to 0.96x faster than new.so![131305387 143637939.40 154329874] new.so
[142264400 159074092.40 167089438] old.soexecution :: nanoseconds :: benchmarks-next/bz2/benchmark.wasm
No difference in performance.
[25932760 35978222.50 53794238] new.so
[28960083 29737468.90 35137211] old.soexecution :: cycles :: benchmarks-next/bz2/benchmark.wasm
No difference in performance.
[98545894 136719075.20 204420658] new.so
[110059628 113008690.20 133522880] old.so
</details>
cfallin edited issue #3942:
This issue is meant to track the status of migrating Cranelift to use regalloc2, our new register allocator. We started this work a while ago, and as detailed in our 2022 roadmap, we plan to finish the migration this year.
The major tasks remaining are:
- [x] Develop regalloc2 as a standalone project
- [ ] Get a second reviewer to triple-check the symbolic checker we've been using to fuzz regalloc2, since regalloc correctness is essential for correctness/security of all layers above it (@fitzgen is currently looking this over)
- [ ] Integrate any remaining regalloc2 tweaks/improvements (I have a few queued up, mostly API-related; this is my running branch as I bring it up with Cranelift.)
- [ ] Release regalloc2 crate on crates.io
- [ ] Merge support for regalloc2 into Cranelift
The last task has been under development for the past 2.5 weeks or so. I'll make my private branch public shortly, after a bit of cleanup. Its current status is that it is fully functional (passes tests, runs benchmarks) on x86-64. There is work to do to move the other two backends over (aarch64, s390x) and I will do this before we merge. (I might not be able to do this before Mon Mar 28; I'm out-of-office and offline all of next week unfortunately, but wanted to get these results out first!)
The nature of the changes to Cranelift are such that we do have to do the transition atomically and remove regalloc.rs support at the same time; the whole MachInst infrastructure is basically built up around the regalloc abstractions, so swapping it out has a large effect. Fortunately though I think there is not too much of a downside (aside from the usual code-churn risk, which we mitigate with ongoing fuzzing and careful review) -- performance numbers look good.
Here is a current snapshot of some benchmark results:
Benchmark Compilation (wallclock) Execution (wallclock) blake3-scalar 25% faster 28% faster blake3-simd no diff no diff meshoptimizer 19% faster 17% faster pulldown-cmark 17% faster no diff bz2 15% faster no diff SpiderMonkey, 21% faster 2% faster fib(30) clang.wasm 42% faster N/A
with full details here:
<details>
<summary>Benchmark methodology and raw output</summary>As percentage improvement over baseline (old): Benchmark Compilation (wallclock) Execution (wallclock) blake3-scalar 25% faster 28% faster blake3-simd no diff no diff meshoptimizer 19% faster 17% faster pulldown-cmark 17% faster no diff bz2 15% faster no diff SpiderMonkey, 21% faster 2% faster fib(30) clang.wasm 42% faster N/A As ratios (percent improvement above = 100% * (1 - 1/speedup_ratio)) Benchmark Compilation (wallclock) Execution (wallclock) blake3-scalar 1.34x faster 1.38x faster blake3-simd no diff no diff meshoptimizer 1.24x faster 1.21x faster pulldown-cmark 1.21x faster no diff bz2 1.18x faster no diff SpiderMonkey, 1.26x faster 1.02x faster fib(30) clang.wasm 1.71x faster N/A Methodology: - Sightglass with --processes 2 --iterations-per-process 5. - Last two benchmarks running commandline wasmtime - rm -r ~/.cache/wasmtime - run `wasmtime run` once to ensure compiled - measure runtime 5x, take best of five - measure compile time with `wasmtime compile` 5x, take best of five - clang.wasm doesn't have a test harness, so is compile-only - Testing on 12-core / 24-thread Ryzen 3900X, Linux/x86-64 Comparing baseline of Wasmtime fdf063df98ad3839b0e0b78ea55b53b1a296abb0 (from Mar 16) against my internal regalloc2 branch 9b89942cf62d262ee9ac3e7eab525ea8544a458b (from Mar 17) which last synced with Wasmtime at eb1b71e31c035ff4250c5013ca0268deb931aa7c (from Feb 24). Raw output of Sightglass below (instantiation excluded, not interesting). ---- compilation :: cycles :: benchmarks-next/blake3-scalar/benchmark.wasm Δ = 121531866.00 ± 51042761.18 (confidence = 99%) new.so is 1.14x to 1.34x faster than old.so! old.so is 0.72x to 0.89x faster than new.so! [478052996 501410277.40 591983000] new.so [604955098 622942143.40 709527450] old.so compilation :: nanoseconds :: benchmarks-next/blake3-scalar/benchmark.wasm Δ = 31981472.00 ± 13432120.92 (confidence = 99%) new.so is 1.14x to 1.34x faster than old.so! old.so is 0.72x to 0.89x faster than new.so! [125802142 131948268.40 155782325] new.so [159196645 163929740.40 186715328] old.so execution :: nanoseconds :: benchmarks-next/blake3-scalar/benchmark.wasm Δ = 36931.50 ± 3272.72 (confidence = 99%) new.so is 1.32x to 1.38x faster than old.so! old.so is 0.72x to 0.77x faster than new.so! [105358 106660.00 110728] new.so [140608 143591.50 149787] old.so execution :: cycles :: benchmarks-next/blake3-scalar/benchmark.wasm Δ = 140341.60 ± 12437.21 (confidence = 99%) new.so is 1.32x to 1.38x faster than old.so! old.so is 0.72x to 0.77x faster than new.so! [400368 405315.60 420774] new.so [534318 545657.20 569202] old.so ---- compilation :: cycles :: benchmarks-next/blake3-simd/benchmark.wasm No difference in performance. [112727304 139448014.80 189082604] new.so [123143218 156732493.40 233512432] old.so compilation :: nanoseconds :: benchmarks-next/blake3-simd/benchmark.wasm No difference in performance. [29664800 36696541.20 49758219] new.so [32405712 41244760.40 61449541] old.so execution :: cycles :: benchmarks-next/blake3-simd/benchmark.wasm No difference in performance. [400672 739521.80 1042226] new.so [498142 828791.40 1160786] old.so execution :: nanoseconds :: benchmarks-next/blake3-simd/benchmark.wasm No difference in performance. [105439 194609.20 274267] new.so [131088 218099.20 305464] old.so ---- compilation :: cycles :: benchmarks-next/meshoptimizer/benchmark.wasm Δ = 483775336.20 ± 24646158.96 (confidence = 99%) new.so is 1.22x to 1.24x faster than old.so! old.so is 0.80x to 0.82x faster than new.so! [2090515508 2113482784.00 2150210240] new.so [2554359582 2597258120.20 2630111328] old.so compilation :: nanoseconds :: benchmarks-next/meshoptimizer/benchmark.wasm Δ = 127275628.40 ± 6480546.57 (confidence = 99%) new.so is 1.22x to 1.24x faster than old.so! old.so is 0.80x to 0.82x faster than new.so! [550127669 556172437.60 565836581] new.so [672188482 683448066.00 692063546] old.so execution :: cycles :: benchmarks-next/meshoptimizer/benchmark.wasm Δ = 3386913742.00 ± 454568778.61 (confidence = 99%) new.so is 1.16x to 1.21x faster than old.so! old.so is 0.82x to 0.86x faster than new.so! [17786842514 17978520795.40 18352029814] new.so [20863697992 21365434537.40 22139271504] old.so execution :: nanoseconds :: benchmarks-next/meshoptimizer/benchmark.wasm Δ = 891020039.40 ± 119694835.02 (confidence = 99%) new.so is 1.16x to 1.21x faster than old.so! old.so is 0.82x to 0.86x faster than new.so! [4680694128 4731128047.40 4829411387] new.so [5489883512 5622148086.80 5826025212] old.so ---- compilation :: cycles :: benchmarks-next/pulldown-cmark/benchmark.wasm Δ = 213252595.20 ± 29303757.92 (confidence = 99%) new.so is 1.16x to 1.21x faster than old.so! old.so is 0.82x to 0.86x faster than new.so! [1120180378 1148350389.80 1203069094] new.so [1340768136 1361602985.00 1397014596] old.so compilation :: nanoseconds :: benchmarks-next/pulldown-cmark/benchmark.wasm Δ = 56118120.00 ± 7711578.76 (confidence = 99%) new.so is 1.16x to 1.21x faster than old.so! old.so is 0.82x to 0.86x faster than new.so! [294780634 302193792.40 316593182] new.so [352828441 358311912.40 367631343] old.so execution :: cycles :: benchmarks-next/pulldown-cmark/benchmark.wasm No difference in performance. [8257780 8443755.80 8560944] new.so [8455570 9495162.60 17648568] old.so execution :: nanoseconds :: benchmarks-next/pulldown-cmark/benchmark.wasm No difference in performance. [2173072 2222013.50 2252853] new.so [2225116 2498693.60 4644290] old.so ---- compilation :: cycles :: benchmarks-next/bz2/benchmark.wasm Δ = 58684068.80 ± 36909440.37 (confidence = 99%) new.so is 1.04x to 1.18x faster than old.so! old.so is 0.84x to 0.96x faster than new.so! [498967588 545831464.20 586460840] new.so [540660276 604515533.00 635005118] old.so compilation :: nanoseconds :: benchmarks-next/bz2/benchmark.wasm Δ = 15436153.00 ± 9714229.01 (confidence = 99%) new.so is 1.04x to 1.18x faster than old.so! old.so is 0.84x to 0.96x faster than new.so! [131305387 143637939.40 154329874] new.so [142264400 159074092.40 167089438] old.so execution :: nanoseconds :: benchmarks-next/bz2/benchmark.wasm No difference in performance. [25932760 35978222.50 53794238] new.so [28960083 29737468.90 35137211] old.so execution :: cycles :: benchmarks-next/bz2/benchmark.wasm No difference in performance. [98545894 136719075.20 204420658] new.so [110059628 113008690.20 133522880] old.so
</details>
cfallin edited issue #3942:
This issue is meant to track the status of migrating Cranelift to use regalloc2, our new register allocator. We started this work a while ago, and as detailed in our 2022 roadmap, we plan to finish the migration this year.
The major tasks remaining are:
- [x] Develop regalloc2 as a standalone project
- [ ] Get a second reviewer to triple-check the symbolic checker we've been using to fuzz regalloc2, since regalloc correctness is essential for correctness/security of all layers above it (@fitzgen is currently looking this over)
- [ ] Integrate any remaining regalloc2 tweaks/improvements (I have a few queued up, mostly API-related; this is my running branch as I bring it up with Cranelift.)
- [ ] Release regalloc2 crate on crates.io
- [ ] Merge support for regalloc2 into Cranelift
The last task has been under development for the past 2.5 weeks or so. I'll make my private branch public shortly, after a bit of cleanup. Its current status is that it is fully functional (passes tests, runs benchmarks) on x86-64. There is work to do to move the other two backends over (aarch64, s390x) and I will do this before we merge. (I might not be able to do this before Mon Mar 28; I'm out-of-office and offline all of next week unfortunately, but wanted to get these results out first!)
The nature of the changes to Cranelift are such that we do have to do the transition atomically and remove regalloc.rs support at the same time; the whole MachInst infrastructure is basically built up around the regalloc abstractions, so swapping it out has a large effect. Fortunately though I think there is not too much of a downside (aside from the usual code-churn risk, which we mitigate with ongoing fuzzing and careful review) -- performance numbers look good.
Here is a current snapshot of some benchmark results:
Benchmark Compilation (wallclock) Execution (wallclock) blake3-scalar 25% faster 28% faster blake3-simd no diff no diff meshoptimizer 19% faster 17% faster pulldown-cmark 17% faster no diff bz2 15% faster no diff SpiderMonkey, 21% faster 2% faster fib(30) clang.wasm 42% faster N/A
with full details here:
<details>
<summary>Benchmark methodology and raw output</summary>
<pre>
As percentage improvement over baseline (old):Benchmark Compilation (wallclock) Execution (wallclock)
blake3-scalar 25% faster 28% faster
blake3-simd no diff no diff
meshoptimizer 19% faster 17% faster
pulldown-cmark 17% faster no diff
bz2 15% faster no diff
SpiderMonkey, 21% faster 2% faster
fib(30)
clang.wasm 42% faster N/AAs ratios (percent improvement above = 100% * (1 - 1/speedup_ratio))
Benchmark Compilation (wallclock) Execution (wallclock)
blake3-scalar 1.34x faster 1.38x faster
blake3-simd no diff no diff
meshoptimizer 1.24x faster 1.21x faster
pulldown-cmark 1.21x faster no diff
bz2 1.18x faster no diff
SpiderMonkey, 1.26x faster 1.02x faster
fib(30)
clang.wasm 1.71x faster N/AMethodology:
- Sightglass with --processes 2 --iterations-per-process 5.
- Last two benchmarks running commandline wasmtime
- rm -r ~/.cache/wasmtime
- run
wasmtime run
once to ensure compiled- measure runtime 5x, take best of five
- measure compile time with
wasmtime compile
5x, take best of five- clang.wasm doesn't have a test harness, so is compile-only
- Testing on 12-core / 24-thread Ryzen 3900X, Linux/x86-64
Comparing baseline of Wasmtime fdf063df98ad3839b0e0b78ea55b53b1a296abb0 (from
Mar 16) against my internal regalloc2 branch
9b89942cf62d262ee9ac3e7eab525ea8544a458b (from Mar 17) which last synced with
Wasmtime at eb1b71e31c035ff4250c5013ca0268deb931aa7c (from Feb 24).Raw output of Sightglass below (instantiation excluded, not interesting).
compilation :: cycles :: benchmarks-next/blake3-scalar/benchmark.wasm
Δ = 121531866.00 ± 51042761.18 (confidence = 99%)
new.so is 1.14x to 1.34x faster than old.so!
old.so is 0.72x to 0.89x faster than new.so![478052996 501410277.40 591983000] new.so
[604955098 622942143.40 709527450] old.socompilation :: nanoseconds :: benchmarks-next/blake3-scalar/benchmark.wasm
Δ = 31981472.00 ± 13432120.92 (confidence = 99%)
new.so is 1.14x to 1.34x faster than old.so!
old.so is 0.72x to 0.89x faster than new.so![125802142 131948268.40 155782325] new.so
[159196645 163929740.40 186715328] old.soexecution :: nanoseconds :: benchmarks-next/blake3-scalar/benchmark.wasm
Δ = 36931.50 ± 3272.72 (confidence = 99%)
new.so is 1.32x to 1.38x faster than old.so!
old.so is 0.72x to 0.77x faster than new.so![105358 106660.00 110728] new.so
[140608 143591.50 149787] old.soexecution :: cycles :: benchmarks-next/blake3-scalar/benchmark.wasm
Δ = 140341.60 ± 12437.21 (confidence = 99%)
new.so is 1.32x to 1.38x faster than old.so!
old.so is 0.72x to 0.77x faster than new.so![400368 405315.60 420774] new.so
[534318 545657.20 569202] old.so
compilation :: cycles :: benchmarks-next/blake3-simd/benchmark.wasm
No difference in performance.
[112727304 139448014.80 189082604] new.so
[123143218 156732493.40 233512432] old.socompilation :: nanoseconds :: benchmarks-next/blake3-simd/benchmark.wasm
No difference in performance.
[29664800 36696541.20 49758219] new.so
[32405712 41244760.40 61449541] old.soexecution :: cycles :: benchmarks-next/blake3-simd/benchmark.wasm
No difference in performance.
[400672 739521.80 1042226] new.so
[498142 828791.40 1160786] old.soexecution :: nanoseconds :: benchmarks-next/blake3-simd/benchmark.wasm
No difference in performance.
[105439 194609.20 274267] new.so
[131088 218099.20 305464] old.so
compilation :: cycles :: benchmarks-next/meshoptimizer/benchmark.wasm
Δ = 483775336.20 ± 24646158.96 (confidence = 99%)
new.so is 1.22x to 1.24x faster than old.so!
old.so is 0.80x to 0.82x faster than new.so![2090515508 2113482784.00 2150210240] new.so
[2554359582 2597258120.20 2630111328] old.socompilation :: nanoseconds :: benchmarks-next/meshoptimizer/benchmark.wasm
Δ = 127275628.40 ± 6480546.57 (confidence = 99%)
new.so is 1.22x to 1.24x faster than old.so!
old.so is 0.80x to 0.82x faster than new.so![550127669 556172437.60 565836581] new.so
[672188482 683448066.00 692063546] old.soexecution :: cycles :: benchmarks-next/meshoptimizer/benchmark.wasm
Δ = 3386913742.00 ± 454568778.61 (confidence = 99%)
new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so![17786842514 17978520795.40 18352029814] new.so
[20863697992 21365434537.40 22139271504] old.soexecution :: nanoseconds :: benchmarks-next/meshoptimizer/benchmark.wasm
Δ = 891020039.40 ± 119694835.02 (confidence = 99%)
new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so![4680694128 4731128047.40 4829411387] new.so
[5489883512 5622148086.80 5826025212] old.so
compilation :: cycles :: benchmarks-next/pulldown-cmark/benchmark.wasm
Δ = 213252595.20 ± 29303757.92 (confidence = 99%)
new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so![1120180378 1148350389.80 1203069094] new.so
[1340768136 1361602985.00 1397014596] old.socompilation :: nanoseconds :: benchmarks-next/pulldown-cmark/benchmark.wasm
Δ = 56118120.00 ± 7711578.76 (confidence = 99%)
new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so![294780634 302193792.40 316593182] new.so
[352828441 358311912.40 367631343] old.soexecution :: cycles :: benchmarks-next/pulldown-cmark/benchmark.wasm
No difference in performance.
[8257780 8443755.80 8560944] new.so
[8455570 9495162.60 17648568] old.soexecution :: nanoseconds :: benchmarks-next/pulldown-cmark/benchmark.wasm
No difference in performance.
[2173072 2222013.50 2252853] new.so
[2225116 2498693.60 4644290] old.so
compilation :: cycles :: benchmarks-next/bz2/benchmark.wasm
Δ = 58684068.80 ± 36909440.37 (confidence = 99%)
new.so is 1.04x to 1.18x faster than old.so!
old.so is 0.84x to 0.96x faster than new.so![498967588 545831464.20 586460840] new.so
[540660276 604515533.00 635005118] old.socompilation :: nanoseconds :: benchmarks-next/bz2/benchmark.wasm
Δ = 15436153.00 ± 9714229.01 (confidence = 99%)
new.so is 1.04x to 1.18x faster than old.so!
old.so is 0.84x to 0.96x faster than new.so![131305387 143637939.40 154329874] new.so
[142264400 159074092.40 167089438] old.soexecution :: nanoseconds :: benchmarks-next/bz2/benchmark.wasm
No difference in performance.
[25932760 35978222.50 53794238] new.so
[28960083 29737468.90 35137211] old.soexecution :: cycles :: benchmarks-next/bz2/benchmark.wasm
No difference in performance.
[98545894 136719075.20 204420658] new.so
[110059628 113008690.20 133522880] old.so
</pre>
</details>
abrown commented on issue #3942:
(Can we add spidermonkey.wasm and clang.wasm to Sightglass?)
cfallin commented on issue #3942:
(Can we add spidermonkey.wasm and clang.wasm to Sightglass?)
We could perhaps, yeah, with some hackery (building a toplevel harness mostly). In the SpiderMonkey case we need to add a WASI directory capability and feed in a JS file, and in the clang case we need a way to tell the infra that it's compile-only (I don't know how to run it). For now it's not too bad to run by hand though :-)
cfallin commented on issue #3942:
A little more benchmarking -- taking most of the modules from #911 and compiling with baseline and regalloc2:
Wasm (SHA256 of module) from #911 baseline compile (s) regalloc2 compile (s) 0ddff0dac47311846e831cb25df5ec5fcb7c59a4 1.201 0.262 256e0360aa2774d6ad1bb5589030b7a944a81c5d 0.680 0.671 28276a409e576044bea8cdc46068426484bf7b06 0.035 0.039 2e746b5b07c0a022415d6c1527815af44daae33e 0.006 0.006 4286371e64c07f853a5d4de482d658f3c7f2c711 0.137 0.365 6ccd889e8a97b9adb2697f9f60477e511ad50be4 0.721 0.329 9850b3172ddb705be8caa06599cb92ead3cd251c 0.509 0.645 bdb6099c0073360613f17cc9a7d2380d50f8eb9e 2.725 0.061 bf8490f3bd1f3350a0d4a83670bb1d3d017cf8ef 0.074 0.283 cb46921624763cf50eb826585d224bb3975a4234 0.693 0.035 d31a6a6de65a08096dc855a17f49499114826a3e 0.057 0.284 d51589b35a521c29420fc140b292383f2ca5fd70 3.180 0.617 dfafaa30ecd41ab9bece126eec8129b42925a4dd 1.367 1.011
In almost all cases things got faster, sometimes significantly so (3.18s -> 0.61s, 1.2s -> 0.26s, 2.7s -> 0.061s (!)). This tracks with my understanding of some of the bottlenecks I saw in profiling before and the efforts to keep away from quadratic explosions and nonlinear behavior in general in regalloc2 as far as possible. Some of the smaller modules see some increases (0.137s -> 0.365s, 0.057s -> 0.284s); I haven't conclusively resolved what's going on in those but it wouldn't surprise me if this comes from splitting heuristics being a little more aggressive. In any case nothing immediately jumps out in the profile.
alexcrichton labeled issue #3942:
This issue is meant to track the status of migrating Cranelift to use regalloc2, our new register allocator. We started this work a while ago, and as detailed in our 2022 roadmap, we plan to finish the migration this year.
The major tasks remaining are:
- [x] Develop regalloc2 as a standalone project
- [ ] Get a second reviewer to triple-check the symbolic checker we've been using to fuzz regalloc2, since regalloc correctness is essential for correctness/security of all layers above it (@fitzgen is currently looking this over)
- [ ] Integrate any remaining regalloc2 tweaks/improvements (I have a few queued up, mostly API-related; this is my running branch as I bring it up with Cranelift.)
- [ ] Release regalloc2 crate on crates.io
- [ ] Merge support for regalloc2 into Cranelift
The last task has been under development for the past 2.5 weeks or so. I'll make my private branch public shortly, after a bit of cleanup. Its current status is that it is fully functional (passes tests, runs benchmarks) on x86-64. There is work to do to move the other two backends over (aarch64, s390x) and I will do this before we merge. (I might not be able to do this before Mon Mar 28; I'm out-of-office and offline all of next week unfortunately, but wanted to get these results out first!)
The nature of the changes to Cranelift are such that we do have to do the transition atomically and remove regalloc.rs support at the same time; the whole MachInst infrastructure is basically built up around the regalloc abstractions, so swapping it out has a large effect. Fortunately though I think there is not too much of a downside (aside from the usual code-churn risk, which we mitigate with ongoing fuzzing and careful review) -- performance numbers look good.
Here is a current snapshot of some benchmark results:
Benchmark Compilation (wallclock) Execution (wallclock) blake3-scalar 25% faster 28% faster blake3-simd no diff no diff meshoptimizer 19% faster 17% faster pulldown-cmark 17% faster no diff bz2 15% faster no diff SpiderMonkey, 21% faster 2% faster fib(30) clang.wasm 42% faster N/A
with full details here:
<details>
<summary>Benchmark methodology and raw output</summary>
<pre>
As percentage improvement over baseline (old):Benchmark Compilation (wallclock) Execution (wallclock)
blake3-scalar 25% faster 28% faster
blake3-simd no diff no diff
meshoptimizer 19% faster 17% faster
pulldown-cmark 17% faster no diff
bz2 15% faster no diff
SpiderMonkey, 21% faster 2% faster
fib(30)
clang.wasm 42% faster N/AAs ratios (percent improvement above = 100% * (1 - 1/speedup_ratio))
Benchmark Compilation (wallclock) Execution (wallclock)
blake3-scalar 1.34x faster 1.38x faster
blake3-simd no diff no diff
meshoptimizer 1.24x faster 1.21x faster
pulldown-cmark 1.21x faster no diff
bz2 1.18x faster no diff
SpiderMonkey, 1.26x faster 1.02x faster
fib(30)
clang.wasm 1.71x faster N/AMethodology:
- Sightglass with --processes 2 --iterations-per-process 5.
- Last two benchmarks running commandline wasmtime
- rm -r ~/.cache/wasmtime
- run
wasmtime run
once to ensure compiled- measure runtime 5x, take best of five
- measure compile time with
wasmtime compile
5x, take best of five- clang.wasm doesn't have a test harness, so is compile-only
- Testing on 12-core / 24-thread Ryzen 3900X, Linux/x86-64
Comparing baseline of Wasmtime fdf063df98ad3839b0e0b78ea55b53b1a296abb0 (from
Mar 16) against my internal regalloc2 branch
9b89942cf62d262ee9ac3e7eab525ea8544a458b (from Mar 17) which last synced with
Wasmtime at eb1b71e31c035ff4250c5013ca0268deb931aa7c (from Feb 24).Raw output of Sightglass below (instantiation excluded, not interesting).
compilation :: cycles :: benchmarks-next/blake3-scalar/benchmark.wasm
Δ = 121531866.00 ± 51042761.18 (confidence = 99%)
new.so is 1.14x to 1.34x faster than old.so!
old.so is 0.72x to 0.89x faster than new.so![478052996 501410277.40 591983000] new.so
[604955098 622942143.40 709527450] old.socompilation :: nanoseconds :: benchmarks-next/blake3-scalar/benchmark.wasm
Δ = 31981472.00 ± 13432120.92 (confidence = 99%)
new.so is 1.14x to 1.34x faster than old.so!
old.so is 0.72x to 0.89x faster than new.so![125802142 131948268.40 155782325] new.so
[159196645 163929740.40 186715328] old.soexecution :: nanoseconds :: benchmarks-next/blake3-scalar/benchmark.wasm
Δ = 36931.50 ± 3272.72 (confidence = 99%)
new.so is 1.32x to 1.38x faster than old.so!
old.so is 0.72x to 0.77x faster than new.so![105358 106660.00 110728] new.so
[140608 143591.50 149787] old.soexecution :: cycles :: benchmarks-next/blake3-scalar/benchmark.wasm
Δ = 140341.60 ± 12437.21 (confidence = 99%)
new.so is 1.32x to 1.38x faster than old.so!
old.so is 0.72x to 0.77x faster than new.so![400368 405315.60 420774] new.so
[534318 545657.20 569202] old.so
compilation :: cycles :: benchmarks-next/blake3-simd/benchmark.wasm
No difference in performance.
[112727304 139448014.80 189082604] new.so
[123143218 156732493.40 233512432] old.socompilation :: nanoseconds :: benchmarks-next/blake3-simd/benchmark.wasm
No difference in performance.
[29664800 36696541.20 49758219] new.so
[32405712 41244760.40 61449541] old.soexecution :: cycles :: benchmarks-next/blake3-simd/benchmark.wasm
No difference in performance.
[400672 739521.80 1042226] new.so
[498142 828791.40 1160786] old.soexecution :: nanoseconds :: benchmarks-next/blake3-simd/benchmark.wasm
No difference in performance.
[105439 194609.20 274267] new.so
[131088 218099.20 305464] old.so
compilation :: cycles :: benchmarks-next/meshoptimizer/benchmark.wasm
Δ = 483775336.20 ± 24646158.96 (confidence = 99%)
new.so is 1.22x to 1.24x faster than old.so!
old.so is 0.80x to 0.82x faster than new.so![2090515508 2113482784.00 2150210240] new.so
[2554359582 2597258120.20 2630111328] old.socompilation :: nanoseconds :: benchmarks-next/meshoptimizer/benchmark.wasm
Δ = 127275628.40 ± 6480546.57 (confidence = 99%)
new.so is 1.22x to 1.24x faster than old.so!
old.so is 0.80x to 0.82x faster than new.so![550127669 556172437.60 565836581] new.so
[672188482 683448066.00 692063546] old.soexecution :: cycles :: benchmarks-next/meshoptimizer/benchmark.wasm
Δ = 3386913742.00 ± 454568778.61 (confidence = 99%)
new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so![17786842514 17978520795.40 18352029814] new.so
[20863697992 21365434537.40 22139271504] old.soexecution :: nanoseconds :: benchmarks-next/meshoptimizer/benchmark.wasm
Δ = 891020039.40 ± 119694835.02 (confidence = 99%)
new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so![4680694128 4731128047.40 4829411387] new.so
[5489883512 5622148086.80 5826025212] old.so
compilation :: cycles :: benchmarks-next/pulldown-cmark/benchmark.wasm
Δ = 213252595.20 ± 29303757.92 (confidence = 99%)
new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so![1120180378 1148350389.80 1203069094] new.so
[1340768136 1361602985.00 1397014596] old.socompilation :: nanoseconds :: benchmarks-next/pulldown-cmark/benchmark.wasm
Δ = 56118120.00 ± 7711578.76 (confidence = 99%)
new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so![294780634 302193792.40 316593182] new.so
[352828441 358311912.40 367631343] old.soexecution :: cycles :: benchmarks-next/pulldown-cmark/benchmark.wasm
No difference in performance.
[8257780 8443755.80 8560944] new.so
[8455570 9495162.60 17648568] old.soexecution :: nanoseconds :: benchmarks-next/pulldown-cmark/benchmark.wasm
No difference in performance.
[2173072 2222013.50 2252853] new.so
[2225116 2498693.60 4644290] old.so
compilation :: cycles :: benchmarks-next/bz2/benchmark.wasm
Δ = 58684068.80 ± 36909440.37 (confidence = 99%)
new.so is 1.04x to 1.18x faster than old.so!
old.so is 0.84x to 0.96x faster than new.so![498967588 545831464.20 586460840] new.so
[540660276 604515533.00 635005118] old.socompilation :: nanoseconds :: benchmarks-next/bz2/benchmark.wasm
Δ = 15436153.00 ± 9714229.01 (confidence = 99%)
new.so is 1.04x to 1.18x faster than old.so!
old.so is 0.84x to 0.96x faster than new.so![131305387 143637939.40 154329874] new.so
[142264400 159074092.40 167089438] old.soexecution :: nanoseconds :: benchmarks-next/bz2/benchmark.wasm
No difference in performance.
[25932760 35978222.50 53794238] new.so
[28960083 29737468.90 35137211] old.soexecution :: cycles :: benchmarks-next/bz2/benchmark.wasm
No difference in performance.
[98545894 136719075.20 204420658] new.so
[110059628 113008690.20 133522880] old.so
</pre>
</details>
alexcrichton labeled issue #3942:
This issue is meant to track the status of migrating Cranelift to use regalloc2, our new register allocator. We started this work a while ago, and as detailed in our 2022 roadmap, we plan to finish the migration this year.
The major tasks remaining are:
- [x] Develop regalloc2 as a standalone project
- [ ] Get a second reviewer to triple-check the symbolic checker we've been using to fuzz regalloc2, since regalloc correctness is essential for correctness/security of all layers above it (@fitzgen is currently looking this over)
- [ ] Integrate any remaining regalloc2 tweaks/improvements (I have a few queued up, mostly API-related; this is my running branch as I bring it up with Cranelift.)
- [ ] Release regalloc2 crate on crates.io
- [ ] Merge support for regalloc2 into Cranelift
The last task has been under development for the past 2.5 weeks or so. I'll make my private branch public shortly, after a bit of cleanup. Its current status is that it is fully functional (passes tests, runs benchmarks) on x86-64. There is work to do to move the other two backends over (aarch64, s390x) and I will do this before we merge. (I might not be able to do this before Mon Mar 28; I'm out-of-office and offline all of next week unfortunately, but wanted to get these results out first!)
The nature of the changes to Cranelift are such that we do have to do the transition atomically and remove regalloc.rs support at the same time; the whole MachInst infrastructure is basically built up around the regalloc abstractions, so swapping it out has a large effect. Fortunately though I think there is not too much of a downside (aside from the usual code-churn risk, which we mitigate with ongoing fuzzing and careful review) -- performance numbers look good.
Here is a current snapshot of some benchmark results:
Benchmark Compilation (wallclock) Execution (wallclock) blake3-scalar 25% faster 28% faster blake3-simd no diff no diff meshoptimizer 19% faster 17% faster pulldown-cmark 17% faster no diff bz2 15% faster no diff SpiderMonkey, 21% faster 2% faster fib(30) clang.wasm 42% faster N/A
with full details here:
<details>
<summary>Benchmark methodology and raw output</summary>
<pre>
As percentage improvement over baseline (old):Benchmark Compilation (wallclock) Execution (wallclock)
blake3-scalar 25% faster 28% faster
blake3-simd no diff no diff
meshoptimizer 19% faster 17% faster
pulldown-cmark 17% faster no diff
bz2 15% faster no diff
SpiderMonkey, 21% faster 2% faster
fib(30)
clang.wasm 42% faster N/AAs ratios (percent improvement above = 100% * (1 - 1/speedup_ratio))
Benchmark Compilation (wallclock) Execution (wallclock)
blake3-scalar 1.34x faster 1.38x faster
blake3-simd no diff no diff
meshoptimizer 1.24x faster 1.21x faster
pulldown-cmark 1.21x faster no diff
bz2 1.18x faster no diff
SpiderMonkey, 1.26x faster 1.02x faster
fib(30)
clang.wasm 1.71x faster N/AMethodology:
- Sightglass with --processes 2 --iterations-per-process 5.
- Last two benchmarks running commandline wasmtime
- rm -r ~/.cache/wasmtime
- run
wasmtime run
once to ensure compiled- measure runtime 5x, take best of five
- measure compile time with
wasmtime compile
5x, take best of five- clang.wasm doesn't have a test harness, so is compile-only
- Testing on 12-core / 24-thread Ryzen 3900X, Linux/x86-64
Comparing baseline of Wasmtime fdf063df98ad3839b0e0b78ea55b53b1a296abb0 (from
Mar 16) against my internal regalloc2 branch
9b89942cf62d262ee9ac3e7eab525ea8544a458b (from Mar 17) which last synced with
Wasmtime at eb1b71e31c035ff4250c5013ca0268deb931aa7c (from Feb 24).Raw output of Sightglass below (instantiation excluded, not interesting).
compilation :: cycles :: benchmarks-next/blake3-scalar/benchmark.wasm
Δ = 121531866.00 ± 51042761.18 (confidence = 99%)
new.so is 1.14x to 1.34x faster than old.so!
old.so is 0.72x to 0.89x faster than new.so![478052996 501410277.40 591983000] new.so
[604955098 622942143.40 709527450] old.socompilation :: nanoseconds :: benchmarks-next/blake3-scalar/benchmark.wasm
Δ = 31981472.00 ± 13432120.92 (confidence = 99%)
new.so is 1.14x to 1.34x faster than old.so!
old.so is 0.72x to 0.89x faster than new.so![125802142 131948268.40 155782325] new.so
[159196645 163929740.40 186715328] old.soexecution :: nanoseconds :: benchmarks-next/blake3-scalar/benchmark.wasm
Δ = 36931.50 ± 3272.72 (confidence = 99%)
new.so is 1.32x to 1.38x faster than old.so!
old.so is 0.72x to 0.77x faster than new.so![105358 106660.00 110728] new.so
[140608 143591.50 149787] old.soexecution :: cycles :: benchmarks-next/blake3-scalar/benchmark.wasm
Δ = 140341.60 ± 12437.21 (confidence = 99%)
new.so is 1.32x to 1.38x faster than old.so!
old.so is 0.72x to 0.77x faster than new.so![400368 405315.60 420774] new.so
[534318 545657.20 569202] old.so
compilation :: cycles :: benchmarks-next/blake3-simd/benchmark.wasm
No difference in performance.
[112727304 139448014.80 189082604] new.so
[123143218 156732493.40 233512432] old.socompilation :: nanoseconds :: benchmarks-next/blake3-simd/benchmark.wasm
No difference in performance.
[29664800 36696541.20 49758219] new.so
[32405712 41244760.40 61449541] old.soexecution :: cycles :: benchmarks-next/blake3-simd/benchmark.wasm
No difference in performance.
[400672 739521.80 1042226] new.so
[498142 828791.40 1160786] old.soexecution :: nanoseconds :: benchmarks-next/blake3-simd/benchmark.wasm
No difference in performance.
[105439 194609.20 274267] new.so
[131088 218099.20 305464] old.so
compilation :: cycles :: benchmarks-next/meshoptimizer/benchmark.wasm
Δ = 483775336.20 ± 24646158.96 (confidence = 99%)
new.so is 1.22x to 1.24x faster than old.so!
old.so is 0.80x to 0.82x faster than new.so![2090515508 2113482784.00 2150210240] new.so
[2554359582 2597258120.20 2630111328] old.socompilation :: nanoseconds :: benchmarks-next/meshoptimizer/benchmark.wasm
Δ = 127275628.40 ± 6480546.57 (confidence = 99%)
new.so is 1.22x to 1.24x faster than old.so!
old.so is 0.80x to 0.82x faster than new.so![550127669 556172437.60 565836581] new.so
[672188482 683448066.00 692063546] old.soexecution :: cycles :: benchmarks-next/meshoptimizer/benchmark.wasm
Δ = 3386913742.00 ± 454568778.61 (confidence = 99%)
new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so![17786842514 17978520795.40 18352029814] new.so
[20863697992 21365434537.40 22139271504] old.soexecution :: nanoseconds :: benchmarks-next/meshoptimizer/benchmark.wasm
Δ = 891020039.40 ± 119694835.02 (confidence = 99%)
new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so![4680694128 4731128047.40 4829411387] new.so
[5489883512 5622148086.80 5826025212] old.so
compilation :: cycles :: benchmarks-next/pulldown-cmark/benchmark.wasm
Δ = 213252595.20 ± 29303757.92 (confidence = 99%)
new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so![1120180378 1148350389.80 1203069094] new.so
[1340768136 1361602985.00 1397014596] old.socompilation :: nanoseconds :: benchmarks-next/pulldown-cmark/benchmark.wasm
Δ = 56118120.00 ± 7711578.76 (confidence = 99%)
new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so![294780634 302193792.40 316593182] new.so
[352828441 358311912.40 367631343] old.soexecution :: cycles :: benchmarks-next/pulldown-cmark/benchmark.wasm
No difference in performance.
[8257780 8443755.80 8560944] new.so
[8455570 9495162.60 17648568] old.soexecution :: nanoseconds :: benchmarks-next/pulldown-cmark/benchmark.wasm
No difference in performance.
[2173072 2222013.50 2252853] new.so
[2225116 2498693.60 4644290] old.so
compilation :: cycles :: benchmarks-next/bz2/benchmark.wasm
Δ = 58684068.80 ± 36909440.37 (confidence = 99%)
new.so is 1.04x to 1.18x faster than old.so!
old.so is 0.84x to 0.96x faster than new.so![498967588 545831464.20 586460840] new.so
[540660276 604515533.00 635005118] old.socompilation :: nanoseconds :: benchmarks-next/bz2/benchmark.wasm
Δ = 15436153.00 ± 9714229.01 (confidence = 99%)
new.so is 1.04x to 1.18x faster than old.so!
old.so is 0.84x to 0.96x faster than new.so![131305387 143637939.40 154329874] new.so
[142264400 159074092.40 167089438] old.soexecution :: nanoseconds :: benchmarks-next/bz2/benchmark.wasm
No difference in performance.
[25932760 35978222.50 53794238] new.so
[28960083 29737468.90 35137211] old.soexecution :: cycles :: benchmarks-next/bz2/benchmark.wasm
No difference in performance.
[98545894 136719075.20 204420658] new.so
[110059628 113008690.20 133522880] old.so
</pre>
</details>
cfallin edited issue #3942:
This issue is meant to track the status of migrating Cranelift to use regalloc2, our new register allocator. We started this work a while ago, and as detailed in our 2022 roadmap, we plan to finish the migration this year.
The major tasks remaining are:
- [x] Develop regalloc2 as a standalone project
- [x] Get a second reviewer to triple-check the symbolic checker we've been using to fuzz regalloc2, since regalloc correctness is essential for correctness/security of all layers above it (@fitzgen is currently looking this over)
- [ ] Integrate any remaining regalloc2 tweaks/improvements (I have a few queued up, mostly API-related; this is my running branch as I bring it up with Cranelift.)
- [ ] Release regalloc2 crate on crates.io
- [ ] Merge support for regalloc2 into Cranelift
The last task has been under development for the past 2.5 weeks or so. I'll make my private branch public shortly, after a bit of cleanup. Its current status is that it is fully functional (passes tests, runs benchmarks) on x86-64. There is work to do to move the other two backends over (aarch64, s390x) and I will do this before we merge. (I might not be able to do this before Mon Mar 28; I'm out-of-office and offline all of next week unfortunately, but wanted to get these results out first!)
The nature of the changes to Cranelift are such that we do have to do the transition atomically and remove regalloc.rs support at the same time; the whole MachInst infrastructure is basically built up around the regalloc abstractions, so swapping it out has a large effect. Fortunately though I think there is not too much of a downside (aside from the usual code-churn risk, which we mitigate with ongoing fuzzing and careful review) -- performance numbers look good.
Here is a current snapshot of some benchmark results:
Benchmark Compilation (wallclock) Execution (wallclock) blake3-scalar 25% faster 28% faster blake3-simd no diff no diff meshoptimizer 19% faster 17% faster pulldown-cmark 17% faster no diff bz2 15% faster no diff SpiderMonkey, 21% faster 2% faster fib(30) clang.wasm 42% faster N/A
with full details here:
<details>
<summary>Benchmark methodology and raw output</summary>
<pre>
As percentage improvement over baseline (old):Benchmark Compilation (wallclock) Execution (wallclock)
blake3-scalar 25% faster 28% faster
blake3-simd no diff no diff
meshoptimizer 19% faster 17% faster
pulldown-cmark 17% faster no diff
bz2 15% faster no diff
SpiderMonkey, 21% faster 2% faster
fib(30)
clang.wasm 42% faster N/AAs ratios (percent improvement above = 100% * (1 - 1/speedup_ratio))
Benchmark Compilation (wallclock) Execution (wallclock)
blake3-scalar 1.34x faster 1.38x faster
blake3-simd no diff no diff
meshoptimizer 1.24x faster 1.21x faster
pulldown-cmark 1.21x faster no diff
bz2 1.18x faster no diff
SpiderMonkey, 1.26x faster 1.02x faster
fib(30)
clang.wasm 1.71x faster N/AMethodology:
- Sightglass with --processes 2 --iterations-per-process 5.
- Last two benchmarks running commandline wasmtime
- rm -r ~/.cache/wasmtime
- run
wasmtime run
once to ensure compiled- measure runtime 5x, take best of five
- measure compile time with
wasmtime compile
5x, take best of five- clang.wasm doesn't have a test harness, so is compile-only
- Testing on 12-core / 24-thread Ryzen 3900X, Linux/x86-64
Comparing baseline of Wasmtime fdf063df98ad3839b0e0b78ea55b53b1a296abb0 (from
Mar 16) against my internal regalloc2 branch
9b89942cf62d262ee9ac3e7eab525ea8544a458b (from Mar 17) which last synced with
Wasmtime at eb1b71e31c035ff4250c5013ca0268deb931aa7c (from Feb 24).Raw output of Sightglass below (instantiation excluded, not interesting).
compilation :: cycles :: benchmarks-next/blake3-scalar/benchmark.wasm
Δ = 121531866.00 ± 51042761.18 (confidence = 99%)
new.so is 1.14x to 1.34x faster than old.so!
old.so is 0.72x to 0.89x faster than new.so![478052996 501410277.40 591983000] new.so
[604955098 622942143.40 709527450] old.socompilation :: nanoseconds :: benchmarks-next/blake3-scalar/benchmark.wasm
Δ = 31981472.00 ± 13432120.92 (confidence = 99%)
new.so is 1.14x to 1.34x faster than old.so!
old.so is 0.72x to 0.89x faster than new.so![125802142 131948268.40 155782325] new.so
[159196645 163929740.40 186715328] old.soexecution :: nanoseconds :: benchmarks-next/blake3-scalar/benchmark.wasm
Δ = 36931.50 ± 3272.72 (confidence = 99%)
new.so is 1.32x to 1.38x faster than old.so!
old.so is 0.72x to 0.77x faster than new.so![105358 106660.00 110728] new.so
[140608 143591.50 149787] old.soexecution :: cycles :: benchmarks-next/blake3-scalar/benchmark.wasm
Δ = 140341.60 ± 12437.21 (confidence = 99%)
new.so is 1.32x to 1.38x faster than old.so!
old.so is 0.72x to 0.77x faster than new.so![400368 405315.60 420774] new.so
[534318 545657.20 569202] old.so
compilation :: cycles :: benchmarks-next/blake3-simd/benchmark.wasm
No difference in performance.
[112727304 139448014.80 189082604] new.so
[123143218 156732493.40 233512432] old.socompilation :: nanoseconds :: benchmarks-next/blake3-simd/benchmark.wasm
No difference in performance.
[29664800 36696541.20 49758219] new.so
[32405712 41244760.40 61449541] old.soexecution :: cycles :: benchmarks-next/blake3-simd/benchmark.wasm
No difference in performance.
[400672 739521.80 1042226] new.so
[498142 828791.40 1160786] old.soexecution :: nanoseconds :: benchmarks-next/blake3-simd/benchmark.wasm
No difference in performance.
[105439 194609.20 274267] new.so
[131088 218099.20 305464] old.so
compilation :: cycles :: benchmarks-next/meshoptimizer/benchmark.wasm
Δ = 483775336.20 ± 24646158.96 (confidence = 99%)
new.so is 1.22x to 1.24x faster than old.so!
old.so is 0.80x to 0.82x faster than new.so![2090515508 2113482784.00 2150210240] new.so
[2554359582 2597258120.20 2630111328] old.socompilation :: nanoseconds :: benchmarks-next/meshoptimizer/benchmark.wasm
Δ = 127275628.40 ± 6480546.57 (confidence = 99%)
new.so is 1.22x to 1.24x faster than old.so!
old.so is 0.80x to 0.82x faster than new.so![550127669 556172437.60 565836581] new.so
[672188482 683448066.00 692063546] old.soexecution :: cycles :: benchmarks-next/meshoptimizer/benchmark.wasm
Δ = 3386913742.00 ± 454568778.61 (confidence = 99%)
new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so![17786842514 17978520795.40 18352029814] new.so
[20863697992 21365434537.40 22139271504] old.soexecution :: nanoseconds :: benchmarks-next/meshoptimizer/benchmark.wasm
Δ = 891020039.40 ± 119694835.02 (confidence = 99%)
new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so![4680694128 4731128047.40 4829411387] new.so
[5489883512 5622148086.80 5826025212] old.so
compilation :: cycles :: benchmarks-next/pulldown-cmark/benchmark.wasm
Δ = 213252595.20 ± 29303757.92 (confidence = 99%)
new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so![1120180378 1148350389.80 1203069094] new.so
[1340768136 1361602985.00 1397014596] old.socompilation :: nanoseconds :: benchmarks-next/pulldown-cmark/benchmark.wasm
Δ = 56118120.00 ± 7711578.76 (confidence = 99%)
new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so![294780634 302193792.40 316593182] new.so
[352828441 358311912.40 367631343] old.soexecution :: cycles :: benchmarks-next/pulldown-cmark/benchmark.wasm
No difference in performance.
[8257780 8443755.80 8560944] new.so
[8455570 9495162.60 17648568] old.soexecution :: nanoseconds :: benchmarks-next/pulldown-cmark/benchmark.wasm
No difference in performance.
[2173072 2222013.50 2252853] new.so
[2225116 2498693.60 4644290] old.so
compilation :: cycles :: benchmarks-next/bz2/benchmark.wasm
Δ = 58684068.80 ± 36909440.37 (confidence = 99%)
new.so is 1.04x to 1.18x faster than old.so!
old.so is 0.84x to 0.96x faster than new.so![498967588 545831464.20 586460840] new.so
[540660276 604515533.00 635005118] old.socompilation :: nanoseconds :: benchmarks-next/bz2/benchmark.wasm
Δ = 15436153.00 ± 9714229.01 (confidence = 99%)
new.so is 1.04x to 1.18x faster than old.so!
old.so is 0.84x to 0.96x faster than new.so![131305387 143637939.40 154329874] new.so
[142264400 159074092.40 167089438] old.soexecution :: nanoseconds :: benchmarks-next/bz2/benchmark.wasm
No difference in performance.
[25932760 35978222.50 53794238] new.so
[28960083 29737468.90 35137211] old.soexecution :: cycles :: benchmarks-next/bz2/benchmark.wasm
No difference in performance.
[98545894 136719075.20 204420658] new.so
[110059628 113008690.20 133522880] old.so
</pre>
</details>
cfallin edited issue #3942:
This issue is meant to track the status of migrating Cranelift to use regalloc2, our new register allocator. We started this work a while ago, and as detailed in our 2022 roadmap, we plan to finish the migration this year.
The major tasks remaining are:
- [x] Develop regalloc2 as a standalone project
- [x] Get a second reviewer to triple-check the symbolic checker we've been using to fuzz regalloc2, since regalloc correctness is essential for correctness/security of all layers above it (@fitzgen is currently looking this over)
- [x] Integrate any remaining regalloc2 tweaks/improvements (I have a few queued up, mostly API-related; this is my running branch as I bring it up with Cranelift.)
- [ ] Release regalloc2 crate on crates.io
- [ ] Merge support for regalloc2 into Cranelift
The last task has been under development for the past 2.5 weeks or so. I'll make my private branch public shortly, after a bit of cleanup. Its current status is that it is fully functional (passes tests, runs benchmarks) on x86-64. There is work to do to move the other two backends over (aarch64, s390x) and I will do this before we merge. (I might not be able to do this before Mon Mar 28; I'm out-of-office and offline all of next week unfortunately, but wanted to get these results out first!)
The nature of the changes to Cranelift are such that we do have to do the transition atomically and remove regalloc.rs support at the same time; the whole MachInst infrastructure is basically built up around the regalloc abstractions, so swapping it out has a large effect. Fortunately though I think there is not too much of a downside (aside from the usual code-churn risk, which we mitigate with ongoing fuzzing and careful review) -- performance numbers look good.
Here is a current snapshot of some benchmark results:
Benchmark Compilation (wallclock) Execution (wallclock) blake3-scalar 25% faster 28% faster blake3-simd no diff no diff meshoptimizer 19% faster 17% faster pulldown-cmark 17% faster no diff bz2 15% faster no diff SpiderMonkey, 21% faster 2% faster fib(30) clang.wasm 42% faster N/A
with full details here:
<details>
<summary>Benchmark methodology and raw output</summary>
<pre>
As percentage improvement over baseline (old):Benchmark Compilation (wallclock) Execution (wallclock)
blake3-scalar 25% faster 28% faster
blake3-simd no diff no diff
meshoptimizer 19% faster 17% faster
pulldown-cmark 17% faster no diff
bz2 15% faster no diff
SpiderMonkey, 21% faster 2% faster
fib(30)
clang.wasm 42% faster N/AAs ratios (percent improvement above = 100% * (1 - 1/speedup_ratio))
Benchmark Compilation (wallclock) Execution (wallclock)
blake3-scalar 1.34x faster 1.38x faster
blake3-simd no diff no diff
meshoptimizer 1.24x faster 1.21x faster
pulldown-cmark 1.21x faster no diff
bz2 1.18x faster no diff
SpiderMonkey, 1.26x faster 1.02x faster
fib(30)
clang.wasm 1.71x faster N/AMethodology:
- Sightglass with --processes 2 --iterations-per-process 5.
- Last two benchmarks running commandline wasmtime
- rm -r ~/.cache/wasmtime
- run
wasmtime run
once to ensure compiled- measure runtime 5x, take best of five
- measure compile time with
wasmtime compile
5x, take best of five- clang.wasm doesn't have a test harness, so is compile-only
- Testing on 12-core / 24-thread Ryzen 3900X, Linux/x86-64
Comparing baseline of Wasmtime fdf063df98ad3839b0e0b78ea55b53b1a296abb0 (from
Mar 16) against my internal regalloc2 branch
9b89942cf62d262ee9ac3e7eab525ea8544a458b (from Mar 17) which last synced with
Wasmtime at eb1b71e31c035ff4250c5013ca0268deb931aa7c (from Feb 24).Raw output of Sightglass below (instantiation excluded, not interesting).
compilation :: cycles :: benchmarks-next/blake3-scalar/benchmark.wasm
Δ = 121531866.00 ± 51042761.18 (confidence = 99%)
new.so is 1.14x to 1.34x faster than old.so!
old.so is 0.72x to 0.89x faster than new.so![478052996 501410277.40 591983000] new.so
[604955098 622942143.40 709527450] old.socompilation :: nanoseconds :: benchmarks-next/blake3-scalar/benchmark.wasm
Δ = 31981472.00 ± 13432120.92 (confidence = 99%)
new.so is 1.14x to 1.34x faster than old.so!
old.so is 0.72x to 0.89x faster than new.so![125802142 131948268.40 155782325] new.so
[159196645 163929740.40 186715328] old.soexecution :: nanoseconds :: benchmarks-next/blake3-scalar/benchmark.wasm
Δ = 36931.50 ± 3272.72 (confidence = 99%)
new.so is 1.32x to 1.38x faster than old.so!
old.so is 0.72x to 0.77x faster than new.so![105358 106660.00 110728] new.so
[140608 143591.50 149787] old.soexecution :: cycles :: benchmarks-next/blake3-scalar/benchmark.wasm
Δ = 140341.60 ± 12437.21 (confidence = 99%)
new.so is 1.32x to 1.38x faster than old.so!
old.so is 0.72x to 0.77x faster than new.so![400368 405315.60 420774] new.so
[534318 545657.20 569202] old.so
compilation :: cycles :: benchmarks-next/blake3-simd/benchmark.wasm
No difference in performance.
[112727304 139448014.80 189082604] new.so
[123143218 156732493.40 233512432] old.socompilation :: nanoseconds :: benchmarks-next/blake3-simd/benchmark.wasm
No difference in performance.
[29664800 36696541.20 49758219] new.so
[32405712 41244760.40 61449541] old.soexecution :: cycles :: benchmarks-next/blake3-simd/benchmark.wasm
No difference in performance.
[400672 739521.80 1042226] new.so
[498142 828791.40 1160786] old.soexecution :: nanoseconds :: benchmarks-next/blake3-simd/benchmark.wasm
No difference in performance.
[105439 194609.20 274267] new.so
[131088 218099.20 305464] old.so
compilation :: cycles :: benchmarks-next/meshoptimizer/benchmark.wasm
Δ = 483775336.20 ± 24646158.96 (confidence = 99%)
new.so is 1.22x to 1.24x faster than old.so!
old.so is 0.80x to 0.82x faster than new.so![2090515508 2113482784.00 2150210240] new.so
[2554359582 2597258120.20 2630111328] old.socompilation :: nanoseconds :: benchmarks-next/meshoptimizer/benchmark.wasm
Δ = 127275628.40 ± 6480546.57 (confidence = 99%)
new.so is 1.22x to 1.24x faster than old.so!
old.so is 0.80x to 0.82x faster than new.so![550127669 556172437.60 565836581] new.so
[672188482 683448066.00 692063546] old.soexecution :: cycles :: benchmarks-next/meshoptimizer/benchmark.wasm
Δ = 3386913742.00 ± 454568778.61 (confidence = 99%)
new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so![17786842514 17978520795.40 18352029814] new.so
[20863697992 21365434537.40 22139271504] old.soexecution :: nanoseconds :: benchmarks-next/meshoptimizer/benchmark.wasm
Δ = 891020039.40 ± 119694835.02 (confidence = 99%)
new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so![4680694128 4731128047.40 4829411387] new.so
[5489883512 5622148086.80 5826025212] old.so
compilation :: cycles :: benchmarks-next/pulldown-cmark/benchmark.wasm
Δ = 213252595.20 ± 29303757.92 (confidence = 99%)
new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so![1120180378 1148350389.80 1203069094] new.so
[1340768136 1361602985.00 1397014596] old.socompilation :: nanoseconds :: benchmarks-next/pulldown-cmark/benchmark.wasm
Δ = 56118120.00 ± 7711578.76 (confidence = 99%)
new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so![294780634 302193792.40 316593182] new.so
[352828441 358311912.40 367631343] old.soexecution :: cycles :: benchmarks-next/pulldown-cmark/benchmark.wasm
No difference in performance.
[8257780 8443755.80 8560944] new.so
[8455570 9495162.60 17648568] old.soexecution :: nanoseconds :: benchmarks-next/pulldown-cmark/benchmark.wasm
No difference in performance.
[2173072 2222013.50 2252853] new.so
[2225116 2498693.60 4644290] old.so
compilation :: cycles :: benchmarks-next/bz2/benchmark.wasm
Δ = 58684068.80 ± 36909440.37 (confidence = 99%)
new.so is 1.04x to 1.18x faster than old.so!
old.so is 0.84x to 0.96x faster than new.so![498967588 545831464.20 586460840] new.so
[540660276 604515533.00 635005118] old.socompilation :: nanoseconds :: benchmarks-next/bz2/benchmark.wasm
Δ = 15436153.00 ± 9714229.01 (confidence = 99%)
new.so is 1.04x to 1.18x faster than old.so!
old.so is 0.84x to 0.96x faster than new.so![131305387 143637939.40 154329874] new.so
[142264400 159074092.40 167089438] old.soexecution :: nanoseconds :: benchmarks-next/bz2/benchmark.wasm
No difference in performance.
[25932760 35978222.50 53794238] new.so
[28960083 29737468.90 35137211] old.soexecution :: cycles :: benchmarks-next/bz2/benchmark.wasm
No difference in performance.
[98545894 136719075.20 204420658] new.so
[110059628 113008690.20 133522880] old.so
</pre>
</details>
cfallin edited issue #3942:
This issue is meant to track the status of migrating Cranelift to use regalloc2, our new register allocator. We started this work a while ago, and as detailed in our 2022 roadmap, we plan to finish the migration this year.
The major tasks remaining are:
- [x] Develop regalloc2 as a standalone project
- [x] Get a second reviewer to triple-check the symbolic checker we've been using to fuzz regalloc2, since regalloc correctness is essential for correctness/security of all layers above it (@fitzgen is currently looking this over)
- [x] Integrate any remaining regalloc2 tweaks/improvements (I have a few queued up, mostly API-related; this is my running branch as I bring it up with Cranelift.)
- [x] Release regalloc2 crate on crates.io
- [ ] Merge support for regalloc2 into Cranelift
The last task has been under development for the past 2.5 weeks or so. I'll make my private branch public shortly, after a bit of cleanup. Its current status is that it is fully functional (passes tests, runs benchmarks) on x86-64. There is work to do to move the other two backends over (aarch64, s390x) and I will do this before we merge. (I might not be able to do this before Mon Mar 28; I'm out-of-office and offline all of next week unfortunately, but wanted to get these results out first!)
The nature of the changes to Cranelift are such that we do have to do the transition atomically and remove regalloc.rs support at the same time; the whole MachInst infrastructure is basically built up around the regalloc abstractions, so swapping it out has a large effect. Fortunately though I think there is not too much of a downside (aside from the usual code-churn risk, which we mitigate with ongoing fuzzing and careful review) -- performance numbers look good.
Here is a current snapshot of some benchmark results:
Benchmark Compilation (wallclock) Execution (wallclock) blake3-scalar 25% faster 28% faster blake3-simd no diff no diff meshoptimizer 19% faster 17% faster pulldown-cmark 17% faster no diff bz2 15% faster no diff SpiderMonkey, 21% faster 2% faster fib(30) clang.wasm 42% faster N/A
with full details here:
<details>
<summary>Benchmark methodology and raw output</summary>
<pre>
As percentage improvement over baseline (old):Benchmark Compilation (wallclock) Execution (wallclock)
blake3-scalar 25% faster 28% faster
blake3-simd no diff no diff
meshoptimizer 19% faster 17% faster
pulldown-cmark 17% faster no diff
bz2 15% faster no diff
SpiderMonkey, 21% faster 2% faster
fib(30)
clang.wasm 42% faster N/AAs ratios (percent improvement above = 100% * (1 - 1/speedup_ratio))
Benchmark Compilation (wallclock) Execution (wallclock)
blake3-scalar 1.34x faster 1.38x faster
blake3-simd no diff no diff
meshoptimizer 1.24x faster 1.21x faster
pulldown-cmark 1.21x faster no diff
bz2 1.18x faster no diff
SpiderMonkey, 1.26x faster 1.02x faster
fib(30)
clang.wasm 1.71x faster N/AMethodology:
- Sightglass with --processes 2 --iterations-per-process 5.
- Last two benchmarks running commandline wasmtime
- rm -r ~/.cache/wasmtime
- run
wasmtime run
once to ensure compiled- measure runtime 5x, take best of five
- measure compile time with
wasmtime compile
5x, take best of five- clang.wasm doesn't have a test harness, so is compile-only
- Testing on 12-core / 24-thread Ryzen 3900X, Linux/x86-64
Comparing baseline of Wasmtime fdf063df98ad3839b0e0b78ea55b53b1a296abb0 (from
Mar 16) against my internal regalloc2 branch
9b89942cf62d262ee9ac3e7eab525ea8544a458b (from Mar 17) which last synced with
Wasmtime at eb1b71e31c035ff4250c5013ca0268deb931aa7c (from Feb 24).Raw output of Sightglass below (instantiation excluded, not interesting).
compilation :: cycles :: benchmarks-next/blake3-scalar/benchmark.wasm
Δ = 121531866.00 ± 51042761.18 (confidence = 99%)
new.so is 1.14x to 1.34x faster than old.so!
old.so is 0.72x to 0.89x faster than new.so![478052996 501410277.40 591983000] new.so
[604955098 622942143.40 709527450] old.socompilation :: nanoseconds :: benchmarks-next/blake3-scalar/benchmark.wasm
Δ = 31981472.00 ± 13432120.92 (confidence = 99%)
new.so is 1.14x to 1.34x faster than old.so!
old.so is 0.72x to 0.89x faster than new.so![125802142 131948268.40 155782325] new.so
[159196645 163929740.40 186715328] old.soexecution :: nanoseconds :: benchmarks-next/blake3-scalar/benchmark.wasm
Δ = 36931.50 ± 3272.72 (confidence = 99%)
new.so is 1.32x to 1.38x faster than old.so!
old.so is 0.72x to 0.77x faster than new.so![105358 106660.00 110728] new.so
[140608 143591.50 149787] old.soexecution :: cycles :: benchmarks-next/blake3-scalar/benchmark.wasm
Δ = 140341.60 ± 12437.21 (confidence = 99%)
new.so is 1.32x to 1.38x faster than old.so!
old.so is 0.72x to 0.77x faster than new.so![400368 405315.60 420774] new.so
[534318 545657.20 569202] old.so
compilation :: cycles :: benchmarks-next/blake3-simd/benchmark.wasm
No difference in performance.
[112727304 139448014.80 189082604] new.so
[123143218 156732493.40 233512432] old.socompilation :: nanoseconds :: benchmarks-next/blake3-simd/benchmark.wasm
No difference in performance.
[29664800 36696541.20 49758219] new.so
[32405712 41244760.40 61449541] old.soexecution :: cycles :: benchmarks-next/blake3-simd/benchmark.wasm
No difference in performance.
[400672 739521.80 1042226] new.so
[498142 828791.40 1160786] old.soexecution :: nanoseconds :: benchmarks-next/blake3-simd/benchmark.wasm
No difference in performance.
[105439 194609.20 274267] new.so
[131088 218099.20 305464] old.so
compilation :: cycles :: benchmarks-next/meshoptimizer/benchmark.wasm
Δ = 483775336.20 ± 24646158.96 (confidence = 99%)
new.so is 1.22x to 1.24x faster than old.so!
old.so is 0.80x to 0.82x faster than new.so![2090515508 2113482784.00 2150210240] new.so
[2554359582 2597258120.20 2630111328] old.socompilation :: nanoseconds :: benchmarks-next/meshoptimizer/benchmark.wasm
Δ = 127275628.40 ± 6480546.57 (confidence = 99%)
new.so is 1.22x to 1.24x faster than old.so!
old.so is 0.80x to 0.82x faster than new.so![550127669 556172437.60 565836581] new.so
[672188482 683448066.00 692063546] old.soexecution :: cycles :: benchmarks-next/meshoptimizer/benchmark.wasm
Δ = 3386913742.00 ± 454568778.61 (confidence = 99%)
new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so![17786842514 17978520795.40 18352029814] new.so
[20863697992 21365434537.40 22139271504] old.soexecution :: nanoseconds :: benchmarks-next/meshoptimizer/benchmark.wasm
Δ = 891020039.40 ± 119694835.02 (confidence = 99%)
new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so![4680694128 4731128047.40 4829411387] new.so
[5489883512 5622148086.80 5826025212] old.so
compilation :: cycles :: benchmarks-next/pulldown-cmark/benchmark.wasm
Δ = 213252595.20 ± 29303757.92 (confidence = 99%)
new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so![1120180378 1148350389.80 1203069094] new.so
[1340768136 1361602985.00 1397014596] old.socompilation :: nanoseconds :: benchmarks-next/pulldown-cmark/benchmark.wasm
Δ = 56118120.00 ± 7711578.76 (confidence = 99%)
new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so![294780634 302193792.40 316593182] new.so
[352828441 358311912.40 367631343] old.soexecution :: cycles :: benchmarks-next/pulldown-cmark/benchmark.wasm
No difference in performance.
[8257780 8443755.80 8560944] new.so
[8455570 9495162.60 17648568] old.soexecution :: nanoseconds :: benchmarks-next/pulldown-cmark/benchmark.wasm
No difference in performance.
[2173072 2222013.50 2252853] new.so
[2225116 2498693.60 4644290] old.so
compilation :: cycles :: benchmarks-next/bz2/benchmark.wasm
Δ = 58684068.80 ± 36909440.37 (confidence = 99%)
new.so is 1.04x to 1.18x faster than old.so!
old.so is 0.84x to 0.96x faster than new.so![498967588 545831464.20 586460840] new.so
[540660276 604515533.00 635005118] old.socompilation :: nanoseconds :: benchmarks-next/bz2/benchmark.wasm
Δ = 15436153.00 ± 9714229.01 (confidence = 99%)
new.so is 1.04x to 1.18x faster than old.so!
old.so is 0.84x to 0.96x faster than new.so![131305387 143637939.40 154329874] new.so
[142264400 159074092.40 167089438] old.soexecution :: nanoseconds :: benchmarks-next/bz2/benchmark.wasm
No difference in performance.
[25932760 35978222.50 53794238] new.so
[28960083 29737468.90 35137211] old.soexecution :: cycles :: benchmarks-next/bz2/benchmark.wasm
No difference in performance.
[98545894 136719075.20 204420658] new.so
[110059628 113008690.20 133522880] old.so
</pre>
</details>
cfallin edited issue #3942:
This issue is meant to track the status of migrating Cranelift to use regalloc2, our new register allocator. We started this work a while ago, and as detailed in our 2022 roadmap, we plan to finish the migration this year.
The major tasks remaining are:
- [x] Develop regalloc2 as a standalone project
- [x] Get a second reviewer to triple-check the symbolic checker we've been using to fuzz regalloc2, since regalloc correctness is essential for correctness/security of all layers above it (@fitzgen is currently looking this over)
- [x] Integrate any remaining regalloc2 tweaks/improvements (I have a few queued up, mostly API-related; this is my running branch as I bring it up with Cranelift.)
- [x] Release regalloc2 crate on crates.io (done)
- [ ] Merge support for regalloc2 into Cranelift
The last task has been under development for the past 2.5 weeks or so. I'll make my private branch public shortly, after a bit of cleanup. Its current status is that it is fully functional (passes tests, runs benchmarks) on x86-64. There is work to do to move the other two backends over (aarch64, s390x) and I will do this before we merge. (I might not be able to do this before Mon Mar 28; I'm out-of-office and offline all of next week unfortunately, but wanted to get these results out first!)
The nature of the changes to Cranelift are such that we do have to do the transition atomically and remove regalloc.rs support at the same time; the whole MachInst infrastructure is basically built up around the regalloc abstractions, so swapping it out has a large effect. Fortunately though I think there is not too much of a downside (aside from the usual code-churn risk, which we mitigate with ongoing fuzzing and careful review) -- performance numbers look good.
Here is a current snapshot of some benchmark results:
Benchmark Compilation (wallclock) Execution (wallclock) blake3-scalar 25% faster 28% faster blake3-simd no diff no diff meshoptimizer 19% faster 17% faster pulldown-cmark 17% faster no diff bz2 15% faster no diff SpiderMonkey, 21% faster 2% faster fib(30) clang.wasm 42% faster N/A
with full details here:
<details>
<summary>Benchmark methodology and raw output</summary>
<pre>
As percentage improvement over baseline (old):Benchmark Compilation (wallclock) Execution (wallclock)
blake3-scalar 25% faster 28% faster
blake3-simd no diff no diff
meshoptimizer 19% faster 17% faster
pulldown-cmark 17% faster no diff
bz2 15% faster no diff
SpiderMonkey, 21% faster 2% faster
fib(30)
clang.wasm 42% faster N/AAs ratios (percent improvement above = 100% * (1 - 1/speedup_ratio))
Benchmark Compilation (wallclock) Execution (wallclock)
blake3-scalar 1.34x faster 1.38x faster
blake3-simd no diff no diff
meshoptimizer 1.24x faster 1.21x faster
pulldown-cmark 1.21x faster no diff
bz2 1.18x faster no diff
SpiderMonkey, 1.26x faster 1.02x faster
fib(30)
clang.wasm 1.71x faster N/AMethodology:
- Sightglass with --processes 2 --iterations-per-process 5.
- Last two benchmarks running commandline wasmtime
- rm -r ~/.cache/wasmtime
- run
wasmtime run
once to ensure compiled- measure runtime 5x, take best of five
- measure compile time with
wasmtime compile
5x, take best of five- clang.wasm doesn't have a test harness, so is compile-only
- Testing on 12-core / 24-thread Ryzen 3900X, Linux/x86-64
Comparing baseline of Wasmtime fdf063df98ad3839b0e0b78ea55b53b1a296abb0 (from
Mar 16) against my internal regalloc2 branch
9b89942cf62d262ee9ac3e7eab525ea8544a458b (from Mar 17) which last synced with
Wasmtime at eb1b71e31c035ff4250c5013ca0268deb931aa7c (from Feb 24).Raw output of Sightglass below (instantiation excluded, not interesting).
compilation :: cycles :: benchmarks-next/blake3-scalar/benchmark.wasm
Δ = 121531866.00 ± 51042761.18 (confidence = 99%)
new.so is 1.14x to 1.34x faster than old.so!
old.so is 0.72x to 0.89x faster than new.so![478052996 501410277.40 591983000] new.so
[604955098 622942143.40 709527450] old.socompilation :: nanoseconds :: benchmarks-next/blake3-scalar/benchmark.wasm
Δ = 31981472.00 ± 13432120.92 (confidence = 99%)
new.so is 1.14x to 1.34x faster than old.so!
old.so is 0.72x to 0.89x faster than new.so![125802142 131948268.40 155782325] new.so
[159196645 163929740.40 186715328] old.soexecution :: nanoseconds :: benchmarks-next/blake3-scalar/benchmark.wasm
Δ = 36931.50 ± 3272.72 (confidence = 99%)
new.so is 1.32x to 1.38x faster than old.so!
old.so is 0.72x to 0.77x faster than new.so![105358 106660.00 110728] new.so
[140608 143591.50 149787] old.soexecution :: cycles :: benchmarks-next/blake3-scalar/benchmark.wasm
Δ = 140341.60 ± 12437.21 (confidence = 99%)
new.so is 1.32x to 1.38x faster than old.so!
old.so is 0.72x to 0.77x faster than new.so![400368 405315.60 420774] new.so
[534318 545657.20 569202] old.so
compilation :: cycles :: benchmarks-next/blake3-simd/benchmark.wasm
No difference in performance.
[112727304 139448014.80 189082604] new.so
[123143218 156732493.40 233512432] old.socompilation :: nanoseconds :: benchmarks-next/blake3-simd/benchmark.wasm
No difference in performance.
[29664800 36696541.20 49758219] new.so
[32405712 41244760.40 61449541] old.soexecution :: cycles :: benchmarks-next/blake3-simd/benchmark.wasm
No difference in performance.
[400672 739521.80 1042226] new.so
[498142 828791.40 1160786] old.soexecution :: nanoseconds :: benchmarks-next/blake3-simd/benchmark.wasm
No difference in performance.
[105439 194609.20 274267] new.so
[131088 218099.20 305464] old.so
compilation :: cycles :: benchmarks-next/meshoptimizer/benchmark.wasm
Δ = 483775336.20 ± 24646158.96 (confidence = 99%)
new.so is 1.22x to 1.24x faster than old.so!
old.so is 0.80x to 0.82x faster than new.so![2090515508 2113482784.00 2150210240] new.so
[2554359582 2597258120.20 2630111328] old.socompilation :: nanoseconds :: benchmarks-next/meshoptimizer/benchmark.wasm
Δ = 127275628.40 ± 6480546.57 (confidence = 99%)
new.so is 1.22x to 1.24x faster than old.so!
old.so is 0.80x to 0.82x faster than new.so![550127669 556172437.60 565836581] new.so
[672188482 683448066.00 692063546] old.soexecution :: cycles :: benchmarks-next/meshoptimizer/benchmark.wasm
Δ = 3386913742.00 ± 454568778.61 (confidence = 99%)
new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so![17786842514 17978520795.40 18352029814] new.so
[20863697992 21365434537.40 22139271504] old.soexecution :: nanoseconds :: benchmarks-next/meshoptimizer/benchmark.wasm
Δ = 891020039.40 ± 119694835.02 (confidence = 99%)
new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so![4680694128 4731128047.40 4829411387] new.so
[5489883512 5622148086.80 5826025212] old.so
compilation :: cycles :: benchmarks-next/pulldown-cmark/benchmark.wasm
Δ = 213252595.20 ± 29303757.92 (confidence = 99%)
new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so![1120180378 1148350389.80 1203069094] new.so
[1340768136 1361602985.00 1397014596] old.socompilation :: nanoseconds :: benchmarks-next/pulldown-cmark/benchmark.wasm
Δ = 56118120.00 ± 7711578.76 (confidence = 99%)
new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so![294780634 302193792.40 316593182] new.so
[352828441 358311912.40 367631343] old.soexecution :: cycles :: benchmarks-next/pulldown-cmark/benchmark.wasm
No difference in performance.
[8257780 8443755.80 8560944] new.so
[8455570 9495162.60 17648568] old.soexecution :: nanoseconds :: benchmarks-next/pulldown-cmark/benchmark.wasm
No difference in performance.
[2173072 2222013.50 2252853] new.so
[2225116 2498693.60 4644290] old.so
compilation :: cycles :: benchmarks-next/bz2/benchmark.wasm
Δ = 58684068.80 ± 36909440.37 (confidence = 99%)
new.so is 1.04x to 1.18x faster than old.so!
old.so is 0.84x to 0.96x faster than new.so![498967588 545831464.20 586460840] new.so
[540660276 604515533.00 635005118] old.socompilation :: nanoseconds :: benchmarks-next/bz2/benchmark.wasm
Δ = 15436153.00 ± 9714229.01 (confidence = 99%)
new.so is 1.04x to 1.18x faster than old.so!
old.so is 0.84x to 0.96x faster than new.so![131305387 143637939.40 154329874] new.so
[142264400 159074092.40 167089438] old.soexecution :: nanoseconds :: benchmarks-next/bz2/benchmark.wasm
No difference in performance.
[25932760 35978222.50 53794238] new.so
[28960083 29737468.90 35137211] old.soexecution :: cycles :: benchmarks-next/bz2/benchmark.wasm
No difference in performance.
[98545894 136719075.20 204420658] new.so
[110059628 113008690.20 133522880] old.so
</pre>
</details>
cfallin closed issue #3942:
This issue is meant to track the status of migrating Cranelift to use regalloc2, our new register allocator. We started this work a while ago, and as detailed in our 2022 roadmap, we plan to finish the migration this year.
The major tasks remaining are:
- [x] Develop regalloc2 as a standalone project
- [x] Get a second reviewer to triple-check the symbolic checker we've been using to fuzz regalloc2, since regalloc correctness is essential for correctness/security of all layers above it (@fitzgen is currently looking this over)
- [x] Integrate any remaining regalloc2 tweaks/improvements (I have a few queued up, mostly API-related; this is my running branch as I bring it up with Cranelift.)
- [x] Release regalloc2 crate on crates.io (done)
- [ ] Merge support for regalloc2 into Cranelift
The last task has been under development for the past 2.5 weeks or so. I'll make my private branch public shortly, after a bit of cleanup. Its current status is that it is fully functional (passes tests, runs benchmarks) on x86-64. There is work to do to move the other two backends over (aarch64, s390x) and I will do this before we merge. (I might not be able to do this before Mon Mar 28; I'm out-of-office and offline all of next week unfortunately, but wanted to get these results out first!)
The nature of the changes to Cranelift are such that we do have to do the transition atomically and remove regalloc.rs support at the same time; the whole MachInst infrastructure is basically built up around the regalloc abstractions, so swapping it out has a large effect. Fortunately though I think there is not too much of a downside (aside from the usual code-churn risk, which we mitigate with ongoing fuzzing and careful review) -- performance numbers look good.
Here is a current snapshot of some benchmark results:
Benchmark Compilation (wallclock) Execution (wallclock) blake3-scalar 25% faster 28% faster blake3-simd no diff no diff meshoptimizer 19% faster 17% faster pulldown-cmark 17% faster no diff bz2 15% faster no diff SpiderMonkey, 21% faster 2% faster fib(30) clang.wasm 42% faster N/A
with full details here:
<details>
<summary>Benchmark methodology and raw output</summary>
<pre>
As percentage improvement over baseline (old):Benchmark Compilation (wallclock) Execution (wallclock)
blake3-scalar 25% faster 28% faster
blake3-simd no diff no diff
meshoptimizer 19% faster 17% faster
pulldown-cmark 17% faster no diff
bz2 15% faster no diff
SpiderMonkey, 21% faster 2% faster
fib(30)
clang.wasm 42% faster N/AAs ratios (percent improvement above = 100% * (1 - 1/speedup_ratio))
Benchmark Compilation (wallclock) Execution (wallclock)
blake3-scalar 1.34x faster 1.38x faster
blake3-simd no diff no diff
meshoptimizer 1.24x faster 1.21x faster
pulldown-cmark 1.21x faster no diff
bz2 1.18x faster no diff
SpiderMonkey, 1.26x faster 1.02x faster
fib(30)
clang.wasm 1.71x faster N/AMethodology:
- Sightglass with --processes 2 --iterations-per-process 5.
- Last two benchmarks running commandline wasmtime
- rm -r ~/.cache/wasmtime
- run
wasmtime run
once to ensure compiled- measure runtime 5x, take best of five
- measure compile time with
wasmtime compile
5x, take best of five- clang.wasm doesn't have a test harness, so is compile-only
- Testing on 12-core / 24-thread Ryzen 3900X, Linux/x86-64
Comparing baseline of Wasmtime fdf063df98ad3839b0e0b78ea55b53b1a296abb0 (from
Mar 16) against my internal regalloc2 branch
9b89942cf62d262ee9ac3e7eab525ea8544a458b (from Mar 17) which last synced with
Wasmtime at eb1b71e31c035ff4250c5013ca0268deb931aa7c (from Feb 24).Raw output of Sightglass below (instantiation excluded, not interesting).
compilation :: cycles :: benchmarks-next/blake3-scalar/benchmark.wasm
Δ = 121531866.00 ± 51042761.18 (confidence = 99%)
new.so is 1.14x to 1.34x faster than old.so!
old.so is 0.72x to 0.89x faster than new.so![478052996 501410277.40 591983000] new.so
[604955098 622942143.40 709527450] old.socompilation :: nanoseconds :: benchmarks-next/blake3-scalar/benchmark.wasm
Δ = 31981472.00 ± 13432120.92 (confidence = 99%)
new.so is 1.14x to 1.34x faster than old.so!
old.so is 0.72x to 0.89x faster than new.so![125802142 131948268.40 155782325] new.so
[159196645 163929740.40 186715328] old.soexecution :: nanoseconds :: benchmarks-next/blake3-scalar/benchmark.wasm
Δ = 36931.50 ± 3272.72 (confidence = 99%)
new.so is 1.32x to 1.38x faster than old.so!
old.so is 0.72x to 0.77x faster than new.so![105358 106660.00 110728] new.so
[140608 143591.50 149787] old.soexecution :: cycles :: benchmarks-next/blake3-scalar/benchmark.wasm
Δ = 140341.60 ± 12437.21 (confidence = 99%)
new.so is 1.32x to 1.38x faster than old.so!
old.so is 0.72x to 0.77x faster than new.so![400368 405315.60 420774] new.so
[534318 545657.20 569202] old.so
compilation :: cycles :: benchmarks-next/blake3-simd/benchmark.wasm
No difference in performance.
[112727304 139448014.80 189082604] new.so
[123143218 156732493.40 233512432] old.socompilation :: nanoseconds :: benchmarks-next/blake3-simd/benchmark.wasm
No difference in performance.
[29664800 36696541.20 49758219] new.so
[32405712 41244760.40 61449541] old.soexecution :: cycles :: benchmarks-next/blake3-simd/benchmark.wasm
No difference in performance.
[400672 739521.80 1042226] new.so
[498142 828791.40 1160786] old.soexecution :: nanoseconds :: benchmarks-next/blake3-simd/benchmark.wasm
No difference in performance.
[105439 194609.20 274267] new.so
[131088 218099.20 305464] old.so
compilation :: cycles :: benchmarks-next/meshoptimizer/benchmark.wasm
Δ = 483775336.20 ± 24646158.96 (confidence = 99%)
new.so is 1.22x to 1.24x faster than old.so!
old.so is 0.80x to 0.82x faster than new.so![2090515508 2113482784.00 2150210240] new.so
[2554359582 2597258120.20 2630111328] old.socompilation :: nanoseconds :: benchmarks-next/meshoptimizer/benchmark.wasm
Δ = 127275628.40 ± 6480546.57 (confidence = 99%)
new.so is 1.22x to 1.24x faster than old.so!
old.so is 0.80x to 0.82x faster than new.so![550127669 556172437.60 565836581] new.so
[672188482 683448066.00 692063546] old.soexecution :: cycles :: benchmarks-next/meshoptimizer/benchmark.wasm
Δ = 3386913742.00 ± 454568778.61 (confidence = 99%)
new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so![17786842514 17978520795.40 18352029814] new.so
[20863697992 21365434537.40 22139271504] old.soexecution :: nanoseconds :: benchmarks-next/meshoptimizer/benchmark.wasm
Δ = 891020039.40 ± 119694835.02 (confidence = 99%)
new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so![4680694128 4731128047.40 4829411387] new.so
[5489883512 5622148086.80 5826025212] old.so
compilation :: cycles :: benchmarks-next/pulldown-cmark/benchmark.wasm
Δ = 213252595.20 ± 29303757.92 (confidence = 99%)
new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so![1120180378 1148350389.80 1203069094] new.so
[1340768136 1361602985.00 1397014596] old.socompilation :: nanoseconds :: benchmarks-next/pulldown-cmark/benchmark.wasm
Δ = 56118120.00 ± 7711578.76 (confidence = 99%)
new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so![294780634 302193792.40 316593182] new.so
[352828441 358311912.40 367631343] old.soexecution :: cycles :: benchmarks-next/pulldown-cmark/benchmark.wasm
No difference in performance.
[8257780 8443755.80 8560944] new.so
[8455570 9495162.60 17648568] old.soexecution :: nanoseconds :: benchmarks-next/pulldown-cmark/benchmark.wasm
No difference in performance.
[2173072 2222013.50 2252853] new.so
[2225116 2498693.60 4644290] old.so
compilation :: cycles :: benchmarks-next/bz2/benchmark.wasm
Δ = 58684068.80 ± 36909440.37 (confidence = 99%)
new.so is 1.04x to 1.18x faster than old.so!
old.so is 0.84x to 0.96x faster than new.so![498967588 545831464.20 586460840] new.so
[540660276 604515533.00 635005118] old.socompilation :: nanoseconds :: benchmarks-next/bz2/benchmark.wasm
Δ = 15436153.00 ± 9714229.01 (confidence = 99%)
new.so is 1.04x to 1.18x faster than old.so!
old.so is 0.84x to 0.96x faster than new.so![131305387 143637939.40 154329874] new.so
[142264400 159074092.40 167089438] old.soexecution :: nanoseconds :: benchmarks-next/bz2/benchmark.wasm
No difference in performance.
[25932760 35978222.50 53794238] new.so
[28960083 29737468.90 35137211] old.soexecution :: cycles :: benchmarks-next/bz2/benchmark.wasm
No difference in performance.
[98545894 136719075.20 204420658] new.so
[110059628 113008690.20 133522880] old.so
</pre>
</details>
cfallin edited issue #3942:
This issue is meant to track the status of migrating Cranelift to use regalloc2, our new register allocator. We started this work a while ago, and as detailed in our 2022 roadmap, we plan to finish the migration this year.
The major tasks remaining are:
- [x] Develop regalloc2 as a standalone project
- [x] Get a second reviewer to triple-check the symbolic checker we've been using to fuzz regalloc2, since regalloc correctness is essential for correctness/security of all layers above it (@fitzgen is currently looking this over)
- [x] Integrate any remaining regalloc2 tweaks/improvements (I have a few queued up, mostly API-related; this is my running branch as I bring it up with Cranelift.)
- [x] Release regalloc2 crate on crates.io (done)
- [x] Merge support for regalloc2 into Cranelift
The last task has been under development for the past 2.5 weeks or so. I'll make my private branch public shortly, after a bit of cleanup. Its current status is that it is fully functional (passes tests, runs benchmarks) on x86-64. There is work to do to move the other two backends over (aarch64, s390x) and I will do this before we merge. (I might not be able to do this before Mon Mar 28; I'm out-of-office and offline all of next week unfortunately, but wanted to get these results out first!)
The nature of the changes to Cranelift are such that we do have to do the transition atomically and remove regalloc.rs support at the same time; the whole MachInst infrastructure is basically built up around the regalloc abstractions, so swapping it out has a large effect. Fortunately though I think there is not too much of a downside (aside from the usual code-churn risk, which we mitigate with ongoing fuzzing and careful review) -- performance numbers look good.
Here is a current snapshot of some benchmark results:
Benchmark Compilation (wallclock) Execution (wallclock) blake3-scalar 25% faster 28% faster blake3-simd no diff no diff meshoptimizer 19% faster 17% faster pulldown-cmark 17% faster no diff bz2 15% faster no diff SpiderMonkey, 21% faster 2% faster fib(30) clang.wasm 42% faster N/A
with full details here:
<details>
<summary>Benchmark methodology and raw output</summary>
<pre>
As percentage improvement over baseline (old):Benchmark Compilation (wallclock) Execution (wallclock)
blake3-scalar 25% faster 28% faster
blake3-simd no diff no diff
meshoptimizer 19% faster 17% faster
pulldown-cmark 17% faster no diff
bz2 15% faster no diff
SpiderMonkey, 21% faster 2% faster
fib(30)
clang.wasm 42% faster N/AAs ratios (percent improvement above = 100% * (1 - 1/speedup_ratio))
Benchmark Compilation (wallclock) Execution (wallclock)
blake3-scalar 1.34x faster 1.38x faster
blake3-simd no diff no diff
meshoptimizer 1.24x faster 1.21x faster
pulldown-cmark 1.21x faster no diff
bz2 1.18x faster no diff
SpiderMonkey, 1.26x faster 1.02x faster
fib(30)
clang.wasm 1.71x faster N/AMethodology:
- Sightglass with --processes 2 --iterations-per-process 5.
- Last two benchmarks running commandline wasmtime
- rm -r ~/.cache/wasmtime
- run
wasmtime run
once to ensure compiled- measure runtime 5x, take best of five
- measure compile time with
wasmtime compile
5x, take best of five- clang.wasm doesn't have a test harness, so is compile-only
- Testing on 12-core / 24-thread Ryzen 3900X, Linux/x86-64
Comparing baseline of Wasmtime fdf063df98ad3839b0e0b78ea55b53b1a296abb0 (from
Mar 16) against my internal regalloc2 branch
9b89942cf62d262ee9ac3e7eab525ea8544a458b (from Mar 17) which last synced with
Wasmtime at eb1b71e31c035ff4250c5013ca0268deb931aa7c (from Feb 24).Raw output of Sightglass below (instantiation excluded, not interesting).
compilation :: cycles :: benchmarks-next/blake3-scalar/benchmark.wasm
Δ = 121531866.00 ± 51042761.18 (confidence = 99%)
new.so is 1.14x to 1.34x faster than old.so!
old.so is 0.72x to 0.89x faster than new.so![478052996 501410277.40 591983000] new.so
[604955098 622942143.40 709527450] old.socompilation :: nanoseconds :: benchmarks-next/blake3-scalar/benchmark.wasm
Δ = 31981472.00 ± 13432120.92 (confidence = 99%)
new.so is 1.14x to 1.34x faster than old.so!
old.so is 0.72x to 0.89x faster than new.so![125802142 131948268.40 155782325] new.so
[159196645 163929740.40 186715328] old.soexecution :: nanoseconds :: benchmarks-next/blake3-scalar/benchmark.wasm
Δ = 36931.50 ± 3272.72 (confidence = 99%)
new.so is 1.32x to 1.38x faster than old.so!
old.so is 0.72x to 0.77x faster than new.so![105358 106660.00 110728] new.so
[140608 143591.50 149787] old.soexecution :: cycles :: benchmarks-next/blake3-scalar/benchmark.wasm
Δ = 140341.60 ± 12437.21 (confidence = 99%)
new.so is 1.32x to 1.38x faster than old.so!
old.so is 0.72x to 0.77x faster than new.so![400368 405315.60 420774] new.so
[534318 545657.20 569202] old.so
compilation :: cycles :: benchmarks-next/blake3-simd/benchmark.wasm
No difference in performance.
[112727304 139448014.80 189082604] new.so
[123143218 156732493.40 233512432] old.socompilation :: nanoseconds :: benchmarks-next/blake3-simd/benchmark.wasm
No difference in performance.
[29664800 36696541.20 49758219] new.so
[32405712 41244760.40 61449541] old.soexecution :: cycles :: benchmarks-next/blake3-simd/benchmark.wasm
No difference in performance.
[400672 739521.80 1042226] new.so
[498142 828791.40 1160786] old.soexecution :: nanoseconds :: benchmarks-next/blake3-simd/benchmark.wasm
No difference in performance.
[105439 194609.20 274267] new.so
[131088 218099.20 305464] old.so
compilation :: cycles :: benchmarks-next/meshoptimizer/benchmark.wasm
Δ = 483775336.20 ± 24646158.96 (confidence = 99%)
new.so is 1.22x to 1.24x faster than old.so!
old.so is 0.80x to 0.82x faster than new.so![2090515508 2113482784.00 2150210240] new.so
[2554359582 2597258120.20 2630111328] old.socompilation :: nanoseconds :: benchmarks-next/meshoptimizer/benchmark.wasm
Δ = 127275628.40 ± 6480546.57 (confidence = 99%)
new.so is 1.22x to 1.24x faster than old.so!
old.so is 0.80x to 0.82x faster than new.so![550127669 556172437.60 565836581] new.so
[672188482 683448066.00 692063546] old.soexecution :: cycles :: benchmarks-next/meshoptimizer/benchmark.wasm
Δ = 3386913742.00 ± 454568778.61 (confidence = 99%)
new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so![17786842514 17978520795.40 18352029814] new.so
[20863697992 21365434537.40 22139271504] old.soexecution :: nanoseconds :: benchmarks-next/meshoptimizer/benchmark.wasm
Δ = 891020039.40 ± 119694835.02 (confidence = 99%)
new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so![4680694128 4731128047.40 4829411387] new.so
[5489883512 5622148086.80 5826025212] old.so
compilation :: cycles :: benchmarks-next/pulldown-cmark/benchmark.wasm
Δ = 213252595.20 ± 29303757.92 (confidence = 99%)
new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so![1120180378 1148350389.80 1203069094] new.so
[1340768136 1361602985.00 1397014596] old.socompilation :: nanoseconds :: benchmarks-next/pulldown-cmark/benchmark.wasm
Δ = 56118120.00 ± 7711578.76 (confidence = 99%)
new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so![294780634 302193792.40 316593182] new.so
[352828441 358311912.40 367631343] old.soexecution :: cycles :: benchmarks-next/pulldown-cmark/benchmark.wasm
No difference in performance.
[8257780 8443755.80 8560944] new.so
[8455570 9495162.60 17648568] old.soexecution :: nanoseconds :: benchmarks-next/pulldown-cmark/benchmark.wasm
No difference in performance.
[2173072 2222013.50 2252853] new.so
[2225116 2498693.60 4644290] old.so
compilation :: cycles :: benchmarks-next/bz2/benchmark.wasm
Δ = 58684068.80 ± 36909440.37 (confidence = 99%)
new.so is 1.04x to 1.18x faster than old.so!
old.so is 0.84x to 0.96x faster than new.so![498967588 545831464.20 586460840] new.so
[540660276 604515533.00 635005118] old.socompilation :: nanoseconds :: benchmarks-next/bz2/benchmark.wasm
Δ = 15436153.00 ± 9714229.01 (confidence = 99%)
new.so is 1.04x to 1.18x faster than old.so!
old.so is 0.84x to 0.96x faster than new.so![131305387 143637939.40 154329874] new.so
[142264400 159074092.40 167089438] old.soexecution :: nanoseconds :: benchmarks-next/bz2/benchmark.wasm
No difference in performance.
[25932760 35978222.50 53794238] new.so
[28960083 29737468.90 35137211] old.soexecution :: cycles :: benchmarks-next/bz2/benchmark.wasm
No difference in performance.
[98545894 136719075.20 204420658] new.so
[110059628 113008690.20 133522880] old.so
</pre>
</details>
Last updated: Nov 22 2024 at 16:03 UTC