@Amanieu Per our exchange at today's meeting, here are the benchmarks I ran against wasmtime main (79e1b5374710cd66af9f6330b4c2a231e61d9866) vs. https://github.com/bytecodealliance/wasmtime/commit/eb2acdfb4f27795856c7455824b08bb705d60e64, which pulls in your regalloc3 branch of regalloc2 (but not your adjust-split branch). You had expected a 50% slowdown in compilation, but I recalled a 25% one. Some benchmarks were indeed up in the 20s, but, on average, they were 14% slower on the low end of the CI, 17% slower on the high. I offer this only in case you find it interesting/surprising and by no means want to slow your work on the allocator, which will benefit us all! :-) Again, I ran these on an ARM chip (M3 Max), which has double the GPRs as an x64.
regalloc3 all suite compilation only.txt
I'd suggest focusing on just spidermonkey, bz2, and pulldown-cmark. maybe as a second tier intgemm-simd, meshoptimizer, and libsodium. but all the shootout stuff is just tiny micro benchmarks that aren't really interesting unless you are focused on optimizing the thing they happen to micro bench.
in practice, the shootout benchmarks are going to just be noise unless you're trying to focus on their specific thing
I guess blake3 is fairly interesting interesting
So 28-29, 17-20, and 25-27%, respectively.
2nd tier: 25-26, 17-21, and ≈15-20
Doesn't really change the numbers, but sure will save me time benchmarking! Thanks!
right yeah, I'm just trying to help make benchmarking easier and make it easier to evaluate the results
you can also benchmark particular phases as well, if you want to focus only on compile time for example, you can do
$ sightglass-cli benchmark --benchmark-phase compilation ...
I can also have it spit out JSON and save myself a lot of regexes, but I have yet to remember to do that until the middle of a long run. ;-)
FYI I run my benchmarks with the shuffling allocator disabled because it makes allocations extremely slow and distorts the runtime. That could be an explanation of why you're seeing different results.
ah, right, I filed https://github.com/bytecodealliance/sightglass/issues/280 a while ago to switch the default but then ran out of spare energy to do it; @Erik Rose that could be an easy PR to make
It's a one-line change:
diff --git a/crates/bench-api/Cargo.toml b/crates/bench-api/Cargo.toml
index a171b0beed..de288922ba 100644
--- a/crates/bench-api/Cargo.toml
+++ b/crates/bench-api/Cargo.toml
@@ -35,5 +35,5 @@ clap = { workspace = true }
wat = { workspace = true }
[features]
-default = ["shuffling-allocator", "wasi-nn"]
+default = ["wasi-nn"]
wasi-nn = ["wasmtime-wasi-nn"]
I must have been very low on spare energy then :-)
@Erik Rose Here are the results I am getting on my (admittedly quite perf-noisy) machine:
compilation :: cycles :: benchmarks/bz2/benchmark.wasm
Δ = 43971560.50 ± 34846300.90 (confidence = 99%)
3-spill.so is 1.08x to 1.66x faster than 2.so!
[129186085 162930575.50 216937350] 2.so
[85216530 118959015.00 141328075] 3-spill.so
compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm
No difference in performance.
[95009950 116081567.00 158748170] 2.so
[80138100 111471111.50 141233575] 3-spill.so
compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm
No difference in performance.
[1066633785 1108409337.00 1163300915] 2.so
[1039408055 1077584459.00 1112557740] 3-spill.so
This was using an older version of regalloc3 which didn't have live range splitting (the 50% cost I mentioned in only with live range splitting). This should be the same version you tested based on your Cargo.lock.
However you can see that ra3 is consistently faster than ra2 in terms of compilation speed.
how does runtime perf look in the non-spilling version? (I forget, sorry; it'd be helpful to have an up-to-date summary of all the numbers somewhere)
I'm currently in the middle of some perf optimizations, but you can find the older results here:
@Erik Rose I've confirmed that the 25% regression you observed is entirely due to the shuffling allocator. regalloc3 doesn't make use of smallvec and instead relies on the caller to reuse the register allocation context for multiple functions (preserving the Vec allocations inside it). Unfortunately Cranelift doesn't use regalloc2::run_with_ctx and therefore cannot take advantage of this.
However the difference is only really noticable with the shuffling allocator enabled since it allocations so slow that compilation takes ~5x longer.
Fantastic. Thanks, Amanieu! I will update my checkouts, get off the shuffling allocator, and re-run my benchmarks.
I opened a PR to disable the shuffling allocator as well: https://github.com/bytecodealliance/wasmtime/pull/10300.
Note that if you update your checkouts now, the current main branch default to enabling live range splitting which is what causes the 50% slowdown I previously mentioned.
You can override it by selecting SplitStrategy::Spill in the regalloc3 options. Or just modify your regalloc3 checkout to make that the default temporarily.
@Amanieu considering swapping from ra2 to 3. Do you have any sort of design doc available similar to ra2? Mostly I'm just a bit confused on the difference between a bank and a class. In what instance would I need multiple register banks to describe one class?
@Leaves Have you looked at the documentation for the reginfo module? Also there are examples available here.
Yep, that's exactly what I'm looking for, I had entirely missed that comment at the top!
Edit: Ok that comment explained everything I was curious about, thank you much.
@Amanieu Sorry to bug you again, is there a proper way to get the set of all registers used by the allocator? Including registers allocated to instructions, and those used by edits. I'm currently just iterating over all the instructions using output_insts(..) and recording it as I go. But a succinct method to get this would be nice(if it exists).
@Leaves No, there is no other way. Regalloc3 doesn't track this internally, especially for temporary registers that are only used as scratch registers for moves.
What do you need this information for? If it's for stack frame setup (callee-saved registers), you may want to specify those registers as defined on function entry and used by return instructions. That way they will automatically be spilled to a stack slot by the register allocator.
Yeah for stack frame setup. I need to know what registers are saved to properly emit push/pop pairs. It's important that these registers are saved in an easy to define way for the various UnwindInfo formats. If RA3 is deciding where they are spilled/filled, I won't be able to locate this to emit proper unwind information. And I need to know how many pairs of push/pops there are, so I can know the entire stack frame size, prior to lowering an intrinsic like __AddressOfReturnAddress() which needs to read from RSP+FrameSize.
Where FrameSize = PushedRegisters * 8 + StackLayout.spillslot_area_size.
So, I need to iterate over all the instructions and edits first, find out which non-volatile registers are written in any way, and only then can I lower that intrinsic.
Alternatively, you can let the register allocator handle saving/restoring into regalloc-managed spillslots. Just specify the constraints that the return instruction takes the original values of the callee-saved registers as fixed-register inputs. Then just get the stack frame size from output.stack_layout().
I will not be able to describe these saves with unwind info though, and for certain targets like windows that require support for asyncronous exception support, I need to be very careful with this. Allowing RA3 to arbitrarily determine spill/fill locations for non-volatile registers will make that impossible.
For the unwind information, you can get the location of each value from output.value_locations(). You can use that to get the values of the callee-saved at every point in the function.
That is true, however there is a limit to the number of locations that a non-volatile register can be saved into in with the Windows UnwindInfo spec. They must be saved in the prolog, and the prolog cannot be larger than 256 bytes. So, I would need to guarantee that regalloc saves them once there, and only restores from that location at function end.
Right, in that case you probably have to do this yourself. And there is no way other than scanning through the output instructions to find out which registers are used. It's not tracked internally in the allocator, so there isn't any faster way of doing this.
Additionally PUSH/POP pairs at the beginning/end of the function are smaller in size, even when encoding a spill/fill with an 8bit displacement. MOV [RSP+8BitDisp]
Got it, I implemented that, and it appears to work well. RA2 required something similar anyway but there were more simple allocs/edits slices to iterate over.
Hmm now that I think about it, it's actually possible to make this more efficient by tracking it internally. I'll add an API for it.
It would be something along the lines of fn reg_units_used(&self) -> &RegUnitSet.
What would be most helpful would be reg_units_read(&self), reg_units_written(&self). Read alone wouldn't be enough for this case.
Are you open to PRs btw?
That's not as simple as it seems: the initial values for registers is emitted by the entry instruction, so all registers that are used are technically written.
I'm open to PRs but I would prefer if it was discussed first to avoid unnecessary work.
Thats right actually, I do skip the defs created by my NativeRegistersIn instruction that is the first instruction in the function. So that would need to be considered as well :grimacing:
Tracking which registers are used can be done relatively efficiently, essentially meaning "does this register at any point hold a regalloc-managed value".
For anything more than that your best bet is still just scanning the output I think.
I think so now as well after considering the fact that I need to disregard Defs from certain instructions.
I suppose this is really getting away from cranelift/regalloc2 talk now. But if both validators are returning Ok for RA3 and I'm still getting an error about rematerialization(despite always returning None for rematerialization), what direction should I go in for debugging?
Error actually arising is "add_remat called with non-rematerializable value" and it's called on a block parameter(%17) for a small exit block.
block13(%17) freq(1):
; predecessors: block6, block7, block10, block11, block12
inst66: inst
inst67: inst Use(%2):r1 Use(%17):r0
inst68: ret
Can you dump the function and reginfo to text form so I can try it with regalloc3-tool?
Do you mean just converting to a GenericFunction then printing it?
https://pastebin.com/raw/vpSvqK8X
https://pastebin.com/raw/QzJx8aGE
42 messages were moved from this topic to #cranelift > regalloc3 debugging by Amanieu.
Latest benchmark results for regalloc3:
compilation
benchmarks/bz2/benchmark.wasm
cycles
[388409280 394336775.00 420104370] ./ra2.so
[324084635 327185663.00 345267545] ./ra3-spill.so
[361255790 364677054.00 383229420] ./ra3-split.so
benchmarks/pulldown-cmark/benchmark.wasm
cycles
[796059285 799510071.50 807007075] ./ra2.so
[716654015 721313358.50 729701735] ./ra3-spill.so
[782036010 786266075.00 789498185] ./ra3-split.so
benchmarks/spidermonkey/benchmark.wasm
cycles
[20091993285 20178233887.00 20278362790] ./ra2.so
[18244842070 18398756379.50 18586440250] ./ra3-spill.so
[19562045300 19644339190.00 19873169225] ./ra3-split.so
execution
benchmarks/bz2/benchmark.wasm
cycles
[118045165 118595158.50 121545340] ./ra2.so
[112530215 112865336.50 113412705] ./ra3-spill.so
[111467545 111745469.50 112304780] ./ra3-split.so
benchmarks/pulldown-cmark/benchmark.wasm
cycles
[9640925 9705132.50 9788520] ./ra2.so
[8748110 8830787.00 9037630] ./ra3-spill.so
[8934240 9045858.50 9167515] ./ra3-split.so
benchmarks/spidermonkey/benchmark.wasm
cycles
[1136702140 1140818234.50 1153406240] ./ra2.so
[1090508300 1100439718.00 1122041375] ./ra3-spill.so
[1113807730 1119928414.50 1128316140] ./ra3-split.so
At this point I consider regalloc3 basically done and ready for use. I'm currently writing up a design document for it, but other than that it's ready for review.
I second that. After porting my own compiler form RA2 to RA3, fuzzing extensively, and deploying into product, I am VERY happy with the results. The only change I would make would be relaxing some of the requirements around BB params, but only if it doesn't hurt output code quality. Requiring additional optimization passes to prep for regalloc is somewhat tedius.
@Amanieu thanks for these latest results. As we discussed back in January, to get RA3 into Cranelift, we'll need someone to:
This is big enough that we'll probably want an RFC for the transition as well; and please feel free to schedule time on the Cranelift weekly meeting agenda.
@Chris Fallin Sure! I'm currently finishing up the design document. Once that is done, I will post a PR against an empty branch so that you can review the code. I've also submitted https://github.com/bytecodealliance/regalloc2/pull/230 which adds regalloc3 as a back-end for regalloc2.
Last updated: Dec 06 2025 at 06:05 UTC