regalloc3 is a project that I've been working on for the past ~6 months which aims to write a new register allocator to succeed regalloc2. It is currently still a work in progress (notably, live range splitting is not implemented yet) but it is in a state where I would welcome reviews, comments and suggestions.
From the README, here's a list of API-level changes compared to regalloc2:
S0
/ S1
/ D0
on AArch32).CASP
register pairs on AArch64, LD4
SIMD vector loads on AArch64).RegInfo
trait.Function
and RegInfo
implementations.It also has several algorithmic level improvements, such as better handling of fixed register constraints and reused registers, a new faster algorithm for computing live ranges and a faster linear scan allocator for spill slot allocation.
I also have a branch of regalloc2 which adapts the input function and then passes it on to regalloc3.
I've used this to benchmark regalloc2 and regalloc3 using wasmtime. Some benchmark results when compiling pulldown-cmark.wasm
from sightglass:
congrats on releasing! looks very cool!
I've used this to benchmark regalloc2 and regalloc3 using wasmtime. Some benchmark results when compiling
pulldown-cmark.wasm
from sightglass:
do you have the sightglass results as well? the screenshots of the profiler don't tell us the actual speed up/slow down observed for compilation time and execution time vs regalloc2
The total samples gives you an idea of how much time is spent in register allocation. The numbers are relatively stable across runs.
Hyperfine results:
Summary
./ra3 compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm ran
1.03 ± 0.03 times faster than ./base compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm
3% is nothing to sneeze at, nice!
do you have measurements of runtime (quality of generated code) by any chance?
The increase isn't much because regalloc is only 30% of the total runtime.
The detailed profiles give a better idea of the relative performance of just the allocators.
Sightglass is giving me strange results though, claiming that regalloc2 is ~20% faster.
compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm
Δ = 102683574.00 ± 19686025.05 (confidence = 99%)
baseline.so is 1.16x to 1.23x faster than ra3.so!
[506461200 524084844.50 556646895] baseline.so
[609491400 626768418.50 666477735] ra3.so
Execution time isn't great now, but I think that's mostly because I haven't implemented splitting yet:
execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm
Δ = 1957518.50 ± 254885.40 (confidence = 99%)
baseline.so is 1.21x to 1.28x faster than ra3.so!
[7821135 7976304.00 8446655] baseline.so
[9711310 9933822.50 10376345] ra3.so
can it handle stuff like where there's 128 integer registers and it needs to allocate register ranges starting at any register of size 12, 64, 37, and 3? or does it still rely on O(n^2) listing all possible allocations ahead of time (huge for arbitrary-sized ranges allocated out of 128 registers, 8128 possible afaict)?
Your second to last point is interesting as that is something I just implemented for regalloc and saved quite a bit of time(5%) for larger binaries by minimizing heap allocations.
@Jacob Lifshay Yes! Well, almost. At the moment I have a limit of 8 registers per reg group, but that can easily just be bumped up. You can see an example for RISC-V vectors here.
You do need to list out all the allocatable combinations, but only once in the register definition which never changes.
The way register groups work is approximately based on https://llvm.org/devmtg/2016-11/Slides/Braun-DealingWithRegisterHierarchies.pdf
Wagwan, looks very neat looking forward to use this when it is ready to replace regalloc2 fully
Does crane lift produce a pruned ssa? I don’t think so…
So it isn’t trying to remove all truly useless phis
(deleted)
(deleted)
Sorry wrong channel
If you are quick enough you can move messages between topics inside a channel using the move messages option.
This is very exciting. Once the code base has stabilized I would love to see a design overview like the one for regalloc2.
One thing I am especially interested in is the handling of spills for Arm (Aarch32/64). Spilling requires additional registers to load the spilled value into and materialize the stack offset. How does regalloc3 deal with this? Are these registers modelled by the allocator or are certain registers set aside for this purpose?
Not currently, but I have some design notes on how to handle that. It does handle the scratch register for stack-to-stack moves though (like regalloc2 does)
It's a bit tricky since, if you need to spill to free up a register, you need to ensure that the slot you spill to doesn't need a scratch register to calculate the spillslot address.
New benchmarks, including a move optimizer post-processing pass that I haven't pushed yet:
compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm
Δ = 9800371.00 ± 3117907.67 (confidence = 99%)
ra3.so is 1.06x to 1.12x faster than baseline.so!
[103712420 116994400.95 143270540] baseline.so
[93272620 107194029.95 133119665] ra3.so
execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm
Δ = 400260.70 ± 121688.71 (confidence = 99%)
ra3.so is 1.03x to 1.06x faster than baseline.so!
[9414475 9589995.80 10502695] baseline.so
[8977885 9189735.10 13234935] ra3.so
compilation :: cycles :: benchmarks/bz2/benchmark.wasm
Δ = 21871418.80 ± 3220053.20 (confidence = 99%)
ra3.so is 1.22x to 1.29x faster than baseline.so!
[97671420 108323316.50 143930605] baseline.so
[75466125 86451897.70 113070615] ra3.so
execution :: cycles :: benchmarks/bz2/benchmark.wasm
Δ = 2363113.90 ± 590882.98 (confidence = 99%)
ra3.so is 1.02x to 1.03x faster than baseline.so!
[117038180 118659545.20 133033110] baseline.so
[115063375 116296431.30 122370570] ra3.so
I disabled the shuffling allocator in wasmtime-bench-api (for both ra3.so
and baseline.so
) because it was distorting the results: the shuffling allocator is extremely slow.
compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm
Δ = 104370202.30 ± 27034267.63 (confidence = 99%)
ra3.so is 1.06x to 1.11x faster than baseline.so!
[1248169230 1331614134.20 1492533140] baseline.so
[1172304280 1227243931.90 1351484785] ra3.so
execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm
Δ = 27698273.40 ± 5688050.16 (confidence = 99%)
baseline.so is 1.02x to 1.03x faster than ra3.so!
[1152335590 1165996450.50 1191607235] baseline.so
[1181614875 1193694723.90 1232190190] ra3.so
@Amanieu Have you implemented live range splitting?
Additionally, those benchmarks, are they of the allocator itself running, or the emitted code?
Live range splitting is not implemented yet, I'm currently studying the existing heuristics used in different allocators.
The benchmarks show both execution
and compilation
.
Oh wow, nice! Seems like there's still some performance to squeeze out AND you're already meeting the baseline!
Also ~15% of the allocator time is spent converting from the regalloc2 format to the regalloc3 format.
Hmmm even more enticing. I will watch your progress with great interest.
Any idea why I might be failing the .expect("missing allocation"); in move_resolver.rs:760 when using the regalloc2 translation layer?
Is this with cranelift?
It is not.
Hmm possibly you are passing invalid input to the register allocator.
It does complete fine with normal regalloc2.
Is this with a debug build? There is a cfg!(debug_assertions)
that validates the input.
It is release
I'll try debug.
Interesting, it is claiming there are predecessors to the entry block, however I also have an assertion of my own prior to register allocation that makes sure there are not. I will investigate further and let you know what I find.
Ah I see...
It seems this may be incorrect here? I'll look more.
My cfg is NOT in a RPO so the entry block is NOT block zero.
Ah that might be it. regalloc3 assumes the entry block is block 0.
Does the translation layer do anything current to rewrite this?
No, it doesn't change the block ordering.
I suppose it could...
For now, I'll compute an RPO and feed that in, been meaning to do this for a while anyway.
regalloc3 requires that if block A dominates block B, A must come before B in the block order. Since the entry block dominates everything, it must be the first block.
:thumbs_up:
Additionally, I'm a bit curious about the omission of branch_blockparams
in the Function
trait.
I see the jump params method but why none for conditional branches? Are separate destinations not allowed to have separate parameter lists?
Block parameters are only allowed on blocks with multiple predecessors.
This makes sense if you think about it: if there is only a single predecessor then you can just use the value from the predecessor directly instead of creating a separate value that is just a copy.
I agree with that, having trouble wrapping my mind around the implementation.
If I originally had this:
fn branch_blockparams(&self, block: Block, _insn: Inst, succ_idx: usize) -> &[VReg] {
&self.blocks[block.index()].successor_params[succ_idx]
}
implementation for regalloc2, where is the equivalent or similar requirement in regalloc3?
It's called jump_blockparams
I make a distinction between jump
terminators (which can have blockparams but no operands) and branch
terminators (which can have operands but no blockparams).
I see that, but there it is not specified which successor is the target, as is done above with succ_index
.
Because there can only be 1 successor if the successor has multiple predecessors.
Due to the critical edge rule.
Ah... yes that is true!
Is there a reason terminating instructions cannot have clobbers? I have a call to a non returning function(fastfails on windows) which is the "terminating instruction" and ra3 isn't happy about that. I can just put a UD2 or INT3 after it to quiet it down but still what is the rationale behind that requirement?
Specifically this: Terminator with no successors cannot have clobbers
Also I love your error printing, very nice!
Because clobbers have no effect in that context: there's nothing after the instruction.
But it doesn't strictly cause a problem does it?
Well I did want to take advantage of this for some optimizations, but never got around to it.
It's still on my TODO list.
Putting a UD2/nop or something will solve it for me so not really a big problem anyway to impose that requirement.
No, the solution is to remove the clobbers since they have no effect.
But that would require making the call aware that it is "special" and doesn't return.
You could always add a check in inst_clobbers
to see if this is the last instruction in the block or something.
I suppose I could also relax the check, but I'm not super comfortable about it.
Somewhat interesting, there seem to be some unnecessary moves taking place back into spill slots and registers even though they are not used after the move. Here is an example:
https://i.imgur.com/ecAa5Cp.png
The superfluous move of the value in RAX into a stack slot(at 2C0h) which is never again touched(this block is the last in the function) is seen above^
That's from the lack of splitting. We are immediately spilling right after the definition of a value.
However move optimization is able to optimize the reload away but not the spill.
I should also point out the output there, and I cannot figure it out why exactly(because that single function basic block is quite large) is not semantically equivalent to the regalloc2 output and causes the binary to exit because a conditional branch is taken in a later function that shouldn't be.
Again though, this is an extreme example, I was compiling chrome.dll on Windows, and this is a very very large function. I'm not going to do much more with it now as I have other things that need tending to, but I can send you the output for regalloc2 and regalloc3 in DMs if you wish to see it yourself, as well as my machine instruction representation that is fed into regalloc.
With debug_assertions
enabled it also validates that the result of register allocation is correct.
You can try a release build with debug assertions enabled if debug builds are too slow.
Alternatively if you've already narrowed it down to a single function, you can edit the regalloc2 adapter to dump the text representation of the input function in a way that can be later used with regalloc3-tool:
log::info!(
"Reginfo:\n{}",
regalloc3::debug_utils::DisplayRegInfo(&self.reginfo)
);
log::info!(
"Input function:\n{}",
regalloc3::debug_utils::DisplayFunction(&self.adapter)
);
This is one cool of a project! Is there a todo.txt anywhere? Would love to help
@ghostway I do have an internal todo.txt locally, but it's not really suitable for public consumption. I'm taking a break at the moment to work on other things, but I will get back to regalloc3 soon.
Hello! While we are not working on RegAlloc me and @Dimitris Aspetakis are working on Instruction Scheduling inside Cranelift.
We have implemented some heuristics without so much benefit, only on some benchmarks of the XNNPACK. Currently we are implementing another heuristic but we get different results in different machines.
We are thinking that this is probably cause of the Reg Allocation.
Can we somehow count the number of spills?
@Panagiotis Karouzakis a very simple (slightly hacky) way to do this is to do a wasmtime compile
, then objdump
the resulting .cwasm
file, and grep with a regular expression that matches loads/stores relative to the stack pointer
something like mov.*(%rsp).*
on x86-64, or ldr.*sp.*
+ str.*sp.*
on aarch64
then count lines in the disassembly that match
Last updated: Nov 22 2024 at 17:03 UTC