Public release of regalloc3 · cranelift

regalloc3 is a project that I've been working on for the past ~6 months which aims to write a new register allocator to succeed regalloc2. It is currently still a work in progress (notably, live range splitting is not implemented yet) but it is in a state where I would welcome reviews, comments and suggestions.

It also has several algorithmic level improvements, such as better handling of fixed register constraints and reused registers, a new faster algorithm for computing live ranges and a faster linear scan allocator for spill slot allocation.

I also have a branch of regalloc2 which adapts the input function and then passes it on to regalloc3.

I've used this to benchmark regalloc2 and regalloc3 using wasmtime. Some benchmark results when compiling pulldown-cmark.wasm from sightglass:

GitHub - Amanieu/regalloc2 at regalloc3

A new register allocator. Contribute to Amanieu/regalloc2 development by creating an account on GitHub.

fitzgen (he/him) (Jul 29 2024 at 20:48):

do you have the sightglass results as well? the screenshots of the profiler don't tell us the actual speed up/slow down observed for compilation time and execution time vs regalloc2

Amanieu (Jul 29 2024 at 20:54):

The total samples gives you an idea of how much time is spent in register allocation. The numbers are relatively stable across runs.

Amanieu (Jul 29 2024 at 20:55):

Summary
  ./ra3 compile -C parallel-compilation=n  ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm ran
    1.03 ± 0.03 times faster than ./base compile -C parallel-compilation=n  ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm

Chris Fallin (Jul 29 2024 at 20:56):

Chris Fallin (Jul 29 2024 at 20:57):

Amanieu (Jul 29 2024 at 20:58):

Amanieu (Jul 29 2024 at 20:59):

The detailed profiles give a better idea of the relative performance of just the allocators.

Amanieu (Jul 29 2024 at 21:00):

Sightglass is giving me strange results though, claiming that regalloc2 is ~20% faster.

Amanieu (Jul 29 2024 at 21:00):

compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 102683574.00 ± 19686025.05 (confidence = 99%)

  baseline.so is 1.16x to 1.23x faster than ra3.so!

  [506461200 524084844.50 556646895] baseline.so
  [609491400 626768418.50 666477735] ra3.so

Amanieu (Jul 29 2024 at 21:02):

Execution time isn't great now, but I think that's mostly because I haven't implemented splitting yet:

execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 1957518.50 ± 254885.40 (confidence = 99%)

  baseline.so is 1.21x to 1.28x faster than ra3.so!

  [7821135 7976304.00 8446655] baseline.so
  [9711310 9933822.50 10376345] ra3.so

Jacob Lifshay (Jul 30 2024 at 01:54):

can it handle stuff like where there's 128 integer registers and it needs to allocate register ranges starting at any register of size 12, 64, 37, and 3? or does it still rely on O(n^2) listing all possible allocations ahead of time (huge for arbitrary-sized ranges allocated out of 128 registers, 8128 possible afaict)?

Leaves (Jul 30 2024 at 01:54):

Your second to last point is interesting as that is something I just implemented for regalloc and saved quite a bit of time(5%) for larger binaries by minimizing heap allocations.

Amanieu (Jul 30 2024 at 06:55):

@Jacob Lifshay Yes! Well, almost. At the moment I have a limit of 8 registers per reg group, but that can easily just be bumped up. You can see an example for RISC-V vectors here.

regalloc3/example_reginfo/riscv.reginfo at main · Amanieu/regalloc3

New register allocator designed as a successor to regalloc2 - Amanieu/regalloc3

Amanieu (Jul 30 2024 at 07:00):

You do need to list out all the allocatable combinations, but only once in the register definition which never changes.

Amanieu (Jul 30 2024 at 07:24):

African Hungarian (Jul 30 2024 at 17:00):

Wagwan, looks very neat looking forward to use this when it is ready to replace regalloc2 fully

Leaves (Jul 31 2024 at 06:25):

Leaves (Jul 31 2024 at 06:31):

Leaves (Jul 31 2024 at 06:39):

bjorn3 (Jul 31 2024 at 11:49):

If you are quick enough you can move messages between topics inside a channel using the move messages option.

rm (Jul 31 2024 at 15:34):

This is very exciting. Once the code base has stabilized I would love to see a design overview like the one for regalloc2.

One thing I am especially interested in is the handling of spills for Arm (Aarch32/64). Spilling requires additional registers to load the spilled value into and materialize the stack offset. How does regalloc3 deal with this? Are these registers modelled by the allocator or are certain registers set aside for this purpose?

Amanieu (Jul 31 2024 at 15:39):

Not currently, but I have some design notes on how to handle that. It does handle the scratch register for stack-to-stack moves though (like regalloc2 does)

Amanieu (Jul 31 2024 at 15:51):

It's a bit tricky since, if you need to spill to free up a register, you need to ensure that the slot you spill to doesn't need a scratch register to calculate the spillslot address.

Amanieu (Aug 11 2024 at 16:30):

New benchmarks, including a move optimizer post-processing pass that I haven't pushed yet:

compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 9800371.00 ± 3117907.67 (confidence = 99%)

  ra3.so is 1.06x to 1.12x faster than baseline.so!

  [103712420 116994400.95 143270540] baseline.so
  [93272620 107194029.95 133119665] ra3.so

execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 400260.70 ± 121688.71 (confidence = 99%)

  ra3.so is 1.03x to 1.06x faster than baseline.so!

  [9414475 9589995.80 10502695] baseline.so
  [8977885 9189735.10 13234935] ra3.so

Amanieu (Aug 11 2024 at 16:31):

compilation :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 21871418.80 ± 3220053.20 (confidence = 99%)

  ra3.so is 1.22x to 1.29x faster than baseline.so!

  [97671420 108323316.50 143930605] baseline.so
  [75466125 86451897.70 113070615] ra3.so

execution :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 2363113.90 ± 590882.98 (confidence = 99%)

  ra3.so is 1.02x to 1.03x faster than baseline.so!

  [117038180 118659545.20 133033110] baseline.so
  [115063375 116296431.30 122370570] ra3.so

Amanieu (Aug 11 2024 at 16:32):

I disabled the shuffling allocator in wasmtime-bench-api (for both ra3.so and baseline.so) because it was distorting the results: the shuffling allocator is extremely slow.

Amanieu (Aug 11 2024 at 16:34):

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 104370202.30 ± 27034267.63 (confidence = 99%)

  ra3.so is 1.06x to 1.11x faster than baseline.so!

  [1248169230 1331614134.20 1492533140] baseline.so
  [1172304280 1227243931.90 1351484785] ra3.so

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 27698273.40 ± 5688050.16 (confidence = 99%)

  baseline.so is 1.02x to 1.03x faster than ra3.so!

  [1152335590 1165996450.50 1191607235] baseline.so
  [1181614875 1193694723.90 1232190190] ra3.so

Leaves (Aug 13 2024 at 23:14):

Leaves (Aug 13 2024 at 23:15):

Additionally, those benchmarks, are they of the allocator itself running, or the emitted code?

Amanieu (Aug 13 2024 at 23:18):

Live range splitting is not implemented yet, I'm currently studying the existing heuristics used in different allocators.

Amanieu (Aug 13 2024 at 23:18):

Leaves (Aug 13 2024 at 23:32):

Oh wow, nice! Seems like there's still some performance to squeeze out AND you're already meeting the baseline!

Amanieu (Aug 13 2024 at 23:57):

Also ~15% of the allocator time is spent converting from the regalloc2 format to the regalloc3 format.

Leaves (Aug 14 2024 at 01:52):

Leaves (Aug 14 2024 at 21:16):

Any idea why I might be failing the .expect("missing allocation"); in move_resolver.rs:760 when using the regalloc2 translation layer?

Amanieu (Aug 14 2024 at 21:18):

Leaves (Aug 14 2024 at 21:19):

Amanieu (Aug 14 2024 at 21:20):

Leaves (Aug 14 2024 at 21:21):

Amanieu (Aug 14 2024 at 21:21):

Is this with a debug build? There is a cfg!(debug_assertions) that validates the input.

Leaves (Aug 14 2024 at 21:22):

Leaves (Aug 14 2024 at 21:27):

Interesting, it is claiming there are predecessors to the entry block, however I also have an assertion of my own prior to register allocation that makes sure there are not. I will investigate further and let you know what I find.

Leaves (Aug 14 2024 at 21:28):

Leaves (Aug 14 2024 at 21:29):

Amanieu (Aug 14 2024 at 21:29):

Leaves (Aug 14 2024 at 21:29):

Amanieu (Aug 14 2024 at 21:30):

Leaves (Aug 14 2024 at 21:30):

For now, I'll compute an RPO and feed that in, been meaning to do this for a while anyway.

Amanieu (Aug 14 2024 at 21:31):

regalloc3 requires that if block A dominates block B, A must come before B in the block order. Since the entry block dominates everything, it must be the first block.

Leaves (Aug 14 2024 at 21:32):

Leaves (Aug 14 2024 at 21:33):

Additionally, I'm a bit curious about the omission of branch_blockparams in the Function trait.

I see the jump params method but why none for conditional branches? Are separate destinations not allowed to have separate parameter lists?

Amanieu (Aug 14 2024 at 21:34):

This makes sense if you think about it: if there is only a single predecessor then you can just use the value from the predecessor directly instead of creating a separate value that is just a copy.

Leaves (Aug 14 2024 at 21:36):

fn branch_blockparams(&self, block: Block, _insn: Inst, succ_idx: usize) -> &[VReg] {
        &self.blocks[block.index()].successor_params[succ_idx]
    }

implementation for regalloc2, where is the equivalent or similar requirement in regalloc3?

Amanieu (Aug 14 2024 at 21:37):

I make a distinction between jump terminators (which can have blockparams but no operands) and branch terminators (which can have operands but no blockparams).

Leaves (Aug 14 2024 at 21:38):

I see that, but there it is not specified which successor is the target, as is done above with succ_index.

Amanieu (Aug 14 2024 at 21:41):

Because there can only be 1 successor if the successor has multiple predecessors.

Amanieu (Aug 14 2024 at 21:42):

Leaves (Aug 14 2024 at 21:51):

Leaves (Aug 14 2024 at 23:05):

Is there a reason terminating instructions cannot have clobbers? I have a call to a non returning function(fastfails on windows) which is the "terminating instruction" and ra3 isn't happy about that. I can just put a UD2 or INT3 after it to quiet it down but still what is the rationale behind that requirement?

Leaves (Aug 14 2024 at 23:07):

Amanieu (Aug 14 2024 at 23:09):

Because clobbers have no effect in that context: there's nothing after the instruction.

Leaves (Aug 14 2024 at 23:09):

Amanieu (Aug 14 2024 at 23:10):

Well I did want to take advantage of this for some optimizations, but never got around to it.

Amanieu (Aug 14 2024 at 23:10):

Leaves (Aug 14 2024 at 23:10):

Putting a UD2/nop or something will solve it for me so not really a big problem anyway to impose that requirement.

Amanieu (Aug 14 2024 at 23:11):

Leaves (Aug 14 2024 at 23:11):

But that would require making the call aware that it is "special" and doesn't return.

Amanieu (Aug 14 2024 at 23:12):

You could always add a check in inst_clobbers to see if this is the last instruction in the block or something.

Amanieu (Aug 14 2024 at 23:12):

Leaves (Aug 16 2024 at 03:45):

Somewhat interesting, there seem to be some unnecessary moves taking place back into spill slots and registers even though they are not used after the move. Here is an example:
https://i.imgur.com/ecAa5Cp.png
The superfluous move of the value in RAX into a stack slot(at 2C0h) which is never again touched(this block is the last in the function) is seen above^

Amanieu (Aug 16 2024 at 07:29):

That's from the lack of splitting. We are immediately spilling right after the definition of a value.

Amanieu (Aug 16 2024 at 07:31):

However move optimization is able to optimize the reload away but not the spill.

Leaves (Aug 16 2024 at 16:14):

I should also point out the output there, and I cannot figure it out why exactly(because that single function basic block is quite large) is not semantically equivalent to the regalloc2 output and causes the binary to exit because a conditional branch is taken in a later function that shouldn't be.

Again though, this is an extreme example, I was compiling chrome.dll on Windows, and this is a very very large function. I'm not going to do much more with it now as I have other things that need tending to, but I can send you the output for regalloc2 and regalloc3 in DMs if you wish to see it yourself, as well as my machine instruction representation that is fed into regalloc.

Amanieu (Aug 16 2024 at 17:11):

With debug_assertions enabled it also validates that the result of register allocation is correct.

Amanieu (Aug 16 2024 at 17:11):

You can try a release build with debug assertions enabled if debug builds are too slow.

Amanieu (Aug 16 2024 at 17:15):

Alternatively if you've already narrowed it down to a single function, you can edit the regalloc2 adapter to dump the text representation of the input function in a way that can be later used with regalloc3-tool:

            log::info!(
                "Reginfo:\n{}",
                regalloc3::debug_utils::DisplayRegInfo(&self.reginfo)
            );
            log::info!(
                "Input function:\n{}",
                regalloc3::debug_utils::DisplayFunction(&self.adapter)
            );

ghostway (Aug 31 2024 at 16:10):

Amanieu (Sep 01 2024 at 22:33):

@ghostway I do have an internal todo.txt locally, but it's not really suitable for public consumption. I'm taking a break at the moment to work on other things, but I will get back to regalloc3 soon.

Panagiotis Karouzakis (Sep 05 2024 at 21:14):

Hello! While we are not working on RegAlloc me and @Dimitris Aspetakis are working on Instruction Scheduling inside Cranelift.

We have implemented some heuristics without so much benefit, only on some benchmarks of the XNNPACK. Currently we are implementing another heuristic but we get different results in different machines.

We are thinking that this is probably cause of the Reg Allocation.
Can we somehow count the number of spills?

Chris Fallin (Sep 05 2024 at 21:22):

@Panagiotis Karouzakis a very simple (slightly hacky) way to do this is to do a wasmtime compile, then objdump the resulting .cwasm file, and grep with a regular expression that matches loads/stores relative to the stack pointer

Chris Fallin (Sep 05 2024 at 21:23):

something like mov.*(%rsp).* on x86-64, or ldr.*sp.* + str.*sp.* on aarch64

Stream: cranelift

Topic: Public release of regalloc3

Amanieu (Jul 29 2024 at 19:32):

fitzgen (he/him) (Jul 29 2024 at 20:48):

Amanieu (Jul 29 2024 at 20:54):

Amanieu (Jul 29 2024 at 20:55):

Chris Fallin (Jul 29 2024 at 20:56):

Chris Fallin (Jul 29 2024 at 20:57):

Amanieu (Jul 29 2024 at 20:58):

Amanieu (Jul 29 2024 at 20:59):

Amanieu (Jul 29 2024 at 21:00):

Amanieu (Jul 29 2024 at 21:00):

Amanieu (Jul 29 2024 at 21:02):

Jacob Lifshay (Jul 30 2024 at 01:54):

Leaves (Jul 30 2024 at 01:54):

Amanieu (Jul 30 2024 at 06:55):

Amanieu (Jul 30 2024 at 07:00):

Amanieu (Jul 30 2024 at 07:24):

African Hungarian (Jul 30 2024 at 17:00):

Leaves (Jul 31 2024 at 06:25):

Leaves (Jul 31 2024 at 06:25):

Leaves (Jul 31 2024 at 06:31):

Leaves (Jul 31 2024 at 06:31):

Leaves (Jul 31 2024 at 06:39):

bjorn3 (Jul 31 2024 at 11:49):

rm (Jul 31 2024 at 15:34):

Amanieu (Jul 31 2024 at 15:39):

Amanieu (Jul 31 2024 at 15:51):

Amanieu (Aug 11 2024 at 16:30):

Amanieu (Aug 11 2024 at 16:31):

Amanieu (Aug 11 2024 at 16:32):

Amanieu (Aug 11 2024 at 16:34):

Leaves (Aug 13 2024 at 23:14):

Leaves (Aug 13 2024 at 23:15):

Amanieu (Aug 13 2024 at 23:18):

Amanieu (Aug 13 2024 at 23:18):

Leaves (Aug 13 2024 at 23:32):

Amanieu (Aug 13 2024 at 23:57):

Leaves (Aug 14 2024 at 01:52):

Leaves (Aug 14 2024 at 21:16):

Amanieu (Aug 14 2024 at 21:18):

Leaves (Aug 14 2024 at 21:19):

Amanieu (Aug 14 2024 at 21:20):

Leaves (Aug 14 2024 at 21:21):

Amanieu (Aug 14 2024 at 21:21):

Leaves (Aug 14 2024 at 21:22):

Leaves (Aug 14 2024 at 21:22):

Leaves (Aug 14 2024 at 21:27):

Leaves (Aug 14 2024 at 21:28):

Leaves (Aug 14 2024 at 21:28):

Leaves (Aug 14 2024 at 21:28):

Leaves (Aug 14 2024 at 21:29):

Amanieu (Aug 14 2024 at 21:29):

Leaves (Aug 14 2024 at 21:29):

Amanieu (Aug 14 2024 at 21:30):

Amanieu (Aug 14 2024 at 21:30):

Leaves (Aug 14 2024 at 21:30):

Amanieu (Aug 14 2024 at 21:31):

Leaves (Aug 14 2024 at 21:32):

Leaves (Aug 14 2024 at 21:33):

Amanieu (Aug 14 2024 at 21:34):

Amanieu (Aug 14 2024 at 21:34):

Leaves (Aug 14 2024 at 21:36):

Amanieu (Aug 14 2024 at 21:37):

Amanieu (Aug 14 2024 at 21:37):

Leaves (Aug 14 2024 at 21:38):

Amanieu (Aug 14 2024 at 21:41):

Amanieu (Aug 14 2024 at 21:42):

Leaves (Aug 14 2024 at 21:51):

Leaves (Aug 14 2024 at 23:05):

Leaves (Aug 14 2024 at 23:07):

Leaves (Aug 14 2024 at 23:07):

Amanieu (Aug 14 2024 at 23:09):

Leaves (Aug 14 2024 at 23:09):

Amanieu (Aug 14 2024 at 23:10):

Amanieu (Aug 14 2024 at 23:10):

Leaves (Aug 14 2024 at 23:10):

Amanieu (Aug 14 2024 at 23:11):

Leaves (Aug 14 2024 at 23:11):

Amanieu (Aug 14 2024 at 23:12):