Stream: cranelift

Topic: Public release of regalloc3


view this post on Zulip Amanieu (Jul 29 2024 at 19:32):

regalloc3 is a project that I've been working on for the past ~6 months which aims to write a new register allocator to succeed regalloc2. It is currently still a work in progress (notably, live range splitting is not implemented yet) but it is in a state where I would welcome reviews, comments and suggestions.

From the README, here's a list of API-level changes compared to regalloc2:

It also has several algorithmic level improvements, such as better handling of fixed register constraints and reused registers, a new faster algorithm for computing live ranges and a faster linear scan allocator for spill slot allocation.

I also have a branch of regalloc2 which adapts the input function and then passes it on to regalloc3.

I've used this to benchmark regalloc2 and regalloc3 using wasmtime. Some benchmark results when compiling pulldown-cmark.wasm from sightglass:

ra3.png
ra2.png

New register allocator designed as a successor to regalloc2 - Amanieu/regalloc3
A new register allocator. Contribute to Amanieu/regalloc2 development by creating an account on GitHub.

view this post on Zulip fitzgen (he/him) (Jul 29 2024 at 20:48):

congrats on releasing! looks very cool!

I've used this to benchmark regalloc2 and regalloc3 using wasmtime. Some benchmark results when compiling pulldown-cmark.wasm from sightglass:

do you have the sightglass results as well? the screenshots of the profiler don't tell us the actual speed up/slow down observed for compilation time and execution time vs regalloc2

view this post on Zulip Amanieu (Jul 29 2024 at 20:54):

The total samples gives you an idea of how much time is spent in register allocation. The numbers are relatively stable across runs.

view this post on Zulip Amanieu (Jul 29 2024 at 20:55):

Hyperfine results:

Summary
  ./ra3 compile -C parallel-compilation=n  ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm ran
    1.03 ± 0.03 times faster than ./base compile -C parallel-compilation=n  ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm

view this post on Zulip Chris Fallin (Jul 29 2024 at 20:56):

3% is nothing to sneeze at, nice!

view this post on Zulip Chris Fallin (Jul 29 2024 at 20:57):

do you have measurements of runtime (quality of generated code) by any chance?

view this post on Zulip Amanieu (Jul 29 2024 at 20:58):

The increase isn't much because regalloc is only 30% of the total runtime.

view this post on Zulip Amanieu (Jul 29 2024 at 20:59):

The detailed profiles give a better idea of the relative performance of just the allocators.

view this post on Zulip Amanieu (Jul 29 2024 at 21:00):

Sightglass is giving me strange results though, claiming that regalloc2 is ~20% faster.

view this post on Zulip Amanieu (Jul 29 2024 at 21:00):

compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 102683574.00 ± 19686025.05 (confidence = 99%)

  baseline.so is 1.16x to 1.23x faster than ra3.so!

  [506461200 524084844.50 556646895] baseline.so
  [609491400 626768418.50 666477735] ra3.so

view this post on Zulip Amanieu (Jul 29 2024 at 21:02):

Execution time isn't great now, but I think that's mostly because I haven't implemented splitting yet:

execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 1957518.50 ± 254885.40 (confidence = 99%)

  baseline.so is 1.21x to 1.28x faster than ra3.so!

  [7821135 7976304.00 8446655] baseline.so
  [9711310 9933822.50 10376345] ra3.so

view this post on Zulip Jacob Lifshay (Jul 30 2024 at 01:54):

can it handle stuff like where there's 128 integer registers and it needs to allocate register ranges starting at any register of size 12, 64, 37, and 3? or does it still rely on O(n^2) listing all possible allocations ahead of time (huge for arbitrary-sized ranges allocated out of 128 registers, 8128 possible afaict)?

view this post on Zulip Leaves (Jul 30 2024 at 01:54):

Your second to last point is interesting as that is something I just implemented for regalloc and saved quite a bit of time(5%) for larger binaries by minimizing heap allocations.

view this post on Zulip Amanieu (Jul 30 2024 at 06:55):

@Jacob Lifshay Yes! Well, almost. At the moment I have a limit of 8 registers per reg group, but that can easily just be bumped up. You can see an example for RISC-V vectors here.

New register allocator designed as a successor to regalloc2 - Amanieu/regalloc3

view this post on Zulip Amanieu (Jul 30 2024 at 07:00):

You do need to list out all the allocatable combinations, but only once in the register definition which never changes.

view this post on Zulip Amanieu (Jul 30 2024 at 07:24):

The way register groups work is approximately based on https://llvm.org/devmtg/2016-11/Slides/Braun-DealingWithRegisterHierarchies.pdf

view this post on Zulip African Hungarian (Jul 30 2024 at 17:00):

Wagwan, looks very neat looking forward to use this when it is ready to replace regalloc2 fully

view this post on Zulip Leaves (Jul 31 2024 at 06:25):

Does crane lift produce a pruned ssa? I don’t think so…

view this post on Zulip Leaves (Jul 31 2024 at 06:25):

So it isn’t trying to remove all truly useless phis

view this post on Zulip Leaves (Jul 31 2024 at 06:31):

(deleted)

view this post on Zulip Leaves (Jul 31 2024 at 06:31):

(deleted)

view this post on Zulip Leaves (Jul 31 2024 at 06:39):

Sorry wrong channel

view this post on Zulip bjorn3 (Jul 31 2024 at 11:49):

If you are quick enough you can move messages between topics inside a channel using the move messages option.

view this post on Zulip rm (Jul 31 2024 at 15:34):

This is very exciting. Once the code base has stabilized I would love to see a design overview like the one for regalloc2.

One thing I am especially interested in is the handling of spills for Arm (Aarch32/64). Spilling requires additional registers to load the spilled value into and materialize the stack offset. How does regalloc3 deal with this? Are these registers modelled by the allocator or are certain registers set aside for this purpose?

view this post on Zulip Amanieu (Jul 31 2024 at 15:39):

Not currently, but I have some design notes on how to handle that. It does handle the scratch register for stack-to-stack moves though (like regalloc2 does)

view this post on Zulip Amanieu (Jul 31 2024 at 15:51):

It's a bit tricky since, if you need to spill to free up a register, you need to ensure that the slot you spill to doesn't need a scratch register to calculate the spillslot address.

view this post on Zulip Amanieu (Aug 11 2024 at 16:30):

New benchmarks, including a move optimizer post-processing pass that I haven't pushed yet:

compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 9800371.00 ± 3117907.67 (confidence = 99%)

  ra3.so is 1.06x to 1.12x faster than baseline.so!

  [103712420 116994400.95 143270540] baseline.so
  [93272620 107194029.95 133119665] ra3.so

execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 400260.70 ± 121688.71 (confidence = 99%)

  ra3.so is 1.03x to 1.06x faster than baseline.so!

  [9414475 9589995.80 10502695] baseline.so
  [8977885 9189735.10 13234935] ra3.so

view this post on Zulip Amanieu (Aug 11 2024 at 16:31):

compilation :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 21871418.80 ± 3220053.20 (confidence = 99%)

  ra3.so is 1.22x to 1.29x faster than baseline.so!

  [97671420 108323316.50 143930605] baseline.so
  [75466125 86451897.70 113070615] ra3.so

execution :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 2363113.90 ± 590882.98 (confidence = 99%)

  ra3.so is 1.02x to 1.03x faster than baseline.so!

  [117038180 118659545.20 133033110] baseline.so
  [115063375 116296431.30 122370570] ra3.so

view this post on Zulip Amanieu (Aug 11 2024 at 16:32):

I disabled the shuffling allocator in wasmtime-bench-api (for both ra3.so and baseline.so) because it was distorting the results: the shuffling allocator is extremely slow.

view this post on Zulip Amanieu (Aug 11 2024 at 16:34):

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 104370202.30 ± 27034267.63 (confidence = 99%)

  ra3.so is 1.06x to 1.11x faster than baseline.so!

  [1248169230 1331614134.20 1492533140] baseline.so
  [1172304280 1227243931.90 1351484785] ra3.so

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 27698273.40 ± 5688050.16 (confidence = 99%)

  baseline.so is 1.02x to 1.03x faster than ra3.so!

  [1152335590 1165996450.50 1191607235] baseline.so
  [1181614875 1193694723.90 1232190190] ra3.so

view this post on Zulip Leaves (Aug 13 2024 at 23:14):

@Amanieu Have you implemented live range splitting?

view this post on Zulip Leaves (Aug 13 2024 at 23:15):

Additionally, those benchmarks, are they of the allocator itself running, or the emitted code?

view this post on Zulip Amanieu (Aug 13 2024 at 23:18):

Live range splitting is not implemented yet, I'm currently studying the existing heuristics used in different allocators.

view this post on Zulip Amanieu (Aug 13 2024 at 23:18):

The benchmarks show both execution and compilation.

view this post on Zulip Leaves (Aug 13 2024 at 23:32):

Oh wow, nice! Seems like there's still some performance to squeeze out AND you're already meeting the baseline!

view this post on Zulip Amanieu (Aug 13 2024 at 23:57):

Also ~15% of the allocator time is spent converting from the regalloc2 format to the regalloc3 format.

view this post on Zulip Leaves (Aug 14 2024 at 01:52):

Hmmm even more enticing. I will watch your progress with great interest.

view this post on Zulip Leaves (Aug 14 2024 at 21:16):

Any idea why I might be failing the .expect("missing allocation"); in move_resolver.rs:760 when using the regalloc2 translation layer?

view this post on Zulip Amanieu (Aug 14 2024 at 21:18):

Is this with cranelift?

view this post on Zulip Leaves (Aug 14 2024 at 21:19):

It is not.

view this post on Zulip Amanieu (Aug 14 2024 at 21:20):

Hmm possibly you are passing invalid input to the register allocator.

view this post on Zulip Leaves (Aug 14 2024 at 21:21):

It does complete fine with normal regalloc2.

view this post on Zulip Amanieu (Aug 14 2024 at 21:21):

Is this with a debug build? There is a cfg!(debug_assertions) that validates the input.

view this post on Zulip Leaves (Aug 14 2024 at 21:22):

It is release

view this post on Zulip Leaves (Aug 14 2024 at 21:22):

I'll try debug.

view this post on Zulip Leaves (Aug 14 2024 at 21:27):

Interesting, it is claiming there are predecessors to the entry block, however I also have an assertion of my own prior to register allocation that makes sure there are not. I will investigate further and let you know what I find.

view this post on Zulip Leaves (Aug 14 2024 at 21:28):

Ah I see...

view this post on Zulip Leaves (Aug 14 2024 at 21:28):

view this post on Zulip Leaves (Aug 14 2024 at 21:28):

It seems this may be incorrect here? I'll look more.

view this post on Zulip Leaves (Aug 14 2024 at 21:29):

My cfg is NOT in a RPO so the entry block is NOT block zero.

view this post on Zulip Amanieu (Aug 14 2024 at 21:29):

Ah that might be it. regalloc3 assumes the entry block is block 0.

view this post on Zulip Leaves (Aug 14 2024 at 21:29):

Does the translation layer do anything current to rewrite this?

view this post on Zulip Amanieu (Aug 14 2024 at 21:30):

No, it doesn't change the block ordering.

view this post on Zulip Amanieu (Aug 14 2024 at 21:30):

I suppose it could...

view this post on Zulip Leaves (Aug 14 2024 at 21:30):

For now, I'll compute an RPO and feed that in, been meaning to do this for a while anyway.

view this post on Zulip Amanieu (Aug 14 2024 at 21:31):

regalloc3 requires that if block A dominates block B, A must come before B in the block order. Since the entry block dominates everything, it must be the first block.

view this post on Zulip Leaves (Aug 14 2024 at 21:32):

:thumbs_up:

view this post on Zulip Leaves (Aug 14 2024 at 21:33):

Additionally, I'm a bit curious about the omission of branch_blockparams in the Function trait.

I see the jump params method but why none for conditional branches? Are separate destinations not allowed to have separate parameter lists?

view this post on Zulip Amanieu (Aug 14 2024 at 21:34):

Block parameters are only allowed on blocks with multiple predecessors.

view this post on Zulip Amanieu (Aug 14 2024 at 21:34):

This makes sense if you think about it: if there is only a single predecessor then you can just use the value from the predecessor directly instead of creating a separate value that is just a copy.

view this post on Zulip Leaves (Aug 14 2024 at 21:36):

I agree with that, having trouble wrapping my mind around the implementation.

If I originally had this:

fn branch_blockparams(&self, block: Block, _insn: Inst, succ_idx: usize) -> &[VReg] {
        &self.blocks[block.index()].successor_params[succ_idx]
    }

implementation for regalloc2, where is the equivalent or similar requirement in regalloc3?

view this post on Zulip Amanieu (Aug 14 2024 at 21:37):

It's called jump_blockparams

view this post on Zulip Amanieu (Aug 14 2024 at 21:37):

I make a distinction between jump terminators (which can have blockparams but no operands) and branch terminators (which can have operands but no blockparams).

view this post on Zulip Leaves (Aug 14 2024 at 21:38):

I see that, but there it is not specified which successor is the target, as is done above with succ_index.

view this post on Zulip Amanieu (Aug 14 2024 at 21:41):

Because there can only be 1 successor if the successor has multiple predecessors.

view this post on Zulip Amanieu (Aug 14 2024 at 21:42):

Due to the critical edge rule.

view this post on Zulip Leaves (Aug 14 2024 at 21:51):

Ah... yes that is true!

view this post on Zulip Leaves (Aug 14 2024 at 23:05):

Is there a reason terminating instructions cannot have clobbers? I have a call to a non returning function(fastfails on windows) which is the "terminating instruction" and ra3 isn't happy about that. I can just put a UD2 or INT3 after it to quiet it down but still what is the rationale behind that requirement?

view this post on Zulip Leaves (Aug 14 2024 at 23:07):

Specifically this: Terminator with no successors cannot have clobbers

view this post on Zulip Leaves (Aug 14 2024 at 23:07):

Also I love your error printing, very nice!

view this post on Zulip Amanieu (Aug 14 2024 at 23:09):

Because clobbers have no effect in that context: there's nothing after the instruction.

view this post on Zulip Leaves (Aug 14 2024 at 23:09):

But it doesn't strictly cause a problem does it?

view this post on Zulip Amanieu (Aug 14 2024 at 23:10):

Well I did want to take advantage of this for some optimizations, but never got around to it.

view this post on Zulip Amanieu (Aug 14 2024 at 23:10):

It's still on my TODO list.

view this post on Zulip Leaves (Aug 14 2024 at 23:10):

Putting a UD2/nop or something will solve it for me so not really a big problem anyway to impose that requirement.

view this post on Zulip Amanieu (Aug 14 2024 at 23:11):

No, the solution is to remove the clobbers since they have no effect.

view this post on Zulip Leaves (Aug 14 2024 at 23:11):

But that would require making the call aware that it is "special" and doesn't return.

view this post on Zulip Amanieu (Aug 14 2024 at 23:12):

You could always add a check in inst_clobbers to see if this is the last instruction in the block or something.

view this post on Zulip Amanieu (Aug 14 2024 at 23:12):

I suppose I could also relax the check, but I'm not super comfortable about it.

view this post on Zulip Leaves (Aug 16 2024 at 03:45):

Somewhat interesting, there seem to be some unnecessary moves taking place back into spill slots and registers even though they are not used after the move. Here is an example:
https://i.imgur.com/ecAa5Cp.png
The superfluous move of the value in RAX into a stack slot(at 2C0h) which is never again touched(this block is the last in the function) is seen above^

view this post on Zulip Amanieu (Aug 16 2024 at 07:29):

That's from the lack of splitting. We are immediately spilling right after the definition of a value.

view this post on Zulip Amanieu (Aug 16 2024 at 07:31):

However move optimization is able to optimize the reload away but not the spill.

view this post on Zulip Leaves (Aug 16 2024 at 16:14):

I should also point out the output there, and I cannot figure it out why exactly(because that single function basic block is quite large) is not semantically equivalent to the regalloc2 output and causes the binary to exit because a conditional branch is taken in a later function that shouldn't be.

Again though, this is an extreme example, I was compiling chrome.dll on Windows, and this is a very very large function. I'm not going to do much more with it now as I have other things that need tending to, but I can send you the output for regalloc2 and regalloc3 in DMs if you wish to see it yourself, as well as my machine instruction representation that is fed into regalloc.

view this post on Zulip Amanieu (Aug 16 2024 at 17:11):

With debug_assertions enabled it also validates that the result of register allocation is correct.

view this post on Zulip Amanieu (Aug 16 2024 at 17:11):

You can try a release build with debug assertions enabled if debug builds are too slow.

view this post on Zulip Amanieu (Aug 16 2024 at 17:15):

Alternatively if you've already narrowed it down to a single function, you can edit the regalloc2 adapter to dump the text representation of the input function in a way that can be later used with regalloc3-tool:

            log::info!(
                "Reginfo:\n{}",
                regalloc3::debug_utils::DisplayRegInfo(&self.reginfo)
            );
            log::info!(
                "Input function:\n{}",
                regalloc3::debug_utils::DisplayFunction(&self.adapter)
            );

view this post on Zulip ghostway (Aug 31 2024 at 16:10):

This is one cool of a project! Is there a todo.txt anywhere? Would love to help

view this post on Zulip Amanieu (Sep 01 2024 at 22:33):

@ghostway I do have an internal todo.txt locally, but it's not really suitable for public consumption. I'm taking a break at the moment to work on other things, but I will get back to regalloc3 soon.

view this post on Zulip Panagiotis Karouzakis (Sep 05 2024 at 21:14):

Hello! While we are not working on RegAlloc me and @Dimitris Aspetakis are working on Instruction Scheduling inside Cranelift.

We have implemented some heuristics without so much benefit, only on some benchmarks of the XNNPACK. Currently we are implementing another heuristic but we get different results in different machines.

We are thinking that this is probably cause of the Reg Allocation.
Can we somehow count the number of spills?

view this post on Zulip Chris Fallin (Sep 05 2024 at 21:22):

@Panagiotis Karouzakis a very simple (slightly hacky) way to do this is to do a wasmtime compile, then objdump the resulting .cwasm file, and grep with a regular expression that matches loads/stores relative to the stack pointer

view this post on Zulip Chris Fallin (Sep 05 2024 at 21:23):

something like mov.*(%rsp).* on x86-64, or ldr.*sp.* + str.*sp.* on aarch64

view this post on Zulip Chris Fallin (Sep 05 2024 at 21:23):

then count lines in the disassembly that match


Last updated: Nov 22 2024 at 17:03 UTC