register allocating the stack · cranelift

I saw someone mention that you were thinking of register allocating stack slots, you might be interested in a register allocator I'm currently writing in python that does exactly that. it also supports allocating ranges of registers (needed for our SVP64 extensions for PowerPC):
https://salsa.debian.org/Kazan-team/mirrors/bigint-presentation-code (link to mirror cuz it's nicer to browse)

Kazan / mirrors / bigint-presentation-code · GitLab

https://git.libre-soc.org/?p=bigint-presentation-code.git;a=summary

Jacob Lifshay (Jan 05 2023 at 07:47):

Jacob Lifshay (Jan 05 2023 at 07:50):

I'm planning on adding a PowerPC backend to cranelift that supports SVP64, so this is partially experimentation on how to allocate register ranges -- if you have any better algorithm ideas, that'd be appreciated!

Jacob Lifshay (Jan 05 2023 at 07:54):

src/bigint_presentation_code/register_allocator.py · 12d9d07e58b349c05f2c44dc7a8a0a667aa9c63d · Kazan / mirrors / bigint-presentation-code · GitLab

https://git.libre-soc.org/?p=bigint-presentation-code.git;a=summary

Jacob Lifshay (Jan 05 2023 at 08:04):

adding SVP64 support to regalloc2 would require expanding physical register indexes and register class indexes since SVP64 has 128 integer registers, 128 fp registers, 128 condition registers, and a few special-purpose registers, and, assuming we follow what I did above, we will need 64 register classes per register file, since vectors are allocated as ranges of registers and SVP64 supports all vectors from 1 to 64 elements. e.g. f64x53 takes 53 fp registers.

Jacob Lifshay (Jan 05 2023 at 08:05):

so f64x25 is a different register class than f64x16 since it uses a different number of fp registers.

Amanieu (Jan 05 2023 at 16:59):

I actually implement support for fixed stack slots in regalloc2: https://github.com/bytecodealliance/regalloc2/pull/17. The API is a bit of a hack since stack slots are represented as a PReg but it works well enough to let the register allocator handle incoming/outgoing argument in stack slots.

Add support for fixed stack slots by Amanieu · Pull Request #17 · bytecodealliance/regalloc2

This PR adds the ability to mark a PReg as being a fixed stack slot through MachineEnv. This means that: The PReg can be specified using OperandConstraint::FixedReg. This PReg can be selected for ...

Amanieu (Jan 05 2023 at 17:02):

The main reason why regalloc2 only allows a limited set of physical registers is because it tries to encode all of the information about an operand in 32 bits. Expanding this to 64 bits would allow more physical registers, but at a small cost in performance (~3% according to https://github.com/bytecodealliance/regalloc2/issues/5#issuecomment-914481573).

Investigate better bitpacking for Operand and Use · Issue #5 · bytecodealliance/regalloc2

Two core data-structure elements, Operand and Use, are both designed to fit a relatively large amount of information in one u32. This is a performance optimization that we have found to be relative...

Jacob Lifshay (Jan 05 2023 at 18:00):

the register allocator i wrote is a bit more general than that, ssa values each have a set of locations they can be allocated to, some of which can be stack slots.

3% doesn't seem too bad, it's waay better than the register allocator i wrote which takes like 5min on a few hundred instructions -- definitely needs more optimization, though otoh if i was wanting it to be fast i would have written it in rust.

Jamey Sharp (Jan 05 2023 at 18:14):

I'm certain that @Chris Fallin will want to look at this once he's back from vacation in a couple weeks!
I can tell you now though that compilation speed is a major goal for Cranelift: I had to go to extreme lengths to set up a benchmarking environment that could reliably measure 1% changes, because we care about those. And register allocation is the single biggest time-sink in Cranelift, so we have plans to make it significantly faster this year.
I think it's going to be an interesting and fun challenge to figure out how to do fast but decent register allocation for your instruction set!

sightglass/cpu-isolation.md at main · bytecodealliance/sightglass

A benchmark suite and tool to compare different implementations of the same primitives. - sightglass/cpu-isolation.md at main · bytecodealliance/sightglass

Jacob Lifshay (Jan 09 2023 at 19:00):

Jacob Lifshay (Jan 31 2023 at 08:33):

@Chris Fallin earlier ^ it was mentioned that you'd likely be interested in the problem of register allocating for SVP64, mentioning this again in case you missed it since iirc you were on vacation

Jacob Lifshay (Jan 31 2023 at 08:36):

I've started rewriting it in Rust, I got too fed up with Python's slowness and lack of zero-cost abstractions. This time it implements support for more than 1 basic block.

Chris Fallin (Jan 31 2023 at 17:04):

@Jacob Lifshay that's really interesting, thanks! Re: register ranges, I remember briefly considering register-range allocation then running away screaming, as it were, due to the complexity implications in regalloc2's architecture at least. (One could probe a range of registers just as one would an individual register, and decide to evict everything in the way or not, but the forward-progress implications are much harder to reason about.)

I'll be interested to dig into how your allocator handles this; I played with IRC a bit early on in regalloc explorations and it seemed fairly flexible, indeed. (IIRC, JavaScriptCore's optimizing tier uses it too?)

Jacob Lifshay (Jan 31 2023 at 17:11):

Jacob Lifshay (Jan 31 2023 at 17:14):

if you run the tests it produces a series of .dot graphs in subdirectories of the dumped_graphs directory, they may make understanding the algorithm easier

Jacob Lifshay (Jan 31 2023 at 17:19):

nMigen 1.x / nmigen · GitLab

A refreshed Python toolbox for building complex digital hardware

Jacob Lifshay (Jan 31 2023 at 17:21):

Jacob Lifshay (Jan 31 2023 at 17:32):

one of the main issues i was running into was that since I didn't implement spilling or live-range splitting in the python register allocator, i have a pass that splits all ssa variables before register allocation, inserting copies after each def and before each use, unfortunately IRC doesn't merge ssa values with a fixed register assignment, so function arguments are often copied to a new register before being used.

Stream: cranelift

Topic: register allocating the stack

Jacob Lifshay (Jan 05 2023 at 07:44):