I saw someone mention that you were thinking of register allocating stack slots, you might be interested in a register allocator I'm currently writing in python that does exactly that. it also supports allocating ranges of registers (needed for our SVP64 extensions for PowerPC):
https://salsa.debian.org/Kazan-team/mirrors/bigint-presentation-code (link to mirror cuz it's nicer to browse)
it uses a graph-coloring algorithm based on a combination of:
Retargetable Graph-Coloring Register Allocation for Irregular Architectures
and a copy-merging algorithm based on Iterated Register Coalescing, section 5
I'm planning on adding a PowerPC backend to cranelift that supports SVP64, so this is partially experimentation on how to allocate register ranges -- if you have any better algorithm ideas, that'd be appreciated!
the main register allocation function: https://salsa.debian.org/Kazan-team/mirrors/bigint-presentation-code/-/blob/12d9d07e58b349c05f2c44dc7a8a0a667aa9c63d/src/bigint_presentation_code/register_allocator.py#L794
adding SVP64 support to regalloc2 would require expanding physical register indexes and register class indexes since SVP64 has 128 integer registers, 128 fp registers, 128 condition registers, and a few special-purpose registers, and, assuming we follow what I did above, we will need 64 register classes per register file, since vectors are allocated as ranges of registers and SVP64 supports all vectors from 1 to 64 elements. e.g. f64x53 takes 53 fp registers.
so f64x25 is a different register class than f64x16 since it uses a different number of fp registers.
I actually implement support for fixed stack slots in regalloc2: https://github.com/bytecodealliance/regalloc2/pull/17. The API is a bit of a hack since stack slots are represented as a PReg
but it works well enough to let the register allocator handle incoming/outgoing argument in stack slots.
The main reason why regalloc2 only allows a limited set of physical registers is because it tries to encode all of the information about an operand in 32 bits. Expanding this to 64 bits would allow more physical registers, but at a small cost in performance (~3% according to https://github.com/bytecodealliance/regalloc2/issues/5#issuecomment-914481573).
the register allocator i wrote is a bit more general than that, ssa values each have a set of locations they can be allocated to, some of which can be stack slots.
3% doesn't seem too bad, it's waay better than the register allocator i wrote which takes like 5min on a few hundred instructions -- definitely needs more optimization, though otoh if i was wanting it to be fast i would have written it in rust.
I'm certain that @Chris Fallin will want to look at this once he's back from vacation in a couple weeks!
I can tell you now though that compilation speed is a major goal for Cranelift: I had to go to extreme lengths to set up a benchmarking environment that could reliably measure 1% changes, because we care about those. And register allocation is the single biggest time-sink in Cranelift, so we have plans to make it significantly faster this year.
I think it's going to be an interesting and fun challenge to figure out how to do fast but decent register allocation for your instruction set!
posted about this on the libre-soc-dev mailing list: https://lists.libre-soc.org/pipermail/libre-soc-dev/2023-January/005475.html
see also: the tracking bug for the sub-project i'm writing the register allocator as part of: https://bugs.libre-soc.org/show_bug.cgi?id=942
@Chris Fallin earlier ^ it was mentioned that you'd likely be interested in the problem of register allocating for SVP64, mentioning this again in case you missed it since iirc you were on vacation
I've started rewriting it in Rust, I got too fed up with Python's slowness and lack of zero-cost abstractions. This time it implements support for more than 1 basic block.
@Jacob Lifshay that's really interesting, thanks! Re: register ranges, I remember briefly considering register-range allocation then running away screaming, as it were, due to the complexity implications in regalloc2's architecture at least. (One could probe a range of registers just as one would an individual register, and decide to evict everything in the way or not, but the forward-progress implications are much harder to reason about.)
I'll be interested to dig into how your allocator handles this; I played with IRC a bit early on in regalloc explorations and it seemed fairly flexible, indeed. (IIRC, JavaScriptCore's optimizing tier uses it too?)
it mostly handles it by not evicting registers, yes it could use improvements.
if you run the tests it produces a series of .dot graphs in subdirectories of the dumped_graphs directory, they may make understanding the algorithm easier
note nmutil and nmigen need to be installed from git, the versions on pypi are waay out of date. clone urls:
https://git.libre-soc.org/git/nmutil.git
https://git.libre-soc.org/git/bigint-presentation-code.git
https://gitlab.com/nmigen/nmigen
nmigen needs to be installed first, then nmutil, then bigint-presentation-code
Chris Fallin said:
I'll be interested to dig into how your allocator handles this; I played with IRC a bit early on in regalloc explorations and it seemed fairly flexible, indeed. (IIRC, JavaScriptCore's optimizing tier uses it too?)
one of the main issues i was running into was that since I didn't implement spilling or live-range splitting in the python register allocator, i have a pass that splits all ssa variables before register allocation, inserting copies after each def and before each use, unfortunately IRC doesn't merge ssa values with a fixed register assignment, so function arguments are often copied to a new register before being used.
e.g. here r3 (first argument reg on ppc64le) is copied unnecessarily to r10: https://git.libre-soc.org/?p=bigint-presentation-code.git;a=blob;f=src/bigint_presentation_code/_tests/test_toom_cook.py;h=ae8b6c0bb6b087496f2ca81b0453d4097ef58b10;hb=HEAD#l877
(in case you're wondering, yes, i need to add a DCE pass)
Last updated: Oct 23 2024 at 20:03 UTC