Stream: cranelift

Topic: register allocating the stack


view this post on Zulip Jacob Lifshay (Jan 05 2023 at 07:44):

I saw someone mention that you were thinking of register allocating stack slots, you might be interested in a register allocator I'm currently writing in python that does exactly that. it also supports allocating ranges of registers (needed for our SVP64 extensions for PowerPC):
https://salsa.debian.org/Kazan-team/mirrors/bigint-presentation-code (link to mirror cuz it's nicer to browse)

https://git.libre-soc.org/?p=bigint-presentation-code.git;a=summary

view this post on Zulip Jacob Lifshay (Jan 05 2023 at 07:47):

it uses a graph-coloring algorithm based on a combination of:
Retargetable Graph-Coloring Register Allocation for Irregular Architectures
and a copy-merging algorithm based on Iterated Register Coalescing, section 5

view this post on Zulip Jacob Lifshay (Jan 05 2023 at 07:50):

I'm planning on adding a PowerPC backend to cranelift that supports SVP64, so this is partially experimentation on how to allocate register ranges -- if you have any better algorithm ideas, that'd be appreciated!

view this post on Zulip Jacob Lifshay (Jan 05 2023 at 07:54):

the main register allocation function: https://salsa.debian.org/Kazan-team/mirrors/bigint-presentation-code/-/blob/12d9d07e58b349c05f2c44dc7a8a0a667aa9c63d/src/bigint_presentation_code/register_allocator.py#L794

https://git.libre-soc.org/?p=bigint-presentation-code.git;a=summary

view this post on Zulip Jacob Lifshay (Jan 05 2023 at 08:04):

adding SVP64 support to regalloc2 would require expanding physical register indexes and register class indexes since SVP64 has 128 integer registers, 128 fp registers, 128 condition registers, and a few special-purpose registers, and, assuming we follow what I did above, we will need 64 register classes per register file, since vectors are allocated as ranges of registers and SVP64 supports all vectors from 1 to 64 elements. e.g. f64x53 takes 53 fp registers.

view this post on Zulip Jacob Lifshay (Jan 05 2023 at 08:05):

so f64x25 is a different register class than f64x16 since it uses a different number of fp registers.

view this post on Zulip Amanieu (Jan 05 2023 at 16:59):

I actually implement support for fixed stack slots in regalloc2: https://github.com/bytecodealliance/regalloc2/pull/17. The API is a bit of a hack since stack slots are represented as a PReg but it works well enough to let the register allocator handle incoming/outgoing argument in stack slots.

This PR adds the ability to mark a PReg as being a fixed stack slot through MachineEnv. This means that: The PReg can be specified using OperandConstraint::FixedReg. This PReg can be selected for ...

view this post on Zulip Amanieu (Jan 05 2023 at 17:02):

The main reason why regalloc2 only allows a limited set of physical registers is because it tries to encode all of the information about an operand in 32 bits. Expanding this to 64 bits would allow more physical registers, but at a small cost in performance (~3% according to https://github.com/bytecodealliance/regalloc2/issues/5#issuecomment-914481573).

Two core data-structure elements, Operand and Use, are both designed to fit a relatively large amount of information in one u32. This is a performance optimization that we have found to be relative...

view this post on Zulip Jacob Lifshay (Jan 05 2023 at 18:00):

the register allocator i wrote is a bit more general than that, ssa values each have a set of locations they can be allocated to, some of which can be stack slots.

3% doesn't seem too bad, it's waay better than the register allocator i wrote which takes like 5min on a few hundred instructions -- definitely needs more optimization, though otoh if i was wanting it to be fast i would have written it in rust.

view this post on Zulip Jamey Sharp (Jan 05 2023 at 18:14):

I'm certain that @Chris Fallin will want to look at this once he's back from vacation in a couple weeks!
I can tell you now though that compilation speed is a major goal for Cranelift: I had to go to extreme lengths to set up a benchmarking environment that could reliably measure 1% changes, because we care about those. And register allocation is the single biggest time-sink in Cranelift, so we have plans to make it significantly faster this year.
I think it's going to be an interesting and fun challenge to figure out how to do fast but decent register allocation for your instruction set!

A benchmark suite and tool to compare different implementations of the same primitives. - sightglass/cpu-isolation.md at main · bytecodealliance/sightglass

view this post on Zulip Jacob Lifshay (Jan 09 2023 at 19:00):

posted about this on the libre-soc-dev mailing list: https://lists.libre-soc.org/pipermail/libre-soc-dev/2023-January/005475.html
see also: the tracking bug for the sub-project i'm writing the register allocator as part of: https://bugs.libre-soc.org/show_bug.cgi?id=942

view this post on Zulip Jacob Lifshay (Jan 31 2023 at 08:33):

@Chris Fallin earlier ^ it was mentioned that you'd likely be interested in the problem of register allocating for SVP64, mentioning this again in case you missed it since iirc you were on vacation

view this post on Zulip Jacob Lifshay (Jan 31 2023 at 08:36):

I've started rewriting it in Rust, I got too fed up with Python's slowness and lack of zero-cost abstractions. This time it implements support for more than 1 basic block.

view this post on Zulip Chris Fallin (Jan 31 2023 at 17:04):

@Jacob Lifshay that's really interesting, thanks! Re: register ranges, I remember briefly considering register-range allocation then running away screaming, as it were, due to the complexity implications in regalloc2's architecture at least. (One could probe a range of registers just as one would an individual register, and decide to evict everything in the way or not, but the forward-progress implications are much harder to reason about.)

I'll be interested to dig into how your allocator handles this; I played with IRC a bit early on in regalloc explorations and it seemed fairly flexible, indeed. (IIRC, JavaScriptCore's optimizing tier uses it too?)

view this post on Zulip Jacob Lifshay (Jan 31 2023 at 17:11):

it mostly handles it by not evicting registers, yes it could use improvements.

view this post on Zulip Jacob Lifshay (Jan 31 2023 at 17:14):

if you run the tests it produces a series of .dot graphs in subdirectories of the dumped_graphs directory, they may make understanding the algorithm easier

view this post on Zulip Jacob Lifshay (Jan 31 2023 at 17:19):

note nmutil and nmigen need to be installed from git, the versions on pypi are waay out of date. clone urls:
https://git.libre-soc.org/git/nmutil.git
https://git.libre-soc.org/git/bigint-presentation-code.git
https://gitlab.com/nmigen/nmigen

A refreshed Python toolbox for building complex digital hardware

view this post on Zulip Jacob Lifshay (Jan 31 2023 at 17:21):

nmigen needs to be installed first, then nmutil, then bigint-presentation-code

view this post on Zulip Jacob Lifshay (Jan 31 2023 at 17:32):

Chris Fallin said:

I'll be interested to dig into how your allocator handles this; I played with IRC a bit early on in regalloc explorations and it seemed fairly flexible, indeed. (IIRC, JavaScriptCore's optimizing tier uses it too?)

one of the main issues i was running into was that since I didn't implement spilling or live-range splitting in the python register allocator, i have a pass that splits all ssa variables before register allocation, inserting copies after each def and before each use, unfortunately IRC doesn't merge ssa values with a fixed register assignment, so function arguments are often copied to a new register before being used.

view this post on Zulip Jacob Lifshay (Jan 31 2023 at 17:36):

e.g. here r3 (first argument reg on ppc64le) is copied unnecessarily to r10: https://git.libre-soc.org/?p=bigint-presentation-code.git;a=blob;f=src/bigint_presentation_code/_tests/test_toom_cook.py;h=ae8b6c0bb6b087496f2ca81b0453d4097ef58b10;hb=HEAD#l877

view this post on Zulip Jacob Lifshay (Jan 31 2023 at 17:38):

(in case you're wondering, yes, i need to add a DCE pass)


Last updated: Oct 23 2024 at 20:03 UTC