regalloc: overlapping register classes · cranelift

Stream: cranelift

Topic: regalloc: overlapping register classes

Sam Parker (Nov 09 2021 at 14:54):

Hi,
I'm currently thinking about how we can SVE support in and I'm wondering how this could work with the current register allocator. My issue is that the bottom 128-bits of an SVE 'Z' register aliases a 'V' register and, from my basic understanding, this currently wouldn't be supported. Also, do I remember hearing that another regalloc is in the works..?

Amanieu (Nov 09 2021 at 14:57):

The register allocator doesn't need to track 'V' and 'Z' registers separately: if an SVE instruction is using a 'Z' register then the corresponding 'V' register can't be used, and vice-versa.

Chris Fallin (Nov 09 2021 at 16:42):

Hi @Sam Parker -- agree with what Amanieu said in this case; the unit of allocation is just a "V or Z register" since (if I understand correctly) they're 1-to-1 (?). In other cases where non-1-to-1 overlaps are needed, e.g. ARM32 with 64-bit float d0 aliasing s0 and s1, we do need a solution, but haven't had the resources to develop one yet.

Re: new regalloc, yes, that's regalloc2; it doesn't solve this problem either, yet, but conceivably it could, with some work, by adding subunits to the commitment map or something like that. (RA2 is currently stalled on a licensing issue before we can release it, and the compatibility shim that allows Cranelift to use it via the current regalloc.rs API is also pending review still due to limited resources, but I'm hoping to nudge things forward...)

Sam Parker (Nov 10 2021 at 08:47):

Okay, thanks. So how is the aliasing handled, could you point me to an example? My reasoning for thinking this wouldn't work is because currently the AArch64 backend only has two disjoint register classes, and I will need to add a third because the scalable type is wildly different from the existing V128 class. So I need to somehow map 'distinct' registers to each other, which is different from H,S,D and Q 'regs' being the same register.

Chris Fallin (Nov 10 2021 at 16:49):

@Sam Parker is my understanding correct that register Zn aliases Vn (and only that)? If so, I think we can just use the existing register class; allocating, say, v13 means you can either use v13 or z13 at your leisure. Or is the aliasing more complex than that?

Reading this intro I see that there are also predicate registers p0 through p15; those indeed would need a separate register class. I haven't looked at the relevant bits of regalloc.rs in long enough to remember details, but adding to the RegAlloc enum and seeing what breaks might give a reasonable indication of how much work is required. Alternately, depending on how hacky you're feeling, you could just reuse an unused class (say, RegClass::I32) for the predicate registers; the allocator doesn't actually care what the classes are, just that they're different bins of resources.

Chris Fallin (Nov 10 2021 at 16:51):

Ah, and before I forget, the ABI code definitely has some per-reg-class behavior in the prologue/epilogue generation and for the clobber list on calls; that'll need updates too, I think.

Sam Parker (Nov 10 2021 at 16:55):

@Chris Fallin Yes, this is my main concern. I'm assuming we're going to need to make some significant changes around the frame layout because we'll now have an implementation (runtime) defined sized register to store/restore on the stack. The ABI doesn't seem to suggest that we can rely on accurate type information being passed, so I'm assuming we need to be able to fallback safely.

Chris Fallin (Nov 10 2021 at 16:57):

Ah! OK, I understand better now what's going on. In the current state of things, the Option<Type> should always be a Some I think, but that's not guaranteed and in fact not the case moving forward as regalloc2 handles all moves in a consolidated way and forgets type info before spill generation (or more precisely, can share a reg between two different types that live in the same class depending on path, so would need a lattice to merge types)

Chris Fallin (Nov 10 2021 at 16:58):

So we need a better answer for this, or else we need separate reg classes; agreed

Chris Fallin (Nov 10 2021 at 16:59):

I think separate reg classes are the cleaner answer at the API/conceptual level (at the boundary between regalloc and lowering) so I think I'd prefer to go that way, rather than try to preserve type info, which can be fragile and is catastrophic if wrong (see e.g. the CVE last April). Aliasing classes should be possible in regalloc2; in regalloc.rs it's dicier, so maybe rely on the type info just for prototyping

Chris Fallin (Nov 10 2021 at 17:00):

are the regs caller- or callee-saved, out of curiosity?

Anton Kirilov (Nov 10 2021 at 17:04):

The answer is actually complicated.

Anton Kirilov (Nov 10 2021 at 17:05):

It depends on whether the function has a parameter that is either a scalable vector or predicate register or whether it returns one.

Anton Kirilov (Nov 10 2021 at 17:08):

If the answer to both questions is no, then the function can be treated as if it has no awareness of SVE at all, at least with respect to ABI issues.

Anton Kirilov (Nov 10 2021 at 17:10):

That is, only d8 - d15 are callee-saved.

Anton Kirilov (Nov 10 2021 at 17:14):

Otherwise z8 - z23 and p4 - p15 are callee-saved.

Anton Kirilov (Nov 10 2021 at 17:16):

There is a further complication - all predicate registers are not equivalent.

Anton Kirilov (Nov 10 2021 at 17:16):

Only the first 8 can be used to control an operation, i.e. as a governing predicate.

Chris Fallin (Nov 10 2021 at 17:18):

This is very interesting indeed! The last constraint sounds like a bit of a challenge; I imagine some lowerings will require a first-8 pred reg and others can take any? So that's two overlapping classes as well

Anton Kirilov (Nov 10 2021 at 17:18):

So, we may actually need 2 register classes for predicate registers - one for governing predicates and another for the rest.

Anton Kirilov (Nov 10 2021 at 17:19):

A simpler option is to just forget about the existence of p8 - p15.

Chris Fallin (Nov 10 2021 at 17:19):

that would do for a prototype, yep

Chris Fallin (Nov 10 2021 at 17:19):

next question -- can predicates be spilled relatively cheaply or is that more like a "materialize the flags by transferring to an int reg" operation that's costly?

Anton Kirilov (Nov 10 2021 at 17:20):

No need to do anything like that.

Chris Fallin (Nov 10 2021 at 17:21):

ok, so it sounds like we can just treat this like a separate class; so the hardest part is knowing how wide one needs to spill for vector regs

Anton Kirilov (Nov 10 2021 at 17:22):

The only issue is that there aren't any auto-decrementing (or incrementing) addressing modes.

Chris Fallin (Nov 10 2021 at 17:22):

Ah, good to know; relevant for prologue code I guess

Anton Kirilov (Nov 10 2021 at 17:23):

However, there are instructions that increment or decrement by the size of a scalable vector or predicate register.

Chris Fallin (Nov 10 2021 at 17:23):

@Sam Parker so it sounds like the simplest design for a prototype would be (i) one reg class for both Vn and Zn regs, (i) separate reg class for Pn regs, (iii) use the type to know how much of a Vn/Zn reg to spill, as that's reliable for now; and (iv) we can figure out overlapping classes in due time. Sounds reasonable?

Anton Kirilov (Nov 10 2021 at 17:25):

Coming back to the predicate registers - one feature that would be nice if the register allocator supports it is if it could "spill" a governing predicate register to another predicate register that can't be a governing predicate.

Sam Parker (Nov 10 2021 at 17:26):

@Chris Fallin Yes, this sounds like the only option really, and not too bad if we have reliable type info. Thanks!

Chris Fallin (Nov 10 2021 at 17:26):

@Anton Kirilov that's an interesting idea, sort of like a multi-tiered spill -- "this class is best, if not then that class, if not then spill to stack"

Chris Fallin (Nov 10 2021 at 17:26):

needs some thought :-)

Anton Kirilov (Nov 10 2021 at 17:27):

That's pretty much the use case for the last 8 predicate registers - storage space.

Anton Kirilov (Nov 10 2021 at 17:28):

As I said, we could simply forget about their existence in an initial implementation, but then the cost would be potentially more spilling to the stack.

Amanieu (Nov 10 2021 at 17:54):

If we're going to have multi-tiered spilling then something that can be generally helpful is spilling integer registers to FP registers. This is actually recommended by the ARM CPU optimization guides since int <> fp transfers are faster than int <> mem transfers. I believe this is also the case on x86.

Sam Parker (Nov 11 2021 at 10:29):

Though I have tried this previously, on older cores, and it didn't help - but doesn't mean we shouldn't try again though.

Last updated: Oct 23 2024 at 20:03 UTC