Hi,
I'm currently thinking about how we can SVE support in and I'm wondering how this could work with the current register allocator. My issue is that the bottom 128-bits of an SVE 'Z' register aliases a 'V' register and, from my basic understanding, this currently wouldn't be supported. Also, do I remember hearing that another regalloc is in the works..?
The register allocator doesn't need to track 'V' and 'Z' registers separately: if an SVE instruction is using a 'Z' register then the corresponding 'V' register can't be used, and vice-versa.
Hi @Sam Parker -- agree with what Amanieu said in this case; the unit of allocation is just a "V or Z register" since (if I understand correctly) they're 1-to-1 (?). In other cases where non-1-to-1 overlaps are needed, e.g. ARM32 with 64-bit float d0
aliasing s0
and s1
, we do need a solution, but haven't had the resources to develop one yet.
Re: new regalloc, yes, that's regalloc2; it doesn't solve this problem either, yet, but conceivably it could, with some work, by adding subunits to the commitment map or something like that. (RA2 is currently stalled on a licensing issue before we can release it, and the compatibility shim that allows Cranelift to use it via the current regalloc.rs API is also pending review still due to limited resources, but I'm hoping to nudge things forward...)
Okay, thanks. So how is the aliasing handled, could you point me to an example? My reasoning for thinking this wouldn't work is because currently the AArch64 backend only has two disjoint register classes, and I will need to add a third because the scalable type is wildly different from the existing V128 class. So I need to somehow map 'distinct' registers to each other, which is different from H,S,D and Q 'regs' being the same register.
@Sam Parker is my understanding correct that register Zn
aliases Vn
(and only that)? If so, I think we can just use the existing register class; allocating, say, v13
means you can either use v13
or z13
at your leisure. Or is the aliasing more complex than that?
Reading this intro I see that there are also predicate registers p0
through p15
; those indeed would need a separate register class. I haven't looked at the relevant bits of regalloc.rs
in long enough to remember details, but adding to the RegAlloc
enum and seeing what breaks might give a reasonable indication of how much work is required. Alternately, depending on how hacky you're feeling, you could just reuse an unused class (say, RegClass::I32
) for the predicate registers; the allocator doesn't actually care what the classes are, just that they're different bins of resources.
Ah, and before I forget, the ABI code definitely has some per-reg-class behavior in the prologue/epilogue generation and for the clobber list on calls; that'll need updates too, I think.
@Chris Fallin Yes, this is my main concern. I'm assuming we're going to need to make some significant changes around the frame layout because we'll now have an implementation (runtime) defined sized register to store/restore on the stack. The ABI doesn't seem to suggest that we can rely on accurate type information being passed, so I'm assuming we need to be able to fallback safely.
Ah! OK, I understand better now what's going on. In the current state of things, the Option<Type>
should always be a Some
I think, but that's not guaranteed and in fact not the case moving forward as regalloc2 handles all moves in a consolidated way and forgets type info before spill generation (or more precisely, can share a reg between two different types that live in the same class depending on path, so would need a lattice to merge types)
So we need a better answer for this, or else we need separate reg classes; agreed
I think separate reg classes are the cleaner answer at the API/conceptual level (at the boundary between regalloc and lowering) so I think I'd prefer to go that way, rather than try to preserve type info, which can be fragile and is catastrophic if wrong (see e.g. the CVE last April). Aliasing classes should be possible in regalloc2; in regalloc.rs it's dicier, so maybe rely on the type info just for prototyping
are the regs caller- or callee-saved, out of curiosity?
The answer is actually complicated.
It depends on whether the function has a parameter that is either a scalable vector or predicate register or whether it returns one.
If the answer to both questions is no, then the function can be treated as if it has no awareness of SVE at all, at least with respect to ABI issues.
That is, only d8
- d15
are callee-saved.
Otherwise z8
- z23
and p4
- p15
are callee-saved.
There is a further complication - all predicate registers are not equivalent.
Only the first 8 can be used to control an operation, i.e. as a governing predicate.
This is very interesting indeed! The last constraint sounds like a bit of a challenge; I imagine some lowerings will require a first-8 pred reg and others can take any? So that's two overlapping classes as well
So, we may actually need 2 register classes for predicate registers - one for governing predicates and another for the rest.
A simpler option is to just forget about the existence of p8
- p15
.
that would do for a prototype, yep
next question -- can predicates be spilled relatively cheaply or is that more like a "materialize the flags by transferring to an int reg" operation that's costly?
No need to do anything like that.
ok, so it sounds like we can just treat this like a separate class; so the hardest part is knowing how wide one needs to spill for vector regs
The only issue is that there aren't any auto-decrementing (or incrementing) addressing modes.
Ah, good to know; relevant for prologue code I guess
However, there are instructions that increment or decrement by the size of a scalable vector or predicate register.
@Sam Parker so it sounds like the simplest design for a prototype would be (i) one reg class for both Vn and Zn regs, (i) separate reg class for Pn regs, (iii) use the type to know how much of a Vn/Zn reg to spill, as that's reliable for now; and (iv) we can figure out overlapping classes in due time. Sounds reasonable?
Coming back to the predicate registers - one feature that would be nice if the register allocator supports it is if it could "spill" a governing predicate register to another predicate register that can't be a governing predicate.
@Chris Fallin Yes, this sounds like the only option really, and not too bad if we have reliable type info. Thanks!
@Anton Kirilov that's an interesting idea, sort of like a multi-tiered spill -- "this class is best, if not then that class, if not then spill to stack"
needs some thought :-)
That's pretty much the use case for the last 8 predicate registers - storage space.
As I said, we could simply forget about their existence in an initial implementation, but then the cost would be potentially more spilling to the stack.
If we're going to have multi-tiered spilling then something that can be generally helpful is spilling integer registers to FP registers. This is actually recommended by the ARM CPU optimization guides since int <> fp transfers are faster than int <> mem transfers. I believe this is also the case on x86.
Though I have tried this previously, on older cores, and it didn't help - but doesn't mean we shouldn't try again though.
Last updated: Oct 23 2024 at 20:03 UTC