Hi all,
I've been toying with regalloc2 for some local dev but I've ran into a bit of a snag. I have the following instruction pseudocode that I'd like to register allocate using regalloc2 (I'm aware that it's not SSA, but I run into the same exact problem if I model this using block parameters instead):
block0:
mov r0, 1
jz r0, block1, block2
block1:
mov r1, 10
jmp block3
block2:
mov r1, 20
jmp block3
block3:
exit r1
Now, the odditiy of the architecture that I'd like to compile to is that both jz
and jmp
clobber a set of registers.
The above pseudocode is fed to regalloc2 as follows (format is insn id | insn | operands | clobbers
):
== block 0; preds = [], succs = [Block(1), Block(2)] | params = [] ==
insn 0 | MovImm64(r0, 0) | [Def: v0i reg] | []
insn 1 | Jz(r0, BlockId(1), BlockId(2)) | [Use: v0i reg] | ["RAX", "RCX", "RDX", "R8", "R9", "R10", "R11"]
== block 1; preds = [Block(0)], succs = [Block(3)] | params = [] ==
insn 2 | MovImm64(r3, 10) | [Def: v3i reg] | []
insn 3 | Jmp(BlockId(3)) | [] | ["RAX", "RCX", "RDX", "R8", "R9", "R10", "R11"]
== block 2; preds = [Block(0)], succs = [Block(3)] | params = [] ==
insn 4 | MovImm64(r3, 20) | [Def: v3i reg] | []
insn 5 | Jmp(BlockId(3)) | [] | ["RAX", "RCX", "RDX", "R8", "R9", "R10", "R11"]
== block 3; preds = [Block(1), Block(2)], succs = [] | params = [] ==
insn 6 | Exit(r3) | [Use: v3i fixed(p10i)] | []
RA2 successfully computes a result for this, but it's entirely wrong! It outputs the following InstOrEdit trace:
== block 0 ==
MovImm64(r0, 0) [p0i]
Jz(r0, BlockId(1), BlockId(2)) [p0i]
== block 1 ==
MovImm64(r3, 10) [p10i]
Move { from: p10i, to: p4i }
Jmp(BlockId(3)) []
== block 2 ==
MovImm64(r3, 20) [p10i]
Move { from: p10i, to: p4i }
Jmp(BlockId(3)) []
== block 3 ==
Exit(r3) [p10i]
Here, p0i
is RAX
, p4i
is RSI
, and p10i
is R10
. Effectively, it's output the following assembly:
block0:
mov rax, 1
jz rax, block1, block2
block1:
mov r10, 10
mov rsi, r10
jmp block3
block2:
mov r10, 20
mov rsi, r10
jmp block3
block3:
exit r10
It's assigned the r10
register to the 10/20
constants, because exit
wants its argument to be in that physical register specifically. The main problem here of course is that the jmp
instructions will clobber r10
, so the final call to exit
will contain garbage. RA2 seemingly is aware of this, because it attempts to back up the value into RSI (which isn't clobbered) first. But then it proceeds to never restore r10
with mov r10, rsi
after the branch?
(Additionally, the choice of the initial mov r10, [10/20]
is weird too. Why doesn't it immediately assign RSI
to these cases, so it doesn't have to move around immediately after?)
At first, I thought this was an issue with my input not being in SSA form. However, restructuring this so that block3 has a block parameter and assigning that appropriately results in exactly the same broken code.
I suspect that the issue has something to do with RA2 not properly considering clobbers if the instruction doing the clobbering is a branch/block terminator. To illustrate, I added a simple NopClobber instruction to my ISA that does nothing except clobber. I placed this after each branch (i.e., its the first instruction in any block with a nonzero amount of predecessors). Re-running RA2 on that input results in the following assignment which, while suboptimal, at least is correct:
block0:
mov rax, 1
jz rax, block1, block2
block1:
nop-clobber
mov r10, 10
mov rdi, r10
jmp block3
block2:
nop-clobber
mov r10, 20
mov rdi, r10
jmp block3
block3:
nop-clobber
mov r10, rdi
exit r10
This workaround is fine for now, but it results in a suboptimal allocation and is probably indicative of _some_ issue in RA2. Figured I'd reach out and poke you, to see if this is something that RA2 should/could be handling better.
Ah, what's going on here is that the blockparam assignment causes edge-moves to be generated, and edge-moves go before the predecessor block's jump because there are multiple preds (block1 and block2) for one successor (block3)
In general, branches are special -- there are some restrictions on how they can use parameters as well -- because of the need to insert these moves that logically happen only on the edge actually before the branch over that edge
The bug honestly is that we don't error out when branches have clobbers specified...
To address this by making the allocator more flexible (actually fully supporting clobbers and all kinds of arguments on branches) would be quite nontrivial because it would mean we actually need to split the edge -- something we don't do currently (the CFG is immutable) -- or else require an empty block on every edge, which would be quite expensive
so RA2 was designed with the tradeoff that branches are limited, which worked for the real ISAs we were targeting. Could you say more about why branches have clobbers (maybe there's some other solution we can think of)?
Fair enough. I can emulate clobbers on branches through the nop-clobber trick so losing that isn't too bad for my usecases.
Am I correct in thinking that RA2 does fixed-register choices before it attempts to compensate for clobbers? That'd explain the suboptimal choice for it to first move the constants into r10 (the fixed register for the exit insn), then work around the clobber that it encounters with the swap in/out of rdi.
My branch instructions cause clobbers because the final assembly also does a function call before the branch for instrumentation purposes, and this function call ends up doing the actual branch, instead of the real x86 instruction (you'll notice that the clobber set in my trace above is exactly the same as the set of caller-save registers on Win64). The reason for not having some kind of Insn::InstrumentCall
before the branches is that in the case of Jz
I don't want RA2 to spend efforts on ensuring that the argument isn't clobbered when it doesn't suffer from the clobbering (the clobbering happens after the conditional register has been read)
I suppose the alternative would be to just have the instrumentation back up all registers (i.e. a non-clobbering call) so that I don't have to worry about clobbers in edges at all.
@Thijs Molendijk You can put clobbers on branches if the destination block has more than 1 predecessor. If the destination block only has 1 predecessor then you need to use the nop clobber trick.
Am I correct in thinking that RA2 does fixed-register choices before it attempts to compensate for clobbers?
No, this isn't how it works: RA2 collects all requirements, including fixed-register constraints and clobber conflicts, and handles them all together in the main allocation loop. If you had put the same constraints and clobbers on a regular instruction in straight line code, you'd see it work properly. The only issue is with branches, as I mentioned; branches are special because they can come after an edge move that they logically should come before (so their uses are heavily restricted).
Just to confirm, I'm not allowed to have a block that contains a branching instruction that isn't the final instruction, right? I just ran into some oddities where regallocs were weird/incorrect when I had a block that ended with jz x, a; jmp b
, but I presume I need to model that as two blocks, one direct branch and one implicit fallthrough?
(if that's the case, it'd probably also be good to add an assertion/check for that)
Hm, nevermind. That still results in a bad allocation. Let me reduce it and see if I can find what is causing it
RA2 itself doesn't care about whether the last 1 or last N instructions are branches; it gets its canonical CFG from what you provide re: successor info
Hm okay, I ran into another bug, but I'm not sure where. I have the following input:
== block 0 | pred = [] | succ = [Block(1), Block(2)] ==
insn 0 | MovImm64(r1, 2147483647) | operands=[Def: v1i reg] | clobbers=[]
insn 1 | Syscall(r2, ...) | operands=[Def: v2i fixed(p0i), Use: v1i fixed(p9i)] | clobbers=["RAX", "RCX", "RDX", "R8", "R9", "R10", "R11"]
insn 2 | Jz(r2, BlockId(1)) | operands=[Use@Late: v2i reg, Def: v5i fixed(p3i)] | clobbers=[]
insn 3 | Jmp(BlockId(2)) | operands=[] | clobbers=[]
== block 1 | pred = [Block(0)] | succ = [] ==
insn 4 | NopClobber | operands=[] | clobbers=["RAX", "RCX", "RDX", "R8", "R9", "R10", "R11"]
insn 5 | Exit(r2) | operands=[Use: v2i fixed(p3i)] | clobbers=[]
== block 2 | pred = [Block(0)] | succ = [] ==
insn 6 | NopClobber | operands=[] | clobbers=["RAX", "RCX", "RDX", "R8", "R9", "R10", "R11"]
insn 7 | Exit(r2) | operands=[Use: v2i fixed(p3i)] | clobbers=[]
This is roughly equivalent to the following assembly:
block0:
mov r1, 2147483647
syscall r2, [r1]
jz r2, block1
jmp block2
block1:
exit r2
block2:
exit r2
There are only clobbers on non-branch instructions here, so we should be fine w.r.t. avoiding the clobbering issues described earlier. I've automatically inserted a NopClobber in both branches, even if it might not be necessary (since they both only have a single predecessor).
Running RA2 on this errors with "Could not allocate minimal bundle, but the allocation problem should be possible to solve" (https://github.com/bytecodealliance/regalloc2/blob/main/src/ion/process.rs#L1266).
Now, if I drop the RAX clobber from the syscall instruction, it successfully completes the allocation with a proper working allocation:
== block 0 | pred = [] | succ = [Block(1), Block(2)] ==
insn 0 | MovImm64(r1, 2147483647) | operands=[Def: v1i reg] | clobbers=[] | allocs=[p9i]
insn 1 | Syscall(r2, ...) | operands=[Def: v2i fixed(p0i), Use: v1i fixed(p9i)] | clobbers=["RCX", "RDX", "R8", "R9", "R10", "R11"] | allocs=[p0i, p9i]
insn 2 | Jz(r2, BlockId(1)) | operands=[Use@Late: v2i reg, Def: v5i fixed(p3i)] | clobbers=[] | allocs=[p0i, p3i]
Move { from: p0i, to: p4i }
insn 3 | Jmp(BlockId(2)) | operands=[] | clobbers=[] | allocs=[]
== block 1 | pred = [Block(0)] | succ = [] ==
insn 4 | NopClobber | operands=[] | clobbers=["RAX", "RCX", "RDX", "R8", "R9", "R10", "R11"] | allocs=[]
Move { from: p4i, to: p3i }
insn 5 | Exit(r2) | operands=[Use: v2i fixed(p3i)] | clobbers=[] | allocs=[p3i]
== block 2 | pred = [Block(0)] | succ = [] ==
insn 6 | NopClobber | operands=[] | clobbers=["RAX", "RCX", "RDX", "R8", "R9", "R10", "R11"] | allocs=[]
Move { from: p4i, to: p3i }
insn 7 | Exit(r2) | operands=[Use: v2i fixed(p3i)] | clobbers=[] | allocs=[p3i]
(well, it's not actually working but that's the result of the NopClobber hack; I'm lacking a NopClobber after the Jz if branch is not taken, so RA2 only moves RAX into RDI after Jz, at which point RAX is already clobbered. But that's my fault, not RA2's).
The docs for Program::inst_clobbers say "Note that it is legal for a register to be both a clobber and an actual def (via pinned vreg or via operand constrained to the reg). This is for convenience: e.g., a call instruction might have a constant clobber set determined by the ABI, but some of those clobbered registers are sometimes return value(s).". The behavior above seems to contradict this.
Unfortunately I've run out of spare cycles for support here; I'd encourage you to look at how Cranelift uses RA2 and continue to debug, and I may be able to return later
No worries, thanks for the help. I'll file an issue on the repo if nothing else, perhaps I'll have time myself to look into it deeper.
Last updated: Nov 22 2024 at 16:03 UTC