So I believe that there might be a problem when there are exactly enough PREGS in a class for Defs of an instruction, and one of the Defs reuses an input in the same RegClass.
Take the following example where I have defined the semantics required to represent the x86 instruction SBB, which modifies the CF, PF, AF, ZF, SF and OF flags, but also reads the CF flag and uses it in the internal computation.
Below is my definition of the instruction. FlagIn/RegIn
values are Early-Uses, FlagOut/RegOut
values are LateDefs meaning that their liveness does not overlap.
sbb_r_rs(
Data(),
RegIn(
(left, OperandConstraint::Reg),
(right, OperandConstraint::Any)
),
FlagIn(
(cf_in, OperandConstraint::FixedReg(PREG_CF)) // CF read by the SBB instruction.
),
RegOut(
(dest, OperandConstraint::Reuse(left.index()))
),
FlagOut(
(of, OperandConstraint::FixedReg(PREG_OF)),
(sf, OperandConstraint::FixedReg(PREG_SF)),
(zf, OperandConstraint::FixedReg(PREG_ZF)),
(af, OperandConstraint::FixedReg(PREG_AF)),
(cf, OperandConstraint::Reuse(cf_in.index())), // Notice the reuse of the input of CF because it must also be in PREG_CF
(pf, OperandConstraint::FixedReg(PREG_PF))
)
),
The Flag PRegs are in the Vector RegClass, where the Reg PRegs are in the Integer RegClass.
In the Vector(flag) RegClass has 6 total PRegs: PREG_CF, PREG_PF, PREG_AF, PREF_ZF, PREG_SF, PREF_OF. Notice that this is exactly the number of Defs the instruction has.
This looks like it should work to me, there are 6 Defs in the Vector class, 6 available PRegs in the Vector class, one Use from the Vector class, but if that use is needed after this instruction, it could be spilled.
So, why is it when I run RegAlloc2 over this, I get the error TooManyLiveRegs? I don't believe I should be getting this error, unless? Any input is greatly appreciated!
RA2 doesn't make a guarantee about "exact fit": because register allocation operates on heuristics, it could be the case that it needs more than the strictly optimal number of registers for your program. (Guaranteeing otherwise is equivalent to providing an exact rather than heuristic solution to the graph coloring problem.)
I still feel that modeling individual flags as registers feels... not in line with the way that RA2 is designed to work; I'd encourage you to see if you can find a way to "lift" the handling of flags to a higher level somehow, and generate flag spills/restores (or lazy computation, or ...) from your own instruction-generation logic
To the first: But with the constraints I've given it, it is forced to put the flags where I've specified. It cannot put edits into the middle(before the early and late) stages of my instruction. It must satisfy the constraints coming in, that the in_cf
(or a copy of it) be in PREG_CF, and that PREG_CF will be something different on the way out. If the value of in_cf
lives past this SBB instruction, shouldn't regalloc2 be aware that it cannot store it in a PREG and force a spill of the value, meaning it exists in both a stack slot and PREG_CF going in to the SBB? But really, in_cf
is dead after the instruction, so to me it isn't clear why it cannot be allocated for both in_cf
and [out_]cf
.
To the second: This is ALMOST easy to do... The main issue is that not all instructions touch all flags. One option I pursued was treating the flags as a single register, and merge pseudo-instructions to merge flag outputs from different instructions based on a mask: (Flags1 & Mask1) | (Flags2 & Mask2)
. But I would still end up using RA2 to model spills of this flag register. This merge pseudo-instruction would also always need to be emitted(probably turned into a function call to do the computation) when any computation that only modifies a strict subset of the flags happens.
Yes, you're right, the constraints do admit a solution; you, as a human, can see it. The register allocator is not an exhaustive solver, it's a collection of heuristics and approximation algorithms, so it does not in this case
If you're able to make it do so, PRs welcome!
The problem that several of us have run into when looking into corner-cases like this is: fix the heuristics for one case, another breaks. We seem to be at a fairly stable "local maximum" unfortunately
I'd encourage you to look at how e.g. Valgrind handles flags; iirc it tracks which operands and operator last affected flags, and lazily computes flags when needed
Ok so too many live regs isn't really saying it is impossible. It is just a failure of the heuristics to handle the super edge case I've created. The comments in the source somewhat misleading when saying it is "impossible" :eyes:
context is: impossible given the current assignment of minimal bundles to registers, not impossible under all possible assignments that the constraints admit
:pray: got it, thank you very much!
no problem, best of luck!
So just to see what happened, I added some more pseudo registers to the Vector(flag) RegClass to temporarily move past that issue and see if I could get it to at least allocate. And now I have something interesting happening.
In the prolog, v2v
is defined as an input flag param and is OperandConstraint::Fixed(p0v)
.
In the body, this value of v2v
(in p0v
) is copied into p19v
for storage in the first edit:
the SBB instruction is emitted, which creates a new value of the carry flag v6v
, also fixed in p0v
because of the OperandConstraint::Reuse(2)
, the 2nd operand being SBB's read of the CF value v2v
, with constraint OperandConstraint::Fixed(p0v)
.
At this point after the SBB, p0v
contains the value of the carry flag computed by the SBB.
But now, another edit is emitted which moves from p19v
into p0v
, overwriting the carry flag value computed by the SBB and restoring v2v
into p0v
for seemingly no reason.
In the epilog, the value of v6v
is used, and it is OperandConstraint::Fixed(p0v)
so it is going to read from p0v
. But v6v
won't be in p0v
because of the edit above!
So now I'm quite confused, am I looking at a bug? What is going on? Any ideas are appreciated. I feel like despite the nonsense I'm doing with the flags, those edits still should not exist.
The machine instructions I have that I pass to RA2 are pictured below, their operand slices follow them, none of them have any clobbers.
Prolog:
Inst: reg_params (Def: v0i fixed(p0i)),(Def: v1i fixed(p2i)),
Inst: flag_params (Def: v2v fixed(p0v)),
Body:
Edit: p0v to p19v
Inst: sbb_r_rs (Use: v0i reg),(Use: v1i any),(Use: v2v fixed(p0v)),(Def: v4i reuse(0)),(Def: v5v fixed(p11v)),(Def: v8v fixed(p7v)),(Def: v9v fixed(p6v)),(Def: v7v fixed(p4v)),(Def: v6v reuse(2)),(Def: v10v fixed(p2v)),
Edit: p19v to p0v
Epilog:
Inst: reg_outputs (Use: v4i fixed(p0i))
Inst: flag_outputs (Use: v6v fixed(p0v)),(Use: v10v fixed(p2v)),(Use: v7v fixed(p4v)),(Use: v9v fixed(p6v)),(Use: v8v fixed(p7v)),(Use: v5v fixed(p11v))
Inst: ret
Unfortunately I don't have any more spare cycles to read this message in detail and process it or continue this conversation. A few of us nominally maintain RA2 (and I wrote most of it) but all of us are way overloaded with other work right now. PRs always welcome if you're able to help with algorithmic improvements. All the best...
Absolutely no worries at all and I very much appreciate your help so far. I WILL eventually solve this problem, and I will report back when I do :salute:
Looking forward to it!
I'm going to go reading through the RA2 source now. I noticed that if I fix the Integer registers being allocated, it works... The following will work, see the first 3 operands(the only integer operands) are now fixed.
sbb_r_rs (Use: v0i fixed(p0i)),(Use: v1i fixed(p2i)),(Use: v2v fixed(p0v)),(Def: v4i fixed(p0i)),(Def: v5v fixed(p11v)),(Def: v8v fixed(p7v)),(Def: v9v fixed(p6v)),(Def: v7v fixed(p4v)),(Def: v6v fixed(p0v)),(Def: v10v fixed(p2v))
However as soon as I leave them to RA2 to decide, notice first 3 operands are reg/any/reuse, not fixed:
sbb_r_rs (Use: v0i reg),(Use: v1i any),(Use: v2v fixed(p0v)),(Def: v4i reuse(0)),(Def: v5v fixed(p11v)),(Def: v8v fixed(p7v)),(Def: v9v fixed(p6v)),(Def: v7v fixed(p4v)),(Def: v6v fixed(p0v)),(Def: v10v fixed(p2v)),Clobbers {}
it panics with ".\src\ion\process.rs:1241:17":"Could not allocate minimal bundle, but the allocation problem should be possible to solve"
:eyes:
Looks like I'm tickling: Support instructions that clobber all registers and have non-fixed uses · Issue #145 · bytecodealliance/regalloc2 (github.com) ...
But not exactly, seeing as how the all the clobbered/defs are in a separate class.
Does RA2 only allow for only a single OperandConstraint::Reuse
per instruction? Here is code in liveranges.rs:465 that raises this question:
I believe this is the source of the incorrect edit's above, when two OperandConstraint::Reuse
s were present.
It is supported in that it won't miscompile, but I don't think it handles it well.
That may very well be the issue here.
that could be it indeed; it seems worth an experiment (if you're willing) to switch out the Option
for a Vec
("small set", no need to reach for a hashset) and do contains
later when we check if inputs are reused
Yep, will be experimenting with it today. Also curious as to how cranelift handles x86's XCHG with this, being that it would require two OperandConstraint::Reuse
s to correctly describe. Will look into that as well.
We don't emit xchg, so we don't have to solve that problem fortunately!
(We only need to support the subset of instructions we actually emit, so we can pretend any other instruction we don't need simply doesn't exist)
Well that saves me having to go look, thanks.
I wonder if XCHG could help with circular moves in RA2, when a scratch register is used. when R1 and R2 need to swap. Although I would guess this is a rare occurrence?
I think there's some discussion of that in one of the issues about the move resolver (or at least, I've discussed this with folks). cycles are pretty rare but you could certainly add a stat and see how rare and estimate what impact it might have
internally (in the CPU) moves and xchg's are both handled by move elision at the rename stage and have zero cost in the execution backend, so instruction fetch bandwidth would be the only real effect I suspect (may be a bottleneck in some cases, not always)
Might see some slowdowns if it's a move to a stack slot though, as the cpu will lock the cache for that location. So if lots of stack slot loads are happening around there maybe could be worse.
oh, yeah, mem/reg xchg's are a different beast entirely
Ok so, good news! Changing to allow for proper handling of multiple reuse operands has allowed me to fix my problem.(sort of)
Now, instead of modeling my SBB
like this:
sbb_r_rs(
Data(),
RegIn(
(left, OperandConstraint::Reg),
(right, OperandConstraint::Any)
),
FlagIn(
(cf_in, OperandConstraint::FixedReg(PREG_CF))
),
RegOut(
(dest, OperandConstraint::Reuse(left.index()))
),
FlagOut(
(of, OperandConstraint::FixedReg(PREG_OF)),
(sf, OperandConstraint::FixedReg(PREG_SF)),
(zf, OperandConstraint::FixedReg(PREG_ZF)),
(af, OperandConstraint::FixedReg(PREG_AF)),
(cf, OperandConstraint::FixedReg(PREG_CF)), // Notice this is not reusing, it is just specifying fixed at same register as the in_cf...
(pf, OperandConstraint::FixedReg(PREG_PF))
)
),
I model it like this, which seems to avoid the problem of TooManyLiveRegs
sbb_r_rs(
Data(),
RegIn(
(left, OperandConstraint::Reg),
(right, OperandConstraint::Any)
),
FlagIn(
(cf_in, OperandConstraint::FixedReg(PREG_CF))
),
RegOut(
(dest, OperandConstraint::Reuse(left.index()))
),
FlagOut(
(of, OperandConstraint::FixedReg(PREG_OF)),
(sf, OperandConstraint::FixedReg(PREG_SF)),
(zf, OperandConstraint::FixedReg(PREG_ZF)),
(af, OperandConstraint::FixedReg(PREG_AF)),
(cf, OperandConstraint::Reuse(cf_in.index())), //notice now this is a reuse of the fixed reg cf_in.
(pf, OperandConstraint::FixedReg(PREG_PF))
)
),
And this does now allocate :eyes:
nice! happy to take a PR if you'd like; our standard for testing is running the ion_checker
fuzzer locally for a while (overnight on multiple cores if you're able)
I've never actually contributed to open source before... so not very familiar with the procedure, but I'd be happy to run the tests for a while no problem.
ah! so if you have a git commit with your change, it should be not so bad: create a fork of the repo on GitHub (there's a button for this), add the git remote and push your branch to your fork, then either use the web UI or the link that the git server returns in a message to create a PR
Alright I've got 8 of the ion checker running, I'll let those go for a while. I'll look into creating a PR. :thumbsup:
awesome!
Ok submitted PR!, commenting on #145 too.
Chris Fallin said:
internally (in the CPU) moves and xchg's are both handled by move elision at the rename stage and have zero cost in the execution backend, so instruction fetch bandwidth would be the only real effect I suspect (may be a bottleneck in some cases, not always)
just as an aside, this is not true for even recent chips, e.g. Alder Lake executes xchg in the backend.
Only Zen started renaming xchgs but with worse latency than moves so three moves is the better codegen in general.
Unless to facilitate the 3 moves, you have to spill another value to create a scratch register.(which you will unless you designate a scratch register for regalloc). So only when above a certain number of cyclic moves are happening, such that the equivalent cost of the XCHG instructions is more than the spill load/restore of the value for the scratch register, will the XCHGs be slower I think?
I don‘t know about the design of regalloc2, but if you have a register that‘s currently free, you can just use that. And note that spill+3movs might actually be faster than xchg if it is part of a dependency chain even if you need to restore that value later since that will likely get forwarded.
imo it is just pretty dependent on the actual code (+uarch) which option is faster, most of the time it doesn‘t matter and it doesn‘t happen that often which is why (to the best of my knowledge) most compilers don‘t bother trying to optimize this.
And since CPU vendors optimize for code that‘s already out there it‘s usually best to just stick to what other compilers would generate in similar cases (as I had to learn recently).
Sorry about the long message, I didn’t intend to derail the thread, just note something I myself was surprised about when I first learned that xchg is not renamed.
Last updated: Dec 23 2024 at 12:05 UTC