Stream: cranelift

Topic: RA2 TooManyLiveRegs when I don't think there are.


view this post on Zulip Leaves (Jul 16 2024 at 19:40):

So I believe that there might be a problem when there are exactly enough PREGS in a class for Defs of an instruction, and one of the Defs reuses an input in the same RegClass.

Take the following example where I have defined the semantics required to represent the x86 instruction SBB, which modifies the CF, PF, AF, ZF, SF and OF flags, but also reads the CF flag and uses it in the internal computation.

Below is my definition of the instruction. FlagIn/RegIn values are Early-Uses, FlagOut/RegOut values are LateDefs meaning that their liveness does not overlap.

    sbb_r_rs(
        Data(),
        RegIn(
            (left, OperandConstraint::Reg),
            (right, OperandConstraint::Any)
        ),
        FlagIn(
            (cf_in, OperandConstraint::FixedReg(PREG_CF))     // CF read by the SBB instruction.
        ),
        RegOut(
            (dest, OperandConstraint::Reuse(left.index()))
        ),
        FlagOut(
            (of, OperandConstraint::FixedReg(PREG_OF)),
            (sf, OperandConstraint::FixedReg(PREG_SF)),
            (zf, OperandConstraint::FixedReg(PREG_ZF)),
            (af, OperandConstraint::FixedReg(PREG_AF)),
            (cf, OperandConstraint::Reuse(cf_in.index())),   // Notice the reuse of the input of CF because it must also be in PREG_CF
            (pf, OperandConstraint::FixedReg(PREG_PF))
        )
    ),

The Flag PRegs are in the Vector RegClass, where the Reg PRegs are in the Integer RegClass.

In the Vector(flag) RegClass has 6 total PRegs: PREG_CF, PREG_PF, PREG_AF, PREF_ZF, PREG_SF, PREF_OF. Notice that this is exactly the number of Defs the instruction has.

This looks like it should work to me, there are 6 Defs in the Vector class, 6 available PRegs in the Vector class, one Use from the Vector class, but if that use is needed after this instruction, it could be spilled.

So, why is it when I run RegAlloc2 over this, I get the error TooManyLiveRegs? I don't believe I should be getting this error, unless? Any input is greatly appreciated!

view this post on Zulip Chris Fallin (Jul 16 2024 at 20:17):

RA2 doesn't make a guarantee about "exact fit": because register allocation operates on heuristics, it could be the case that it needs more than the strictly optimal number of registers for your program. (Guaranteeing otherwise is equivalent to providing an exact rather than heuristic solution to the graph coloring problem.)

view this post on Zulip Chris Fallin (Jul 16 2024 at 20:18):

I still feel that modeling individual flags as registers feels... not in line with the way that RA2 is designed to work; I'd encourage you to see if you can find a way to "lift" the handling of flags to a higher level somehow, and generate flag spills/restores (or lazy computation, or ...) from your own instruction-generation logic

view this post on Zulip Leaves (Jul 16 2024 at 21:20):

To the first: But with the constraints I've given it, it is forced to put the flags where I've specified. It cannot put edits into the middle(before the early and late) stages of my instruction. It must satisfy the constraints coming in, that the in_cf(or a copy of it) be in PREG_CF, and that PREG_CF will be something different on the way out. If the value of in_cf lives past this SBB instruction, shouldn't regalloc2 be aware that it cannot store it in a PREG and force a spill of the value, meaning it exists in both a stack slot and PREG_CF going in to the SBB? But really, in_cf is dead after the instruction, so to me it isn't clear why it cannot be allocated for both in_cf and [out_]cf.

To the second: This is ALMOST easy to do... The main issue is that not all instructions touch all flags. One option I pursued was treating the flags as a single register, and merge pseudo-instructions to merge flag outputs from different instructions based on a mask: (Flags1 & Mask1) | (Flags2 & Mask2). But I would still end up using RA2 to model spills of this flag register. This merge pseudo-instruction would also always need to be emitted(probably turned into a function call to do the computation) when any computation that only modifies a strict subset of the flags happens.

view this post on Zulip Chris Fallin (Jul 16 2024 at 21:22):

Yes, you're right, the constraints do admit a solution; you, as a human, can see it. The register allocator is not an exhaustive solver, it's a collection of heuristics and approximation algorithms, so it does not in this case

view this post on Zulip Chris Fallin (Jul 16 2024 at 21:22):

If you're able to make it do so, PRs welcome!

view this post on Zulip Chris Fallin (Jul 16 2024 at 21:23):

The problem that several of us have run into when looking into corner-cases like this is: fix the heuristics for one case, another breaks. We seem to be at a fairly stable "local maximum" unfortunately

view this post on Zulip Chris Fallin (Jul 16 2024 at 21:24):

I'd encourage you to look at how e.g. Valgrind handles flags; iirc it tracks which operands and operator last affected flags, and lazily computes flags when needed

view this post on Zulip Leaves (Jul 16 2024 at 21:38):

Ok so too many live regs isn't really saying it is impossible. It is just a failure of the heuristics to handle the super edge case I've created. The comments in the source somewhat misleading when saying it is "impossible" :eyes:

view this post on Zulip Chris Fallin (Jul 16 2024 at 21:40):

context is: impossible given the current assignment of minimal bundles to registers, not impossible under all possible assignments that the constraints admit

view this post on Zulip Leaves (Jul 16 2024 at 21:40):

:pray: got it, thank you very much!

view this post on Zulip Chris Fallin (Jul 16 2024 at 21:40):

no problem, best of luck!

view this post on Zulip Leaves (Jul 16 2024 at 22:56):

So just to see what happened, I added some more pseudo registers to the Vector(flag) RegClass to temporarily move past that issue and see if I could get it to at least allocate. And now I have something interesting happening.

In the prolog, v2v is defined as an input flag param and is OperandConstraint::Fixed(p0v).

In the body, this value of v2v(in p0v) is copied into p19v for storage in the first edit:
the SBB instruction is emitted, which creates a new value of the carry flag v6v, also fixed in p0v because of the OperandConstraint::Reuse(2), the 2nd operand being SBB's read of the CF value v2v, with constraint OperandConstraint::Fixed(p0v).
At this point after the SBB, p0v contains the value of the carry flag computed by the SBB.
But now, another edit is emitted which moves from p19v into p0v, overwriting the carry flag value computed by the SBB and restoring v2v into p0v for seemingly no reason.

In the epilog, the value of v6v is used, and it is OperandConstraint::Fixed(p0v) so it is going to read from p0v. But v6v won't be in p0v because of the edit above!

So now I'm quite confused, am I looking at a bug? What is going on? Any ideas are appreciated. I feel like despite the nonsense I'm doing with the flags, those edits still should not exist.

The machine instructions I have that I pass to RA2 are pictured below, their operand slices follow them, none of them have any clobbers.

Prolog:
  Inst: reg_params (Def: v0i fixed(p0i)),(Def: v1i fixed(p2i)),
  Inst: flag_params (Def: v2v fixed(p0v)),

Body:
  Edit: p0v to p19v
  Inst: sbb_r_rs (Use: v0i reg),(Use: v1i any),(Use: v2v fixed(p0v)),(Def: v4i reuse(0)),(Def: v5v fixed(p11v)),(Def: v8v fixed(p7v)),(Def: v9v fixed(p6v)),(Def: v7v fixed(p4v)),(Def: v6v reuse(2)),(Def: v10v fixed(p2v)),
  Edit: p19v to p0v

Epilog:
  Inst: reg_outputs (Use: v4i fixed(p0i))
  Inst: flag_outputs (Use: v6v fixed(p0v)),(Use: v10v fixed(p2v)),(Use: v7v fixed(p4v)),(Use: v9v fixed(p6v)),(Use: v8v fixed(p7v)),(Use: v5v fixed(p11v))
  Inst: ret

view this post on Zulip Chris Fallin (Jul 16 2024 at 23:06):

Unfortunately I don't have any more spare cycles to read this message in detail and process it or continue this conversation. A few of us nominally maintain RA2 (and I wrote most of it) but all of us are way overloaded with other work right now. PRs always welcome if you're able to help with algorithmic improvements. All the best...

view this post on Zulip Leaves (Jul 16 2024 at 23:08):

Absolutely no worries at all and I very much appreciate your help so far. I WILL eventually solve this problem, and I will report back when I do :salute:

view this post on Zulip Chris Fallin (Jul 16 2024 at 23:08):

Looking forward to it!

view this post on Zulip Leaves (Jul 16 2024 at 23:29):

I'm going to go reading through the RA2 source now. I noticed that if I fix the Integer registers being allocated, it works... The following will work, see the first 3 operands(the only integer operands) are now fixed.

sbb_r_rs (Use: v0i fixed(p0i)),(Use: v1i fixed(p2i)),(Use: v2v fixed(p0v)),(Def: v4i fixed(p0i)),(Def: v5v fixed(p11v)),(Def: v8v fixed(p7v)),(Def: v9v fixed(p6v)),(Def: v7v fixed(p4v)),(Def: v6v fixed(p0v)),(Def: v10v fixed(p2v))

However as soon as I leave them to RA2 to decide, notice first 3 operands are reg/any/reuse, not fixed:

sbb_r_rs (Use: v0i reg),(Use: v1i any),(Use: v2v fixed(p0v)),(Def: v4i reuse(0)),(Def: v5v fixed(p11v)),(Def: v8v fixed(p7v)),(Def: v9v fixed(p6v)),(Def: v7v fixed(p4v)),(Def: v6v fixed(p0v)),(Def: v10v fixed(p2v)),Clobbers {}

it panics with ".\src\ion\process.rs:1241:17":"Could not allocate minimal bundle, but the allocation problem should be possible to solve"

:eyes:

view this post on Zulip Leaves (Jul 17 2024 at 02:27):

Looks like I'm tickling: Support instructions that clobber all registers and have non-fixed uses · Issue #145 · bytecodealliance/regalloc2 (github.com) ...

There are currently special cases for when an instruction uses a fixed register and clobbers it, but when an instruction uses an unconstrained register and clobbers all registers there is no specia...

view this post on Zulip Leaves (Jul 17 2024 at 02:32):

But not exactly, seeing as how the all the clobbered/defs are in a separate class.

view this post on Zulip Leaves (Jul 17 2024 at 23:40):

Does RA2 only allow for only a single OperandConstraint::Reuse per instruction? Here is code in liveranges.rs:465 that raises this question:

view this post on Zulip Leaves (Jul 17 2024 at 23:42):

I believe this is the source of the incorrect edit's above, when two OperandConstraint::Reuses were present.

view this post on Zulip Amanieu (Jul 18 2024 at 09:41):

It is supported in that it won't miscompile, but I don't think it handles it well.

view this post on Zulip Amanieu (Jul 18 2024 at 09:42):

That may very well be the issue here.

view this post on Zulip Chris Fallin (Jul 18 2024 at 15:20):

that could be it indeed; it seems worth an experiment (if you're willing) to switch out the Option for a Vec ("small set", no need to reach for a hashset) and do contains later when we check if inputs are reused

view this post on Zulip Leaves (Jul 18 2024 at 16:35):

Yep, will be experimenting with it today. Also curious as to how cranelift handles x86's XCHG with this, being that it would require two OperandConstraint::Reuses to correctly describe. Will look into that as well.

view this post on Zulip Chris Fallin (Jul 18 2024 at 16:38):

We don't emit xchg, so we don't have to solve that problem fortunately!

view this post on Zulip Chris Fallin (Jul 18 2024 at 16:38):

(We only need to support the subset of instructions we actually emit, so we can pretend any other instruction we don't need simply doesn't exist)

view this post on Zulip Leaves (Jul 18 2024 at 16:38):

Well that saves me having to go look, thanks.

view this post on Zulip Leaves (Jul 18 2024 at 16:39):

I wonder if XCHG could help with circular moves in RA2, when a scratch register is used. when R1 and R2 need to swap. Although I would guess this is a rare occurrence?

view this post on Zulip Chris Fallin (Jul 18 2024 at 16:41):

I think there's some discussion of that in one of the issues about the move resolver (or at least, I've discussed this with folks). cycles are pretty rare but you could certainly add a stat and see how rare and estimate what impact it might have

view this post on Zulip Chris Fallin (Jul 18 2024 at 16:42):

internally (in the CPU) moves and xchg's are both handled by move elision at the rename stage and have zero cost in the execution backend, so instruction fetch bandwidth would be the only real effect I suspect (may be a bottleneck in some cases, not always)

view this post on Zulip Leaves (Jul 18 2024 at 16:43):

Might see some slowdowns if it's a move to a stack slot though, as the cpu will lock the cache for that location. So if lots of stack slot loads are happening around there maybe could be worse.

view this post on Zulip Chris Fallin (Jul 18 2024 at 16:44):

oh, yeah, mem/reg xchg's are a different beast entirely

view this post on Zulip Leaves (Jul 18 2024 at 18:18):

Ok so, good news! Changing to allow for proper handling of multiple reuse operands has allowed me to fix my problem.(sort of)

Now, instead of modeling my SBB like this:

    sbb_r_rs(
        Data(),
        RegIn(
            (left, OperandConstraint::Reg),
            (right, OperandConstraint::Any)
        ),
        FlagIn(
            (cf_in, OperandConstraint::FixedReg(PREG_CF))
        ),
        RegOut(
            (dest, OperandConstraint::Reuse(left.index()))
        ),
        FlagOut(
            (of, OperandConstraint::FixedReg(PREG_OF)),
            (sf, OperandConstraint::FixedReg(PREG_SF)),
            (zf, OperandConstraint::FixedReg(PREG_ZF)),
            (af, OperandConstraint::FixedReg(PREG_AF)),
            (cf, OperandConstraint::FixedReg(PREG_CF)),  // Notice this is not reusing, it is just specifying fixed at same register as the in_cf...
            (pf, OperandConstraint::FixedReg(PREG_PF))
        )
    ),

I model it like this, which seems to avoid the problem of TooManyLiveRegs

    sbb_r_rs(
        Data(),
        RegIn(
            (left, OperandConstraint::Reg),
            (right, OperandConstraint::Any)
        ),
        FlagIn(
            (cf_in, OperandConstraint::FixedReg(PREG_CF))
        ),
        RegOut(
            (dest, OperandConstraint::Reuse(left.index()))
        ),
        FlagOut(
            (of, OperandConstraint::FixedReg(PREG_OF)),
            (sf, OperandConstraint::FixedReg(PREG_SF)),
            (zf, OperandConstraint::FixedReg(PREG_ZF)),
            (af, OperandConstraint::FixedReg(PREG_AF)),
            (cf, OperandConstraint::Reuse(cf_in.index())),   //notice now this is a reuse of the fixed reg cf_in.
            (pf, OperandConstraint::FixedReg(PREG_PF))
        )
    ),

view this post on Zulip Leaves (Jul 18 2024 at 18:18):

And this does now allocate :eyes:

view this post on Zulip Chris Fallin (Jul 18 2024 at 18:20):

nice! happy to take a PR if you'd like; our standard for testing is running the ion_checker fuzzer locally for a while (overnight on multiple cores if you're able)

view this post on Zulip Leaves (Jul 18 2024 at 18:21):

I've never actually contributed to open source before... so not very familiar with the procedure, but I'd be happy to run the tests for a while no problem.

view this post on Zulip Chris Fallin (Jul 18 2024 at 18:24):

ah! so if you have a git commit with your change, it should be not so bad: create a fork of the repo on GitHub (there's a button for this), add the git remote and push your branch to your fork, then either use the web UI or the link that the git server returns in a message to create a PR

view this post on Zulip Leaves (Jul 18 2024 at 22:48):

Alright I've got 8 of the ion checker running, I'll let those go for a while. I'll look into creating a PR. :thumbsup:

view this post on Zulip Chris Fallin (Jul 18 2024 at 23:05):

awesome!

view this post on Zulip Leaves (Jul 19 2024 at 16:07):

Ok submitted PR!, commenting on #145 too.

view this post on Zulip T0b1 (Jul 21 2024 at 15:09):

Chris Fallin said:

internally (in the CPU) moves and xchg's are both handled by move elision at the rename stage and have zero cost in the execution backend, so instruction fetch bandwidth would be the only real effect I suspect (may be a bottleneck in some cases, not always)

just as an aside, this is not true for even recent chips, e.g. Alder Lake executes xchg in the backend.
Only Zen started renaming xchgs but with worse latency than moves so three moves is the better codegen in general.

view this post on Zulip Leaves (Jul 21 2024 at 19:28):

Unless to facilitate the 3 moves, you have to spill another value to create a scratch register.(which you will unless you designate a scratch register for regalloc). So only when above a certain number of cyclic moves are happening, such that the equivalent cost of the XCHG instructions is more than the spill load/restore of the value for the scratch register, will the XCHGs be slower I think?

view this post on Zulip T0b1 (Jul 22 2024 at 02:59):

I don‘t know about the design of regalloc2, but if you have a register that‘s currently free, you can just use that. And note that spill+3movs might actually be faster than xchg if it is part of a dependency chain even if you need to restore that value later since that will likely get forwarded.
imo it is just pretty dependent on the actual code (+uarch) which option is faster, most of the time it doesn‘t matter and it doesn‘t happen that often which is why (to the best of my knowledge) most compilers don‘t bother trying to optimize this.
And since CPU vendors optimize for code that‘s already out there it‘s usually best to just stick to what other compilers would generate in similar cases (as I had to learn recently).

Sorry about the long message, I didn’t intend to derail the thread, just note something I myself was surprised about when I first learned that xchg is not renamed.


Last updated: Oct 23 2024 at 20:03 UTC