github-actions[bot] commented on issue #3785:
Subscribe to Label Action
cc @cfallin, @fitzgen
<details>
This issue or pull request has been labeled: "cranelift", "cranelift:area:aarch64", "cranelift:area:machinst", "cranelift:area:x64", "isle"Thus the following users have been cc'd because of the following labels:
- cfallin: isle
- fitzgen: isle
To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.
Learn more.
</details>
uweigand commented on issue #3785:
Rebased now that both prerequisite PRs were merged. This is probably still not quite ready to merge, but I'd certainly appreciate any comments!
uweigand commented on issue #3785:
Hi @cfallin, this doesn't implement the
ABISig
refactor yet, it's just a rebase to current mainline.However, this shows the regalloc issue I was mentioning earlier today. You'll notice the patch changes the
reftypes.clif
file, which shows the regalloc regression I've been seeing. Note the one additional move in (new) line 80, and the extra use of call-clobbered register%r10
.
uweigand edited a comment on issue #3785:
Hi @cfallin, this doesn't implement the
ABISig
refactor yet, it's just a rebase to current mainline.However, this shows the regalloc issue I was mentioning earlier today. You'll notice the patch changes the
reftypes.clif
file, which shows the regalloc regression I've been seeing. Note the one additional move in (new) line 80, and the extra use of call-saved register%r10
.
cfallin commented on issue #3785:
@uweigand I spent a few hours digging into exactly why the extra move occurs here, to see if there is anything actionable. To summarize, here is what's going on:
- There is a liverange from the definition of the ref (initially a move-from-preg to pick up the arg value in r3) until its return at the bottom of the function.
- When building this liverange, we add uses that constrain to preg r3, and later that constrain to stack at the safepoint.
- We recognize that this is a conflict, so we split.
- The split nominally happens just before the conflict -- inst4 at the safepoint -- so we get a new LR and bundle starting at inst4 until return at inst12, and the original LR at inst1..inst4.
- But the "spill bundle" mechanism that tries to have a common fallback location between LRs-with-actual-uses chops off the bit of the first bundle after the def, so we get a bundle at inst1..inst1 with preg constraint r3, a backing spill bundle, and a bundle at inst4..inst12 with constraint to stack.
- The "second chance" logic tries to assign a register to every spill bundle before giving up and giving it a stack slot. In this case the spill bundle gets r10, a callee-saved register, because of the call at inst3. So we have the value in r3 initially, then r10 across the call, then on the stack for the explicit stack safepoint.
So how could we address this?
- We could avoid chopping off the trailing bits of LRs when splitting; but this turns out to be a really useful thing in most cases, because it shrinks the post-split pieces to just around their uses and significantly reduces contention. I measured high-single-digits perf improvements from this when I was initially tuning things.
- We could somehow hint that the spill bundle can be a stack location. But this requires looking "down" the split tree from the spillbundle at root to all its pieces, and seeing that they have explicit stack constraints. And it also requires some flow-specific knowledge: that we're flowing into a place with a stack constraint, rather than out of one. And probably would need some weighting mechanism.
- We could let stack-constrained uses live directly on the spillbundle, and then constrain the spillbundle to just a stack slot (no second-chance register choice). This is probably the best option; it doesn't touch the heuristics at all for non-reftype cases. But it would require some careful thought, because currently spillbundles don't carry uses at all.
So there's a plausible path forward (last option above), but it's not an easy fix.
In general, there is going to be "noise" like this with any shift in complex heuristics; I do apologize, and wish we could always avoid regressions like this, but it's also an impossible request (not saying you are making it, but stating this out loud since it's the topic at hand) to always avoid regressions. In general the best we can do is measure and hill-climb toward better performance in aggregate. In any case, I'll file an issue over in regalloc2 to note the above idea, and at some point I might have a slice of time (it would probably take ~a few days to a week) to get to it.
cfallin commented on issue #3785:
OK, so given all the above, I got nerd-sniped for the rest of the afternoon and did bytecodealliance/regalloc2#49. It seems to address the issue above (in fact the code is a little shorter than with regalloc.rs).
uweigand commented on issue #3785:
Hi @cfallin, thanks for the detailed analysis and the quick fix! I can confirm that with regalloc2 0.1.3 the code generated for the reftypes.clif test case is improved, both without this PR and with this PR applied. So this regalloc change looks like a clear improvement to me.
However, I'm still wondering why we are seeing any regalloc differences from this PR - both with and without your regalloc2 change applied. Without the regalloc2 change, we're seeing the regression shown in the PR. But even with the regalloc2 change, we are seeing a change in generated code (not a regression, but still a change):
7,8c7,8 < " bras %r1, 12 ; data %f + 0 ; lg %r5, 0(%r1)", < " basr %r14, %r5", --- > " bras %r1, 12 ; data %f + 0 ; lg %r4, 0(%r1)", > " basr %r14, %r4", 12,13c12,13 < " llcr %r5, %r2", < " chi %r5, 0", --- > " llcr %r3, %r2", > " chi %r3, 0", 28,29c28,29 < " la %r4, 160(%r15)", < " lg %r4, 0(%r4)", --- > " la %r5, 160(%r15)", > " lg %r4, 0(%r5)",
This seems strange given that the code regalloc sees before and after this PR is nearly identical.
Before this PR:Entry block: 0 v131 := v146 v134 := v140 v135 := v139 Block 0: (original IR block: block0) (successor: Block 1) (successor: Block 3) (instruction range: 0 .. 11) Inst 0: lgr %v128, %r2 Inst 1: lgr %v129, %r3 Inst 2: lgr %r2, %v128 Inst 3: bras %r1, 12 ; data %f + 0 ; lg %v147, 0(%r1) Inst 4: basr %r14, %v147 Inst 5: lr %v130, %r2 Inst 6: la %v146, 0(%r15) Inst 7: stg %v128, 0(%v131) Inst 8: llcr %v145, %v130 Inst 9: chi %v145, 0 Inst 10: jgnlh label1 ; jg label3
After this PR:
Entry block: 0 v130 := v148 v131 := v146 v134 := v140 v135 := v139 Block 0: (original IR block: block0) (successor: Block 1) (successor: Block 3) (instruction range: 0 .. 11) Inst 0: lgr %v128, %r2 Inst 1: lgr %v129, %r3 Inst 2: lgr %r2, %v128 Inst 3: bras %r1, 12 ; data %f + 0 ; lg %v147, 0(%r1) Inst 4: basr %r14, %v147 Inst 5: lr %v148, %r2 Inst 6: la %v146, 0(%r15) Inst 7: stg %v128, 0(%v131) Inst 8: llcr %v145, %v130 Inst 9: chi %v145, 0 Inst 10: jgnlh label1 ; jg label3
The only difference is that before the PR, insts 5 and 8 use the same vreg (v130), while after the PR, they use two different vregs (v130 and v148), which are marked as aliases.
If aliased vregs are indeed treated identically by regalloc, why does this change still appear to make a difference?
cfallin commented on issue #3785:
Ah, so I think what is happening is that the aliasing rewrites toward the new vreg, not the old one:
v130
is made an alias ofv148
, so all instances ofv130
becomev148
as seen by regalloc.This means that the order of vregs seen by regalloc2 is not quite the same, so allocation decisions may be made in a slightly different order.
This aligns with what is seen in your diff above, I think: the instruction sequence is exactly the same, but the specific register numbers are different in a few cases.
uweigand commented on issue #3785:
This means that the order of vregs seen by regalloc2 is not quite the same, so allocation decisions may be made in a slightly different order.
I see, that does indeed look like it explains the difference. I think with the regalloc2 patch we should be all good then. Thanks again!
uweigand commented on issue #3785:
Hi @cfallin, this latest version now has the
ABISig
refactoring and a bunch of other general interface cleanups.As far as I am concerned, this should now be ready to be merged, so I'd appreciate another review - any comments welcome!
uweigand commented on issue #3785:
Updated to resolve merge conflicts due to the
uses_defs_clobbers
change.
Last updated: Jan 24 2025 at 00:11 UTC