alexcrichton opened PR #6022 from x64-refactor-extend
to main
:
This commit refactors the
{s,u}extend
instruction lowerings for the x64 backend to share a bit more code and move heavy lifting toextend_to_gpr
which is called from a number of other locations as well. Additionally special cases forpextr{b,w}
are added to this as more special-cases where the zero-extend can be elided due to the automatic zero-extensions of the original instructions.
jameysharp submitted PR review.
jameysharp submitted PR review.
jameysharp created PR review comment:
has_type
does a lot of work if you use it here. Inlower
,src
is anInst
, but here it's aValue
. As a result, first there's an implicit conversion to get the instruction that defined this value, thenhas_type
gets the first result of that instruction. But in all of these cases, that's the sameValue
that we started with. So let's usevalue_type
directly to bypass all that.And while the existing comment was true, that we can't do this with an internal extractor today, I think we can do it with an internal constructor instead.
(decl pure partial zeroes_upper (Inst) Unit) (rule (zeroes_upper (iadd _ _)) unit) (rule (zeroes_upper (isub _ _)) unit) ... (rule 1 (extend_to_gpr src @ (value_type $I32) $I64 (ExtendKind.Zero)) (if (zeroes_upper src)) src)
I don't like the name I picked for
zeroes_upper
so please replace that, but I think the overall structure should work.
jameysharp created PR review comment:
In addition to the comments about
uextend
that apply tosextend
, I'd also like to renumber the rule priorities here. Maybe just match the priorities used inuextend
: the$I128
and$I64
cases don't specify a priority and default to 0, and thefits_in_32
case is at priority -1.
jameysharp created PR review comment:
Pre-existing, but would you mind changing
(and val (value_type ty))
toval @ (value_type ty)
while you're here?
jameysharp created PR review comment:
This seems like a nice optimization for
extractlane
but I'm concerned that it may be fragile. Correct me if I'm missing something, but I think it's relying on the lowering rules forextractlane
to produce specific instructions, without having any clear connection between the two sets of rules. Not only are they far removed from each other, but this rule checks that the result typefits_in_16
, while that rule checks that the argument type ismulti_lane 8
ormulti_lane 16
.At minimum I think each place needs a comment referring to the other. Ideally it'd be great to connect them in some way that's harder to get wrong, though I haven't come up with a good way to do that.
jameysharp created PR review comment:
I guess this comment should include
I64
now.Thanks to #6018, I now know that you don't need to check that the source of an extend fits in 64 bits if its destination is
I128
. The CLIF verifier checks that the source is narrower than the destination. So this could just match on(uextend src)
. I'm not fully decided on whether there's value in keeping the more verbose pattern as documentation, but I'm inclined to think that it's more confusing than helpful.Similarly, the
src_ty
constraints in the other twouextend
rules can be removed.This also all applies to
sextend
below.In addition, I can't select the next few lines to comment on them but I'd prefer the
$I64
case to be at the same priority as$I128
. For the moment the(fits_in_32 dst_ty)
case still has to be at a different priority (either higher or lower, doesn't matter much) but I'm hoping to fix that someday.
alexcrichton updated PR #6022 from x64-refactor-extend
to main
.
alexcrichton submitted PR review.
alexcrichton created PR review comment:
Oh nice suggestion! IIRC the desire was to reuse the predicate "does this zero the upper bits" in other instructions, but I at least don't personally remember the context of if there's other locations to use this predicate. In any case I've refactored it out to avoid some extra boilerplate.
alexcrichton submitted PR review.
alexcrichton created PR review comment:
True!
One way this could be tackled is to just go ahead and emit the instruction right here in the helper. For the case of
extractlane
that's pretty easy since it's a single instruction and theextractlane
rules themselves aren't fancy. That's still somewhat brittle though as it doesn't handle evolution ofextractlane
over time (if at all).Perhapes more worrisome, though, is the preexisting cases for dropping a
uextend
. The pattern matches for 32-bit I think have the same trait where they don't necessarily guarantee a particular instruction is being used, we're just relying on things being implemented in a particular way. That part I don't know how best to solve because I definitely wouldn't want to inline all theiadd
rules here, for example.Do you think it's worthwhile to comment all of these though? While perhaps exhaustive it may be getting a bit out of hand since a good handful of the binops span multiple terminals of rules and if the purpose of the comment is to say "please always use the
sub
intruction here or something else that zeros the upper 32-bits foruextend
elision" then it's easy to miss that if it's only located at the top of the rules.I suppose an even crazier strategy would be to postprocess the instructions after lowering, but that naturally has no existing infrastructure and is almost certainly overkill for this one use case.
jameysharp submitted PR review.
jameysharp created PR review comment:
Oh, I guess I assumed that we could pretty well count on any implementation of e.g. 32-bit add having this behavior on x86. I thought of
pextrb
as a little more magical because it zero-extends to a narrower width.I'm not saying this is a good plan, but I guess another approach would be to scatter the individual
extend_to_gpr
andvalue32_zeros_upper32
rules out to place them next to thelower
rules that they rely on. Actually, I don't entirely hate that.We don't necessarily have to solve this in this PR, I'm just a little nervous about it long-term. I am curious if @cfallin or others have any other suggestions for this.
cfallin submitted PR review.
cfallin created PR review comment:
Yeah, unfortunately this is the status-quo for
iadd
and friends too. Placing rules that are intertwined in this way (producer that zeroes upper bits, consumer that relies on it) next to each other is not a bad idea; or at least, we could add comments nothing the implicit extra invariant everywhere it is either enforced or relied upon.At various points we've talked about "demanded-bits" and "known-bits" optimizations that could solve this sort of issue at the lowering-framework level: for example, in the upward scan during isel, we could record "requires zero-extended value" when we see a uextend and otherwise pass through the source register, then generate the extend after the producer if any consumer needed it. That needs a bit more thought to refine the right abstractions though.
We could also optimize VCode; at various points we've talked about post-lowering GVN, cprop, etc., and eliding redundant uextends could be a part of that. But again, much more plumbing needed.
Finally, this is exactly the sort of invariant that the verification project encodes in annotations, and checks on both ends. So I'm optimistic that eventually we'll be in a situation where it's actually enforced, and we can sleep easy at night. In the meantime IMHO it's probably enough to rely on documentation via comments and/or placing intertwined rules together.
cfallin edited PR review comment.
alexcrichton submitted PR review.
alexcrichton created PR review comment:
I should note that the
pextrb
documentation I'm pretty sure indicates that it zero-extends to the whole 64-bit register, but I'm not certain enough to add the rule just yet and otherwise trying to write manual tests for this is a bit of a pain. I'm also mostly interested in this from a wasm perspective wherei8x16.extract_lane_u
for example unconditionally does auextend
toi32
to obey wasm semantics but theuextend
can be elided in all cases for the x86 instruction (and probably other platforms too which I should look into).One point of hesitation I would have to scatter the
(rule ...)
s to various places would be that priorities are needed for the(rule ...)
annotations and it's easy enough to reason about priorities when they're written in one place but having to group them from lots of different places could make things difficult.Given though that the main motivation is the wasm lowering here, I could also just add this one specific rule for
uextend
around the place thatextractlane
is codegen'd since they'd both use the singlepextr{b,w}
instruction.I'm still not sure how best to resolve this, but if you're mainly worried about the
extractlane
cases being the odd ones out, how about I add comments to the(rule)
with extractlane here and comments on the lowering ofextractlane
as well pointing to each other?
jameysharp submitted PR review.
jameysharp created PR review comment:
I think adding comments just for extractlane is plenty for this PR. Thanks!
I agree with your concerns about scattering rules around that rely on priorities. I think we can deal with that, and also I think all our options are bad in different ways, which is why I said I don't hate that option. But in any case it doesn't have to happen right now.
alexcrichton updated PR #6022 from x64-refactor-extend
to main
.
alexcrichton submitted PR review.
alexcrichton created PR review comment:
Ok I have added some more words to help connect the dots here :+1:
jameysharp submitted PR review.
jameysharp created PR review comment:
Perfect, thank you!
jameysharp submitted PR review.
alexcrichton merged PR #6022.
Last updated: Jan 24 2025 at 00:11 UTC