Stream: git-wasmtime

Topic: wasmtime / PR #6022 x64: Refactor and add extractlane spe...


view this post on Zulip Wasmtime GitHub notifications bot (Mar 14 2023 at 23:58):

alexcrichton opened PR #6022 from x64-refactor-extend to main:

This commit refactors the {s,u}extend instruction lowerings for the x64 backend to share a bit more code and move heavy lifting to extend_to_gpr which is called from a number of other locations as well. Additionally special cases for pextr{b,w} are added to this as more special-cases where the zero-extend can be elided due to the automatic zero-extensions of the original instructions.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 15 2023 at 19:29):

jameysharp submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 15 2023 at 19:29):

jameysharp submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 15 2023 at 19:29):

jameysharp created PR review comment:

has_type does a lot of work if you use it here. In lower, src is an Inst, but here it's a Value. As a result, first there's an implicit conversion to get the instruction that defined this value, then has_type gets the first result of that instruction. But in all of these cases, that's the same Value that we started with. So let's use value_type directly to bypass all that.

And while the existing comment was true, that we can't do this with an internal extractor today, I think we can do it with an internal constructor instead.

(decl pure partial zeroes_upper (Inst) Unit)
(rule (zeroes_upper (iadd _ _)) unit)
(rule (zeroes_upper (isub _ _)) unit)
...

(rule 1 (extend_to_gpr src @ (value_type $I32) $I64 (ExtendKind.Zero))
        (if (zeroes_upper src))
        src)

I don't like the name I picked for zeroes_upper so please replace that, but I think the overall structure should work.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 15 2023 at 19:29):

jameysharp created PR review comment:

In addition to the comments about uextend that apply to sextend, I'd also like to renumber the rule priorities here. Maybe just match the priorities used in uextend: the $I128 and $I64 cases don't specify a priority and default to 0, and the fits_in_32 case is at priority -1.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 15 2023 at 19:29):

jameysharp created PR review comment:

Pre-existing, but would you mind changing (and val (value_type ty)) to val @ (value_type ty) while you're here?

view this post on Zulip Wasmtime GitHub notifications bot (Mar 15 2023 at 19:29):

jameysharp created PR review comment:

This seems like a nice optimization for extractlane but I'm concerned that it may be fragile. Correct me if I'm missing something, but I think it's relying on the lowering rules for extractlane to produce specific instructions, without having any clear connection between the two sets of rules. Not only are they far removed from each other, but this rule checks that the result type fits_in_16, while that rule checks that the argument type is multi_lane 8 or multi_lane 16.

At minimum I think each place needs a comment referring to the other. Ideally it'd be great to connect them in some way that's harder to get wrong, though I haven't come up with a good way to do that.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 15 2023 at 19:29):

jameysharp created PR review comment:

I guess this comment should include I64 now.

Thanks to #6018, I now know that you don't need to check that the source of an extend fits in 64 bits if its destination is I128. The CLIF verifier checks that the source is narrower than the destination. So this could just match on (uextend src). I'm not fully decided on whether there's value in keeping the more verbose pattern as documentation, but I'm inclined to think that it's more confusing than helpful.

Similarly, the src_ty constraints in the other two uextend rules can be removed.

This also all applies to sextend below.

In addition, I can't select the next few lines to comment on them but I'd prefer the $I64 case to be at the same priority as $I128. For the moment the (fits_in_32 dst_ty) case still has to be at a different priority (either higher or lower, doesn't matter much) but I'm hoping to fix that someday.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 15 2023 at 23:26):

alexcrichton updated PR #6022 from x64-refactor-extend to main.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 15 2023 at 23:27):

alexcrichton submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 15 2023 at 23:27):

alexcrichton created PR review comment:

Oh nice suggestion! IIRC the desire was to reuse the predicate "does this zero the upper bits" in other instructions, but I at least don't personally remember the context of if there's other locations to use this predicate. In any case I've refactored it out to avoid some extra boilerplate.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 15 2023 at 23:36):

alexcrichton submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 15 2023 at 23:36):

alexcrichton created PR review comment:

True!

One way this could be tackled is to just go ahead and emit the instruction right here in the helper. For the case of extractlane that's pretty easy since it's a single instruction and the extractlane rules themselves aren't fancy. That's still somewhat brittle though as it doesn't handle evolution of extractlane over time (if at all).

Perhapes more worrisome, though, is the preexisting cases for dropping a uextend. The pattern matches for 32-bit I think have the same trait where they don't necessarily guarantee a particular instruction is being used, we're just relying on things being implemented in a particular way. That part I don't know how best to solve because I definitely wouldn't want to inline all the iadd rules here, for example.

Do you think it's worthwhile to comment all of these though? While perhaps exhaustive it may be getting a bit out of hand since a good handful of the binops span multiple terminals of rules and if the purpose of the comment is to say "please always use the sub intruction here or something else that zeros the upper 32-bits for uextend elision" then it's easy to miss that if it's only located at the top of the rules.

I suppose an even crazier strategy would be to postprocess the instructions after lowering, but that naturally has no existing infrastructure and is almost certainly overkill for this one use case.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 16 2023 at 01:20):

jameysharp submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 16 2023 at 01:20):

jameysharp created PR review comment:

Oh, I guess I assumed that we could pretty well count on any implementation of e.g. 32-bit add having this behavior on x86. I thought of pextrb as a little more magical because it zero-extends to a narrower width.

I'm not saying this is a good plan, but I guess another approach would be to scatter the individual extend_to_gpr and value32_zeros_upper32 rules out to place them next to the lower rules that they rely on. Actually, I don't entirely hate that.

We don't necessarily have to solve this in this PR, I'm just a little nervous about it long-term. I am curious if @cfallin or others have any other suggestions for this.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 16 2023 at 01:35):

cfallin submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 16 2023 at 01:35):

cfallin created PR review comment:

Yeah, unfortunately this is the status-quo for iadd and friends too. Placing rules that are intertwined in this way (producer that zeroes upper bits, consumer that relies on it) next to each other is not a bad idea; or at least, we could add comments nothing the implicit extra invariant everywhere it is either enforced or relied upon.

At various points we've talked about "demanded-bits" and "known-bits" optimizations that could solve this sort of issue at the lowering-framework level: for example, in the upward scan during isel, we could record "requires zero-extended value" when we see a uextend and otherwise pass through the source register, then generate the extend after the producer if any consumer needed it. That needs a bit more thought to refine the right abstractions though.

We could also optimize VCode; at various points we've talked about post-lowering GVN, cprop, etc., and eliding redundant uextends could be a part of that. But again, much more plumbing needed.

Finally, this is exactly the sort of invariant that the verification project encodes in annotations, and checks on both ends. So I'm optimistic that eventually we'll be in a situation where it's actually enforced, and we can sleep easy at night. In the meantime IMHO it's probably enough to rely on documentation via comments and/or placing intertwined rules together.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 16 2023 at 01:36):

cfallin edited PR review comment.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 16 2023 at 18:20):

alexcrichton submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 16 2023 at 18:20):

alexcrichton created PR review comment:

I should note that the pextrb documentation I'm pretty sure indicates that it zero-extends to the whole 64-bit register, but I'm not certain enough to add the rule just yet and otherwise trying to write manual tests for this is a bit of a pain. I'm also mostly interested in this from a wasm perspective where i8x16.extract_lane_u for example unconditionally does a uextend to i32 to obey wasm semantics but the uextend can be elided in all cases for the x86 instruction (and probably other platforms too which I should look into).

One point of hesitation I would have to scatter the (rule ...)s to various places would be that priorities are needed for the (rule ...) annotations and it's easy enough to reason about priorities when they're written in one place but having to group them from lots of different places could make things difficult.

Given though that the main motivation is the wasm lowering here, I could also just add this one specific rule for uextend around the place that extractlane is codegen'd since they'd both use the single pextr{b,w} instruction.

I'm still not sure how best to resolve this, but if you're mainly worried about the extractlane cases being the odd ones out, how about I add comments to the (rule) with extractlane here and comments on the lowering of extractlane as well pointing to each other?

view this post on Zulip Wasmtime GitHub notifications bot (Mar 16 2023 at 18:28):

jameysharp submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 16 2023 at 18:28):

jameysharp created PR review comment:

I think adding comments just for extractlane is plenty for this PR. Thanks!

I agree with your concerns about scattering rules around that rely on priorities. I think we can deal with that, and also I think all our options are bad in different ways, which is why I said I don't hate that option. But in any case it doesn't have to happen right now.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 16 2023 at 18:51):

alexcrichton updated PR #6022 from x64-refactor-extend to main.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 16 2023 at 18:51):

alexcrichton submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 16 2023 at 18:51):

alexcrichton created PR review comment:

Ok I have added some more words to help connect the dots here :+1:

view this post on Zulip Wasmtime GitHub notifications bot (Mar 16 2023 at 19:16):

jameysharp submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 16 2023 at 19:16):

jameysharp created PR review comment:

Perfect, thank you!

view this post on Zulip Wasmtime GitHub notifications bot (Mar 16 2023 at 19:17):

jameysharp submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 17 2023 at 02:17):

alexcrichton merged PR #6022.


Last updated: Oct 23 2024 at 20:03 UTC