Stream: git-wasmtime

Topic: wasmtime / PR #5930 x64: Add a smattering of lowerings fo...


view this post on Zulip Wasmtime GitHub notifications bot (Mar 04 2023 at 22:18):

alexcrichton opened PR #5930 from x64-shuffles to main:

I took the meshoptimizer benchmark, compiled it to wasm, and ran it through both Wasmtime and v8. One major improvement for Wasmtime was to use specialized shuffle instructions instead of the generic fallback which requires two pshufd with rip-relative immediates plus a por vs a single instruction in many cases. I trawled through the v8 disassembly and found a number of shuffle-related instructions and translated all of their lowerings to Wasmtime. On one specific benchmark in that program this PR improves the throughput by 30%, and another by 5%.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 08 2023 at 20:49):

alexcrichton requested abrown for a review on PR #5930.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 08 2023 at 22:35):

alexcrichton updated PR #5930 from x64-shuffles to main.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 09 2023 at 19:20):

abrown submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 09 2023 at 19:20):

abrown submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 09 2023 at 19:20):

abrown created PR review comment:

Wish the matching logic could be visible here in ISLE but that doesn't really change how I feel about this PR.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 09 2023 at 19:20):

abrown created PR review comment:

        // fits in 32-bits.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 09 2023 at 19:20):

abrown created PR review comment:

I started to wonder how many comparisons need to be made before one can emit a shuffle lowering: I think in the end it is worth it to spend a bit more time finding the right instruction here but I wonder if there is quicker way to do so. (Just wondering, no action neeeded).

view this post on Zulip Wasmtime GitHub notifications bot (Mar 09 2023 at 19:20):

abrown created PR review comment:

        // When selecting from the right-hand-side offset, subtract these all by 4 which

view this post on Zulip Wasmtime GitHub notifications bot (Mar 09 2023 at 19:20):

abrown created PR review comment:

I think this could use some documentation; it's not exactly clear why only bytes[0] matters here (little-endianness-related, it would seem). This is a foundational block to all of the helpers above and it would be nice to see clearly when we expect the Nones from this function to be the ones propagated outwards.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 09 2023 at 19:20):

abrown created PR review comment:

And I get that it gets rather hairy in the helpers...

view this post on Zulip Wasmtime GitHub notifications bot (Mar 09 2023 at 22:14):

alexcrichton submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 09 2023 at 22:14):

alexcrichton created PR review comment:

I agree! I originally was wondering how bad it would be to put this all in ISLE but I ended up concluding that it would end up with even more extractors and not necessarily simplify the overall scheme, so I opted to leave the rules in ISLE but the extraction of immediates in Rust (for better or worse)

view this post on Zulip Wasmtime GitHub notifications bot (Mar 09 2023 at 22:15):

alexcrichton created PR review comment:

Same! My thinking though was that there's really not all that many shuffle instructions in a program so if shuffle is 100x slower to translate than load it's probably not the end of the world. That being said I do think that these can be slightly optimized where instead of having an extractor-per-immediate we could have one extractor that returns an enum of all the possible shuffle instructions that could be used which is then pattern-matched on in ISLE. I'm not sure if it would be that much faster but it would prevent bouncing back and forth between generated code and isle.rs if it becomes an issue.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 09 2023 at 22:15):

alexcrichton submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 09 2023 at 22:26):

alexcrichton updated PR #5930 from x64-shuffles to main.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 09 2023 at 22:27):

alexcrichton has enabled auto merge for PR #5930.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 09 2023 at 22:46):

abrown submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 09 2023 at 22:46):

abrown created PR review comment:

I was going to mention that V8 does this "match once, then emit" through separate opcodes like you mention below:

Oh I meant to take a look at that but never got around to it. Interestingly though it looks like the x64 backend for v8 has individual opcodes for all the punpck* variants so it seems like they probably translate shuffles to different opcodes earlier in the pipeline than what Cranelift is doing in the backend here.

I like the enum idea for some point in the future.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 09 2023 at 23:21):

alexcrichton submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 09 2023 at 23:21):

alexcrichton created PR review comment:

Somewhere in v8 though something is translating from the wasm shuffle to each individual kX64* instruction though, right? I would naively expect that the translation there is on the same order-of-magnitude of complexity/time/etc for what these ISLE lowerings are generating.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 09 2023 at 23:40):

alexcrichton merged PR #5930.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 09 2023 at 23:45):

jameysharp submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 09 2023 at 23:45):

jameysharp submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 09 2023 at 23:45):

jameysharp created PR review comment:

I suppose there could be a similar version of this rule if all lanes specify to broadcast the first byte of the _second_ operand. That's, what, if every byte in the mask is 0x10?

view this post on Zulip Wasmtime GitHub notifications bot (Mar 09 2023 at 23:45):

jameysharp created PR review comment:

Since ISLE will combine all the u128_from_immediate calls in the rules at priority 6 below, many of these rules are really cheap to evaluate together. This isn't as bad as it looks. :laughing:

The enum approach would let ISLE do the same collapsing for a bunch more of these rules, so yeah, that should be good. Also, all the rules that would then use the same extractor but match on different returned enum variants should then be placed at the same priority if there aren't other rules at priorities in between.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 10 2023 at 00:12):

alexcrichton submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 10 2023 at 00:12):

alexcrichton created PR review comment:

True! I've been going back and forth a bit on how exhaustive all these rules should be. For example this rule:

(rule 6 (lower (shuffle a b (u128_from_immediate 0x1f1e_0f0e_1d1c_0d0c_1b1a_0b0a_1918_0908)))
      (x64_punpckhwd a b))

could also match the u128 with the two 64-bit halves swapped to generate a punpckhwd b a instruction.

I added a bunch of lhs/rhs shuffles here for things like pshufd and shufps but I'm also second guessing myself doing that since I didn't actually see any such shuffles in the wild and I was just adding them here on a whim.

Some of this I've been feeling would be better implemented as a egraph-style optimization. For example if all the shuffles lanes are <16 then the b operand could be dropped. If they're all >=16 then they could all be updated and the a operand could be dropped. For the punpckhwd case above it's theoretically possible where the LSB of the shuffle mask could be one of the lowest numbered bytes. You can see though that by this point it feels much more complicated than a simple rule so I also haven't tried to do this yet (plus from an egraph-perspective it's not really "simpler" to replace a shuffle with a shuffle)

In the end I figured I'd stick to doing the obvious lowerings for each instruction a platform has along with ensuring that some various simd binaries I've been seeing all have all their shuffles special-cased. We can always add more here in the future, and I was mainly hoping that by that point it's more obivous how to solve the problem of "could this shuffle be special-cased if the operands were swapped?"


Last updated: Dec 23 2024 at 12:05 UTC