alexcrichton opened PR #5795 from better-avx-supoprt
to main
:
I was poking around Spidermonkey's wasm backend and saw that the various assembler functions used are all
v*
-prefixed which look like they're intended for use with AVX instructions. I looked at Cranelift and it currently doesn't have support for many AVX-based instructions, so I figured I'd take a crack at it!The support added here is a bit of a mishmash when viewed alone, but my general goal was to take a single instruction from the SIMD proposal for WebAssembly and migrate all of its component instructions to AVX. I, by random chance, picked a pretty complicated instruction of
f32x4.min
. This wasm instruction is implemented on x64 with 4 unique SSE instructions and ended up being a pretty good candidate.Further digging about AVX-vs-SSE shows that there should be two major benefits to using AVX over SSE:
Primarily AVX instructions largely use a three-operand form where two input registers are operated with and an output register is also specified. This is in contrast to SSE's predominant one-register-is-input-but-also-output pattern. This should help free up the register allocator a bit and additionally remove the need for movement between registers.
As #4767 notes the memory-based operations of VEX-encoded instructions (aka AVX instructions) do not have strict alignment requirements which means we would be able to sink loads and stores into individual instructions instead of having separate instructions.
So I set out on my journey to implement the instructions used by
f32x4.min
. The first few were fairly easy. The machinst backends are already of the shape "take these inputs and compute the output" where the x86 requirement of a register being both input and output is postprocessed in. This means that theinst.isle
creation helpers for SSE instructions were already of the correct form to use AVX. I chose to add newrule
branches for the instruction creation helpers, for examplex64_andnps
. The newrule
conditionally only runs if AVX is enabled and emits an AVX instruction instead of an SSE instruction for achieving the same goal. This means that no lowerings of clif instructions were modified, instead just new instructions are being generated.The VEX encoding was previously not heavily used in Cranelift. The only current user are the FMA-style instructions that Cranelift has at this time. These FMA instructions have one extra operand than
vandnps
, for example, so I split the existingXmmRmRVex
into a few more variants to fit the shape of the instructions that needed generating forf32x4.min
. This was accompanied then with more AVX opcode definitions, more emission support, etc.Upon implementing all of this it turned out that the test suite was failing on my machine due to the memory-operand encodings of VEX instructions not being supported. I didn't explicitly add those in myself but some preexisting RIP-relative addressing was leaking into the new instructions with existing tests. I opted to go ahead and fill out the memory addressing modes of VEX encoding to get the tests passing again.
All-in-all this PR adds new instructions to the x64 backend for a number of AVX instructions, updates 5 existing instruction producers to use AVX instructions conditionally, implements VEX memory operands, and adds some simple tests for the new output of
f32x4.min
. The existing runtest forf32x.min
caught a few intermediate bugs along the way and I additionally added a plaintarget x86_64
to that runtest to ensure that it executes with and without AVX to test the various lowerings. I'll also note that this, and future support, should be well-fuzzed through Wasmtime's fuzzing which may explicitly disable AVX support despite the machine having access to AVX, so non-AVX lowerings should be well-tested into the future.It's also worth mentioning that I am not an AVX or VEX or x64 expert. Implementing the memory operand part for VEX was the hardest part of this PR and while I think it should be good someone else should definitely double-check me. Additionally I haven't added many instructions to the x64 backend yet so I may have missed obvious places to tests or such, so am happy to follow-up with anything to be more thorough if necessary.
Finally I should note that this is just the tip of the iceberg when it comes to AVX. My hope is to get some of the idioms sorted out to make it easier for future PRs to add one-off instruction lowerings or such.
<!--
Please ensure that the following steps are all taken care of before submitting
the PR.
[ ] This has been discussed in issue #..., or if not, please tell us why
here.[ ] A short description of what this does, why it is needed; if the
description becomes long, the matter should probably be discussed in an issue
first.[ ] This PR contains test cases, if meaningful.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so. The list of
suggested reviewers on the right can help you.Please ensure all communication adheres to the code of conduct.
-->
afonso360 submitted PR review.
afonso360 submitted PR review.
afonso360 created PR review comment:
This (and the other
fma
) decl's can be updated now that we support XmmMem right?
afonso360 submitted PR review.
afonso360 submitted PR review.
afonso360 created PR review comment:
I would have expected that in this case the load could have been merged into
vorps
, any idea why that didn't happen?
afonso360 submitted PR review.
afonso360 submitted PR review.
afonso360 created PR review comment:
;; Helper for creating `MInst.XmmRmiRVex` instructions.
afonso360 created PR review comment:
I don't think we can update this to
XmmMem
sinceminps
requires alignment, but its unfortunate since it also prevents loads sinking into the AVX version.Is there another way around it? I suspect we are going to run into more cases like this in the future.
My thoughts would be to do something like:
(decl load_me_maybe (XmmMem) Xmm) (decl x64_minps (Xmm XmmMem) Xmm) (rule 0 (x64_minps x y) (xmm_rm_r (SseOpcode.Minps) x (load_me_maybe y)))
But I'm not sure how good an idea that is, and we definitely don't need to do it in this PR.
afonso360 created PR review comment:
It looks like for loads to be sunk we need a bunch of additional stuff. Which for some reason I had the idea wasn't necessary before.
So that probably explains why this particular case didn't get sunk.
bor
doesen't have a special case for sinking loads for floats (but does have it for integers).
afonso360 submitted PR review.
alexcrichton updated PR #5795 from better-avx-supoprt
to main
.
alexcrichton submitted PR review.
alexcrichton created PR review comment:
It should yeah! As you've seen though I think more intrusive handling is going to be required to get fused memory loads/stores into these operations as well. For now I think it might be best to defer switching everything over to
XmmMem
for a future PR to holistically try to handle everything in one go. The main reason I added memory encodings for VEX instructions was just because a test case ended up requiring it in Wasmtime which I wasn't expecting!
alexcrichton submitted PR review.
alexcrichton created PR review comment:
Ah yeah I think we'll have to add all the
sinkable_load
business to the rules for all the lowerings here. That shouldn't be too too bad as it's basically the same as all the other arithmetic instructions.
afonso360 submitted PR review.
afonso360 created PR review comment:
Yeah I agree, It's probably better to do it all in a future PR! :+1:
alexcrichton submitted PR review.
alexcrichton created PR review comment:
One idea I had was that in the
sinkable_load
extractor a helper functionis_mergeable_load
is called which currently saysfalse
for all SIMD-based loads if they aren't flagged as aligned. I was thinking that if we have complete AVX support we might be able to make that function conditional where if AVX is enabled then it allows unaligned loads but if not it disallows unaligned loads. That's still somewhat brittle, though, so I don't honestly know the best way to handle this at this time.
afonso360 submitted PR review.
afonso360 created PR review comment:
Similar to the above comments, we should probably do load sinking in a separate PR since it is a bit more advanced.
afonso360 edited PR review comment.
afonso360 edited PR review comment.
fitzgen submitted PR review.
fitzgen submitted PR review.
fitzgen created PR review comment:
Is this an actual TODO or is that arm just unreachable given the set of AVX instructions we support right now?
fitzgen created PR review comment:
Nitpick: doc comment is slightly outdated now.
fitzgen created PR review comment:
That's still somewhat brittle, though, so I don't honestly know the best way to handle this at this time.
We could pass the opcode to
is_mergeable_load
as well and have an allow list for AVX instructions that don't barf on unaligned loads.
elliottt submitted PR review.
elliottt created PR review comment:
I like that idea! We've already got a separate path for sinking loads for sse operations (
put_in_xmm_mem
) which would be a great place to add this information. The only downside I can see with changing the signature of that function is that we'd no longer be able to use it as an implicit conversion. I think that might actually be a great change though, as it would force us to be a bit more thoughtful about where we're allowing loads to be sunk for sse instructions.
alexcrichton updated PR #5795 from better-avx-supoprt
to main
.
alexcrichton updated PR #5795 from better-avx-supoprt
to main
.
alexcrichton submitted PR review.
alexcrichton created PR review comment:
Ah just an unreachable arm, I've updated the panic invocation to indicate that
afonso360 submitted PR review.
cfallin submitted PR review.
cfallin created PR review comment:
Alternate idea, for future consideration: an instruction that takes a R/M arg but requires alignment could encode that in the type, via e.g. an
XmmMemAligned
arg. (If we currently have a case where we use the sameMInst
enum variant for an alignment-required opcode and not, we should split into separate variants.) Then if we made a pass over all instructions to ensure we got the types right, we can be much more permissive with the automatic conversions everywhere, without having to worry about continued vigilance with explicitput_in_xmm_mem
, etc.
alexcrichton merged PR #5795.
Last updated: Dec 23 2024 at 12:05 UTC