abrown opened PR #2397 from packed-shifts
to main
:
This change first modifies
SyntheticAmode
to allow it to refer to constants: this is helpful becauseSyntheticAmode
andAmode
are used throughout the x64 backend to refer to memory locations so constants can now fit into that scheme. (In a future PR I will modifyvconst
to simply lower to aInst::XmmUnaryRmR
, which accepts aSyntheticAmode
, instead of using the specialInst::XmmLoadConst
variant).With that in place, I added all of the lowerings for the vector versions of
ishl
,sshr
, andushr
. There are several interesting parts to this:
- vector shifts can accept an immediate shift amount, so this must be handled in each case; if the shift amount is dynamic, then it must be moved to an XMM register
- the x86 ISA does not provide
i8x16
shifts, so these are lowered either using a mask table (ishl
andushr
) or using packing/unpacking (sshr
)- the equivalent of
sshr.i64x2
is also missing in the x86 ISA so a "extract, scalar shift, insert" scheme is used (as it was in the old backend)
The issues above are documented in detail in the code.
abrown requested cfallin, bnjbvr and jlb6740 for a review on PR #2397.
abrown requested cfallin, bnjbvr and jlb6740 for a review on PR #2397.
abrown requested cfallin, bnjbvr and jlb6740 for a review on PR #2397.
abrown edited PR #2397 from packed-shifts
to main
:
This change first modifies
SyntheticAmode
to allow it to refer to constants: this is helpful becauseSyntheticAmode
andAmode
are used throughout the x64 backend to refer to memory locations so constants can now fit into that scheme. (In a future PR I will modifyvconst
to simply lower to aInst::XmmUnaryRmR
, which accepts aSyntheticAmode
, instead of using the specialInst::XmmLoadConst
variant).With that in place, I added all of the lowerings for the vector versions of
ishl
,sshr
, andushr
. There are several interesting parts to this:
- vector shifts can accept an immediate shift amount, so this must be handled in each case; if the shift amount is dynamic, then it must be moved to an XMM register
- the x86 ISA does not provide
i8x16
shifts, so these are lowered either using a mask table (ishl
andushr
) or using packing/unpacking (sshr
)- the equivalent of
sshr.i64x2
is also missing in the x86 ISA so a "extract, scalar shift, insert" scheme is used (as it was in the old backend)The issues above are documented in detail in the code.
cfallin submitted PR Review.
cfallin submitted PR Review.
cfallin created PR Review Comment:
Perhaps
assert!((ty == types::F32 && lane < 4) || (ty == types::F64 && lane < 1))
above to ensure this is the case?
cfallin created PR Review Comment:
Could we add some documentation here describing how these tables were computed? I think I get the general idea -- for the "low half" 8 bits of each 16-bit shifted-right value, we need to mask out the bits that migrated from the high half to the low half -- but, e.g., why are the first three values of the shift-by-1 case (second row) 0xff, rather than the first, third, fifth, ...?
cfallin created PR Review Comment:
s/then/them/
cfallin created PR Review Comment:
This is somewhat surprising to me; I would have expected the definition of
ushr
to be that it masks the shift amount first, so e.g. a 33-bit shift of a 32-bit lane is the same as a 1-bit shift. The definition incodegen/meta/src/shared/instructions.rs
seems to indicate this as well (shift-amounts := y mod B
). Does this mean we need an explicit mask first?
jlb6740 submitted PR Review.
jlb6740 created PR Review Comment:
I think this is a good catch. ushr_i32x4 lowers to SHR (unsigned divide) which is masks the count to 5 bits .. meaning the shift is 1-bit and not 33-bit. @cfallin What do you mean by an explicit mask first .. is it just the file tests that needs updating?
abrown submitted PR Review.
abrown created PR Review Comment:
Added some documentation; to make things simpler, I mask both lanes though this is not strictly necessary and what you are seeing is
rustfmt
sending some of the mask on to the next line.
abrown submitted PR Review.
abrown created PR Review Comment:
Huh, yeah. Good thing I added that test :big_smile:; I think I remember adding it to the old backend to remind me of this issue.
code_translator.rs
already does this masking so the extra masking was overkill, but the old backend was legalizingushr
tox86_psrl
(and the like) so the test was correct for those ISA-specific instructions. When I refactored the test to useushr
, it doesn't match the Cranelift semantics, as you observe.I don't think we should mask twice,once in CLIF and again in the x64 backend, and I don't feel too strongly about which solution is best:
- avoid masking in
code_translator.rs
and mask instead in the backend, leaving the CLIF semantics as-is- leave the masking in
code_translator.rs
and change the CLIF semanticsSince either change might involve changes to other backends and since this currently passes the Wasm spec tests, I would be in favor of creating an issue and resolving it in a separate PR. What do you think?
abrown edited PR Review Comment.
abrown updated PR #2397 from packed-shifts
to main
:
This change first modifies
SyntheticAmode
to allow it to refer to constants: this is helpful becauseSyntheticAmode
andAmode
are used throughout the x64 backend to refer to memory locations so constants can now fit into that scheme. (In a future PR I will modifyvconst
to simply lower to aInst::XmmUnaryRmR
, which accepts aSyntheticAmode
, instead of using the specialInst::XmmLoadConst
variant).With that in place, I added all of the lowerings for the vector versions of
ishl
,sshr
, andushr
. There are several interesting parts to this:
- vector shifts can accept an immediate shift amount, so this must be handled in each case; if the shift amount is dynamic, then it must be moved to an XMM register
- the x86 ISA does not provide
i8x16
shifts, so these are lowered either using a mask table (ishl
andushr
) or using packing/unpacking (sshr
)- the equivalent of
sshr.i64x2
is also missing in the x86 ISA so a "extract, scalar shift, insert" scheme is used (as it was in the old backend)The issues above are documented in detail in the code.
cfallin submitted PR Review.
cfallin created PR Review Comment:
Hmm, interesting; so this is mainly an issue for (i) CLIF compile tests and (ii) potentially other Cranelift embedders. For the latters' sake, I think we should err on the side of implementing the CLIF semantics as currently defined, i.e., doing the masking in the backends. Happy to have this become a followup issue/PR though. Thanks for digging into it!
cfallin submitted PR Review.
abrown submitted PR Review.
abrown created PR Review Comment:
Sure, I opened https://github.com/bytecodealliance/wasmtime/issues/2409.
abrown merged PR #2397.
Last updated: Dec 23 2024 at 12:05 UTC