alexcrichton opened PR #9853 from alexcrichton:pulley-simd-compare
to bytecodealliance:main
:
More wast tests passing.
<!--
Please make sure you include the following information:
If this work has been discussed elsewhere, please include a link to that
conversation. If it was discussed in an issue, just mention "issue #...".Explain why this change is needed. If the details are in an issue already,
this can be brief.Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.htmlPlease ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->
alexcrichton requested cfallin for a review on PR #9853.
alexcrichton requested wasmtime-compiler-reviewers for a review on PR #9853.
alexcrichton requested dicej for a review on PR #9853.
alexcrichton requested wasmtime-core-reviewers for a review on PR #9853.
alexcrichton requested wasmtime-default-reviewers for a review on PR #9853.
cfallin submitted PR review:
Thanks! Looks right to me; I checked over most of this for copy-pastos but am additionally trusting the runtests as a backstop. Item of stray curiosity below but nothing to block on.
cfallin created PR review comment:
Stray curiosity: did you happen to look if LLVM can autovectorize this? It sure would be neat to have vector op implementations bottom out in native vector instructions when Pulley runs on a SIMD-capable host...
(No worries if not, it's not the main goal, but if it inspires anything then all the better)
alexcrichton submitted PR review.
alexcrichton created PR review comment:
Heh I've been double-checking this along the way for most of the simd opcodes. The good news is yes! LLVM does a pretty good job at auto-vectorizing all these methods.
For example
vaddi32x4
looks like this:0000000000000000 <_ZN97_$LT$pulley_interpreter..interp..Interpreter$u20$as$u20$pulley_interpreter..decode..OpVisitor$GT$9vaddi32x417h24bc21fe57c19519E>: 0: 48 8b 07 mov (%rdi),%rax 3: 89 f1 mov %esi,%ecx 5: 40 0f b6 d6 movzbl %sil,%edx 9: c1 ee 04 shr $0x4,%esi c: 81 e6 f0 0f 00 00 and $0xff0,%esi 12: c1 e9 0c shr $0xc,%ecx 15: 81 e1 f0 0f 00 00 and $0xff0,%ecx 1b: c1 e2 04 shl $0x4,%edx 1e: c5 f9 6f 04 30 vmovdqa (%rax,%rsi,1),%xmm0 23: c5 f9 fe 04 08 vpaddd (%rax,%rcx,1),%xmm0,%xmm0 28: c5 f9 7f 04 10 vmovdqa %xmm0,(%rax,%rdx,1) 2d: 31 c0 xor %eax,%eax 2f: c3 ret
and the method here looks like this:
0000000000000000 <_ZN105_$LT$pulley_interpreter..interp..Interpreter$u20$as$u20$pulley_interpreter..decode..ExtendedOpVisitor$GT$7veq8x1617h73aa3ce30a2d51abE>: 0: 48 8b 07 mov (%rdi),%rax 3: 89 f1 mov %esi,%ecx 5: 40 0f b6 d6 movzbl %sil,%edx 9: c1 ee 04 shr $0x4,%esi c: 81 e6 f0 0f 00 00 and $0xff0,%esi 12: c1 e9 0c shr $0xc,%ecx 15: 81 e1 f0 0f 00 00 and $0xff0,%ecx 1b: c1 e2 04 shl $0x4,%edx 1e: c5 f9 6f 04 30 vmovdqa (%rax,%rsi,1),%xmm0 23: c5 f9 74 04 08 vpcmpeqb (%rax,%rcx,1),%xmm0,%xmm0 28: c5 f9 7f 04 10 vmovdqa %xmm0,(%rax,%rdx,1) 2d: 31 c0 xor %eax,%eax 2f: c3 ret
Most of the complexity here is decoding
BinaryOperands<VReg>
where it's three 5-bit values packed into a 16-bit value, but otherwise it's pretty optimal in terms of lowering.
cfallin submitted PR review.
cfallin created PR review comment:
Nice, that's great!
cfallin merged PR #9853.
Last updated: Jan 24 2025 at 00:11 UTC