jlb6740 opened PR #3095 from fix-x64-ext_mul_i8x16
to main
:
<!--
Please ensure that the following steps are all taken care of before submitting
the PR.
[ ] This has been discussed in issue #..., or if not, please tell us why
here.[ ] A short description of what this does, why it is needed; if the
description becomes long, the matter should probably be discussed in an issue
first.[ ] This PR contains test cases, if meaningful.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so. The list of
suggested reviewers on the right can help you.Please ensure all communication adheres to the code of conduct.
-->
jlb6740 has marked PR #3095 as ready for review.
jlb6740 assigned PR #3095 to alexcrichton.
jlb6740 assigned PR #3095 to alexcrichton.
alexcrichton submitted PR review.
alexcrichton submitted PR review.
alexcrichton created PR review comment:
This may be my lack of knowledge about the x64 backend, but the above instruction
Inst::gen_move(tmp_reg, lhs, output_ty)
has the destination as the first parameter (
tmp_reg
I believe) and the parameters to the instruction as the second (lhs
).Does that mean that this should be
dst
as the first parameter since that should be the destination? (or doesxmm_rm_r
behave differently in that the destination is the second parameter?)
alexcrichton created PR review comment:
Looking at the pattern matching and such, it looks like the codegen here is actually the same as-if the specialized pattern matching in this function weren't present? Basically it looks like this is generating
palignr; pmovsxbw
for each lhs/rhs, and then it'spmullw
for the result. That looks like the same codegen as if the component operations ofi16x8.extmul_high_i8x16_s
were each translated individually.If that's the case is it necessary to have a specialized pattern-match here for x64? E.g. were the wasm instructions primarily for other platforms which have a native instruction for the operation?
alexcrichton created PR review comment:
Also, how come a temp register is used? It seems like the goal of
palignr
is to get the high 8 values into the low 8 values, and then thepmovsxbw
will sign-extend the low values into 16-bit values, but looking at the intructions it seems like that can all be done with the same register?
alexcrichton created PR review comment:
Hm although given
xmm_rm_r_imm
it looks like the destination isn't the first argument, so I may just not know the backend here in this regard.
jlb6740 submitted PR review.
jlb6740 created PR review comment:
This may be my lack of knowledge about the x64 backend, but the above instruction
Inst::gen_move(tmp_reg, lhs, output_ty)
has the destination as the first parameter (
tmp_reg
I believe) and the parameters to the instruction as the second (lhs
).Does that mean that this should be
dst
as the first parameter since that should be the destination? (or doesxmm_rm_r
behave differently in that the destination is the second parameter?)Yes .. good observation. gen_move is unique in this way by having the first operand as the destination. Not sure the history behind that but it is something that perhaps should be refactored to avoid confusion.
jlb6740 submitted PR review.
jlb6740 created PR review comment:
Looking at the pattern matching and such, it looks like the codegen here is actually the same as-if the specialized pattern matching in this function weren't present? Basically it looks like this is generating
palignr; pmovsxbw
for each lhs/rhs, and then it'spmullw
for the result. That looks like the same codegen as if the component operations ofi16x8.extmul_high_i8x16_s
were each translated individually.If that's the case is it necessary to have a specialized pattern-match here for x64? E.g. were the wasm instructions primarily for other platforms which have a native instruction for the operation?
@alexcrichton Another good observation. Yes, obviously this instruction is just the combination of a 3 instruction sequence written as one. For arm these 3 instructions can be lowered efficiently more easily with this 3 instruction combination for all 3 variants, i.e. (i16x8.extmul_high_i8x16_s, i32x4.extmul_high_i16x8_s, i64x2.extmul_high_i32x4_s) so the lowering is optimal but for x64 more optimal lowering is only available for a few of the variants. Namely i16x8.extmul_high_i8x16_s and i16x8.extmul_high_i8x16_u does not have a more optimal lowering so I end up just inlining the codegen here. In fact the mistake was simply a copy paste error in that inlining, not seeing a way to print the final codegen to triple-check, and then not having adequate to testing to catch the mistake. For the other variants the specialized codegen I think saves an instruction or two (though I haven't counted explicitly).
jlb6740 updated PR #3095 (assigned to alexcrichton) from fix-x64-ext_mul_i8x16
to main
.
alexcrichton created PR review comment:
Ah ok! Do you think it'd be possible to structure the code such that it falls-through to the normal lowerings for non-specialized implementations, but specialized implementaitons (the other widths > 8) still have pattern-matched codegen? Either way seems fine to me though
alexcrichton submitted PR review.
jlb6740 submitted PR review.
jlb6740 created PR review comment:
Hi @alexcrichton, I think so but I am not quite confident how to write this ... particular when there are two input registers here. What you gain is the normal benefit of code reuse but the lowering should be no more efficient than what you see here and may introduce a few more jumps. If you are ok with either way I'd prefer to just merge this fix now and refactor after a discussion with @cfallin.
alexcrichton submitted PR review.
alexcrichton created PR review comment:
Oh sure yeah, I think it's fine to land this now and refactor later (although I think some tests here would still be good)
cfallin submitted PR review.
cfallin created PR review comment:
FWIW I'm fine with this minor code duplication for now; a bigger refactor later when we have a more systematic way of writing lowering pattern-matching seems reasonable.
cfallin submitted PR review.
jlb6740 submitted PR review.
jlb6740 created PR review comment:
Hi @alexcrichton, did you have in many any particular type of test in mind? There is spec tests, file tests, unit tests, and in the future fuzz testing but for proving correctness I think the spec tests have been what we've used. The unit tests help confirm encodings as long as we are confirming with as or some other tool.
jlb6740 deleted PR review comment.
jlb6740 merged PR #3095.
Last updated: Jan 24 2025 at 00:11 UTC