jlb6740 opened PR #2982 from implement_fcvt_low_from_unit
to main
:
<!--
Please ensure that the following steps are all taken care of before submitting
the PR.
[ ] This has been discussed in issue #..., or if not, please tell us why
here.[ ] A short description of what this does, why it is needed; if the
description becomes long, the matter should probably be discussed in an issue
first.[ ] This PR contains test cases, if meaningful.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so. The list of
suggested reviewers on the right can help you.Please ensure all communication adheres to the code of conduct.
-->
jlb6740 updated PR #2982 from implement_fcvt_low_from_unit
to main
.
akirilov-arm submitted PR review.
akirilov-arm submitted PR review.
akirilov-arm created PR review comment:
Please, use
32-bit integers
instead ofdoubleword integers
.You should also update the
fcvt_low_from_sint
description.
jlb6740 submitted PR review.
jlb6740 created PR review comment:
@akirilov-arm good catch. Thanks!
jlb6740 edited PR review comment.
jlb6740 updated PR #2982 from implement_fcvt_low_from_unit
to main
.
bjorn3 created PR review comment:
Can you copy the docs from
fcvt_low_from_sint
?
bjorn3 submitted PR review.
jlb6740 updated PR #2982 from implement_fcvt_low_from_unit
to main
.
jlb6740 updated PR #2982 from implement_fcvt_low_from_unit
to main
.
jlb6740 requested akirilov-arm for a review on PR #2982.
akirilov-arm submitted PR review.
akirilov-arm created PR review comment:
Is there any specific reason to add a new IR operation? Correct me if I am wrong, but this is equivalent to
uwiden_low
+fcvt_from_uint
, which should be straightforward to pattern-match in the backend (if it was a sequence of, say, 10 instructions, then a new IR operation would be perfectly understandable).I realize that this has already been done for
fcvt_low_from_sint
, so I guess the same question applies to it.
akirilov-arm created PR review comment:
Yes, exactly, the term "doubleword" means "64-bit" in the Arm architecture.
jlb6740 edited PR review comment.
jlb6740 submitted PR review.
jlb6740 created PR review comment:
Hi @akirilov-arm .. good question. I did the lowering for fcvt_low_from_sint but don't remember the reasoning. Likely I simply did not realize swiden_low could be used. Brings up a question though of if the goal is to minimize the number of instructions here by reusing and mapping many-to-one as much as possible which will have the consequence of a more generic definition of the instruction. Or do we instead want to be more specific in our instruction name and definition and closely tie to the mapped wasm instruction. You mention how involved the pattern matching would get if instructions are shared may be a decider. I'll push another patch removing these new instructions and instead attempt to lower using uwiden_low and swiden_low.
jlb6740 submitted PR review.
jlb6740 created PR review comment:
@akirilov-arm I remember now, we were mimicking the implementation of F32x4ConvertI32x4S but couldn't use or didn't want to use that same instruction. I think uwiden_low was never a thought. Rereading your question, I am not sure what you mean by uwiden_low+fcvt_from_uint as I was initially thinking you just asking why I didn't reuse uwiden_low? Can you explain what you mean by uwiden_low+fcvt_from_uint w.r.t the instruction definition in instruction.rs and code_translator.rs? The patch I just pushed should address all other comments except this one.
jlb6740 updated PR #2982 from implement_fcvt_low_from_unit
to main
.
akirilov-arm submitted PR review.
akirilov-arm created PR review comment:
What I mean is that no changes would be necessary to
instructions.rs
. As forcode_translator.rs
, it would need to do something like:Operator::F64x2ConvertLowI32x4U => { let a = pop1_with_bitcast(state, I32X4, builder); let widened_a = builder.ins().uwiden_low(a); state.push1(builder.ins().fcvt_from_uint(F64X2, widened_a)); }
Then if a backend needs to do anything special for that combination, it will match the pattern.
akirilov-arm submitted PR review.
akirilov-arm created PR review comment:
I think you missed a line in the description here.
jlb6740 submitted PR review.
jlb6740 created PR review comment:
Hi @akirilov-arm ok .. I am following the other instructions as an example and don't see the others doing this. What does it mean to call two instruction builders here? I tried the above and it's going to want something implemented for uwiden_low and fcvt_from_uint. Is this what you are suggesting?
akirilov-arm submitted PR review.
akirilov-arm created PR review comment:
That's one way to do it, but it's probably going to result in suboptimal code generation. You can check the handling of
Operator::V128Load8Splat
- it's a really similar case, i.e. we generate 2 separate IR operations (load
+splat
) for 1 Wasm instruction, and then the backends pattern-match the combination to generate 1 final machine instruction.
jlb6740 submitted PR review.
jlb6740 created PR review comment:
@akirilov-arm I see the implementation in code_translator.rs, but can you point to the backend lowering of Operator::V128Load8Splat that is using pattern matching?
akirilov-arm submitted PR review.
akirilov-arm created PR review comment:
Sure, here's the AArch64 implementation, and in the x64 backend it seems to be in
input_to_reg_mem()
, which is used by theSplat
implementation.
jlb6740 submitted PR review.
jlb6740 created PR review comment:
@akirilov-arm I see a mergable_load call that I assume doesn't apply here? Is it really best to try and merge instructions instead of just mapping it explicitly 1:1 to the wasm instruction like it is now? I don't know that I follow exactly, but I think following your suggestion new logic would be needed in input_to_reg_mem that would get executed for every instruction calling this function. I guess the logic would check if the instruction does a widen and convert? Seems to be less straight forward? What's the advantage of avoiding the 1:1 mapping?
akirilov-arm submitted PR review.
akirilov-arm created PR review comment:
I haven't worked on the x64 backend, so I can't make specific implementation suggestions - my guess is that it would be similar to the AArch64 code, i.e. in the implementation of
fcvt_from_uint
you would check if the input isuwiden_low
and act accordingly (input_to_reg_mem()
doesn't seem like the right place). My point is that matching is possible and in this case easy because we need to match only one input, not a whole expression tree.I am working on enabling the
i16x8.q15mulr_sat_s
andi32x4.trunc_sat_f64x2_s_zero
operations, which also happen to be expressible in terms of existing IR operations. However, they result in much more complicated patterns, which are no longer easy to match, so in those cases I would not make the same suggestion.The main advantage of avoiding the 1:1 mapping is that it reduces the work necessary for all backends to get basic support for the operation - once they implement 2 other operations, which they are going to do anyway because it is required by the Wasm SIMD specification, they get
f64x2.convert_low_i32x4_u
for free (and the same would apply tof64x2.convert_low_i32x4_s
if it is changed similarly). Also, the pattern might occur organically in the Wasm code and would be handled properly. There is probably an argument to be made that enums should be kept as lean as possible, and that other applications that try to generate or manipulate Cranelift IR (say, another compiler frontend or a peephole optimizer) would have an easier job because they will have fewer things to consider, but these are lesser concerns.With all that said, I am definitely not categorically opposed to adding an IR operation, so perhaps it is best to hear a third opinion.
cc @abrown @bnjbvr @bjorn3 @cfallin @uweigand
cfallin submitted PR review.
cfallin created PR review comment:
@jlb6740 we had some discussion on this topic last year -- see #2278 and #2376.
I think in general, we want to try to limit new instructions we add at the CLIF level to those that are truly necessary; while combo-instructions make it easier to plumb through the new thing and get exactly the semantics you want, they impose ongoing maintenance cost and cost on new backend implementations, as the IR-level instruction set becomes more complex. Also, the operator-combos for which there are efficient lowerings may be different on different architectures; we then end up with the union of all combo-instructions, and while some backends will have efficient lowerings for the ones that were purpose-built for that backend, the other other backends will have to add new logic that could have come from the combination of simpler ops automatically, as @akirilov-arm says. In general, pattern-matching is a good way to implement better lowerings for some combos without taking on this cost, I think.
In this particular case, the general helpers (e.g.
input_to_reg_mem()
) don't need to change at all; instead there would be some logic in the lowering case forfcvt_from_uint
that looks for auwiden_low
. Thematches_input()
helper should be useful here.
akirilov-arm submitted PR review.
akirilov-arm created PR review comment:
@cfallin It's probably reasonable to consider a "complexity bound" on the necessary matching, though, as I said - for instance,
i32x4.trunc_sat_f64x2_s_zero
is equivalent to:v1 = fcvt_to_sint_sat.i64x2 v0 v2 = vconst.i64x2 0 v3 = snarrow v1, v2
or:
v1 = fcvt_to_sint_sat.i64x2 v0 v2 = iconst.i64 0 v3 = splat.i64x2 v2 v4 = snarrow v1, v3
(and maybe I am forgetting another straightforward way to generate a vector of zeros)
I would lean towards introducing a new IR operation in that case.
cfallin submitted PR review.
cfallin created PR review comment:
Yes, definitely, the matching has some dynamic cost (in compile time and IR memory usage), so there's a tradeoff. When it's a simple A+B combo op as here, it seems reasonable to me to pattern-match the composition, but we should take it on a case-by-case basis. I'd be curious in your examples whether the lowering can give a more optimized instruction sequence for those 3/4 ops together than what would fall out of the simple lowering, but we can save such discussion for a future PR :-)
jlb6740 updated PR #2982 from implement_fcvt_low_from_unit
to main
.
jlb6740 updated PR #2982 from implement_fcvt_low_from_unit
to main
.
jlb6740 updated PR #2982 from implement_fcvt_low_from_unit
to main
.
jlb6740 submitted PR review.
jlb6740 created PR review comment:
@akirilov-arm @cfallin Hi .. with some offline help from @cfallin I was able to understand what this should look like in lowering. I used the existing algorithm but will investigate doing it another way that uses fewer instructions. I did not try to refactor any other lowerings such as f64x2.convert_low_i32x4_s since this PR is about implementing f64x2.convert_low_i32x4_u and I figure we want to at least get the others finished before optimizing and refactoring previous instructions, but I do plan to refactor f64x2.convert_low_i32x4_s with a different PR if not just for symmetry. Let me know if there is anything else needed for this PR.
jlb6740 edited PR review comment.
jlb6740 updated PR #2982 from implement_fcvt_low_from_unit
to main
.
jlb6740 requested akirilov-arm for a review on PR #2982.
akirilov-arm submitted PR review.
jlb6740 merged PR #2982.
Last updated: Jan 24 2025 at 00:11 UTC