Stream: git-wasmtime

Topic: wasmtime / PR #2982 Implements f64x2.convert_low_i32x4_u ...


view this post on Zulip Wasmtime GitHub notifications bot (Jun 14 2021 at 15:38):

jlb6740 opened PR #2982 from implement_fcvt_low_from_unit to main:

<!--

Please ensure that the following steps are all taken care of before submitting
the PR.

Please ensure all communication adheres to the code of conduct.
-->

view this post on Zulip Wasmtime GitHub notifications bot (Jun 14 2021 at 15:54):

jlb6740 updated PR #2982 from implement_fcvt_low_from_unit to main.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 14 2021 at 16:01):

akirilov-arm submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 14 2021 at 16:01):

akirilov-arm submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 14 2021 at 16:01):

akirilov-arm created PR review comment:

Please, use 32-bit integers instead of doubleword integers.

You should also update the fcvt_low_from_sint description.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 14 2021 at 18:00):

jlb6740 submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 14 2021 at 18:00):

jlb6740 created PR review comment:

@akirilov-arm good catch. Thanks!

view this post on Zulip Wasmtime GitHub notifications bot (Jun 14 2021 at 18:14):

jlb6740 edited PR review comment.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 14 2021 at 18:16):

jlb6740 updated PR #2982 from implement_fcvt_low_from_unit to main.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 14 2021 at 18:53):

bjorn3 created PR review comment:

Can you copy the docs from fcvt_low_from_sint?

view this post on Zulip Wasmtime GitHub notifications bot (Jun 14 2021 at 18:53):

bjorn3 submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 14 2021 at 18:59):

jlb6740 updated PR #2982 from implement_fcvt_low_from_unit to main.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 14 2021 at 19:56):

jlb6740 updated PR #2982 from implement_fcvt_low_from_unit to main.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 14 2021 at 22:23):

jlb6740 requested akirilov-arm for a review on PR #2982.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 15 2021 at 13:13):

akirilov-arm submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 15 2021 at 13:13):

akirilov-arm created PR review comment:

Is there any specific reason to add a new IR operation? Correct me if I am wrong, but this is equivalent to uwiden_low + fcvt_from_uint, which should be straightforward to pattern-match in the backend (if it was a sequence of, say, 10 instructions, then a new IR operation would be perfectly understandable).

I realize that this has already been done for fcvt_low_from_sint, so I guess the same question applies to it.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 15 2021 at 13:13):

akirilov-arm created PR review comment:

Yes, exactly, the term "doubleword" means "64-bit" in the Arm architecture.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 15 2021 at 22:00):

jlb6740 edited PR review comment.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 16 2021 at 00:54):

jlb6740 submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 16 2021 at 00:54):

jlb6740 created PR review comment:

Hi @akirilov-arm .. good question. I did the lowering for fcvt_low_from_sint but don't remember the reasoning. Likely I simply did not realize swiden_low could be used. Brings up a question though of if the goal is to minimize the number of instructions here by reusing and mapping many-to-one as much as possible which will have the consequence of a more generic definition of the instruction. Or do we instead want to be more specific in our instruction name and definition and closely tie to the mapped wasm instruction. You mention how involved the pattern matching would get if instructions are shared may be a decider. I'll push another patch removing these new instructions and instead attempt to lower using uwiden_low and swiden_low.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 16 2021 at 01:38):

jlb6740 submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 16 2021 at 01:38):

jlb6740 created PR review comment:

@akirilov-arm I remember now, we were mimicking the implementation of F32x4ConvertI32x4S but couldn't use or didn't want to use that same instruction. I think uwiden_low was never a thought. Rereading your question, I am not sure what you mean by uwiden_low+fcvt_from_uint as I was initially thinking you just asking why I didn't reuse uwiden_low? Can you explain what you mean by uwiden_low+fcvt_from_uint w.r.t the instruction definition in instruction.rs and code_translator.rs? The patch I just pushed should address all other comments except this one.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 16 2021 at 01:38):

jlb6740 updated PR #2982 from implement_fcvt_low_from_unit to main.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 16 2021 at 16:15):

akirilov-arm submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 16 2021 at 16:15):

akirilov-arm created PR review comment:

What I mean is that no changes would be necessary to instructions.rs. As for code_translator.rs, it would need to do something like:

Operator::F64x2ConvertLowI32x4U => {
  let a = pop1_with_bitcast(state, I32X4, builder);
  let widened_a = builder.ins().uwiden_low(a);

  state.push1(builder.ins().fcvt_from_uint(F64X2, widened_a));
}

Then if a backend needs to do anything special for that combination, it will match the pattern.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 16 2021 at 16:18):

akirilov-arm submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 16 2021 at 16:18):

akirilov-arm created PR review comment:

I think you missed a line in the description here.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 17 2021 at 02:46):

jlb6740 submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 17 2021 at 02:46):

jlb6740 created PR review comment:

Hi @akirilov-arm ok .. I am following the other instructions as an example and don't see the others doing this. What does it mean to call two instruction builders here? I tried the above and it's going to want something implemented for uwiden_low and fcvt_from_uint. Is this what you are suggesting?

view this post on Zulip Wasmtime GitHub notifications bot (Jun 17 2021 at 14:51):

akirilov-arm submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 17 2021 at 14:51):

akirilov-arm created PR review comment:

That's one way to do it, but it's probably going to result in suboptimal code generation. You can check the handling of Operator::V128Load8Splat - it's a really similar case, i.e. we generate 2 separate IR operations (load + splat) for 1 Wasm instruction, and then the backends pattern-match the combination to generate 1 final machine instruction.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 23 2021 at 08:12):

jlb6740 submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 23 2021 at 08:12):

jlb6740 created PR review comment:

@akirilov-arm I see the implementation in code_translator.rs, but can you point to the backend lowering of Operator::V128Load8Splat that is using pattern matching?

view this post on Zulip Wasmtime GitHub notifications bot (Jun 23 2021 at 12:39):

akirilov-arm submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 23 2021 at 12:39):

akirilov-arm created PR review comment:

Sure, here's the AArch64 implementation, and in the x64 backend it seems to be in input_to_reg_mem(), which is used by the Splat implementation.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 23 2021 at 22:10):

jlb6740 submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 23 2021 at 22:10):

jlb6740 created PR review comment:

@akirilov-arm I see a mergable_load call that I assume doesn't apply here? Is it really best to try and merge instructions instead of just mapping it explicitly 1:1 to the wasm instruction like it is now? I don't know that I follow exactly, but I think following your suggestion new logic would be needed in input_to_reg_mem that would get executed for every instruction calling this function. I guess the logic would check if the instruction does a widen and convert? Seems to be less straight forward? What's the advantage of avoiding the 1:1 mapping?

view this post on Zulip Wasmtime GitHub notifications bot (Jun 24 2021 at 00:02):

akirilov-arm submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 24 2021 at 00:02):

akirilov-arm created PR review comment:

I haven't worked on the x64 backend, so I can't make specific implementation suggestions - my guess is that it would be similar to the AArch64 code, i.e. in the implementation of fcvt_from_uint you would check if the input is uwiden_low and act accordingly (input_to_reg_mem() doesn't seem like the right place). My point is that matching is possible and in this case easy because we need to match only one input, not a whole expression tree.

I am working on enabling the i16x8.q15mulr_sat_s and i32x4.trunc_sat_f64x2_s_zero operations, which also happen to be expressible in terms of existing IR operations. However, they result in much more complicated patterns, which are no longer easy to match, so in those cases I would not make the same suggestion.

The main advantage of avoiding the 1:1 mapping is that it reduces the work necessary for all backends to get basic support for the operation - once they implement 2 other operations, which they are going to do anyway because it is required by the Wasm SIMD specification, they get f64x2.convert_low_i32x4_u for free (and the same would apply to f64x2.convert_low_i32x4_s if it is changed similarly). Also, the pattern might occur organically in the Wasm code and would be handled properly. There is probably an argument to be made that enums should be kept as lean as possible, and that other applications that try to generate or manipulate Cranelift IR (say, another compiler frontend or a peephole optimizer) would have an easier job because they will have fewer things to consider, but these are lesser concerns.

With all that said, I am definitely not categorically opposed to adding an IR operation, so perhaps it is best to hear a third opinion.

cc @abrown @bnjbvr @bjorn3 @cfallin @uweigand

view this post on Zulip Wasmtime GitHub notifications bot (Jun 24 2021 at 00:16):

cfallin submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 24 2021 at 00:16):

cfallin created PR review comment:

@jlb6740 we had some discussion on this topic last year -- see #2278 and #2376.

I think in general, we want to try to limit new instructions we add at the CLIF level to those that are truly necessary; while combo-instructions make it easier to plumb through the new thing and get exactly the semantics you want, they impose ongoing maintenance cost and cost on new backend implementations, as the IR-level instruction set becomes more complex. Also, the operator-combos for which there are efficient lowerings may be different on different architectures; we then end up with the union of all combo-instructions, and while some backends will have efficient lowerings for the ones that were purpose-built for that backend, the other other backends will have to add new logic that could have come from the combination of simpler ops automatically, as @akirilov-arm says. In general, pattern-matching is a good way to implement better lowerings for some combos without taking on this cost, I think.

In this particular case, the general helpers (e.g. input_to_reg_mem()) don't need to change at all; instead there would be some logic in the lowering case for fcvt_from_uint that looks for a uwiden_low. The matches_input() helper should be useful here.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 24 2021 at 00:52):

akirilov-arm submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 24 2021 at 00:52):

akirilov-arm created PR review comment:

@cfallin It's probably reasonable to consider a "complexity bound" on the necessary matching, though, as I said - for instance, i32x4.trunc_sat_f64x2_s_zero is equivalent to:

v1 = fcvt_to_sint_sat.i64x2 v0
v2 = vconst.i64x2 0
v3 = snarrow v1, v2

or:

v1 = fcvt_to_sint_sat.i64x2 v0
v2 = iconst.i64 0
v3 = splat.i64x2 v2
v4 = snarrow v1, v3

(and maybe I am forgetting another straightforward way to generate a vector of zeros)
I would lean towards introducing a new IR operation in that case.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 24 2021 at 01:08):

cfallin submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 24 2021 at 01:08):

cfallin created PR review comment:

Yes, definitely, the matching has some dynamic cost (in compile time and IR memory usage), so there's a tradeoff. When it's a simple A+B combo op as here, it seems reasonable to me to pattern-match the composition, but we should take it on a case-by-case basis. I'd be curious in your examples whether the lowering can give a more optimized instruction sequence for those 3/4 ops together than what would fall out of the simple lowering, but we can save such discussion for a future PR :-)

view this post on Zulip Wasmtime GitHub notifications bot (Jul 08 2021 at 19:15):

jlb6740 updated PR #2982 from implement_fcvt_low_from_unit to main.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 08 2021 at 19:37):

jlb6740 updated PR #2982 from implement_fcvt_low_from_unit to main.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 08 2021 at 19:43):

jlb6740 updated PR #2982 from implement_fcvt_low_from_unit to main.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 08 2021 at 19:51):

jlb6740 submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 08 2021 at 19:51):

jlb6740 created PR review comment:

@akirilov-arm @cfallin Hi .. with some offline help from @cfallin I was able to understand what this should look like in lowering. I used the existing algorithm but will investigate doing it another way that uses fewer instructions. I did not try to refactor any other lowerings such as f64x2.convert_low_i32x4_s since this PR is about implementing f64x2.convert_low_i32x4_u and I figure we want to at least get the others finished before optimizing and refactoring previous instructions, but I do plan to refactor f64x2.convert_low_i32x4_s with a different PR if not just for symmetry. Let me know if there is anything else needed for this PR.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 08 2021 at 19:52):

jlb6740 edited PR review comment.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 08 2021 at 20:26):

jlb6740 updated PR #2982 from implement_fcvt_low_from_unit to main.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 09 2021 at 02:13):

jlb6740 requested akirilov-arm for a review on PR #2982.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 09 2021 at 13:56):

akirilov-arm submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 09 2021 at 17:39):

jlb6740 merged PR #2982.


Last updated: Jan 24 2025 at 00:11 UTC