Stream: git-wasmtime

Topic: wasmtime / issue #10906 Cranelift: Inconsistent results f...


view this post on Zulip Wasmtime GitHub notifications bot (Jun 03 2025 at 13:31):

akldc opened issue #10906:

.clif Test Case

test optimize
    set opt_level=none
    set preserve_frame_pointers=true
    set enable_multi_ret_implicit_sret=true
    target x86_64

function %main() -> i16x8 fast {
    const0 = 0x560af419d25ee4ab70b6b8ba64146998

block0:
    v3 = iconst.i8 66
    v4 = iconst.i16 0x6342
    v9 = f64const 0x1.3ebf9685e6a16p-2
    v10 = vconst.i8x16 const0
    v11 = vconst.i16x8 const0
    v12 = vconst.i32x4 const0
    v13 = vconst.f32x4 const0
    v14 = vconst.i64x2 const0
    v15 = vconst.f64x2 const0
    jump block2

block2:
    v23 = scalar_to_vector.i16x8 v4
    return v23
}

; print: %main()

Steps to Reproduce

clif-util run -v ./test1.clif

Results

%main() -> 0x00000000000000000000000000006342

Then add a return value v9.

test optimize
    set opt_level=none
    set preserve_frame_pointers=true
    set enable_multi_ret_implicit_sret=true
    target x86_64

function %main() -> f64, i16x8 fast {
    const0 = 0x560af419d25ee4ab70b6b8ba64146998

block0:
    v3 = iconst.i8 66
    v4 = iconst.i16 0x6342
    v9 = f64const 0x1.3ebf9685e6a16p-2
    v10 = vconst.i8x16 const0
    v11 = vconst.i16x8 const0
    v12 = vconst.i32x4 const0
    v13 = vconst.f32x4 const0
    v14 = vconst.i64x2 const0
    v15 = vconst.f64x2 const0
    jump block2

block2:
    v23 = scalar_to_vector.i16x8 v4
    return v9, v23
}

; print: %main()

Sometimes the result is [0x1.3ebf9685e6a16p-2, 0x00000000000000000000000000006342],
and other times it’s [0x1.3ebf9685e6a16p-2, 0x000000000000000100000000ffff6342].
Why does this inconsistency happen?

view this post on Zulip Wasmtime GitHub notifications bot (Jun 03 2025 at 13:31):

akldc added the bug label to Issue #10906.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 03 2025 at 13:31):

akldc added the cranelift label to Issue #10906.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 04 2025 at 21:12):

cfallin commented on issue #10906:

cc @abrown -- looks like a potential nondeterminism issue with a SIMD instruction on x86 -- want to take a look?

view this post on Zulip Wasmtime GitHub notifications bot (Jun 05 2025 at 23:54):

abrown commented on issue #10906:

I looked at this today. The first example compiles to:

$ cargo run -p cranelift-tools -- compile --target x86_64 before.clif --output before.s
$ objdump -d before.s
Disassembly of section .text:

0000000000000000 <%main>:
   0:   55                      push   %rbp
   1:   48 89 e5                mov    %rsp,%rbp
   4:   b8 42 63 00 00          mov    $0x6342,%eax
   9:   66 0f c4 c0 00          pinsrw $0x0,%eax,%xmm0
   e:   48 89 ec                mov    %rbp,%rsp
  11:   5d                      pop    %rbp
  12:   c3                      ret

With the added return value:

$ cargo run -p cranelift-tools -- compile --target x86_64 after.clif --output after.s
$ objdump -d after.s
Disassembly of section .text:

0000000000000000 <%main>:
   0:   55                      push   %rbp
   1:   48 89 e5                mov    %rsp,%rbp
   4:   ba 42 63 00 00          mov    $0x6342,%edx
   9:   48 b9 16 6a 5e 68 f9    movabs $0x3fd3ebf9685e6a16,%rcx
  10:   eb d3 3f
  13:   66 48 0f 6e c1          movq   %rcx,%xmm0
  18:   66 0f c4 ca 00          pinsrw $0x0,%edx,%xmm1
  1d:   48 89 ec                mov    %rbp,%rsp
  20:   5d                      pop    %rbp
  21:   c3                      ret

view this post on Zulip Wasmtime GitHub notifications bot (Jun 05 2025 at 23:59):

abrown commented on issue #10906:

The ISLE chain appears to be:

(rule (lower (scalar_to_vector src @ (value_type ty)))
      (bitcast_gpr_to_xmm (ty_bits ty) src))
(rule (bitcast_gpr_to_xmm 16 src)
      (x64_pinsrw (xmm_uninit_value) src 0))
(rule 0 (x64_pinsrw src1 src2 lane) (x64_pinsrw_a src1 src2 lane))

That uninitialized XMM could have some extra bits in it?

view this post on Zulip Wasmtime GitHub notifications bot (Jun 06 2025 at 00:01):

abrown edited a comment on issue #10906:

The ISLE chain appears to be:

(rule (lower (scalar_to_vector src @ (value_type ty)))
      (bitcast_gpr_to_xmm (ty_bits ty) src))
(rule (bitcast_gpr_to_xmm 16 src)
      (x64_pinsrw (xmm_uninit_value) src 0))
(rule 0 (x64_pinsrw src1 src2 lane) (x64_pinsrw_a src1 src2 lane))

That uninitialized XMM could have some extra bits in it? Seems like it should have been zeroed out.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 06 2025 at 00:05):

cfallin commented on issue #10906:

(rule (bitcast_gpr_to_xmm 16 src)
(x64_pinsrw (xmm_uninit_value) src 0))

That seems like the issue, unless I'm misunderstanding -- bitcast_gpr_to_xmm should zero the upper lanes, but this is explicitly opting into uninitialized/existing bits in those lanes. It seems we've had this since #9045.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 06 2025 at 00:05):

cfallin edited a comment on issue #10906:

(rule (bitcast_gpr_to_xmm 16 src) (x64_pinsrw (xmm_uninit_value) src 0))

That seems like the issue, unless I'm misunderstanding -- bitcast_gpr_to_xmm should zero the upper lanes, but this is explicitly opting into uninitialized/existing bits in those lanes. It seems we've had this since #9045.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 06 2025 at 00:29):

abrown commented on issue #10906:

Do we need a x64_pxor of a temp_writable_xmm there, then? (I'm looking around for something like zero_xmm but not finding it).

view this post on Zulip Wasmtime GitHub notifications bot (Jun 06 2025 at 00:41):

alexcrichton commented on issue #10906:

IMO bitcast_gpr_to_xmm is probably fine insofar that the name alone implies to me that it's got the expected behavior, but scalar_to_vector is documented as zeroing all upper lanes and so the bug lies in implementing scalar_to_vector with bitcast_gpr_to_xmm. The lowering of scalar_to_vector for ty_scalar_float also looks wrong because it's not zeroing the upper lanes, so I think that the scalar_to_vector instruction lowering may just need some love and care to fix some cases. WebAssembly only uses the lowering where the source is a 32-bit or 64-bit integer loaded from memory which is why I don't think this has come up before.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 06 2025 at 00:42):

alexcrichton commented on issue #10906:

Although that being said I think it would be reasonable to document bitcast_gpr_to_xmm as zeroing all the other bits (it certainly helps to avoid creating false dependencies). Nevertheless I think scalar_to_vector for floats still needs improving.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 06 2025 at 00:49):

cfallin commented on issue #10906:

Yeah, I suppose it depends on which way we define it. I suppose I was reading "..._to_xmm" as meaning "to 128 bits" but one could just as well think about this the same way we think about narrow values in GPRs. I'm fine going either way.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 06 2025 at 02:52):

abrown commented on issue #10906:

I hadn't read these latest comments prior to #10949; sounds like maybe I should alter the solution there?

view this post on Zulip Wasmtime GitHub notifications bot (Jun 06 2025 at 14:33):

alexcrichton commented on issue #10906:

Personally I think that's a reasonable fix, but mind leaving a comment on the bitcast_gpr_to_xmm helper that it's defined as zeroing all the upper bits? (which is then why it's suitable for scalar_to_vector)

view this post on Zulip Wasmtime GitHub notifications bot (Jun 06 2025 at 17:35):

fitzgen closed issue #10906:

.clif Test Case

test optimize
    set opt_level=none
    set preserve_frame_pointers=true
    set enable_multi_ret_implicit_sret=true
    target x86_64

function %main() -> i16x8 fast {
    const0 = 0x560af419d25ee4ab70b6b8ba64146998

block0:
    v3 = iconst.i8 66
    v4 = iconst.i16 0x6342
    v9 = f64const 0x1.3ebf9685e6a16p-2
    v10 = vconst.i8x16 const0
    v11 = vconst.i16x8 const0
    v12 = vconst.i32x4 const0
    v13 = vconst.f32x4 const0
    v14 = vconst.i64x2 const0
    v15 = vconst.f64x2 const0
    jump block2

block2:
    v23 = scalar_to_vector.i16x8 v4
    return v23
}

; print: %main()

Steps to Reproduce

clif-util run -v ./test1.clif

Results

%main() -> 0x00000000000000000000000000006342

Then add a return value v9.

test optimize
    set opt_level=none
    set preserve_frame_pointers=true
    set enable_multi_ret_implicit_sret=true
    target x86_64

function %main() -> f64, i16x8 fast {
    const0 = 0x560af419d25ee4ab70b6b8ba64146998

block0:
    v3 = iconst.i8 66
    v4 = iconst.i16 0x6342
    v9 = f64const 0x1.3ebf9685e6a16p-2
    v10 = vconst.i8x16 const0
    v11 = vconst.i16x8 const0
    v12 = vconst.i32x4 const0
    v13 = vconst.f32x4 const0
    v14 = vconst.i64x2 const0
    v15 = vconst.f64x2 const0
    jump block2

block2:
    v23 = scalar_to_vector.i16x8 v4
    return v9, v23
}

; print: %main()

Sometimes the result is [0x1.3ebf9685e6a16p-2, 0x00000000000000000000000000006342],
and other times it’s [0x1.3ebf9685e6a16p-2, 0x000000000000000100000000ffff6342].
Why does this inconsistency happen?


Last updated: Dec 06 2025 at 06:05 UTC