akldc opened issue #10906:
.clifTest Casetest optimize set opt_level=none set preserve_frame_pointers=true set enable_multi_ret_implicit_sret=true target x86_64 function %main() -> i16x8 fast { const0 = 0x560af419d25ee4ab70b6b8ba64146998 block0: v3 = iconst.i8 66 v4 = iconst.i16 0x6342 v9 = f64const 0x1.3ebf9685e6a16p-2 v10 = vconst.i8x16 const0 v11 = vconst.i16x8 const0 v12 = vconst.i32x4 const0 v13 = vconst.f32x4 const0 v14 = vconst.i64x2 const0 v15 = vconst.f64x2 const0 jump block2 block2: v23 = scalar_to_vector.i16x8 v4 return v23 } ; print: %main()Steps to Reproduce
clif-util run -v ./test1.clifResults
%main() -> 0x00000000000000000000000000006342Then add a return value v9.
test optimize set opt_level=none set preserve_frame_pointers=true set enable_multi_ret_implicit_sret=true target x86_64 function %main() -> f64, i16x8 fast { const0 = 0x560af419d25ee4ab70b6b8ba64146998 block0: v3 = iconst.i8 66 v4 = iconst.i16 0x6342 v9 = f64const 0x1.3ebf9685e6a16p-2 v10 = vconst.i8x16 const0 v11 = vconst.i16x8 const0 v12 = vconst.i32x4 const0 v13 = vconst.f32x4 const0 v14 = vconst.i64x2 const0 v15 = vconst.f64x2 const0 jump block2 block2: v23 = scalar_to_vector.i16x8 v4 return v9, v23 } ; print: %main()Sometimes the result is
[0x1.3ebf9685e6a16p-2, 0x00000000000000000000000000006342],
and other times it’s[0x1.3ebf9685e6a16p-2, 0x000000000000000100000000ffff6342].
Why does this inconsistency happen?
akldc added the bug label to Issue #10906.
akldc added the cranelift label to Issue #10906.
cfallin commented on issue #10906:
cc @abrown -- looks like a potential nondeterminism issue with a SIMD instruction on x86 -- want to take a look?
abrown commented on issue #10906:
I looked at this today. The first example compiles to:
$ cargo run -p cranelift-tools -- compile --target x86_64 before.clif --output before.s $ objdump -d before.s Disassembly of section .text: 0000000000000000 <%main>: 0: 55 push %rbp 1: 48 89 e5 mov %rsp,%rbp 4: b8 42 63 00 00 mov $0x6342,%eax 9: 66 0f c4 c0 00 pinsrw $0x0,%eax,%xmm0 e: 48 89 ec mov %rbp,%rsp 11: 5d pop %rbp 12: c3 retWith the added return value:
$ cargo run -p cranelift-tools -- compile --target x86_64 after.clif --output after.s $ objdump -d after.s Disassembly of section .text: 0000000000000000 <%main>: 0: 55 push %rbp 1: 48 89 e5 mov %rsp,%rbp 4: ba 42 63 00 00 mov $0x6342,%edx 9: 48 b9 16 6a 5e 68 f9 movabs $0x3fd3ebf9685e6a16,%rcx 10: eb d3 3f 13: 66 48 0f 6e c1 movq %rcx,%xmm0 18: 66 0f c4 ca 00 pinsrw $0x0,%edx,%xmm1 1d: 48 89 ec mov %rbp,%rsp 20: 5d pop %rbp 21: c3 ret
abrown commented on issue #10906:
The ISLE chain appears to be:
(rule (lower (scalar_to_vector src @ (value_type ty))) (bitcast_gpr_to_xmm (ty_bits ty) src))(rule (bitcast_gpr_to_xmm 16 src) (x64_pinsrw (xmm_uninit_value) src 0))(rule 0 (x64_pinsrw src1 src2 lane) (x64_pinsrw_a src1 src2 lane))That uninitialized XMM could have some extra bits in it?
abrown edited a comment on issue #10906:
The ISLE chain appears to be:
(rule (lower (scalar_to_vector src @ (value_type ty))) (bitcast_gpr_to_xmm (ty_bits ty) src))(rule (bitcast_gpr_to_xmm 16 src) (x64_pinsrw (xmm_uninit_value) src 0))(rule 0 (x64_pinsrw src1 src2 lane) (x64_pinsrw_a src1 src2 lane))That uninitialized XMM could have some extra bits in it? Seems like it should have been zeroed out.
cfallin commented on issue #10906:
(rule (bitcast_gpr_to_xmm 16 src)
(x64_pinsrw (xmm_uninit_value) src 0))That seems like the issue, unless I'm misunderstanding --
bitcast_gpr_to_xmmshould zero the upper lanes, but this is explicitly opting into uninitialized/existing bits in those lanes. It seems we've had this since #9045.
cfallin edited a comment on issue #10906:
(rule (bitcast_gpr_to_xmm 16 src) (x64_pinsrw (xmm_uninit_value) src 0))That seems like the issue, unless I'm misunderstanding --
bitcast_gpr_to_xmmshould zero the upper lanes, but this is explicitly opting into uninitialized/existing bits in those lanes. It seems we've had this since #9045.
abrown commented on issue #10906:
Do we need a
x64_pxorof atemp_writable_xmmthere, then? (I'm looking around for something likezero_xmmbut not finding it).
alexcrichton commented on issue #10906:
IMO
bitcast_gpr_to_xmmis probably fine insofar that the name alone implies to me that it's got the expected behavior, butscalar_to_vectoris documented as zeroing all upper lanes and so the bug lies in implementingscalar_to_vectorwithbitcast_gpr_to_xmm. The lowering ofscalar_to_vectorforty_scalar_floatalso looks wrong because it's not zeroing the upper lanes, so I think that thescalar_to_vectorinstruction lowering may just need some love and care to fix some cases. WebAssembly only uses the lowering where the source is a 32-bit or 64-bit integer loaded from memory which is why I don't think this has come up before.
alexcrichton commented on issue #10906:
Although that being said I think it would be reasonable to document
bitcast_gpr_to_xmmas zeroing all the other bits (it certainly helps to avoid creating false dependencies). Nevertheless I thinkscalar_to_vectorfor floats still needs improving.
cfallin commented on issue #10906:
Yeah, I suppose it depends on which way we define it. I suppose I was reading "
..._to_xmm" as meaning "to 128 bits" but one could just as well think about this the same way we think about narrow values in GPRs. I'm fine going either way.
abrown commented on issue #10906:
I hadn't read these latest comments prior to #10949; sounds like maybe I should alter the solution there?
alexcrichton commented on issue #10906:
Personally I think that's a reasonable fix, but mind leaving a comment on the
bitcast_gpr_to_xmmhelper that it's defined as zeroing all the upper bits? (which is then why it's suitable forscalar_to_vector)
fitzgen closed issue #10906:
.clifTest Casetest optimize set opt_level=none set preserve_frame_pointers=true set enable_multi_ret_implicit_sret=true target x86_64 function %main() -> i16x8 fast { const0 = 0x560af419d25ee4ab70b6b8ba64146998 block0: v3 = iconst.i8 66 v4 = iconst.i16 0x6342 v9 = f64const 0x1.3ebf9685e6a16p-2 v10 = vconst.i8x16 const0 v11 = vconst.i16x8 const0 v12 = vconst.i32x4 const0 v13 = vconst.f32x4 const0 v14 = vconst.i64x2 const0 v15 = vconst.f64x2 const0 jump block2 block2: v23 = scalar_to_vector.i16x8 v4 return v23 } ; print: %main()Steps to Reproduce
clif-util run -v ./test1.clifResults
%main() -> 0x00000000000000000000000000006342Then add a return value v9.
test optimize set opt_level=none set preserve_frame_pointers=true set enable_multi_ret_implicit_sret=true target x86_64 function %main() -> f64, i16x8 fast { const0 = 0x560af419d25ee4ab70b6b8ba64146998 block0: v3 = iconst.i8 66 v4 = iconst.i16 0x6342 v9 = f64const 0x1.3ebf9685e6a16p-2 v10 = vconst.i8x16 const0 v11 = vconst.i16x8 const0 v12 = vconst.i32x4 const0 v13 = vconst.f32x4 const0 v14 = vconst.i64x2 const0 v15 = vconst.f64x2 const0 jump block2 block2: v23 = scalar_to_vector.i16x8 v4 return v9, v23 } ; print: %main()Sometimes the result is
[0x1.3ebf9685e6a16p-2, 0x00000000000000000000000000006342],
and other times it’s[0x1.3ebf9685e6a16p-2, 0x000000000000000100000000ffff6342].
Why does this inconsistency happen?
Last updated: Dec 06 2025 at 06:05 UTC