Stream: git-wasmtime

Topic: wasmtime / issue #12033 Support load sinking across notra...


view this post on Zulip Wasmtime GitHub notifications bot (Nov 15 2025 at 13:23):

bjorn3 opened issue #12033:

Feature

For example for subtracting 2 mathematical vectors with 3 elements like so:

function u0:0(i64 sret, i64, i64) system_v {
block0(v0: i64, v1: i64, v2: i64):
    v3 = load.f64 notrap v1
    v4 = load.f64 notrap v2
    v6 = load.f64 notrap v1+8
    v7 = load.f64 notrap v2+8
    v9 = load.f64 notrap v1+16
    v10 = load.f64 notrap v2+16
    v5 = fsub v3, v4
    store notrap v5, v0
    v8 = fsub v6, v7
    store notrap v8, v0+8
    v11 = fsub v9, v10
    store notrap v11, v0+16
    return
}

6 load instructions will be generated followed by 3 pairs of sub + store:

0000000000000000 <sub>:
   0:   55                      push   rbp
   1:   48 89 e5                mov    rbp,rsp
   4:   f2 0f 10 3e             movsd  xmm7,QWORD PTR [rsi]
   8:   f2 0f 10 2a             movsd  xmm5,QWORD PTR [rdx]
   c:   f2 0f 10 46 08          movsd  xmm0,QWORD PTR [rsi+0x8]
  11:   f2 0f 10 72 08          movsd  xmm6,QWORD PTR [rdx+0x8]
  16:   f2 0f 10 4e 10          movsd  xmm1,QWORD PTR [rsi+0x10]
  1b:   f2 0f 10 52 10          movsd  xmm2,QWORD PTR [rdx+0x10]
  20:   f2 0f 5c fd             subsd  xmm7,xmm5
  24:   f2 0f 11 3f             movsd  QWORD PTR [rdi],xmm7
  28:   f2 0f 5c c6             subsd  xmm0,xmm6
  2c:   f2 0f 11 47 08          movsd  QWORD PTR [rdi+0x8],xmm0
  31:   f2 0f 5c ca             subsd  xmm1,xmm2
  35:   f2 0f 11 4f 10          movsd  QWORD PTR [rdi+0x10],xmm1
  3a:   48 89 f8                mov    rax,rdi
  3d:   48 89 ec                mov    rsp,rbp
  40:   5d                      pop    rbp
  41:   c3                      ret

while LLVM is able to sink half the loads into the subsd instructions themself even with -O0:

  sub:
        mov     rax, rdi
        movsd   xmm2, qword ptr [rsi]
        subsd   xmm2, qword ptr [rdx]
        movsd   xmm1, qword ptr [rsi + 8]
        subsd   xmm1, qword ptr [rdx + 8]
        movsd   xmm0, qword ptr [rsi + 16]
        subsd   xmm0, qword ptr [rdx + 16]
        movsd   qword ptr [rdi], xmm2
        movsd   qword ptr [rdi + 8], xmm1
        movsd   qword ptr [rdi + 16], xmm0
        ret

Cranelift is not entirely incapable of load sinking as seen for a dot product where it does load sink a single load:

function u0:0(i64, i64) -> f64 system_v {
block0(v0: i64, v1: i64):
    v3 = load.f64 notrap v0
    v4 = load.f64 notrap v1
    v6 = load.f64 notrap v0+8
    v7 = load.f64 notrap v1+8
    v10 = load.f64 notrap v0+16
    v11 = load.f64 notrap v1+16
    v5 = fmul v3, v4
    v8 = fmul v6, v7
    v9 = fadd v5, v8
    v12 = fmul v10, v11
    v13 = fadd v9, v12
    return v13
}
0000000000000000 <dot>:
   0:   55                      push   rbp
   1:   48 89 e5                mov    rbp,rsp
   4:   f2 0f 10 07             movsd  xmm0,QWORD PTR [rdi]
   8:   f2 0f 10 2e             movsd  xmm5,QWORD PTR [rsi]
   c:   f2 0f 10 4f 08          movsd  xmm1,QWORD PTR [rdi+0x8]
  11:   f2 0f 10 76 08          movsd  xmm6,QWORD PTR [rsi+0x8]
  16:   f2 0f 10 57 10          movsd  xmm2,QWORD PTR [rdi+0x10]
  1b:   f2 0f 59 c5             mulsd  xmm0,xmm5
  1f:   f2 0f 59 ce             mulsd  xmm1,xmm6
  23:   f2 0f 58 c1             addsd  xmm0,xmm1
  27:   f2 0f 59 56 10          mulsd  xmm2,QWORD PTR [rsi+0x10]
  2c:   f2 0f 58 c2             addsd  xmm0,xmm2
  30:   48 89 ec                mov    rsp,rbp
  33:   5d                      pop    rbp
  34:   c3                      ret

but again even LLVM -O0 will load sink all 3 possible loads:

dot:
        mov     qword ptr [rsp - 8], rdi
        movsd   xmm0, qword ptr [rdi]
        mulsd   xmm0, qword ptr [rsi]
        movsd   xmm1, qword ptr [rdi + 8]
        mulsd   xmm1, qword ptr [rsi + 8]
        addsd   xmm0, xmm1
        movsd   xmm1, qword ptr [rdi + 16]
        mulsd   xmm1, qword ptr [rsi + 16]
        addsd   xmm0, xmm1
        ret

These examples are taken from https://github.com/ebobby/simple-raytracer/blob/496b6164b9f16250f99b91327da8f01acc1e3534/src/vector.rs compiled with both cg_clif (-Copt-level=3) and cg_llvm (-Copt-level=0).

Benefit

Improves runtime performance.

Implementation

I think this is caused by get_value_as_source_or_const considering loads as having side-effects even when they are notrap.

Alternatives

TODO: What are the alternative implementation approaches or alternative ways to
solve the problem that this feature would solve? How do these alternatives
compare to this proposal?

view this post on Zulip Wasmtime GitHub notifications bot (Nov 15 2025 at 13:24):

bjorn3 edited issue #12033:

Feature

For example for subtracting 2 mathematical vectors with 3 elements like so:

function u0:0(i64 sret, i64, i64) system_v {
block0(v0: i64, v1: i64, v2: i64):
    v3 = load.f64 notrap v1
    v4 = load.f64 notrap v2
    v6 = load.f64 notrap v1+8
    v7 = load.f64 notrap v2+8
    v9 = load.f64 notrap v1+16
    v10 = load.f64 notrap v2+16
    v5 = fsub v3, v4
    store notrap v5, v0
    v8 = fsub v6, v7
    store notrap v8, v0+8
    v11 = fsub v9, v10
    store notrap v11, v0+16
    return
}

6 load instructions will be generated followed by 3 pairs of sub + store:

0000000000000000 <sub>:
   0:   55                      push   rbp
   1:   48 89 e5                mov    rbp,rsp
   4:   f2 0f 10 3e             movsd  xmm7,QWORD PTR [rsi]
   8:   f2 0f 10 2a             movsd  xmm5,QWORD PTR [rdx]
   c:   f2 0f 10 46 08          movsd  xmm0,QWORD PTR [rsi+0x8]
  11:   f2 0f 10 72 08          movsd  xmm6,QWORD PTR [rdx+0x8]
  16:   f2 0f 10 4e 10          movsd  xmm1,QWORD PTR [rsi+0x10]
  1b:   f2 0f 10 52 10          movsd  xmm2,QWORD PTR [rdx+0x10]
  20:   f2 0f 5c fd             subsd  xmm7,xmm5
  24:   f2 0f 11 3f             movsd  QWORD PTR [rdi],xmm7
  28:   f2 0f 5c c6             subsd  xmm0,xmm6
  2c:   f2 0f 11 47 08          movsd  QWORD PTR [rdi+0x8],xmm0
  31:   f2 0f 5c ca             subsd  xmm1,xmm2
  35:   f2 0f 11 4f 10          movsd  QWORD PTR [rdi+0x10],xmm1
  3a:   48 89 f8                mov    rax,rdi
  3d:   48 89 ec                mov    rsp,rbp
  40:   5d                      pop    rbp
  41:   c3                      ret

while LLVM is able to sink half the loads into the subsd instructions themself even with -O0:

  sub:
        mov     rax, rdi
        movsd   xmm2, qword ptr [rsi]
        subsd   xmm2, qword ptr [rdx]
        movsd   xmm1, qword ptr [rsi + 8]
        subsd   xmm1, qword ptr [rdx + 8]
        movsd   xmm0, qword ptr [rsi + 16]
        subsd   xmm0, qword ptr [rdx + 16]
        movsd   qword ptr [rdi], xmm2
        movsd   qword ptr [rdi + 8], xmm1
        movsd   qword ptr [rdi + 16], xmm0
        ret

Cranelift is not entirely incapable of load sinking as seen for a dot product where it does load sink a single load:

function u0:0(i64, i64) -> f64 system_v {
block0(v0: i64, v1: i64):
    v3 = load.f64 notrap v0
    v4 = load.f64 notrap v1
    v6 = load.f64 notrap v0+8
    v7 = load.f64 notrap v1+8
    v10 = load.f64 notrap v0+16
    v11 = load.f64 notrap v1+16
    v5 = fmul v3, v4
    v8 = fmul v6, v7
    v9 = fadd v5, v8
    v12 = fmul v10, v11
    v13 = fadd v9, v12
    return v13
}
0000000000000000 <dot>:
   0:   55                      push   rbp
   1:   48 89 e5                mov    rbp,rsp
   4:   f2 0f 10 07             movsd  xmm0,QWORD PTR [rdi]
   8:   f2 0f 10 2e             movsd  xmm5,QWORD PTR [rsi]
   c:   f2 0f 10 4f 08          movsd  xmm1,QWORD PTR [rdi+0x8]
  11:   f2 0f 10 76 08          movsd  xmm6,QWORD PTR [rsi+0x8]
  16:   f2 0f 10 57 10          movsd  xmm2,QWORD PTR [rdi+0x10]
  1b:   f2 0f 59 c5             mulsd  xmm0,xmm5
  1f:   f2 0f 59 ce             mulsd  xmm1,xmm6
  23:   f2 0f 58 c1             addsd  xmm0,xmm1
  27:   f2 0f 59 56 10          mulsd  xmm2,QWORD PTR [rsi+0x10]
  2c:   f2 0f 58 c2             addsd  xmm0,xmm2
  30:   48 89 ec                mov    rsp,rbp
  33:   5d                      pop    rbp
  34:   c3                      ret

but again even LLVM -O0 will load sink all 3 possible loads:

dot:
        mov     qword ptr [rsp - 8], rdi
        movsd   xmm0, qword ptr [rdi]
        mulsd   xmm0, qword ptr [rsi]
        movsd   xmm1, qword ptr [rdi + 8]
        mulsd   xmm1, qword ptr [rsi + 8]
        addsd   xmm0, xmm1
        movsd   xmm1, qword ptr [rdi + 16]
        mulsd   xmm1, qword ptr [rsi + 16]
        addsd   xmm0, xmm1
        ret

These examples are taken from https://github.com/ebobby/simple-raytracer/blob/496b6164b9f16250f99b91327da8f01acc1e3534/src/vector.rs compiled with both cg_clif (-Copt-level=3) and cg_llvm (-Copt-level=0).

Benefit

Improves runtime performance.

Implementation

I think this is caused by get_value_as_source_or_const considering loads as having side-effects even when they are notrap.

Alternatives

TODO: What are the alternative implementation approaches or alternative ways to
solve the problem that this feature would solve? How do these alternatives
compare to this proposal?

view this post on Zulip Wasmtime GitHub notifications bot (Nov 26 2025 at 17:28):

alexcrichton added the cranelift:area:x64 label to Issue #12033.

view this post on Zulip Wasmtime GitHub notifications bot (Dec 03 2025 at 19:17):

cfallin commented on issue #12033:

Thanks for filing this!

The main difficulty with doing this today is that our instruction-coloring pass computes colors once, and updates colors on all loads and stores; in essence this is like building a spine of dependency edges between all adjacent memory ops to keep them all in the same order.

The desired optimization output does actually have the ops in the same order, but we need to update instruction colors as we sink to see that. That's maybe possible with a little more complexity but we'd have to think carefully about it.

(Note that coloring happens separately than alias analysis and works to keep all side-effects in order; if we were to unify the two, it would probably best be via Nick's proposal in #10427)

view this post on Zulip Wasmtime GitHub notifications bot (Dec 03 2025 at 19:26):

bjorn3 commented on issue #12033:

When only notrap loads are involved, it doesn't matter that the order of the loads stays the same. It only matter that it doesn't change relative to loads that may trap and stores.

view this post on Zulip Wasmtime GitHub notifications bot (Dec 03 2025 at 19:28):

cfallin commented on issue #12033:

Yes, that's what makes the transform possible, I agree. My description is how lowering works today, so we will need to update the instruction coloring algorithm as noted.

view this post on Zulip Wasmtime GitHub notifications bot (Dec 03 2025 at 19:29):

cfallin commented on issue #12033:

(The reason that notrap loads still participate in coloring is that they should not be moved across stores; and moving across stores is governed today by coloring, not alias analysis)


Last updated: Dec 06 2025 at 06:05 UTC