bjorn3 opened issue #12033:
Feature
For example for subtracting 2 mathematical vectors with 3 elements like so:
function u0:0(i64 sret, i64, i64) system_v { block0(v0: i64, v1: i64, v2: i64): v3 = load.f64 notrap v1 v4 = load.f64 notrap v2 v6 = load.f64 notrap v1+8 v7 = load.f64 notrap v2+8 v9 = load.f64 notrap v1+16 v10 = load.f64 notrap v2+16 v5 = fsub v3, v4 store notrap v5, v0 v8 = fsub v6, v7 store notrap v8, v0+8 v11 = fsub v9, v10 store notrap v11, v0+16 return }6 load instructions will be generated followed by 3 pairs of sub + store:
0000000000000000 <sub>: 0: 55 push rbp 1: 48 89 e5 mov rbp,rsp 4: f2 0f 10 3e movsd xmm7,QWORD PTR [rsi] 8: f2 0f 10 2a movsd xmm5,QWORD PTR [rdx] c: f2 0f 10 46 08 movsd xmm0,QWORD PTR [rsi+0x8] 11: f2 0f 10 72 08 movsd xmm6,QWORD PTR [rdx+0x8] 16: f2 0f 10 4e 10 movsd xmm1,QWORD PTR [rsi+0x10] 1b: f2 0f 10 52 10 movsd xmm2,QWORD PTR [rdx+0x10] 20: f2 0f 5c fd subsd xmm7,xmm5 24: f2 0f 11 3f movsd QWORD PTR [rdi],xmm7 28: f2 0f 5c c6 subsd xmm0,xmm6 2c: f2 0f 11 47 08 movsd QWORD PTR [rdi+0x8],xmm0 31: f2 0f 5c ca subsd xmm1,xmm2 35: f2 0f 11 4f 10 movsd QWORD PTR [rdi+0x10],xmm1 3a: 48 89 f8 mov rax,rdi 3d: 48 89 ec mov rsp,rbp 40: 5d pop rbp 41: c3 retwhile LLVM is able to sink half the loads into the
subsdinstructions themself even with -O0:sub: mov rax, rdi movsd xmm2, qword ptr [rsi] subsd xmm2, qword ptr [rdx] movsd xmm1, qword ptr [rsi + 8] subsd xmm1, qword ptr [rdx + 8] movsd xmm0, qword ptr [rsi + 16] subsd xmm0, qword ptr [rdx + 16] movsd qword ptr [rdi], xmm2 movsd qword ptr [rdi + 8], xmm1 movsd qword ptr [rdi + 16], xmm0 retCranelift is not entirely incapable of load sinking as seen for a dot product where it does load sink a single load:
function u0:0(i64, i64) -> f64 system_v { block0(v0: i64, v1: i64): v3 = load.f64 notrap v0 v4 = load.f64 notrap v1 v6 = load.f64 notrap v0+8 v7 = load.f64 notrap v1+8 v10 = load.f64 notrap v0+16 v11 = load.f64 notrap v1+16 v5 = fmul v3, v4 v8 = fmul v6, v7 v9 = fadd v5, v8 v12 = fmul v10, v11 v13 = fadd v9, v12 return v13 }0000000000000000 <dot>: 0: 55 push rbp 1: 48 89 e5 mov rbp,rsp 4: f2 0f 10 07 movsd xmm0,QWORD PTR [rdi] 8: f2 0f 10 2e movsd xmm5,QWORD PTR [rsi] c: f2 0f 10 4f 08 movsd xmm1,QWORD PTR [rdi+0x8] 11: f2 0f 10 76 08 movsd xmm6,QWORD PTR [rsi+0x8] 16: f2 0f 10 57 10 movsd xmm2,QWORD PTR [rdi+0x10] 1b: f2 0f 59 c5 mulsd xmm0,xmm5 1f: f2 0f 59 ce mulsd xmm1,xmm6 23: f2 0f 58 c1 addsd xmm0,xmm1 27: f2 0f 59 56 10 mulsd xmm2,QWORD PTR [rsi+0x10] 2c: f2 0f 58 c2 addsd xmm0,xmm2 30: 48 89 ec mov rsp,rbp 33: 5d pop rbp 34: c3 retbut again even LLVM -O0 will load sink all 3 possible loads:
dot: mov qword ptr [rsp - 8], rdi movsd xmm0, qword ptr [rdi] mulsd xmm0, qword ptr [rsi] movsd xmm1, qword ptr [rdi + 8] mulsd xmm1, qword ptr [rsi + 8] addsd xmm0, xmm1 movsd xmm1, qword ptr [rdi + 16] mulsd xmm1, qword ptr [rsi + 16] addsd xmm0, xmm1 retThese examples are taken from https://github.com/ebobby/simple-raytracer/blob/496b6164b9f16250f99b91327da8f01acc1e3534/src/vector.rs compiled with both cg_clif (
-Copt-level=3) and cg_llvm (-Copt-level=0).Benefit
Improves runtime performance.
Implementation
I think this is caused by
get_value_as_source_or_constconsidering loads as having side-effects even when they arenotrap.Alternatives
TODO: What are the alternative implementation approaches or alternative ways to
solve the problem that this feature would solve? How do these alternatives
compare to this proposal?
bjorn3 edited issue #12033:
Feature
For example for subtracting 2 mathematical vectors with 3 elements like so:
function u0:0(i64 sret, i64, i64) system_v { block0(v0: i64, v1: i64, v2: i64): v3 = load.f64 notrap v1 v4 = load.f64 notrap v2 v6 = load.f64 notrap v1+8 v7 = load.f64 notrap v2+8 v9 = load.f64 notrap v1+16 v10 = load.f64 notrap v2+16 v5 = fsub v3, v4 store notrap v5, v0 v8 = fsub v6, v7 store notrap v8, v0+8 v11 = fsub v9, v10 store notrap v11, v0+16 return }6 load instructions will be generated followed by 3 pairs of sub + store:
0000000000000000 <sub>: 0: 55 push rbp 1: 48 89 e5 mov rbp,rsp 4: f2 0f 10 3e movsd xmm7,QWORD PTR [rsi] 8: f2 0f 10 2a movsd xmm5,QWORD PTR [rdx] c: f2 0f 10 46 08 movsd xmm0,QWORD PTR [rsi+0x8] 11: f2 0f 10 72 08 movsd xmm6,QWORD PTR [rdx+0x8] 16: f2 0f 10 4e 10 movsd xmm1,QWORD PTR [rsi+0x10] 1b: f2 0f 10 52 10 movsd xmm2,QWORD PTR [rdx+0x10] 20: f2 0f 5c fd subsd xmm7,xmm5 24: f2 0f 11 3f movsd QWORD PTR [rdi],xmm7 28: f2 0f 5c c6 subsd xmm0,xmm6 2c: f2 0f 11 47 08 movsd QWORD PTR [rdi+0x8],xmm0 31: f2 0f 5c ca subsd xmm1,xmm2 35: f2 0f 11 4f 10 movsd QWORD PTR [rdi+0x10],xmm1 3a: 48 89 f8 mov rax,rdi 3d: 48 89 ec mov rsp,rbp 40: 5d pop rbp 41: c3 retwhile LLVM is able to sink half the loads into the
subsdinstructions themself even with -O0:sub: mov rax, rdi movsd xmm2, qword ptr [rsi] subsd xmm2, qword ptr [rdx] movsd xmm1, qword ptr [rsi + 8] subsd xmm1, qword ptr [rdx + 8] movsd xmm0, qword ptr [rsi + 16] subsd xmm0, qword ptr [rdx + 16] movsd qword ptr [rdi], xmm2 movsd qword ptr [rdi + 8], xmm1 movsd qword ptr [rdi + 16], xmm0 retCranelift is not entirely incapable of load sinking as seen for a dot product where it does load sink a single load:
function u0:0(i64, i64) -> f64 system_v { block0(v0: i64, v1: i64): v3 = load.f64 notrap v0 v4 = load.f64 notrap v1 v6 = load.f64 notrap v0+8 v7 = load.f64 notrap v1+8 v10 = load.f64 notrap v0+16 v11 = load.f64 notrap v1+16 v5 = fmul v3, v4 v8 = fmul v6, v7 v9 = fadd v5, v8 v12 = fmul v10, v11 v13 = fadd v9, v12 return v13 }0000000000000000 <dot>: 0: 55 push rbp 1: 48 89 e5 mov rbp,rsp 4: f2 0f 10 07 movsd xmm0,QWORD PTR [rdi] 8: f2 0f 10 2e movsd xmm5,QWORD PTR [rsi] c: f2 0f 10 4f 08 movsd xmm1,QWORD PTR [rdi+0x8] 11: f2 0f 10 76 08 movsd xmm6,QWORD PTR [rsi+0x8] 16: f2 0f 10 57 10 movsd xmm2,QWORD PTR [rdi+0x10] 1b: f2 0f 59 c5 mulsd xmm0,xmm5 1f: f2 0f 59 ce mulsd xmm1,xmm6 23: f2 0f 58 c1 addsd xmm0,xmm1 27: f2 0f 59 56 10 mulsd xmm2,QWORD PTR [rsi+0x10] 2c: f2 0f 58 c2 addsd xmm0,xmm2 30: 48 89 ec mov rsp,rbp 33: 5d pop rbp 34: c3 retbut again even LLVM -O0 will load sink all 3 possible loads:
dot: mov qword ptr [rsp - 8], rdi movsd xmm0, qword ptr [rdi] mulsd xmm0, qword ptr [rsi] movsd xmm1, qword ptr [rdi + 8] mulsd xmm1, qword ptr [rsi + 8] addsd xmm0, xmm1 movsd xmm1, qword ptr [rdi + 16] mulsd xmm1, qword ptr [rsi + 16] addsd xmm0, xmm1 retThese examples are taken from https://github.com/ebobby/simple-raytracer/blob/496b6164b9f16250f99b91327da8f01acc1e3534/src/vector.rs compiled with both cg_clif (
-Copt-level=3) and cg_llvm (-Copt-level=0).Benefit
Improves runtime performance.
Implementation
I think this is caused by
get_value_as_source_or_constconsidering loads as having side-effects even when they arenotrap.Alternatives
TODO: What are the alternative implementation approaches or alternative ways to
solve the problem that this feature would solve? How do these alternatives
compare to this proposal?
alexcrichton added the cranelift:area:x64 label to Issue #12033.
cfallin commented on issue #12033:
Thanks for filing this!
The main difficulty with doing this today is that our instruction-coloring pass computes colors once, and updates colors on all loads and stores; in essence this is like building a spine of dependency edges between all adjacent memory ops to keep them all in the same order.
The desired optimization output does actually have the ops in the same order, but we need to update instruction colors as we sink to see that. That's maybe possible with a little more complexity but we'd have to think carefully about it.
(Note that coloring happens separately than alias analysis and works to keep all side-effects in order; if we were to unify the two, it would probably best be via Nick's proposal in #10427)
bjorn3 commented on issue #12033:
When only notrap loads are involved, it doesn't matter that the order of the loads stays the same. It only matter that it doesn't change relative to loads that may trap and stores.
cfallin commented on issue #12033:
Yes, that's what makes the transform possible, I agree. My description is how lowering works today, so we will need to update the instruction coloring algorithm as noted.
cfallin commented on issue #12033:
(The reason that notrap loads still participate in coloring is that they should not be moved across stores; and moving across stores is governed today by coloring, not alias analysis)
Last updated: Dec 06 2025 at 06:05 UTC