stack_addr + load/store merging · cranelift

I noticed that on arm64 a stack_addr + load/store gets merged into a single instruction. This doesn't seem to happen on x86 however. Where is this implemented for arm64?

bjorn3 (Sep 19 2025 at 12:50):

function %stack_store_small(i64) {
ss0 = explicit_slot 8

block0(v0: i64):
  stack_store.i64 v0, ss0
  return
}

; VCode:
;   stp fp, lr, [sp, #-16]!
;   mov fp, sp
;   sub sp, sp, #16
; block0:
;   mov x2, sp
;   str x0, [x2]
;   add sp, sp, #16
;   ldp fp, lr, [sp], #16
;   ret

function %stack_store_small(i64) {
ss0 = explicit_slot 8

block0(v0: i64):
  stack_store.i64 v0, ss0
  return
}

; VCode:
;   pushq %rbp
;   movq %rsp, %rbp
;   subq $0x10, %rsp
; block0:
;   leaq <offset:1>+(%rsp), %rax
;   movq %rdi, (%rax)
;   addq $0x10, %rsp
;   movq %rbp, %rsp
;   popq %rbp
;   retq

bjorn3 (Sep 19 2025 at 12:51):

bjorn3 (Sep 19 2025 at 14:12):

[BENCH RUN] ebobby/simple-raytracer
Benchmark 1: ./raytracer_cg_llvm
  Time (mean ± σ):      2.690 s ±  0.043 s    [User: 2.686 s, System: 0.004 s]
  Range (min … max):    2.663 s …  2.807 s    10 runs

Benchmark 2: ./raytracer_cg_clif
  Time (mean ± σ):      3.702 s ±  0.021 s    [User: 3.697 s, System: 0.005 s]
  Range (min … max):    3.674 s …  3.738 s    10 runs

Benchmark 3: ./raytracer_cg_clif_opt
  Time (mean ± σ):      2.286 s ±  0.011 s    [User: 2.282 s, System: 0.004 s]
  Range (min … max):    2.268 s …  2.306 s    10 runs

Summary
  ./raytracer_cg_clif_opt ran
    1.18 ± 0.02 times faster than ./raytracer_cg_llvm
    1.62 ± 0.01 times faster than ./raytracer_cg_clif

[BENCH RUN] ebobby/simple-raytracer
Benchmark 1: ./raytracer_cg_llvm
  Time (mean ± σ):      2.668 s ±  0.010 s    [User: 2.663 s, System: 0.004 s]
  Range (min … max):    2.660 s …  2.687 s    10 runs

Benchmark 2: ./raytracer_cg_clif
  Time (mean ± σ):      2.552 s ±  0.016 s    [User: 2.547 s, System: 0.004 s]
  Range (min … max):    2.539 s …  2.588 s    10 runs

Benchmark 3: ./raytracer_cg_clif_opt
  Time (mean ± σ):      2.151 s ±  0.022 s    [User: 2.147 s, System: 0.004 s]
  Range (min … max):    2.130 s …  2.206 s    10 runs

Summary
  ./raytracer_cg_clif_opt ran
    1.19 ± 0.01 times faster than ./raytracer_cg_clif
    1.24 ± 0.01 times faster than ./raytracer_cg_llvm

bjorn3 (Sep 22 2025 at 13:12):

Cranelift: use SP-offset amodes for `stack_addr`+load/store. by cfallin · Pull Request #11727 · bytecodealliance/wasmtime

We provide stack_load/ stack_store / stack_addr instructions in Cranelift to operate on stack slots, and the first two are legalized to a stack_addr plus an ordinary load or store instruction. We c...

Chris Fallin (Sep 22 2025 at 13:48):

Ah, I vaguely recalled seeing something about this but couldn’t find an issue — happy this will solve your problem too!

bjorn3 (Sep 22 2025 at 14:05):

Optimize `stack_store` and `stack_load` · Issue #1064 · bytecodealliance/wasmtime

Currently, stack_store and stack_load are legalized into stack_addr followed by plain store and load, producing code like this: lea rax, qword ptr [rsp + 8] mov qword ptr [rax], rdi We really want ...

Stream: cranelift