Stream: cranelift

Topic: stack_addr + load/store merging


view this post on Zulip bjorn3 (Sep 19 2025 at 12:49):

I noticed that on arm64 a stack_addr + load/store gets merged into a single instruction. This doesn't seem to happen on x86 however. Where is this implemented for arm64?

view this post on Zulip bjorn3 (Sep 19 2025 at 12:50):

function %stack_store_small(i64) {
ss0 = explicit_slot 8

block0(v0: i64):
  stack_store.i64 v0, ss0
  return
}

; VCode:
;   stp fp, lr, [sp, #-16]!
;   mov fp, sp
;   sub sp, sp, #16
; block0:
;   mov x2, sp
;   str x0, [x2]
;   add sp, sp, #16
;   ldp fp, lr, [sp], #16
;   ret

vs

function %stack_store_small(i64) {
ss0 = explicit_slot 8

block0(v0: i64):
  stack_store.i64 v0, ss0
  return
}

; VCode:
;   pushq %rbp
;   movq %rsp, %rbp
;   subq $0x10, %rsp
; block0:
;   leaq <offset:1>+(%rsp), %rax
;   movq %rdi, (%rax)
;   addq $0x10, %rsp
;   movq %rbp, %rsp
;   popq %rbp
;   retq

view this post on Zulip bjorn3 (Sep 19 2025 at 12:51):

Never mind, I read the arm64 asm wrong. There is doesn't happen either.

view this post on Zulip bjorn3 (Sep 19 2025 at 14:12):

Got something kinda working. Before:

[BENCH RUN] ebobby/simple-raytracer
Benchmark 1: ./raytracer_cg_llvm
  Time (mean ± σ):      2.690 s ±  0.043 s    [User: 2.686 s, System: 0.004 s]
  Range (min  max):    2.663 s   2.807 s    10 runs

Benchmark 2: ./raytracer_cg_clif
  Time (mean ± σ):      3.702 s ±  0.021 s    [User: 3.697 s, System: 0.005 s]
  Range (min  max):    3.674 s   3.738 s    10 runs

Benchmark 3: ./raytracer_cg_clif_opt
  Time (mean ± σ):      2.286 s ±  0.011 s    [User: 2.282 s, System: 0.004 s]
  Range (min  max):    2.268 s   2.306 s    10 runs

Summary
  ./raytracer_cg_clif_opt ran
    1.18 ± 0.02 times faster than ./raytracer_cg_llvm
    1.62 ± 0.01 times faster than ./raytracer_cg_clif

After:

[BENCH RUN] ebobby/simple-raytracer
Benchmark 1: ./raytracer_cg_llvm
  Time (mean ± σ):      2.668 s ±  0.010 s    [User: 2.663 s, System: 0.004 s]
  Range (min  max):    2.660 s   2.687 s    10 runs

Benchmark 2: ./raytracer_cg_clif
  Time (mean ± σ):      2.552 s ±  0.016 s    [User: 2.547 s, System: 0.004 s]
  Range (min  max):    2.539 s   2.588 s    10 runs

Benchmark 3: ./raytracer_cg_clif_opt
  Time (mean ± σ):      2.151 s ±  0.022 s    [User: 2.147 s, System: 0.004 s]
  Range (min  max):    2.130 s   2.206 s    10 runs

Summary
  ./raytracer_cg_clif_opt ran
    1.19 ± 0.01 times faster than ./raytracer_cg_clif
    1.24 ± 0.01 times faster than ./raytracer_cg_llvm

view this post on Zulip bjorn3 (Sep 22 2025 at 13:12):

https://github.com/bytecodealliance/wasmtime/pull/11727 by @Chris Fallin has a better implementation with the same perf benefits.

We provide stack_load/ stack_store / stack_addr instructions in Cranelift to operate on stack slots, and the first two are legalized to a stack_addr plus an ordinary load or store instruction. We c...

view this post on Zulip Chris Fallin (Sep 22 2025 at 13:48):

Ah, I vaguely recalled seeing something about this but couldn’t find an issue — happy this will solve your problem too!

view this post on Zulip bjorn3 (Sep 22 2025 at 14:05):

Looks like there was an ancient issue for this too: https://github.com/bytecodealliance/wasmtime/issues/1064

Currently, stack_store and stack_load are legalized into stack_addr followed by plain store and load, producing code like this: lea rax, qword ptr [rsp + 8] mov qword ptr [rax], rdi We really want ...

Last updated: Dec 06 2025 at 07:03 UTC