I noticed that on arm64 a stack_addr + load/store gets merged into a single instruction. This doesn't seem to happen on x86 however. Where is this implemented for arm64?
function %stack_store_small(i64) {
ss0 = explicit_slot 8
block0(v0: i64):
stack_store.i64 v0, ss0
return
}
; VCode:
; stp fp, lr, [sp, #-16]!
; mov fp, sp
; sub sp, sp, #16
; block0:
; mov x2, sp
; str x0, [x2]
; add sp, sp, #16
; ldp fp, lr, [sp], #16
; ret
vs
function %stack_store_small(i64) {
ss0 = explicit_slot 8
block0(v0: i64):
stack_store.i64 v0, ss0
return
}
; VCode:
; pushq %rbp
; movq %rsp, %rbp
; subq $0x10, %rsp
; block0:
; leaq <offset:1>+(%rsp), %rax
; movq %rdi, (%rax)
; addq $0x10, %rsp
; movq %rbp, %rsp
; popq %rbp
; retq
Never mind, I read the arm64 asm wrong. There is doesn't happen either.
Got something kinda working. Before:
[BENCH RUN] ebobby/simple-raytracer
Benchmark 1: ./raytracer_cg_llvm
Time (mean ± σ): 2.690 s ± 0.043 s [User: 2.686 s, System: 0.004 s]
Range (min … max): 2.663 s … 2.807 s 10 runs
Benchmark 2: ./raytracer_cg_clif
Time (mean ± σ): 3.702 s ± 0.021 s [User: 3.697 s, System: 0.005 s]
Range (min … max): 3.674 s … 3.738 s 10 runs
Benchmark 3: ./raytracer_cg_clif_opt
Time (mean ± σ): 2.286 s ± 0.011 s [User: 2.282 s, System: 0.004 s]
Range (min … max): 2.268 s … 2.306 s 10 runs
Summary
./raytracer_cg_clif_opt ran
1.18 ± 0.02 times faster than ./raytracer_cg_llvm
1.62 ± 0.01 times faster than ./raytracer_cg_clif
After:
[BENCH RUN] ebobby/simple-raytracer
Benchmark 1: ./raytracer_cg_llvm
Time (mean ± σ): 2.668 s ± 0.010 s [User: 2.663 s, System: 0.004 s]
Range (min … max): 2.660 s … 2.687 s 10 runs
Benchmark 2: ./raytracer_cg_clif
Time (mean ± σ): 2.552 s ± 0.016 s [User: 2.547 s, System: 0.004 s]
Range (min … max): 2.539 s … 2.588 s 10 runs
Benchmark 3: ./raytracer_cg_clif_opt
Time (mean ± σ): 2.151 s ± 0.022 s [User: 2.147 s, System: 0.004 s]
Range (min … max): 2.130 s … 2.206 s 10 runs
Summary
./raytracer_cg_clif_opt ran
1.19 ± 0.01 times faster than ./raytracer_cg_clif
1.24 ± 0.01 times faster than ./raytracer_cg_llvm
https://github.com/bytecodealliance/wasmtime/pull/11727 by @Chris Fallin has a better implementation with the same perf benefits.
Ah, I vaguely recalled seeing something about this but couldn’t find an issue — happy this will solve your problem too!
Looks like there was an ancient issue for this too: https://github.com/bytecodealliance/wasmtime/issues/1064
Last updated: Dec 06 2025 at 07:03 UTC