pnodet opened PR #13055 from pnodet:aarch64-opt2-lr-only to bytecodealliance:main.
pnodet edited PR #13055:
This updates AArch64 prologue/epilogue generation to use an LR-only linkage frame for a narrow set of simple regular-call functions. Instead of always doing + and restoring both registers, we now use with the same 16-byte stack adjustment when it’s safe to do so.
The optimization is intentionally conservative. It does not apply when frame pointers are required, when unwind info is enabled, when return-address signing is enabled, or when the frame layout needs full FP-based setup. In those cases we keep the existing FP/LR path unchanged.
The goal is to trim unnecessary frame setup/teardown work in the common eligible cases while preserving ABI alignment and keeping behavior identical outside that narrow window.
pnodet edited PR #13055:
This updates AArch64 prologue/epilogue generation to use an LR-only linkage frame for a narrow set of simple regular-call functions. Instead of always doing
stp fp, lr+mov fp, spand restoring both registers, we now usestr/ldr lrwith the same 16-byte stack adjustment when it’s safe to do so.The optimization is intentionally conservative. It does not apply when frame pointers are required, when unwind info is enabled, when return-address signing is enabled, or when the frame layout needs full FP-based setup. In those cases we keep the existing FP/LR path unchanged.
The goal is to trim unnecessary frame setup/teardown work in the common eligible cases while preserving ABI alignment and keeping behavior identical outside that narrow window.
github-actions[bot] added the label cranelift on PR #13055.
github-actions[bot] added the label cranelift:area:aarch64 on PR #13055.
cfallin commented on PR #13055:
@pnodet I see this is still a draft but could you clarify what performance changes, if any, you've measured with this change?
Naively at least, I would expect that the store-pair of fp/lr and the single store of lr, both to a 16-byte slot on the stack, to have almost equal performance on modern CPUs -- the hardware does the store in a single action in either case (single store-buffer slot, single instruction issue), just a different datapath width. Maybe different execution ports, small differences in ILP-heavy workloads, etc. Have you measured a speedup with this?
Last updated: May 03 2026 at 23:15 UTC