I have noticed that currently Cranelift's AArch64 backend stores all 128 bits of each callee-saved SIMD & FP register when generating a function prologue. However, the Procedure Call Standard for the Arm 64-bit Architecture requires that only the bottom 64 bits of a register are saved; if the whole SIMD register is live across a call, then it is the responsibility of the caller to preserve its value. Naturally, that suggests an optimization, but is there a deeper reason for the current behaviour? If the implementation is modified to save only 64 bits, will there be any correctness issues because someone else has different expectations?
@Anton Kirilov nope, there's no deeper reason here; we just weren't aware that the upper 64 bits are always caller-save. That's good news as it's a perf improvement on the table! Happy to take a patch, or I can add it to The List and get to it at some point. Thanks!
(Now that I've said that, though, one thing I should double-check is whether the calling convention in SpiderMonkey on aarch64 differs on this point; it's mostly the system calling convention, except when it's not)
FYI here's the relevant quote from the PCS:
Registers v8-v15 must be preserved by a callee across subroutine calls; the remaining registers (v0-v7, v16-v31) do not need to be preserved (or should be preserved by the caller). Additionally, only the bottom 64 bits of each value stored in v8-v15 need to be preserved [7]; it is the responsibility of the caller to preserve larger values.
Actually there's another possible improvement - we can use store pair instructions as in the GPR case; similarly for the epilogue code.
Yep, that would be an improvement as well
Either @Joey Gouly or I can handle it at some point, that's fine.
There's the question of whether saving/restoring 64 bits would actually make any difference in practice. If the load/store units have a 128-bit wide path to/from D1 then I guess the answer would be "no". If that path is only 64 bits wide and used twice for a 128-bit transaction, then "yes"; but that would seem to imply that all SIMD loads/stores on the machine would be similarly burdened. Which strikes me as a bit unlikely, especially for any mid-range implementation or above.
So I have to say .. I wouldn't be surprised if fixing this made no measurable difference.
I am not sure I understand you - if the LSUs have a 128-bit wide path to the L1D$, then it means that after the fix they will be able to save 2 registers per cycle instead of 1, so they will need half the time (ignoring out-of-order execution).
Mhm, I hadn't thought of that. Good point.
Anyway, this change would probably be made simultaneously with switching to store pair instructions, the latter definitely reducing code size.
Some of the earlier AArch64 CPUs only have a 64-bit path to the L1D$: https://github.com/bytecodealliance/regalloc.rs/issues/14#issuecomment-575648053
Do you mean either Cortex-A55, Cortex-A57, or Cortex-A72?
Yes, all 3 of those only have 64-bit paths to the L1D$.
A55 has asymmetric load and store bandwidths - 64 bits/cycle for loads and 128 bits/cycle for stores. It is stated in the optimization guide and also in the Technical Reference Manual.
The A57 and A72 optimization guides don't really say anything.
Neither do the TRMs (A57, A72).
It's implied by the throughput numbers for "Store vector reg, unscaled immed, Q-‐form"
You can tell that it is splitting the store into 2 uops since it has a 2 cycle latency and 0.5 throughput.
Though it seems that only stores are limited to 64-bits, loads can do 128-bits in a single cycle.
Last updated: Oct 23 2024 at 20:03 UTC