AArch64 callee-saved FP registers · cranelift

I have noticed that currently Cranelift's AArch64 backend stores all 128 bits of each callee-saved SIMD & FP register when generating a function prologue. However, the Procedure Call Standard for the Arm 64-bit Architecture requires that only the bottom 64 bits of a register are saved; if the whole SIMD register is live across a call, then it is the responsibility of the caller to preserve its value. Naturally, that suggests an optimization, but is there a deeper reason for the current behaviour? If the implementation is modified to save only 64 bits, will there be any correctness issues because someone else has different expectations?

bytecodealliance/wasmtime

Standalone JIT-style runtime for WebAssembly, using Cranelift - bytecodealliance/wasmtime

Chris Fallin (Jun 04 2020 at 23:44):

@Anton Kirilov nope, there's no deeper reason here; we just weren't aware that the upper 64 bits are always caller-save. That's good news as it's a perf improvement on the table! Happy to take a patch, or I can add it to The List and get to it at some point. Thanks!

Chris Fallin (Jun 04 2020 at 23:45):

(Now that I've said that, though, one thing I should double-check is whether the calling convention in SpiderMonkey on aarch64 differs on this point; it's mostly the system calling convention, except when it's not)

Anton Kirilov (Jun 04 2020 at 23:55):

Actually there's another possible improvement - we can use store pair instructions as in the GPR case; similarly for the epilogue code.

ARM-software/abi-aa

Application Binary Interface for the Arm® Architecture - ARM-software/abi-aa

Chris Fallin (Jun 05 2020 at 00:03):

Anton Kirilov (Jun 05 2020 at 10:54):

Julian Seward (Jun 05 2020 at 12:00):

There's the question of whether saving/restoring 64 bits would actually make any difference in practice. If the load/store units have a 128-bit wide path to/from D1 then I guess the answer would be "no". If that path is only 64 bits wide and used twice for a 128-bit transaction, then "yes"; but that would seem to imply that all SIMD loads/stores on the machine would be similarly burdened. Which strikes me as a bit unlikely, especially for any mid-range implementation or above.

Julian Seward (Jun 05 2020 at 12:01):

So I have to say .. I wouldn't be surprised if fixing this made no measurable difference.

Anton Kirilov (Jun 05 2020 at 12:27):

I am not sure I understand you - if the LSUs have a 128-bit wide path to the L1D$, then it means that after the fix they will be able to save 2 registers per cycle instead of 1, so they will need half the time (ignoring out-of-order execution).

Julian Seward (Jun 05 2020 at 12:30):

Anton Kirilov (Jun 05 2020 at 12:32):

Anyway, this change would probably be made simultaneously with switching to store pair instructions, the latter definitely reducing code size.

Amanieu (Jun 05 2020 at 12:42):

Value width tracking · Issue #14 · bytecodealliance/regalloc.rs

On many architectures (e.g x86 and ARM64), f32 and f64 values are located in the V128 register class. This means that any spills or moves will need to copy the entire 128 bits of the register. Howe...

Anton Kirilov (Jun 05 2020 at 12:53):

Amanieu (Jun 05 2020 at 13:00):

Anton Kirilov (Jun 05 2020 at 13:00):

A55 has asymmetric load and store bandwidths - 64 bits/cycle for loads and 128 bits/cycle for stores. It is stated in the optimization guide and also in the Technical Reference Manual.

Anton Kirilov (Jun 05 2020 at 13:13):

Anton Kirilov (Jun 05 2020 at 13:19):

Amanieu (Jun 05 2020 at 15:13):

It's implied by the throughput numbers for "Store vector reg, unscaled immed, Q-‐form"

Amanieu (Jun 05 2020 at 15:13):

You can tell that it is splitting the store into 2 uops since it has a 2 cycle latency and 0.5 throughput.

Amanieu (Jun 05 2020 at 15:15):

Though it seems that only stores are limited to 64-bits, loads can do 128-bits in a single cycle.