Stream: cranelift

Topic: AArch64 callee-saved FP registers


view this post on Zulip Anton Kirilov (Jun 04 2020 at 23:33):

I have noticed that currently Cranelift's AArch64 backend stores all 128 bits of each callee-saved SIMD & FP register when generating a function prologue. However, the Procedure Call Standard for the Arm 64-bit Architecture requires that only the bottom 64 bits of a register are saved; if the whole SIMD register is live across a call, then it is the responsibility of the caller to preserve its value. Naturally, that suggests an optimization, but is there a deeper reason for the current behaviour? If the implementation is modified to save only 64 bits, will there be any correctness issues because someone else has different expectations?

Standalone JIT-style runtime for WebAssembly, using Cranelift - bytecodealliance/wasmtime

view this post on Zulip Chris Fallin (Jun 04 2020 at 23:44):

@Anton Kirilov nope, there's no deeper reason here; we just weren't aware that the upper 64 bits are always caller-save. That's good news as it's a perf improvement on the table! Happy to take a patch, or I can add it to The List and get to it at some point. Thanks!

view this post on Zulip Chris Fallin (Jun 04 2020 at 23:45):

(Now that I've said that, though, one thing I should double-check is whether the calling convention in SpiderMonkey on aarch64 differs on this point; it's mostly the system calling convention, except when it's not)

view this post on Zulip Anton Kirilov (Jun 04 2020 at 23:55):

FYI here's the relevant quote from the PCS:

Registers v8-v15 must be preserved by a callee across subroutine calls; the remaining registers (v0-v7, v16-v31) do not need to be preserved (or should be preserved by the caller). Additionally, only the bottom 64 bits of each value stored in v8-v15 need to be preserved [7]; it is the responsibility of the caller to preserve larger values.

Actually there's another possible improvement - we can use store pair instructions as in the GPR case; similarly for the epilogue code.

Application Binary Interface for the Arm® Architecture - ARM-software/abi-aa

view this post on Zulip Chris Fallin (Jun 05 2020 at 00:03):

Yep, that would be an improvement as well

view this post on Zulip Anton Kirilov (Jun 05 2020 at 10:54):

Either @Joey Gouly or I can handle it at some point, that's fine.

view this post on Zulip Julian Seward (Jun 05 2020 at 12:00):

There's the question of whether saving/restoring 64 bits would actually make any difference in practice. If the load/store units have a 128-bit wide path to/from D1 then I guess the answer would be "no". If that path is only 64 bits wide and used twice for a 128-bit transaction, then "yes"; but that would seem to imply that all SIMD loads/stores on the machine would be similarly burdened. Which strikes me as a bit unlikely, especially for any mid-range implementation or above.

view this post on Zulip Julian Seward (Jun 05 2020 at 12:01):

So I have to say .. I wouldn't be surprised if fixing this made no measurable difference.

view this post on Zulip Anton Kirilov (Jun 05 2020 at 12:27):

I am not sure I understand you - if the LSUs have a 128-bit wide path to the L1D$, then it means that after the fix they will be able to save 2 registers per cycle instead of 1, so they will need half the time (ignoring out-of-order execution).

view this post on Zulip Julian Seward (Jun 05 2020 at 12:30):

Mhm, I hadn't thought of that. Good point.

view this post on Zulip Anton Kirilov (Jun 05 2020 at 12:32):

Anyway, this change would probably be made simultaneously with switching to store pair instructions, the latter definitely reducing code size.

view this post on Zulip Amanieu (Jun 05 2020 at 12:42):

Some of the earlier AArch64 CPUs only have a 64-bit path to the L1D$: https://github.com/bytecodealliance/regalloc.rs/issues/14#issuecomment-575648053

On many architectures (e.g x86 and ARM64), f32 and f64 values are located in the V128 register class. This means that any spills or moves will need to copy the entire 128 bits of the register. Howe...

view this post on Zulip Anton Kirilov (Jun 05 2020 at 12:53):

Do you mean either Cortex-A55, Cortex-A57, or Cortex-A72?

view this post on Zulip Amanieu (Jun 05 2020 at 13:00):

Yes, all 3 of those only have 64-bit paths to the L1D$.

view this post on Zulip Anton Kirilov (Jun 05 2020 at 13:00):

A55 has asymmetric load and store bandwidths - 64 bits/cycle for loads and 128 bits/cycle for stores. It is stated in the optimization guide and also in the Technical Reference Manual.

view this post on Zulip Anton Kirilov (Jun 05 2020 at 13:13):

The A57 and A72 optimization guides don't really say anything.

view this post on Zulip Anton Kirilov (Jun 05 2020 at 13:19):

Neither do the TRMs (A57, A72).

view this post on Zulip Amanieu (Jun 05 2020 at 15:13):

It's implied by the throughput numbers for "Store vector reg, unscaled immed, Q-­‐form"

view this post on Zulip Amanieu (Jun 05 2020 at 15:13):

You can tell that it is splitting the store into 2 uops since it has a 2 cycle latency and 0.5 throughput.

view this post on Zulip Amanieu (Jun 05 2020 at 15:15):

Though it seems that only stores are limited to 64-bits, loads can do 128-bits in a single cycle.


Last updated: Oct 23 2024 at 20:03 UTC