bjorn3 commented on Issue #2228:
The default call call conv for aarch64 is systemv, not aapcs. There are some differences between the two in the float handling on at least arm32. For example https://github.com/rust-lang/rust/blob/85fbf49ce0e2274d0acf798f6e703747674feec3/compiler/rustc_target/src/abi/call/arm.rs#L91. I don't know if this is also the case on aarch64 and if so if this is one example.
github-actions[bot] commented on Issue #2228:
Subscribe to Label Action
cc @bnjbvr
<details>
This issue or pull request has been labeled: "cranelift"Thus the following users have been cc'd because of the following labels:
- bnjbvr: cranelift
To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.
Learn more.
</details>
akirilov-arm commented on Issue #2228:
AArch32 is a completely different case, in which there are many ABI variants indeed.
cfallin commented on Issue #2228:
Thanks @akirilov-arm -- it seems you're right that we could save half of our stack space and memory traffic for FP/vec clobber-saves. This should be a straightforward change in the ABI code; I want to verify first that this won't break SpiderMonkey (I think not, as every register is caller-save, IIRC) then I'll create a PR.
akirilov-arm commented on Issue #2228:
@cfallin I might have misunderstood you, but let me rephrase: There are 2 issues - one on the caller and one on the callee side. You are talking about the latter, but I am more concerned about the former. Consider the test case in the PR - the compiler is stashing the vectors (and they are full, 128-bit vectors) before the call to
%g1
inv8
,v9
, andv10
, respectively, which is not going to work in the general case, assuming AAPCS64 compliance.For comparison, here is what GCC and LLVM are doing in an equivalent situation.
cfallin commented on Issue #2228:
Ah, right, this is a bigger issue with respect to callsites; I had missed that half of it, sorry. So because we don't reason about half-clobbers currently (or overlapping registers), we need to treat all vector registers as clobbers on the caller side. I'll create a patch for this first thing Monday.
cfallin commented on Issue #2228:
So on further consideration (and after doing the one-line patch suggested and studying its effect), the solution is actually more complex than I had suggested above:
- We need to consider the "half caller-saves" as caller-saves when we generate calls, or else we incorrectly use registers that will be clobbered, as you say.
- At the same time, we need to consider the "half caller-saves" as callee-saves when we generate prologues and epilogues, because if we modify the low half we really do need to save them per the ABI.
The problem is how these interact: any function call at all now forces a bunch of clobber saves/restores in the caller's prologue and epilogue, because the half-clobbers from the call become full-clobbers (due to the imprecision of not reasoning about half-registers).
In essence, we need a way to "pass through" the half-clobbers from callsite to prologue/epilogue, as long as callee and caller have the same ABI; only true clobbers from the rest of the function body should alter the prologue/epilogue.
I can think of a few hacks, e.g. add the full-clobbers at callsites (as defs) to avoid regalloc using the registers, but compute our own clobber-set by scanning over every instruction except callsites; but this feels very error-prone and fragile compared to simply doing the right thing and reasoning through overlapping registers.
@julian-seward1, thoughts on this re: regalloc and how difficult it would be to support half-defs / half-clobbers?
akirilov-arm commented on Issue #2228:
@cfallin I realized that I should expand my changes and add test cases with
f32
andf64
values, so do not merge anything, please.In essence, we need a way to "pass through" the half-clobbers from callsite to prologue/epilogue, as long as callee and caller have the same ABI...
Do you mean that the caller lowering should pass somehow the clobber information to the callee lowering? I thought that lowering passes for different functions were more or less independent, and indeed happened in different worker threads...
Unfortunately I can't say anything the rest of your comments because I am missing some information - in particular, your quick fix and its effects (expanding the test cases should help with the latter).
cfallin commented on Issue #2228:
Do you mean that the caller lowering should pass somehow the clobber information to the callee lowering? I thought that lowering passes for different functions were more or less independent, and indeed happened in different worker threads...
No, what I mean is more conceptual: if the callee clobbers some set of registers
C
, then by calling it, the caller clobbers at leastC
as well (that's what I meant by "pass through"; it's something that intrinsically happens, not something we are doing; sorry, it was unclear). So if the caller and callee have the same ABI, then I think we can effectively hack around this limitation by:
Adopting conservative definitions of caller- and callee-save registers at callsite generation and prologue generation respectively, so the half-and-half vector registers in question are both caller- and callee-save, BUT
Ignoring the call instructions' defs/clobbers when computing the saved-clobbers list for prologue generation, because (if same ABI) anything it clobbers, we are also allowed to clobber without saving.
So this will require a small change to
regalloc.rs
, namely a single bit per instruction to indicate "exclude mods/defs from clobbers". @julian-seward1 and @bnjbvr, thoughts on this?
Last updated: Dec 23 2024 at 13:07 UTC