alexcrichton opened PR #1573 from aarch64-stack-limit
to master
:
This commit implements the stack limit checks in cranelift for the
AArch64 backend. This gets thestack_limit
argument purpose as well as
a function's globalstack_limit
directive working for the AArch64
backend. I've tested this locally on some hardware and in an emulator
and it looks to be working for basic tests, but I've never really done
AArch64 before so some scrutiny on the instructions would be most
welcome!I also suspect that these stack limit checks aren't quite optimal, but they relatively closely match the x86 backend (which also isn't optimal), so for now I'm largely hoping for correctness as opposed to speed.
alexcrichton edited PR #1573 from aarch64-stack-limit
to master
:
This commit implements the stack limit checks in cranelift for the
AArch64 backend. This gets thestack_limit
argument purpose as well as
a function's globalstack_limit
directive working for the AArch64
backend. I've tested this locally on some hardware and in an emulator
and it looks to be working for basic tests, but I've never really done
AArch64 before so some scrutiny on the instructions would be most
welcome!I also suspect that these stack limit checks aren't quite optimal, but they relatively closely match the x86 backend (which also isn't optimal), so for now I'm largely hoping for correctness as opposed to speed.
Closes #1569
alexcrichton updated PR #1573 from aarch64-stack-limit
to master
:
This commit implements the stack limit checks in cranelift for the
AArch64 backend. This gets thestack_limit
argument purpose as well as
a function's globalstack_limit
directive working for the AArch64
backend. I've tested this locally on some hardware and in an emulator
and it looks to be working for basic tests, but I've never really done
AArch64 before so some scrutiny on the instructions would be most
welcome!I also suspect that these stack limit checks aren't quite optimal, but they relatively closely match the x86 backend (which also isn't optimal), so for now I'm largely hoping for correctness as opposed to speed.
Closes #1569
bnjbvr requested cfallin for a review on PR #1573.
bjorn3 created PR Review Comment:
// documentation in x86/abi.rs for why this is here. The general idea is
bjorn3 submitted PR Review.
cfallin submitted PR Review.
cfallin submitted PR Review.
cfallin created PR Review Comment:
Hmm. I'm not wild about having this GlobalValue expansion implementation local to the stack-limit code, but upon looking at
aarch64/lower.rs
, it seems that in all other cases the legalization pass transforms actualGlobalValue
ops into core ops as here.Perhaps we could at least move
generate_gv
toaarch64/lower.rs
(passing in the vmctx reg as an arg)? This could eventually be used if we shift from the current encodings-based legalization framework to something simpler/more purpose-built as well.
cfallin created PR Review Comment:
Factor this pattern out (also occurs above) -- take
f
andsig
, and return reg for vmctx plusVec<Inst>
?To allocate a new temp as an
into_reg
to holdstack_limit
, we'll need to finagle the plumbing a bit. Vreg temps are allocated by theLowerCtx
, which we've kept separate from the ABI traits, and using this will be tricky because we can't have generic methods on theABIBody
trait. The ordering is slightly tricky too: theABIBody
is created before lowering starts, so that's too early to alloc a temp, butgen_prologue
happens late (we lower instructions in reverse order) so we can't do it there either.Suggestion: perhaps add an
init
method on theABIBody
trait that takes a "just in case" temp? (vregs are basically free, aside from creating sparsity in the ID space, so this is fine.) Then this could setstack_limit
for use bygen_prologue
and any calls.
cfallin created PR Review Comment:
Eek, I'm not sure if it's safe to use
spilltmp
here -- or at least it's not clear it will interact well with anything we might write in the future. The basic invariant is thatspilltmp
should only be used within a sequence of instructions generated as one unit (the original use was to compute large stackframe offsets for regalloc spills/reloads, hence the name). We should probably take aninto_reg
as an argument ofgenerate_gv
here, and then by implication, togen_stack_limit
too.(I should've added more documentation about this, sorry -- if you wouldn't mind adding a comment on
spilltmp_reg()
to this effect, that'd be very helpful, thanks!)
cfallin created PR Review Comment:
Are we worried about >4K struct offsets? If we're making this into a more generic thing anyway, we could handle the overflow case here too -- basically just use
Inst::load_constant(rd, value)
giving aVec<Inst>
(and we can use the spilltemp reg for the constant).
cfallin created PR Review Comment:
This little register dance hopefully goes away if we use a proper vreg for
stack_limit
...
cfallin created PR Review Comment:
Actually, one more thought: after reading through the rest of the patch, it seems that
stack_limit
is not used beyond the prologue. Given this, I think it would be fine to usespilltmp
withingen_prologue
itself, and then carefully validate that we don't keep it live across anything else that could implicitly use it. This includes any virtual registers at all: the register allocator is liable to pop in and do a surprise spill with a big offset somewhere, wiping outspilltmp
.
cfallin created PR Review Comment:
This should have a test case in the emission tests (
test_aarch64_binemit()
below), with the golden machine code validated as the comment notes (using real aarch64 assembler). Thanks!
alexcrichton submitted PR Review.
alexcrichton created PR Review Comment:
Oh nice,
Inst::load_constant
works great! While I suspect that >4K offsets aren't really that common it seems easy enough to support, so I've gone ahead and added it in.
alexcrichton submitted PR Review.
alexcrichton created PR Review Comment:
So FWIW this is definitely an "unfortunate" special context. This is "duplicated" with the x86 backend as well. The reason for this is that prologue/epilogue happens so late (way after legalization and even after register allocation) that it's difficult to share code. The hope is that the number of global values we'd need to process wouldn't be that large and/or it'd be easy to extend them over time.
In that sense I suspect that the global value legalization pass is unlikely to go away (it's architecture independent too!). That being said I'm happy to move this code around. Given that do you think it'd still have a better home in
lower.rs
?(I can definitely do a better job about documenting all this too)
cfallin submitted PR Review.
cfallin created PR Review Comment:
Hmm, OK. I still think that we should move this to
lower.rs
, but other refactoring and cleanup can come later -- don't want to hold this up for it!
alexcrichton updated PR #1573 from aarch64-stack-limit
to master
:
This commit implements the stack limit checks in cranelift for the
AArch64 backend. This gets thestack_limit
argument purpose as well as
a function's globalstack_limit
directive working for the AArch64
backend. I've tested this locally on some hardware and in an emulator
and it looks to be working for basic tests, but I've never really done
AArch64 before so some scrutiny on the instructions would be most
welcome!I also suspect that these stack limit checks aren't quite optimal, but they relatively closely match the x86 backend (which also isn't optimal), so for now I'm largely hoping for correctness as opposed to speed.
Closes #1569
cfallin submitted PR Review.
cfallin created PR Review Comment:
nit: s/caler/caller/
cfallin submitted PR Review.
alexcrichton updated PR #1573 from aarch64-stack-limit
to master
:
This commit implements the stack limit checks in cranelift for the
AArch64 backend. This gets thestack_limit
argument purpose as well as
a function's globalstack_limit
directive working for the AArch64
backend. I've tested this locally on some hardware and in an emulator
and it looks to be working for basic tests, but I've never really done
AArch64 before so some scrutiny on the instructions would be most
welcome!I also suspect that these stack limit checks aren't quite optimal, but they relatively closely match the x86 backend (which also isn't optimal), so for now I'm largely hoping for correctness as opposed to speed.
Closes #1569
alexcrichton merged PR #1573.
Last updated: Jan 24 2025 at 00:11 UTC