jyn514 commented on Issue #1105:
How hard would this be to implement? I'm willing to take a shot at it.
tschneidereit commented on Issue #1105:
@bnjbvr, @cfallin, @julian-seward1, can you comment on this?
bnjbvr commented on Issue #1105:
I'm assuming the question arises in the context of the new backend.
From looking at LLVM's docs, it seems that
alloca
always takes a static (= known at compile time) amount of stack space. If that's true, it should be somewhat easy to implement (add amount to SP, adjust the "nominal SP" offset, make sure to deallocate in the return paths).If one can pass a dynamic input value that's the amount to allocate, it is likely to be much trickier, because we need to be able to track precisely the running SP value within the function's body: that's what the nominal SP offset does in a static manner. It should be implementable, but it might require using a register for this purpose.
bjorn3 commented on Issue #1105:
From looking at LLVM's docs, it seems that alloca always takes a static (= known at compile time) amount of stack space.
No, it also allows a dynamic input. It is just that a static input is equivalent to using stack slots in Cranelift.
cfallin commented on Issue #1105:
It's definitely possible to implement this with the new backends. It interacts with the way we address stack slots and spill slots; at least on aarch64, we address function arguments with
fp
, which stays at the top of the stack frame (invariant to anyalloca
s), but we address stack/spill slots with offsets fromsp
, because positive offsets are cheaper on aarch64. We track "nominal SP" as an offset from real SP, so we can continue to access this storage while we've temporarily pushed args to set up for a call.The most straightforward approach would probably be to (i) detect when an alloca (or just a dynamic alloca) is present; then if so, (ii) allocate a separate scratch register in the prologue and copy nominal-SP to that; then (iii) access all stack and spill slots relative to that register. We lose a register in that case but I think that's unavoidable unless we revert to negative offsets from FP (which has a higher cost -- a few percent degradation at least, because it forces
add
instructions to synthesize addresses when offset more than -0x80
, IIRC).Happy to point out the bits that would need to change in more detail if you would like!
cfallin edited a comment on Issue #1105:
It's definitely possible to implement this with the new backends. It interacts with the way we address stack slots and spill slots; at least on aarch64, we address function arguments with
fp
, which stays at the top of the stack frame (invariant to anyalloca
s), but we address stack/spill slots with offsets fromsp
, because positive offsets are cheaper on aarch64. We track "nominal SP" as an offset from real SP (statically during codegen), so we can continue to access this storage while we've temporarily pushed args to set up for a call.The most straightforward approach would probably be to (i) detect when an alloca (or just a dynamic alloca) is present; then if so, (ii) allocate a separate scratch register in the prologue and copy nominal-SP to that; then (iii) access all stack and spill slots relative to that register. We lose a register in that case but I think that's unavoidable unless we revert to negative offsets from FP (which has a higher cost -- a few percent degradation at least, because it forces
add
instructions to synthesize addresses when offset more than -0x80
, IIRC).Happy to point out the bits that would need to change in more detail if you would like!
cfallin edited a comment on Issue #1105:
It's definitely possible to implement this with the new backends. It interacts with the way we address stack slots and spill slots; at least on aarch64, we address function arguments with
fp
, which stays at the top of the stack frame (invariant to anyalloca
s), but we address stack/spill slots with offsets fromsp
, because positive offsets are cheaper on aarch64. We track "nominal SP" as an offset from real SP (statically during codegen), so we can continue to access this storage while we've temporarily pushed args to set up for a call (EDIT: or, with alloca support, after we've decremented real SP to allocate storage).The most straightforward approach would probably be to (i) detect when an alloca (or just a dynamic alloca) is present; then if so, (ii) allocate a separate scratch register in the prologue and copy nominal-SP to that; then (iii) access all stack and spill slots relative to that register. We lose a register in that case but I think that's unavoidable unless we revert to negative offsets from FP (which has a higher cost -- a few percent degradation at least, because it forces
add
instructions to synthesize addresses when offset more than -0x80
, IIRC).Happy to point out the bits that would need to change in more detail if you would like!
peterhuene commented on Issue #1105:
Related to this, at least for the x86-64 ABIs, I would like to see Cranelift stop using RBP as a "traditional" frame pointer as both DWARF and Windows unwind information encode enough information to properly describe frame layout without having to establish a frame pointer for frames of static size. This would free RBP to be used as a GPR for functions that do not have dynamic stack allocations.
In fact, on Windows x64, a "frame pointer" is supposed to be exactly what you describe the "nominal-SP" register as: a register pointing at the base (or somewhere inside) of the "static" part of the local frame and used to reference args/locals (and CSRs for unwind) by positive offset. For that ABI, a frame pointer is therefore generally only established for frames calling
alloca
.Right now the x64 prologue/epilogue instructions relating to the establishment of a traditional frame pointer are simply wasted instructions on Windows.
peterhuene edited a comment on Issue #1105:
Related to this, at least for the x86-64 ABIs, I would like to see Cranelift stop using RBP as a "traditional" frame pointer as both DWARF and Windows unwind information encode enough information to properly describe frame layout without having to establish a frame pointer for frames of static size. This would free RBP to be used as a GPR for functions that do not have dynamic stack allocations or as the "nominal-SP" register for functions that have dynamic stack allocations.
In fact, on Windows x64, a "frame pointer" is supposed to be exactly what you describe the "nominal-SP" register as: a register pointing at the base (or somewhere inside) of the "static" part of the local frame and used to reference args/locals (and CSRs for unwind) by positive offset. For that ABI, a frame pointer is therefore generally only established for frames calling
alloca
.Right now the x64 prologue/epilogue instructions relating to the establishment of a traditional frame pointer are simply wasted instructions on Windows.
peterhuene edited a comment on Issue #1105:
Related to this, at least for the x86-64 ABIs, I would like to see Cranelift stop using RBP as a "traditional" frame pointer as both DWARF and Windows unwind encode enough information to properly describe frame layout without having to establish a frame pointer for frames of static size. This would free RBP to be used as a GPR for functions that do not have dynamic stack allocations or as the "nominal-SP" register for functions that have dynamic stack allocations.
In fact, on Windows x64, a "frame pointer" is supposed to be exactly what you describe the "nominal-SP" register as: a register pointing at the base (or somewhere inside) of the "static" part of the local frame and used to reference args/locals (and CSRs for unwind) by positive offset. For that ABI, a frame pointer is therefore generally only established for frames calling
alloca
.Right now the x64 prologue/epilogue instructions relating to the establishment of a traditional frame pointer are simply wasted instructions on Windows.
bjorn3 commented on Issue #1105:
Related to this, at least for the x86-64 ABIs, I would like to see Cranelift stop using RBP as a "traditional" frame pointer as both DWARF and Windows unwind information encode enough information to properly describe frame layout without having to establish a frame pointer for frames of static size.
This should be an option in my opinion. Using DWARF unwinding for perf profiles as opposed to frame pointers results in much bigger
perf.data
files and slowerperf report
, as it requires capturing a big chunk of the stack and then performing the unwinding offline. Online unwinding using DWARF tables is simply too slow.
peterhuene commented on Issue #1105:
This should be an option in my opinion.
Definitely, but I think omitting a traditional frame pointer should be default for these ABIs, at least for optimized compilations. An option to opt-in when they are legitimately needed (like in the case of a tool relying on them for fast stack walks) makes sense to me.
cfallin commented on Issue #1105:
@peterhuene that's a good point -- could you create a separate issue for that? I definitely agree that
-fomit-frame-pointer
optimizations are something we should look into at some point.
peterhuene commented on Issue #1105:
I opened #1149 a while back specific to Windows. Should we create a more general "omit frame pointers when permitted" issue?
cfallin commented on Issue #1105:
Sure, I think it makes sense to track with a separate issue; it's a distinct thing that we'd want to do on any platform when we're allowed to (by ABIs and by debug requirements).
peterhuene commented on Issue #1105:
I've opened #2073.
Last updated: Dec 23 2024 at 12:05 UTC