peterhuene edited Issue #1201:
- What is the feature or code improvement you would like to do in Cranelift?
Improve the code generation for Windows x64 ABI (fastcall) to reduce function size by omitting frame pointers, using the caller-provided shadow space as spill slots (for optimized compilation), and potentially omit prologues/epilogues entirely (for "leaf" functions).
- What is the value of adding this in Cranelift?
Reduced code generation size and improved performance on optimized compilations when targeting Windows.
- **Do you have an implementation plan, and/or ideas for data structures or
algorithms to use?**Windows x64 ABI has strict requirements for function prologues and epilogues. This enables the OS to consistently walk and unwind the stack during exception handling.
Because of these strict requirements, a frame pointer is rarely needed for the purpose of unwinding and is only required for a frame doing a dynamic allocation (i.e.
alloca
). However, keeping the frame pointer may mean smaller instruction sizes based on the displacement from a frame pointer vs. the current stack pointer.Additionally, an explicit stack frame is not necessary at all for "leaf" functions. We can omit prologue, epilogue, and unwind information generation entirely for functions that don't:
- Have any stack allocation.
- Call another function (i.e. "leaf").
- Have a need to save non-volatile registers (i.e. limited only to modifying volatile registers).
I therefore propose the following changes:
- Favor omitting the frame pointer unless there's a call to dynamically allocate stack space (if ever supported) or such that the size cost of the function outweighs the benefit of having an additional GPR. Note: the current implementation expects that the pushing of the previous frame pointer will realign the stack prior to the call to
layout_stack
, so omitting the frame pointer will impact this assumption.- For unoptimized compilation, home register-passed function arguments into the caller-provided shadow space so debuggers can find the arguments even without debug information.
- For optimized compilation, consider the caller-provided shadow space to be scratch and treat it as "preallocated" spill slots for the current frame. However, it looks like this may violate some layout assumptions in
layout_stack
and require more consideration.- Detect if the function is leaf (based on the definition above) and skip prologue and epilogue generation.
- Skip allocation of shadow space if a function does not call.
The result should be smaller function code generation on Windows, especially in the case of leaf functions.
- **Have you considered alternative implementations? If so, how are they better
or worse than your proposal?**I have not considered alternative implementations.
This issue was motivated by #1199.
peterhuene edited Issue #1201:
- What is the feature or code improvement you would like to do in Cranelift?
Improve the code generation for Windows x64 ABI (fastcall) to reduce function size by omitting frame pointers, using the caller-provided shadow space as spill slots (for optimized compilation), and potentially omit prologues/epilogues entirely (for "leaf" functions).
- What is the value of adding this in Cranelift?
Reduced code generation size and improved performance on optimized compilations when targeting Windows.
- **Do you have an implementation plan, and/or ideas for data structures or
algorithms to use?**Windows x64 ABI has strict requirements for function prologues and epilogues. This enables the OS to consistently walk and unwind the stack during exception handling.
Because of these strict requirements, a frame pointer is rarely needed for the purpose of unwinding and is only required for a frame doing a dynamic allocation (i.e.
alloca
). However, keeping the frame pointer may mean smaller instruction sizes based on the displacement from a frame pointer vs. the current stack pointer.Additionally, an explicit stack frame is not necessary at all for "leaf" functions. We can omit prologue, epilogue, and unwind information generation entirely for functions that don't:
- Have any stack allocation.
- Call another function (i.e. "leaf").
- Have a need to save non-volatile registers (i.e. limited only to modifying volatile registers).
I therefore propose the following changes:
- Favor omitting the frame pointer unless there's a call to dynamically allocate stack space (if ever supported) or such that the omission of the frame pointer results in an increase of the function's size (having an additional GPR is probably not worth it). Note: the current implementation expects that the pushing of the previous frame pointer will realign the stack prior to the call to
layout_stack
, so omitting the frame pointer will impact this assumption.- For unoptimized compilation, home register-passed function arguments into the caller-provided shadow space so debuggers can find the arguments even without debug information.
- For optimized compilation, consider the caller-provided shadow space to be scratch and treat it as "preallocated" spill slots for the current frame. However, it looks like this may violate some layout assumptions in
layout_stack
and require more consideration.- Detect if the function is leaf (based on the definition above) and skip prologue and epilogue generation.
- Skip allocation of shadow space if a function does not call.
The result should be smaller function code generation on Windows, especially in the case of leaf functions.
- **Have you considered alternative implementations? If so, how are they better
or worse than your proposal?**I have not considered alternative implementations.
This issue was motivated by #1199.
peterhuene edited Issue #1201:
- What is the feature or code improvement you would like to do in Cranelift?
Improve the code generation for Windows x64 ABI (fastcall) to reduce function size by omitting frame pointers, using the caller-provided shadow space as spill slots (for optimized compilation), and potentially omit prologues/epilogues entirely (for "leaf" functions).
- What is the value of adding this in Cranelift?
Reduced code generation size and improved performance on optimized compilations when targeting Windows.
- **Do you have an implementation plan, and/or ideas for data structures or
algorithms to use?**Windows x64 ABI has strict requirements for function prologues and epilogues. This enables the OS to consistently walk and unwind the stack during exception handling.
Because of these strict requirements, a frame pointer is rarely needed for the purpose of unwinding and is only required for a frame doing a dynamic allocation (i.e.
alloca
). However, keeping the frame pointer may mean smaller instruction sizes based on the displacement from a frame pointer vs. the current stack pointer.Additionally, an explicit stack frame is not necessary at all for "leaf" functions. We can omit prologue, epilogue, and unwind information generation entirely for functions that don't:
- Have any stack allocation.
- Call another function (i.e. "leaf").
- Have a need to save non-volatile registers (i.e. limited only to modifying volatile registers).
I therefore propose the following changes:
- Favor omitting the frame pointer unless there's a call to dynamically allocate stack space (if ever supported) or such that the omission of the frame pointer results in an increase of the function's size to an unsatisfactory degree. Note: the current implementation expects that the pushing of the previous frame pointer will realign the stack prior to the call to
layout_stack
, so omitting the frame pointer will impact this assumption.- For unoptimized compilation, home register-passed function arguments into the caller-provided shadow space so debuggers can find the arguments even without debug information.
- For optimized compilation, consider the caller-provided shadow space to be scratch and treat it as "preallocated" spill slots for the current frame. However, it looks like this may violate some layout assumptions in
layout_stack
and require more consideration.- Detect if the function is leaf (based on the definition above) and skip prologue and epilogue generation.
- Skip allocation of shadow space if a function does not call.
The result should be smaller function code generation on Windows, especially in the case of leaf functions.
- **Have you considered alternative implementations? If so, how are they better
or worse than your proposal?**I have not considered alternative implementations.
This issue was motivated by #1199.
peterhuene edited Issue #1201:
- What is the feature or code improvement you would like to do in Cranelift?
Improve the code generation for Windows x64 ABI (a.k.a. "fastcall") by omitting frame pointers when possible, using the caller-provided shadow space as spill slots for optimized compilation, and omit prologues/epilogues entirely for true "leaf" functions.
- What is the value of adding this in Cranelift?
Reduced code generation size and improved performance on optimized compilations when targeting Windows.
- **Do you have an implementation plan, and/or ideas for data structures or
algorithms to use?**Windows x64 ABI has strict requirements for function prologues and epilogues. This enables the OS to consistently walk and unwind the stack during exception handling.
Because of these strict requirements, a frame pointer is rarely needed for the purpose of unwinding and is only required for a frame doing a dynamic allocation (i.e.
alloca
). However, omitting the frame pointer might actually increase instruction sizes based on the displacement from a frame pointer vs. the current stack pointer. This should be taken into account when deciding if a frame pointer should be omitted.Additionally, an explicit stack frame is not necessary at all for "leaf" functions. We can omit prologue, epilogue, and unwind information generation entirely for functions that don't:
- Have any stack allocation.
- Call another function (i.e. "leaf").
- Have a need to save non-volatile registers (i.e. limited only to modifying volatile registers).
I therefore propose the following changes:
- Favor omitting the frame pointer unless there's a call to dynamically allocate stack space (if ever supported) or such that the omission of the frame pointer results in an increase of the function's size to an unsatisfactory degree. Note: the current implementation expects that the pushing of the previous frame pointer will realign the stack prior to the call to
layout_stack
, so omitting the frame pointer will impact this assumption.- For unoptimized compilation, home register-passed function arguments into the caller-provided shadow space so debuggers can find the arguments even without debug information.
- For optimized compilation, consider the caller-provided shadow space to be scratch and treat it as "preallocated" spill slots for the current frame. However, it looks like this may violate some layout assumptions in
layout_stack
and require more consideration.- Detect if the function is leaf (based on the definition above) and skip prologue and epilogue generation.
- Skip allocation of shadow space if a function does not call.
The result should be smaller function code generation on Windows, especially in the case of leaf functions.
- **Have you considered alternative implementations? If so, how are they better
or worse than your proposal?**I have not considered alternative implementations.
This issue was motivated by #1199.
alexcrichton transferred Issue #1201:
- What is the feature or code improvement you would like to do in Cranelift?
Improve the code generation for Windows x64 ABI (a.k.a. "fastcall") by omitting frame pointers when possible, using the caller-provided shadow space as spill slots for optimized compilation, and omit prologues/epilogues entirely for true "leaf" functions.
- What is the value of adding this in Cranelift?
Reduced code generation size and improved performance on optimized compilations when targeting Windows.
- **Do you have an implementation plan, and/or ideas for data structures or
algorithms to use?**Windows x64 ABI has strict requirements for function prologues and epilogues. This enables the OS to consistently walk and unwind the stack during exception handling.
Because of these strict requirements, a frame pointer is rarely needed for the purpose of unwinding and is only required for a frame doing a dynamic allocation (i.e.
alloca
). However, omitting the frame pointer might actually increase instruction sizes based on the displacement from a frame pointer vs. the current stack pointer. This should be taken into account when deciding if a frame pointer should be omitted.Additionally, an explicit stack frame is not necessary at all for "leaf" functions. We can omit prologue, epilogue, and unwind information generation entirely for functions that don't:
- Have any stack allocation.
- Call another function (i.e. "leaf").
- Have a need to save non-volatile registers (i.e. limited only to modifying volatile registers).
I therefore propose the following changes:
- Favor omitting the frame pointer unless there's a call to dynamically allocate stack space (if ever supported) or such that the omission of the frame pointer results in an increase of the function's size to an unsatisfactory degree. Note: the current implementation expects that the pushing of the previous frame pointer will realign the stack prior to the call to
layout_stack
, so omitting the frame pointer will impact this assumption.- For unoptimized compilation, home register-passed function arguments into the caller-provided shadow space so debuggers can find the arguments even without debug information.
- For optimized compilation, consider the caller-provided shadow space to be scratch and treat it as "preallocated" spill slots for the current frame. However, it looks like this may violate some layout assumptions in
layout_stack
and require more consideration.- Detect if the function is leaf (based on the definition above) and skip prologue and epilogue generation.
- Skip allocation of shadow space if a function does not call.
The result should be smaller function code generation on Windows, especially in the case of leaf functions.
- **Have you considered alternative implementations? If so, how are they better
or worse than your proposal?**I have not considered alternative implementations.
This issue was motivated by #1199.
Last updated: Nov 22 2024 at 17:03 UTC