fitzgen added the cranelift label to Issue #10421.
fitzgen added the cranelift:area:debug label to Issue #10421.
fitzgen added the cranelift:goal:native-ABI label to Issue #10421.
fitzgen opened issue #10421:
Right Now
We emit a single frame description entry (FDE) per function, covering the function's entire PC range.
For every range of code within that function that has a different offset to the canonical frame address (CFA), we emit call frame infromation (CFI) instructions which advance the logical CFI machine's PC to this code range and then describe the new way to get the CFA as a delta from the previous code range's method.
Given that, in order to find the CFA for a PC in the middle of a function, an
.eh_frame
-using stack walker has to first do an evaluation of all the CFI instructions for the prologue and anything leading up to its current PC before it knows how to find the CFA for its current location.[^caching] This evaluatesO(n)
CFI instructions.[^caching]: The stack walker could cache or preprocess these results, but that is a bit of an aside. It implies additional time and/or space overheads that might make sense for, e.g., a profiler that will frequently walk stacks, but wouldn't make sense for the system's C++ unwinder which might rarely, if ever, actually walk and unwind the stack. Even if some stack walkers won't benefit from the FDE improvements described in this issue, there will always be other stack walkers with different trade offs that will benefit.
Example
Here is the Wasm function
f(x) = x + 1
compiled by Wasmtime and Cranelift:0000000000000000 <wasm[0]::function[0]>: 0: 55 pushq %rbp 1: 48 89 e5 movq %rsp, %rbp 4: 8d 42 01 leal 0x1(%rdx), %eax 7: 48 89 ec movq %rbp, %rsp a: 5d popq %rbp b: c3 retq
Here is the FDE that covers that function (lightly edited output of
dwarfdump --eh-frame
):FDE cie=00000000 pc=fffffffffffff000...fffffffffffff00c Format: DWARF32 DW_CFA_advance_loc: 1 DW_CFA_def_cfa_offset: +16 DW_CFA_offset: RBP -16 DW_CFA_advance_loc: 3 DW_CFA_def_cfa_register: RBP 0xfffffffffffff000: CFA=RSP+8: RIP=[CFA-8] 0xfffffffffffff001: CFA=RSP+16: RBP=[CFA-16], RIP=[CFA-8] 0xfffffffffffff004: CFA=RBP+16: RBP=[CFA-16], RIP=[CFA-8]
In this example, finding the CFA at function offset
0x004
and onward (i.e. after the function prologue) requires evaluating the CFI instructions to find the CFA at all earlier function offsets (0x000
and0x001
).(Aside 1: we don't properly encode the CFA for the epilogue, but probably should for completeness. Will file a separate issue about this.)
(Aside 2: function offsets
0x001
and0x004
happen to define identical CFAs with different approaches. We would ideally recognize and optimize this somewhere, but writing a CFI instruction optimizer is maybe more than we want to ever bite off, it isn't that important compared to removing theO(n)
CFI instruction evaluation.)Proposal: Generate Multiple FDEs per Function
If we instead emit multiple FDEs per function, one FDE per range of code that has its own CFA, then
.eh_frame
-using stack walkers can evaluate only the CFI instructions that define their target PC's CFA, replacing theO(n)
evaluation with anO(1)
evaluation. That is, DWARF defines evaluating CFI instructions as decompressing a PC-range-to-CFA-recovery table, and each FDE can represent multiple rows in that table; right now, our FDEs do encode multiple rows, and I'm proposing that we instead emit more, smaller FDEs that each encode only a single table row.As a first step, I'd expect the previous example's FDE to be exploded into something like this:
FDE cie=00000000 pc=fffffffffffff000...fffffffffffff001 Format: DWARF32 DW_CFA_nop: 0xfffffffffffff000: CFA=RSP+8: RIP=[CFA-8] FDE cie=00000000 pc=fffffffffffff001...fffffffffffff004 Format: DWARF32 DW_CFA_def_cfa_offset: +16 DW_CFA_offset: RBP -16 0xfffffffffffff001: CFA=RSP+16: RBP=[CFA-16], RIP=[CFA-8] FDE cie=00000000 pc=fffffffffffff004...fffffffffffff00c Format: DWARF32 DW_CFA_def_cfa_register: RBP 0xfffffffffffff004: CFA=RBP+16: RBP=[CFA-16], RIP=[CFA-8]
As a next step, we could do even better because (at least with frame pointers enabled, like in this example) we only ever emit a small number of different CFI instruction sequences in our FDEs. Almost all FDEs will just be
DW_CFGA_def_cfa_register: RBP
and the minority that aren't exactly that (e.g. the FDEs for function proglogues) are also going to emit one of a common small handful of different sequences. DWARF/.eh_frame
allows us to deduplicate shared prefixes of CFI instructions in common information entries (CIEs). Every FDE has an associated CIE that defines the initial CFI instructions to run, defining the initial CFI machine state, before evaluating the FDE's CFI instructions. We could define an additional CIE for each of our handful of CFI instruction sequences, and then our FDE wouldn't need additional CFI instructions, it could just inherit the CFI machine state from its CIE.[^cie-caching] This would deduplicate all our CFI instruction sequences in the.eh_frame
section, recovering (and probably improving upon) any binary size compression that we might have lost during the introduction of multiple FDEs per function.[^cie-caching]: And because (unlike FDEs) there tend to be only a few CIEs,
.eh_frame
-using stack walkers that won't do complex pre-processing and fine-grained caching of FDE lookups and evaluation results will often cache the results of evaluating each CIE, effectively caching the evaluation of all CFI instructions that will ever need to be evaluated for our FDEs under this final.eh_frame
emission scheme.
fitzgen edited issue #10421:
Right Now
We emit a single frame description entry (FDE) per function, covering the function's entire PC range.
For every range of code within that function that has a different offset to the canonical frame address (CFA), we emit call frame infromation (CFI) instructions which advance the logical CFI machine's PC to this code range and then describe the new way to get the CFA as a delta from the previous code range's method.
Given that, in order to find the CFA for a PC in the middle of a function, an
.eh_frame
-using stack walker has to first do an evaluation of all the CFI instructions for the prologue and anything leading up to its current PC before it knows how to find the CFA for its current location.[^caching] This evaluatesO(n)
CFI instructions.[^caching]: The stack walker could cache or preprocess these results, but that is a bit of an aside. It implies additional time and/or space overheads that might make sense for, e.g., a profiler that will frequently walk stacks, but wouldn't make sense for the system's C++ unwinder which might rarely, if ever, actually walk and unwind the stack. Even if some stack walkers won't benefit from the FDE improvements described in this issue, there will always be other stack walkers with different trade offs that will benefit.
Example
Here is the Wasm function
f(x) = x + 1
compiled by Wasmtime and Cranelift:0000000000000000 <wasm[0]::function[0]>: 0: 55 pushq %rbp 1: 48 89 e5 movq %rsp, %rbp 4: 8d 42 01 leal 0x1(%rdx), %eax 7: 48 89 ec movq %rbp, %rsp a: 5d popq %rbp b: c3 retq
Here is the FDE that covers that function (lightly edited output of
dwarfdump --eh-frame
):FDE cie=00000000 pc=fffffffffffff000...fffffffffffff00c Format: DWARF32 DW_CFA_advance_loc: 1 DW_CFA_def_cfa_offset: +16 DW_CFA_offset: RBP -16 DW_CFA_advance_loc: 3 DW_CFA_def_cfa_register: RBP 0xfffffffffffff000: CFA=RSP+8: RIP=[CFA-8] 0xfffffffffffff001: CFA=RSP+16: RBP=[CFA-16], RIP=[CFA-8] 0xfffffffffffff004: CFA=RBP+16: RBP=[CFA-16], RIP=[CFA-8]
In this example, finding the CFA at function offset
0x004
and onward (i.e. after the function prologue) requires evaluating the CFI instructions to find the CFA at all earlier function offsets (0x000
and0x001
).(Aside 1: we don't properly encode the CFA for the epilogue, but probably should for completeness. Will file a separate issue about this.)
(Aside 2: function offsetsEdit: nevermind, I misread "rsp" as "rbp", they are in fact two different CFAs.0x001
and0x004
happen to define identical CFAs with different approaches. We would ideally recognize and optimize this somewhere, but writing a CFI instruction optimizer is maybe more than we want to ever bite off, it isn't that important compared to removing theO(n)
CFI instruction evaluation.)Proposal: Generate Multiple FDEs per Function
If we instead emit multiple FDEs per function, one FDE per range of code that has its own CFA, then
.eh_frame
-using stack walkers can evaluate only the CFI instructions that define their target PC's CFA, replacing theO(n)
evaluation with anO(1)
evaluation. That is, DWARF defines evaluating CFI instructions as decompressing a PC-range-to-CFA-recovery table, and each FDE can represent multiple rows in that table; right now, our FDEs do encode multiple rows, and I'm proposing that we instead emit more, smaller FDEs that each encode only a single table row.As a first step, I'd expect the previous example's FDE to be exploded into something like this:
FDE cie=00000000 pc=fffffffffffff000...fffffffffffff001 Format: DWARF32 DW_CFA_nop: 0xfffffffffffff000: CFA=RSP+8: RIP=[CFA-8] FDE cie=00000000 pc=fffffffffffff001...fffffffffffff004 Format: DWARF32 DW_CFA_def_cfa_offset: +16 DW_CFA_offset: RBP -16 0xfffffffffffff001: CFA=RSP+16: RBP=[CFA-16], RIP=[CFA-8] FDE cie=00000000 pc=fffffffffffff004...fffffffffffff00c Format: DWARF32 DW_CFA_offset: RBP -16 DW_CFA_def_cfa_offset: +16 DW_CFA_def_cfa_register: RBP 0xfffffffffffff004: CFA=RBP+16: RBP=[CFA-16], RIP=[CFA-8]
As a next step, we could do even better because (at least with frame pointers enabled, like in this example) we only ever emit a small number of different CFI instruction sequences in our FDEs. Almost all FDEs will just be
DW_CFGA_def_cfa_register: RBP
and the minority that aren't exactly that (e.g. the FDEs for function proglogues) are also going to emit one of a common small handful of different sequences. DWARF/.eh_frame
allows us to deduplicate shared prefixes of CFI instructions in common information entries (CIEs). Every FDE has an associated CIE that defines the initial CFI instructions to run, defining the initial CFI machine state, before evaluating the FDE's CFI instructions. We could define an additional CIE for each of our handful of CFI instruction sequences, and then our FDE wouldn't need additional CFI instructions, it could just inherit the CFI machine state from its CIE.[^cie-caching] This would deduplicate all our CFI instruction sequences in the.eh_frame
section, recovering (and probably improving upon) any binary size compression that we might have lost during the introduction of multiple FDEs per function.[^cie-caching]: And because (unlike FDEs) there tend to be only a few CIEs,
.eh_frame
-using stack walkers that won't do complex pre-processing and fine-grained caching of FDE lookups and evaluation results will often cache the results of evaluating each CIE, effectively caching the evaluation of all CFI instructions that will ever need to be evaluated for our FDEs under this final.eh_frame
emission scheme.
bjorn3 commented on issue #10421:
Wouldn't this cause the
.eh_frame_hdr
section to triple in size, making finding the right FDE slower?
fitzgen commented on issue #10421:
Yes, but (a) the
.eh_frame_hdr
section is binary-searched through and moving somen
from aO(n)
toO(log n)
process seems like a win and (b) I think that the size that the.eh_frame_hdr
grows will not be larger (and will probably smaller?) than the size that the.eh_frame
shrinks due to the additional deduplication.
Last updated: Apr 18 2025 at 21:03 UTC