fitzgen added the cranelift label to Issue #10421.
fitzgen added the cranelift:area:debug label to Issue #10421.
fitzgen added the cranelift:goal:native-ABI label to Issue #10421.
fitzgen opened issue #10421:
Right Now
We emit a single frame description entry (FDE) per function, covering the function's entire PC range.
For every range of code within that function that has a different offset to the canonical frame address (CFA), we emit call frame infromation (CFI) instructions which advance the logical CFI machine's PC to this code range and then describe the new way to get the CFA as a delta from the previous code range's method.
Given that, in order to find the CFA for a PC in the middle of a function, an
.eh_frame-using stack walker has to first do an evaluation of all the CFI instructions for the prologue and anything leading up to its current PC before it knows how to find the CFA for its current location.[^caching] This evaluatesO(n)CFI instructions.[^caching]: The stack walker could cache or preprocess these results, but that is a bit of an aside. It implies additional time and/or space overheads that might make sense for, e.g., a profiler that will frequently walk stacks, but wouldn't make sense for the system's C++ unwinder which might rarely, if ever, actually walk and unwind the stack. Even if some stack walkers won't benefit from the FDE improvements described in this issue, there will always be other stack walkers with different trade offs that will benefit.
Example
Here is the Wasm function
f(x) = x + 1compiled by Wasmtime and Cranelift:0000000000000000 <wasm[0]::function[0]>: 0: 55 pushq %rbp 1: 48 89 e5 movq %rsp, %rbp 4: 8d 42 01 leal 0x1(%rdx), %eax 7: 48 89 ec movq %rbp, %rsp a: 5d popq %rbp b: c3 retqHere is the FDE that covers that function (lightly edited output of
dwarfdump --eh-frame):FDE cie=00000000 pc=fffffffffffff000...fffffffffffff00c Format: DWARF32 DW_CFA_advance_loc: 1 DW_CFA_def_cfa_offset: +16 DW_CFA_offset: RBP -16 DW_CFA_advance_loc: 3 DW_CFA_def_cfa_register: RBP 0xfffffffffffff000: CFA=RSP+8: RIP=[CFA-8] 0xfffffffffffff001: CFA=RSP+16: RBP=[CFA-16], RIP=[CFA-8] 0xfffffffffffff004: CFA=RBP+16: RBP=[CFA-16], RIP=[CFA-8]In this example, finding the CFA at function offset
0x004and onward (i.e. after the function prologue) requires evaluating the CFI instructions to find the CFA at all earlier function offsets (0x000and0x001).(Aside 1: we don't properly encode the CFA for the epilogue, but probably should for completeness. Will file a separate issue about this.)
(Aside 2: function offsets
0x001and0x004happen to define identical CFAs with different approaches. We would ideally recognize and optimize this somewhere, but writing a CFI instruction optimizer is maybe more than we want to ever bite off, it isn't that important compared to removing theO(n)CFI instruction evaluation.)Proposal: Generate Multiple FDEs per Function
If we instead emit multiple FDEs per function, one FDE per range of code that has its own CFA, then
.eh_frame-using stack walkers can evaluate only the CFI instructions that define their target PC's CFA, replacing theO(n)evaluation with anO(1)evaluation. That is, DWARF defines evaluating CFI instructions as decompressing a PC-range-to-CFA-recovery table, and each FDE can represent multiple rows in that table; right now, our FDEs do encode multiple rows, and I'm proposing that we instead emit more, smaller FDEs that each encode only a single table row.As a first step, I'd expect the previous example's FDE to be exploded into something like this:
FDE cie=00000000 pc=fffffffffffff000...fffffffffffff001 Format: DWARF32 DW_CFA_nop: 0xfffffffffffff000: CFA=RSP+8: RIP=[CFA-8] FDE cie=00000000 pc=fffffffffffff001...fffffffffffff004 Format: DWARF32 DW_CFA_def_cfa_offset: +16 DW_CFA_offset: RBP -16 0xfffffffffffff001: CFA=RSP+16: RBP=[CFA-16], RIP=[CFA-8] FDE cie=00000000 pc=fffffffffffff004...fffffffffffff00c Format: DWARF32 DW_CFA_def_cfa_register: RBP 0xfffffffffffff004: CFA=RBP+16: RBP=[CFA-16], RIP=[CFA-8]As a next step, we could do even better because (at least with frame pointers enabled, like in this example) we only ever emit a small number of different CFI instruction sequences in our FDEs. Almost all FDEs will just be
DW_CFGA_def_cfa_register: RBPand the minority that aren't exactly that (e.g. the FDEs for function proglogues) are also going to emit one of a common small handful of different sequences. DWARF/.eh_frameallows us to deduplicate shared prefixes of CFI instructions in common information entries (CIEs). Every FDE has an associated CIE that defines the initial CFI instructions to run, defining the initial CFI machine state, before evaluating the FDE's CFI instructions. We could define an additional CIE for each of our handful of CFI instruction sequences, and then our FDE wouldn't need additional CFI instructions, it could just inherit the CFI machine state from its CIE.[^cie-caching] This would deduplicate all our CFI instruction sequences in the.eh_framesection, recovering (and probably improving upon) any binary size compression that we might have lost during the introduction of multiple FDEs per function.[^cie-caching]: And because (unlike FDEs) there tend to be only a few CIEs,
.eh_frame-using stack walkers that won't do complex pre-processing and fine-grained caching of FDE lookups and evaluation results will often cache the results of evaluating each CIE, effectively caching the evaluation of all CFI instructions that will ever need to be evaluated for our FDEs under this final.eh_frameemission scheme.
fitzgen edited issue #10421:
Right Now
We emit a single frame description entry (FDE) per function, covering the function's entire PC range.
For every range of code within that function that has a different offset to the canonical frame address (CFA), we emit call frame infromation (CFI) instructions which advance the logical CFI machine's PC to this code range and then describe the new way to get the CFA as a delta from the previous code range's method.
Given that, in order to find the CFA for a PC in the middle of a function, an
.eh_frame-using stack walker has to first do an evaluation of all the CFI instructions for the prologue and anything leading up to its current PC before it knows how to find the CFA for its current location.[^caching] This evaluatesO(n)CFI instructions.[^caching]: The stack walker could cache or preprocess these results, but that is a bit of an aside. It implies additional time and/or space overheads that might make sense for, e.g., a profiler that will frequently walk stacks, but wouldn't make sense for the system's C++ unwinder which might rarely, if ever, actually walk and unwind the stack. Even if some stack walkers won't benefit from the FDE improvements described in this issue, there will always be other stack walkers with different trade offs that will benefit.
Example
Here is the Wasm function
f(x) = x + 1compiled by Wasmtime and Cranelift:0000000000000000 <wasm[0]::function[0]>: 0: 55 pushq %rbp 1: 48 89 e5 movq %rsp, %rbp 4: 8d 42 01 leal 0x1(%rdx), %eax 7: 48 89 ec movq %rbp, %rsp a: 5d popq %rbp b: c3 retqHere is the FDE that covers that function (lightly edited output of
dwarfdump --eh-frame):FDE cie=00000000 pc=fffffffffffff000...fffffffffffff00c Format: DWARF32 DW_CFA_advance_loc: 1 DW_CFA_def_cfa_offset: +16 DW_CFA_offset: RBP -16 DW_CFA_advance_loc: 3 DW_CFA_def_cfa_register: RBP 0xfffffffffffff000: CFA=RSP+8: RIP=[CFA-8] 0xfffffffffffff001: CFA=RSP+16: RBP=[CFA-16], RIP=[CFA-8] 0xfffffffffffff004: CFA=RBP+16: RBP=[CFA-16], RIP=[CFA-8]In this example, finding the CFA at function offset
0x004and onward (i.e. after the function prologue) requires evaluating the CFI instructions to find the CFA at all earlier function offsets (0x000and0x001).(Aside 1: we don't properly encode the CFA for the epilogue, but probably should for completeness. Will file a separate issue about this.)
(Aside 2: function offsetsEdit: nevermind, I misread "rsp" as "rbp", they are in fact two different CFAs.0x001and0x004happen to define identical CFAs with different approaches. We would ideally recognize and optimize this somewhere, but writing a CFI instruction optimizer is maybe more than we want to ever bite off, it isn't that important compared to removing theO(n)CFI instruction evaluation.)Proposal: Generate Multiple FDEs per Function
If we instead emit multiple FDEs per function, one FDE per range of code that has its own CFA, then
.eh_frame-using stack walkers can evaluate only the CFI instructions that define their target PC's CFA, replacing theO(n)evaluation with anO(1)evaluation. That is, DWARF defines evaluating CFI instructions as decompressing a PC-range-to-CFA-recovery table, and each FDE can represent multiple rows in that table; right now, our FDEs do encode multiple rows, and I'm proposing that we instead emit more, smaller FDEs that each encode only a single table row.As a first step, I'd expect the previous example's FDE to be exploded into something like this:
FDE cie=00000000 pc=fffffffffffff000...fffffffffffff001 Format: DWARF32 DW_CFA_nop: 0xfffffffffffff000: CFA=RSP+8: RIP=[CFA-8] FDE cie=00000000 pc=fffffffffffff001...fffffffffffff004 Format: DWARF32 DW_CFA_def_cfa_offset: +16 DW_CFA_offset: RBP -16 0xfffffffffffff001: CFA=RSP+16: RBP=[CFA-16], RIP=[CFA-8] FDE cie=00000000 pc=fffffffffffff004...fffffffffffff00c Format: DWARF32 DW_CFA_offset: RBP -16 DW_CFA_def_cfa_offset: +16 DW_CFA_def_cfa_register: RBP 0xfffffffffffff004: CFA=RBP+16: RBP=[CFA-16], RIP=[CFA-8]As a next step, we could do even better because (at least with frame pointers enabled, like in this example) we only ever emit a small number of different CFI instruction sequences in our FDEs. Almost all FDEs will just be
DW_CFGA_def_cfa_register: RBPand the minority that aren't exactly that (e.g. the FDEs for function proglogues) are also going to emit one of a common small handful of different sequences. DWARF/.eh_frameallows us to deduplicate shared prefixes of CFI instructions in common information entries (CIEs). Every FDE has an associated CIE that defines the initial CFI instructions to run, defining the initial CFI machine state, before evaluating the FDE's CFI instructions. We could define an additional CIE for each of our handful of CFI instruction sequences, and then our FDE wouldn't need additional CFI instructions, it could just inherit the CFI machine state from its CIE.[^cie-caching] This would deduplicate all our CFI instruction sequences in the.eh_framesection, recovering (and probably improving upon) any binary size compression that we might have lost during the introduction of multiple FDEs per function.[^cie-caching]: And because (unlike FDEs) there tend to be only a few CIEs,
.eh_frame-using stack walkers that won't do complex pre-processing and fine-grained caching of FDE lookups and evaluation results will often cache the results of evaluating each CIE, effectively caching the evaluation of all CFI instructions that will ever need to be evaluated for our FDEs under this final.eh_frameemission scheme.
bjorn3 commented on issue #10421:
Wouldn't this cause the
.eh_frame_hdrsection to triple in size, making finding the right FDE slower?
fitzgen commented on issue #10421:
Yes, but (a) the
.eh_frame_hdrsection is binary-searched through and moving somenfrom aO(n)toO(log n)process seems like a win and (b) I think that the size that the.eh_frame_hdrgrows will not be larger (and will probably smaller?) than the size that the.eh_frameshrinks due to the additional deduplication.
Last updated: Dec 13 2025 at 21:03 UTC