wasmtime / issue #10421 Cranelift: Improve the FDEs we ge... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / issue #10421 Cranelift: Improve the FDEs we ge...

Wasmtime GitHub notifications bot (Mar 19 2025 at 18:34):

fitzgen added the cranelift label to Issue #10421.

Wasmtime GitHub notifications bot (Mar 19 2025 at 18:34):

fitzgen added the cranelift:area:debug label to Issue #10421.

Wasmtime GitHub notifications bot (Mar 19 2025 at 18:34):

fitzgen added the cranelift:goal:native-ABI label to Issue #10421.

Wasmtime GitHub notifications bot (Mar 19 2025 at 18:34):

fitzgen opened issue #10421:

Right Now

We emit a single frame description entry (FDE) per function, covering the function's entire PC range.

For every range of code within that function that has a different offset to the canonical frame address (CFA), we emit call frame infromation (CFI) instructions which advance the logical CFI machine's PC to this code range and then describe the new way to get the CFA as a delta from the previous code range's method.

Given that, in order to find the CFA for a PC in the middle of a function, an .eh_frame-using stack walker has to first do an evaluation of all the CFI instructions for the prologue and anything leading up to its current PC before it knows how to find the CFA for its current location.[^caching] This evaluates O(n) CFI instructions.

[^caching]: The stack walker could cache or preprocess these results, but that is a bit of an aside. It implies additional time and/or space overheads that might make sense for, e.g., a profiler that will frequently walk stacks, but wouldn't make sense for the system's C++ unwinder which might rarely, if ever, actually walk and unwind the stack. Even if some stack walkers won't benefit from the FDE improvements described in this issue, there will always be other stack walkers with different trade offs that will benefit.

Example

Here is the Wasm function f(x) = x + 1 compiled by Wasmtime and Cranelift:
0000000000000000 <wasm[0]::function[0]>:
       0: 55                            pushq   %rbp
       1: 48 89 e5                      movq    %rsp, %rbp
       4: 8d 42 01                      leal    0x1(%rdx), %eax
       7: 48 89 ec                      movq    %rbp, %rsp
       a: 5d                            popq    %rbp
       b: c3                            retq
Here is the FDE that covers that function (lightly edited output of dwarfdump --eh-frame):
FDE cie=00000000 pc=fffffffffffff000...fffffffffffff00c
  Format:       DWARF32
  DW_CFA_advance_loc: 1
  DW_CFA_def_cfa_offset: +16
  DW_CFA_offset: RBP -16
  DW_CFA_advance_loc: 3
  DW_CFA_def_cfa_register: RBP

  0xfffffffffffff000: CFA=RSP+8: RIP=[CFA-8]
  0xfffffffffffff001: CFA=RSP+16: RBP=[CFA-16], RIP=[CFA-8]
  0xfffffffffffff004: CFA=RBP+16: RBP=[CFA-16], RIP=[CFA-8]
In this example, finding the CFA at function offset 0x004 and onward (i.e. after the function prologue) requires evaluating the CFI instructions to find the CFA at all earlier function offsets (0x000 and 0x001).

(Aside 1: we don't properly encode the CFA for the epilogue, but probably should for completeness. Will file a separate issue about this.)

(Aside 2: function offsets 0x001 and 0x004 happen to define identical CFAs with different approaches. We would ideally recognize and optimize this somewhere, but writing a CFI instruction optimizer is maybe more than we want to ever bite off, it isn't that important compared to removing the O(n) CFI instruction evaluation.)

Proposal: Generate Multiple FDEs per Function

If we instead emit multiple FDEs per function, one FDE per range of code that has its own CFA, then .eh_frame-using stack walkers can evaluate only the CFI instructions that define their target PC's CFA, replacing the O(n) evaluation with an O(1) evaluation. That is, DWARF defines evaluating CFI instructions as decompressing a PC-range-to-CFA-recovery table, and each FDE can represent multiple rows in that table; right now, our FDEs do encode multiple rows, and I'm proposing that we instead emit more, smaller FDEs that each encode only a single table row.

As a first step, I'd expect the previous example's FDE to be exploded into something like this:
FDE cie=00000000 pc=fffffffffffff000...fffffffffffff001
  Format:       DWARF32
  DW_CFA_nop:

  0xfffffffffffff000: CFA=RSP+8: RIP=[CFA-8]

FDE cie=00000000 pc=fffffffffffff001...fffffffffffff004
  Format:       DWARF32
  DW_CFA_def_cfa_offset: +16
  DW_CFA_offset: RBP -16

  0xfffffffffffff001: CFA=RSP+16: RBP=[CFA-16], RIP=[CFA-8]

FDE cie=00000000 pc=fffffffffffff004...fffffffffffff00c
  Format:       DWARF32
  DW_CFA_def_cfa_register: RBP

  0xfffffffffffff004: CFA=RBP+16: RBP=[CFA-16], RIP=[CFA-8]
As a next step, we could do even better because (at least with frame pointers enabled, like in this example) we only ever emit a small number of different CFI instruction sequences in our FDEs. Almost all FDEs will just be DW_CFGA_def_cfa_register: RBP and the minority that aren't exactly that (e.g. the FDEs for function proglogues) are also going to emit one of a common small handful of different sequences. DWARF/.eh_frame allows us to deduplicate shared prefixes of CFI instructions in common information entries (CIEs). Every FDE has an associated CIE that defines the initial CFI instructions to run, defining the initial CFI machine state, before evaluating the FDE's CFI instructions. We could define an additional CIE for each of our handful of CFI instruction sequences, and then our FDE wouldn't need additional CFI instructions, it could just inherit the CFI machine state from its CIE.[^cie-caching] This would deduplicate all our CFI instruction sequences in the .eh_frame section, recovering (and probably improving upon) any binary size compression that we might have lost during the introduction of multiple FDEs per function.

[^cie-caching]: And because (unlike FDEs) there tend to be only a few CIEs, .eh_frame-using stack walkers that won't do complex pre-processing and fine-grained caching of FDE lookups and evaluation results will often cache the results of evaluating each CIE, effectively caching the evaluation of all CFI instructions that will ever need to be evaluated for our FDEs under this final .eh_frame emission scheme.

Wasmtime GitHub notifications bot (Mar 19 2025 at 18:45):

fitzgen edited issue #10421:

Right Now

We emit a single frame description entry (FDE) per function, covering the function's entire PC range.

For every range of code within that function that has a different offset to the canonical frame address (CFA), we emit call frame infromation (CFI) instructions which advance the logical CFI machine's PC to this code range and then describe the new way to get the CFA as a delta from the previous code range's method.

Given that, in order to find the CFA for a PC in the middle of a function, an .eh_frame-using stack walker has to first do an evaluation of all the CFI instructions for the prologue and anything leading up to its current PC before it knows how to find the CFA for its current location.[^caching] This evaluates O(n) CFI instructions.

[^caching]: The stack walker could cache or preprocess these results, but that is a bit of an aside. It implies additional time and/or space overheads that might make sense for, e.g., a profiler that will frequently walk stacks, but wouldn't make sense for the system's C++ unwinder which might rarely, if ever, actually walk and unwind the stack. Even if some stack walkers won't benefit from the FDE improvements described in this issue, there will always be other stack walkers with different trade offs that will benefit.

Example

Here is the Wasm function f(x) = x + 1 compiled by Wasmtime and Cranelift:
0000000000000000 <wasm[0]::function[0]>:
       0: 55                            pushq   %rbp
       1: 48 89 e5                      movq    %rsp, %rbp
       4: 8d 42 01                      leal    0x1(%rdx), %eax
       7: 48 89 ec                      movq    %rbp, %rsp
       a: 5d                            popq    %rbp
       b: c3                            retq
Here is the FDE that covers that function (lightly edited output of dwarfdump --eh-frame):
FDE cie=00000000 pc=fffffffffffff000...fffffffffffff00c
  Format:       DWARF32
  DW_CFA_advance_loc: 1
  DW_CFA_def_cfa_offset: +16
  DW_CFA_offset: RBP -16
  DW_CFA_advance_loc: 3
  DW_CFA_def_cfa_register: RBP

  0xfffffffffffff000: CFA=RSP+8: RIP=[CFA-8]
  0xfffffffffffff001: CFA=RSP+16: RBP=[CFA-16], RIP=[CFA-8]
  0xfffffffffffff004: CFA=RBP+16: RBP=[CFA-16], RIP=[CFA-8]
In this example, finding the CFA at function offset 0x004 and onward (i.e. after the function prologue) requires evaluating the CFI instructions to find the CFA at all earlier function offsets (0x000 and 0x001).

(Aside 1: we don't properly encode the CFA for the epilogue, but probably should for completeness. Will file a separate issue about this.)

(Aside 2: function offsets 0x001 and 0x004 happen to define identical CFAs with different approaches. We would ideally recognize and optimize this somewhere, but writing a CFI instruction optimizer is maybe more than we want to ever bite off, it isn't that important compared to removing the O(n) CFI instruction evaluation.) Edit: nevermind, I misread "rsp" as "rbp", they are in fact two different CFAs.

Proposal: Generate Multiple FDEs per Function

If we instead emit multiple FDEs per function, one FDE per range of code that has its own CFA, then .eh_frame-using stack walkers can evaluate only the CFI instructions that define their target PC's CFA, replacing the O(n) evaluation with an O(1) evaluation. That is, DWARF defines evaluating CFI instructions as decompressing a PC-range-to-CFA-recovery table, and each FDE can represent multiple rows in that table; right now, our FDEs do encode multiple rows, and I'm proposing that we instead emit more, smaller FDEs that each encode only a single table row.

As a first step, I'd expect the previous example's FDE to be exploded into something like this:
FDE cie=00000000 pc=fffffffffffff000...fffffffffffff001
  Format:       DWARF32
  DW_CFA_nop:

  0xfffffffffffff000: CFA=RSP+8: RIP=[CFA-8]

FDE cie=00000000 pc=fffffffffffff001...fffffffffffff004
  Format:       DWARF32
  DW_CFA_def_cfa_offset: +16
  DW_CFA_offset: RBP -16

  0xfffffffffffff001: CFA=RSP+16: RBP=[CFA-16], RIP=[CFA-8]

FDE cie=00000000 pc=fffffffffffff004...fffffffffffff00c
  Format:       DWARF32
  DW_CFA_offset: RBP -16
  DW_CFA_def_cfa_offset: +16
  DW_CFA_def_cfa_register: RBP

  0xfffffffffffff004: CFA=RBP+16: RBP=[CFA-16], RIP=[CFA-8]
As a next step, we could do even better because (at least with frame pointers enabled, like in this example) we only ever emit a small number of different CFI instruction sequences in our FDEs. Almost all FDEs will just be DW_CFGA_def_cfa_register: RBP and the minority that aren't exactly that (e.g. the FDEs for function proglogues) are also going to emit one of a common small handful of different sequences. DWARF/.eh_frame allows us to deduplicate shared prefixes of CFI instructions in common information entries (CIEs). Every FDE has an associated CIE that defines the initial CFI instructions to run, defining the initial CFI machine state, before evaluating the FDE's CFI instructions. We could define an additional CIE for each of our handful of CFI instruction sequences, and then our FDE wouldn't need additional CFI instructions, it could just inherit the CFI machine state from its CIE.[^cie-caching] This would deduplicate all our CFI instruction sequences in the .eh_frame section, recovering (and probably improving upon) any binary size compression that we might have lost during the introduction of multiple FDEs per function.

[^cie-caching]: And because (unlike FDEs) there tend to be only a few CIEs, .eh_frame-using stack walkers that won't do complex pre-processing and fine-grained caching of FDE lookups and evaluation results will often cache the results of evaluating each CIE, effectively caching the evaluation of all CFI instructions that will ever need to be evaluated for our FDEs under this final .eh_frame emission scheme.

Wasmtime GitHub notifications bot (Mar 19 2025 at 19:11):

bjorn3 commented on issue #10421:

Wouldn't this cause the .eh_frame_hdr section to triple in size, making finding the right FDE slower?

Wasmtime GitHub notifications bot (Mar 19 2025 at 20:53):

fitzgen commented on issue #10421:

Yes, but (a) the .eh_frame_hdr section is binary-searched through and moving some n from a O(n) to O(log n) process seems like a win and (b) I think that the size that the .eh_frame_hdr grows will not be larger (and will probably smaller?) than the size that the .eh_frame shrinks due to the additional deduplication.

Last updated: Apr 18 2025 at 21:03 UTC