fitzgen opened PR #4431 from fast-stack-walking
to main
:
Why do we want Wasm stack walking to be fast? Because we capture stacks whenever
there is a trap and traps actually happen fairly frequently with short-lived
programs and WASI'sexit
.Previously, we would rely on generating the system unwind info (e.g.
.eh_frame
) and using the system unwinder (via thebacktrace
crate) to walk
the full stack and filter out any non-Wasm stack frames. This can,
unfortunately, be slow for two primary reasons:
The system unwinder is doing
O(all-kinds-of-frames)
work rather than
O(wasm-frames)
work.System unwind info and the system unwinder need to be much more general than
a purpose-built stack walker for Wasm needs to be. It has to handle any kind of
stack frame that any compiler might emit where as our Wasm frames are emitted by
Cranelift and always have frame pointers. This translates into implementation
complexity and general overhead. There can also be unnecessary-for-our-use-cases
global synchronization and locks involved, further slowing down stack walking in
the presence of multiple threads trying to capture stacks in parallel.This commit introduces a purpose-built stack walker for traversing just our Wasm
frames. To find all the sequences of Wasm-to-Wasm stack frames, and ignore
non-Wasm stack frames, we keep a linked list of(entry stack pointer, exit frame pointer)
pairs. This linked list is maintained via Wasm-to-host and
host-to-Wasm trampolines. Within a sequence of Wasm-to-Wasm calls, we can use
frame pointers (which Cranelift preserves) to find the next older Wasm frame on
the stack, and we keep doing this until we reach the entry stack pointer,
meaning that the next older frame will be a host frame.The trampolines need to avoid a couple stumbling blocks. First, they need to be
compiled ahead of time, since we may not have access to a compiler at
runtime (e.g. if thecranelift
feature is disabled) but still want to be able
to call functions that have already been compiled and get stack traces for those
functions. Usually this means we would compile the appropriate trampolines
insideModule::new
and the compiled module object would hold the
trampolines. However, we also need to support calling host functions that are
wrapped intowasmtime::Func
s and there doesn't exist any ahead-of-time
compiled module object to hold the appropriate trampolines:// Define a host function. let func_type = wasmtime::FuncType::new( vec![wasmtime::ValType::I32], vec![wasmtime::ValType::I32], ); let func = Func::new(&mut store, func_type, |_, params, results| { // ... Ok(()) }); // Call that host function. let mut results = vec![wasmtime::Val::I32(0)]; func.call(&[wasmtime::Val::I32(0)], &mut results)?;
Therefore, we define one host-to-Wasm trampoline and one Wasm-to-host trampoline
in assembly that work for all Wasm and host function signatures. These
trampolines are careful to only use volatile registers, avoid touching any
register that is an argument in the calling convention ABI, and tail call to the
target callee function. This allows forwarding any set of arguments and any
returns to and from the callee, while also allowing us to maintain our linked
list of Wasm stack and frame pointers before transferring control to the
callee. These trampolines are not used in Wasm-to-Wasm calls, only when crossing
the host-Wasm boundary, so they do not impose overhead on regular calls. (And if
using one trampoline for all host-Wasm boundary crossing ever breaks branch
prediction enough in the CPU to become any kind of bottleneck, we can do fun
things like have multiple copies of the same trampoline and choose a random copy
for each function, sharding the functions across branch predictor entries.)Finally, this commit also ends the use of a synthetic
Module
and allocating a
stubbed outVMContext
for host functions. Instead, we define a
VMHostFuncContext
with its own magic value, similar toVMComponentContext
,
specifically for host functions.<h2>Benchmarks</h2>
<h3>Traps and Stack Traces</h3>
Large improvements to taking stack traces on traps, ranging from shaving off 64%
to 99.95% of the time it used to take.<details>
multi-threaded-traps/0 time: [2.5686 us 2.5808 us 2.5934 us] thrpt: [0.0000 elem/s 0.0000 elem/s 0.0000 elem/s] change: time: [-85.419% -85.153% -84.869%] (p = 0.00 < 0.05) thrpt: [+560.90% +573.56% +585.84%] Performance has improved. Found 8 outliers among 100 measurements (8.00%) 4 (4.00%) high mild 4 (4.00%) high severe multi-threaded-traps/1 time: [2.9021 us 2.9167 us 2.9322 us] thrpt: [341.04 Kelem/s 342.86 Kelem/s 344.58 Kelem/s] change: time: [-91.455% -91.294% -91.096%] (p = 0.00 < 0.05) thrpt: [+1023.1% +1048.6% +1070.3%] Performance has improved. Found 6 outliers among 100 measurements (6.00%) 1 (1.00%) high mild 5 (5.00%) high severe multi-threaded-traps/2 time: [2.9996 us 3.0145 us 3.0295 us] thrpt: [660.18 Kelem/s 663.47 Kelem/s 666.76 Kelem/s] change: time: [-94.040% -93.910% -93.762%] (p = 0.00 < 0.05) thrpt: [+1503.1% +1542.0% +1578.0%] Performance has improved. Found 5 outliers among 100 measurements (5.00%) 5 (5.00%) high severe multi-threaded-traps/4 time: [5.5768 us 5.6052 us 5.6364 us] thrpt: [709.68 Kelem/s 713.63 Kelem/s 717.25 Kelem/s] change: time: [-93.193% -93.121% -93.052%] (p = 0.00 < 0.05) thrpt: [+1339.2% +1353.6% +1369.1%] Performance has improved. multi-threaded-traps/8 time: [8.6408 us 9.1212 us 9.5438 us] thrpt: [838.24 Kelem/s 877.08 Kelem/s 925.84 Kelem/s] change: time: [-94.754% -94.473% -94.202%] (p = 0.00 < 0.05) thrpt: [+1624.7% +1709.2% +1806.1%] Performance has improved. multi-threaded-traps/16 time: [10.152 us 10.840 us 11.545 us] thrpt: [1.3858 Melem/s 1.4760 Melem/s 1.5761 Melem/s] change: time: [-97.042% -96.823% -96.577%] (p = 0.00 < 0.05) thrpt: [+2821.5% +3048.1% +3281.1%] Performance has improved. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild many-modules-registered-traps/1 time: [2.6278 us 2.6361 us 2.6447 us] thrpt: [378.11 Kelem/s 379.35 Kelem/s 380.55 Kelem/s] change: time: [-85.311% -85.108% -84.909%] (p = 0.00 < 0.05) thrpt: [+562.65% +571.51% +580.76%] Performance has improved. Found 9 outliers among 100 measurements (9.00%) 3 (3.00%) high mild 6 (6.00%) high severe many-modules-registered-traps/8 time: [2.6294 us 2.6460 us 2.6623 us] thrpt: [3.0049 Melem/s 3.0235 Melem/s 3.0425 Melem/s] change: time: [-85.895% -85.485% -85.022%] (p = 0.00 < 0.05) thrpt: [+567.63% +588.95% +608.95%] Performance has improved. Found 8 outliers among 100 measurements (8.00%) 3 (3.00%) high mild 5 (5.00%) high severe many-modules-registered-traps/64 time: [2.6218 us 2.6329 us 2.6452 us] thrpt: [24.195 Melem/s 24.308 Melem/s 24.411 Melem/s] change: time: [-93.629% -93.551% -93.470%] (p = 0.00 < 0.05) thrpt: [+1431.4% +1450.6% +1469.5%] Performance has improved. Found 3 outliers among 100 measurements (3.00%) 3 (3.00%) high mild many-modules-registered-traps/512 time: [2.6569 us 2.6737 us 2.6923 us] thrpt: [190.17 Melem/s 191.50 Melem/s 192.71 Melem/s] change: time: [-99.277% -99.268% -99.260%] (p = 0.00 < 0.05) thrpt: [+13417% +13566% +13731%] Performance has improved. Found 4 outliers among 100 measurements (4.00%) 4 (4.00%) high mild many-modules-registered-traps/4096 time: [2.7258 us 2.7390 us 2.7535 us] thrpt: [1.4876 Gelem/s 1.4955 Gelem/s 1.5027 Gelem/s] change: time: [-99.956% -99.955% -99.955%] (p = 0.00 < 0.05) thrpt: [+221417% +223380% +224881%] Performance has improved. Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) high mild 1 (1.00%) high severe many-stack-frames-traps/1 time: [1.4658 us 1.4719 us 1.4784 us] thrpt: [676.39 Kelem/s 679.38 Kelem/s 682.21 Kelem/s] change: time: [-90.368% -89.947% -89.586%] (p = 0.00 < 0.05) thrpt: [+860.23% +894.72% +938.21%] Performance has improved. Found 8 outliers among 100 measurements (8.00%) 5 (5.00%) high mild 3 (3.00%) high severe many-stack-frames-traps/8 time: [2.4772 us 2.4870 us 2.4973 us] thrpt: [3.2034 [message truncated]
alexcrichton submitted PR review.
alexcrichton submitted PR review.
alexcrichton created PR review comment:
I think technically this should be aligned up to the pointer size?
alexcrichton created PR review comment:
It might be worth pointing out that this is 64-bit only
alexcrichton created PR review comment:
You might want to add this constant to the asserts below as well
alexcrichton created PR review comment:
In the end this'll probably want to get folded into
get_pc
above to avoid duplicating thecfg_if!
alexcrichton created PR review comment:
We'll actually need to replace this check with some other boolean in a
store
since the non-recursive behavior here was necessary to implement the currently desired semantics (stack limit set only once) and we just piggy-backed on the canary being set.
fitzgen updated PR #4431 from fast-stack-walking
to main
.
fitzgen updated PR #4431 from fast-stack-walking
to main
.
fitzgen updated PR #4431 from fast-stack-walking
to main
.
fitzgen created PR review comment:
Any suggestions of how to implement this check? Do we have anything already that we can piggy back on, the way we did with the stack canary, or do we need a new thing?
fitzgen submitted PR review.
fitzgen updated PR #4431 from fast-stack-walking
to main
.
fitzgen updated PR #4431 from fast-stack-walking
to main
.
fitzgen updated PR #4431 from fast-stack-walking
to main
.
alexcrichton submitted PR review.
alexcrichton created PR review comment:
Sorry I meant to answer this earlier and forgot. I think
stack_limit
could be suitable for this in theVMRuntimeLimits
. If it'susize::MAX
then it should be reset, otherwise it should be not tampered with.
fitzgen updated PR #4431 from fast-stack-walking
to main
.
alexcrichton updated PR #4431 from fast-stack-walking
to main
.
akirilov-arm submitted PR review.
akirilov-arm submitted PR review.
akirilov-arm created PR review comment:
Use either
X16
orX17
, otherwise the code becomes incompatible with BTI.
akirilov-arm created PR review comment:
As line 56.
akirilov-arm created PR review comment:
As line 73.
akirilov-arm created PR review comment:
Similarly to line 56, end with
.cfi_endproc
.
akirilov-arm created PR review comment:
As line 75.
akirilov-arm created PR review comment:
As line 56.
akirilov-arm created PR review comment:
As line 75.
fitzgen submitted PR review.
fitzgen created PR review comment:
All the CFI stuff will be fixed with using the macros in the
wasmtime-fibers
crate.
fitzgen submitted PR review.
fitzgen created PR review comment:
Specifically for the tail call, or only use these two registers in the whole trampoline? The first is very doable, the second a bit harder.
fitzgen submitted PR review.
fitzgen created PR review comment:
FWIW, you didn't leave a comment on line 56 so I'm not totally sure what you are referring to here. Given the context of the other comment referencing line 56, I assume this has to do with CFI pragmas? These will be handled once I start using the
wasmtime-fibers
crate's macros.
akirilov-arm submitted PR review.
akirilov-arm created PR review comment:
Just the tail call - the indirect branch is what would cause an issue.
fitzgen updated PR #4431 from fast-stack-walking
to main
.
akirilov-arm submitted PR review.
akirilov-arm created PR review comment:
Please, start this function with:
.cfi_startproc hint #34 // bti c
akirilov-arm submitted PR review.
akirilov-arm created PR review comment:
Sorry, I must have accidentally deleted it; I have added it now.
akirilov-arm submitted PR review.
fitzgen updated PR #4431 from fast-stack-walking
to main
.
pepyakin submitted PR review.
pepyakin created PR review comment:
The signature seems to be
[] → []
now
pepyakin submitted PR review.
pepyakin created PR review comment:
(In advance I want to apologize if this is not the right place to ask this)
I was working on this, and there I was relying on the ability to dump the value of a pinned register (r15 on x86_64) to
VMRuntimeLimits
when leaving the wasm code. I'm not sure how to do the same with the tail call though.Do you have any recommendations about that? Should I just introduce my own set of trampolines?
Thanks
pepyakin submitted PR review.
fitzgen updated PR #4431 from fast-stack-walking
to main
.
fitzgen has marked PR #4431 as ready for review.
fitzgen requested alexcrichton for a review on PR #4431.
fitzgen requested akirilov-arm for a review on PR #4431.
fitzgen updated PR #4431 from fast-stack-walking
to main
.
fitzgen submitted PR review.
fitzgen created PR review comment:
Good eye! Fixed.
fitzgen submitted PR review.
fitzgen created PR review comment:
I think you can do that in these existing trampolines. Depending on the overhead, we might do it unconditionally, even if slacked fuel wasn't enabled, or we might have a different version of the Wasm-to-host trampoline for when that feature is enabled (although this risks combinatorial blow up if we have to keep adding more trampolines for more features).
pepyakin submitted PR review.
pepyakin created PR review comment:
Yeah, that was my first impression as well, but later I realized that the r15, the fuel counter, has to be dumped after the wasm returns. The tail call makes the wasm return directly to the host, skipping the trampoline on the exit path.
If we replaced the
jmp
with a call, we would be able to place the r15 dumping code after the call, but as you mention, that would mess up stack arguments.
pepyakin submitted PR review.
pepyakin created PR review comment:
Now, giving it a little thought, it is also unclear what "my own set of trampolines" would do, since they will have to deal with the same problem: call the wasm callee with the arguments that were passed by the host caller.
Perhaps something along the lines of would work:
In the trampoline, stash the original return address (pointing to the host) somewhere and replace it with another new address. That new address would point to a "trampoline continuation". That continuation would dump r15 into vmctx, unstash the original return address and then return.
There has to be a better way though...
alexcrichton submitted PR review.
alexcrichton submitted PR review.
alexcrichton created PR review comment:
Could this be expanded with a little bit about how we do stack walking today? Mostly just mentioning that frame pointers are required in wasm code and we use custom asm trampolines when entering and exiting wasm to ensure we record the boundaries of contiguous wasm stacks.
alexcrichton created PR review comment:
Could this grow some docs to explain which trampoline sets these fields and why they're expected to be valid here?
alexcrichton created PR review comment:
On second thought, do we actually need this file at all? The
VMOffsets
stuff is only related to "the compiler needs to know about offsets for the target" but I don't think any compilation we do actually modfiies aVMHostFuncContext
at all, right? We could change the test assertions for the offsets to using the real structs I think?
alexcrichton created PR review comment:
I forget why this condition is necessary now, could you add some docs to this?
alexcrichton created PR review comment:
Perhaps actually remove the condition here so other platforms can add conditions as appropriate? (if they don't already fall back to this case)
alexcrichton created PR review comment:
I mentioned this before, but personally I think it would be best to cut down on the
cfg_if!
here. I'd rather not have a duplicated set within each function. Could this instead move to the architecture-specific modules in thewasmtime-runtime
crate?
alexcrichton created PR review comment:
A note to fill out this TODO before landing
alexcrichton created PR review comment:
Could this grow some documentation to explain what
initial_pc_and_fp
are?
alexcrichton created PR review comment:
Perhaps delegate to
set_callee
above?
alexcrichton created PR review comment:
note to fill this out a bit more
alexcrichton created PR review comment:
Do you think it's worth going ahead and adding one of these assembly trampolines for all libcalls? Otherwise we could run the risk of if a gc happens during an otherwise-unrelated libcall that could incorrectly generate a backtrace I think?
Also do you think there's a way for us to guarantee that the asm wrapper is used once-per-libcall? Otherwise we might accidentally forget to use this macro.
alexcrichton created PR review comment:
Could you add some notes to the docs here about the unsafety and how
T
must be carefully selected?
alexcrichton created PR review comment:
I think this may actually be able to be
bti c
instead ofhint #34
since I presume LLVM's internal assembler is a bit more modern than random ancient binutils assemblers.
alexcrichton created PR review comment:
We should probably ping Chris or some embark folks to double-check that this PR passes tests on macOS when CI is otherwise green here and ready to go to double-check this.
alexcrichton created PR review comment:
I think we've resolved this "TODO"
alexcrichton created PR review comment:
Could you add some words to why this is done?
alexcrichton created PR review comment:
bump here on the TODO (and the note I had below about reusing the stack limit as the "canary" here)
alexcrichton created PR review comment:
Also, for each of these small helpers, could you add some architecture-specific documentation as to why the function body is the way it is?
pepyakin submitted PR review.
pepyakin created PR review comment:
Is this still needed or is it a leftover?
pepyakin created PR review comment:
Do you think this could use the same test as for
wasm_to_host_trampoline_offsets_tests
, just for documentation purposes?
pepyakin created PR review comment:
FWIW, it does not compile on my machine:
error[E0609]: no field `__opaque_fp` on type `__darwin_arm_thread_state64` --> crates/runtime/src/traphandlers/unix.rs:223:40 | 223 | (*cx.uc_mcontext).__ss.__opaque_fp as usize, | ^^^^^^^^^^^ unknown field | = note: available fields are: `__x`, `__fp`, `__lr`, `__sp`, `__pc` ... and 2 others
if changed to
__fp
I am getting linking errors related to libcalls= note: Undefined symbols for architecture arm64: "__wasmtime_out_of_gas", referenced from: _out_of_gas in libwasmtime_runtime-dbb96764d3b18906.rlib(wasmtime_runtime-dbb96764d3b18906.44jpaurkbhy60k2w.rcgu.o) (maybe you meant: wasmtime_runtime::libcalls::__wasmtime_out_of_gas::_$u7b$$u7b$closure$u7d$$u7d$::h82a1f6dfe40ded68, ___wasmtime_out_of_gas ) "__wasmtime_memory_atomic_wait64", referenced from: _memory_atomic_wait64 in libwasmtime_runtime-dbb96764d3b18906.rlib(wasmtime_runtime-dbb96764d3b18906.44jpaurkbhy60k2w.rcgu.o) (maybe you meant: ___wasmtime_memory_atomic_wait64, wasmtime_runtime::libcalls::__wasmtime_memory_atomic_wait64::_$u7b$$u7b$closure$u7d$$u7d$::h0572dbcd971fc0ae ) "__wasmtime_table_copy", referenced from: _table_copy in libwasmtime_runtime-dbb96764d3b18906.rlib(wasmtime_runtime-dbb96764d3b18906.44jpaurkbhy60k2w.rcgu.o) (maybe you meant: wasmtime_runtime::libcalls::__wasmtime_table_copy::_$u7b$$u7b$closure$u7d$$u7d$::h8a12719fb9aadf1b, ___wasmtime_table_copy ) "__wasmtime_memory_atomic_notify", referenced from: _memory_atomic_notify in libwasmtime_runtime-dbb96764d3b18906.rlib(wasmtime_runtime-dbb96764d3b18906.44jpaurkbhy60k2w.rcgu.o) (maybe you meant: wasmtime_runtime::libcalls::__wasmtime_memory_atomic_notify::_$u7b$$u7b$closure$u7d$$u7d$::h6e6f4ca809ea38f9, ___wasmtime_memory_atomic_notify ) "__wasmtime_memory_init", referenced from: _memory_init in libwasmtime_runtime-dbb96764d3b18906.rlib(wasmtime_runtime-dbb96764d3b18906.44jpaurkbhy60k2w.rcgu.o) (maybe you meant: wasmtime_runtime::libcalls::__wasmtime_memory_init::_$u7b$$u7b$closure$u7d$$u7d$::hda18b620453f6e32, ___wasmtime_memory_init ) "__wasmtime_memory_fill", referenced from: _memory_fill in libwasmtime_runtime-dbb96764d3b18906.rlib(wasmtime_runtime-dbb96764d3b18906.44jpaurkbhy60k2w.rcgu.o) (maybe you meant: wasmtime_runtime::libcalls::__wasmtime_memory_fill::_$u7b$$u7b$closure$u7d$$u7d$::h16f10cadc9476e42, ___wasmtime_memory_fill ) "__wasmtime_memory_atomic_wait32", referenced from: _memory_atomic_wait32 in libwasmtime_runtime-dbb96764d3b18906.rlib(wasmtime_runtime-dbb96764d3b18906.44jpaurkbhy60k2w.rcgu.o) (maybe you meant: wasmtime_runtime::libcalls::__wasmtime_memory_atomic_wait32::_$u7b$$u7b$closure$u7d$$u7d$::h560d4d674b202c95, ___wasmtime_memory_atomic_wait32 ) "__wasmtime_table_init", referenced from: _table_init in libwasmtime_runtime-dbb96764d3b18906.rlib(wasmtime_runtime-dbb96764d3b18906.44jpaurkbhy60k2w.rcgu.o) (maybe you meant: wasmtime_runtime::libcalls::__wasmtime_table_init::_$u7b$$u7b$closure$u7d$$u7d$::he5e1a78b56db45b4, ___wasmtime_table_init ) "__wasmtime_table_fill", referenced from: _table_fill in libwasmtime_runtime-dbb96764d3b18906.rlib(wasmtime_runtime-dbb96764d3b18906.44jpaurkbhy60k2w.rcgu.o) (maybe you meant: wasmtime_runtime::libcalls::__wasmtime_table_fill::_$u7b$$u7b$closure$u7d$$u7d$::hf8b7ce102ac3de7a, ___wasmtime_table_fill ) "__wasmtime_memory_copy", referenced from: _memory_copy in libwasmtime_runtime-dbb96764d3b18906.rlib(wasmtime_runtime-dbb96764d3b18906.44jpaurkbhy60k2w.rcgu.o) (maybe you meant: ___wasmtime_memory_copy, wasmtime_runtime::libcalls::__wasmtime_memory_copy::_$u7b$$u7b$closure$u7d$$u7d$::h89a592f83dabef04 ) "__wasmtime_table_grow", referenced from: _table_grow in libwasmtime_runtime-dbb96764d3b18906.rlib(wasmtime_runtime-dbb96764d3b18906.44jpaurkbhy60k2w.rcgu.o) (maybe you meant: wasmtime_runtime::libcalls::__wasmtime_table_grow::_$u7b$$u7b$closure$u7d$$u7d$::hdaa1831af98473b5, ___wasmtime_table_grow ) "__wasmtime_new_epoch", referenced from: _new_epoch in libwasmtime_runtime-dbb96764d3b18906.rlib(wasmtime_runtime-dbb96764d3b18906.44jpaurkbhy60k2w.rcgu.o) (maybe you meant: wasmtime_runtime::libcalls::__wasmtime_new_epoch::_$u7b$$u7b$closure$u7d$$u7d$::h1a92330ea38f0813, ___wasmtime_new_epoch ) "__wasmtime_memory32_grow_impl", referenced from: _memory32_grow in libwasmtime_runtime-dbb96764d3b18906.rlib(wasmtime_runtime-dbb96764d3b18906.44jpaurkbhy60k2w.rcgu.o) (maybe you meant: wasmtime_runtime::libcalls::__wasmtime_memory32_grow_impl::_$u7b$$u7b$closure$u7d$$u7d$::hd067c3bee9e773e8, ___wasmtime_memory32_grow_impl ) ld: symbol(s) not found for architecture arm64 clang: error: linker command failed with exit code 1 (use -v to see invocation)
If I build without
posix-signals-on-macos
I'd get this:error[E0308]: mismatched types --> crates/runtime/src/traphandlers/macos.rs:388:33 | 388 | state.capture_backtrace(wasm_pc); | ^^^^^^^ expected enum `std::option::Option`, found *-ptr | = note: expected enum `std::option::Option<(usize, usize)>` found raw pointer `*const u8`
pepyakin submitted PR review.
fitzgen updated PR #4431 from fast-stack-walking
to main
.
fitzgen updated PR #4431 from fast-stack-walking
to main
.
fitzgen updated PR #4431 from fast-stack-walking
to main
.
fitzgen updated PR #4431 from fast-stack-walking
to main
.
fitzgen updated PR #4431 from fast-stack-walking
to main
.
fitzgen updated PR #4431 from fast-stack-walking
to main
.
fitzgen submitted PR review.
fitzgen created PR review comment:
Hi @uweigand, would you mind suggesting wording for some comments for these functions similar to those in the
x86_64.rs
file just below? I'm not familiar enough to add the docs that Alex requested in review myself.
fitzgen submitted PR review.
fitzgen created PR review comment:
Hi @akirilov-arm, would you mind suggesting wording for some comments for these functions similar to those in the
x86_64.rs
file just below? I'm not familiar enough to add the docs that Alex requested in review myself. Thanks!
fitzgen edited PR review comment.
fitzgen submitted PR review.
fitzgen created PR review comment:
Split everything out into modules, but haven't added non-x64 docs yet (pinged folks for help with that).
uweigand created PR review comment:
get_next_older_pc_from_fp:
The next older PC can be found in register %r14 at function entry, which was saved into slot 14 of the register save area pointed to by 'FP' (the backchain pointer).
get_next_older_fp_from_fp:
The next older 'FP' (backchain pointer) was saved in the slot pointed to by the current 'FP'.
reached_entry_sp:
The 'FP' (backchain pointer) holds the value of the stack pointer at function entry. If this equals the value the stack pointer had when we first entered a Wasm function, we are done.
uweigand submitted PR review.
uweigand submitted PR review.
uweigand created PR review comment:
Not much to say about the stack alignment except that the ABI guarantees 8-byte alignment at all times, and we cannot in general expect anything more.
fitzgen updated PR #4431 from fast-stack-walking
to main
.
fitzgen submitted PR review.
fitzgen created PR review comment:
Thanks!
fitzgen updated PR #4431 from fast-stack-walking
to main
.
fitzgen submitted PR review.
fitzgen created PR review comment:
We generally write failing test cases to
test.wa{t,sm}
for fuzz tests and smoke tests like this so it is easier to debug. They are only written if the test fails, so won't otherwise spew stuff to the file system during normal operation.
fitzgen created PR review comment:
I intentionally chose to have two tests so that if they evolve in different directions in the future we can update each respective test and not accidentally leave one trampoline's offsets untested.
fitzgen submitted PR review.
akirilov-arm created PR review comment:
Technically there is no alignment requirement for the frame pointer - in fact, if my reading of the AAPCS64 is correct (
The location of the frame record within a stack frame is not specified.
), then it could even be completely unaligned (or at the very least aligned on 8 bytes)! However, it is probably still useful to leave this check as is.
akirilov-arm submitted PR review.
akirilov-arm submitted PR review.
akirilov-arm created PR review comment:
I don't there is any need for a comment here - the assertion says everything that needs to be stated.
akirilov-arm edited PR review comment.
akirilov-arm created PR review comment:
Here is a direct quote from section 6.2.3 of the Procedure Call Standard for the Arm® 64-bit Architecture (AAPCS64):
Conforming code shall construct a linked list of stack-frames. Each frame shall link to the frame of its caller by means of a frame record of two 64-bit values on the stack (independent of the data model). The frame record for the innermost frame (belonging to the most recent routine invocation) shall be pointed to by the frame pointer register (FP). The lowest addressed double-word shall point to the previous frame record and the highest addressed double-word shall contain the value passed in LR on entry to the current function. If code uses the pointer signing extension to sign return addresses, the value in LR must be signed before storing it in the frame record. The end of the frame record chain is indicated by the address zero in the address for the previous frame. The location of the frame record within a stack frame is not specified.
It can be trimmed to:
Each frame shall link to the frame of its caller by means of a frame record of two 64-bit values on the stack [...] The frame record for the innermost frame [...] shall be pointed to by the frame pointer register (FP). The lowest addressed double-word shall point to the previous frame record and the highest addressed double-word shall contain the value passed in LR on entry to the current function.
I think it covers both this function and
get_next_older_fp_from_fp()
.
akirilov-arm submitted PR review.
akirilov-arm submitted PR review.
akirilov-arm created PR review comment:
Maybe add a comment stating that we do not support big-endian AArch64 (also on line 46)?
akirilov-arm edited PR review comment.
fitzgen updated PR #4431 from fast-stack-walking
to main
.
fitzgen updated PR #4431 from fast-stack-walking
to main
.
akirilov-arm submitted PR review.
akirilov-arm created PR review comment:
I added comments for each function.
fitzgen submitted PR review.
fitzgen created PR review comment:
Thanks, I saw, just haven't gotten around to adding them yet. Appreciated!
fitzgen updated PR #4431 from fast-stack-walking
to main
.
fitzgen updated PR #4431 from fast-stack-walking
to main
.
fitzgen updated PR #4431 from fast-stack-walking
to main
.
fitzgen requested alexcrichton for a review on PR #4431.
alexcrichton created PR review comment:
This condition feels a bit funny to me so I'm wondering if we can add more strict assertions around this perhaps. The bug with components is that the pc/fp aren't set here on exit back into the host, and I was hoping that an assertion would trip somewhere in this code to indicate such.
I know that we have a -1 sentinel to get encoded for a host->host call but otherwise I think we should always have this set and we can assert as much? (or... something like that)
alexcrichton created PR review comment:
Is this still necessary with the
.iter()
method below?
alexcrichton created PR review comment:
Could this assert that the
fp
has some relation tofirst_wasm_sp
? (e.g. one of thearch::*
-specific functions I think should returntrue
?
alexcrichton created PR review comment:
Personally I feel that this macro is still too complicated for its own good, but I sent you https://github.com/fitzgen/wasmtime/pull/7 which is what I was thinking of how to simplify it.
alexcrichton submitted PR review.
alexcrichton submitted PR review.
alexcrichton created PR review comment:
Is this still necessary? Ideally this'd stay private to this module to avoid having to reason about the rest of the crate
alexcrichton created PR review comment:
Could this use
VMOpaqueContext::from_vm_host_func_context
?
alexcrichton created PR review comment:
To clarify, is this something you plan to take on or is this something I should plan to look into?
alexcrichton created PR review comment:
This was marked as resolved, but inside the resolved comment is an indication that this isn't the right name?
Locally I get:
$ AR=true CC=true cargo check --target aarch64-apple-darwin -p wasmtime-runtime --features posix-signals-on-macos Checking wasmtime-runtime v0.40.0 (/home/acrichto/code/wasmtime/crates/runtime) error[E0609]: no field `__opaque_fp` on type `__darwin_arm_thread_state64` --> crates/runtime/src/traphandlers/unix.rs:223:40 | 223 | (*cx.uc_mcontext).__ss.__opaque_fp as usize, | ^^^^^^^^^^^ unknown field | = note: available fields are: `__x`, `__fp`, `__lr`, `__sp`, `__pc` ... and 2 others For more information about this error, try `rustc --explain E0609`. error: could not compile `wasmtime-runtime` due to previous error
alexcrichton created PR review comment:
This also of relates to the
if state.old_last_wasm_exit_pc == 0 {
below in thefor state in state.iter()
loop. Mostly I'm just curious if you can add some extra degree of assertions about perhaps some other fields or something like that.Is it possible to assert
pc != 0
here? Or otherwise make this conditional on the entry sp where if the entry sp is -1 then that's a host->hostCallThreadState
which has no frames, but otherwise some wasm should have been executed to yield frames.
fitzgen updated PR #4431 from fast-stack-walking
to main
.
fitzgen submitted PR review.
fitzgen created PR review comment:
Ah seems something got lost in a rebase.
fitzgen submitted PR review.
fitzgen created PR review comment:
I can look into the cranelift side of things but will probably need to ask you lots of questions about the component trampolines themselves.
fitzgen created PR review comment:
The
backtrace
module has two calls totls::with(...)
. I guess we could dopub(super::backtrace)
or whatever the syntax is (I think I've only ever usedpub(crate)
andpub(super)
before). Or we could make thebacktrace
module a submodule oftraphandlers
.
fitzgen submitted PR review.
fitzgen submitted PR review.
fitzgen created PR review comment:
It is used in a debug assertion inside
Backtrace::trace_with_trap_state
. Maybe I will make thebacktrace
module a submodule oftraphandlers
, since it is the only thing that wants access to these otherwise-private fields.
alexcrichton created PR review comment:
Ah ok, no need to do
pub(super::backtrace)
, but your other suggestion of movingbacktrace
as a submodule sounds reasonable to me.
alexcrichton submitted PR review.
alexcrichton submitted PR review.
alexcrichton created PR review comment:
Ok sounds good to me. Given cranelift "intrinsics" for getting the fp/return-pc I think it'll be relatively easy to fold in the necessary support into the component trampolines.
akirilov-arm submitted PR review.
akirilov-arm submitted PR review.
akirilov-arm created PR review comment:
One small detail -
word
is a loaded term in the Arm architecture, and in this context certainly wrong, since it means 32 bits (the AAPCS64 also provides the same definition in section 5.1); up to you whether you use the architecture-neutral64 bits
or the correctdouble-word
.
akirilov-arm created PR review comment:
Ditto.
cfallin submitted PR review.
cfallin created PR review comment:
FWIW these terms are loaded in x86-land as well: "word" means 16 bits (still to this day, e.g. with a
.word
directive in GNU as assembler), "double word" or "dword" means 32 bits, and "quad word" or "qword" means 64 bits. These names appear at least in assembly but also e.g. in Windows APIs, and differ between architectures as we see here, and so I try to avoid them altogether and just say "64 bits" / "u64" / "8 bytes above FP" / ... as appropriate for context.
akirilov-arm created PR review comment:
Personally I am fine with using the loaded terms in bits of code that are obviously architecture-specific such as this file, but have no preference otherwise, other than using the correct term, of course. Note that we are providing a direct quote from the AAPCS64 that uses both kinds of terms.
akirilov-arm submitted PR review.
fitzgen submitted PR review.
fitzgen created PR review comment:
I don't think so? Because the host has no guarantees of preserving frame pointers and the FP is the host's FP in this situation, so I don't think we can assert any sort of relation here.
fitzgen updated PR #4431 from fast-stack-walking
to main
.
akirilov-arm submitted PR review.
fitzgen updated PR #4431 from fast-stack-walking
to main
.
fitzgen updated PR #4431 from fast-stack-walking
to main
.
fitzgen updated PR #4431 from fast-stack-walking
to main
.
fitzgen updated PR #4431 from fast-stack-walking
to main
.
fitzgen updated PR #4431 from fast-stack-walking
to main
.
fitzgen updated PR #4431 from fast-stack-walking
to main
.
fitzgen updated PR #4431 from fast-stack-walking
to main
.
fitzgen merged PR #4431.
Last updated: Jan 24 2025 at 00:11 UTC