cfallin opened PR #11930 from cfallin:wasmtime-debug-signal-inject-calls to bytecodealliance:main:
(Stacked on top of #11921)
This repurposes the code from #11826 to "inject calls": when in a signal
handler, we can update the register state to redirect execution upon
signal-handler return to a special hand-written trampoline, and this
trampoline can save all registers and enter the host, just as if a
hostcall had occurred.As before, this is Linux-only in its current draft. I need to add macOS and Windows support, still. Putting this up to show how a few loose ends in #11921 get used.
cfallin commented on PR #11930:
I'll note for brainstorming purposes that the current problem in front of me is how to rework macOS' Mach ports-based signal handler to work with this. To recap a bit what the requirements on each platform are, and how this "call injection" works:
- All three of our main platforms (Linux, macOS, Windows) give us the ability to catch traps and edit register state before resuming.
- Linux lets us do this in a "signal context" where we really shouldn't do much of anything if we can help it -- no allocation, etc. We run on a sigaltstack and we can push more to the guest stack if we want.
- Windows lets us do this in a vectored exception handler, where we run on the guest's stack and cannot push anything to the stack.
- macOS lets us do this from a separate thread reading exceptions from a Mach port, where we can do anything a normal thread can do, except we don't have the guest's TLS because we're a separate thread.
The basic need is to inject enough state into the register context, along with redirecting PC, that a stub can take control, find the Store state, invoke any debug event handler, then restore all context and return to the guest if it's a resumable trap (which this PR doesn't have, but we will have in a few more PRs for breakpoints).
One can see how this is a little tricky. The approach I've taken that is at least Windows and Linux-compatible is to update only registers, not the stack (because Windows); inject args into the registers; save off the original register values and PC to the
VMStoreContext(which we have via TLS in the signal handler); then in the trampoline, save all regs to the stack, and copy the original values of the injected registers back from the store to the stack save-frame.macOS inverts most of the "can do" and "can't do" bits: we can push to the stack (unlike Windows) but we can't read TLS, so we have nowhere to save state that we clobber when redirecting other than to push it to the stack. So probably the best we can do is to push the original register values to the guest stack ourselves from the exception handler thread.
Of course this means that we need a slightly different stub for macOS (for x86-64 and aarch64 both); and we'll need a slightly different stub for Windows/x86-64 too because of fastcall when we call the host code.
One more thing about the riscv64 stub: it saves all of the V-extension state, because vector registers are separate from float registers, but unlike our other three architectures, we don't unconditionally assume that vector registers are present. So technically to run with V disabled with debugging enabled, if we care about that, we need an alternate riscv64 stub too that elides that bit. Note that we need to care because we have to save everything, not just the ABI callee-saves, because we're "interrupting" with no regalloc cooperation.
All of this to say: I am starting to think that the efficiency advantage of "trap-based implicit hostcalls", with all that entails (breakpoints that are just break instructions we can patch in), may not be worth the complexity and maintenance burden. The alternative is to go with hostcall-based-traps universally. (We still do need the wonky raw
*mut dyn VMStorefor the Pulley case, because Pulley does seem to unconditionally rely on interpreter traps on at least the divide instruction.)Partly that would make me sad, but on the other hand, it would make me quite happy too: it would mean that we are one PR away from breakpoints if we go with the bitmask scheme, or two if we still patch in a call (self-modifying code but not trapping).
I'm happy to go either way, and these stubs were quite fun to write, but with my "not impossible to maintain" hat on, I think I know the better answer...
(cc @alexcrichton and @fitzgen for thoughts)
cfallin commented on PR #11930:
Quick napkin math on efficiency if we abandon call injection on signals:
- Execution efficiency takes about a 1.5-2x hit mainly on explicit bounds-checking. (This on top of the ~2.5x for debug instrumentation.) That's tolerable if not great.
- Instead of a two-byte (
ud2on x86-64) or four-byte (brkon aarch64) breakpoint, we can do patchable calls in five bytes (call+ riprel32 on x86-64) or four bytes (bl+ PCrel26 on aarch64). The key here is to define a new callconv that is all callee-save, and use our normal trampoline machinery to emit a trampoline for this. The call at the CLIF level would takevmctx. We'd use a new opcodepatchable_calland the only difference from a normal call opcode would be that emission would place the byte range of the instruction and the instruction bytes themselves in metadata, and an equivalent length nop in code. This should have fairly small perf impact (fetch bandwidth but nothing more for nops; and vmctx will already likely be in the first-arg reg so no additional moves).- For single-stepping, rather than the trap-on-null-load trick to enable all breakpoints that I described in yesterday's Cranelift weekly, I think I would go with the "enable all in func on entry to func" approach; and func entry/exit hostcalls themselves can be guarded by a flag to minimize that overhead.
The upshot of all that is that it's much more portable and easier to reason about, and the latter at least is in short supply otherwise with everything else we're adding for debugging. One could see this as "hostcalls everywhere" as in debug RFC v1, except with SMC to avoid overhead until patched in.
github-actions[bot] commented on PR #11930:
Subscribe to Label Action
cc @fitzgen
<details>
This issue or pull request has been labeled: "cranelift", "pulley", "wasmtime:api"Thus the following users have been cc'd because of the following labels:
- fitzgen: pulley
To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.
Learn more.
</details>
cfallin commented on PR #11930:
Closing as this is pushed to "post-MVP debugging" due to all the above complexities; will keep the branch around for mining for the good bits later as needed.
cfallin closed without merge PR #11930.
Last updated: Dec 06 2025 at 06:05 UTC