cfallin opened PR #44 from cfallin:debugging-v2 to bytecodealliance:main:
This RFC describes updates to our plan-of-record for debugging in Wasmtime, as I plan to start work soon. It includes several recent realizations I've had that I believe could lead to a better overall set of tradeoffs and implementation choices.
Thanks to @fitzgen and @arjunr2 for discussions on some parts of this. Any bad ideas contained herein remain my own fault only :-)
programmerjake submitted PR review.
programmerjake created PR review comment:
that we memcpy out, and memcpy back in, and all continues as we had
cfallin updated PR #44.
cfallin created PR review comment:
Thanks!
cfallin submitted PR review.
tschneidereit created PR review comment:
For byte-granularity, we'd still have to identify whether the correct byte is set, not one of the others, right? And once we have to do that, can we maybe reduce the size of the shadow memory? An obvious step would be to turn it into a bitmap, reduce it's size to 1/8. But once we have to do more than just a nonzero check, maybe we can go further? E.g. we could have the shadow memory (which at this point I guess is a map) be more coarse-grained and only contain, say, 1-bit-per-u64. We'd then check that bit and do a 2nd-level check, e.g. based on a map lookup, iff the bit is set. This could either happen in a hostcall, or even in a guest function to reduce overhead.
I guess there could be pathological cases where a tight loop sets bytes within the same unit of granularity very frequently, while almost never setting the watched byte. Even then, this scheme doesn't seem prohibitively expensive. And I would think that by far the more common case will be that the absolutely overwhelming number of writes will be to locations not covered by one of these bits.
This would probably still be the case if we go to even more coarse-grained regions, like 256 or 512, but it seems like this could easily be made configurable.
tschneidereit created PR review comment:
To double-check: this would also enable suspend-and-resume when using the pooling allocator, as long as an instance is resumed into the same set of memory & stack slots, right? If so, that seems like a very nice additional property.
tschneidereit submitted PR review:
I left a few inline comments and questions.
Besides those, an overall question* I have is: how does the complexity and performance of this approach to tracking values compare with the effort required to make what we currently have more robust? We've had @singleaccretion, @philipc, and others contribute excellent improvements to our current support pretty recently, and @philipc in particular mentioned a month ago that he's working on improvements.
I guess the key argument for doing something different is that, as you say, it's hard to make all optimizations robust in terms of retaining good DWARF info. But it seems like that doesn't need to be a goal? Instead, we could do what one usually does, and require compiling without (all) optimizations for debugging.
I guess to some degree my question boils down to "should we really do something entirely different from what most everyone else does here, throwing away the partially-working thing we already have in the process?"
*: This question would apply to the original RFC as well, I suppose. I didn't have the bandwidth to really think it through when that was discussed, however.
programmerjake submitted PR review.
programmerjake created PR review comment:
we could have something like the following trie compiled to cranelift IR:
pub type LeafElement = u8; pub const LEAF_BITS: usize = 512; pub const LEAF_SIZE: usize = LEAF_BITS / LeafElement::BITS; pub type Leaf = [LeafElement; LEAF_SIZE]; #[inline] pub fn hit_breakpoint(pc: usize, tree: &[Option<&mut Leaf>]) -> bool { let Some(leaf) = &tree[pc / LEAF_BITS] else { return false; } hit_breakpoint_slow(pc, leaf) } #[cold] fn hit_breakpoint_slow(pc: usize, leaf: &Leaf) -> bool { let bit_index = pc % LEAF_BITS; let leaf_element = leaf[bit_index / LeafElement::BITS]; let leaf_bit = leaf_element & (1 << (pc % LeafElement::BITS)); leaf_bit != 0 }
SingleAccretion submitted PR review:
I guess the key argument for doing something different is that, as you say, it's hard to make all optimizations robust in terms of retaining good DWARF info. But it seems like that doesn't need to be a goal?
+1. In the optimizing compilers I have experience with, good debugging implies running unoptimized code. Optimized code debug info is available, but its main use case is core dumps (e. g. from CI), where "something" ("anything") is much better than "nothing".
The value tracking approach in this RFC (at a high level) seems like largely the same thing as unoptimized code (load/store all WASM locals), though more robust since we get correct WASM state by construction instead of it being a byproduct of no optimizations.
However, how does this approach satisfy the core dump use case (where we would be handling code without any of the WASM stores/loads)? This implies keeping the current value label code.
the current value label code
Has some issues even when not optimized (due to the SSA conversion and regalloc). They could be solved by construction by lowering all
local.get | local.setto loads/stores from a stack slot - just like this RFC proposes, but as a general-opt-level=0strategy. Historically this may have been untenable due to some LLVM deficiencies resulting in WASM debug code having a huge number of locals (thousands for even relatively simple methods), but that has been partially fixed recently.
SingleAccretion created PR review comment:
do we want to support setting locals to new values when stepping through a program?
Is that a choice that can be made? The ability to change WASM local values is required for changing values in the debugger (including implicitly as part of expression evaluation). This is something the current DWARF translation supports, it would be a regression to disallow it.
SingleAccretion created PR review comment:
after each step/breakpoint
Is that an assumption that the WASM is exposed only on stepping boundaries? I don't see why that would be true (considering e. g. trapping instructions).
tschneidereit commented on PR #44:
Another aspect I forgot to mention: making the current approach robust (on non/less optimized codegen) would have the advantage of also helping cg_clif and other non-wasm embedders, whereas it seems like if we go down the route described here, we would all but certainly end up removing what we have right now, because maintaining both approaches would be too costly.
The approach this PR seems to suggest would mean that just attaching gdb to the wasmtime host process would no longer work (as eg gdb will attempt to set breakpoints at specific native instructions based on the source info, rather than ask Wasmtime to set it on a wasm instruction), rather we did have to add a gdbstub to Wasmtime, which then makes it impossible to step through host code unless we build an entire replacement for gdbserver.
cfallin submitted PR review.
cfallin created PR review comment:
For byte-granularity, we'd still have to identify whether the correct byte is set, not one of the others, right?
Yes, that's right, the slow-path in the hostcall would check "did the byte actually change" before transferring control to the debugger. That's actually already the case even if the granularity of watchpoint matches or exceeds that of the store in question, because it's always possible that the store "changed" the value to the already-existing value; easy enough to check.
(two-level scheme with coarse first level, possibly a bitmap)
Yep, that's a good idea!
cfallin submitted PR review.
cfallin created PR review comment:
That is kind of technically true, said with a slight wince, but practically not I think -- the issue is that in general, any pointer reachable from vmctx can escape into the compiled code (e.g., the VM limits struct, or an imported table descriptor, or ...); so the "within the lifetime of the originating Store" is more or less necessary to imply that we have the same reachable universe at the same addresses. So the "technically true" bit is that if one can ensure that all the other data structures are also allocated at the same addresses, not just the memory and stack slots, then it works; but that implies control over the system allocator which we generally don't have. Alternately if one were o adopt a restriction that everything reachable from
vmctxis also allocated into slots in the pooling allocator, and then ensured that exactly the same slots in all index spaces were used...
cfallin edited PR review comment.
So I see I've opened a slight Pandora's box by bringing more of the original RFC's approach into question than I had hoped :-) Thanks all for the comments so far -- let me try to give my perspective on "why not just improve existing gdb support".
Starting with discussions a few years ago we had the concept of "debugging levels" -- what does the debugger see, and what is underneath the debugger -- and we generally seemed to agree that attaching gdb to the Wasmtime process was suboptimal for most developer use-cases because it mixes host code, Wasmtime runtime implementation code, and guest code. There are times when one does want that, namely when developing host code or runtime code; so the existing gdb/DWARF support should not go away. But a Wasm VM-level view, and then a guest language view on top of that, are (I think) the much more commonly-desired kinds of debugging.
That thinking led to a sort of SpiderMonkey-style design instead: the VM implementation is underneath the debugging abstract view, and implements a host API and debug interface. I'll note that the original RFC covers this and I'm not trying to change it here -- it proposes using DAP (not gdbstub) to expose this interface externally, and a notion of debug adapters to translate from the Wasm VM semantic level to source-language semantics.
I think that overall structure is nice from an "abstractions presented to the Wasm developer" point of view, but really importantly also, it provides a clean base for time-travel debugging. I simply don't know how to build time-travel if we're mixing host code into the "debuggee" abstraction.
So to the point "unless we build an entire replacement for gdbserver." -- yes, that's the plan in the first RFC (using DAP instead of the gdbstub protocol), unchanged here, and I think it's both necessary and not too bad!
So then there's the more scoped question of: how do we get state out of the code, and can we leverage Cranelift's support for debug value labels and improve it? (Note that this is beneath the level of DWARF and the DWARF translator; it's purely about communicating "Wasm local N" down to a register level.) To that, I'd say that I've tried over the years to make this robust but it's definitely an uphill battle -- I would agree with SingleAccretion's statement that the mechanism "Has some issues even when not optimized (due to the SSA conversion and regalloc). They could be solved by construction by lowering all local.get | local.set to loads/stores from a stack slot - just like this RFC proposes". As one particular example of the kind of issue that arises, dead code is removed implicitly by our lowering architecture, even if we have no optimizations enabled, and regalloc will not store dead values either; so we will have gaps in the state view ("optimized away", etc). As another example, attempting to recover values out of registers implies a fairly substantial change to our trampolines and stack-walking code to save register state on Wasm exit, and restore register state with unwind info as we walk up the stack to view frames. So to the extent that we fix this by construction, we're already at this RFC's proposed approach; the only questions that remain are around the "rest of the owl" (attach gdb to host or provide a debug API that sees the Wasm abstraction).
Then to
I guess the key argument for doing something different is that, as you say, it's hard to make all optimizations robust in terms of retaining good DWARF info. But it seems like that doesn't need to be a goal? Instead, we could do what one usually does, and require compiling without (all) optimizations for debugging.
I guess to some degree my question boils down to "should we really do something entirely different from what most everyone else does here, throwing away the partially-working thing we already have in the process?"
My thoughts would basically be:
- Compiling without any opts is still not totally robust at recovering 100% of state, for structural reasons, and involves substantial complexity to do so still;
- One way to see our general approach is that we're asking "What Would SpiderMonkey Do?" and building that, up to the debugger API, except that rather than base it on the baseline tier, we base it on Cranelift with explicit value stores to memory (the one bit I'm hoping to change with this RFC!);
- I don't want us to throw away the existing debug value-label support, as it doesn't cost us too much in maintenance right now, and is useful for debugging when one wants to see host code or runtime code too.
cfallin submitted PR review.
cfallin created PR review comment:
Yes, at boundaries between Wasm bytecode operators; the store-state-to-stackslot approach here should work fine in the case that, say, a Wasm load/store traps, because the state will be as-of the program point just before that opcode, I think.
cfallin created PR review comment:
Well, to the extent that it's a performance/functionality tradeoff, we could certainly expose it to the user, I think. Time-travel debugging also has to forbid it, for determinism, so we need a "read-only mode" in any case.
cfallin submitted PR review.
SingleAccretion submitted PR review.
SingleAccretion created PR review comment:
Yes, at boundaries between Wasm bytecode operators.
Ah, I see. Perhaps that bears more explicit mentioning as I read that on first pass as "we will parse
.debug_lineand only consideris_stmtentries as points where the state is exposed [optimizing it out elsewhere]".
cfallin created PR review comment:
Also regarding the current level of support with gdb-attach-to-host-process debugging of Wasm: gdb will indeed know how to update values when we tell it about their locations, but regalloc actually doesn't guarantee that a value lives only in a single place (allowing "overlapping live-ranges" is important for some edge cases), and lowering (even without opts) can also merge across instructions, so e.g. if we set a local to a constant and add it to another value, we may get an
add reg, imminstruction at the machine-code level, and debugger updates won't change that value. So to the extent that it is "supported" today, it's broken, IMHO.
cfallin submitted PR review.
SingleAccretion commented on PR #44:
By the way, a simple way to state the difference between this RFC and what @tschneidereit hinted at above (as far as value tracking is concerned) is the choice of "where do we spill the debug state (WASM locals)".
1) On the [WASM | other input]-to-CLIF boundary (make it a frontend concern).
2) On the CLIF-to-native-code boundary. I suspect that for full fidelity this would require adding some IR features a'la LLVM intrinsics (you can see a glimpse of this today with how the WASM translator adds a phantom use for vmctxt so that it's always live).(I actually don't mind either way, since I am only interested in WASM debugging support part)
Notably, LLVM implements it both ways (allocas + dbg.value). After this RFC cranelift will be a in similar state, though perhaps less 'general' (the special debug spill slot + value labels).
tschneidereit commented on PR #44:
Thank you @cfallin for the detailed answers, very much appreciated! And my apologies for re-opening discussions that really were concluded when the previous RFC was accepted—I want to explicitly acknowledge that that's on me for not having dug in at the time.
To be clear: I never meant to question whether we should have our own debug server: we obviously want to have that. Besides not wanting to incidentally have to debug wasmtime while debugging guest code, it's simply not always possible to attach a native debugger to the process. I was hoping we'd be able to do so based on the same infrastructure for deriving source-level locations/data from generated code, instead of having two separate mechanisms.
The complexities you're describing around retaining good debug info even for non-optimizing code make sense to me. I keep being concerned about maintaining support for both approaches in the long run nevertheless. But maybe core dumps are a persistent motivation to keep both approaches supported in the long-term, anyway?
If that's not the case because there's some other way to support those for Wasm, my default expectation would be for the DWARF support to eventually fall by the wayside because for Wasm the use case won't be strong enough. If you're very confident that that won't be the case because you truly can't see a realistic scenario in which the maintenance burden would become significant enough, then I guess that concern wouldn't be well-founded.
using DAP instead of the gdbstub protocol
FWIW I still think using DAP directly instead of going through gdb or lldb using a gdbstub will lead to a worse experience once you go beyond basic debugger functionality. It forces Wasmtime to do all DWARF handling, special cases for every possible guest language you may want to debug (cause DWARF is not expressive enough to represent every debugging relevant detail of all programming languages), you are unable to reuse the gdb or lldb pretty printer python scripts for the target language (so eg inspecting
Vec,HashMapandBTreeMapin Rust will require special cases and any custom pretty printers a crate may ship with don't work either). You did have to reimplement more advanced things like temporary breakpoints (tbreak), watchpoints, conditional breakpoints, dynamic printf (dprintf), implement your own expression parser (or to be precise one for each guest language) for the print command. Presumably you won't pull in a Python dependency as plugin interface either. Additionally there is no cli interface for DAP, so anyone who runs their debugger in the cli (like me) can't use the debugger integration.But I guess I won't change your mind.
Besides not wanting to incidentally have to debug wasmtime while debugging guest code, it's simply not always possible to attach a native debugger to the process.
Is that because of the ptrace syscall being blocked? Or is that because blocking other wasm instances would be bad? If the latter, provided that you get the instance to debug running on a separate thread, you should be able to use the gdb non-stop mode to avoid suspending any other threads while you are debugging the wasm instance. I never used it though.
bjorn3 edited a comment on PR #44:
Besides not wanting to incidentally have to debug wasmtime while debugging guest code, it's simply not always possible to attach a native debugger to the process.
Is that because of the ptrace syscall being blocked? Or is that because blocking execution of other wasm instances would be bad? If the latter, provided that you get the instance to debug running on a separate thread, you should be able to use the gdb non-stop mode to avoid suspending any other threads while you are debugging the wasm instance. I never used it though.
FWIW I still think using DAP directly instead of going through gdb or lldb using a gdbstub will lead to a worse experience once you go beyond basic debugger functionality. [ ... ] But I guess I won't change your mind.
This is actually the part I'm most dubious and flexible about from the original RFC, after the Winch->Cranelift change -- I plan to build a host API first then the DAP server on top of it, and so there'd be nothing stopping a gdbstub server implementation existing there either, technically. As I recall, there were concerns about the GDB protocol itself not supporting some of the Wasm VM abstractions needed, but that may have changed recently, and I didn't feel strongly about it then (I didn't author the original RFC as I had just come back from leave but I did sign off on it, for full transparency). I chose not to challenge this part of the original RFC to keep scope limited here...
But maybe core dumps are a persistent motivation to keep both approaches supported in the long-term, anyway?
Core dumps are definitely the most interesting open question (I should have addressed this in the present RFC too; apologies). I think the most straightforward plan is still to use the value-label mechanism to recover values on a best-effort basis; or if the user has turned on debug support, then recover values from the mechanism described in this RFC. In essence, the new situation can be seen as: take the Winch-or-Cranelift tradeoff that would have existed after the original RFC, but replace "Winch" with "Cranelift with special value stores added".
By the way, a simple way to state the difference between this RFC and what @tschneidereit hinted at above (as far as value tracking is concerned) is the choice of "where do we spill the debug state (WASM locals)".
1. On the [WASM | other input]-to-CLIF boundary (make it a frontend concern). 2. On the CLIF-to-native-code boundary. I suspect that for full fidelity this would require adding some IR features a'la LLVM intrinsics (you can see a glimpse of this today with how the WASM translator adds a phantom use for vmctxt so that it's always live).(I actually don't mind either way, since I am only interested in WASM debugging support part)
Notably, LLVM implements it both ways (allocas + dbg.value). After this RFC cranelift will be a in similar state, though perhaps less 'general' (the special debug spill slot + value labels).
Yes, I think this is a good way to frame it, thanks! And this then points to our general trend of moving logic into the frontend as a way to manage complexity -- the explicit stores are I think a much more tractable path than trying to get full fidelity via new intrinsics. In the limit, those intrinsics have to be placed on every value at every step point, and at that point the question is just, do we store the values in one place or let them stay where they are (
Anyregalloc constraints) and describe those locations. Then the complexity and potential brittleness of recovering register state with full fidelity, plus the much higher metadata overhead (vs. "locals and operand stack are at this offset" once per function), seems to tip again toward the frontend-based approach.(That phantom use on vmctx was one of my slightly weak attempts to make debugging via host gdb better -- it at least allows Wasm memories to be always reachable via the DWARF expression describing the memory descriptor load, and thus provides a way to always translate shadow-stack source language variable locations, but it's still pretty incomplete!)
cfallin edited a comment on PR #44:
By the way, a simple way to state the difference between this RFC and what @tschneidereit hinted at above (as far as value tracking is concerned) is the choice of "where do we spill the debug state (WASM locals)".
1. On the [WASM | other input]-to-CLIF boundary (make it a frontend concern). 2. On the CLIF-to-native-code boundary. I suspect that for full fidelity this would require adding some IR features a'la LLVM intrinsics (you can see a glimpse of this today with how the WASM translator adds a phantom use for vmctxt so that it's always live).(I actually don't mind either way, since I am only interested in WASM debugging support part)
Notably, LLVM implements it both ways (allocas + dbg.value). After this RFC cranelift will be a in similar state, though perhaps less 'general' (the special debug spill slot + value labels).
Yes, I think this is a good way to frame it, thanks! And this then points to our general trend of moving logic into the frontend as a way to manage complexity -- the explicit stores are I think a much more tractable path than trying to get full fidelity via new intrinsics. In the limit, those intrinsics have to be placed on every value at every step point, just like the stores, and at that point the question is just, do we store the values in one place or let them stay where they are (
Anyregalloc constraints) and describe those locations. Then the complexity and potential brittleness of recovering register state with full fidelity, plus the much higher metadata overhead (vs. "locals and operand stack are at this offset" once per function), seems to tip again toward the frontend-based approach.(That phantom use on vmctx was one of my slightly weak attempts to make debugging via host gdb better -- it at least allows Wasm memories to be always reachable via the DWARF expression describing the memory descriptor load, and thus provides a way to always translate shadow-stack source language variable locations, but it's still pretty incomplete!)
As far as I understand originally LLVM only supported debuginfo on values stored in memory using
llvm.dbg.declare, but it has since gained support for debuginfo on SSA values usingllvm.dbg.valuefor better runtime performance. In both cases from the point this intrinsic is called the source level variable gets a new memory address (dbg.declare) / value (dbg.value) assigned.
fitzgen created PR review comment:
Agreed that this can and should be well-factored away from the translation of each individual opcode and the
FuncEnvironment. If we couldn't keep opcode translation as-is, I would consider that disqualifying. Luckily, I have no doubts here.That said, and this is fairly nitpick-y, but I don't think we should build this functionality into
cranelift_frontend::Variabledirectly (or even into thecranelift_frontendcrate at all), which is what I am understanding you are proposing in the last sentence here. I think thatcranelift_frontend::Variableshould keep the single responsibility it has now (translation of variables with multiple assignments into singly-assigned SSA values) and not gain any new responsibilities. Instead, we should hook into the existing{before,after}_translate_operatorhooks to explicitly spill precise Wasm state, something like this:impl FuncEnvironment<'_> { pub fn before_translate_operator( &mut self, op: &Operator, _operand_types: Option<&[WasmValType]>, builder: &mut FunctionBuilder, stack: &FuncTranslationStacks, ) -> WasmResult<()> { if self.tunables.consume_fuel { self.fuel_before_op(op, builder, state.reachable()); } // NEW!!! if self.should_preserve_precise_wasm_state() { self.spill_precise_wasm_state(builder, stack); } Ok(()) } fn spill_precise_wasm_state(&self, builder: &mut FunctionBuilder, stack: &FuncTranslationStacks) { self.spill_precise_wasm_locals(builder); self.spill_precise_wasm_stack(builder, stack); } // Spill all Wasm locals into the stack slot. // // TODO: make this even better by introducing a `Locals` type that // keeps track of which locals are dirty? Control flow joins complicate // things a little bit, but maybe not too much. Doesn't have to be in // the V1 implementation... fn spill_precise_wasm_locals(&self, builder: &mut FunctionBuilder) { let slot = self.precise_wasm_state_stack_slot(); for i in 0..=self.max_local_index() { let val = builder.use_var(Variable::from_u32(i)); let offset = todo!(); builder.ins().stack_store(slot, val, offset); } } // Spill all values from the Wasm operand stack into the stack slot. // // TODO: make this even better by keep track of the index below which // all the values are the same as last time, and can be skipped. fn spill_precise_wasm_stack( &self, builder: &mut FunctionBuilder, stack: &FuncTranslationStacks, ) { let slot = self.precise_wasm_state_stack_slot(); for val in stack.operands() { let offset = todo!(); builder.ins().stack_store(slot, val, offset); } } }The above sketch avoids nicely layers concerns and avoids introducing new responsibilities on lower-level things (a microcosm of the overall approach!)
fitzgen created PR review comment:
The Firefox DevTools did not model the debugger server and debuggee as a pair of async coroutines, and instead had multiple levels of event loop running directly on top of the debuggee and its thread/stack. This led to constant annoyance in the forms of unexpected re-entrancy and concurrency. I think it was ultimately a mistake in the design/architecture of the devtools; not fatal, but regularly frustrating. To some degree, that stuff is fundamental: evaluating an expression for the user might trigger a new breakpoint, which must be handled before returning the result of the origianl evaluation to the user. But if we model everything as async coroutines, then we are at least being honest that it can happen and are not setting ourselves up for surprises.
This also reminds me of Alex's recent work on making VM internals use
asyncwhereever we can suspend (e.g. due to an async limiter or async memory growth or what not); and Alex has found and fixed a lot of bugs because we didn't realize some call could be async.Both of these experiences tell me that explicit async modeling is definitely the right choice here.
fitzgen created PR review comment:
This is a little bit of an aside, and I've stated this elsewhere, but will repeat it here for posterity: I agree that starting with components is the 100% correct choice as far as incrementalism, value-to-effort ratio, and rapidity goes, however extending this overall scheme to core Wasm is actually not very onerous, and we shouldn't talk about it like it is, IMO.
All we need is a way to track the region of memory that has potentially been modified by the host, and record the values of those regions in the trace just before resuming to Wasm. Luckily, there are only a handful of ways for the host to mutate a Wasm memory:
wasmtime::Memory::write: This is the ideal case because we know exactly the region of memory being mutated.wasmtime::Memory::data_mut: This gives the whole memory out as mutable. We could either mark the full memory region as mutated, and record the full memory in the trace, or we could simply not support recording the effects of this API call and panic in this method during recording. (Or we could simply ignore it and get likely-incorrect traces, or we could add additional user APIs for explicitly marking regions of memory as dirty and make it their responsibility, or we could do the virtual memory stuff. None of these are very attractive options, IMO.)wasmtime::Memory::data_and_store_mut: Same as (2)wasmtime::Memory::data_ptr: Same concerns as (2)I personally think it would be a fine tradeoff to say that we only support recording memory side effects via (1) and make the other APIs panic when recording is enabled on the store. (1) is our "narrow waist" and makes tracking dirty memory regions no more difficult than with the equivalent "narrow waist" we are using for components via
wasmtime::runtime::component::func::LowerContext::get.And we would need to do similar for any exported tables, but there we only have to instrument
wasmtime::Table::{set,fill,copy}for dirty-region tracking and we have easy access to the exact potentially-mutated region for all those methods. No issues with the whole table technically being potentially modified but in practice that being highly unlikely.Finally, for completeness, we would need to track
NaNbit patterns, relaxed SIMD results, and table/memory growth success/failure, but all these things actually need to be in the trace for components as well, since their non-determinism inside the component can result in non-determinism at the component boundary as well. Of course, it is fine for a V0 to skip this stuff, or force deterministic relaxed SIMD, for example, but a complete and correct implementation of record/replay (for either components or core Wasm!) must instrument all of them.
fitzgen submitted PR review:
Thanks for writing this up, Chris!
Perhaps obvious, but I will state it explicitly anyways: I am in favor of this tweak to our debugging plans.
I do have a handful of comments and nitpicks on some of the finer details below, but none of this anything that should hold up the RFC or really need to be debated or discussed in depth. Mostly I only mention anything because the RFC already touched on it, and I would have been perfectly fine leaving these details for individual PRs or whatever, but since it was brought up already... I hope I don't come off too nitpicky!
fitzgen created PR review comment:
:100: :100: :100:
And, for example, if we ever (in the far future) want to do tiering and on-stack-replacement, I think we would want to do it in a similar way, where we explicitly construct clif to rematerialize Wasm state on the bail-out path, rather than introduce a new kind of metadata (like SpiderMonkey's notes thing that I forget exactly what it is called) that needs to be interpreted by the runtime.
This approach of creating CLIF that explicitly has our desired semantics lets us keep Cranelift smaller and simpler, reduces the integration surface area between compiled code and the VM, and reuses infrastracture we have to have anyways for existing features/functionality (i.e. correct optimizations and a working compiler pipeline for CLIF) when building new features/functionality (debugging/OSR/etc.) It does shift some new responsibility to the Wasm-to-CLIF frontend for these new features, but in my experience that code is pretty straightforward, easy to extend, and has historically had fewer bugs than, say, Cranelift's midend or the integration between Cranelift and the VM with stack maps and what not. That last point is super relevant because those alternative components are some of the places where the new-feature complexity would otherwise shift to, if not the Wasm-to-CLIF frontend.
Ultimately, I strongly believe that the explicit-CLIF approach detailed in this RFC (and, as noted, is already the direction we have been migrating towards, e.g. with user stack maps) reduces overall system complexity.
fitzgen created PR review comment:
:100:
Fast (optimized) recording without requiring full (deoptimized, slow) state preservation until replay time is a must.
fitzgen created PR review comment:
SpiderMonkey's low-level debugger API was similar in this regard, and I feel it worked well. FWIW, these "exit events" were called "completion values".
@bjorn3
FWIW I still think using DAP directly instead of going through gdb or lldb using a gdbstub will lead to a worse experience once you go beyond basic debugger functionality.
@cfallin
This is actually the part I'm most dubious and flexible about from the original RFC, after the Winch->Cranelift change -- I plan to build a host API first then the DAP server on top of it, and so there'd be nothing stopping a gdbstub server implementation existing there either, technically.
To summarize and reiterate previous discussions, here are the main reasons we didn't originally choose the GDB protocol:
It excludes debugging interpreted-in-Wasm programs. Today, we have many users that write Wasm programs in Python or JS that happen to be implemented via compiling the Python/JS interpreter to Wasm and running the user's program inside the interpreter. Users want to debug their Python/JS program, of course, not the interpreter. With debug adapter components, language runtimes have the flexibility to implement their own debug adapter component specific to themselves, enabling source-level debugging of their interpreted programs, and without requiring that the interpreted program have more capabilities (e.g. sockets) when supporting debugging versus when not.
Wasm as a language has a bunch of things that don't fit neatly into the GDB protocol and its idea of an abstract ISA:
- it is a harvard architecture with a managed stack and structured control flow (for example, IIUC, the GDB protocol wants stack walking logic to be done on the client and implemented via inspecting raw stack memory the same as any other kind of memory access on the server, and that is fundamentally not possible with Wasm's managed stack),
- it has multiple different kinds of "registers" (value-stack operands, locals, and globals) and there are different numbers of those "registers" not just depending on the program but also which frame you are within and what PC you are at within a frame,
- it has managed, unforgable references whose bit patterns cannot be inspected and have different valid operations depending on the kind of reference,
- GC structs and arrays as core language primitives, whose raw layout, field offsets, alignment, and bits cannot be inspected, and must be accessed only via
struct.getand similar,- many different memories (
pis a pointer, but a pointer into which memory?),- tables that are kind of like memories but for managed references,
- and probably more things I am forgetting off the top of my head.
None of the above has materially changed since the last RFC.
Things would certainly be easier if we had fewer constraints, didn't want to support debugging interpreted programs, and only cared about debugging the shape of Wasm module that LLVM emits today. In that hypothetical world, the GDB protocol would probably be the right choice. But that isn't the world we live in.
It forces Wasmtime to do all DWARF handling
There is nothing stopping us from compiling parts of
lldb, for example, to Wasm and using that as the kernel of a debugging component, such that it handles all DWARF interpretation.
Re: gdb protocol -- I absolutely agree that we do not abandon DAP and debug adapters generally as the path to allow flexible guest debugging; I was imagining more, as a concession to allow use of gdb/lldb where desired, and given the subset of full Wasm semantics that can be encoded in the protocol, one could hypothetically also build a gdbstub server on top of the underlying host debugger API. Basically, if that's a real user need, then I think it's "just" a question of prioritization/time allocation or having the right person willing to build such a component, not a matter of technical impossibility.
(And separately, @bjorn3, not to expose too much "hypothetical ideas in my brain" planning, but I've been thinking for a bit that it would be an enjoyable side-project to build a command-line DAP client, given a similar personal alignment toward preferring CLI debugging over IDE-integrated approaches. So, please rest assured that to the degree I am building the tooling I myself want to use, usability from inside my tmux inside an ssh session is a high priority.)
That said -- I'm trying to thread the needle between opening too much Pandora's box by responding to discussion points, vs. securing just the one-bit change (Winch to Cranelift) that I think we need to most practically start work; so further discussion about that topic can be declared out-of-scope if that makes this RFC easier :-)
@cfallin
Core dumps are definitely the most interesting open question (I should have addressed this in the present RFC too; apologies). I think the most straightforward plan is still to use the value-label mechanism to recover values on a best-effort basis; or if the user has turned on debug support, then recover values from the mechanism described in this RFC.
+1 to keeping the current best-effort value recovery for core dumps when precise-Wasm-state preservation (as described by this RFC) is not enabled. My hunch is that the overhead of maintaining precise Wasm state would be too high for the core-dump-on-prod-and-debug-locally use case.
But perhaps this is all moot if recording for recording is fast enough such that record-on-prod-and-replay-locally can fully subsume that use case? That said, core dumps with best-effort recovery are zero-overhead today though, and zero is a hard number to match...
fitzgen edited a comment on PR #44:
@cfallin
Core dumps are definitely the most interesting open question (I should have addressed this in the present RFC too; apologies). I think the most straightforward plan is still to use the value-label mechanism to recover values on a best-effort basis; or if the user has turned on debug support, then recover values from the mechanism described in this RFC.
+1 to keeping the current best-effort value recovery for core dumps when precise-Wasm-state preservation (as described by this RFC) is not enabled. My hunch is that the overhead of maintaining precise Wasm state (even only before Wasm instructions that can trap) would be too high for the core-dump-on-prod-and-debug-locally use case.
But perhaps this is all moot if recording for recording is fast enough such that record-on-prod-and-replay-locally can fully subsume that use case? That said, core dumps with best-effort recovery are zero-overhead today though, and zero is a hard number to match...
fitzgen edited a comment on PR #44:
@cfallin
Core dumps are definitely the most interesting open question (I should have addressed this in the present RFC too; apologies). I think the most straightforward plan is still to use the value-label mechanism to recover values on a best-effort basis; or if the user has turned on debug support, then recover values from the mechanism described in this RFC.
+1 to keeping the current best-effort value recovery for core dumps when precise-Wasm-state preservation (as described by this RFC) is not enabled. My hunch is that the overhead of maintaining precise Wasm state (even only before Wasm instructions that can trap) would be too high for the core-dump-on-prod-and-debug-locally use case.
But perhaps this is all moot if recording for record/replay is fast enough such that record-on-prod-and-replay-locally can fully subsume that use case? That said, core dumps with best-effort recovery are zero-overhead today though, and zero is a hard number to match...
(And separately, @bjorn3, not to expose too much "hypothetical ideas in my brain" planning, but I've been thinking for a bit that it would be an enjoyable side-project to build a command-line DAP client, given a similar personal alignment toward preferring CLI debugging over IDE-integrated approaches. So, please rest assured that to the degree I am building the tooling I myself want to use, usability from inside my tmux inside an ssh session is a high priority.)
FWIW, such a thing could be very useful for Wasmtime's integration tests.
cfallin submitted PR review.
cfallin created PR review comment:
Oh, yes, all of this will be implemented in
wasmtime-cranelift, not at the Cranelift level; so the raw variable-to-SSA translation is unchanged, but we wrap its use to have a place to generate the state updates for locals. Stack pushes on theStackstracking struct are likewise where I was thinking we could centrally insert the updates.I think that the optimization of only generating stores for changed state, not all state at all points, is important; I guess the main question is whether we track a dirty-set and flush it with stores, or just do the stores at the right times. All of this is getting deeper into implementation and is probably appropriate for PR discussion when we get there, I imagine!
cfallin submitted PR review.
cfallin created PR review comment:
That's great real-world validation -- thanks!
cfallin submitted PR review.
cfallin created PR review comment:
Good point -- I wrote this section out quickly but you're right that it's incomplete, and indeed we can support core Wasm tracing as well. If I recall correctly from looking through the wasip1 implementation (or perhaps the wiggle glue it uses?) there was extensive use of
data_mut()so I concluded at the time (late 2021, experiments in... various kinds of wasmtime isolation) that we'd need more work to denote the semantics more strictly to track memory movement. That's the origin of "components solve this" sort of thinking here. Anyway, I'll update this section, thanks!
alexcrichton commented on PR #44:
As the high-level bit of this RFC, using Cranelift instead of Winch, I think this provides a good outline of how to achieve that and I agree that it resolves the outlined motivations (and agree it's a good idea). This to me feels similar to how one gets good debug information from LLVM, you turn off optimizations and spill everything into stack slots.
My thinking, though, is that while it's desirable to use Cranelift to implement debugging we'll want to perhaps change various implementation details instead of having it exactly as-is in this RFC. There's already quite a lot of discussion here and I'm already a bit lost so I'm going to list out some thoughts of mine but please feel free to disregard/ignore anything that's already been discussed here.
DWARF
To me I see this RFC as proposing something similar to what native compilers do for
-g -O0which is to turn off all optimzations and generate pretty slow code where everything travels through the stack quite a lot. Given that I don't think that this is necessarily a major pivot away from what we do today. For example the current wasm-to-native DWARF will still benefit from improvements and in theory its job should get much simpler because we'll have new properties such as:
- Locals/wasm-stack values now live in a stack frame at a defined offset. This means that there's no hunting around for which register they reside in or not having anything at all because
i32.constwas folded into instructions.- Today even if optimizations are all disabled we still have lowering rules that will fire which will combine multiple wasm-level instructions into a machine instruction. This means that the wasm-to-native DWARF translation has to handle the fact that many wasm opcodes can map to a single machine instruction, thus making things like single-stepping pretty difficult. If we had a load/store around every wasm opcode though that Cranelift can't optimize away that would prevent much of this fusion.
Personally I also see DWARF solving a slightly different problem than the wasm debugging that is the goal here. DWARF is all we have today so it's kind of used for "everything" but it feels best appropriate when you want to debug the host and the guest at the same time (e.g. "attach gdb to wasmtime-the-process"). For many users there's no desire to do that and there's no need to debug Wasmtime at all, so I also see DWARF-of-today as a pretty big mismatch from where we actually want to end up.
With all that in mind I'd see this RFC as not changing our original direction of "let's implement wasm-native debugging". That itself may still use DWARF, just wasm-DWARF and not native-DWARF. Today with our current wasm-to-native DWARF we bascially have no means of doing this and never will because constructs like wasm locals and the wasm stack just aren't represented in CLIF at all. While we could add tons and tons of metadata this feels more straightforward of a design where we represent them as actual runtime data structures.
Breakpoints
One concern I have with the design outlined here is how breakpoints are going to work. I feel that we should have some sort of goal where if you enable debugging you only increase memory consumption by N% or something like that. I'm worried that the watchpoint/breakpoint design here would have an N that's significantly higher than might be workable for many embeddings. Given that I think it'd be worthwhile pursuing other alternatives.
For example nop-slides and native breakpoint instructions are dismissed here and in the Winch RFC I believe due to the complexity of having multiple instances concurrently using the same underlying module. While I agree that this is a complicated problem we don't want to solve, I would imagine that we could also solve this by having a private mapping used for debugging. For example when you instantiate a module in debugging mode the text section is always a new private mapping rather than using the original
Module's mapping. We could additionally make this CoW itself where it wouldn't actually cost resident memory until you set a breakpoint. Then the idea of setting a breakpoint is "make the page read/write, insert an instruction, then make the page read/execute" I think should generally "just work"? (assuming appropriate locations to insert the breakpoints)Single-step
Regardless of whether we go the route in this RFC or my thinking above, I'm not actually sure how we would implement single-stepping. My assumption is that debuggers decide to "resume with single step" in an otherwise paused debugging session. I'm assuming that "turn on single step mode" should be relatively cheap as a result. If we go for the bitmap design we'd have to pave the entire bitmap with 1s before we start execution. Then upon resumption we'd have to pave it all back with the original breakpoint set (which would then also need to be stored elsewhere). With the native-breakpoint-instruction idea we'd have to enable every breakpoint simultaneously which would involve paving over the entire text section. Both of these feel not great to me...
AFAIK native debuggers work by enabling special processer "do a single step" modes. Ideally we'd do something like that where we'd just single-step until the next wasm instruction and inspect the pc after all steps. I don't know much of the implementation in this space but I'd assume that such an implementation would be fraught with portability concerns given that we'd be using OS-level APIs to implement all this most likely.
Cranelift and Wasm PCs
One point I was thinking of reading this RFC was that while storing locals and the wasm stack in a native stack slot works to reify those values, there's still the other Cranelift problem of code motion which can make setting breakpoints difficult for example. One example is what I mentioned above about how we can fold multiple wasm instructions into a single machine instructions for example, even when Cranelift optimizations are disabled. Given this I think it'd be worth gaming out a bit how exactly the mapping between wasm and native pcs is going to work.
For example let's say I want to set a breakpoint on an
i32.addinstruction. In theory we don't actually have any idea where theiaddin Cranelift is going to show up, whether it's folded somewhere else. What we do know is that there's going to be astoreinstruction of the result of the add onto the wasm-stack-representation, though. Are the stores of the results going to be the canonical location of where to set a breakpoint? How would this handle other wasm opcodes likei32.storewhich only load values from the stack? There's also the question of control flow instructions such asbr. With the branch-chomping behavior ofMachBufferwould we need to forcibly disable all that to guarantee some codegen for all branching instructions?Basically while I agree that "put things in a stack slot" solves the "what value does this wasm local have" problem, I think there's still the problem of "how are we guaranteed that there's a piece of native code to set a breakpoint on for all wasm instructions". I don't think this insurmountable, but I don't think it's "for free" as opposed to Winch for example which probably is basically guaranteed to do a bunch of codegen per wasm op.
Only store vs also-load from the native stack
This RFC currently specifies that the native stack will effectively contain a shadow copy of all wasm locals and the wasm operand stack. This seems quite viable to me, but what I might also propose is an extension to this. I think it would be relatively low-cost to instrument wasm translation to implement one of three modes:
- One mode is what we have today, nothing is on the stack. This is not suitable for high-fidelity debugging but would still support opportunistic wasm-to-native DWARF translation to recover some values with
gdb-the-wasmtime-process.- The next tier would be what this RFC specifies where values are all recorded on the native stack but the true values of something live in registers. This means that at any point in the program we have an accurate view of wasm locals and the wasm operand stack to support high-fidelity debugging. What this wouldn't support though is setting values to something else. Personally I'm not sold on the "after a breakpoint reload all the values" strategy outlined in this RFC as I feel that would create quite a lot of codegen bloat. I still feel that this mode might be useful as a "not the highest overhead mode" perahps, and we would also have ...
- The highest-overhead mode would both load and store values from the native stack. This means that the source of truth for values exclusively lives on the stack. Setting values from a debugger would then be trivial for example.
All of these modes would in theory be pretty easy to emit from wasm translation by hooking into local manipulation and wasm stack manipulation.
Representation on the native stack
One minor point is that this RFC says it'd use the
ValRawrepresentation for values on the native stack for locals and the operand stack. Personally I think we should avoidValRawgiven how inefficient it is. For an i32 for example 75% of the bytes are wasted space. If wasm ever gets a longer vector type this number will only go up as well. Given that I'd hope we could at least store locals in exact-size stack slots (e.g. one stack slot for all i32 locals, one for all i64 locals, and so on). The operand stack might still need to useValRawunless we go really hard on compliation metadata tables, however.
Thanks @alexcrichton for the comments!
With all that in mind I'd see this RFC as not changing our original direction of "let's implement wasm-native debugging". That itself may still use DWARF, just wasm-DWARF and not native-DWARF. Today with our current wasm-to-native DWARF we bascially have no means of doing this and never will because constructs like wasm locals and the wasm stack just aren't represented in CLIF at all. While we could add tons and tons of metadata this feels more straightforward of a design where we represent them as actual runtime data structures.
Yes, exactly -- this RFC is a delta-update to the implementation strategy, and it's not proposing a shift away from the overall idea of leaning into our status as a VM. That makes life easier in a lot of ways -- we don't have to reverse-engineer machine code as a native debugger does; we can change it as needed to produce the state and do the things that we require.
One concern I have with the design outlined here is how breakpoints are going to work. I feel that we should have some sort of goal where if you enable debugging you only increase memory consumption by N% or something like that. I'm worried that the watchpoint/breakpoint design here would have an N that's significantly higher than might be workable for many embeddings. Given that I think it'd be worthwhile pursuing other alternatives.
It's worth noting that the status-quo in the original RFC that reached consensus was: hostcall at every step point, unconditionally (!). In that sense, this RFC's breakpoint strategy can be seen as an incremental optimization on that: guard the call with a bitmask check. This should result in a substantial increase in runtime performance (or rather, "recovery of acceptable performance" -- to the point that I suspect we would have independently realized we needed another approach even without an RFC discussion here), and a small-to-moderate code size increase I suspect. I'm basing that on my sketch of the guard, which is: load, bit-test, branch; with the load amortized over up to 64 step-points. So (after amortization) a little over two instructions per step-point (8 bytes on aarch64, 5 bytes on x86-64 (3 for TEST and 2 for short Jcc)).
That said I'm pragmatic (that's the entire reason for this RFC actually) and if it ends up being significant still, we should definitely play with other options. If it's workable I think it's preferable because of its relative simplicity.
For example nop-slides and native breakpoint instructions are dismissed here and in the Winch RFC I believe due to the complexity of having multiple instances concurrently using the same underlying module. While I agree that this is a complicated problem we don't want to solve, I would imagine that we could also solve this by having a private mapping used for debugging. For example when you instantiate a module in debugging mode the text section is always a new private mapping rather than using the original Module's mapping. We could additionally make this CoW itself where it wouldn't actually cost resident memory until you set a breakpoint. Then the idea of setting a breakpoint is "make the page read/write, insert an instruction, then make the page read/execute" I think should generally "just work"? (assuming appropriate locations to insert the breakpoints)
The private mapping approach is a neat twist and avoids the race conditions otherwise inherent in dynamic patching. That said, this is still a little tricky at the signal-handler level (we actually need to save all register context and have a new resumption trampoline that restores it, not just PC/SP/FP) and generally it depends more on system details (I look forward to getting this to work on Windows, and macOS...) vs the everything-in-our-compiled-logic approach.
Regardless of whether we go the route in this RFC or my thinking above, I'm not actually sure how we would implement single-stepping. [ ... ] pave the entire bitmap with 1s before we start execution.
This one is actually a little simpler I think: we can know control flow ("successor step-points") from metadata, so if we're currently at step-point P and we want to single-step, we disable the breakpoint at P (or decref it -- maybe the user also had a breakpoint set) and set a breakpoint at P'. The common case is one successor (straight-line code); conditional branches have two;
br_tables and indirect calls are the trickiest, where the former has a dynamic but hopefully-not-too-large list of bits to set and the latter is probably best handled with a special "every function entry" bit that we can emit a separate check for.One point I was thinking of reading this RFC was that while storing locals and the wasm stack in a native stack slot works to reify those values, there's still the other Cranelift problem of code motion which can make setting breakpoints difficult for example. One example is what I mentioned above about how we can fold multiple wasm instructions into a single machine instructions for example, even when Cranelift optimizations are disabled. Given this I think it'd be worth gaming out a bit how exactly the mapping between wasm and native pcs is going to work.
This one is actually pretty simple too: we have a "possible breakpoint" pseudo-instruction that does the check and calls the runtime if breakpoint set. We emit one of these before every Wasm opcode is translated. This instruction is a side-effect; it doesn't get moved, and other side effects (notably, the stores that update Wasm VM state) don't get moved across it, either.
(If we go with dynamic patching, that instruction is instead a few bytes of NOP that get overwritten with an INT3 or BRK or whatever.)
This all follows from the "write the behavior we want in IR, don't reverse-engineer it from machine code" philosophy: there's no worry about compiler optimizations altering any sequencing. We write the IR, and modulo compiler bugs, any possible compiler must execute that IR's side-effects in order. All the optimizations are a separable concern.
Only store vs also-load from the native stack [ ... three modes ]
Yep, these three modes correspond to what I described here:
The main design question appears to be: is the state a strict output, i.e., read-only when the program is paused, or is it an output and input, i.e., read-write? In other words, do we want to support setting locals to new values when stepping through a program? [ ... ] Perhaps surprisingly, either option is implementable: if we want to support read-write state, we load new values [ ... ]
I agree we'll want both kinds of debugging instrumentation (as well as "no instrumentation").
alexcrichton commented on PR #44:
In lieu of discussing finer details here, one high-level point I'd like to get agreement on here on what's being decided and what's not being decided. For example:
- Seems like we want to decide "let's try to use Cranelift for debugging instead of just Winch".
- To me I don't think we should decide "this is how breakpoints are going to work".
To me this is important because it's how I implicitly viewed the previous RFC. I viewed it as deciding "let's use Winch" and everything else about details to me was "we'll try this but it's just an idea at this point and we don't know if this is the perfect approach". Instead of me implicitly viewing these RFCs this way I'd like to get some more formal consensus about this.
I'm imagining a hypothetical situation where someone does work on a particular aspect of this, gets relatively far, and then later it's realized that maybe it wasn't the right approach and we should start from a different point. In this situation I don't want to empower anyone to "wield" this RFC saying that this is what we previously decided so therefore it must be done that way. Rather I think it's better to admit at this time that we as a project are unlikely to be equipped at this point to render judgement on what's the best approach for all the possible questions here.
Others may feel differently about this though! I think that's fine, but I want to make this explicit. Are there technical points in this RFC that are intended to be "final" after we merge this? Are all points beyond "use Cranelift" considered provisional?
programmerjake commented on PR #44:
If we go for the bitmap design we'd have to pave the entire bitmap with 1s before we start execution.
if we go for the 2-level trie design I suggested, we can just have another top-level tree node (leaves can be adjusted to be larger to not waste too much memory on the top-level tree nodes):
let all_ones_leaf = [LeafElement::MAX; LEAF_SIZE]; let all_ones_tree = vec![&all_ones_leaf; required_size]; // maybe can be optimized further with mmap shenanigans since every page is identicalthat way all you do to single-step is pass a different tree pointer and run the code, no massive memset or memcpy needed.
In lieu of discussing finer details here, one high-level point I'd like to get agreement on here on what's being decided and what's not being decided. For example:
* Seems like we want to decide "let's try to use Cranelift for debugging instead of just Winch". * To me I don't think we should decide "this is how breakpoints are going to work".To me this is important because it's how I implicitly viewed the previous RFC. I viewed it as deciding "let's use Winch" and everything else about details to me was "we'll try this but it's just an idea at this point and we don't know if this is the perfect approach". Instead of me implicitly viewing these RFCs this way I'd like to get some more formal consensus about this.
Yep, definitely. I tried to get ahead of this a little bit with this section:
This RFC, concretely, proposes one RFC-level decision that requires a new consensus: to update our intended approach to debugging support in Wasmtime to start from Cranelift-compiled code, with instrumentation to implement the debug state and other mechanisms.
These mechanisms are sketched below, but some details may change in small-to-medium ways as we experiment with a real implementation and work through all of their interactions. The sketches are provided to substantiate the new direction's feasibility; the goal of this RFC is only to build consensus on the one most-significant bit.
So everything beyond the MSB is more or less "here's why it could be practical" (e.g.: a lot of the earlier thinking assumed that anything built on Cranelift would be infeasible, so the sketched mechanisms are meant as evidence that there are viable approaches that would be simple and robust). I guess consensus on this RFC would mean that people agree the evidence suggests that building on Cranelift would be practical.
Also: I think part of the resistance I've felt in trying to get started on debugging is that the original RFC did seem to pin down a number of details, which need another RFC to undo or update; or at least I read it that way, and starting to land a bunch of PRs that say "nah, sorry, decided to do breakpoints a totally different way" would feel somewhat antisocial to me. I have deeper thoughts I could give in some other venue about what the right level of "waterfall" design is, but I strongly agree that I want to leave flexibility for experimentation and pathfinding as we actually get into the nitty-gritty. The main thing I want here is to see that folks are generally on-board with building on top of Cranelift, so that we don't get unhappy surprises when I start sending PRs.
alexcrichton commented on PR #44:
:+1: that all sounds good to me and aligns with my thinking as well
I realize I wrote a lot above but I'd be curious for your thoughts specifically on the "Cranelift and Wasm PCs" section. the only outstanding unknown is a hypothetical route to guaranteeing that in the face of lowering/
MachBufferoptimizations we have a "thing" to place a breakpoint on for all wasm instructions in the original program. Is the idea that some non-optimizable CLIF would be generated? For example an immovable explicit breakpoint check, an immovable "rewrite this nop" instruction, or something like that?(also just asking this to prod at this part of the design a bit more since the half of debugging of "what value is this" seems solvable with Cranelift, and this is the half of "how to step between wasm pcs" which seemed less clear to me at first)
I realize I wrote a lot above but I'd be curious for your thoughts specifically on the "Cranelift and Wasm PCs" section. the only outstanding unknown is a hypothetical route to guaranteeing that in the face of lowering/
MachBufferoptimizations we have a "thing" to place a breakpoint on for all wasm instructions in the original program. Is the idea that some non-optimizable CLIF would be generated? For example an immovable explicit breakpoint check, an immovable "rewrite this nop" instruction, or something like that?Yep, that's the part I replied to above (search for
"possible breakpoint" pseudo-instruction), if I'm not mistaken. The tl;dr is that a step-point/breakpoint is itself a side-effect (we can make it an opaque instruction that is respect by e.g. alias analysis, etc as having arbitrary effects) -- so nothing will move around it.I realize it's a very subtle mentality shift but once one realizes that we don't care about the actual executed machine code, only its effects -- and updating the debugger's view of the world and transferring control to the debugger is an effect -- it's super-freeing from any concern about whether the compiler optimizations matter. Cranelift might indeed precompute the entire result of the program, and lower it into machine code that is just a series of updates to debugger state and calls to the debugger (in the "read-only" mode at least); and that's fine! We still get what we want, i.e., control at each step with a view of variable values.
alexcrichton commented on PR #44:
Oh sorry I mistakenly put that in my mental bucket of "things to handle and restore state after resuming from a breakpoint" for some reason. I've been context switching a bit too much today...
it's super-freeing from any concern about whether the compiler optimizations matter
This is an excellent point!
programmerjake edited a comment on PR #44:
If we go for the bitmap design we'd have to pave the entire bitmap with 1s before we start execution.
if we go for the 2-level trie design I suggested, we can just have another top-level tree node (leaves can be adjusted to be larger to not waste too much memory on the top-level tree nodes):
let all_ones_leaf = [LeafElement::MAX; LEAF_SIZE]; let all_ones_tree = vec![&all_ones_leaf; required_size]; // maybe can be optimized further with mmap shenanigans since every page is identicalthat way all you do to single-step is pass a different tree pointer and run the code, no massive memset or memcpy needed.
this also has the benefit of the inlined cranelift code for the breakpoint check not needing additional instructions for this special case.
tschneidereit commented on PR #44:
I'm not yet sure if I'll be able to make it to today's wasmtime call, so wanted to follow up here just in case: after letting things percolate for a while, I'm fully on board with the direction here. It's very clear that we'll not want/be able to give up on DWARF support anytime soon, while at the same time it makes sense to me that the approach described here will give us something more robust for Wasm debugging.
Everything else is details, and I agree with Alex and Chris that this kind of RFC is best treated as leaving those open to iterative refinement later on.
cfallin updated PR #44.
I've updated the text here based on feedback and discussions -- thanks all!
One thing I'm explicitly still not addressing is how the "top half" of the debugging stack works: this RFC discusses only the VM implementation strategy, up to the host API that permits setting break/watchpoints, stepping, continuing, observing/updating state, etc.
In the Wasmtime meeting last week, we had a good discussion about this top half, with the conclusion that the original debugging RFC's vision for debug adapter components may also be a bit too ambitious for an MVP, and there will be a separate RFC to update that part of the plan. This RFC properly scoped is thus about the question "should we build on Cranelift instead" (with supporting details as a plausible plan showing feasibility).
Given that discussion has quieted down above, I'd like to make a
Motion to Finalize
Disposition: Merge
More on the RFC process here.
tschneidereit submitted PR review.
alexcrichton submitted PR review.
Since we now have signoffs per
Once any stakeholder from a different group has signed off, the RFC will move into a 10 calendar day final comment period (FCP)
we are moving into the
Final Comment Period
and if no objections or other issues are raised, this RFC should merge on 2025-09-25.
fitzgen submitted PR review.
The final comment period has passed with no objections, so this RFC now merges -- thanks all for the discussion and feedback!
I'll have followup PRs to Wasmtime to start to introduce this new approach to debugging soon.
cfallin merged PR #44.
cfallin edited PR #44:
This RFC describes updates to our plan-of-record for debugging in Wasmtime, as I plan to start work soon. It includes several recent realizations I've had that I believe could lead to a better overall set of tradeoffs and implementation choices.
Thanks to @fitzgen and @arjunr2 for discussions on some parts of this. Any bad ideas contained herein remain my own fault only :-)
Last updated: Dec 06 2025 at 06:05 UTC