cfallin requested alexcrichton for a review on PR #11895.
cfallin opened PR #11895 from cfallin:wasmtime-debug-callbacks to bytecodealliance:main:
This PR adds a notion of "debug events", and a mechanism in Wasmtime to associate a "debug handler" with a store such that the handler is invoked as-if it were an async hostcall on each event. The async handler owns the store while its future exists, so the whole "world" (within the store) is frozen and the handler can examine any state it likes with a
StoreContextMut.Note that this callback-based scheme is a compromise: eventually, we would like to have a native async API that produces a stream of events, as sketched in #11826 and in [this branch]. However, the async approach implemented naively (that is, with manual fiber suspends and with state passed on the store) suffers from unsoundness in the presence of dropped futures. Alex, Nick and I discussed this extensively and agreed that the
Accessormechanism is the right way to allow for a debugger to have "timesliced"/"shared" access to a store (only when polled/when an event is delivered), but we will defer that for now, because it requires additional work (mainly, converting existing async yield points in the runtime to "give up" the store with therun_concurrentmechanism). I'll file a followup issue to track that. The idea is that we can eventually build that when ready, but the API we provide to a debugger component can remain unchanged; only this plumbing and the glue to the debugger component will be reworked.With this scheme based on callbacks, we expect that one should be able to implement a debugger using async channels to communicate with the callback. The idea is that there would be a protocol where the callback sends a debug event to the debugger main loop elsewhere in the executor (e.g., over a Tokio channel or other async channel mechanism), and when the debugger wants to allow execution to continue, it sends a "continue" message back. In the meantime, while the world is paused, the debugger can send messages to the callback to query the
StoreContextMutit has and read out state. This indirection/proxying of Store access is necessary for soundness: again, teleporting the Store out may look like it almost works ("it is like a mutable reborrow on a hostcall") except in the presence of dropped futures with sandwiched Wasm->host->Wasm situations.This PR implements debug events for a few cases that can be caught directly in the runtime, e.g., exceptions and traps raised just before re-entry to Wasm. Other kinds of traps, such as those normally implemented by host signals, require additional work (as in #11826) to implement "hostcall injection" on signal reception; and breakpoints will be built on top of that. The point of this PR is only to get the initial plumbing in place for events.
Thanks to Alex for help on working how how to keep this from requiring
T: Sendthroughout Wasmtime![this branch]: https://github.com/cfallin/wasmtime/tree/wasmtime-debug-async
<!--
Please make sure you include the following information:
If this work has been discussed elsewhere, please include a link to that
conversation. If it was discussed in an issue, just mention "issue #...".Explain why this change is needed. If the details are in an issue already,
this can be brief.Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.htmlPlease ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->
cfallin requested wasmtime-core-reviewers for a review on PR #11895.
cfallin updated PR #11895.
cfallin updated PR #11895.
fitzgen created PR review comment:
I don't think we ever use the term "activation" in our public docs today, just in the internal unwinding and
CallThreadStatedocs. That isn't to say we shouldn't use it here, but we probably want to define what it means in parens or something.
fitzgen created PR review comment:
Nice, IIUC we don't even need out own
unsafeblock whose correctness is justified viaDebugHandler: Sendbecause everything falls out from the existingblock_on, right?
fitzgen created PR review comment:
Nitpick: the docs should have a
## Panicssection and note that this panics when async is not enabled.And should this also be
cfg'd onfeature = "async"? Becausefeature = "debug"does not imply that right now (should it?)
fitzgen submitted PR review:
LGTM! Will hold off on adding to the merge queue until @alexcrichton has a chance to look at this.
fitzgen created PR review comment:
Should we add an
Arc<anyhow::Error>field here (and updateTrapReason::Useraccordingly) so that the debugger can inspect the host error? Not something we have to do now, in this PR, ofc. And if we are concerned about the extra allocation inTrapReaason::User, even when debugging is disabled, we could potentially do aCow-esque thing where it is a regularanyhow::Erroruntil it needs to be shared, at which point it gets put into anArc.
alexcrichton submitted PR review:
Would it make sense to also start something like a mask of events for a debug handler? For example if a
HostcallErrorisn't desired then that could get masked out while other events would still be processed.
alexcrichton created PR review comment:
Could this be
-> impl Futureinstead ofBox? That'd enable implementing this trait withasync fn ...
alexcrichton created PR review comment:
We'll definitely want to handle the result of
block_on. Ifblock_onreturns anErrthen that means that the fiber needs to exit ASAP (e.g. a dropped future, forcible cleanup, etc), meaning that if this is ignored then wasm might just continue on its merry way while the caller expects it to exit ASAP.Basically the
Resulthere should be plumbed to the caller.
alexcrichton created PR review comment:
One possible alternative: use
&selfhere but haveCloneas a supertrait ofDebugHandler. That way we don't requireArchere (e.g. if the implementor is zero-sized) and the implementation internally would clone-out-of-the-store to avoid borrowing the store to be able to pass it to the context today (basically as what already happens here)
alexcrichton created PR review comment:
Could the other states be listed out in an exhaustive match with a comment why they don't have debug events?
alexcrichton created PR review comment:
Two events? Wouldn't
UncaughtExceptionalways imply the formerThrownException?
alexcrichton created PR review comment:
Well, while you're technically true we're still lying. The stack frame above this has
&mut VMStoreon the stack and that's notSendand it's just invisible from rustc.
alexcrichton created PR review comment:
This is a subtle, but load bearing,
whereclause. Could the function have a Rust-level comment (e.g.//, not///) indicating why? Basically explaining how a&mut dyn VMStoreis gonna be on the stack and that is required to beSend, so therefore this must beSend.
alexcrichton created PR review comment:
To add to this, should this also
assert!that thedebug_guestis enabled?
alexcrichton created PR review comment:
Or, alternatively, this doesn't look like it needs to be owned, so this could be
HostcallError(&'a anyhow::Error)maybe?
fitzgen submitted PR review.
fitzgen created PR review comment:
I was assuming that the borrowing probably wouldn't work out, but if it does then yeah let's do that
fitzgen submitted PR review.
fitzgen created PR review comment:
To be sure I understand you correctly: you're effectively saying "push the arc into the
T: DebugHandler, if necessary"? If so, I like that :+1:
alexcrichton submitted PR review.
alexcrichton created PR review comment:
Indeed!
fitzgen submitted PR review.
fitzgen created PR review comment:
Oh also: shouldn't this future (eventually?) result in a
ContinuationCommandor whatever, whereenum ContinuationCommand { // Normal continue. Resume(Vec<Val>), // Exceptional continue. ResumeThrow(Vec<Val>), // Forcibly return from the current frame. ResumeFinish(Vec<Val>), // Terminate execution, raising a trap. Terminate(anyhow::Error), // or `wasmtime::Trap`? }?
Some variants might be invalid to return when handling some events, and we don't need to implement all variants immediately, but this is the place we would want to do this kind of thing, right?
cfallin submitted PR review.
cfallin created PR review comment:
Either that or a mutation API on the store to alter the return-point/state of the current activation; but honestly I do like the functional style here better.
Right now since we don't support mutating the execution at all, only observing, I don't think we need anything here since there is only one option (
Resumewith no alternative values specified) implied by return, no?
cfallin submitted PR review.
cfallin created PR review comment:
I did spend a bit of time in lifetime hell yesterday trying to get
&'a anyhow::Errorto work; it was really not gelling well with the future's lifetime so I gave up. I can try a bit more but I do like theArcalternative as well from the external perspective at least.
fitzgen submitted PR review.
fitzgen created PR review comment:
That sounds right to me, and also I guess
ContinuationCommand::Resumeshouldn't have a payload, only the other variants should.I also like the "functional style" better than some roundabout mechanism to stash data on the side temporarily to pass things from a callee to a caller.
cfallin updated PR #11895.
cfallin submitted PR review.
cfallin created PR review comment:
Updated in 08f1ba4a0d, along with a mention of activations in the doc-comment for
Store::debug_framesas well.
cfallin updated PR #11895.
cfallin submitted PR review.
cfallin created PR review comment:
Addressed in 08f1ba4a0dffdb1f4686d88c3be28b8f625202e6 and 28c63d58d7ce65f1b6871954735a11a291e1ab45;
asyncis now a feature dependency ofdebug, assert was added, and added panic conditions for the two required settings.
cfallin submitted PR review.
cfallin created PR review comment:
Right; it's at least not a new kind of unsafety relative to the other places that we have fiber suspends from hostcalls, AFAIK.
cfallin submitted PR review.
cfallin created PR review comment:
Unfortunately no, because then the trait is not dyn-compatible -- or at least I don't know how to work around this:
error[E0038]: the trait `debug::DebugHandler` is not dyn compatible --> crates/wasmtime/src/runtime/store.rs:275:42 | 275 | pub(crate) debug_handler: Option<Arc<dyn DebugHandler<Data = T>>>, | ^^^^^^^^^^^^^^^^^^^^^^^^^^ `debug::DebugHandler` is not dyn compatible | note: for a trait to be dyn compatible it needs to allow building a vtable for more information, visit <https://doc.rust-lang.org/reference/items/traits.html#dyn-compatibility> --> crates/wasmtime/src/runtime/debug.rs:512:10 | 502 | pub trait DebugHandler: Send + Sync + 'static { | ------------ this trait is not dyn compatible... ... 512 | ) -> impl Future<Output = ()> + Send + 'a; | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ...because method `handle` references an `impl Trait` type in its return type = help: consider moving `handle` to another trait
cfallin submitted PR review.
cfallin created PR review comment:
Unfortunately this seems to run afoul of dyn compatibility as well... happy to see it refactored in followup if there is some clever trait-constraint trickery that can work around this. I see
dyn_cloneand similar things but it seems that may not be worth it for a single use-case like this...
cfallin created PR review comment:
My thought process that led here was: I initially wanted events on caught and uncaught exceptions; but we don't yet have the mechanisms to give a view of state at the catch-point, only the throw-point, so the former became "thrown exception"; but then the exception is thrown whether it's caught in Wasm or uncaught.
I suppose I could rename
ThrownExceptiontoThrownExceptionToWasmor something similar, but then an observer who wants to see all exception throws needs to watch two different events.I do agree it's a little awkward to see two distinct hooks at one static point; maybe we just document what
ThrownExceptionmeans and invoke (only) that here...
cfallin submitted PR review.
cfallin submitted PR review.
cfallin created PR review comment:
Cool -- I'll build this out once we have the other return options.
cfallin submitted PR review.
cfallin created PR review comment:
Concretely, we're inside
CallThreadState::unwindwhen we invoke this method; I guess we transmute anErrinto a host unwind then? It's a little awkward as it overrides an existing unwind but I guess we can make that work.
cfallin updated PR #11895.
cfallin submitted PR review.
cfallin created PR review comment:
OK, handled in 7e9d38686a.
cfallin submitted PR review.
cfallin created PR review comment:
Good idea -- handled in 7e9d38686a.
cfallin submitted PR review.
cfallin created PR review comment:
OK, renamed to
CaughtExceptionThrownandUncaughtExceptionThrownin 7e9d38686a.
cfallin updated PR #11895.
cfallin submitted PR review.
cfallin created PR review comment:
Yep definitely -- added in 50bb08e40e.
cfallin updated PR #11895.
cfallin commented on PR #11895:
Would it make sense to also start something like a mask of events for a debug handler? For example if a HostcallError isn't desired then that could get masked out while other events would still be processed.
Maybe eventually, yeah -- that's a good idea. I might defer that to the point that we're building out the top half to see how we want to use it.
cfallin submitted PR review.
cfallin created PR review comment:
In particular it is always some variant of
error: lifetime may not live long enough --> crates/wasmtime/src/runtime/store.rs:2726:52 | 2723 | fn block_on_debug_handler(&mut self, event: crate::DebugEvent<'_>) -> anyhow::Result<()> { | ----- has type `debug::DebugEvent<'1>` ... 2726 | StoreContextMut(self).block_on(|store| Pin::from(handler.handle(store, event))) | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ returning this value requires that `'1` must outlive `'static`even with
507 /// Handle a debug event. 508 fn handle<'a>( 509 self: Arc<Self>, 510 store: StoreContextMut<'a, Self::Data>, 511 event: DebugEvent<'a>, 512 ) -> Box<dyn Future<Output = ()> + Send + 'a>;that should tie the lifetime inside the event to the store. Something to do with the anonymous lifetime inference with
block_on; it feels like it should be possible but I don't have enough constraint-fu right now to make it work, sorry!
cfallin submitted PR review.
cfallin created PR review comment:
The
Arcvariant of this is pretty messy too: it requires taking apart the unwind-state to temporarily take ownership, then reconstructing it later. I think I'm going to defer this as well since the error will come out of the invocation immediately after the debug event anyway. If we can make the borrowing work I'm happy to see a refactor in a followup though!
cfallin updated PR #11895.
cfallin commented on PR #11895:
OK, updated based on feedback -- final look before I merge? Thanks!
fitzgen submitted PR review:
LGTM, thanks!
alexcrichton commented on PR #11895:
With respect to clones, object-safety, lifetimes, and
async fn, it's all possible. That at least compiles in thewasmtimecrate, although I didn't go through tests yet. Would you prefer the design in this PR? Or ok if I have a follow-up to improve the API?
cfallin commented on PR #11895:
Ah, neat, an internal adapter that gets monomorphized in the top layer where we have T -- I'll incorporate that, thanks!
cfallin updated PR #11895.
cfallin commented on PR #11895:
All of that worked fine, thanks! I pulled in your patch and then added some more comments to the trait to indicate that it should be cheap to clone, i.e., is recommended to be an
Arcunder the covers or something similar.
cfallin has enabled auto merge for PR #11895.
cfallin updated PR #11895.
cfallin commented on PR #11895:
I pushed a change to avoid this failure where execution in Pulley did not emit a debug event for a divide-by-zero; this is because the trap-handling path is ever so slightly different in Pulley. I'll address in a followup (we're very close and it'd be honestly easier to keep Pulley working throughout this process than to exclude it in config); the
cfgto limit to Cranelift-native architectures is in 97336c19963fda33b229dcfbe33f9b880725fc4f.
cfallin has enabled auto merge for PR #11895.
cfallin merged PR #11895.
Last updated: Dec 06 2025 at 06:05 UTC