cfallin requested alexcrichton for a review on PR #12183.
cfallin opened PR #12183 from cfallin:debugger-crate to bytecodealliance:main:
The debug support in Wasmtime so far is structured around async callbacks that occur at certain kinds of events, like breakpoints. This is a suitable foundation but makes for an awkward implementation of a top-level debugger implementation, which likely has an event loop dealing with user commands (via a UI or a protocol connection) and expects to perform actions such as "run until next breakpoint".
This PR introduces a new crate that wraps a
Storein aDebugger. This wrapper embodies an inner async body that can perform whatever actions it likes on theStorethat is passed back in. This inner body is spawned as an async task. The debugger wrapper registers its ownDebugHandlercallback that communicates with the outside world via bidirectional command/response queues.On the "outside", the
Debuggerpresents an interface suitable for inserting into a debug protocol server or UI: an async method that runs until next event and returns that event, and a method that permits querying or modifying the store whenever therunmethod is not executing. The latter operates by sending a closure over the queue, because theStoremust continue to be owned by the async task that is (still) running and suspended in async callbacks.Right now, this is exercised only via a few unit tests, but the intent is to next build up the "top half" of the debugger using this abstraction, e.g. by running a gdbstub protocol server (likely as a Wasm component in a "debug-main WIT world" -- RFC needed for this).
Also, when we eventually move debugging over to native use of
run_concurrent, this paradigm should remain mostly unchanged at this level of API: there can still be an object that has an async method that runs and yields the next event, and there can still be a method that takes a closure that can operate (within its scope only) on theStore.A few warts that I could use feedback on:
Cancelation safety is weird. Fibers panic when dropped before execution of their body completes, and this seems to mean that we can't allow a
Debuggerto drop early (or at least, thetokio::testunit test that owns the runtime that runs the async task to finish before the debugged body completes!). If there is a better way to handle cancelation safety here, I'm all ears.It's not clear to me if the boxed-closure-and-
Anyapproach to providing access to theStoreis the best we can do, but I suspect it is.<!--
Please make sure you include the following information:
If this work has been discussed elsewhere, please include a link to that
conversation. If it was discussed in an issue, just mention "issue #...".Explain why this change is needed. If the details are in an issue already,
this can be brief.Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.htmlPlease ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->
cfallin requested wasmtime-core-reviewers for a review on PR #12183.
cfallin requested wasmtime-default-reviewers for a review on PR #12183.
cfallin updated PR #12183.
cfallin updated PR #12183.
fitzgen submitted PR review:
Have to context switch, but want to get my comments thus far posted.
fitzgen created PR review comment:
Could be good to have an ascii diagram of the state machine here. Assuming we start in the
Pausedstate, effectively always beginning at a breakpoint just before the program's first instantiation, I'm thinking something like| | V Paused <---------> Queried ^ | | V Running | | V CompleteBut now that I think about it some more, it seems like complete and paused are actually the same for "reactor" style programs? That is, those that are a bag of exported functions that can be called in any order, rather than those defining just a
main. This is because you start out in the initial paused state, call into some exported function and do stuff and then that can complete but you can always call back in again and do more stuff and debug these subsequent entries into the program as well.This also makes me think of something that I don't think we've discussed thus far in our debugging conversations: debugging the start function and instantiation in general. For example, what if a data segment's initialization traps due to out-of-bounds memory access? We want to support debugging that, but I don't think we raise a debugging trap event for that yet, do we? (Obviously, we don't need to address all that in this PR, but I think this state machine should reflect the actual states we want to support in the fullness of time, IMO). So given that I think we actually want something like
| V Uninstantiated <---. ^ | | | (debugger says terminate) | | V | Running | ^ | | | | | V | Paused-------------' ^ | | V Queriedwhere uninstantiated to running involves instantiation, and returning from
mainor any other export that we called into as the entry point puts us back at paused (since we can inspect memories and such in this state, but can't in the uninstantiated state)
fitzgen created PR review comment:
Maybe name this
Paused, because the name is a bit ambiguous withCompleteotherwise (a program stops when it is finished, does it not?)
cfallin submitted PR review.
cfallin created PR review comment:
Happy to add a state diagram!
One point of clarification regarding "before instantiation" or distinctions between different lifecycle states of a Wasm instance/component: this interface actually has no concept of instantiation. The "inner body" can be anything: it could create a bunch of instances, it could continue to call into an already-existing instance, etc. It inverts control so that the code on the outside gets events when the callback occurs, that's it; and the body doesn't run until the first
run().So, yes, we could add debug events for out-of-bounds accesses by data segments. Any breakpoints in the start function should already "just work" though in the sense that it'll be Wasm execution within the scope of the inner body.
alexcrichton submitted PR review:
this seems to mean that we can't allow a Debugger to drop early
This is definitely a bug in Wasmtime that we need to fix. Can you add a failing test or gist one that I can help poke around?
alexcrichton created PR review comment:
You can drop debug/std/async here since that'll come in through the normal dependency
alexcrichton created PR review comment:
should this be omitted?
alexcrichton created PR review comment:
This should be define-able with
async fn handle(...)and you don't need to worry aboutSend
cfallin commented on PR #12183:
This is definitely a bug in Wasmtime that we need to fix. Can you add a failing test or gist one that I can help poke around?
Sure thing -- with this commit and
cargo test -p wasmtime-internal-debugger -- basic_debuggerI see
thread 'test::basic_debugger' (1129985) panicked at /home/cfallin/work/wasmtime/crates/wasmtime/src/runtime/fiber.rs:199:70: called `Option::unwrap()` on a `None` value note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
cfallin commented on PR #12183:
(It's entirely possible that I'm holding some part of the async machinery wrong, either in this PR or in earlier work to add the async debugger hook -- happy to have your opinions on that if so!)
alexcrichton commented on PR #12183:
That'll get fixed by this:
diff --git a/crates/wasmtime/src/runtime/fiber.rs b/crates/wasmtime/src/runtime/fiber.rs index 68dce6884f..1ab5120ef3 100644 --- a/crates/wasmtime/src/runtime/fiber.rs +++ b/crates/wasmtime/src/runtime/fiber.rs @@ -358,7 +358,7 @@ impl StoreOpaque { } /// Returns whether `block_on` will succeed or panic. - #[cfg(feature = "call-hook")] + #[cfg(any(feature = "call-hook", feature = "debug"))] pub(crate) fn can_block(&mut self) -> bool { self.fiber_async_state_mut().current_future_cx.is_some() } diff --git a/crates/wasmtime/src/runtime/store.rs b/crates/wasmtime/src/runtime/store.rs index 0ce69fc6c3..ac6929d376 100644 --- a/crates/wasmtime/src/runtime/store.rs +++ b/crates/wasmtime/src/runtime/store.rs @@ -2839,6 +2839,9 @@ unsafe impl<T> VMStore for StoreInner<T> { #[cfg(feature = "debug")] fn block_on_debug_handler(&mut self, event: crate::DebugEvent<'_>) -> anyhow::Result<()> { if let Some(handler) = self.debug_handler.take() { + if !self.can_block() { + bail!("couldn't grab async_cx for debug handler") + } log::trace!("about to raise debug event {event:?}"); StoreContextMut(self).with_blocking(|store, cx| { cx.block_on(Pin::from(handler.handle(store, event)).as_mut())tl;dr; when a store is dropped all fibers are resumed with a trap that says "please die immediately" but there are some circumstances in internal code where during the process of returning an error from everything something async is attempted. The async attempt fails because the blocking context (rightfully) wasn't configured.
This happened before with call hooks for example where we would try to invoke the call hook when the fiber is exiting saying "hey a trap happened". The fix here is to just skip debug events/call hooks/etc during this situation since if the store is being dropped that's no interesting anyway
alexcrichton edited a comment on PR #12183:
That'll get fixed by this:
diff --git a/crates/wasmtime/src/runtime/fiber.rs b/crates/wasmtime/src/runtime/fiber.rs index 68dce6884f..1ab5120ef3 100644 --- a/crates/wasmtime/src/runtime/fiber.rs +++ b/crates/wasmtime/src/runtime/fiber.rs @@ -358,7 +358,7 @@ impl StoreOpaque { } /// Returns whether `block_on` will succeed or panic. - #[cfg(feature = "call-hook")] + #[cfg(any(feature = "call-hook", feature = "debug"))] pub(crate) fn can_block(&mut self) -> bool { self.fiber_async_state_mut().current_future_cx.is_some() } diff --git a/crates/wasmtime/src/runtime/store.rs b/crates/wasmtime/src/runtime/store.rs index 0ce69fc6c3..ac6929d376 100644 --- a/crates/wasmtime/src/runtime/store.rs +++ b/crates/wasmtime/src/runtime/store.rs @@ -2839,6 +2839,9 @@ unsafe impl<T> VMStore for StoreInner<T> { #[cfg(feature = "debug")] fn block_on_debug_handler(&mut self, event: crate::DebugEvent<'_>) -> anyhow::Result<()> { if let Some(handler) = self.debug_handler.take() { + if !self.can_block() { + bail!("couldn't grab async_cx for debug handler") + } log::trace!("about to raise debug event {event:?}"); StoreContextMut(self).with_blocking(|store, cx| { cx.block_on(Pin::from(handler.handle(store, event)).as_mut())tl;dr; when a store is dropped all fibers are resumed with a trap that says "please die immediately" but there are some circumstances in internal code where during the process of returning an error from everything something async is attempted. The async attempt fails because the blocking context (rightfully) wasn't configured.
This happened before with call hooks for example where we would try to invoke the call hook when the fiber is exiting saying "hey a trap happened". The fix here is to just skip debug events/call hooks/etc during this situation since if the store is being dropped that's no interesting anyway
(if you wouldn't mind this'd be a good thing to add a test for too)
cfallin updated PR #12183.
cfallin submitted PR review.
cfallin created PR review comment:
Added in cce7e85d8c.
cfallin submitted PR review.
cfallin created PR review comment:
Ah, good point -- fixed in fac011ac15ccdef503841b073b1886215bd43bb2.
cfallin submitted PR review.
cfallin created PR review comment:
Good point, thanks -- fixed in fac011ac15ccdef503841b073b1886215bd43bb2.
cfallin submitted PR review.
cfallin created PR review comment:
Ah, yes, that was a copy-paste leftover -- fixed in fac011ac15ccdef503841b073b1886215bd43bb2.
cfallin submitted PR review.
cfallin created PR review comment:
Yes, good point. Fixed in fac011ac15ccdef503841b073b1886215bd43bb2.
cfallin updated PR #12183.
cfallin updated PR #12183.
cfallin commented on PR #12183:
Awesome, thanks @alexcrichton for that fix! Updated and added a test with an early drop of debugger/store.
alexcrichton submitted PR review:
:+1: from me (I defer to you/Nick to possibly await his approval too)
cfallin commented on PR #12183:
Thanks! I'll go ahead and merge and we can always tweak anything we need as we build on top of this.
cfallin commented on PR #12183:
CI failure on i686 (i.e., pulley32) is due to a problem with
PreserveAllinteracting poorly with Pulley's call1/2/3/4 optimization; #12213 fixes (tested locally). Will merge that here once that PR lands.
alexcrichton submitted PR review:
:+1: from me (I defer to you/Nick to possibly await his approval too)
cfallin updated PR #12183.
cfallin has enabled auto merge for PR #12183.
alexcrichton commented on PR #12183:
The CI error there is:
Running: `"curl" "--user-agent" "bytecodealliance/wasmtime auto-publish script" "https://crates.io/api/v1/crates/wasmtime-internal-debugger/owners"` thread 'main' (2599) panicked at scripts/publish.rs:623:17: crate wasmtime-internal-debugger is not owned by wasmtime-publish, please run: cargo owner -a wasmtime-publish wasmtime-internal-debugger note: run with `RUST_BACKTRACE=1` environment variable to display a backtracenotably we verify that all crates are owned by
wasmtime-publishand already published to crates.io -- this ensures the auto-publish later succeeds. Can you publish a stub crate and invite thewasmtime-publishowner?
cfallin commented on PR #12183:
Yep, I've just published the crate and invited
wasmtime-publish-- let me know when ready to re-attempt landing.
cfallin commented on PR #12183:
I'm not sure I understand the failure now -- I added
wasmtime-publishand it shows as an owner; I removed myself as an owner; it's still failing. @alexcrichton any ideas?
alexcrichton commented on PR #12183:
Oh that's wasmtime-publish-the-github-team not wasmtime-publish-the-github user. If you run the comment it printed out it should be resolved
cfallin commented on PR #12183:
Hmm. Unfortunately I removed myself from the crate after
wasmtime-publishaccepted the invite. So I'm not able to make any more permissions changes.Do you have credentials for that user so you can log in and invite the team? Or have we lost the crate name forever?
(FWIW, I want to reiterate my suggestion from earlier that we have a runbook for all of this now that we have such a complex publishing and permissions scheme, because this was all very ambiguous and undocumented to me)
alexcrichton commented on PR #12183:
Yeah the BA has a 1password with credentials for wasmtime-publish (that's how I accepted the invite in the first place). I added the team. Agreed this can be documented better (like pretty much everything else infrastructure-related), and it's something I hope to handle in January with the move to trusted publishing.
cfallin updated PR #12183.
cfallin has enabled auto merge for PR #12183.
cfallin commented on PR #12183:
Thanks!
cfallin merged PR #12183.
Last updated: Jan 09 2026 at 13:15 UTC