wasmtime / PR #11921 Debug: add some infrastructure for c... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / PR #11921 Debug: add some infrastructure for c...

Wasmtime GitHub notifications bot (Oct 23 2025 at 01:23):

cfallin opened PR #11921 from cfallin:wasmtime-debug-callbacks-pulley to bytecodealliance:main:

This is a followup to #11895 where I had disabled a test that failed to emit a debug event for a hostcall-generated trap on a divide-by-zero in Pulley. This PR allows that test to pass, and brings Pulley back to parity with native Cranelift in debug support currently.

This was a bit of a "start to pull the thread and the entire finished mechanism materializes" PR; happy to consider ways to split it up if needed. In short, disabling signal-based traps on a Pulley configuration still relies on Pulley opcodes (e.g., divide) actually trapping, in a way that looks more like a "native ISA trap"; so I had to start to build out the actual trap-handling mechanisms. In any case, this will all be needed for followup work soon that will handle traps on native platforms (redirecting from signals by injecting calls), so this is not a distraction.

This PR includes, ranked in decreasing order of "may scare other Wasmtime maintainers" score:

A raw NonNull<dyn VMStore> in the CallThreadState, with a long comment about provenance and mut-borrow exclusivity. This is needed right now to allow the interpreter to invoke the debug event handler, but will soon be needed when injecting hostcalls on signals, because a signal context also has no state available from the Wasm code other than what is in TLS. Hence, we need a way to get the store back from the Wasm when we do something that is "morally a hostcall" at a trapping instruction.

I do believe this is sound, or at least close to it if not (please scrutinize carefully!); the basic idea is that the Wasm acts as an opaque blob in the middle, and the pointer comes out of it one way or another (the normal way, as the first arg to a hostcall, or the weird way, via TLS and the CallThreadState during a trap). Exclusive ownership is still clear at any given point and only one &mut ever exists in the current frame at a time. That said, I haven't tested with miri yet.

This does require careful thought about the Wasm compilation, too; we need the moral equivalent of a &mut self reborrow as-if we were making a hostcall on each trapping instruction. It turns out that we already treat them as memory-fence instructions, so nothing loaded from the store can be moved or cached across them, and I've added a comment now about how this is load-bearing.

Updates to CallThreadState's "exit state", normally set by the exit trampoline, that we now also set when we invoke a debug event handler during a trap context[^1] so that Store::debug_frames properly sees the current activation. This is a little more awkward than it could be because we store the tramopline FP, not last Wasm FP, and there is no trampoline frame in this case, so I've added a flag and some conditionals. I'm happy to refactor instead to go (back) to storing the last Wasm FP instead, with the extra load in the exit trampoline to compute that.

A whole bunch of plumbing, creating a large but mechanical diff, in the code translator to actually add debug tags on all traps and calls to raise. It turns out that once I got all of the above working in Pulley, the test disagreed about current Wasm PC between native and Pulley, and Pulley was right; native was getting it wrong because the ud2 was sunk to the bottom in a cold block and, without tags, we scanned backward to pick up the last Wasm PC in the function. This new plumbing and addition of tags in all the appropriate places fixes that.

[^1]: I keep saying "during a trap context" here, but to avoid any
signal-safety scares, note that when this is done for native
signals in a followup PR, we will inject a hostcall by modifying
stack/register state and returning from the actual signal context,
so it really is as-if we did a hostcall from a trapping
instruction.

Wasmtime GitHub notifications bot (Oct 23 2025 at 01:23):

cfallin requested wasmtime-compiler-reviewers for a review on PR #11921.

Wasmtime GitHub notifications bot (Oct 23 2025 at 01:23):

cfallin requested alexcrichton for a review on PR #11921.

Wasmtime GitHub notifications bot (Oct 23 2025 at 01:23):

cfallin requested wasmtime-core-reviewers for a review on PR #11921.

Wasmtime GitHub notifications bot (Oct 23 2025 at 01:27):

cfallin updated PR #11921.

Wasmtime GitHub notifications bot (Oct 23 2025 at 01:29):

cfallin updated PR #11921.

Wasmtime GitHub notifications bot (Oct 23 2025 at 01:29):

cfallin edited PR #11921:

This is a followup to #11895 where I had disabled a test that failed to emit a debug event for a hostcall-generated trap on a divide-by-zero in Pulley. This PR allows that test to pass, and brings Pulley back to parity with native Cranelift in debug support currently.

This was a bit of a "start to pull the thread and the entire finished mechanism materializes" PR; happy to consider ways to split it up if needed. In short, disabling signal-based traps on a Pulley configuration still relies on Pulley opcodes (e.g., divide) actually trapping, in a way that looks more like a "native ISA trap"; so I had to start to build out the actual trap-handling mechanisms. In any case, this will all be needed for followup work soon that will handle traps on native platforms (redirecting from signals by injecting calls), so this is not a distraction.

This PR includes, ranked in decreasing order of "may scare other Wasmtime maintainers" score:

A raw NonNull<dyn VMStore> in the CallThreadState, with a long comment about provenance and mut-borrow exclusivity. This is needed right now to allow the interpreter to invoke the debug event handler, but will soon be needed when injecting hostcalls on signals, because a signal context also has no state available from the Wasm code other than what is in TLS. Hence, we need a way to get the store back from the Wasm when we do something that is "morally a hostcall" at a trapping instruction.

I do believe this is sound, or at least close to it if not (please scrutinize carefully!); the basic idea is that the Wasm acts as an opaque blob in the middle, and the pointer comes out of it one way or another (the normal way, as the first arg to a hostcall, or the weird way, via TLS and the CallThreadState during a trap). Exclusive ownership is still clear at any given point and only one &mut ever exists in the current frame at a time. That said, I haven't tested with miri yet.

This does require careful thought about the Wasm compilation, too; we need the moral equivalent of a &mut self reborrow as-if we were making a hostcall on each trapping instruction. It turns out that we already treat them as memory-fence instructions, so nothing loaded from the store can be moved or cached across them, and I've added a comment now about how this is load-bearing.

Updates to CallThreadState's "exit state", normally set by the exit trampoline, that we now also set when we invoke a debug event handler during a trap context[^1] so that Store::debug_frames properly sees the current activation. This is a little more awkward than it could be because we store the trampoline FP, not last Wasm FP, and there is no trampoline frame in this case, so I've added a flag and some conditionals. I'm happy to refactor instead to go (back) to storing the last Wasm FP instead, with the extra load in the exit trampoline to compute that.

A whole bunch of plumbing, creating a large but mechanical diff, in the code translator to actually add debug tags on all traps and calls to raise. It turns out that once I got all of the above working in Pulley, the test disagreed about current Wasm PC between native and Pulley, and Pulley was right; native was getting it wrong because the raise libcall was sunk to the bottom in a cold block and, without tags, we scanned backward to pick up the last Wasm PC in the function. This new plumbing and addition of tags in all the appropriate places fixes that.

[^1]: I keep saying "during a trap context" here, but to avoid any
signal-safety scares, note that when this is done for native
signals in a followup PR, we will inject a hostcall by modifying
stack/register state and returning from the actual signal context,
so it really is as-if we did a hostcall from a trapping
instruction.

Wasmtime GitHub notifications bot (Oct 23 2025 at 03:35):

cfallin updated PR #11921.

Wasmtime GitHub notifications bot (Oct 23 2025 at 18:11):

alexcrichton submitted PR review.

Wasmtime GitHub notifications bot (Oct 23 2025 at 18:11):

alexcrichton created PR review comment:

Personally I'm a bit wary about how this is an ever-expanding structure with a huge matrix of fields all customized for various purposes. Especially with a whole litany of last_* fields for all sorts of purposes it's kind of hard to keep it all straight.

Especially when I think these fields aren't actually modified from compiled code at all, would it be possible to store this elsewhere and/or infer it from control flow and/or other state? For example instead of last_wasm_exit_trap_fp could this store directly in last_wasm_exit_trampoline_fp? And instead of storing last_wasm_exit_was_trap could that be carried around as a paramter in relevant locations instead?

Wasmtime GitHub notifications bot (Oct 23 2025 at 18:11):

alexcrichton created PR review comment:

It looks like the only use of this is the interpreter bits, and for that the trap functionality already has a VMContext in scope. Would it be possible to use that VMContext already there instead of storing this here?

Wasmtime GitHub notifications bot (Oct 23 2025 at 18:11):

alexcrichton created PR review comment:

To confirm, you see no way this field could be stored in FuncEnvironment somehow? This is a lot of plumbing...

Wasmtime GitHub notifications bot (Oct 23 2025 at 18:11):

alexcrichton created PR review comment:

One of my worries is that as this state grows and grows more for all sorts of one-off things it becomes more and more difficult to ensure that everything is always kept in sync with everything else. I find it's more robust to reuse existing state and/or minimize state where possible to ensure that everything is exercised more often, where possible at least

Wasmtime GitHub notifications bot (Oct 23 2025 at 18:20):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Oct 23 2025 at 18:20):

cfallin created PR review comment:

We could, potentially; I don't know why it was originally designed this way. I'm happy to put together a separate refactor PR to move stacks under env (at worst there is some borrowing-related reason I haven't seen and I'll run into a roadblock there).

Wasmtime GitHub notifications bot (Oct 23 2025 at 18:26):

cfallin created PR review comment:

We could get to it through that VMContext and then VMStoreContext, but then two things:

We'd need to add an equivalent field (a backpointer) to the VMStoreContext to get to the dyn VMStore; so we wouldn't be removing this field, only moving it; and

We need this in followup work (my next PR) outside of an interpreter context, when we inject calls into the runtime on signals and we don't have a vmctx (we know nothing about the register state we've walked in on during the signal).

In the latter case we can still get the VMStoreContext from the CallThreadState from TLS; so the question is just, do we want to add a layer of indirection and put the NonNull<dyn VMStore> there instead (but on the flip side, not have to set it on every Wasm entry). Happy to go either way...

Wasmtime GitHub notifications bot (Oct 23 2025 at 18:26):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Oct 23 2025 at 18:31):

cfallin created PR review comment:

Unfortunately not, as far as I can tell:

The basic problem is that we have a Store (or StoreContextMut) exposed to the debug callback. That Store fundamentally has to know about the last activation so it can provide a view into those frames (via debug_frames). The way that we view the stack depends on exactly how the host entry occurred, and a trap is a new kind of host entry -- previously we didn't actually enter all that far into host code, only far enough to do the raise and return back out on the other side. There's no direct call-parameter dataflow we can add this onto; we aren't calling into the debug handler with some special thing that already has interpreted the debug frames, we're giving it the full Store. In particular, a trapping context means that we interpret PC differently: it points to the trapping instruction, not the instruction-after as in calls. So this logically belongs with the "end of the last activation" state: it's required to interpret the exit PC/FP. Also, because there's no trampoline, reusing last_wasm_exit_trampoline_fp is the wrong choice I think.

I also share your unease with the ugliness here, and I think the one possible cleanup I see is to refactor back to storing the last Wasm FP, not trampoline FP. What do you think?

Wasmtime GitHub notifications bot (Oct 23 2025 at 18:31):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Oct 23 2025 at 18:32):

cfallin edited PR review comment.

Wasmtime GitHub notifications bot (Oct 23 2025 at 18:35):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Oct 23 2025 at 18:35):

cfallin created PR review comment:

Also, regarding storing elsewhere: because it's part of the "last activation" state, it needs to be saved/restored along with other parts of the VMStoreContext when, for example, CallThreadStates are pushed/popped on the activation chain when fibers suspend. That all works with immutable borrows today and directly accesses the unsafe cells in this location; that's why I kept it here and not somewhere else.

Wasmtime GitHub notifications bot (Oct 23 2025 at 19:47):

cfallin created PR review comment:

OK, I've gone down this road a bit, but I don't think it's any better: the diff is huge (+1446 -938 so far, and I'm only ~a third done), and moreover it doesn't actually improve the plumbing situation here: there are still plenty of places where neither stacks nor env was available where we fundamentally need one or the other to get the debug tags right.

Wasmtime GitHub notifications bot (Oct 23 2025 at 19:47):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Oct 23 2025 at 19:51):

cfallin created PR review comment:

(To give a more principled argument why: we now need some metadata everywhere we could have a trap. We currently have a bunch of places low in the abstraction stack where we only have the Cranelift-level function builder, and some raw SSA values, and we're generating logic that could trap. Previously this was fine because we just added a static trap code indicating the kind of trap, and SourceLocs already track where we are. Now we need more metadata, e.g. for the stack shape. So fundamentally we need to carry that somehow and we have no Wasmtime-specific place to hang it now.)

Wasmtime GitHub notifications bot (Oct 23 2025 at 19:51):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Oct 23 2025 at 20:27):

cfallin updated PR #11921.

Wasmtime GitHub notifications bot (Oct 23 2025 at 20:28):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Oct 23 2025 at 20:28):

cfallin created PR review comment:

I did the refactor to remove "trampoline FP" (in normal case) and "trap FP" (in trap case) bifurcation in favor of one FP in 7fd1f60e7c. We still need the was_trap state (and to save/restore it); there's no way to avoid that bit as far as I can tell.

Wasmtime GitHub notifications bot (Oct 23 2025 at 20:47):

cfallin updated PR #11921.

Wasmtime GitHub notifications bot (Oct 24 2025 at 21:16):

alexcrichton submitted PR review.

Wasmtime GitHub notifications bot (Oct 24 2025 at 21:16):

alexcrichton created PR review comment:

Thinking more about this, this isn't sound due to the active use of TLS when an async function is invoked (e.g. handling a debug event). That might involve a fiber/thread switch which means that resumption on another thread would break the tls::with block.

Somehow, I'm not sure how best how, but the invocation of the debug handler needs to be outside of tls::with. This is also why I tried really hard recently to get all async invocations to "the libcall calls it and nothing else" and this is effectively a regression from that where we're back to the state of "deep in the bowels of the runtime all of a sudden there's a fiber switch".

Basically I think this might be worth rethinking how the async interface is implemented in the runtime. For example something should have prevented this construction in the first place via some sort of "need to be in async context but not safely in an async context".

Wasmtime GitHub notifications bot (Oct 24 2025 at 21:16):

alexcrichton created PR review comment:

Also, thinking a bit more on this, I think we should explicitly document debugging support as not tier 1 (either 2 or 3, probably 3 while it's in-progress) and possibly consider disabling it by default instead of enabling it by default. Given the pretty large amount of new unsafe and new primitives being written for it I wouldn't be confident enough myself to guarantee it's "likely CVE-free" until we've had a good deal more time to think about the unsafe primitives, fuzz things, etc.

Wasmtime GitHub notifications bot (Oct 24 2025 at 21:16):

alexcrichton created PR review comment:

Well, if the goal is to put this field here, then that leads to other questions for me:

Much of the rest of the state in CallThreadState exists because it doesn't have a pointer to Store, so if one was added then it seems like all the state here should be removed in favor of accessing it through the store.

Have you run this through the Miri tests? I think that this may be miri-unsound at a first glance (but I've got about a 70% hit rate on predicting this so not great). The dyn VMStore is passed to CallThreadState::new which gets a raw pointer from that, but immediately later the store is accessed store.0.executor() which I believe effectively invalidates the previously acquired mutable pointer.

In general I don't really understand the provenance comment here. My understanding is that the invariant we need to uphold is that (a) when we enter wasm we start with a safe VMStorereference and (b) when in wasm all usage of VMStore is derived from either that raw pointers or other raw pointers originally derived. I don't know why that would limit usage of the store to just the trap handler or why Cranelift's optimizations or such would come into play here.

Put another way, if this field exists then I think other store-related fields should all be deleted to avoid having state to keep in sync. Taking this provenance comment as-is means that it's not actually possible to do that because the field can only be used during traps, but I don't think that's accurate which means that I don't think the comment here is accurate.

For Miri-things I think ideally a CallThreadState would have a lifetime parameter which is a phantom borrow of the &'a mut dyn VMStore which logically ensures that while the CallThreadState is active all usage of the store must be derived from the state itself rather than via external borrows.

Wasmtime GitHub notifications bot (Oct 24 2025 at 21:16):

alexcrichton created PR review comment:

Well the current state here is that we pass in three different "context" style arguments which are basically just plumbed everywhere. I don't really know how best to address it, but it feels not great to pass so many context-style arguments all over the place. I'm not disputing that FuncTranslationStacks is needed, I'm mostly wondering why it's not possible to reduce all the boilerplate.

we have no Wasmtime-specific place to hang it now

How come this isn't FuncEnvironment? Isn't that Wasmtime-specific?

On one hand I don't want to necessarily add more work to this PR and it's ok to address some othe time. On the other hand though it also feels likely that the next time this happens we'll have 4 context arguments to pass everywhere because for one reason or another it won't fit into the other contexts. Personally I feel like it's worth investigating why so many context arguments are needed here and whether something can be changed to adjust this (as a separate PR, but not an indefinitely-delayed PR to happen in the future).

For example one thing to try would indeed be a fourth context which has pointers/references to all 3 other contexts. That fourth context is then the only context passed everywhere and methods/etc are all updated to be on the fourth context. This to me would decrease the cognitive load where there's still just a single context to actually worry about and the only minor detail is that the constituent parts of said context come from different locations which necessesitates a few wrapper structures.

Wasmtime GitHub notifications bot (Oct 24 2025 at 21:30):

cfallin created PR review comment:

So the basic issue is that FuncEnvironment is not plumbed everywhere; that is why I mentioned that we need some new plumbing in any case. Many of the places that didn't have stacks also did not have the environment.

I plumbed through the bit that we need (stacks) but if we put stacks in the environment, we now need (i) a huge diff to do that refactor, and (ii) to add the required plumbing anyway. That's all I'm trying to say: I don't think there's a nice clean small diff we can have in this PR, it has to have a lot of churn to get the information we need.

I agree refactoring this all would be nice. Ideally not in this PR!

Wasmtime GitHub notifications bot (Oct 24 2025 at 21:30):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Oct 24 2025 at 22:14):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Oct 24 2025 at 22:14):

cfallin created PR review comment:

I haven't figured out how to run this in miri yet (the furthest I've gotten is the blanket prohibition on Cranelift-compiling code with the error that results in me putting #[cfg_attr(miri, ignore)] on all my tests, but I haven't put any time into looking at how you've built miri testing elsewhere). I should definitely do that.

That said, a few thoughts:

I agree a broader refactor to have just a dyn VMStore pop out the other side of CallThreadState would actually be ideal, then make this well-tested, well-vetted, and standardize on it. In general it seems like we've plumbed through just enough state for the things that we do need when we get control without a vmctx; doing this right once would be best.

I tried to limit this to "during traps" to make the distinction about whether we have a vmctx passed in via a normal hostcall route or not. Basically, there is always one way that the root of ownership of the store passes back into the host: either via TLS -> CallThreadState -> VMStoreContext -> raw VMStore pointer (trap route) or via vmctx -> Instance -> store pointer (Instance::enter_store_from_wasm route for libcalls).

Technically I don't see a reason why we couldn't always use the TLS route, except for the obvious one, performance (and then the fiber-suspend one which you raise below, about which more thoughts below...).

All of this hinges on my understanding of provenance which is that (true mut borrow) -> (multiple raw pointers) -> (true mut borrow derived from one of those raw pointers) is safe, i.e., you can pick one of multiple dataflow paths to get your raw pointer, and the others are just invalidated and must not be used. That should be what's going on here.

On Cranelift optimizations: we're accustomed to thinking about Wasm as opaque, and considering only the provenance story on the Rust side, but it's important to think about what provenance means in a general compiler sense and ensuring we're not committing the same sins on the other side. In particular, if we have an implicit hostcall on trap, that's as-if we create a new &mut Store, so just as provenance exists as a concern on the Rust side because language semantics around aliasing are load-bearing for optimizations, we have alias analysis and optimizations on the Cranelift side too. My point there is that we must ensure we have no cached state loaded from the store across any trapping instruction if we do this implicit hostcall thing; that would be exactly equivalent to an exclusive-mut-borrow violation on the Rust side. The reason this is discussed at all in this context is that we are taking the &mut dyn VMStore now and using it, so we're creating this limitation on the Cranelift side. One depends on the other.

Wasmtime GitHub notifications bot (Oct 24 2025 at 22:16):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Oct 24 2025 at 22:16):

cfallin created PR review comment:

Hmm. I agree we can address this by pulling the state out of the tls::with somehow -- i.e., the raw store pointer. We're not doing any store ownership yielding if we fiber-suspend so there's nothing unsafe with keeping that around across suspends, I think (correct me if I'm wrong?), so I just need to restructure my closure-based helpers?

Wasmtime GitHub notifications bot (Oct 24 2025 at 22:29):

cfallin commented on PR #11921:

Thanks for all your comments so far, @alexcrichton! A few overall thoughts:

In parallel to this, over in #11930, I have started to realize that injecting implicit hostcalls during traps may not be worth the complexity given variations in signal-handling quirks between platforms (got it working on Linux but each of macOS and Windows need different design bits). If that goes away and we require hostcall-based traps everywhere, then the only remaining "debugger gets control outside of a libcall" situation is Pulley's use of trapping instructions (e.v. my div-by-zero test which this PR enables), and there I'd prefer to refactor to make Pulley use true hostcall-based traps in those cases too.

So if we go that way, we avoid the need to re-enter the runtime without a vmctx passed to us the normal way; and all of these ownership/provenance questions go away.

That said: I realize I'm fighting a somewhat uphill battle on store provenance, but I do suspect it might be a need elsewhere in the future, for example if we do want to enable signal-based traps in debugging for performance later. At a high level the ownership handoff is dynamically valid; and the overall scheme should be compatible with stacked borrows (one mut borrow -> multiple raw pointers -> one mut borrow derived from one of them). I'm happy to keep talking about this and try for a refactor where we shift to one raw store pointer in the CallThreadState and derive other things (unsafe-intrinsics' T pointer, the VMStoreContext, ...) from that. That would give us a framework with fewer limitations for future work. The trick is finding the right abstraction for a safe API, I think...

That also said, it occurs to me that the Accessor mechanism, with its TLS mechanism already vetted, could be an alternate foundation for all of this. (We already know it's the eventual alternate foundation for all of this, but I mean even in a callback world.) Basically, transfer ownership of the debug handler if any to the CallThreadState on entry and take back on exit; then that's all we need to invoke the handler, and it can keep its locally captured Access.

I don't necessarily want to create a dependency on all of the refactoring you are planning to do around that, though...

Wasmtime GitHub notifications bot (Nov 01 2025 at 17:35):

alexcrichton commented on PR #11921:

Ok I now actually got a chance to sit down and properly read your last comment @cfallin. My read on it though is that when our discussions from the CG meeting are mixed in it's likely that most of this PR is moot since the store will come through libcalls primarily. Does that match your understanding though? Or will we still want most of this to land?

Wasmtime GitHub notifications bot (Nov 01 2025 at 18:19):

cfallin closed without merge PR #11921.

Wasmtime GitHub notifications bot (Nov 01 2025 at 18:19):

cfallin commented on PR #11921:

Yes, I think we've subsumed the need for this (and also call-injection for now); and in fact I have a WIP branch to address Pulley support the other way, by doing the equivalent of libcall-based traps (which runs into some complications due to unimplemented opcodes that I'm working through). I'll close this and I was planning on writing up a "meta-issue" for the current thinking on path to debug MVP. Thanks!

Last updated: Feb 24 2026 at 05:28 UTC