Stream: git-wasmtime

Topic: wasmtime / PR #11921 Debug: add some infrastructure for c...


view this post on Zulip Wasmtime GitHub notifications bot (Oct 23 2025 at 01:23):

cfallin opened PR #11921 from cfallin:wasmtime-debug-callbacks-pulley to bytecodealliance:main:

This is a followup to #11895 where I had disabled a test that failed to emit a debug event for a hostcall-generated trap on a divide-by-zero in Pulley. This PR allows that test to pass, and brings Pulley back to parity with native Cranelift in debug support currently.

This was a bit of a "start to pull the thread and the entire finished mechanism materializes" PR; happy to consider ways to split it up if needed. In short, disabling signal-based traps on a Pulley configuration still relies on Pulley opcodes (e.g., divide) actually trapping, in a way that looks more like a "native ISA trap"; so I had to start to build out the actual trap-handling mechanisms. In any case, this will all be needed for followup work soon that will handle traps on native platforms (redirecting from signals by injecting calls), so this is not a distraction.

This PR includes, ranked in decreasing order of "may scare other Wasmtime maintainers" score:

[^1]: I keep saying "during a trap context" here, but to avoid any
signal-safety scares, note that when this is done for native
signals in a followup PR, we will inject a hostcall by modifying
stack/register state and returning from the actual signal context,
so it really is as-if we did a hostcall from a trapping
instruction.

<!--
Please make sure you include the following information:

Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.html

Please ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->

view this post on Zulip Wasmtime GitHub notifications bot (Oct 23 2025 at 01:23):

cfallin requested wasmtime-compiler-reviewers for a review on PR #11921.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 23 2025 at 01:23):

cfallin requested alexcrichton for a review on PR #11921.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 23 2025 at 01:23):

cfallin requested wasmtime-core-reviewers for a review on PR #11921.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 23 2025 at 01:27):

cfallin updated PR #11921.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 23 2025 at 01:29):

cfallin updated PR #11921.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 23 2025 at 01:29):

cfallin edited PR #11921:

This is a followup to #11895 where I had disabled a test that failed to emit a debug event for a hostcall-generated trap on a divide-by-zero in Pulley. This PR allows that test to pass, and brings Pulley back to parity with native Cranelift in debug support currently.

This was a bit of a "start to pull the thread and the entire finished mechanism materializes" PR; happy to consider ways to split it up if needed. In short, disabling signal-based traps on a Pulley configuration still relies on Pulley opcodes (e.g., divide) actually trapping, in a way that looks more like a "native ISA trap"; so I had to start to build out the actual trap-handling mechanisms. In any case, this will all be needed for followup work soon that will handle traps on native platforms (redirecting from signals by injecting calls), so this is not a distraction.

This PR includes, ranked in decreasing order of "may scare other Wasmtime maintainers" score:

[^1]: I keep saying "during a trap context" here, but to avoid any
signal-safety scares, note that when this is done for native
signals in a followup PR, we will inject a hostcall by modifying
stack/register state and returning from the actual signal context,
so it really is as-if we did a hostcall from a trapping
instruction.

<!--
Please make sure you include the following information:

Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.html

Please ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->

view this post on Zulip Wasmtime GitHub notifications bot (Oct 23 2025 at 03:35):

cfallin updated PR #11921.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 23 2025 at 18:11):

alexcrichton submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 23 2025 at 18:11):

alexcrichton created PR review comment:

Personally I'm a bit wary about how this is an ever-expanding structure with a huge matrix of fields all customized for various purposes. Especially with a whole litany of last_* fields for all sorts of purposes it's kind of hard to keep it all straight.

Especially when I think these fields aren't actually modified from compiled code at all, would it be possible to store this elsewhere and/or infer it from control flow and/or other state? For example instead of last_wasm_exit_trap_fp could this store directly in last_wasm_exit_trampoline_fp? And instead of storing last_wasm_exit_was_trap could that be carried around as a paramter in relevant locations instead?

view this post on Zulip Wasmtime GitHub notifications bot (Oct 23 2025 at 18:11):

alexcrichton created PR review comment:

It looks like the only use of this is the interpreter bits, and for that the trap functionality already has a VMContext in scope. Would it be possible to use that VMContext already there instead of storing this here?

view this post on Zulip Wasmtime GitHub notifications bot (Oct 23 2025 at 18:11):

alexcrichton created PR review comment:

To confirm, you see no way this field could be stored in FuncEnvironment somehow? This is a lot of plumbing...

view this post on Zulip Wasmtime GitHub notifications bot (Oct 23 2025 at 18:11):

alexcrichton created PR review comment:

One of my worries is that as this state grows and grows more for all sorts of one-off things it becomes more and more difficult to ensure that everything is always kept in sync with everything else. I find it's more robust to reuse existing state and/or minimize state where possible to ensure that everything is exercised more often, where possible at least

view this post on Zulip Wasmtime GitHub notifications bot (Oct 23 2025 at 18:20):

cfallin submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 23 2025 at 18:20):

cfallin created PR review comment:

We could, potentially; I don't know why it was originally designed this way. I'm happy to put together a separate refactor PR to move stacks under env (at worst there is some borrowing-related reason I haven't seen and I'll run into a roadblock there).

view this post on Zulip Wasmtime GitHub notifications bot (Oct 23 2025 at 18:26):

cfallin created PR review comment:

We could get to it through that VMContext and then VMStoreContext, but then two things:

In the latter case we can still get the VMStoreContext from the CallThreadState from TLS; so the question is just, do we want to add a layer of indirection and put the NonNull<dyn VMStore> there instead (but on the flip side, not have to set it on every Wasm entry). Happy to go either way...

view this post on Zulip Wasmtime GitHub notifications bot (Oct 23 2025 at 18:26):

cfallin submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 23 2025 at 18:31):

cfallin created PR review comment:

Unfortunately not, as far as I can tell:

The basic problem is that we have a Store (or StoreContextMut) exposed to the debug callback. That Store fundamentally has to know about the last activation so it can provide a view into those frames (via debug_frames). The way that we view the stack depends on exactly how the host entry occurred, and a trap is a new kind of host entry -- previously we didn't actually enter all that far into host code, only far enough to do the raise and return back out on the other side. There's no direct call-parameter dataflow we can add this onto; we aren't calling into the debug handler with some special thing that already has interpreted the debug frames, we're giving it the full Store. In particular, a trapping context means that we interpret PC differently: it points to the trapping instruction, not the instruction-after as in calls. So this logically belongs with the "end of the last activation" state: it's required to interpret the exit PC/FP. Also, because there's no trampoline, reusing last_wasm_exit_trampoline_fp is the wrong choice I think.

I also share your unease with the ugliness here, and I think the one possible cleanup I see is to refactor back to storing the last Wasm FP, not trampoline FP. What do you think?

view this post on Zulip Wasmtime GitHub notifications bot (Oct 23 2025 at 18:31):

cfallin submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 23 2025 at 18:32):

cfallin edited PR review comment.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 23 2025 at 18:35):

cfallin submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 23 2025 at 18:35):

cfallin created PR review comment:

Also, regarding storing elsewhere: because it's part of the "last activation" state, it needs to be saved/restored along with other parts of the VMStoreContext when, for example, CallThreadStates are pushed/popped on the activation chain when fibers suspend. That all works with immutable borrows today and directly accesses the unsafe cells in this location; that's why I kept it here and not somewhere else.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 23 2025 at 19:47):

cfallin created PR review comment:

OK, I've gone down this road a bit, but I don't think it's any better: the diff is huge (+1446 -938 so far, and I'm only ~a third done), and moreover it doesn't actually improve the plumbing situation here: there are still plenty of places where neither stacks nor env was available where we fundamentally need one or the other to get the debug tags right.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 23 2025 at 19:47):

cfallin submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 23 2025 at 19:51):

cfallin created PR review comment:

(To give a more principled argument why: we now need some metadata everywhere we could have a trap. We currently have a bunch of places low in the abstraction stack where we only have the Cranelift-level function builder, and some raw SSA values, and we're generating logic that could trap. Previously this was fine because we just added a static trap code indicating the kind of trap, and SourceLocs already track where we are. Now we need more metadata, e.g. for the stack shape. So fundamentally we need to carry that somehow and we have no Wasmtime-specific place to hang it now.)

view this post on Zulip Wasmtime GitHub notifications bot (Oct 23 2025 at 19:51):

cfallin submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 23 2025 at 20:27):

cfallin updated PR #11921.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 23 2025 at 20:28):

cfallin submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 23 2025 at 20:28):

cfallin created PR review comment:

I did the refactor to remove "trampoline FP" (in normal case) and "trap FP" (in trap case) bifurcation in favor of one FP in 7fd1f60e7c. We still need the was_trap state (and to save/restore it); there's no way to avoid that bit as far as I can tell.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 23 2025 at 20:47):

cfallin updated PR #11921.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 24 2025 at 21:16):

alexcrichton submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 24 2025 at 21:16):

alexcrichton created PR review comment:

Thinking more about this, this isn't sound due to the active use of TLS when an async function is invoked (e.g. handling a debug event). That might involve a fiber/thread switch which means that resumption on another thread would break the tls::with block.

Somehow, I'm not sure how best how, but the invocation of the debug handler needs to be outside of tls::with. This is also why I tried really hard recently to get all async invocations to "the libcall calls it and nothing else" and this is effectively a regression from that where we're back to the state of "deep in the bowels of the runtime all of a sudden there's a fiber switch".

Basically I think this might be worth rethinking how the async interface is implemented in the runtime. For example something should have prevented this construction in the first place via some sort of "need to be in async context but not safely in an async context".

view this post on Zulip Wasmtime GitHub notifications bot (Oct 24 2025 at 21:16):

alexcrichton created PR review comment:

Also, thinking a bit more on this, I think we should explicitly document debugging support as not tier 1 (either 2 or 3, probably 3 while it's in-progress) and possibly consider disabling it by default instead of enabling it by default. Given the pretty large amount of new unsafe and new primitives being written for it I wouldn't be confident enough myself to guarantee it's "likely CVE-free" until we've had a good deal more time to think about the unsafe primitives, fuzz things, etc.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 24 2025 at 21:16):

alexcrichton created PR review comment:

Well, if the goal is to put this field here, then that leads to other questions for me:

Put another way, if this field exists then I think other store-related fields should all be deleted to avoid having state to keep in sync. Taking this provenance comment as-is means that it's not actually possible to do that because the field can only be used during traps, but I don't think that's accurate which means that I don't think the comment here is accurate.

For Miri-things I think ideally a CallThreadState would have a lifetime parameter which is a phantom borrow of the &'a mut dyn VMStore which logically ensures that while the CallThreadState is active all usage of the store must be derived from the state itself rather than via external borrows.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 24 2025 at 21:16):

alexcrichton created PR review comment:

Well the current state here is that we pass in three different "context" style arguments which are basically just plumbed everywhere. I don't really know how best to address it, but it feels not great to pass so many context-style arguments all over the place. I'm not disputing that FuncTranslationStacks is needed, I'm mostly wondering why it's not possible to reduce all the boilerplate.

we have no Wasmtime-specific place to hang it now

How come this isn't FuncEnvironment? Isn't that Wasmtime-specific?


On one hand I don't want to necessarily add more work to this PR and it's ok to address some othe time. On the other hand though it also feels likely that the next time this happens we'll have 4 context arguments to pass everywhere because for one reason or another it won't fit into the other contexts. Personally I feel like it's worth investigating why so many context arguments are needed here and whether something can be changed to adjust this (as a separate PR, but not an indefinitely-delayed PR to happen in the future).

For example one thing to try would indeed be a fourth context which has pointers/references to all 3 other contexts. That fourth context is then the only context passed everywhere and methods/etc are all updated to be on the fourth context. This to me would decrease the cognitive load where there's still just a single context to actually worry about and the only minor detail is that the constituent parts of said context come from different locations which necessesitates a few wrapper structures.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 24 2025 at 21:30):

cfallin created PR review comment:

So the basic issue is that FuncEnvironment is not plumbed everywhere; that is why I mentioned that we need some new plumbing in any case. Many of the places that didn't have stacks also did not have the environment.

I plumbed through the bit that we need (stacks) but if we put stacks in the environment, we now need (i) a huge diff to do that refactor, and (ii) to add the required plumbing anyway. That's all I'm trying to say: I don't think there's a nice clean small diff we can have in this PR, it has to have a lot of churn to get the information we need.

I agree refactoring this all would be nice. Ideally not in this PR!

view this post on Zulip Wasmtime GitHub notifications bot (Oct 24 2025 at 21:30):

cfallin submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 24 2025 at 22:14):

cfallin submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 24 2025 at 22:14):

cfallin created PR review comment:

I haven't figured out how to run this in miri yet (the furthest I've gotten is the blanket prohibition on Cranelift-compiling code with the error that results in me putting #[cfg_attr(miri, ignore)] on all my tests, but I haven't put any time into looking at how you've built miri testing elsewhere). I should definitely do that.

That said, a few thoughts:

view this post on Zulip Wasmtime GitHub notifications bot (Oct 24 2025 at 22:16):

cfallin submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 24 2025 at 22:16):

cfallin created PR review comment:

Hmm. I agree we can address this by pulling the state out of the tls::with somehow -- i.e., the raw store pointer. We're not doing any store ownership yielding if we fiber-suspend so there's nothing unsafe with keeping that around across suspends, I think (correct me if I'm wrong?), so I just need to restructure my closure-based helpers?

view this post on Zulip Wasmtime GitHub notifications bot (Oct 24 2025 at 22:29):

cfallin commented on PR #11921:

Thanks for all your comments so far, @alexcrichton! A few overall thoughts:

view this post on Zulip Wasmtime GitHub notifications bot (Nov 01 2025 at 17:35):

alexcrichton commented on PR #11921:

Ok I now actually got a chance to sit down and properly read your last comment @cfallin. My read on it though is that when our discussions from the CG meeting are mixed in it's likely that most of this PR is moot since the store will come through libcalls primarily. Does that match your understanding though? Or will we still want most of this to land?

view this post on Zulip Wasmtime GitHub notifications bot (Nov 01 2025 at 18:19):

cfallin closed without merge PR #11921.

view this post on Zulip Wasmtime GitHub notifications bot (Nov 01 2025 at 18:19):

cfallin commented on PR #11921:

Yes, I think we've subsumed the need for this (and also call-injection for now); and in fact I have a WIP branch to address Pulley support the other way, by doing the equivalent of libcall-based traps (which runs into some complications due to unimplemented opcodes that I'm working through). I'll close this and I was planning on writing up a "meta-issue" for the current thinking on path to debug MVP. Thanks!


Last updated: Dec 06 2025 at 07:03 UTC