cfallin opened PR #11921 from cfallin:wasmtime-debug-callbacks-pulley to bytecodealliance:main:
This is a followup to #11895 where I had disabled a test that failed to emit a debug event for a hostcall-generated trap on a divide-by-zero in Pulley. This PR allows that test to pass, and brings Pulley back to parity with native Cranelift in debug support currently.
This was a bit of a "start to pull the thread and the entire finished mechanism materializes" PR; happy to consider ways to split it up if needed. In short, disabling signal-based traps on a Pulley configuration still relies on Pulley opcodes (e.g., divide) actually trapping, in a way that looks more like a "native ISA trap"; so I had to start to build out the actual trap-handling mechanisms. In any case, this will all be needed for followup work soon that will handle traps on native platforms (redirecting from signals by injecting calls), so this is not a distraction.
This PR includes, ranked in decreasing order of "may scare other Wasmtime maintainers" score:
A raw
NonNull<dyn VMStore>in theCallThreadState, with a long comment about provenance and mut-borrow exclusivity. This is needed right now to allow the interpreter to invoke the debug event handler, but will soon be needed when injecting hostcalls on signals, because a signal context also has no state available from the Wasm code other than what is in TLS. Hence, we need a way to get the store back from the Wasm when we do something that is "morally a hostcall" at a trapping instruction.I do believe this is sound, or at least close to it if not (please scrutinize carefully!); the basic idea is that the Wasm acts as an opaque blob in the middle, and the pointer comes out of it one way or another (the normal way, as the first arg to a hostcall, or the weird way, via TLS and the CallThreadState during a trap). Exclusive ownership is still clear at any given point and only one
&mutever exists in the current frame at a time. That said, I haven't tested with miri yet.This does require careful thought about the Wasm compilation, too; we need the moral equivalent of a
&mut selfreborrow as-if we were making a hostcall on each trapping instruction. It turns out that we already treat them as memory-fence instructions, so nothing loaded from the store can be moved or cached across them, and I've added a comment now about how this is load-bearing.Updates to
CallThreadState's "exit state", normally set by the exit trampoline, that we now also set when we invoke a debug event handler during a trap context[^1] so thatStore::debug_framesproperly sees the current activation. This is a little more awkward than it could be because we store the tramopline FP, not last Wasm FP, and there is no trampoline frame in this case, so I've added a flag and some conditionals. I'm happy to refactor instead to go (back) to storing the last Wasm FP instead, with the extra load in the exit trampoline to compute that.A whole bunch of plumbing, creating a large but mechanical diff, in the code translator to actually add debug tags on all traps and calls to
raise. It turns out that once I got all of the above working in Pulley, the test disagreed about current Wasm PC between native and Pulley, and Pulley was right; native was getting it wrong because theud2was sunk to the bottom in a cold block and, without tags, we scanned backward to pick up the last Wasm PC in the function. This new plumbing and addition of tags in all the appropriate places fixes that.[^1]: I keep saying "during a trap context" here, but to avoid any
signal-safety scares, note that when this is done for native
signals in a followup PR, we will inject a hostcall by modifying
stack/register state and returning from the actual signal context,
so it really is as-if we did a hostcall from a trapping
instruction.<!--
Please make sure you include the following information:
If this work has been discussed elsewhere, please include a link to that
conversation. If it was discussed in an issue, just mention "issue #...".Explain why this change is needed. If the details are in an issue already,
this can be brief.Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.htmlPlease ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->
cfallin requested wasmtime-compiler-reviewers for a review on PR #11921.
cfallin requested alexcrichton for a review on PR #11921.
cfallin requested wasmtime-core-reviewers for a review on PR #11921.
cfallin updated PR #11921.
cfallin updated PR #11921.
cfallin edited PR #11921:
This is a followup to #11895 where I had disabled a test that failed to emit a debug event for a hostcall-generated trap on a divide-by-zero in Pulley. This PR allows that test to pass, and brings Pulley back to parity with native Cranelift in debug support currently.
This was a bit of a "start to pull the thread and the entire finished mechanism materializes" PR; happy to consider ways to split it up if needed. In short, disabling signal-based traps on a Pulley configuration still relies on Pulley opcodes (e.g., divide) actually trapping, in a way that looks more like a "native ISA trap"; so I had to start to build out the actual trap-handling mechanisms. In any case, this will all be needed for followup work soon that will handle traps on native platforms (redirecting from signals by injecting calls), so this is not a distraction.
This PR includes, ranked in decreasing order of "may scare other Wasmtime maintainers" score:
A raw
NonNull<dyn VMStore>in theCallThreadState, with a long comment about provenance and mut-borrow exclusivity. This is needed right now to allow the interpreter to invoke the debug event handler, but will soon be needed when injecting hostcalls on signals, because a signal context also has no state available from the Wasm code other than what is in TLS. Hence, we need a way to get the store back from the Wasm when we do something that is "morally a hostcall" at a trapping instruction.I do believe this is sound, or at least close to it if not (please scrutinize carefully!); the basic idea is that the Wasm acts as an opaque blob in the middle, and the pointer comes out of it one way or another (the normal way, as the first arg to a hostcall, or the weird way, via TLS and the CallThreadState during a trap). Exclusive ownership is still clear at any given point and only one
&mutever exists in the current frame at a time. That said, I haven't tested with miri yet.This does require careful thought about the Wasm compilation, too; we need the moral equivalent of a
&mut selfreborrow as-if we were making a hostcall on each trapping instruction. It turns out that we already treat them as memory-fence instructions, so nothing loaded from the store can be moved or cached across them, and I've added a comment now about how this is load-bearing.Updates to
CallThreadState's "exit state", normally set by the exit trampoline, that we now also set when we invoke a debug event handler during a trap context[^1] so thatStore::debug_framesproperly sees the current activation. This is a little more awkward than it could be because we store the trampoline FP, not last Wasm FP, and there is no trampoline frame in this case, so I've added a flag and some conditionals. I'm happy to refactor instead to go (back) to storing the last Wasm FP instead, with the extra load in the exit trampoline to compute that.A whole bunch of plumbing, creating a large but mechanical diff, in the code translator to actually add debug tags on all traps and calls to
raise. It turns out that once I got all of the above working in Pulley, the test disagreed about current Wasm PC between native and Pulley, and Pulley was right; native was getting it wrong because theraiselibcall was sunk to the bottom in a cold block and, without tags, we scanned backward to pick up the last Wasm PC in the function. This new plumbing and addition of tags in all the appropriate places fixes that.[^1]: I keep saying "during a trap context" here, but to avoid any
signal-safety scares, note that when this is done for native
signals in a followup PR, we will inject a hostcall by modifying
stack/register state and returning from the actual signal context,
so it really is as-if we did a hostcall from a trapping
instruction.<!--
Please make sure you include the following information:
If this work has been discussed elsewhere, please include a link to that
conversation. If it was discussed in an issue, just mention "issue #...".Explain why this change is needed. If the details are in an issue already,
this can be brief.Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.htmlPlease ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->
cfallin updated PR #11921.
alexcrichton submitted PR review.
alexcrichton created PR review comment:
Personally I'm a bit wary about how this is an ever-expanding structure with a huge matrix of fields all customized for various purposes. Especially with a whole litany of
last_*fields for all sorts of purposes it's kind of hard to keep it all straight.Especially when I think these fields aren't actually modified from compiled code at all, would it be possible to store this elsewhere and/or infer it from control flow and/or other state? For example instead of
last_wasm_exit_trap_fpcould this store directly inlast_wasm_exit_trampoline_fp? And instead of storinglast_wasm_exit_was_trapcould that be carried around as a paramter in relevant locations instead?
alexcrichton created PR review comment:
It looks like the only use of this is the interpreter bits, and for that the
trapfunctionality already has aVMContextin scope. Would it be possible to use thatVMContextalready there instead of storing this here?
alexcrichton created PR review comment:
To confirm, you see no way this field could be stored in
FuncEnvironmentsomehow? This is a lot of plumbing...
alexcrichton created PR review comment:
One of my worries is that as this state grows and grows more for all sorts of one-off things it becomes more and more difficult to ensure that everything is always kept in sync with everything else. I find it's more robust to reuse existing state and/or minimize state where possible to ensure that everything is exercised more often, where possible at least
cfallin submitted PR review.
cfallin created PR review comment:
We could, potentially; I don't know why it was originally designed this way. I'm happy to put together a separate refactor PR to move
stacksunderenv(at worst there is some borrowing-related reason I haven't seen and I'll run into a roadblock there).
cfallin created PR review comment:
We could get to it through that
VMContextand thenVMStoreContext, but then two things:
- We'd need to add an equivalent field (a backpointer) to the
VMStoreContextto get to thedyn VMStore; so we wouldn't be removing this field, only moving it; and- We need this in followup work (my next PR) outside of an interpreter context, when we inject calls into the runtime on signals and we don't have a vmctx (we know nothing about the register state we've walked in on during the signal).
In the latter case we can still get the
VMStoreContextfrom theCallThreadStatefrom TLS; so the question is just, do we want to add a layer of indirection and put theNonNull<dyn VMStore>there instead (but on the flip side, not have to set it on every Wasm entry). Happy to go either way...
cfallin submitted PR review.
cfallin created PR review comment:
Unfortunately not, as far as I can tell:
The basic problem is that we have a
Store(orStoreContextMut) exposed to the debug callback. That Store fundamentally has to know about the last activation so it can provide a view into those frames (viadebug_frames). The way that we view the stack depends on exactly how the host entry occurred, and a trap is a new kind of host entry -- previously we didn't actually enter all that far into host code, only far enough to do theraiseand return back out on the other side. There's no direct call-parameter dataflow we can add this onto; we aren't calling into the debug handler with some special thing that already has interpreted the debug frames, we're giving it the full Store. In particular, a trapping context means that we interpret PC differently: it points to the trapping instruction, not the instruction-after as in calls. So this logically belongs with the "end of the last activation" state: it's required to interpret the exit PC/FP. Also, because there's no trampoline, reusinglast_wasm_exit_trampoline_fpis the wrong choice I think.I also share your unease with the ugliness here, and I think the one possible cleanup I see is to refactor back to storing the last Wasm FP, not trampoline FP. What do you think?
cfallin submitted PR review.
cfallin edited PR review comment.
cfallin submitted PR review.
cfallin created PR review comment:
Also, regarding storing elsewhere: because it's part of the "last activation" state, it needs to be saved/restored along with other parts of the
VMStoreContextwhen, for example,CallThreadStates are pushed/popped on the activation chain when fibers suspend. That all works with immutable borrows today and directly accesses the unsafe cells in this location; that's why I kept it here and not somewhere else.
cfallin created PR review comment:
OK, I've gone down this road a bit, but I don't think it's any better: the diff is huge (+1446 -938 so far, and I'm only ~a third done), and moreover it doesn't actually improve the plumbing situation here: there are still plenty of places where neither
stacksnorenvwas available where we fundamentally need one or the other to get the debug tags right.
cfallin submitted PR review.
cfallin created PR review comment:
(To give a more principled argument why: we now need some metadata everywhere we could have a trap. We currently have a bunch of places low in the abstraction stack where we only have the Cranelift-level function builder, and some raw SSA values, and we're generating logic that could trap. Previously this was fine because we just added a static trap code indicating the kind of trap, and SourceLocs already track where we are. Now we need more metadata, e.g. for the stack shape. So fundamentally we need to carry that somehow and we have no Wasmtime-specific place to hang it now.)
cfallin submitted PR review.
cfallin updated PR #11921.
cfallin submitted PR review.
cfallin created PR review comment:
I did the refactor to remove "trampoline FP" (in normal case) and "trap FP" (in trap case) bifurcation in favor of one FP in 7fd1f60e7c. We still need the
was_trapstate (and to save/restore it); there's no way to avoid that bit as far as I can tell.
cfallin updated PR #11921.
alexcrichton submitted PR review.
alexcrichton created PR review comment:
Thinking more about this, this isn't sound due to the active use of TLS when an async function is invoked (e.g. handling a debug event). That might involve a fiber/thread switch which means that resumption on another thread would break the
tls::withblock.Somehow, I'm not sure how best how, but the invocation of the debug handler needs to be outside of
tls::with. This is also why I tried really hard recently to get all async invocations to "the libcall calls it and nothing else" and this is effectively a regression from that where we're back to the state of "deep in the bowels of the runtime all of a sudden there's a fiber switch".Basically I think this might be worth rethinking how the async interface is implemented in the runtime. For example something should have prevented this construction in the first place via some sort of "need to be in
asynccontext but not safely in anasynccontext".
alexcrichton created PR review comment:
Also, thinking a bit more on this, I think we should explicitly document debugging support as not tier 1 (either 2 or 3, probably 3 while it's in-progress) and possibly consider disabling it by default instead of enabling it by default. Given the pretty large amount of new
unsafeand new primitives being written for it I wouldn't be confident enough myself to guarantee it's "likely CVE-free" until we've had a good deal more time to think about the unsafe primitives, fuzz things, etc.
alexcrichton created PR review comment:
Well, if the goal is to put this field here, then that leads to other questions for me:
- Much of the rest of the state in
CallThreadStateexists because it doesn't have a pointer toStore, so if one was added then it seems like all the state here should be removed in favor of accessing it through the store.- Have you run this through the Miri tests? I think that this may be miri-unsound at a first glance (but I've got about a 70% hit rate on predicting this so not great). The
dyn VMStoreis passed toCallThreadState::newwhich gets a raw pointer from that, but immediately later the store is accessedstore.0.executor()which I believe effectively invalidates the previously acquired mutable pointer.- In general I don't really understand the provenance comment here. My understanding is that the invariant we need to uphold is that (a) when we enter wasm we start with a safe
VMStorereference and (b) when in wasm all usage ofVMStoreis derived from either that raw pointers or other raw pointers originally derived. I don't know why that would limit usage of the store to just the trap handler or why Cranelift's optimizations or such would come into play here.Put another way, if this field exists then I think other store-related fields should all be deleted to avoid having state to keep in sync. Taking this provenance comment as-is means that it's not actually possible to do that because the field can only be used during traps, but I don't think that's accurate which means that I don't think the comment here is accurate.
For Miri-things I think ideally a
CallThreadStatewould have a lifetime parameter which is a phantom borrow of the&'a mut dyn VMStorewhich logically ensures that while theCallThreadStateis active all usage of the store must be derived from the state itself rather than via external borrows.
alexcrichton created PR review comment:
Well the current state here is that we pass in three different "context" style arguments which are basically just plumbed everywhere. I don't really know how best to address it, but it feels not great to pass so many context-style arguments all over the place. I'm not disputing that
FuncTranslationStacksis needed, I'm mostly wondering why it's not possible to reduce all the boilerplate.we have no Wasmtime-specific place to hang it now
How come this isn't
FuncEnvironment? Isn't that Wasmtime-specific?
On one hand I don't want to necessarily add more work to this PR and it's ok to address some othe time. On the other hand though it also feels likely that the next time this happens we'll have 4 context arguments to pass everywhere because for one reason or another it won't fit into the other contexts. Personally I feel like it's worth investigating why so many context arguments are needed here and whether something can be changed to adjust this (as a separate PR, but not an indefinitely-delayed PR to happen in the future).
For example one thing to try would indeed be a fourth context which has pointers/references to all 3 other contexts. That fourth context is then the only context passed everywhere and methods/etc are all updated to be on the fourth context. This to me would decrease the cognitive load where there's still just a single context to actually worry about and the only minor detail is that the constituent parts of said context come from different locations which necessesitates a few wrapper structures.
cfallin created PR review comment:
So the basic issue is that
FuncEnvironmentis not plumbed everywhere; that is why I mentioned that we need some new plumbing in any case. Many of the places that didn't havestacksalso did not have the environment.I plumbed through the bit that we need (
stacks) but if we putstacksin the environment, we now need (i) a huge diff to do that refactor, and (ii) to add the required plumbing anyway. That's all I'm trying to say: I don't think there's a nice clean small diff we can have in this PR, it has to have a lot of churn to get the information we need.I agree refactoring this all would be nice. Ideally not in this PR!
cfallin submitted PR review.
cfallin submitted PR review.
cfallin created PR review comment:
I haven't figured out how to run this in miri yet (the furthest I've gotten is the blanket prohibition on Cranelift-compiling code with the error that results in me putting
#[cfg_attr(miri, ignore)]on all my tests, but I haven't put any time into looking at how you've built miri testing elsewhere). I should definitely do that.That said, a few thoughts:
I agree a broader refactor to have just a
dyn VMStorepop out the other side ofCallThreadStatewould actually be ideal, then make this well-tested, well-vetted, and standardize on it. In general it seems like we've plumbed through just enough state for the things that we do need when we get control without a vmctx; doing this right once would be best.I tried to limit this to "during traps" to make the distinction about whether we have a
vmctxpassed in via a normal hostcall route or not. Basically, there is always one way that the root of ownership of the store passes back into the host: either via TLS -> CallThreadState -> VMStoreContext -> raw VMStore pointer (trap route) or via vmctx ->Instance-> store pointer (Instance::enter_store_from_wasmroute for libcalls).Technically I don't see a reason why we couldn't always use the TLS route, except for the obvious one, performance (and then the fiber-suspend one which you raise below, about which more thoughts below...).
All of this hinges on my understanding of provenance which is that (true mut borrow) -> (multiple raw pointers) -> (true mut borrow derived from one of those raw pointers) is safe, i.e., you can pick one of multiple dataflow paths to get your raw pointer, and the others are just invalidated and must not be used. That should be what's going on here.
On Cranelift optimizations: we're accustomed to thinking about Wasm as opaque, and considering only the provenance story on the Rust side, but it's important to think about what provenance means in a general compiler sense and ensuring we're not committing the same sins on the other side. In particular, if we have an implicit hostcall on trap, that's as-if we create a new
&mut Store, so just as provenance exists as a concern on the Rust side because language semantics around aliasing are load-bearing for optimizations, we have alias analysis and optimizations on the Cranelift side too. My point there is that we must ensure we have no cached state loaded from the store across any trapping instruction if we do this implicit hostcall thing; that would be exactly equivalent to an exclusive-mut-borrow violation on the Rust side. The reason this is discussed at all in this context is that we are taking the&mut dyn VMStorenow and using it, so we're creating this limitation on the Cranelift side. One depends on the other.
cfallin submitted PR review.
cfallin created PR review comment:
Hmm. I agree we can address this by pulling the state out of the
tls::withsomehow -- i.e., the raw store pointer. We're not doing any store ownership yielding if we fiber-suspend so there's nothing unsafe with keeping that around across suspends, I think (correct me if I'm wrong?), so I just need to restructure my closure-based helpers?
cfallin commented on PR #11921:
Thanks for all your comments so far, @alexcrichton! A few overall thoughts:
In parallel to this, over in #11930, I have started to realize that injecting implicit hostcalls during traps may not be worth the complexity given variations in signal-handling quirks between platforms (got it working on Linux but each of macOS and Windows need different design bits). If that goes away and we require hostcall-based traps everywhere, then the only remaining "debugger gets control outside of a libcall" situation is Pulley's use of trapping instructions (e.v. my div-by-zero test which this PR enables), and there I'd prefer to refactor to make Pulley use true hostcall-based traps in those cases too.
So if we go that way, we avoid the need to re-enter the runtime without a
vmctxpassed to us the normal way; and all of these ownership/provenance questions go away.That said: I realize I'm fighting a somewhat uphill battle on store provenance, but I do suspect it might be a need elsewhere in the future, for example if we do want to enable signal-based traps in debugging for performance later. At a high level the ownership handoff is dynamically valid; and the overall scheme should be compatible with stacked borrows (one mut borrow -> multiple raw pointers -> one mut borrow derived from one of them). I'm happy to keep talking about this and try for a refactor where we shift to one raw store pointer in the
CallThreadStateand derive other things (unsafe-intrinsics'Tpointer, theVMStoreContext, ...) from that. That would give us a framework with fewer limitations for future work. The trick is finding the right abstraction for a safe API, I think...That also said, it occurs to me that the
Accessormechanism, with its TLS mechanism already vetted, could be an alternate foundation for all of this. (We already know it's the eventual alternate foundation for all of this, but I mean even in a callback world.) Basically, transfer ownership of the debug handler if any to the CallThreadState on entry and take back on exit; then that's all we need to invoke the handler, and it can keep its locally capturedAccess.I don't necessarily want to create a dependency on all of the refactoring you are planning to do around that, though...
alexcrichton commented on PR #11921:
Ok I now actually got a chance to sit down and properly read your last comment @cfallin. My read on it though is that when our discussions from the CG meeting are mixed in it's likely that most of this PR is moot since the store will come through libcalls primarily. Does that match your understanding though? Or will we still want most of this to land?
cfallin closed without merge PR #11921.
cfallin commented on PR #11921:
Yes, I think we've subsumed the need for this (and also call-injection for now); and in fact I have a WIP branch to address Pulley support the other way, by doing the equivalent of libcall-based traps (which runs into some complications due to unimplemented opcodes that I'm working through). I'll close this and I was planning on writing up a "meta-issue" for the current thinking on path to debug MVP. Thanks!
Last updated: Dec 06 2025 at 07:03 UTC