fitzgen commented on issue #1749:
Brief summary of an evolution of this idea that came up from discussion between @cfallin, @aturon, and myself:
- Instead of removing read permissions from a well-known page, Wasmtime removes read and write permissions from the Wasm instance's stack(s). This will cause the Wasm guest to fault the next time it touches the stack.
- We don't need to emit any additional loads in function prologues (as long as the function touches the stack at any point, i.e. is not a leaf function) or loop headers (as long as the loop body touches the stack at some point).
- Otherwise, we emit dead loads from the stack in the necessary places (function prologue and loop headers). Note that this is still an improvement over what was described in the OP, which required chained loads through the vmctx.
- We need to be careful about when we are touching the Wasm stack from host code, for example when capturing a Wasm backtrace, and set a flag somewhere that means "we are running host code" that the signal handler can see and use to cooperate with the host code. In general there are a lot of potential TOCTOU bugs here between checking any flags and unmapping the stack.
- If we unmap the stack and host code execution trips the signal handler, then in the signal handler we can set a interrupt-requested-during-host-code flag in the vmctx or somewhere, remap the stack, and then setcontext to the trapping host code and continue. Every call/return from host code to Wasm would check whether an interrupt was requested during the host code's execution and raise a trap or whatever as appropriate.
Compared to the current epoch system, this gets us faster Wasm execution at the cost of making interruption more expensive. Given that the performance trade offs are different, it may make sense to support both, rather than replace epochs with this proposed virtual memory system.
posborne commented on issue #1749:
This question probably stems from a degree of naivety, but I'm wondering why, with appropriate guards in place, it would be preferred to have a trap be caused by accessing stack memory over just having a signal be raised on the thread executing guest code (e.g. using tkill on linux).
In either case, I think we need some synchronization around guest entry/exit to indicate "we are running host code" which would seem like it _might_ be enough to guard against inadvertently guard against having the signal handler go off while in host code (but maybe not).
If that approach was feasible, it seems like it would potentially have a couple benefits:
- We're able to preempt at points outside of when the guest code is touching the stack. I can't think of specific cases where this would be problematic but I'm far from being an expert here and could see this being where there are problems.
- This would remove the need to insert dead loads or similar in leaf code that doesn't touch the stack.
This definitely seems very promising regardless given that I think most production use cases have some need for preemption to enforce timeouts and/or interact better with async schedulers, etc. I'm also curious to see if guest profiling accuracy is able to benefit from this for some workloads.
I briefly reviewed the trap handling code but am still filling in a lot of knowledge gaps.
fitzgen commented on issue #1749:
I'm wondering why, with appropriate guards in place, it would be preferred to have a trap be caused by accessing stack memory over just having a signal be raised on the thread executing guest code (e.g. using tkill on linux).
The tricky bits here are:
- Handling when VM or host code is running, rather than Wasm. We could probably use the same technique described in https://github.com/bytecodealliance/wasmtime/issues/1749#issuecomment-2542264565 though.
- A Wasm guest isn't pinned to a particular thread, and can migrate across different threads between
polls, so we don't automatically know which thread totkill. We could grab the thread id whenever we call into wasm, but that might add unacceptable overhead to our host-to-wasm call path. I don't have a sense of the overheads involved here.But if these hurdles can be overcome, then this sounds pretty promising, since we won't even need compiler changes to implement this, just runtime changes.
cfallin commented on issue #1749:
@posborne that's a great question. In general we have shied away from signal-based solutions because (i) they're not as portable (to Windows at least?), (ii) signal-handler code is (as I'm sure you know!) very limited in what it can do, and has to be careful about race conditions. That said, the underlying mechanism to catch a page fault per this issue's idea is to catch a
SIGSEGVso perhaps it's not so different -- the main thing is that in the case of a munmap-based approach, we have a deterministic location we know this trap will occur, while with a signal, it may come in odd places in trampolines, etc.I suppose the main question to answer would be: do we need signal-masking at any point to avoid race conditions when entering and leaving Wasm code; if so, that probably tips the balance away, as we want close to zero cost in the common (non-interrupting) case and syscalls on entry/exit would be fatal to that.
cfallin commented on issue #1749:
Ah -- thread migration is an interesting case -- one needs some answer for the TOCTOU problem (I look up the tid running a guest; guest calls to host and host does an async suspend, and async executor thread goes off and does something completely different; then I send signal). I suppose signal handler registration is global to the ambient environment, so we would still catch the signal, and we would need to examine TLS and see that we're not active in Wasm (and just ignore the signal probably).
alexcrichton commented on issue #1749:
One thing I can note is that our current signal handler is definitely not async-signal-safe. The signal handler calls
test_if_trapwhich callslookup_codewhich acquires a global rwlock. This is "mostly ok" for signals today where none of them are asynchronous and it'd only be problematic if we segfaulted in the host while holding the write lock (which is unlikely). With async signals though this is naturally unsafe since we could get a signal during the write lock for the thread holding the write lock.I'd also second @cfallin's point ab out TOCTOU, historically I've never known how to safely construct a list of threads at any one point in time that need interruption. Coupled with the possible overhead of scheduling interruptions and synchronization around interruption it didn't seem the most appealing to me
posborne commented on issue #1749:
Thanks for the responses, the reasoning seems solid to me, especially given that the host/guest transition is itself going to be hit pretty heavily in a lot of workloads. I did a little bit of benchmarking with epoch interrupts enabled/disabled and the savings looked very promising (was seeing between ~2-15% improvement depending on the benchmark with many being >10% improvement). Given that we should still see an improvement even if we need to emit dead loads in loop headers I'm still anticipating we could see a very respectable improvement from the optimization based on munmap.
Definitely don't want to add any significant overhead to the host/guest boundary since that's probably more relevant to a lot of real-world workloads than what we typically end up hitting with most synthetic benchmarks primarily focused on guest execution.
erikrose commented on issue #1749:
Would someone assign this to me, please? Thanks!
abrown assigned erikrose to issue #1749.
erikrose commented on issue #1749:
Here is my work in progress. I am approaching it incrementally, first using the simpler "interrupt page" (per
Store, i.e. per thread) method and then, once that works end to end, replacing it with a protected stack. It will be interesting to measure the difference between them.So far, I've added a new
epoch-interruption-via-mmuoption and got dead loads from the interrupt page generating in function prologues and loop headers. I'm now working on a way to trigger an interrupt (by protecting a page). Finally I'll see about the signal handler.
alexcrichton commented on issue #1749:
We had some more discussion about this in a meeting at some point a few moons ago and forgot to update here, but @erikrose before you go too far down the implementation road I wanted to try to recall what we were talking about long ago and write it down here too.
The main concrete point I remember is that signal handlers are extremely tricky and their viability requires carefully balancing various other concerns too. As-is Wasmtime uses a sigaltstack for signals which means that it's not possible to implement interrupt checks with virtual tricks. An epoch change requires stack-switching off the wasm stack for example, and that's not possible to do from the shared resource of the sigaltstack in a signal handler. That would, for example, try to later resume to frames on the stack which have possibly been clobbered by other signals.
A possible fix for this is to remove the sigaltstack, but that removes any sort of possible indicator when a stack overflow happens (e.g. as opposed to the "you overflowed the stack" message that's printed out by the Rust standard library). This is a relatively invasive fix though because I've found such a message to be extremely useful historically. This additionally requires total knowledge of the signal handler in the process so this would likely need to be an
unsafeand off-by-default option because otherwise Wasmtime is designed to play nicely with other signal handlers in the process.Overall I'd recommend considering the signal handler side of things first rather than last because it's often the trickiest part of the system. Where possible I'd recommend doing anything nontrivial in signal handlers altogether. For example for debugging Chris is working on a Cranelift extension for breakpoints and something like that might be suitable here too. The ideal signal handler, IMO, resumes back to the wasm code, possibly with updated state, so that way the signal handler is entirely out of the picture. For example ideally the interrupt check could be a special CLIF instruction or something like that where metadata indicates how to resume it (e.g. a special stub, special ABI, etc). That way when an interrupt happens due to a fault the wasm code would be resumed, at a different location, and the new location would be the one that actually does the suspension.
erikrose commented on issue #1749:
Thanks, Alex. I'll take some time to meditate over this.
Where possible I'd recommend doing anything nontrivial in signal handlers altogether.
Can I assume you meant "recommend against" there?
alexcrichton commented on issue #1749:
Oh oops, indeed! We've basically had a trend so far of every time we do something nontrivial in a signal handler we end up having to walk it back as much as we can. For example historically we did a longjmp out of a signal handler but that was messy for a whole host of reasons so now signal handlers all return one way or another. Part of this is also the portability of what we do in signal handlers, but I realize that's also less of a concern here
cfallin commented on issue #1749:
To add a bit: in #11930 I worked out how to actually do the dance of "call injection" where, from a signal handler, one modifies the interrupted/paused state to return to some injected trampoline that can then do something else. You'd need something like this to make the unmapped-stack mechanism work: in the handler, remap the stack and update state so that it's as-if a yield hostcall were invoked.
I ended up turning away from that approach for now because of complexity. It's along two main dimensions: first, platform support -- I got this working for Linux but macOS and Windows have mutually incompatible capabilities and requirements during the signal-handling phase (see the bullet points in this comment) that together with ABI variations mean we need something like ~8 assembly stubs and we have to be sure all are tested. Then there are gnarly pointer-provenance questions about getting a vmctx (which I think you need for yield?) for which I think I had a valid answer under tree-borrows, but not stack-borrows. In the end all of this wasn't worth it for debugging -- I'm going to actually patch in calls instead. The "ownership of store / pointer provenance" question is actually the biggest issue with any signal-based scheme IMHO -- we go to somewhat extreme lengths to use TLS and keep raw pointers to only the bits needed for trap handling, but the deeper one gets into the runtime's data structures and need to access them during a signal, the trickier the story becomes.
erikrose commented on issue #1749:
@alexcrichton @cfallin Thank you for your detailed writeups! The way this is slowly agglomerating in my head is something like…
- Keep the sigaltstack.
My dead loads at the tops of functions and loops become uses of a new "touch-interrupt-page" CLIF instruction instead, which spits out something like this:
Attempt a load from interrupt page. Jump to CONTINUE. Yield task. CONTINUE: …rest of normal wasm codeThe signal handler jimmies its return address to land at
Yield taskinstead of theJump to CONTINUE. This lets the signal handler exit relatively normally, i.e. it's not around anymore when it comes time to yield. The yielding is then done as if it was part of the ordinary flow of wasm code.Note that, while I've read Chris's linked ticket, I haven't gone through his code yet. Also, I need to understand how yielding is currently implemented (and thus what I'd need to get ahold of to do a yield).
I imagine, as Chris enumerates, that the devil is in the details, though I hope this particular devil can be bargained with. Am I on a somewhat reasonable track, do you think? Thank you again!
cfallin commented on issue #1749:
I'll re-emphasize my takeaway from my adventures linked above, which is that I think an approach that works with signal frames like this has a significant maintenance burden that is hard to justify without really compelling performance numbers.
Some proof-point questions to resolve early before investing a lot more time are:
- What state do you need to do the "epoch yield" hostcall?
- Does this state include anything related to the fiber currently on the stack? (I think the answer is "yes")
- Who owns that fiber? (The answer is somewhere between "the current store" and "random stack frames on-stack", neither of which is accessible from a signal context)
- Can you provide a UB-free ownership transfer story whereby the interrupt at a random point in compiled code safely gets a borrow of those fibers and can do a yield? Can you show this works in miri?
Basically this is the "pointer provenance" question I mention above -- not just a "devil is in the details" to be optimistically charged toward but IMHO a fundamental design issue that prevents this kind of trick from working without a big refactor.
If I were attempting to make this work my first step would be to figure out a way to keep the current store in TLS and be able to get it without the vmctx passed into hostcalls when host code takes control; and make miri happy while doing that. That then enables the rest of this to be possible. The trampoline plumbing and register-state-swapping is comparatively easy (and fun!) -- see my trampolines above.
While we're discussing tradeoffs, I'm curious: what is the current delta you're observing between epoch-based performance and performance with the dead accesses to stack?
alexcrichton commented on issue #1749:
Given enough Cranelift intrinsics I think this wouldn't involve changing any provenance things, which might be worth exploring. For example doing something similar to the debug handling Chris you're doing:
- We could have a CLIF instruction which looks like a
callwith the "patchable" ABI which actually does a load- If that load segfaults, the signal handler uses module metadata, which it already has, to go from "pc to patchable thing"
- The signal handler uses this metadata to redirect the pc to the known stub using the "patchable" ABI
- ABI-wise we can guarantee that vmctx is in a specific register (e.g. first argument) and we can also guarantee the return address is in a specific register (e.g. second argument)
- This stub would "push" a call frame, then do the normal new_epoch libcall.
I'm not sure how best/feasible it would be to model this all in Cranelift, but that would at least keep the signal handler "clean" (no new state, it returns as usual today) as well as not dealing with provenance issues (no new state in the signal handler, so it's all as-is). Not the least of which is the current patchable ABI probably wouldn't work for this since something needs to be clobbered as the return address needs to be stored somewhere. And the fact that the ABI needs to set up the call frame on behalf of the caller too.
That's a pretty significant Cranelift extension (to me at least), so I'd agree it'd be good to get some numbers on the possible delta first.
alexcrichton commented on issue #1749:
Expanding on that thought a bit...
- Assuming we can change the
patchableABI to clobber some registers- Assuming we can add a CLIF instruction that looks like a
callbut codegens aloadin theory that's all the changes to CLIF necessary. Semantics-wise it's as if the function is called every time according to CLIF, but in actuality it's codegen'd as a load that sometimes "calls" via a signal handler. The signal handler would resume at an inline assembly stub which would push a call frame then jump to the Cranelift-generated function that has the
patchableABI and does the yield. So the purpose of the inline assembly would be "pretend acallactually got executed" and it would understand thepatchableABI to know that the vmctx is in a particular register, the return address is in a particular register, and probably the address of the "do the yield" patchable-ABI-CLIF-generated-function is in a register too.Whether that's reasonable to burn 3 registers on every loop is perhaps another question, but that's at least a rough idea to scope down the theoretical Cranelift changes to something more tractable.
posborne commented on issue #1749:
This was a smaller sightglass bench run with just the default suite comparing epoch disabled, enabled in v39, and enabled with Erik's changes (@ 1d94a21e2bd2a52126f6fcad6f29b49c6d3e00e3). At least in it's present form, the improvement seems modest/mixed and still lots of gap room compared with epoch disabled.
Relatively small sample size and all that -- I can do a more complete run with additional iterations if desired.
<img width="1836" height="422" alt="Image" src="https://github.com/user-attachments/assets/09b46aca-0a4b-4e09-ac50-ca59f9d2168a" />
posborne edited a comment on issue #1749:
EDIT: I missed that there was a new config flag to enable the alternate MMU behavior. Will get new numbers shortly and update.
This was a smaller sightglass bench run with just the default suite comparing epoch disabled, enabled in v39, and enabled with Erik's changes (@ 1d94a21e2bd2a52126f6fcad6f29b49c6d3e00e3). At least in it's present form, the improvement seems modest/mixed and still lots of gap room compared with epoch disabled.
Relatively small sample size and all that -- I can do a more complete run with additional iterations if desired.
<img width="1836" height="422" alt="Image" src="https://github.com/user-attachments/assets/09b46aca-0a4b-4e09-ac50-ca59f9d2168a" />
cfallin commented on issue #1749:
@posborne thanks for posting that data!
So if I'm reading the table correctly, assuming that we want some interruption mechanism, we are choosing between the last two columns; middle column is status-quo today (vanilla epochs) and right column is the instrumentation required for VM tricks. Right column is ~0.5% faster for bz2, ~4.5% slower for pulldown-cmark, and ~1% faster for SpiderMonkey. Is that right?
If that's the case, it doesn't seem worth the big complexity jump for 1% on SpiderMonkey, and the slowdown on pulldown is pretty concerning. Did y'all work out any explanations for this effect? E.g. perf counters showing that the stores are more expensive than the load-mostly-constant-data-compare-and-branch of epochs? (And remind me: why do we need stores rather than loads, if we're going to fully unmap the stack to interrupt?)
posborne edited a comment on issue #1749:
EDIT: I missed that there was a new config flag to enable the alternate MMU behavior. Will get new numbers shortly and update. Numbers here are updated, though check my work on the flags @erikrose
This was a smaller sightglass bench run with just the default suite comparing epoch disabled, enabled, and enabled via mmu on the same commit (@ 1d94a21e2bd2a52126f6fcad6f29b49c6d3e00e3).
Relatively small sample size and all that -- I can do a more complete run with additional iterations if desired.
<img width="2640" height="404" alt="Image" src="https://github.com/user-attachments/assets/32a54454-e1e7-42f7-b618-9f125951050a" />
posborne commented on issue #1749:
I updated my previous comments as I was not using the correct flag to do things via MMU; the delta there does seem much more promising coming very close to baseline performance if I gathered results correctly.
posborne commented on issue #1749:
@cfallin I had made a mistake with the earlier numbers, I had missed that @erikrose had added a new flag to hit the code path he added. In adddition, I think v39 -> main must have some unrelated improvements that helped the pulldown-cmark benchmark.
I've updated my previously posted snapshot but could use confirmation from @erikrose that correct usage doesn't require specifying both
-Wepoch-interruption=yand-Wepoch-interruption-via-mmu=y.In the earlier example and this one, the baseline displayed is with any form of epoch interrupts disabled to show the absolute best case with the differences displayed being the statistically significant variation from the baseline.
cfallin commented on issue #1749:
Ah, OK, that's much different data now, and more compelling!
I agree with Alex that a special instruction with register constraints such that we can effectively make the signal a call would be much easier to implement. I think we'd want to scope this feature to Linux as well -- as discussed in the other threads about call injection on signals for debugging, call injection needs to take a slightly different shape for Linux, Windows, and macOS due to mutually incompatible constraints (no sigaltstack on Windows, different thread on macOS).
erikrose commented on issue #1749:
@posborne and I talked off to the side, but, to confirm in-thread: all you need is
epoch_interruption_via_mmu. Atm, it and the existingepoch_interruptionare orthogonal, but that's not the final state of things, as there should be no reason to use both at once.Thanks for doing the benchmarks, Paul! On our prod loads, we measured a difference of 15% back in January, so It's exciting (and darn convenient) to see that hold up locally.
erikrose edited a comment on issue #1749:
@posborne and I talked off to the side, but, to confirm in-thread: all you need is
epoch_interruption_via_mmu. Atm, it and the existingepoch_interruptionare orthogonal, but that's not the final state of things, as there should be no reason to use both at once.Thanks for doing the benchmarks, Paul! On our prod loads, we measured a difference of 15% back in January (between epochs and nothing), so It's exciting (and darn convenient) to see that hold up locally.
erikrose edited a comment on issue #1749:
@posborne and I talked off to the side, but, to confirm in-thread: all you need is
epoch_interruption_via_mmu. Atm, it and the existingepoch_interruptionare orthogonal, but that's not the final state of things, as there should be no reason to use both at once.Thanks for doing the benchmarks, Paul! On our prod loads, we measured a difference of 15% back in January (between deadline-based epochs and nothing), so It's exciting (and darn convenient) to see that hold up locally.
erikrose commented on issue #1749:
@cfallin Quoting from https://github.com/bytecodealliance/wasmtime/pull/12101#issuecomment-3598788048 to continue discussion on a live ticket and tie these two together…
@erikrose there's a potential epoch-yield mechanism that requires no dead loads/stores at all -- happy to discuss further.
Sure thing! Are you thinking to have the embedder call some method on the
Storethat un-NOPs patchable yield calls?So I guess the instruction one would actually want for the epochs case is patchable_call_or_load %epoch_yield(v0), v1 with v1 being the page that gets unmapped to cause a SIGSEGV, then in that context do the patch (and there we do have the Store. Happy to sketch how that should look if @erikrose is curious.
I will gratefully accept any sketches you care to render. The high-level view you give seems clear. I certainly have the
vmctxand the interrupt page address around when I codegen my current dead loads. I don't yet see how, within the signal handler, to get a ptr to the instructions to un-NOP. On second thought, if we followucontext_t.uc_link->uc_stackto find the return address, that's pretty much it, isn't it?I'm not clear on where the signal handler gets the instructions to patch in.
Looking forward to chatting at tomorrow's Cranelift meeting!
cfallin commented on issue #1749:
To summarize what I think is a good approach with the help of a small Cranelift extension, based on discussion in the Cranelift meeting this morning:
- We should have an instruction
dead_load_with_contextthat performs a "dead load" (does not def the result, but does have the side-effect of reading memory), and carries "context", that is, another SSA value placed in a fixed register so that a signal handler knows where to look. That fixed register should not be clobbered, and should live in the first argument register (rdi), probably: this allows compiled code from Wasmtime to, in the common case, use this forvmctxand keepvmctxin the first arg reg almost all of the time. The idea is that we'll need to get ownership of thevmctxand thusStore(viaStore::enter_host_from_wasm) in order to do the yield, and absent any other information, we can't get this from the arbitrary register state at the faulting load.- We should build some signal handler magic that, like my earlier PRs that did "call injection" on signals, overwrites the return address of a signal when we know the signal comes from an epoch-check load (we can add to the module metadata, produced alongside e.g. trap codes, to know this). We will need to save the original return address. Note from my comments on OS platform details around call-injection that we can't simulate a callframe by directly pushing onto the stack; we'll need to stash the true return address somewhere else. That will necessitate probably another fixed reg constraint on the
dead_load_with_contextto act as a scratch.- Then on a SIGSEGV at this magic load, we update PC (
rip) to point to a stub (see my earlier PRs for examples) that saves all context and invokes an actual hostcall with the vmctx value (similar to the existing epoch-yield hostcall).The concrete pieces you'll need to build are:
- the Cranelift instruction
- the unmapped-on-interrupt page and pointer to it in vmctx
- new metadata emitted by the Cranelift instruction, and put into compiled-artifact tables, to indicate that this is an interruption-point load
- special logic in the signal handler that, when seeing such PC, updates state to redirect to the stub, saving the original PC
- that stub, saving all register state and invoking a hostcall with the recovered vmctx
erikrose commented on issue #1749:
@cfallin Clear as a bell; thank you!
Last updated: Dec 13 2025 at 19:03 UTC