Stream: git-wasmtime

Topic: wasmtime / PR #12587 Add `Global::value_ptr()` for cross-...


view this post on Zulip Wasmtime GitHub notifications bot (Feb 13 2026 at 06:19):

AlbertMarashi opened PR #12587 from AlbertMarashi:patch-1 to bytecodealliance:main:

Summary

Motivation

There is currently no way to signal a running WebAssembly instance from another thread using a wasm-visible global. The existing Global::set() requires &mut Store,
which is exclusively held by the thread executing the module. This creates a deadlock in the API: you can't mutate the global until execution finishes, but execution won't
finish until the global is mutated.

The concrete use case is cooperative suspend/resume (checkpoint/restore). A WASM transform inserts global.get $flag; br_if $suspend checks at loop heads and after
call sites. The host signals suspension by writing 1 to that flag from a control thread. The module sees the flag, saves its locals to a shadow stack, and returns. On
resume, the host clears the flag and re-calls the function, which restores locals and continues from where it left off.

Wasmtime's existing interruption mechanisms don't solve this:

What's needed is a way for the module itself to observe a host-written flag and choose to suspend, preserving its own stack. That requires the host to write to a global
that the running module can read — without holding &mut Store.

Design

value_ptr() is deliberately narrow:

Example

let flag = instance.get_global(&mut store, "suspending").unwrap();
let ptr = unsafe { flag.value_ptr(&store) } as *mut i32;

// Move store to worker thread
std::thread::spawn(move || {
    run.call(&mut store, ()).unwrap();
});

// Signal from control thread
unsafe { std::ptr::write_volatile(ptr, 1); }
~~~

view this post on Zulip Wasmtime GitHub notifications bot (Feb 13 2026 at 06:19):

AlbertMarashi requested fitzgen for a review on PR #12587.

view this post on Zulip Wasmtime GitHub notifications bot (Feb 13 2026 at 06:19):

AlbertMarashi requested wasmtime-core-reviewers for a review on PR #12587.

view this post on Zulip Wasmtime GitHub notifications bot (Feb 13 2026 at 06:23):

AlbertMarashi edited PR #12587:

Summary

Motivation

There is currently no way to signal a running WebAssembly instance from another thread using a wasm-visible global. The existing Global::set() requires &mut Store, which is exclusively held by the thread executing the module. This creates a deadlock in the API: you can't mutate the global until execution finishes, but execution won't finish until the global is mutated.

The concrete use case is cooperative suspend/resume (checkpoint/restore). A WASM transform inserts global.get $flag; br_if $suspend checks at loop heads and after call sites. The host signals suspension by writing 1 to that flag from a control thread. The module sees the flag, saves its locals to a shadow stack, and returns. On resume, the host clears the flag and re-calls the function, which restores locals and continues from where it left off.

Wasmtime's existing interruption mechanisms don't solve this:

What's needed is a way for the module itself to observe a host-written flag and choose to suspend, preserving its own stack. That requires the host to write to a global that the running module can read — without holding &mut Store.

Design

value_ptr() is deliberately narrow:

Example

let flag = instance.get_global(&mut store, "suspending").unwrap();
let ptr = unsafe { flag.value_ptr(&store) } as *mut i32;

// Move store to worker thread
std::thread::spawn(move || {
    run.call(&mut store, ()).unwrap();
});

// Signal from control thread
unsafe { std::ptr::write_volatile(ptr, 1); }

view this post on Zulip Wasmtime GitHub notifications bot (Feb 13 2026 at 06:28):

AlbertMarashi updated PR #12587.

view this post on Zulip Wasmtime GitHub notifications bot (Feb 13 2026 at 06:33):

AlbertMarashi updated PR #12587.

view this post on Zulip Wasmtime GitHub notifications bot (Feb 13 2026 at 07:51):

bjorn3 commented on PR #12587:

I don't think this is sound. Globals are accessed through non-atomic loads and stores and thus a concurrent modification is UB.

view this post on Zulip Wasmtime GitHub notifications bot (Feb 13 2026 at 08:45):

AlbertMarashi commented on PR #12587:

I don't think this is sound. Globals are accessed through non-atomic loads and stores and thus a concurrent modification is UB.

Aren't all writes to i32s atomic?

view this post on Zulip Wasmtime GitHub notifications bot (Feb 13 2026 at 09:12):

bjorn3 commented on PR #12587:

No. In Rust a store that is not explicitly marked as atomic concurrent with any other load or store (atomic or not) on the same memory is a data race, which is UB. And a store that is marked as atomic can only be done concurrently on the same memory with other atomic operations. The compiler is allowed to for example assume that if it duplicates a non-atomic load without any operation in between on the same thread that could modify the memory, that both loads will result in the same value. If another thread was allowed to do a store between both operations, that would be UB. Similarly, the compiler is allowed to fold successive non-atomic loads without stores in between. So for example while(!done) {} may optimize to an infinite loop as this is a non-atomic load, so no concurrent modifications are possible without UB.

This is unlike wasm where data races in the linear memory are not UB when said linear memory is marked as shared. (if not marked as shared, the wasm runtime is required to deny concurrent accesses) However globals are not stored in the linear memory and current wasm versions do not support marking globals as shared.

view this post on Zulip Wasmtime GitHub notifications bot (Feb 13 2026 at 09:13):

bjorn3 edited a comment on PR #12587:

No. In Rust a store that is not explicitly marked as atomic concurrent with any other load or store (atomic or not) on the same memory is a data race, which is UB. And a store that is marked as atomic can only be done concurrently on the same memory with other atomic operations. The compiler is allowed to for example assume that if it duplicates a non-atomic load without any operation in between on the same thread that could modify the memory, that both loads will result in the same value. If another thread was allowed to do a store between both operations, that would be UB. Similarly, the compiler is allowed to fold successive non-atomic loads without stores in between. So for example while(!done) {} may optimize to an infinite loop as this is a non-atomic load, so no concurrent modifications are possible without UB.

This is unlike wasm where data races in the linear memory are not UB when said linear memory is marked as shared. (if not marked as shared, the wasm runtime is required to deny concurrent accesses) However globals are not stored in the linear memory and current wasm versions do not support marking globals as shared. Also for example the Pulley interpreter does unconditionally use non-atomic accesses in rust code for globals and is thus subject to the rust data race rules.

view this post on Zulip Wasmtime GitHub notifications bot (Feb 13 2026 at 09:55):

AlbertMarashi commented on PR #12587:

Interesting...

So, why don't we have global atomics in that case?

view this post on Zulip Wasmtime GitHub notifications bot (Feb 13 2026 at 09:57):

AlbertMarashi edited a comment on PR #12587:

Interesting... I didn't know that.

So, why don't we have global atomics in that case?

However, judging based on how globals are currently implemented in wasmtime, it appears that these types of globals are being loaded from the memory address each time no?

view this post on Zulip Wasmtime GitHub notifications bot (Feb 13 2026 at 10:03):

AlbertMarashi commented on PR #12587:

As it stands currently, there appears to be no way to communicate with the code running inside of a module once it's started running, except for doing an unsafe data mutation at a given index in the module's memory from another thread.

This may likely have the same types of issues as you describe - so I am not particularly sure what the right approach is here - it might be valuable for me to have a look at how the increment_epoch functionality in the engine allows for the module running to stop running from another thread. I will report back

view this post on Zulip Wasmtime GitHub notifications bot (Feb 13 2026 at 10:13):

bjorn3 commented on PR #12587:

So, why don't we have global atomics in that case?

Because wasm doesn't yet support sharing an instance between threads. https://github.com/webAssembly/shared-everything-threads is still a draft.

However, judging based on how globals are currently implemented in wasmtime, it appears that these types of globals are being loaded from the memory address each time no?

Using non-atomic loads, so the compiler is allowed to assume that the global won't change. Maybe Cranelift won't miscompile it, but for Pulley we are at the mercy of the compiler that compiled the interpreter, which does consider it UB: https://github.com/bytecodealliance/wasmtime/blob/92f1829e394f1921a5872f068286073b95d242fe/pulley/src/interp.rs#L1202 It is possible to make concurrent accesses to globals possible, but it did require auditing all places where globals are accessed to ensure atomic accesses are used. And it did technically be an extension of the wasm specification, which as I understand Wasmtime prefers not to do.

view this post on Zulip Wasmtime GitHub notifications bot (Feb 13 2026 at 10:20):

AlbertMarashi commented on PR #12587:

So, it appears that wasmtime currently uses these Atomic numbers to track the epoch

https://github.com/bytecodealliance/wasmtime/blob/1adc57ea2768f78074879082fdd1ffe4fd50df62/crates/wasmtime/src/engine.rs#L820-L822

And here

https://github.com/bytecodealliance/wasmtime/blob/92f1829e394f1921a5872f068286073b95d242fe/crates/wasmtime/src/runtime/vm/instance.rs#L500-L504

So, I guess essentially what I am asking for, is to have an external API to do what wasmtime is currently already doing internally for epoch-based suspensions, as I am working on a crate that will allow for snapshotting instances and arbitrary resumability.

Would this require our globals to become atomic? A new global atomic type? etc.

What are your thoughts?

view this post on Zulip Wasmtime GitHub notifications bot (Feb 13 2026 at 13:03):

github-actions[bot] added the label wasmtime:api on PR #12587.

view this post on Zulip Wasmtime GitHub notifications bot (Feb 13 2026 at 15:27):

cfallin commented on PR #12587:

@AlbertMarashi bjorn3 is correct in all of the above: what you have built here is fundamentally at odds with the thread safety of core Wasmtime. A Store uniquely owns all storage used by the Wasm instance while running. We cannot provide the API as written in this PR, because it is incorrect.

Two ways out that I can think of:

Taking a step back, though: what are you actually trying to achieve? When you say suspend/resume do you mean that the Wasm guest saves its state somewhere and returns? Is the goal to timeslice multiple Wasm invocations into a single instance?

If so, you will be interested in the new async component model features (see e.g. Store::run_concurrent and Func::call_concurrent), as well as the in-progress component model cooperative threading work. That work has done the hard part of thinking through state ownership handoffs that you're plowing through/disregarding here.

view this post on Zulip Wasmtime GitHub notifications bot (Feb 14 2026 at 07:09):

AlbertMarashi commented on PR #12587:

Taking a step back, though: what are you actually trying to achieve? When you say suspend/resume do you mean that the Wasm guest saves its state somewhere and returns? Is the goal to timeslice multiple Wasm invocations into a single instance?

If so, you will be interested in the new async component model features (see e.g. Store::run_concurrent and Func::call_concurrent), as well as the in-progress component model cooperative threading work. That work has done the hard part of thinking through state ownership handoffs that you're plowing through/disregarding here.

Can you tell me more about this?

How does it address things that may never yield back to the host, such as infinite loops, or exponential function bombs, etc.


To clarify, what I am attempting to achieve is near native-speed transformation of WASM modules to support arbitrary suspend + snapshot + resume capabilities for WASM instances.

The business use case is a generic cloud execution platform that supports "durable" persistent functions/instances that have the ability for modules to suspend their execution for unbounded amounts of time.

This requires us to have a way to serialize all of the module state into a data file somewhere, to be later resumed by our orchestrator when a new request or event is triggered.

This requires us to be able to:

  1. Augment WASM code (or compiled output) to inject fast-check test + bnz-like instructions inside of their code at ideally either function call/loop boundaries, or even at the instruction level, if possible.
  2. Snapshot, and serialize the instance state once suspended, which requires the ability for us to serialize the stack trace, linear memory, module code, and other resources and objects.
  3. To reload and deserialize the instance state in a consistent and deterministic manner, ready to continue executing the code that we effectively left off at.

Note: Not all resources of course will be able to be fully serialized. (e.g. TCP connections / websocket connection) - however we intend for our host to maintain these types of connections in the background whilst the module is "sleeping" until a new packet/event comes in.

view this post on Zulip Wasmtime GitHub notifications bot (Feb 14 2026 at 22:41):

alexcrichton commented on PR #12587:

In addition to the Wasmtime-specific thread-safety properties, I think that this problem can also be viewed from a spec-level of "this isn't possible to do in wasm right now".

Let's say that a wasm has a big call stack which bottoms out in an infinite loop. If I understand @AlbertMarashi this PR correctly what you're thinking of doing is that this infinite loop would be instrumented (externally, via a wasm->wasm transformation) to have a global.get at the loop header which spills state and then returns back down the stack triggering everything else to spill state too. Semantically what this would look like to wasm, however, is that the value of the global changes between iterations of the loop without wasm doing anything (e.g. no calls to the host, no mutations of the global, nothing). My understanding is that this is a violation of WebAssembly semantics and would be spec-noncompliant behavior were Wasmtime to allow it.

In theory the best-fitting feature here you want is a shared global. This is part of the shared-everything-threads proposal that bjorn3 mentioned and it is not yet implemented in Wasmtime. This would require atomic loads/stores to the global and would correctly model the ability for external actors to mutate the global during wasm execution. The next-best-fitting feature is what @cfallin mentioned using shared memory. This is a somewhat heavyweight feature to use here since you'd need a full 64k of memory just for this one global, but it would work because the wasm would read a byte in memory to see if it should return and the host would mutate that byte when it wanted to inject a yield.

Can you tell me more about this?

How does it address things that may never yield back to the host, such as infinite loops, or exponential function bombs, etc.

Wasmtime's support for async-invoking wasm is documented here. In short with either epochs or fuel we force wasm to time-slice itself during infinite loops and exponential function bombs. It works very similarly to what you're thinking, we inject checks in loop headers and function headers.

What Wasmtime doesn't support, however, is mutation of the store while WebAssembly is suspended or time-sliced. Wasmtime also doesn't support serializing this state to get resumed later on.


All that's to say: I think your best path forward right now is the same wasm->wasm transformation you have today to inject instrumentation. Instead of using a global to signal "please spill and suspend" you would instead use a shared memory and some byte within that shared memory. That should all work on Wasmtime as-is today and require no Wasmtime modifications.

view this post on Zulip Wasmtime GitHub notifications bot (Feb 16 2026 at 11:19):

AlbertMarashi commented on PR #12587:

That makes sense.

Should this issue remain open should there be a future proposal for atomic globals implemented in WASM?

view this post on Zulip Wasmtime GitHub notifications bot (Feb 16 2026 at 17:45):

alexcrichton closed without merge PR #12587.

view this post on Zulip Wasmtime GitHub notifications bot (Feb 16 2026 at 17:45):

alexcrichton commented on PR #12587:

I'll close this in favor of https://github.com/bytecodealliance/wasmtime/issues/9466 which is loosely the tracking issue for shared-everything-threads which would include atomic globals.

view this post on Zulip Wasmtime GitHub notifications bot (Feb 17 2026 at 18:32):

fitzgen commented on PR #12587:

All that's to say: I think your best path forward right now is the same wasm->wasm transformation you have today to inject instrumentation. Instead of using a global to signal "please spill and suspend" you would instead use a shared memory and some byte within that shared memory. That should all work on Wasmtime as-is today and require no Wasmtime modifications.

Agreed although I would say that you could potentially represent the interrupt check with a call to an imported component function and use compile-time builtins to lower that to the tight code that you are chasing. We would need to add unsafe intrinsics for atomic loads, but that seems pretty reasonable to me.


The final thing I would add is that your Wasm-to-Wasm transformation will need to unwind and rewind the stack (if you aren't relying on your Wasm programs cooperating with this interruption and state-saving stuff themselves) which is basically the same thing that binaryen's asyncify is doing, so it is worth looking into their implementation for inspiration if not even something you can reuse or fork. Also look into continuation-passing style transforms, which enable similar things. Both are going to add execution overheads compared to the original, uninstrumented Wasm program, however. That is pretty inescapable.

view this post on Zulip Wasmtime GitHub notifications bot (Feb 19 2026 at 01:25):

AlbertMarashi commented on PR #12587:

which is basically the same thing that binaryen's asyncify is doing
That's correct, in fact, that's pretty much the approach I was trying to implement before running into this roadblock.

Both are going to add execution overheads compared to the original, uninstrumented Wasm program, however. That is pretty inescapable

This is largely true, however, I did come up with a potential solution I am experimenting with currently that provides a zero-cost approach to suspendability/resumability with support for universal serializable instance snapshots via cross-thread signalling/interrupts.

The idea behind that is essentially this:

  1. Leave the compiled code as-is.
  2. At compile time, track compilation information and program metadata to construct side tables that provide us the necessary information to perform arbitrary resumability at any WASM PC.
  3. For native code, this would involve tracking live variable state and logic to convert register state at native PCs into a more universal VM-like program stack.
  4. When an interrupt occurs (e.g. either by page table fault, or by means of thread kill/interrupt), we'd receive the register state of the running instance. From here, we map register state/locals into a stack-based representation of state.
  5. Next, we could snapshot the module, serialize or persist it, or alternatively proceed to the next step (Resume)
  6. Resume would be the inverse of suspending, and would involve reading the program's stack to reload live values back into their respective registers as expected by the native code, and proceed instance execution from where we left off (in ideally the same identical state as before)

This approach keeps the hot code execution path free from code augmentations and checks, and rely on the CPU's native capabilities to stop program execution. The cold path (suspend) would occur at far less frequent intervals (e.g. time-based instance scheduling), so the cost of suspending, serializing and resuming modules is amortized over the main execution time.

The only extra cost would be the memory/disk requirements to store dense side tables that give our engine the necessary information to map registers to stacks at arbitrary PC points - which may potentially double the generated code size (although all of this could exist on disk/mmap'd code files due to the infrequent access and random access)


I've dropped this wasmtime zulip conversation chat log here for anyone that finds this in the future.

#general > WASM Snapshotting and Resumability

view this post on Zulip Wasmtime GitHub notifications bot (Feb 19 2026 at 04:26):

cfallin commented on PR #12587:

@AlbertMarashi you're correct that having the ability to map from native machine state to Wasm VM state (locals and operand stack), and then from Wasm VM state back to native machine code, would let you take a portable snapshot.

However it's an enormous undertaking from a compiler and runtime perspective:

I think you could approach this in a somewhat tractable way by starting from what I did for debug instrumentation, and going the other way for resume -- always loading values back from the stackslots after every interruption point; together with something to ensure that you never have other live values across these points. That's more or less what I describe in the debugging RFC v2 here, around "As a final interesting note: if we ever want to implement on-stack replacement (OSR) ...". But note that the instrumentation approach is decidedly not zero-overhead: it is something like a 2x slowdown. So it's great for debugging, but not something you'd deploy transparently.

tl;dr: "construct some side tables so I can simply read out and reconstitute the native register state" has these "side tables" doing a lot of heavy lifting in a way that requires fundamental changes to the compiler, because you need a fully accurate bijection from Wasm state to register state, with no register left out and all native address dependencies accounted for.

view this post on Zulip Wasmtime GitHub notifications bot (Feb 20 2026 at 04:11):

AlbertMarashi commented on PR #12587:

@cfallin thank you very much for your detailed thoughts and perspectives, they provide an invaluable perspective to many of the same questions and challenges I've pondered, and I am largely on the same page as you with many things.

One note:

instrumentation approach is decidedly not zero-overhead
In my experiments (recursive fibonnaci function - compute heavy) approach, I manually hand-augmented some of the binary code where the idea was to effectively add a test + bnz check at each suspend point.

Thanks to CPU branch prediction, the costs of these checks was measured and estimated to be around ~3%-15%. And the cost to code size would also be in a similar range (test + bnz at each natural suspend/branch point (e.g. loops, function calls))

However, my instrumentation-free approach would be 0% in the hot path, and there would only be a memory/disk and compile cost associated with the dense side tables that would be needed to support conversion of program state into a canonical/universal stack representation

That being said, it is without doubt, no easy undertaking (major compiler changes, etc would all be involved).

I may share my results in the future as I proceed with my own WASM runtime engine implementation and approach :laughing:

view this post on Zulip Wasmtime GitHub notifications bot (Feb 20 2026 at 04:24):

cfallin commented on PR #12587:

The instrumentation overhead I was referring to is the overhead of storing Wasm VM-level values to stackslots, to allow for a perfect bijection between native code-level state and Wasm VM state. The 2x overhead is an actual measured number from my implementation that is in-tree under the guest-debug compile flag.

If you assume that you can have the overhead of only a test+branch, then you run into all of the issues I mentioned: I don't think you'll be able to construct a perfect bijection to allow state resume without essentially rewriting all of Cranelift's mid-end and back-end to account for "recovery instructions". I'd encourage you to read about OSR and how it's implemented in e.g. JavaScript JITs: the problem of extracting VM-level state from one version of a compiled function and injecting it into another (presumably higher-tier-compiled) version of the function is essentially the same as what you're trying to solve. I don't want to discourage you so much as give you an accurate view of the work involved: this is probably an engineer-year of effort to get right, for someone who already knows Cranelift. Happy to see what you work out, of course -- best of luck.

view this post on Zulip Wasmtime GitHub notifications bot (Feb 20 2026 at 08:55):

AlbertMarashi commented on PR #12587:

Ah yes, I get you now, and yes, thank you a lot.

The 2x overhead
Yes, I suspect that this will be the only cost consideration with my approach - however, I think that this extra memory cost might not be as real as one might imagine, if we instead memory map the "side table"/bijection data to disk, given how infrequently and sparsely it might actually be accessed.

The approach that I currently opted into is to actually just use a custom stack implementation that matches more closely to something like a universal/canonical WASM stack as opposed to using a more native approach.

My first goal in my JIT engine is to make instructions operate through stack-based (memory read/write) transformations similar to what an interpreted WASM VM might do, so that I don't have to worry a whole lot about register allocations - with my hope being that the CPU's L1 cache would reduce a lot of the costs associated to memory reads and writes.

I expect this would likely probably involve a significant slowdown, but honestly for proof of concept and business-need requirements something between 2-5x slowdown is totally acceptable for the benefits that snapshotting and persistence offer me.

Additional optimizations could be made in my code to fuse various kinds of WASM instruction operations together for the varying archs.

Anyways, thanks again for your feedback, I will share my results down the line down the line if successful.

view this post on Zulip Wasmtime GitHub notifications bot (Feb 20 2026 at 08:55):

AlbertMarashi edited a comment on PR #12587:

Ah yes, I get you now, and yes, thank you a lot.

The 2x overhead
Yes, I suspect that this will be the only cost consideration with my approach - however, I think that this extra memory cost might not be as real as one might imagine, if we instead memory map the "side table"/bijection data to disk, given how infrequently and sparsely it might actually be accessed.

The approach that I currently opted into is to actually just use a custom stack implementation that matches more closely to something like a universal/canonical WASM stack as opposed to using a more native approach.

My first goal in my JIT engine is to make instructions operate through stack-based (memory read/write) transformations similar to what an interpreted WASM VM might do, so that I don't have to worry a whole lot about register allocations - with my hope being that the CPU's L1 cache would reduce a lot of the costs associated to memory reads and writes.

I expect this would likely probably result in a significant slowdown, but honestly for proof of concept and business-need requirements something between 2-5x slowdown is totally acceptable for the benefits that snapshotting and persistence offer me.

Additional optimizations could be made in my code to fuse various kinds of WASM instruction operations together for the varying archs.

Anyways, thanks again for your feedback, I will share my results down the line down the line if successful.

view this post on Zulip Wasmtime GitHub notifications bot (Feb 20 2026 at 08:56):

AlbertMarashi edited a comment on PR #12587:

Ah yes, I get you now, and yes, thank you a lot.

The 2x overhead

Yes, I suspect that this will be the only cost consideration with my approach - however, I think that this extra memory cost might not be as real as one might imagine, if we instead memory map the "side table"/bijection data to disk, given how infrequently and sparsely it might actually be accessed.

The approach that I currently opted into is to actually just use a custom stack implementation that matches more closely to something like a universal/canonical WASM stack as opposed to using a more native approach.

My first goal in my JIT engine is to make instructions operate through stack-based (memory read/write) transformations similar to what an interpreted WASM VM might do, so that I don't have to worry a whole lot about register allocations - with my hope being that the CPU's L1 cache would reduce a lot of the costs associated to memory reads and writes.

I expect this would likely probably result in a significant slowdown, but honestly for proof of concept and business-need requirements something between 2-5x slowdown is totally acceptable for the benefits that snapshotting and persistence offer me.

Additional optimizations could be made in my code to fuse various kinds of WASM instruction operations together for the varying archs.

Anyways, thanks again for your feedback, I will share my results down the line down the line if successful.


Last updated: Feb 24 2026 at 04:36 UTC