Stream: git-wasmtime

Topic: wasmtime / issue #12069 Handle OOM in the runtime


view this post on Zulip Wasmtime GitHub notifications bot (Nov 21 2025 at 20:07):

fitzgen opened issue #12069:

We have been discussing this in a couple recent Wasmtime meetings[^0] and on Zulip and I figured it was time to centralize discussion in a tracking issue.

[^0]: See https://github.com/bytecodealliance/meetings/blob/main/wasmtime/2025/wasmtime-10-23.md and https://github.com/bytecodealliance/meetings/blob/main/wasmtime/2025/wasmtime-11-20.md

What does handling OOM mean in this case? It means turning allocation failure into an Err(...) return and ultimately propagating that up to the Wasmtime embedder. It may even involve poisoning various data structures if necessary, maybe up to a whole store if necessary, but we haven't fleshed out the details completely yet. That will happen in discussions on this issue and various PRs during implementation.

Various, unordered sketches of things that will be involved:

We will initially focus on supporting the following code paths:

Basically, everything that is supported in our no-std/pulley builds now: a basic runtime without the compiler, that can only run pre-compiled Wasm. We will not initially support async or the pooling allocator either, for example. I have vague ideas about how we might be able to refactor the pooling allocator for greater flexibility and enable its use in no-std / no-virtual-memory environments, but that is a bit orthogonal.

Eventually we will want to support async Wasm, yielding on out-of-fuel, ..., and the component model's async functionality. That is going to be a larger project on top of this already large project, so I'm going to delay talking about how we will cross that bridge until we get closer to it.

In practice, I expect that we will start with the OOM testing/fuzzing, create something very simple that fails immediately, and land that as "expected to fail". Then we can get that passing, which will be quite a bunch of work for this first iteration. Then we can remove the failure expectation. Then we can do a little bit more stuff inside the OOM testing/fuzzing and reveal new places we need to fix, and then we can fix those. We can continue this process until things are starting to look more and more complete. At some point we will add the clippy lints, initially to smaller modules and eventually to bigger regions of code. But the testing can be the forcing function for what area of code we add OOM handling to each step of the way.

The best way to dynamically test/fuzz OOM handling that I know of is the approach taken by SpiderMonkey's oomTest() helper: run a piece of code (potentially written by humans or generated by a fuzzer) with a special allocator that will return null on the first allocation made and check that the code didn't fail to handle the OOM, then run that code again but failing on the second allocation, then the third, etc... up to your time/compute budget. Starting by building this infrastructure is my rough plan. I've done a little bit of digging for other approaches to ensuring that your OOM-handling is correct, and I haven't really found anything, just people arguing about whether you should even check for null returns from malloc or not, which is not very helpful. That is to say, if anyone has any other ideas or knows of any other prior art here, I'd love to hear about it!

view this post on Zulip Wasmtime GitHub notifications bot (Nov 21 2025 at 20:07):

fitzgen added the wasmtime:platform-support label to Issue #12069.

view this post on Zulip Wasmtime GitHub notifications bot (Nov 21 2025 at 20:58):

bjorn3 commented on issue #12069:

I suspect you are going to have a hard time ensuring that all dependencies can also handle OOM. For example for parsing ELF files you are using the object crate which allocates. Symbolizing backtraces allocates a debuginfo cache. async-trait allocates boxes for all futures. Serde allocates (and I don't think it is possible to support handling OOM in there without making an api breaking change)

The unstable std::alloc::set_alloc_error_hook would almost certainly be much easier to use. It is unstable pending a decision if unwinding out of it would be allowed though and if we should even allow running user code on handle_alloc_error. (the argument for not allowing it is that there may be unsafe code that is unsound under the extra reentrancy that this would allow)

view this post on Zulip Wasmtime GitHub notifications bot (Nov 21 2025 at 20:59):

bjorn3 edited a comment on issue #12069:

I suspect you are going to have a hard time ensuring that all dependencies can also handle OOM. For example for parsing ELF files you are using the object crate which allocates. Symbolizing backtraces allocates a debuginfo cache. async-trait allocates boxes for all futures. Serde allocates when deserializing common types like Vec (and I don't think it is possible to support handling OOM in there without making an api breaking change)

The unstable std::alloc::set_alloc_error_hook would almost certainly be much easier to use. It is unstable pending a decision if unwinding out of it would be allowed though and if we should even allow running user code on handle_alloc_error. (the argument for not allowing it is that there may be unsafe code that is unsound under the extra reentrancy that this would allow)

view this post on Zulip Wasmtime GitHub notifications bot (Nov 21 2025 at 21:07):

cfallin commented on issue #12069:

@bjorn3 thanks for the thoughts; we have thought pretty extensively about this and we think the unwinding-based approach is probably not viable in our no-std embedding (see Zulip thread that Nick linked for more). Nick covered serde above; we can turn off symbolization in our embedding; boxed futures are going to be a big hassle when we get to async but in principle there should be ways to define enough of our own version of dependencies to make this work.

view this post on Zulip Wasmtime GitHub notifications bot (Nov 21 2025 at 21:19):

alexcrichton commented on issue #12069:

Personally I view this direction for Wasmtime as a high-level decision point for the project. On one hand we have a proposal like @fitzgen has laid out where the pros are "highly portable" and the cons are "big changes, unknown impact on maintainabiilty". On the other hand could go the route of something like set_alloc_error_hook and/or unwinding which inverts the pros/cons here a bit -- e.g. few changes but much less portable.

In my opinion it's worthwhile to shoot for the more portable version of things. We can all agree it's going to be a big chunk of work, but the portability of such a solution is highly attractive I believe. I'm also somewhat confident that we can thread the needle on the maintainability and runtime cost of this solution so I'm less worried about that.

If, however, the strategy of "check for oom everywhere" doesn't work out then we can go back to the drawing board and think harder about unwinding perhaps, but in lieu of that I'd like to pull on the design thread @fitzgen is proposing here and see how far it gets us.

view this post on Zulip Wasmtime GitHub notifications bot (Nov 21 2025 at 21:36):

bjorn3 commented on issue #12069:

I'm afraid you will have to basically forego like half of all wasmtime dependencies and write replacements for them. (I wrote this before I did the inventarization below. turns out I was pretty much spot on. 20 allocating vs 18 non-allocating)

With cargo check --no-default-features --features "runtime async component-model component-model-async stack-switching" I'm getting the following runtime dependencies (checked means likely unconditionally allocates) excluding Wasmtime controlled ones:

view this post on Zulip Wasmtime GitHub notifications bot (Nov 21 2025 at 21:46):

cfallin commented on issue #12069:

@bjorn3 thanks for this analysis. Most of these we have answers to already, I think.

Basically: yes, it will be expensive. But in the context we (Nick and I at least) are employed to use Wasmtime, we need to be OOM-friendly to continue using Wasmtime at all, and we likely cannot make unwind work easily (and we would not want its complexity in the critical path of a possibly-frequent failure mode in high-traffic scenarios), so it will take more than "this might be a lot of work" to convince us otherwise.

view this post on Zulip Wasmtime GitHub notifications bot (Nov 25 2025 at 03:18):

Soto-J commented on issue #12069:

Hi! I’m interested in contributing to this effort and would love to help with the OOM-handling work.
If there are any recommended starting points or areas that would be most helpful for a new contributor, I’m happy to take them on.


Last updated: Dec 06 2025 at 06:05 UTC