wasmtime / PR #12860 Fix exception ref-count leak in non-... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / PR #12860 Fix exception ref-count leak in non-...

Wasmtime GitHub notifications bot (Mar 27 2026 at 23:17):

cfallin opened PR #12860 from cfallin:exception-leak to bytecodealliance:main:

When a Wasm throw instruction executes, the throw_ref libcall was cloning the GC ref (incrementing the refcount), but the catch handler never decremented it. This caused every caught exception to leak, leading to unbounded GC heap growth in throw/catch loops.

Two fixes:

Remove the unnecessary clone_gc_ref() in throw_ref. The throw/throw_ref instructions consume the exnref operand, so ownership transfers naturally to pending_exception without cloning.

In create_catch_block, emit a drop_gc_ref call for non-ref catches (Catch::One, Catch::All) after field extraction. These catches consume the exnref without passing it to the branch target, so the refcount must be decremented.

Also adds Store::gc_heap_size() / StoreContext::gc_heap_size() accessors and a throw_catch_many_times integration test that throws and catches 100K exceptions in a loop, asserting the GC heap stays within a single 64 KiB page.

Wasmtime GitHub notifications bot (Mar 27 2026 at 23:17):

cfallin requested wasmtime-core-reviewers for a review on PR #12860.

Wasmtime GitHub notifications bot (Mar 27 2026 at 23:17):

cfallin requested wasmtime-compiler-reviewers for a review on PR #12860.

Wasmtime GitHub notifications bot (Mar 27 2026 at 23:17):

cfallin requested fitzgen for a review on PR #12860.

Wasmtime GitHub notifications bot (Mar 28 2026 at 03:34):

github-actions[bot] added the label wasmtime:api on PR #12860.

Wasmtime GitHub notifications bot (Mar 28 2026 at 15:00):

cfallin updated PR #12860.

Wasmtime GitHub notifications bot (Mar 28 2026 at 15:17):

cfallin updated PR #12860.

Wasmtime GitHub notifications bot (Mar 30 2026 at 12:06):

fitzgen submitted PR review.

Wasmtime GitHub notifications bot (Mar 30 2026 at 12:06):

fitzgen created PR review comment:

I don't quite understand. The GC ref should logically be held alive by the Wasm stack (in the over-approx table) and then get dropped upon the next GC. Why wasn't that happening? Are we missing a call to expose_gc_ref_to_wasm in the throw code? (Probably)

If so, then we should do that instead.

Wasmtime GitHub notifications bot (Mar 30 2026 at 12:06):

fitzgen created PR review comment:

A couple things:

If we add a new collector that requires some kind of explicit drop here, we will silently repeat this bug because of the wildcard. We shouldn't have wildcards for this kind of thing.

Even better would be to move this into a trait method on the GC compiler, as the dual of GcCompiler::alloc_exn.

Wasmtime GitHub notifications bot (Mar 30 2026 at 17:15):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Mar 30 2026 at 17:15):

cfallin created PR review comment:

We do call expose_gc_ref_to_wasm here when we manually convert to a wasmtime::ValRaw but you're right, I think it's missing in the throw path. I'll dig deeper -- thanks!

Wasmtime GitHub notifications bot (Mar 30 2026 at 19:41):

cfallin commented on PR #12860:

So digging into this a bit more I see that we always expose_gc_ref_to_wasm in the gc_alloc_raw libcall, which is used both by the exn throw path and other allocation paths (e.g. for GC structs): here

The root issue is actually that the growth heuristic for the GC heap creates the appearance of a leak (and, arguably as we play with semantics, maybe a leak in practice?) The problem I'm trying to solve is that I want setjmp/longjmp with exceptions to be robust and lightweight; that's what the test in this PR is made to emulate. If we continually throw and catch, without holding the exnref, we have only at most one object live at a time. A theoretically ideal GC should somehow see that and keep the working-set size small.

I had short-circuited the "deferred" part of DRC here with explicit drops which I agree is the wrong approach now that I dig in more. But it should also be the case that a C program using SjLj with exceptions should not grow the GC heap to its max size (4GiB say?) before ever collecting -- that seems like an objectively bad heuristic choice.

Perhaps what we need is a collection heuristic like what we do for OwnedRooted, where we have a watermark based on the actual size after last collection, and collect again at twice that size? That allows growth up to a large-working-set regime without more than amortized-constant overhead, but should still bound the working set size pretty tightly. What do you think @fitzgen?

Wasmtime GitHub notifications bot (Mar 30 2026 at 20:34):

fitzgen commented on PR #12860:

So digging into this a bit more I see that we always expose_gc_ref_to_wasm in the gc_alloc_raw libcall, which is used both by the exn throw path and other allocation paths (e.g. for GC structs): here

It needs to be called every time you pass a reference from the host, to Wasm. For example, it looks like it is missing here:

https://github.com/bytecodealliance/wasmtime/blob/5d52f56cf589e48e9e7a277140c1ea8b5aa577d7/crates/wasmtime/src/runtime/vm/libcalls.rs#L1740-L1753

I'm pretty sure that adding a call to expose_gc_ref_to_wasm there will fix the leak.

The root issue is actually that the growth heuristic for the GC heap creates the appearance of a leak (and, arguably as we play with semantics, maybe a leak in practice?) The problem I'm trying to solve is that I want setjmp/longjmp with exceptions to be robust and lightweight; that's what the test in this PR is made to emulate. If we continually throw and catch, without holding the exnref, we have only at most one object live at a time. A theoretically ideal GC should somehow see that and keep the working-set size small.

Right, I agree that we need to fix the heuristics for heap growth vs collection, but I don't think we should be special-casing exnrefs.

Instead, we should fix the heuristics and make it so that exnrefs Just Work the way we want them to, and that falls out automatically from updating those heuristics. That, and then also the other stuff in https://github.com/bytecodealliance/wasmtime/issues/11256, in the fullness of time.

In the meantime, you can update the test to use a resource limiter, custom memory creator, or the pooling allocator to constrain the max size of memories (and therefore also the GC heap) and it should still exercise the leak-checking even without the grow-or-collect heuristics fixed. The pooling allocator is probably the easiest.

Wasmtime GitHub notifications bot (Mar 30 2026 at 20:48):

cfallin updated PR #12860.

Wasmtime GitHub notifications bot (Mar 30 2026 at 20:48):

cfallin commented on PR #12860:

It needs to be called every time you pass a reference from the host, to Wasm. For example, it looks like it is missing here:

I'm a bit confused: the code that you link is a libcall that takes a ref from Wasm and pulls it into the host (to save on the store). Did you mean to link somewhere else?

Separately: writing a variant of this test, that allocates a GC struct and immediately drops it in a loop, causes the GC heap to grow without collection up to its size limit in the same way. That seems to confirm to me that this is not exn-specific but rather a general GC growth heuristic problem?

Right, I agree that we need to fix the heuristics for heap growth vs collection, but I don't think we should be special-casing exnrefs.

Yes, agreed; the next paragraph of my earlier comment suggests a growth heuristic (for the whole GC heap, not just for exnref-specific cases), exactly as you're saying. I would imagine it could/should become the default growth heuristic. What do you think of the proposed heuristic?

Wasmtime GitHub notifications bot (Mar 30 2026 at 20:56):

cfallin edited a comment on PR #12860:

It needs to be called every time you pass a reference from the host, to Wasm. For example, it looks like it is missing here:

I'm a bit confused: the code that you link is a libcall that takes a ref from Wasm and pulls it into the host (to save on the store). Did you mean to link somewhere else?

Separately: writing a variant of this test, that allocates a GC struct and immediately drops it in a loop, causes the GC heap to grow without collection up to its size limit in the same way. That seems to confirm to me that this is not exn-specific but rather a general GC growth heuristic problem?

Right, I agree that we need to fix the heuristics for heap growth vs collection, but I don't think we should be special-casing exnrefs.

Yes, agreed; the next paragraph of my earlier comment suggests a growth heuristic (for the whole GC heap, not just for exnref-specific cases), exactly as you're saying. (In other words: I am definitely not proposing an exceptions-specific change.) I would imagine it could/should become the default growth heuristic. What do you think of the proposed heuristic?

Wasmtime GitHub notifications bot (Mar 30 2026 at 20:59):

fitzgen commented on PR #12860:

Ah sorry I got confused because I had assumed that this was not about the heuristics and was something specific to the interaction between exnref and libcalls, and I was trying not to complicate the discussion by getting into multiple things at once. That backfired :-p

Perhaps what we need is a collection heuristic like what we do for OwnedRooted, where we have a watermark based on the actual size after last collection, and collect again at twice that size? That allows growth up to a large-working-set regime without more than amortized-constant overhead, but should still bound the working set size pretty tightly. What do you think @fitzgen?

Yes, this is roughly what I've been imagining, although there is the minor wrinkle of growing in units of pages, but wanting to do the accounting at a finer-grained level so that the single-page GC heap use case works correctly.

To be precise, what I've had in my head is this:

After each collection we record the live set size somewhere -- maybe GcStore?

When deciding whether to GC or grow the heap, we look at the last live set size, and if it is less than or equal to half our GC heap size, then we do a collection. Otherwise, we grow.

The initial live set size is zero, which causes us to always GC before growing, which should preserve the single-page GC heap use case, because if there is only ever one exnref live then our live set should always end up being below half the GC heap size.

This should also avoid doing a bunch of GCs for each ~power of two on the way up to 64KiB, which is just wasting cycles.

How does that sound?

Wasmtime GitHub notifications bot (Mar 30 2026 at 21:00):

fitzgen commented on PR #12860:

After each collection we record the live set size somewhere -- maybe GcStore?

And because refcounting operates on anti-matter rather than matter, this does unfortunately mean that the DRC collector will need to always have a running count of allocated bytes, rather than being able to compute it just at GC time like the copying collector will be able to. Ah well, not a big deal.

Wasmtime GitHub notifications bot (Mar 30 2026 at 21:04):

cfallin commented on PR #12860:

That does sound reasonable, and I'm happy to work on that -- thanks! Closing this in the meantime (I'll bring back both the sjlj exception test and struct-alloc-then-drop test once we have said heuristic and assert they both stay within one page).

Wasmtime GitHub notifications bot (Mar 30 2026 at 21:04):

cfallin closed without merge PR #12860.

Last updated: Jun 01 2026 at 09:49 UTC