Stream: git-wasmtime

Topic: wasmtime / issue #13403 DRC: per-GC sweep cost grows unbo...


view this post on Zulip Wasmtime GitHub notifications bot (May 18 2026 at 09:56):

harryzz opened issue #13403:


Summary

On a long-running Kotlin/Wasm + WasmGC guest (~9 MB/s steady allocation
rate driven by suspendCoroutine), per-Store::gc cost grows
super-linearly over time. After enough sustained use, a single sweep
exceeds 5 s of host-thread time, triggering Android's input-dispatch
ANR. We patched wasmtime to log the over-approximation list size N,
free-block count F, and sweep duration at every gc; **the trajectory
is reproducible and the scaling is clear**: sweep duration roughly
doubles every few minutes of typical UI interaction, driven primarily
by N's linear growth.

The root cause is architectural: nothing in wasmtime auto-triggers
gc, so N (over-approximated-stack-roots list) grows monotonically
between sweeps, while sweep itself is O(N) with poor cache locality.
Embedders that allocate at a steady rate and let N grow eventually
hit unworkable pause times.

This is distinct from #9701 / PR #10401. That fix was about
transitive ref-decrementing (correctness). We're running on top of it,
on 44.0.1. The remaining issue is pure scaling cost in the sweep.

Environment

Reproducer

Minimal: https://codeberg.org/harryzz/wart-leak-repro

A bare suspendCoroutine loop driven by per-frame host calls — no
Compose, no UI. Reproduces the underlying allocation pattern at ~9 MB/s
WasmGC-heap growth, OOM in ~6:37 on a Pixel 2 XL with no manual
Store::gc triggers.

The full Compose-UI guest (which we actually run) drives a much larger
retained live set per gc (Composition tree, Snapshot records,
LaunchedEffect jobs, Recomposer invalidation list). The
super-linear-growth symptom is visible there but not in the minimal
repro, because the linear-scan cost is dominated by fragmentation,
which Compose's diverse object sizes produce far better than a tight
loop.

Root cause — sweep is O(N) and N grows unbounded between gcs

crates/wasmtime/src/runtime/vm/gc/enabled/drc.rs, lines 124-128:

/// The head of the over-approximated-stack-roots list.
///
/// Note that this is exposed directly to compiled Wasm code through the
/// vmctx, so must not move.
over_approximated_stack_roots: Box<Option<VMGcRef>>,

The over-approximation set is a singly-linked list threaded through
the next_over_approximated_stack_root field in each object's
VMDrcHeader (line 645). sweep (lines 524-618) walks the entire
list every gc:

So |list| at sweep time = (refs currently on the precise wasm stack)

Cost per gc: O(N) linked-list walk, plus N × O(log F) for the
deallocation phase, plus PR #10401's transitive dec-ref. Both N and
F grow over time → super-linear total cost.

There is no automatic scheduling — wasmtime's grow_or_collect_gc_heap
fallback path (the only auto-trigger) only fires on memory.grow
failure. An embedder that returns Ok(true) from
ResourceLimiter::memory_growing never sees a gc unless it calls
Store::gc itself. Embedders that do return false on threshold
also get the cascade-on-recomposition behavior described below.

Secondary — FreeList::first_fit is O(F), but not the dominant cost

crates/wasmtime/src/runtime/vm/gc/enabled/free_list.rs, lines 180-195:

fn first_fit(&mut self, alloc_size: u32) -> Option<(u32, u32)> {
    let (&block_index, &block_len) = self
        .free_block_index_to_len
        .iter()                                  // ← walks BTreeMap in key order
        .find(|(_idx, len)| **len >= alloc_size)?;
    ...
}

The free-block map is keyed by block index (address), not by size.
iter().find(...) walks entries in address order returning the
lowest-address block that fits. O(F) per allocation.

We initially thought this was the dominant cost. It isn't — see the
"What we tried" section. In our workload, frame-level allocation cost
is dwarfed by Compose's render/layout/draw work; switching to a size-
indexed O(log F) lookup didn't move the steady-state frame mean. The
sweep itself is the load-bearing cost, and replacing first_fit
without also addressing the sweep makes things worse (see below).

Direct measurements from a patched wasmtime build

I patched 44.0.1 to log N (over-approximation list size), F
(free-block count), sweep duration, and freed bytes at every sweep.
Three consecutive sweeps during a 45-min soak with active
scrolling/tapping in a Material3 LazyColumn:

11:33:56  sweep 1: dur=478 ms,   N=1,223,135, F:137,269
11:41:52  sweep 2: dur=1248 ms,  N=2,333,452, F:9,96667,394
12:18:20  sweep 3: dur=3000 ms,  N=4,585,421, F:29,53867,394
Sweep 1 Sweep 2 (+8 min) Sweep 3 (+37 min)
N 1.22M 2.33M 4.59M
F_after 37,269 67,394 67,394
Sweep duration 478 ms 1,248 ms 3,000 ms
Per-entry sweep cost 0.39 μs 0.53 μs 0.65 μs
Bytes reclaimed 56 MB 116 MB 249 MB

All three kept counts were 0 — sweeps fired from a host-side
Store::gc(None) invoked between frames, so the precise-stack-roots
set was empty at sweep time.

Notable observations:

  1. F jumps to 37K on the very first sweep. Sweep 1 deallocates
    1.2M scattered objects; the BTreeMap merge-with-neighbors logic in
    FreeList::dealloc only coalesces immediately address-adjacent
    blocks. With live retained objects sandwiching dead ones, ~37K of
    the freed blocks survive as discrete entries. From that point on,
    every guest allocation pays an O(37K) first_fit walk.

  2. F partially reverses between sweeps (37K → 10K before sweep
    2; 67K → 30K before sweep 3) — the allocator does merge some
    blocks as it satisfies new allocations. But each new sweep then
    re-bloats F. Net direction is monotonically up over the soak.

  3. Per-entry sweep cost is rising monotonically (0.39 → 0.53 →
    0.65 μs/entry, +67% over 45 min). At N=4.6M the over-approximation
    linked list's 40-byte headers occupy ~180 MB of GC heap scattered
    across distant addresses. Walking them is bandwidth-bound; the
    working set has clearly blown L1/L2 and is likely thrashing L3 too.

  4. Sweep duration grows faster than N — N grows ~2× per
    inter-sweep interval, but duration grew ~3× from sweep 2 to 3 even
    though that gap was 4× longer than sweep 1 to 2. The two effects
    (more entries to walk, each entry costs more) compound.

Extrapolating the trend: at the observed growth rate, sweep durations
push past 5 s within an hour or two. A single sweep at that scale, or
a small cascade of them inside one render_frame, exceeds Android's
5 s input-dispatch ANR threshold. We've observed this in practice in
multiple soaks (one ANR at ~48 min on the unpatched build, another at
~10 min on a heavier-interaction soak with the instrumentation
adding small per-sweep overhead). Typical ANR cascade pattern in
logcat:

last frame log
+0.0 s   wart-profile: frame N t+X ms  win[N=60 ... ]
+0.0 s   InputDispatcher: "spent 2005 ms processing MotionEvent"
+1.0 s   InputDispatcher: "spent 2350 ms processing MotionEvent"
+4.0 s   InputDispatcher: "spent 2247 ms processing MotionEvent"
+5.0 s   ANR  "Waited 5000 ms for MotionEvent"

Mean steady-state frame time stays flat at ~16 ms outside of
sweep/cascade windows — the problem isn't the normal allocation path,
it's the sweep + the post-sweep allocation cost driven by elevated F.

What we tried that did NOT work

We patched FreeList to add a parallel size-indexed
BTreeSet<(len, index)>, replacing the O(F) linear scan in first_fit
with an O(log F) range((alloc_size, 0)..).next() lookup. All 12
existing FreeList unit tests pass. On device, this **made things
worse**:

Unpatched With FreeList fix Δ
Sweep 1 duration 472 ms 823 ms +74%
Sweep 2 duration 1,222 ms 2,010 ms +64%
Sweep 3 duration (ANR'd before) 4,314 ms new
Per-frame mean (steady) ~16 ms ~16 ms no change
Guest alloc rate 15K refs/sec 19K refs/sec +27%

Two reasons it failed:

  1. first_fit was not the bottleneck. Frame-mean cost is dominated
    by Compose's render/layout/draw work; the alloc-path microseconds
    we shaved off didn't manifest in any observable metric.

  2. Faster allocs made the leak worse. With cheaper allocs, the
    Kotlin/Wasm guest churned through suspendCoroutine cycles faster,
    so N accumulated faster between sweeps. Each subsequent sweep had
    more to walk and more to deallocate. The per-dealloc overhead
    from maintaining a second data structure (one extra BTreeSet op per
    add_block / remove_block / update_block_len) then compounded
    over N+ deallocations per sweep.

The lesson: any allocator-side optimization must be paired with a
sweep-side fix, otherwise it accelerates the pathology. Posting this
finding in case it's useful — happy to share the patch diff if you want
to evaluate it for other workloads where alloc cost is the
bottleneck.

What would help

In order of e
[message truncated]

view this post on Zulip Wasmtime GitHub notifications bot (May 18 2026 at 09:56):

harryzz added the bug label to Issue #13403.

view this post on Zulip Wasmtime GitHub notifications bot (May 18 2026 at 18:27):

fitzgen commented on issue #13403:

Hello, I appreciate the bug report and will strive to investigate it soon, but can you please avoid dumping AI-generated walls of text going forward? By not reviewing the LLM output and summarizing it / including just the necessary information yourself first before reporting, you are thrusting that work and effort upon Wasmtime's maintainers, which breaks the open-source social contract. Read the Bytecode Alliance AI tool usage policy for more details. Thanks!

view this post on Zulip Wasmtime GitHub notifications bot (May 18 2026 at 18:29):

fitzgen added the wasm-proposal:gc label to Issue #13403.

view this post on Zulip Wasmtime GitHub notifications bot (May 19 2026 at 22:22):

fitzgen commented on issue #13403:

@harryzz can you provide the exact Wasmtime invocation you used with wart-leak-repro.wasm to reproduce the increasing sweep times? I ran the following and it never allocated enough to trigger a GC:

wasmtime run -Wgc,function-references,exceptions -Ccollector=drc -Wunknown-imports-default=y wart-leak-repro.wasm

view this post on Zulip Wasmtime GitHub notifications bot (May 20 2026 at 06:37):

harryzz commented on issue #13403:

@fitzgen first sorry for AI generated post. To the question: standalone can't run, need host function.
I have attached new self driven wasm file. Also repo with kotlin code updated if you need for reference.

regards,
Zahari

P.S. there short answer from AI if you need :)

<!-- Failed to upload "wart-leak.repro.tgz" -->

<!-- Failed to upload "wart-leak.repro.tgz" -->

▎ Sorry, that's on me — the .wasm I attached doesn't reproduce standalone. It was a WASI component whose suspend loop was driven by an external render-frame export, so a plain wasmtime run calls main() once,
▎ which parks at the first suspendCoroutine and returns — nothing accumulates.

▎ I've replaced it with a self-driving build (attached: wart-leak-repro.wasm) — main() now runs the suspend/resume cycle itself in an unbounded loop, no host imports, no component model. Your exact invocation
▎ reproduces it:

▎ wasmtime run -Wgc,function-references,exceptions -Ccollector=drc wart-leak-repro.wasm

▎ RSS climbs ~100 MB/s as garbage accumulates with no sweep, plateaus at the 4 GB wasm32 ceiling within ~50 s, then enters the steady-state sweep cycle (the tick # counter keeps advancing past the plateau).
▎ With 01-instrumentation.patch applied you get the per-sweep (N, F, dur) trajectory. The ~70-line source is Main.kt — pure suspendCoroutine, zero dependencies.

wart-leak-repro.tar.gz

view this post on Zulip Wasmtime GitHub notifications bot (May 20 2026 at 19:40):

fitzgen commented on issue #13403:

@harryzz can you verify whether https://github.com/bytecodealliance/wasmtime/pull/13422 fixes the excessively long sweeps for you? FWIW, I wasn't able to repro your sweep times after ten minutes of running the test case on my laptop, but you mention Android in this issue so I assume your hardware is pretty different from mine.

view this post on Zulip Wasmtime GitHub notifications bot (May 20 2026 at 22:09):

harryzz commented on issue #13403:

@fitzgen verified #13422 both on linux x86 (with reproduce test) and Android armv8 (heavy load: compose material3 kotlin library). It works just fine, but a new problem has appeared - a brief description from the AI:

   Tested #13422 on the Android device. It fixes the sweep-growth  sweeps stay flat at ~1 ms. But the ANR returned via a different path: the inline force_gc fires very frequently on our Kotlin/Wasm+Compose
   guest, and each GC's root scan (trace_vmctx_roots  RegisteredType::root/from_parts, per GC-typed global) is expensive enough that the cumulative cost blocks the render thread > 5 s. ANR stack: force_gc 
   do_gc  trace_vmctx_roots  Global::trace_root  RegisteredType::root. So the trigger frequency × per-GC root-scan cost is the new bottleneck  the trigger may be too aggressive for a guest with many GC
   globals, or trace_vmctx_roots is too costly to run at that rate.

hope this helps. If more information is needed, I can provide it.

view this post on Zulip Wasmtime GitHub notifications bot (May 21 2026 at 18:39):

fitzgen commented on issue #13403:

@harryzz can you please file a new issue for that without the verbatim AI text (which is referencing a bunch of stuff unrelated to Wasmtime, eg I don't even know what an ANR is, it isn't a Wasmtime thing, and other stuff implicit in your use case that we don't have context for outside of your project) and instead provide (briefly) a standalone test case, expected results, and actual results, similar to this template? Thanks.

view this post on Zulip Wasmtime GitHub notifications bot (May 21 2026 at 19:09):

fitzgen closed issue #13403:


Summary

On a long-running Kotlin/Wasm + WasmGC guest (~9 MB/s steady allocation
rate driven by suspendCoroutine), per-Store::gc cost grows
super-linearly over time. After enough sustained use, a single sweep
exceeds 5 s of host-thread time, triggering Android's input-dispatch
ANR. We patched wasmtime to log the over-approximation list size N,
free-block count F, and sweep duration at every gc; **the trajectory
is reproducible and the scaling is clear**: sweep duration roughly
doubles every few minutes of typical UI interaction, driven primarily
by N's linear growth.

The root cause is architectural: nothing in wasmtime auto-triggers
gc, so N (over-approximated-stack-roots list) grows monotonically
between sweeps, while sweep itself is O(N) with poor cache locality.
Embedders that allocate at a steady rate and let N grow eventually
hit unworkable pause times.

This is distinct from #9701 / PR #10401. That fix was about
transitive ref-decrementing (correctness). We're running on top of it,
on 44.0.1. The remaining issue is pure scaling cost in the sweep.

Environment

Reproducer

Minimal: https://codeberg.org/harryzz/wart-leak-repro

A bare suspendCoroutine loop driven by per-frame host calls — no
Compose, no UI. Reproduces the underlying allocation pattern at ~9 MB/s
WasmGC-heap growth, OOM in ~6:37 on a Pixel 2 XL with no manual
Store::gc triggers.

The full Compose-UI guest (which we actually run) drives a much larger
retained live set per gc (Composition tree, Snapshot records,
LaunchedEffect jobs, Recomposer invalidation list). The
super-linear-growth symptom is visible there but not in the minimal
repro, because the linear-scan cost is dominated by fragmentation,
which Compose's diverse object sizes produce far better than a tight
loop.

Root cause — sweep is O(N) and N grows unbounded between gcs

crates/wasmtime/src/runtime/vm/gc/enabled/drc.rs, lines 124-128:

/// The head of the over-approximated-stack-roots list.
///
/// Note that this is exposed directly to compiled Wasm code through the
/// vmctx, so must not move.
over_approximated_stack_roots: Box<Option<VMGcRef>>,

The over-approximation set is a singly-linked list threaded through
the next_over_approximated_stack_root field in each object's
VMDrcHeader (line 645). sweep (lines 524-618) walks the entire
list every gc:

So |list| at sweep time = (refs currently on the precise wasm stack)

Cost per gc: O(N) linked-list walk, plus N × O(log F) for the
deallocation phase, plus PR #10401's transitive dec-ref. Both N and
F grow over time → super-linear total cost.

There is no automatic scheduling — wasmtime's grow_or_collect_gc_heap
fallback path (the only auto-trigger) only fires on memory.grow
failure. An embedder that returns Ok(true) from
ResourceLimiter::memory_growing never sees a gc unless it calls
Store::gc itself. Embedders that do return false on threshold
also get the cascade-on-recomposition behavior described below.

Secondary — FreeList::first_fit is O(F), but not the dominant cost

crates/wasmtime/src/runtime/vm/gc/enabled/free_list.rs, lines 180-195:

fn first_fit(&mut self, alloc_size: u32) -> Option<(u32, u32)> {
    let (&block_index, &block_len) = self
        .free_block_index_to_len
        .iter()                                  // ← walks BTreeMap in key order
        .find(|(_idx, len)| **len >= alloc_size)?;
    ...
}

The free-block map is keyed by block index (address), not by size.
iter().find(...) walks entries in address order returning the
lowest-address block that fits. O(F) per allocation.

We initially thought this was the dominant cost. It isn't — see the
"What we tried" section. In our workload, frame-level allocation cost
is dwarfed by Compose's render/layout/draw work; switching to a size-
indexed O(log F) lookup didn't move the steady-state frame mean. The
sweep itself is the load-bearing cost, and replacing first_fit
without also addressing the sweep makes things worse (see below).

Direct measurements from a patched wasmtime build

I patched 44.0.1 to log N (over-approximation list size), F
(free-block count), sweep duration, and freed bytes at every sweep.
Three consecutive sweeps during a 45-min soak with active
scrolling/tapping in a Material3 LazyColumn:

11:33:56  sweep 1: dur=478 ms,   N=1,223,135, F:137,269
11:41:52  sweep 2: dur=1248 ms,  N=2,333,452, F:9,96667,394
12:18:20  sweep 3: dur=3000 ms,  N=4,585,421, F:29,53867,394
Sweep 1 Sweep 2 (+8 min) Sweep 3 (+37 min)
N 1.22M 2.33M 4.59M
F_after 37,269 67,394 67,394
Sweep duration 478 ms 1,248 ms 3,000 ms
Per-entry sweep cost 0.39 μs 0.53 μs 0.65 μs
Bytes reclaimed 56 MB 116 MB 249 MB

All three kept counts were 0 — sweeps fired from a host-side
Store::gc(None) invoked between frames, so the precise-stack-roots
set was empty at sweep time.

Notable observations:

  1. F jumps to 37K on the very first sweep. Sweep 1 deallocates
    1.2M scattered objects; the BTreeMap merge-with-neighbors logic in
    FreeList::dealloc only coalesces immediately address-adjacent
    blocks. With live retained objects sandwiching dead ones, ~37K of
    the freed blocks survive as discrete entries. From that point on,
    every guest allocation pays an O(37K) first_fit walk.

  2. F partially reverses between sweeps (37K → 10K before sweep
    2; 67K → 30K before sweep 3) — the allocator does merge some
    blocks as it satisfies new allocations. But each new sweep then
    re-bloats F. Net direction is monotonically up over the soak.

  3. Per-entry sweep cost is rising monotonically (0.39 → 0.53 →
    0.65 μs/entry, +67% over 45 min). At N=4.6M the over-approximation
    linked list's 40-byte headers occupy ~180 MB of GC heap scattered
    across distant addresses. Walking them is bandwidth-bound; the
    working set has clearly blown L1/L2 and is likely thrashing L3 too.

  4. Sweep duration grows faster than N — N grows ~2× per
    inter-sweep interval, but duration grew ~3× from sweep 2 to 3 even
    though that gap was 4× longer than sweep 1 to 2. The two effects
    (more entries to walk, each entry costs more) compound.

Extrapolating the trend: at the observed growth rate, sweep durations
push past 5 s within an hour or two. A single sweep at that scale, or
a small cascade of them inside one render_frame, exceeds Android's
5 s input-dispatch ANR threshold. We've observed this in practice in
multiple soaks (one ANR at ~48 min on the unpatched build, another at
~10 min on a heavier-interaction soak with the instrumentation
adding small per-sweep overhead). Typical ANR cascade pattern in
logcat:

last frame log
+0.0 s   wart-profile: frame N t+X ms  win[N=60 ... ]
+0.0 s   InputDispatcher: "spent 2005 ms processing MotionEvent"
+1.0 s   InputDispatcher: "spent 2350 ms processing MotionEvent"
+4.0 s   InputDispatcher: "spent 2247 ms processing MotionEvent"
+5.0 s   ANR  "Waited 5000 ms for MotionEvent"

Mean steady-state frame time stays flat at ~16 ms outside of
sweep/cascade windows — the problem isn't the normal allocation path,
it's the sweep + the post-sweep allocation cost driven by elevated F.

What we tried that did NOT work

We patched FreeList to add a parallel size-indexed
BTreeSet<(len, index)>, replacing the O(F) linear scan in first_fit
with an O(log F) range((alloc_size, 0)..).next() lookup. All 12
existing FreeList unit tests pass. On device, this **made things
worse**:

Unpatched With FreeList fix Δ
Sweep 1 duration 472 ms 823 ms +74%
Sweep 2 duration 1,222 ms 2,010 ms +64%
Sweep 3 duration (ANR'd before) 4,314 ms new
Per-frame mean (steady) ~16 ms ~16 ms no change
Guest alloc rate 15K refs/sec 19K refs/sec +27%

Two reasons it failed:

  1. first_fit was not the bottleneck. Frame-mean cost is dominated
    by Compose's render/layout/draw work; the alloc-path microseconds
    we shaved off didn't manifest in any observable metric.

  2. Faster allocs made the leak worse. With cheaper allocs, the
    Kotlin/Wasm guest churned through suspendCoroutine cycles faster,
    so N accumulated faster between sweeps. Each subsequent sweep had
    more to walk and more to deallocate. The per-dealloc overhead
    from maintaining a second data structure (one extra BTreeSet op per
    add_block / remove_block / update_block_len) then compounded
    over N+ deallocations per sweep.

The lesson: any allocator-side optimization must be paired with a
sweep-side fix, otherwise it accelerates the pathology. Posting this
finding in case it's useful — happy to share the patch diff if you want
to evaluate it for other workloads where alloc cost is the
bottleneck.

What would help

In order of e
[message truncated]


Last updated: Jun 01 2026 at 09:49 UTC