harryzz opened issue #13403:
Summary
On a long-running Kotlin/Wasm + WasmGC guest (~9 MB/s steady allocation
rate driven bysuspendCoroutine), per-Store::gccost grows
super-linearly over time. After enough sustained use, a single sweep
exceeds 5 s of host-thread time, triggering Android's input-dispatch
ANR. We patched wasmtime to log the over-approximation list sizeN,
free-block countF, and sweep duration at every gc; **the trajectory
is reproducible and the scaling is clear**: sweep duration roughly
doubles every few minutes of typical UI interaction, driven primarily
byN's linear growth.The root cause is architectural: nothing in wasmtime auto-triggers
gc, soN(over-approximated-stack-roots list) grows monotonically
between sweeps, while sweep itself is O(N) with poor cache locality.
Embedders that allocate at a steady rate and letNgrow eventually
hit unworkable pause times.This is distinct from #9701 / PR #10401. That fix was about
transitive ref-decrementing (correctness). We're running on top of it,
on 44.0.1. The remaining issue is pure scaling cost in the sweep.Environment
- wasmtime 44.0.1, features =
["component-model"], sync host calls- Target:
aarch64-linux-android, AOT (--wasm gc --wasm function-references --wasm exceptions)- Host: single-threaded
Storeon the render thread (winit android-native-activity)- Guest: Kotlin/Wasm 2.4.0-RC compiled to a wasm-component
- Device: Pixel 2 XL / Android 15 / API 35
Reproducer
Minimal: https://codeberg.org/harryzz/wart-leak-repro
A bare
suspendCoroutineloop driven by per-frame host calls — no
Compose, no UI. Reproduces the underlying allocation pattern at ~9 MB/s
WasmGC-heap growth, OOM in ~6:37 on a Pixel 2 XL with no manual
Store::gctriggers.The full Compose-UI guest (which we actually run) drives a much larger
retained live set per gc (Composition tree, Snapshot records,
LaunchedEffect jobs, Recomposer invalidation list). The
super-linear-growth symptom is visible there but not in the minimal
repro, because the linear-scan cost is dominated by fragmentation,
which Compose's diverse object sizes produce far better than a tight
loop.Root cause — sweep is O(
N) andNgrows unbounded between gcs
crates/wasmtime/src/runtime/vm/gc/enabled/drc.rs, lines 124-128:/// The head of the over-approximated-stack-roots list. /// /// Note that this is exposed directly to compiled Wasm code through the /// vmctx, so must not move. over_approximated_stack_roots: Box<Option<VMGcRef>>,The over-approximation set is a singly-linked list threaded through
thenext_over_approximated_stack_rootfield in each object's
VMDrcHeader(line 645).sweep(lines 524-618) walks the entire
list every gc:
Unmarked entries → spliced out, dec_ref'd via
dec_ref_and_maybe_dealloc, possibly deallocated through
FreeList::dealloc(O(log F)).Marked entries → stay in the list across gcs (lines 578-588).
So
|list|at sweep time = (refs currently on the precise wasm stack)
- (refs newly borrowed since last sweep). For a steady-state
high-allocation guest with no explicitStore::gccalls, this
monotonically grows. Each entry'sVMDrcHeaderis ~40 B, scattered
across the GC heap → walking is memory-bandwidth-bound (cache miss
per entry on a 95 MB working set atN= 2.4M).Cost per gc: O(
N) linked-list walk, plus N × O(log F) for the
deallocation phase, plus PR #10401's transitive dec-ref. BothNand
Fgrow over time → super-linear total cost.There is no automatic scheduling — wasmtime's
grow_or_collect_gc_heap
fallback path (the only auto-trigger) only fires on memory.grow
failure. An embedder that returnsOk(true)from
ResourceLimiter::memory_growingnever sees a gc unless it calls
Store::gcitself. Embedders that do return false on threshold
also get the cascade-on-recomposition behavior described below.Secondary —
FreeList::first_fitis O(F), but not the dominant cost
crates/wasmtime/src/runtime/vm/gc/enabled/free_list.rs, lines 180-195:fn first_fit(&mut self, alloc_size: u32) -> Option<(u32, u32)> { let (&block_index, &block_len) = self .free_block_index_to_len .iter() // ← walks BTreeMap in key order .find(|(_idx, len)| **len >= alloc_size)?; ... }The free-block map is keyed by block index (address), not by size.
iter().find(...)walks entries in address order returning the
lowest-address block that fits. O(F) per allocation.We initially thought this was the dominant cost. It isn't — see the
"What we tried" section. In our workload, frame-level allocation cost
is dwarfed by Compose's render/layout/draw work; switching to a size-
indexed O(log F) lookup didn't move the steady-state frame mean. The
sweep itself is the load-bearing cost, and replacingfirst_fit
without also addressing the sweep makes things worse (see below).Direct measurements from a patched wasmtime build
I patched 44.0.1 to log
N(over-approximation list size),F
(free-block count), sweep duration, and freed bytes at every sweep.
Three consecutive sweeps during a 45-min soak with active
scrolling/tapping in a Material3 LazyColumn:11:33:56 sweep 1: dur=478 ms, N=1,223,135, F:1→37,269 11:41:52 sweep 2: dur=1248 ms, N=2,333,452, F:9,966→67,394 12:18:20 sweep 3: dur=3000 ms, N=4,585,421, F:29,538→67,394
Sweep 1 Sweep 2 (+8 min) Sweep 3 (+37 min) N1.22M 2.33M 4.59M F_after37,269 67,394 67,394 Sweep duration 478 ms 1,248 ms 3,000 ms Per-entry sweep cost 0.39 μs 0.53 μs 0.65 μs Bytes reclaimed 56 MB 116 MB 249 MB All three
keptcounts were 0 — sweeps fired from a host-side
Store::gc(None)invoked between frames, so the precise-stack-roots
set was empty at sweep time.Notable observations:
Fjumps to 37K on the very first sweep. Sweep 1 deallocates
1.2M scattered objects; the BTreeMap merge-with-neighbors logic in
FreeList::dealloconly coalesces immediately address-adjacent
blocks. With live retained objects sandwiching dead ones, ~37K of
the freed blocks survive as discrete entries. From that point on,
every guest allocation pays an O(37K)first_fitwalk.
Fpartially reverses between sweeps (37K → 10K before sweep
2; 67K → 30K before sweep 3) — the allocator does merge some
blocks as it satisfies new allocations. But each new sweep then
re-bloatsF. Net direction is monotonically up over the soak.Per-entry sweep cost is rising monotonically (0.39 → 0.53 →
0.65 μs/entry, +67% over 45 min). At N=4.6M the over-approximation
linked list's 40-byte headers occupy ~180 MB of GC heap scattered
across distant addresses. Walking them is bandwidth-bound; the
working set has clearly blown L1/L2 and is likely thrashing L3 too.Sweep duration grows faster than N — N grows ~2× per
inter-sweep interval, but duration grew ~3× from sweep 2 to 3 even
though that gap was 4× longer than sweep 1 to 2. The two effects
(more entries to walk, each entry costs more) compound.Extrapolating the trend: at the observed growth rate, sweep durations
push past 5 s within an hour or two. A single sweep at that scale, or
a small cascade of them inside onerender_frame, exceeds Android's
5 s input-dispatch ANR threshold. We've observed this in practice in
multiple soaks (one ANR at ~48 min on the unpatched build, another at
~10 min on a heavier-interaction soak with the instrumentation
adding small per-sweep overhead). Typical ANR cascade pattern in
logcat:last frame log +0.0 s wart-profile: frame N t+X ms win[N=60 ... ] +0.0 s InputDispatcher: "spent 2005 ms processing MotionEvent" +1.0 s InputDispatcher: "spent 2350 ms processing MotionEvent" +4.0 s InputDispatcher: "spent 2247 ms processing MotionEvent" +5.0 s ANR — "Waited 5000 ms for MotionEvent"Mean steady-state frame time stays flat at ~16 ms outside of
sweep/cascade windows — the problem isn't the normal allocation path,
it's the sweep + the post-sweep allocation cost driven by elevated F.What we tried that did NOT work
We patched
FreeListto add a parallel size-indexed
BTreeSet<(len, index)>, replacing the O(F) linear scan infirst_fit
with an O(log F)range((alloc_size, 0)..).next()lookup. All 12
existingFreeListunit tests pass. On device, this **made things
worse**:
Unpatched With FreeList fix Δ Sweep 1 duration 472 ms 823 ms +74% Sweep 2 duration 1,222 ms 2,010 ms +64% Sweep 3 duration (ANR'd before) 4,314 ms new Per-frame mean (steady) ~16 ms ~16 ms no change Guest alloc rate 15K refs/sec 19K refs/sec +27% Two reasons it failed:
first_fitwas not the bottleneck. Frame-mean cost is dominated
by Compose's render/layout/draw work; the alloc-path microseconds
we shaved off didn't manifest in any observable metric.Faster allocs made the leak worse. With cheaper allocs, the
Kotlin/Wasm guest churned throughsuspendCoroutinecycles faster,
soNaccumulated faster between sweeps. Each subsequent sweep had
more to walk and more to deallocate. The per-dealloc overhead
from maintaining a second data structure (one extra BTreeSet op per
add_block/remove_block/update_block_len) then compounded
overN+ deallocations per sweep.The lesson: any allocator-side optimization must be paired with a
sweep-side fix, otherwise it accelerates the pathology. Posting this
finding in case it's useful — happy to share the patch diff if you want
to evaluate it for other workloads where alloc cost is the
bottleneck.What would help
In order of e
[message truncated]
harryzz added the bug label to Issue #13403.
fitzgen commented on issue #13403:
Hello, I appreciate the bug report and will strive to investigate it soon, but can you please avoid dumping AI-generated walls of text going forward? By not reviewing the LLM output and summarizing it / including just the necessary information yourself first before reporting, you are thrusting that work and effort upon Wasmtime's maintainers, which breaks the open-source social contract. Read the Bytecode Alliance AI tool usage policy for more details. Thanks!
fitzgen added the wasm-proposal:gc label to Issue #13403.
fitzgen commented on issue #13403:
@harryzz can you provide the exact Wasmtime invocation you used with
wart-leak-repro.wasmto reproduce the increasing sweep times? I ran the following and it never allocated enough to trigger a GC:wasmtime run -Wgc,function-references,exceptions -Ccollector=drc -Wunknown-imports-default=y wart-leak-repro.wasm
harryzz commented on issue #13403:
@fitzgen first sorry for AI generated post. To the question: standalone can't run, need host function.
I have attached new self driven wasm file. Also repo with kotlin code updated if you need for reference.regards,
ZahariP.S. there short answer from AI if you need :)
<!-- Failed to upload "wart-leak.repro.tgz" -->
<!-- Failed to upload "wart-leak.repro.tgz" -->
▎ Sorry, that's on me — the .wasm I attached doesn't reproduce standalone. It was a WASI component whose suspend loop was driven by an external render-frame export, so a plain wasmtime run calls main() once,
▎ which parks at the first suspendCoroutine and returns — nothing accumulates.
▎
▎ I've replaced it with a self-driving build (attached: wart-leak-repro.wasm) — main() now runs the suspend/resume cycle itself in an unbounded loop, no host imports, no component model. Your exact invocation
▎ reproduces it:
▎
▎ wasmtime run -Wgc,function-references,exceptions -Ccollector=drc wart-leak-repro.wasm
▎
▎ RSS climbs ~100 MB/s as garbage accumulates with no sweep, plateaus at the 4 GB wasm32 ceiling within ~50 s, then enters the steady-state sweep cycle (the tick # counter keeps advancing past the plateau).
▎ With 01-instrumentation.patch applied you get the per-sweep (N, F, dur) trajectory. The ~70-line source is Main.kt — pure suspendCoroutine, zero dependencies.
fitzgen commented on issue #13403:
@harryzz can you verify whether https://github.com/bytecodealliance/wasmtime/pull/13422 fixes the excessively long sweeps for you? FWIW, I wasn't able to repro your sweep times after ten minutes of running the test case on my laptop, but you mention Android in this issue so I assume your hardware is pretty different from mine.
harryzz commented on issue #13403:
@fitzgen verified #13422 both on linux x86 (with reproduce test) and Android armv8 (heavy load: compose material3 kotlin library). It works just fine, but a new problem has appeared - a brief description from the AI:
▎ Tested #13422 on the Android device. It fixes the sweep-growth — sweeps stay flat at ~1 ms. But the ANR returned via a different path: the inline force_gc fires very frequently on our Kotlin/Wasm+Compose ▎ guest, and each GC's root scan (trace_vmctx_roots → RegisteredType::root/from_parts, per GC-typed global) is expensive enough that the cumulative cost blocks the render thread > 5 s. ANR stack: force_gc → ▎ do_gc → trace_vmctx_roots → Global::trace_root → RegisteredType::root. So the trigger frequency × per-GC root-scan cost is the new bottleneck — the trigger may be too aggressive for a guest with many GC ▎ globals, or trace_vmctx_roots is too costly to run at that rate.hope this helps. If more information is needed, I can provide it.
fitzgen commented on issue #13403:
@harryzz can you please file a new issue for that without the verbatim AI text (which is referencing a bunch of stuff unrelated to Wasmtime, eg I don't even know what an ANR is, it isn't a Wasmtime thing, and other stuff implicit in your use case that we don't have context for outside of your project) and instead provide (briefly) a standalone test case, expected results, and actual results, similar to this template? Thanks.
fitzgen closed issue #13403:
Summary
On a long-running Kotlin/Wasm + WasmGC guest (~9 MB/s steady allocation
rate driven bysuspendCoroutine), per-Store::gccost grows
super-linearly over time. After enough sustained use, a single sweep
exceeds 5 s of host-thread time, triggering Android's input-dispatch
ANR. We patched wasmtime to log the over-approximation list sizeN,
free-block countF, and sweep duration at every gc; **the trajectory
is reproducible and the scaling is clear**: sweep duration roughly
doubles every few minutes of typical UI interaction, driven primarily
byN's linear growth.The root cause is architectural: nothing in wasmtime auto-triggers
gc, soN(over-approximated-stack-roots list) grows monotonically
between sweeps, while sweep itself is O(N) with poor cache locality.
Embedders that allocate at a steady rate and letNgrow eventually
hit unworkable pause times.This is distinct from #9701 / PR #10401. That fix was about
transitive ref-decrementing (correctness). We're running on top of it,
on 44.0.1. The remaining issue is pure scaling cost in the sweep.Environment
- wasmtime 44.0.1, features =
["component-model"], sync host calls- Target:
aarch64-linux-android, AOT (--wasm gc --wasm function-references --wasm exceptions)- Host: single-threaded
Storeon the render thread (winit android-native-activity)- Guest: Kotlin/Wasm 2.4.0-RC compiled to a wasm-component
- Device: Pixel 2 XL / Android 15 / API 35
Reproducer
Minimal: https://codeberg.org/harryzz/wart-leak-repro
A bare
suspendCoroutineloop driven by per-frame host calls — no
Compose, no UI. Reproduces the underlying allocation pattern at ~9 MB/s
WasmGC-heap growth, OOM in ~6:37 on a Pixel 2 XL with no manual
Store::gctriggers.The full Compose-UI guest (which we actually run) drives a much larger
retained live set per gc (Composition tree, Snapshot records,
LaunchedEffect jobs, Recomposer invalidation list). The
super-linear-growth symptom is visible there but not in the minimal
repro, because the linear-scan cost is dominated by fragmentation,
which Compose's diverse object sizes produce far better than a tight
loop.Root cause — sweep is O(
N) andNgrows unbounded between gcs
crates/wasmtime/src/runtime/vm/gc/enabled/drc.rs, lines 124-128:/// The head of the over-approximated-stack-roots list. /// /// Note that this is exposed directly to compiled Wasm code through the /// vmctx, so must not move. over_approximated_stack_roots: Box<Option<VMGcRef>>,The over-approximation set is a singly-linked list threaded through
thenext_over_approximated_stack_rootfield in each object's
VMDrcHeader(line 645).sweep(lines 524-618) walks the entire
list every gc:
Unmarked entries → spliced out, dec_ref'd via
dec_ref_and_maybe_dealloc, possibly deallocated through
FreeList::dealloc(O(log F)).Marked entries → stay in the list across gcs (lines 578-588).
So
|list|at sweep time = (refs currently on the precise wasm stack)
- (refs newly borrowed since last sweep). For a steady-state
high-allocation guest with no explicitStore::gccalls, this
monotonically grows. Each entry'sVMDrcHeaderis ~40 B, scattered
across the GC heap → walking is memory-bandwidth-bound (cache miss
per entry on a 95 MB working set atN= 2.4M).Cost per gc: O(
N) linked-list walk, plus N × O(log F) for the
deallocation phase, plus PR #10401's transitive dec-ref. BothNand
Fgrow over time → super-linear total cost.There is no automatic scheduling — wasmtime's
grow_or_collect_gc_heap
fallback path (the only auto-trigger) only fires on memory.grow
failure. An embedder that returnsOk(true)from
ResourceLimiter::memory_growingnever sees a gc unless it calls
Store::gcitself. Embedders that do return false on threshold
also get the cascade-on-recomposition behavior described below.Secondary —
FreeList::first_fitis O(F), but not the dominant cost
crates/wasmtime/src/runtime/vm/gc/enabled/free_list.rs, lines 180-195:fn first_fit(&mut self, alloc_size: u32) -> Option<(u32, u32)> { let (&block_index, &block_len) = self .free_block_index_to_len .iter() // ← walks BTreeMap in key order .find(|(_idx, len)| **len >= alloc_size)?; ... }The free-block map is keyed by block index (address), not by size.
iter().find(...)walks entries in address order returning the
lowest-address block that fits. O(F) per allocation.We initially thought this was the dominant cost. It isn't — see the
"What we tried" section. In our workload, frame-level allocation cost
is dwarfed by Compose's render/layout/draw work; switching to a size-
indexed O(log F) lookup didn't move the steady-state frame mean. The
sweep itself is the load-bearing cost, and replacingfirst_fit
without also addressing the sweep makes things worse (see below).Direct measurements from a patched wasmtime build
I patched 44.0.1 to log
N(over-approximation list size),F
(free-block count), sweep duration, and freed bytes at every sweep.
Three consecutive sweeps during a 45-min soak with active
scrolling/tapping in a Material3 LazyColumn:11:33:56 sweep 1: dur=478 ms, N=1,223,135, F:1→37,269 11:41:52 sweep 2: dur=1248 ms, N=2,333,452, F:9,966→67,394 12:18:20 sweep 3: dur=3000 ms, N=4,585,421, F:29,538→67,394
Sweep 1 Sweep 2 (+8 min) Sweep 3 (+37 min) N1.22M 2.33M 4.59M F_after37,269 67,394 67,394 Sweep duration 478 ms 1,248 ms 3,000 ms Per-entry sweep cost 0.39 μs 0.53 μs 0.65 μs Bytes reclaimed 56 MB 116 MB 249 MB All three
keptcounts were 0 — sweeps fired from a host-side
Store::gc(None)invoked between frames, so the precise-stack-roots
set was empty at sweep time.Notable observations:
Fjumps to 37K on the very first sweep. Sweep 1 deallocates
1.2M scattered objects; the BTreeMap merge-with-neighbors logic in
FreeList::dealloconly coalesces immediately address-adjacent
blocks. With live retained objects sandwiching dead ones, ~37K of
the freed blocks survive as discrete entries. From that point on,
every guest allocation pays an O(37K)first_fitwalk.
Fpartially reverses between sweeps (37K → 10K before sweep
2; 67K → 30K before sweep 3) — the allocator does merge some
blocks as it satisfies new allocations. But each new sweep then
re-bloatsF. Net direction is monotonically up over the soak.Per-entry sweep cost is rising monotonically (0.39 → 0.53 →
0.65 μs/entry, +67% over 45 min). At N=4.6M the over-approximation
linked list's 40-byte headers occupy ~180 MB of GC heap scattered
across distant addresses. Walking them is bandwidth-bound; the
working set has clearly blown L1/L2 and is likely thrashing L3 too.Sweep duration grows faster than N — N grows ~2× per
inter-sweep interval, but duration grew ~3× from sweep 2 to 3 even
though that gap was 4× longer than sweep 1 to 2. The two effects
(more entries to walk, each entry costs more) compound.Extrapolating the trend: at the observed growth rate, sweep durations
push past 5 s within an hour or two. A single sweep at that scale, or
a small cascade of them inside onerender_frame, exceeds Android's
5 s input-dispatch ANR threshold. We've observed this in practice in
multiple soaks (one ANR at ~48 min on the unpatched build, another at
~10 min on a heavier-interaction soak with the instrumentation
adding small per-sweep overhead). Typical ANR cascade pattern in
logcat:last frame log +0.0 s wart-profile: frame N t+X ms win[N=60 ... ] +0.0 s InputDispatcher: "spent 2005 ms processing MotionEvent" +1.0 s InputDispatcher: "spent 2350 ms processing MotionEvent" +4.0 s InputDispatcher: "spent 2247 ms processing MotionEvent" +5.0 s ANR — "Waited 5000 ms for MotionEvent"Mean steady-state frame time stays flat at ~16 ms outside of
sweep/cascade windows — the problem isn't the normal allocation path,
it's the sweep + the post-sweep allocation cost driven by elevated F.What we tried that did NOT work
We patched
FreeListto add a parallel size-indexed
BTreeSet<(len, index)>, replacing the O(F) linear scan infirst_fit
with an O(log F)range((alloc_size, 0)..).next()lookup. All 12
existingFreeListunit tests pass. On device, this **made things
worse**:
Unpatched With FreeList fix Δ Sweep 1 duration 472 ms 823 ms +74% Sweep 2 duration 1,222 ms 2,010 ms +64% Sweep 3 duration (ANR'd before) 4,314 ms new Per-frame mean (steady) ~16 ms ~16 ms no change Guest alloc rate 15K refs/sec 19K refs/sec +27% Two reasons it failed:
first_fitwas not the bottleneck. Frame-mean cost is dominated
by Compose's render/layout/draw work; the alloc-path microseconds
we shaved off didn't manifest in any observable metric.Faster allocs made the leak worse. With cheaper allocs, the
Kotlin/Wasm guest churned throughsuspendCoroutinecycles faster,
soNaccumulated faster between sweeps. Each subsequent sweep had
more to walk and more to deallocate. The per-dealloc overhead
from maintaining a second data structure (one extra BTreeSet op per
add_block/remove_block/update_block_len) then compounded
overN+ deallocations per sweep.The lesson: any allocator-side optimization must be paired with a
sweep-side fix, otherwise it accelerates the pathology. Posting this
finding in case it's useful — happy to share the patch diff if you want
to evaluate it for other workloads where alloc cost is the
bottleneck.What would help
In order of e
[message truncated]
Last updated: Jun 01 2026 at 09:49 UTC