gfx opened issue #13470:
Upgrading Wado from wasmtime 44 to 45 (default
drccollector) made our GC-heavy benchmarks ~2x slower. I bisected it to #12942.
workload (drc, -O2) 44 45 ratio json/canada 112 ms 229 ms 2.04 sqlite_parse 618 ms 1380 ms 2.23 syntax_highlight 971 ms 1962 ms 2.02 Pure-compute workloads (count_prime, mandelbrot, sieve) are unaffected.
The heuristic decides collect-vs-grow purely on
live < capacity/2, with no notion of allocation rate. For high-allocation / small-live workloads the live set stays small, so a collection always frees enough and the heap never grows past its initial size — it collects on every heap-fill (GC thrashing). This is, ironically, close to the "lots of temporary garbage" case the PR cites as motivation.Collector::Nullruns json/canada in ~15 ms vs ~148 ms for drc, so the cost is collection;gc_heap_reservation(even 1 GiB) has no effect.I'd suggest reverting it for now: the grow-to-the-limit behavior in 44 was a reasonable default (memory is still bounded by the GC heap size limit), and the new heuristic regresses a common workload class.
Then, before re-landing a heuristic, it might help to add a GC throughput benchmark to wasmtime's CI so this kind of regression is caught up front. Here is a self-contained reproducer (a Wado-compiled
wasi:cli/commandcomponent that allocates many short-lived GC objects) that could serve as a reference: https://gist.github.com/gfx/133a82db2817c160da6cbc221b0a4329 — on it, wasmtime 44 ≈ 1005 ms vs 45 ≈ 1950 ms.
cfallin commented on issue #13470:
Thanks for filing an issue with this data!
I think there is a deeper tradeoff here that's missing in the discussion. tl;dr is that I don't think we should revert. But in more detail:
- There is a fundamental tradeoff in garbage collector design between resident-set size and throughput; this is well-known and reproducible e.g. on a JVM by tweaking maximum memory size.
- The original heuristic would grow up until the limit and then collect (simple!), but this limit was by default fairly high. Wasmtime is designed to be used in (among many scenarios) high-concurrency server situations where memory efficiency is important, and throwing a maximal amount of memory at an individual instance for performance may not be the best choice, or feasible. For example, throwing a 128MiB GC heap at an individual instance in a server running thousands of instances, just so that instance can allocate short-lived garbage with a live-set size of a handful of kilobytes, is very wasteful and a nonstarter. Or as I said in #12942, "That behavior optimizes for allocation performance but at the cost of resident memory size -- it is at one extreme end of that tradeoff spectrum."
- The design goal in #12942 was to make RSS O(live set size) rather than O(max GC heap size). That's now the case, and it's an asymptotic bound improvement, and a very serious improvement in RSS when max heap size is significantly larger than typical working-set size (possibly several orders of magnitude better).
Hopefully that makes clear why the change was made. Contrary to the framing above it's not a single-dimensional performance metric with a straightforward regression; it's a tradeoff space and we bought a new asymptotic bound.
I do think we could entertain an alternative (non-default) option that grows unconditionally up to the max heap size then starts collecting; that configuration makes more sense when there is only one instance, and the user has memory to burn.
cc @fitzgen for more thoughts as the main owner of GC (and much more of an expert on these things than me!)
alexcrichton added the wasm-proposal:gc label to Issue #13470.
gfx commented on issue #13470:
Thanks, that explanation makes sense to me.
I understand the motivation for the new default. For high-concurrency or memory-constrained deployments, optimizing RSS relative to the live set seems important, and I can see why the previous grow-until-limit behavior may be undesirable there.
At the same time, I think changing this default is something to be careful about. For latency-sensitive embedders, this can be a clear regression rather than just a different point in the tradeoff space. In our case, the affected paths are roughly 2x slower, which is large enough to be production-impacting.
So my main request would be that this tradeoff should be configurable. The new default may be the right one for some environments, but embedders that have explicitly budgeted memory for an instance should have a supported way to prefer latency/throughput over minimizing RSS.
Longer-term, I also think it would be valuable to have continuous benchmarking around this area that tracks both sides of the tradeoff: RSS/heap growth and throughput/latency on allocation-heavy workloads. That would make changes like this easier to evaluate as intentional tradeoffs rather than surprising regressions after release.
Thanks again for the detailed context.
gfx edited a comment on issue #13470:
Thanks, that explanation makes sense to me.
I understand the motivation for the new default. For high-concurrency or memory-constrained deployments, optimizing RSS relative to the live set seems important, and I can see why the previous grow-until-limit behavior may be undesirable there.
At the same time, I think changing this default is something to be careful about. For latency-sensitive embedders, this can be a clear regression rather than just a different point in the tradeoff space. In our case, the affected paths are roughly 2x slower, which is large enough to be production-impacting.
So my main request would be that this tradeoff should be configurable. The new default may be the right one for some environments, but embedders that have explicitly budgeted memory for an instance should have a supported way to prefer latency/throughput over minimizing RSS.
Longer-term, I also think it would be valuable to have continuous benchmarking around this area that tracks both sides of the tradeoff: RSS/heap growth and throughput/latency on allocation-heavy workloads. That would make changes like this easier to evaluate as intentional tradeoffs rather than surprising regressions after release.
cfallin commented on issue #13470:
Sure, I think we'd be happy to review a PR to make the heuristic configurable.
Speaking philosophically for a second, re:
regression rather than just a different point in the tradeoff space
I don't want us to get into the space where current performance is "locked in" forever on every single axis. There are projects that operate like that (e.g., V8 on performance matters, from what I understand), but we are still in the space where we are figuring out the best designs and tradeoffs. And this is absolutely a tradeoff space on both sides: for a workload with say 128KiB of real GC live-set size peak, and 128MiB heap, that is a 500x-reduction in memory requirement to have the adaptive grow-vs-collect heuristic (guaranteed 2x-live worst-case bound).
For what it's worth, as well, GC is not yet tier-1, so anyone running it in production today does it "at their own risk"; we haven't yet committed to the kind of stability that might change expectations about "regressions after release". (We might soon, but all of this work comes before that change.) And e.g. we recently changed our default collector away from
drc. It's great and valuable that you're running things in production and gaining experience + feeding it back, but just wanted to make sure that was explicitly said.
fitzgen commented on issue #13470:
+1 to everything Chris said about trade offs (and pretty much everything else).
@cfallin
I do think we could entertain an alternative (non-default) option that grows unconditionally up to the max heap size then starts collecting; that configuration makes more sense when there is only one instance, and the user has memory to burn.
I think a nice way to do this would be to make the grow-vs-collect ratio's denominator (or the log2 of the denominator) a tunable. Right now the ratio's denominator is
2(ie the ratio is1/2and collect if the previous heap size was less than that, grow otherwise), but you could effectively get the old behavior by changing the denominator to1 << 31(ie making the ratio1/2147483648) which would basically always choose growth instead of collection.This is a nice way to phrase the problem because it wouldn't actually create any new branches to our existing logic.
@gfx fwiw you should probably experiment with using the copying collector instead of the DRC collector. It actually collects cycles, is much faster, and is now the default collector on
main. It is also what we plan on using when enabling Wasm GC by default.
cfallin commented on issue #13470:
I think a nice way to do this would be to make the grow-vs-collect ratio's denominator (or the log2 of the denominator) a tunable.
Ah, I really like that! @gfx if you want to send a PR for this I'm happy to review it. Otherwise I can throw it on my to-do list and get to it at some point...
gfx commented on issue #13470:
Thanks, that makes sense. I’ll try benchmarking our workloads with the copying collector on main and report back with numbers for both throughput and memory.
I’m also interested in sending a PR for the tunable denominator/log2-denominator approach. The default can remain as-is, while embedders that explicitly want to trade memory for throughput can configure a much smaller collect tendency / old grow-first behavior.
I’ll take a look at where this should be exposed in Wasmtime’s config/tunables.
gfx commented on issue #13470:
Following up with the copying-collector numbers, as promised.
Workload: a syntax highlighter written in Wado that allocates lots of short-lived GC objects per run — i.e. exactly the high-allocation / small-live-set pattern this heuristic regresses. Standalone
wasi:cli/commandcomponent,-O2, 100 iterations/run, on the 45.0.0 CLI with-C collector=…. Best of 10 runs per metric.
collector ms/iter vs drcpeak RSS drc13.45 1.0× 38.9 MB copying2.34 ~5.7× faster 38.2 MB null(never collects)1.75 ~7.7× faster 205 MB (unbounded)
copyingis ~5.7× faster thandrcand nearly matches thenullthroughput ceiling — it removes almost all of the GC overhead the eager-collect heuristic imposes here — at the same peak RSS asdrc. (Thenullfigure confirms this is genuinely high-allocation / small-live-set.)I see
copyingis already the default onmain, so this is fully resolved from our side — we'll switch our embedding tocopying. Thanks for the pointer, @fitzgen.
fitzgen commented on issue #13470:
Glad the copying collector works for you.
I’m also interested in sending a PR for the tunable denominator/log2-denominator approach. The default can remain as-is, while embedders that explicitly want to trade memory for throughput can configure a much smaller collect tendency / old grow-first behavior.
I’ll take a look at where this should be exposed in Wasmtime’s config/tunables.
The new tunable would be added somewhere around here:
And then exposed as a
Configmethod somewhere around here:In general, if you just grep around for the existing GC heap tunables, you should see all the places you'd need to wire this up.
Last updated: Jun 01 2026 at 09:49 UTC