jrmuizel opened issue #11141:
Test Case
binary-trees-wasi-opt.wasm.zip
Steps to Reproduce
- Run
time wasmtime run -W gc=y binary-trees-wasi-opt.wasmActual Results
stretch tree of depth 21 check: 4194303 1048576 trees of depth 4 check: 32505856 262144 trees of depth 6 check: 33292288 65536 trees of depth 8 check: 33488896 16384 trees of depth 10 check: 33538048 4096 trees of depth 12 check: 33550336 1024 trees of depth 14 check: 33553408 256 trees of depth 16 check: 33554176 64 trees of depth 18 check: 33554368 16 trees of depth 20 check: 33554416 long lived tree of depth 20 check: 2097151 real 2m13.350s user 2m12.736s sys 0m0.454sExpected Results
It should run much faster.
V8: 1.2s
Firefox: 4.8s
Safari: 5.9sExtra Info
The benchmark is:
class TreeNode { TreeNode left; TreeNode right; TreeNode bottomUpTree(int depth) { if (depth > 0) { TreeNode leftChild = this.bottomUpTree(depth - 1); TreeNode rightChild = this.bottomUpTree(depth - 1); TreeNode node = new TreeNode(); node.left = leftChild; node.right = rightChild; return node; } else { TreeNode node = new TreeNode(); node.left = null; node.right = null; return node; } } int itemCheck() { if (this.left == null) { return 1; } else { return 1 + this.left.itemCheck() + this.right.itemCheck(); } } } void main() { int minDepth = 4; int n = 20; // Increase n to make it more interesting int maxDepth; if (minDepth + 2 > n) { maxDepth = minDepth + 2; } else { maxDepth = n; } int stretchDepth = maxDepth + 1; TreeNode stretchTree = new TreeNode(); TreeNode temp = stretchTree.bottomUpTree(stretchDepth); int check = temp.itemCheck(); print("stretch tree of depth "); print(stretchDepth); print(" check: "); println(check); TreeNode longLivedTree = new TreeNode(); longLivedTree = longLivedTree.bottomUpTree(maxDepth); int depth = minDepth; while (depth <= maxDepth) { int iterations = 1 << (maxDepth - depth + minDepth); check = 0; int i = 1; while (i <= iterations) { TreeNode tree = new TreeNode(); tree = tree.bottomUpTree(depth); check = check + tree.itemCheck(); i = i + 1; } print(iterations); print(" trees of depth "); print(depth); print(" check: "); println(check); depth = depth + 2; } print("long lived tree of depth "); print(maxDepth); print(" check: "); println(longLivedTree.itemCheck()); }compiled with https://github.com/jrmuizel/wasm-lang1
There's a browser version here: https://zingy-sorbet-48e9a5.netlify.app/binary-trees-opt.html
jrmuizel added the bug label to Issue #11141.
alexcrichton added the wasm-proposal:gc label to Issue #11141.
fitzgen commented on issue #11141:
Thanks for filing this issue with a test case and STR, @jrmuizel! Very appreciated.
First off, in general, when benchmarking Wasm execution in Wasmtime, we want to make sure that we aren't measuring compile time. Unlike browsers, Wasmtime doesn't start with a baseline compiler or interpreter and then dynamically tier up to an optimized compiler; we just (depending on the configuration) compile with our optimizing compiler immediately and then when the whole module is compiled, we start running it. Wasmtime is really more like an AOT compiler than a browser JIT, even though it can technically compile just-in-time. So given all that, in the future, it would be good to do something like this:
$ wasmtime compile binary-trees-wasi-opt.wasm -o binary-trees-wasi-opt.cwasm $ time wasmtime run --allow-precompiled binary-trees-wasi-opt.cwasm(And maybe replace
timewithhyperfineas necessary.)Second, just to set expectations, you should know that we have thus far focused only on correctness and haven't put much effort into optimizing the GC implementation yet. Not at the level of the details of codegen nor at the algorithmic level. For example, the default DRC collector uses a fixed size bump chunk in part of its implementation, and whenever that gets full, we trigger a GC. We don't do any doubling of that bump chunk or anything like that based on the GC heap's size, and I suspect this could lead to somewhat pathological behavior. See also https://github.com/bytecodealliance/wasmtime/issues/9351 for a collection of GC optimizations that we haven't investigated yet.
Finally, if the test case doesn't allocate too much, it is also worth investigating whether using the null collector has a different performance profile than the default DRC collector. If the test case is hitting the potential pathological behavior previously described, then using the null collector will avoid that. That's of course not necessarily a solution for users, but is very useful information to include in issues. Basically, this helps us diagnose whether the issue is in our codegen or the collector implementation itself.
Anyways, that's all just some hopefully helpful background and context for you. I will investigate what's going on here in more detail in a little bit.
fitzgen commented on issue #11141:
The benchmark doesn't complete with the null collector, but the output before it OOMs is coming much quicker, so this is probably related to something in the DRC collector itself.
jrmuizel commented on issue #11141:
Pre-compiling doesn't seem to help:
$ time wasmtime run --allow-precompiled -W gc binary-trees-wasi-opt.cwasm stretch tree of depth 21 check: 4194303 1048576 trees of depth 4 check: 32505856 262144 trees of depth 6 check: 33292288 65536 trees of depth 8 check: 33488896 16384 trees of depth 10 check: 33538048 4096 trees of depth 12 check: 33550336 1024 trees of depth 14 check: 33553408 256 trees of depth 16 check: 33554176 64 trees of depth 18 check: 33554368 16 trees of depth 20 check: 33554416 long lived tree of depth 20 check: 2097151 real 2m12.869s user 2m12.303s sys 0m0.467sThe test case allocates a lot. It's basically just this: https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/binarytrees-javavm-2.html
The other wasm implementations spend most of their time in GC as well.
Here's a profile of wasmtime: https://share.firefox.dev/4k8jCy0
and the Spidermonkey bug for the slowness compared to Chrome:
https://bugzilla.mozilla.org/show_bug.cgi?id=1971860
fitzgen commented on issue #11141:
https://github.com/bytecodealliance/wasmtime/pull/11144 fixes the bump chunk issue I was talking about, and cuts the test case's runtime in half for me
fitzgen commented on issue #11141:
15% of the remaining time is spent cloning
Arcs insideMmapBaseandMmapOffsetjust to get the base pointer for aVMMemoryDefinition:face_palm:
fitzgen commented on issue #11141:
15% of the remaining time is spent cloning
Arcs insideMmapBaseandMmapOffsetjust to get the base pointer for aVMMemoryDefinition:face_palm:And another 15% of that time is spent dropping those
Arcs that we just cloned.WIP patch for this in https://github.com/bytecodealliance/wasmtime/pull/11148
alexcrichton commented on issue #11141:
In the end, @fitzgen this benchmark (or something similar?) might be a good addition to sightglass perhaps
fitzgen commented on issue #11141:
In the end, @fitzgen this benchmark (or something similar?) might be a good addition to sightglass perhaps
Yeah I was thinking the same, although we'd need to get its runtime a little lower probably for that to be reasonable.
fitzgen commented on issue #11141:
See also https://github.com/bytecodealliance/wasmtime/issues/11159
fitzgen commented on issue #11141:
See also https://github.com/bytecodealliance/wasmtime/issues/11162
fitzgen commented on issue #11141:
See also https://github.com/bytecodealliance/wasmtime/issues/11163
fitzgen commented on issue #11141:
See also https://github.com/bytecodealliance/wasmtime/issues/11164
Last updated: Dec 06 2025 at 06:05 UTC