gaaraw opened issue #13295:
Describe the bug
Repeated
ref.funcin a tiny hot loop appears to be much slower in Wasmtime than in Wasmer Cranelift.After reduction, I got a minimal reproducer that preserves essentially the same gap, plus two close controls where the gap disappears. The evidence points specifically to the per-iteration
ref.funcpath, not to the loop scaffold or the reference sink.Primary reproducer:
primary_reproducer_ref_func_hotloop.watSupporting controls:
supporting_control_ref_func_hoisted.watsupporting_control_ref_null_hotloop.watTest Case
Primary reproducer loop body:
ref.func $f0 global.set $g0 local.get $i i64.const 1 i64.sub local.tee $i i64.const 0 i64.ne br_if $bodyThe reduced reproducer uses:
- trip count:
2^30- one declared function
$f0- one mutable
funcrefglobal sink- one declarative element entry so that
ref.funcremains validMatched controls:
- same loop shape, but use a hoisted prebuilt non-null reference via
global.get $g0- same loop shape, but replace
ref.funcwithref.null funcSteps to Reproduce
- Build the primary testcase:
wat2wasm primary_reproducer_ref_func_hotloop.wat -o primary_reproducer_ref_func_hotloop.wasm
- Warm up once:
wasmtime primary_reproducer_ref_func_hotloop.wasm
- Measure runtime:
perf stat -r 3 -e 'task-clock' wasmtime primary_reproducer_ref_func_hotloop.wasm
Run the same flow on the two supporting controls above.
For comparison with Wasmer Cranelift:
wasmer run primary_reproducer_ref_func_hotloop.wasm perf stat -r 3 -e 'task-clock' wasmer run primary_reproducer_ref_func_hotloop.wasmExpected and actual Results
Primary reproducer and close controls
testcase wasmer_cranelift (s) wasmtime (s) ratio primary_reproducer_ref_func_hotloop2.7668 38.2027 13.81x supporting_control_ref_func_hoisted0.5029 0.5266 1.05x supporting_control_ref_null_hotloop0.4083 0.4468 1.09x Observed pattern:
- the primary reproducer is dramatically slower in Wasmtime than in Wasmer Cranelift;
- hoisting the non-null function reference out of the loop collapses the gap;
- replacing
ref.funcwithref.null funcalso collapses the gap.This makes the trigger look very specifically tied to repeated hot-loop
ref.func.Family-level consistency
The original generated
ref.funcseeds showed the same shape:
testcase wasmer_cranelift (s) wasmtime (s) ratio ref_func_12.9594 38.5392 13.02x ref_func_22.7539 38.8498 14.11x A related mixed testcase from the
ref.is_nullfamily also showed the same gap only when the loop usedref.functo create the non-null input each iteration:
testcase wasmer_cranelift (s) wasmtime (s) ratio ref_is_null_2(ref.func+ref.is_null)2.9624 39.3673 13.29x hoisted non-null control for ref.is_null0.6419 0.6855 1.07x So the
ref.is_nulloutlier seems to be explained by the same repeated-ref.functrigger, rather than byref.is_nullitself.Versions and Environment
- Wasmtime version:
wasmtime 41.0.0 (4898322a4 2025-12-18)- Host OS:
Ubuntu 22.04.5 LTS x64- Architecture:
x86_64- CPU:
12th Gen Intel(R) Core(TM) i7-12700Extra Info
I also checked Wasmtime CLIF for the reduced reproducer to make sure the benchmark is still alive.
The hot loop still performs a per-iteration builtin call:
v6 = call fn0(v0, v32) store notrap aligned table v6, v0+96where
fn0iswasmtime_builtin_ref_func.That builtin still performs a deeper indirect runtime call with extra frame/return-address bookkeeping:
v3 = get_frame_pointer.i64 store notrap aligned v3, v2+40 v4 = get_return_address.i64 store notrap aligned v4, v2+48 v7 = call_indirect sig0, v6(v0, v1)In contrast, the hoisted control's hot loop is just a load/store path without
wasmtime_builtin_ref_funcin the loop:v5 = load.i64 notrap aligned table v0+96 store notrap aligned table v5, v0+112I have not confirmed the internal root cause, so I’m only reporting the measured trigger pattern:
- repeated
ref.funcin a tiny hot loop;- slowdown remains after reduction to a minimal reproducer;
- the gap disappears when
ref.funcis removed from the loop;- the gap also disappears for repeated
ref.null func.
gaaraw added the bug label to Issue #13295.
gaaraw added the fuzz-bug label to Issue #13295.
cfallin commented on issue #13295:
Hi @gaaraw,
A few things:
- It would be great for you to review the AI Tool Policy of the Bytecode Alliance (hence this project). Alex already noted in your previous issue (#13272) that "There's a huge amount of information to sift through when the issue is more-or-less memory.copy is faster in one engine than another" and essentially the same thing is true here too.
- Alex also already noted that it's not expected for runtimes to have exactly the same performance profile. In the spirit of the "talk to us first" guidance in our documentation for external fuzzing campaigns, I'd like to ask: is there a hidden assumption/hypothesis in your work that all engines should converge to a single canonical, fast implementation, and anything else is a bug?
In this case, what I think you're running into is that we have lazy initialization of funcrefs in tables, which we do in a libcall. We could optimize differently by not doing that, but the payoff is that our instantiation is extremely fast because the table contents need not be initialized eagerly. What I'm getting at here is that different engines may choose different implementation strategies that prioritize one dimension or another of performance, and so this "performance fuzzbug" may not really even be considered a bug.
So: it'd be useful to hear your philosophy and purpose behind this fuzzing campaign. Is it to find deltas and raise questions that may point to inefficiencies we can fix? That's fine if so -- but I would perhaps go about it a bit differently. First, don't call it a "fuzzbug" (that has a generally-accepted meaning that is pretty different than what you have here); second, don't bombard us with overly verbose descriptions; third, do some more analysis on the tradeoffs, and come into this with more curiosity about "why", then we can have an interesting discussion. Thanks!
cfallin edited a comment on issue #13295:
Hi @gaaraw,
A few things:
- It would be great for you to review the AI Tool Policy of the Bytecode Alliance (hence this project). Alex already noted in your previous issue (#13272) that "There's a huge amount of information to sift through when the issue is more-or-less memory.copy is faster in one engine than another" and essentially the same thing is true here too.
- Alex also already noted that it's not expected for runtimes to have exactly the same performance profile. In the spirit of the "talk to us first" guidance in our documentation for external fuzzing campaigns, I'd like to ask: is there a hidden assumption/hypothesis in your work that all engines should converge to a single canonical, fast implementation, and anything else is a bug?
In this case, what I think you're running into is that we have lazy initialization of funcrefs in tables, which we do in a libcall. We could optimize differently by not doing that, but the payoff is that our instantiation is extremely fast because the table contents need not be initialized eagerly. What I'm getting at here is that different engines may choose different implementation strategies that prioritize one dimension or another of performance, and so this "performance fuzzbug" may not really even be considered a bug.
So: it'd be useful to hear your philosophy and purpose behind this fuzzing campaign. Is it to find deltas and raise questions that may point to inefficiencies we can fix? That's fine if so -- but I would perhaps go about it a bit differently. First, don't call it a "fuzzbug" (that has a generally-accepted meaning that is pretty different than what you have here); second, don't bombard us with overly verbose descriptions; third, do some more analysis on the tradeoffs, and come into this with more curiosity about "why", then we can have an interesting discussion. Thanks!
gaaraw commented on issue #13295:
Thanks — this is very helpful feedback.
You're right on both points: calling these reports "fuzzbugs" is not the right framing here, and my issue writeups have been too verbose.
Also, no — I am not assuming that all engines should converge to one canonical fast implementation, or that a cross-engine delta automatically means one engine is wrong.
What I am trying to do is use cross-runtime deltas as signals, reduce them to smaller execution shapes, and then figure out whether the result looks more like a missed optimization, an implementation tradeoff, or a benchmark artifact.
Your explanation here is useful exactly for that reason. If this
ref.funcbehavior is tied to lazyfuncrefinitialization through a libcall path, with the payoff being faster instantiation, then this case is better understood as a tradeoff-revealing performance anomaly than as a straightforward bug report.So I think the lesson for me is to present these cases with more curiosity and less bug-like framing: tighter reports, clearer controls, and more discussion of plausible tradeoffs up front.
Thanks again — I appreciate the clarification and the candid guidance.
alexcrichton removed the bug label from Issue #13295.
alexcrichton removed the fuzz-bug label from Issue #13295.
alexcrichton added the performance label to Issue #13295.
Last updated: Jun 01 2026 at 09:49 UTC