wasmtime / issue #13295 <Performance> fuzzbug: Repeated `... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / issue #13295 <Performance> fuzzbug: Repeated `...

Wasmtime GitHub notifications bot (May 06 2026 at 04:10):

Describe the bug

Repeated ref.func in a tiny hot loop appears to be much slower in Wasmtime than in Wasmer Cranelift.

After reduction, I got a minimal reproducer that preserves essentially the same gap, plus two close controls where the gap disappears. The evidence points specifically to the per-iteration ref.func path, not to the loop scaffold or the reference sink.

test_cases.zip

Primary reproducer:

primary_reproducer_ref_func_hotloop.wat

Supporting controls:

supporting_control_ref_func_hoisted.wat

supporting_control_ref_null_hotloop.wat

Test Case

Primary reproducer loop body:
ref.func $f0
global.set $g0
local.get $i
i64.const 1
i64.sub
local.tee $i
i64.const 0
i64.ne
br_if $body
The reduced reproducer uses:

trip count: 2^30

one declared function $f0

one mutable funcref global sink

one declarative element entry so that ref.func remains valid

Matched controls:

same loop shape, but use a hoisted prebuilt non-null reference via global.get $g0

same loop shape, but replace ref.func with ref.null func

Steps to Reproduce

Build the primary testcase:
wat2wasm primary_reproducer_ref_func_hotloop.wat -o primary_reproducer_ref_func_hotloop.wasm
Warm up once:
wasmtime primary_reproducer_ref_func_hotloop.wasm
Measure runtime:
perf stat -r 3 -e 'task-clock' wasmtime primary_reproducer_ref_func_hotloop.wasm
Run the same flow on the two supporting controls above.

For comparison with Wasmer Cranelift:
wasmer run primary_reproducer_ref_func_hotloop.wasm
perf stat -r 3 -e 'task-clock' wasmer run primary_reproducer_ref_func_hotloop.wasm
Expected and actual Results

Primary reproducer and close controls

testcase wasmer_cranelift (s) wasmtime (s) ratio

primary_reproducer_ref_func_hotloop 2.7668 38.2027 13.81x

supporting_control_ref_func_hoisted 0.5029 0.5266 1.05x

supporting_control_ref_null_hotloop 0.4083 0.4468 1.09x

Observed pattern:

the primary reproducer is dramatically slower in Wasmtime than in Wasmer Cranelift;

hoisting the non-null function reference out of the loop collapses the gap;

replacing ref.func with ref.null func also collapses the gap.

This makes the trigger look very specifically tied to repeated hot-loop ref.func.

Family-level consistency

The original generated ref.func seeds showed the same shape:

testcase wasmer_cranelift (s) wasmtime (s) ratio

ref_func_1 2.9594 38.5392 13.02x

ref_func_2 2.7539 38.8498 14.11x

A related mixed testcase from the ref.is_null family also showed the same gap only when the loop used ref.func to create the non-null input each iteration:

testcase wasmer_cranelift (s) wasmtime (s) ratio

ref_is_null_2 (ref.func + ref.is_null) 2.9624 39.3673 13.29x

hoisted non-null control for ref.is_null 0.6419 0.6855 1.07x

So the ref.is_null outlier seems to be explained by the same repeated-ref.func trigger, rather than by ref.is_null itself.

Versions and Environment

Wasmtime version: wasmtime 41.0.0 (4898322a4 2025-12-18)

Host OS: Ubuntu 22.04.5 LTS x64

Architecture: x86_64

CPU: 12th Gen Intel(R) Core(TM) i7-12700

Extra Info

I also checked Wasmtime CLIF for the reduced reproducer to make sure the benchmark is still alive.

The hot loop still performs a per-iteration builtin call:
v6 = call fn0(v0, v32)
store notrap aligned table v6, v0+96
where fn0 is wasmtime_builtin_ref_func.

That builtin still performs a deeper indirect runtime call with extra frame/return-address bookkeeping:
v3 = get_frame_pointer.i64
store notrap aligned v3, v2+40
v4 = get_return_address.i64
store notrap aligned v4, v2+48
v7 = call_indirect sig0, v6(v0, v1)
In contrast, the hoisted control's hot loop is just a load/store path without wasmtime_builtin_ref_func in the loop:
v5 = load.i64 notrap aligned table v0+96
store notrap aligned table v5, v0+112
I have not confirmed the internal root cause, so I’m only reporting the measured trigger pattern:

repeated ref.func in a tiny hot loop;

slowdown remains after reduction to a minimal reproducer;

the gap disappears when ref.func is removed from the loop;

the gap also disappears for repeated ref.null func.

testcase	wasmer_cranelift (s)	wasmtime (s)	ratio
`primary_reproducer_ref_func_hotloop`	2.7668	38.2027	13.81x
`supporting_control_ref_func_hoisted`	0.5029	0.5266	1.05x
`supporting_control_ref_null_hotloop`	0.4083	0.4468	1.09x

testcase	wasmer_cranelift (s)	wasmtime (s)	ratio
`ref_func_1`	2.9594	38.5392	13.02x
`ref_func_2`	2.7539	38.8498	14.11x

testcase	wasmer_cranelift (s)	wasmtime (s)	ratio
`ref_is_null_2` (`ref.func` + `ref.is_null`)	2.9624	39.3673	13.29x
hoisted non-null control for `ref.is_null`	0.6419	0.6855	1.07x

Wasmtime GitHub notifications bot (May 06 2026 at 04:10):

gaaraw added the bug label to Issue #13295.

Wasmtime GitHub notifications bot (May 06 2026 at 04:10):

gaaraw added the fuzz-bug label to Issue #13295.

Wasmtime GitHub notifications bot (May 06 2026 at 04:42):

cfallin commented on issue #13295:

Hi @gaaraw,

A few things:

It would be great for you to review the AI Tool Policy of the Bytecode Alliance (hence this project). Alex already noted in your previous issue (#13272) that "There's a huge amount of information to sift through when the issue is more-or-less memory.copy is faster in one engine than another" and essentially the same thing is true here too.

Alex also already noted that it's not expected for runtimes to have exactly the same performance profile. In the spirit of the "talk to us first" guidance in our documentation for external fuzzing campaigns, I'd like to ask: is there a hidden assumption/hypothesis in your work that all engines should converge to a single canonical, fast implementation, and anything else is a bug?

In this case, what I think you're running into is that we have lazy initialization of funcrefs in tables, which we do in a libcall. We could optimize differently by not doing that, but the payoff is that our instantiation is extremely fast because the table contents need not be initialized eagerly. What I'm getting at here is that different engines may choose different implementation strategies that prioritize one dimension or another of performance, and so this "performance fuzzbug" may not really even be considered a bug.

So: it'd be useful to hear your philosophy and purpose behind this fuzzing campaign. Is it to find deltas and raise questions that may point to inefficiencies we can fix? That's fine if so -- but I would perhaps go about it a bit differently. First, don't call it a "fuzzbug" (that has a generally-accepted meaning that is pretty different than what you have here); second, don't bombard us with overly verbose descriptions; third, do some more analysis on the tradeoffs, and come into this with more curiosity about "why", then we can have an interesting discussion. Thanks!

Wasmtime GitHub notifications bot (May 06 2026 at 04:42):

cfallin edited a comment on issue #13295:

Hi @gaaraw,

A few things:

It would be great for you to review the AI Tool Policy of the Bytecode Alliance (hence this project). Alex already noted in your previous issue (#13272) that "There's a huge amount of information to sift through when the issue is more-or-less memory.copy is faster in one engine than another" and essentially the same thing is true here too.

Alex also already noted that it's not expected for runtimes to have exactly the same performance profile. In the spirit of the "talk to us first" guidance in our documentation for external fuzzing campaigns, I'd like to ask: is there a hidden assumption/hypothesis in your work that all engines should converge to a single canonical, fast implementation, and anything else is a bug?

In this case, what I think you're running into is that we have lazy initialization of funcrefs in tables, which we do in a libcall. We could optimize differently by not doing that, but the payoff is that our instantiation is extremely fast because the table contents need not be initialized eagerly. What I'm getting at here is that different engines may choose different implementation strategies that prioritize one dimension or another of performance, and so this "performance fuzzbug" may not really even be considered a bug.

So: it'd be useful to hear your philosophy and purpose behind this fuzzing campaign. Is it to find deltas and raise questions that may point to inefficiencies we can fix? That's fine if so -- but I would perhaps go about it a bit differently. First, don't call it a "fuzzbug" (that has a generally-accepted meaning that is pretty different than what you have here); second, don't bombard us with overly verbose descriptions; third, do some more analysis on the tradeoffs, and come into this with more curiosity about "why", then we can have an interesting discussion. Thanks!

Wasmtime GitHub notifications bot (May 06 2026 at 09:22):

gaaraw commented on issue #13295:

Thanks — this is very helpful feedback.

You're right on both points: calling these reports "fuzzbugs" is not the right framing here, and my issue writeups have been too verbose.

Also, no — I am not assuming that all engines should converge to one canonical fast implementation, or that a cross-engine delta automatically means one engine is wrong.

What I am trying to do is use cross-runtime deltas as signals, reduce them to smaller execution shapes, and then figure out whether the result looks more like a missed optimization, an implementation tradeoff, or a benchmark artifact.

Your explanation here is useful exactly for that reason. If this ref.func behavior is tied to lazy funcref initialization through a libcall path, with the payoff being faster instantiation, then this case is better understood as a tradeoff-revealing performance anomaly than as a straightforward bug report.

So I think the lesson for me is to present these cases with more curiosity and less bug-like framing: tighter reports, clearer controls, and more discussion of plausible tradeoffs up front.

Thanks again — I appreciate the clarification and the candid guidance.

Wasmtime GitHub notifications bot (May 06 2026 at 14:49):

alexcrichton removed the bug label from Issue #13295.

Wasmtime GitHub notifications bot (May 06 2026 at 14:49):

alexcrichton removed the fuzz-bug label from Issue #13295.

Wasmtime GitHub notifications bot (May 06 2026 at 14:49):

alexcrichton added the performance label to Issue #13295.

Last updated: Jul 29 2026 at 05:03 UTC