Stream: git-wasmtime

Topic: wasmtime / issue #13272 <Performance> fuzzbug: Repeated s...


view this post on Zulip Wasmtime GitHub notifications bot (May 05 2026 at 08:10):

gaaraw opened issue #13272:

Describe the bug

Repeated small bounded bulk-memory operations appear to be much slower in Wasmtime than in Wasmer Cranelift in a minimal microbenchmark family.

I first found this in generated differential tests for memory.copy, then reduced and checked it with a smaller reproducer plus several controls. The slowdown is not limited to one seed, and it is still present after varying copy length and src/dst relation.

test_cases.zip

The clearest primary reproducer I found is:

Useful supporting controls are:

Test Case

Primary reproducer loop body:

(local.get $i)
(i32.wrap_i64)
(i32.const 65504)
(i32.and)
(local.get $i)
(i32.wrap_i64)
(i32.const 1431655765)
(i32.xor)
(i32.const 65504)
(i32.and)
(i32.const 32)
(memory.copy)

The reduced reproducer uses:

The closest controls are:

Steps to Reproduce

  1. Build the primary testcase:
wat2wasm primary_reproducer_memory_copy_len32.wat -o primary_reproducer_memory_copy_len32.wasm
  1. Warm up once:
wasmtime primary_reproducer_memory_copy_len32.wasm
  1. Measure runtime:
perf stat -r 3 -e 'task-clock' wasmtime primary_reproducer_memory_copy_len32.wasm
  1. For comparison, run the same flow on the supporting testcases listed above.

  2. If helpful, compare against Wasmer Cranelift with:

wasmer run primary_reproducer_memory_copy_len32.wasm
perf stat -r 3 -e 'task-clock' wasmer run primary_reproducer_memory_copy_len32.wasm

Expected and actual Results

Primary memory.copy reproducer and close controls

testcase shape wasmer_cranelift (s) wasmtime (s) ratio
control_drop target removed 0.09570 0.08054 0.84x
memory.copy len=0 xor-shaped src/dst, bounded window 0.97108 2.61960 2.70x
memory.copy len=32 xor-shaped src/dst, bounded window 0.76792 2.68820 3.50x
memory.fill len=32 same bounded address shape 0.64743 2.25270 3.48x

Observed pattern:

This makes the anomaly look more like bulk-memory helper/runtime-path cost than payload movement cost.

Length sweep for memory.copy

testcase wasmer_cranelift (s) wasmtime (s) ratio
len=0 0.97080 2.73150 2.81x
len=1 0.97112 2.99300 3.08x
len=4 0.89589 2.82370 3.15x
len=8 0.89769 2.81460 3.14x
len=16 0.91569 2.77790 3.03x
len=32 0.76524 2.65780 3.47x
len=64 (safe window) 0.76253 2.68210 3.52x

Observed pattern:

Src/dst relation sweep for memory.copy len=32

testcase wasmer_cranelift (s) wasmtime (s) ratio
src == dst 0.75430 2.61270 3.46x
src = dst + 1024 (safe) 0.72937 2.57010 3.52x

Observed pattern:

Family-level consistency

The original full-trip generated memory_copy_* seeds all showed wasmtime > wasmer_cranelift:

testcase wasmer_cranelift (s) wasmtime (s) ratio
memory_copy_1 12.1567 39.8686 3.28x
memory_copy_2 13.4606 36.9620 2.75x
memory_copy_3 19.7391 36.3320 1.84x
memory_copy_4 23.0015 36.1513 1.57x
memory_copy_5 9.9472 37.0502 3.72x

Related memory_fill_* seeds also showed the same direction:

testcase wasmer_cranelift (s) wasmtime (s) ratio
memory_fill_1 9.8347 31.9666 3.25x
memory_fill_2 12.0900 32.1992 2.66x
memory_fill_3 12.9405 34.5910 2.67x

Versions and Environment

Extra Info

For the primary reduced testcase, I also checked Wasmtime CLIF to make sure the benchmark is still alive.

I generated CLIF with:

wasmtime compile -C cache=n --emit-clif out_dir primary_reproducer_memory_copy_len32.wasm

In the generated CLIF for the hot loop, the operation is still lowered through a per-iteration builtin call equivalent to:

call fn0(vmctx, 0, dst, 0, src, len)

The emitted builtin wasmtime_builtin_memory_copy still performs a deeper indirect runtime call:

v11 = call_indirect sig0, v10(v0, v1, v2, v3, v4, v5)

So this does not look like dead-code elimination or a broken benchmark scaffold.

The strongest trigger condition I can currently support is:

I have not confirmed the internal root cause, so I’m only reporting the measured trigger pattern here.

view this post on Zulip Wasmtime GitHub notifications bot (May 05 2026 at 08:10):

gaaraw added the bug label to Issue #13272.

view this post on Zulip Wasmtime GitHub notifications bot (May 05 2026 at 08:10):

gaaraw added the fuzz-bug label to Issue #13272.

view this post on Zulip Wasmtime GitHub notifications bot (May 05 2026 at 21:16):

alexcrichton commented on issue #13272:

Wasmtime's implementation of memory.{fill,copy} right now is "unconditionally call a libcall" which is known to be slower than more optimized strategies such as special-casing constant-sized memcpy's and translating directly. That being said it's also a significantly simpler implementation because we always have to implement a libcall-based fallback anyway.

Can you detail a bit more why you feel this is a bug should be fixed? For example is this purely for fuzzing? If so performance of an exact shape of something is not guaranteed to be the same across engines, so I suspect you're going to have a difficult time fuzzing that.

On one hand this seems like an optimization we could implement in Wasmtime, but on the other hand that requires nontrivial work, has risk, and needs justification. If the justification is "significantly simplifies a guest toolchain", that seems worth it, but my impression is toolchains like LLVM already hand-unroll memory.copy into component parts. Personally I don't feel "performance differential fuzzing" is the best justification for this sort of performance work myself, but others could also reasonably differ.

Also, FWIW, is this issue AI-generated? There's a huge amount of information to sift through when the issue is more-or-less memory.copy is faster in one engine than another, and most of the other information is just noise around that.

view this post on Zulip Wasmtime GitHub notifications bot (May 06 2026 at 02:34):

gaaraw commented on issue #13272:

Thanks, this is helpful context.

You're right that my original report was too long. Also yes: it was AI-assisted, and I agree it ended up noisier than it should have been.

The short version of what I'm claiming is:

The strongest evidence is the reduced 2^28-trip controls:

testcase Wasmer Cranelift Wasmtime
memory.copy len=0 0.97s 2.62s
memory.copy len=32 0.77s 2.69s
memory.fill len=32 0.65s 2.25s

len=0 is the key result: when the copy length is zero, the payload work is effectively gone, but most of the gap remains. That makes this look much more like bulk-memory helper/libcall overhead than byte-copy cost.

This also matches the emitted Wasmtime CLIF: the hot loop still performs a per-iteration builtin call, and that builtin still performs a deeper indirect runtime call. So a plausible explanation is that small bulk-memory ops are still going through a generic helper path rather than being specialized/inlined in the small constant-size case.

I am not claiming that as a confirmed root cause, but I think it is a real, repeatable execution-shape signal rather than just fuzzing noise.

view this post on Zulip Wasmtime GitHub notifications bot (May 06 2026 at 14:49):

alexcrichton removed the bug label from Issue #13272.

view this post on Zulip Wasmtime GitHub notifications bot (May 06 2026 at 14:49):

alexcrichton removed the fuzz-bug label from Issue #13272.

view this post on Zulip Wasmtime GitHub notifications bot (May 06 2026 at 14:49):

alexcrichton added the performance label to Issue #13272.

view this post on Zulip Wasmtime GitHub notifications bot (May 12 2026 at 19:25):

fitzgen commented on issue #13272:

This also matches the emitted Wasmtime CLIF: the hot loop still performs a per-iteration builtin call, and that builtin still performs a deeper indirect runtime call. So a plausible explanation is that small bulk-memory ops are still going through a generic helper path rather than being specialized/inlined in the small constant-size case.

As Alex said, it doesn't make sense for us to inline special cases for len=0 (or any small len) because toolchains already emit inline code for small memory.copys and all that. That is, (loop (memory.copy)) is not an interesting performance test case because real toolchains already emit code like (if (is-small-copy) (then (inline-copy)) (else (memory.copy)) instead of just (memory.copy) directly.

But if our libcalls are significantly slower than other engine's libcalls, then that is maybe more interesting.

view this post on Zulip Wasmtime GitHub notifications bot (May 21 2026 at 20:45):

alexcrichton commented on issue #13272:

After https://github.com/bytecodealliance/wasmtime/pull/13368 and https://github.com/bytecodealliance/wasmtime/pull/13367 Wasmtime is ~2x faster across the board. There is still no optimization for constant-length short memcpy/memset, however, which will be the lion's share of what's remaining.


Last updated: Jun 01 2026 at 09:49 UTC