gaaraw opened issue #13272:
Describe the bug
Repeated small bounded bulk-memory operations appear to be much slower in Wasmtime than in Wasmer Cranelift in a minimal microbenchmark family.
I first found this in generated differential tests for
memory.copy, then reduced and checked it with a smaller reproducer plus several controls. The slowdown is not limited to one seed, and it is still present after varying copy length and src/dst relation.The clearest primary reproducer I found is:
primary_reproducer_memory_copy_len32.watUseful supporting controls are:
supporting_control_memory_copy_len0.watsupporting_memory_fill_same_shape.watsupporting_memory_copy_len1.watsupporting_memory_copy_len64_safe.watsupporting_memory_copy_src_eq_dst_len32.watsupporting_memory_copy_src_plus1024_len32_safe.watTest Case
Primary reproducer loop body:
(local.get $i) (i32.wrap_i64) (i32.const 65504) (i32.and) (local.get $i) (i32.wrap_i64) (i32.const 1431655765) (i32.xor) (i32.const 65504) (i32.and) (i32.const 32) (memory.copy)The reduced reproducer uses:
- trip count:
2^28- one page of memory:
(memory 1)- both src/dst addresses constrained to a small low-memory window
The closest controls are:
- same shape, but
memory.copylength changed to0- same shape, but
memory.copyreplaced withmemory.fill- same shape, but copy lengths changed across
1/4/8/16/32/64- same shape, but src/dst relation changed to
src == dstandsrc = dst + 1024Steps to Reproduce
- Build the primary testcase:
wat2wasm primary_reproducer_memory_copy_len32.wat -o primary_reproducer_memory_copy_len32.wasm
- Warm up once:
wasmtime primary_reproducer_memory_copy_len32.wasm
- Measure runtime:
perf stat -r 3 -e 'task-clock' wasmtime primary_reproducer_memory_copy_len32.wasm
For comparison, run the same flow on the supporting testcases listed above.
If helpful, compare against Wasmer Cranelift with:
wasmer run primary_reproducer_memory_copy_len32.wasm perf stat -r 3 -e 'task-clock' wasmer run primary_reproducer_memory_copy_len32.wasmExpected and actual Results
Primary
memory.copyreproducer and close controls
testcase shape wasmer_cranelift (s) wasmtime (s) ratio control_drop target removed 0.09570 0.08054 0.84x memory.copy len=0 xor-shaped src/dst, bounded window 0.97108 2.61960 2.70x memory.copy len=32 xor-shaped src/dst, bounded window 0.76792 2.68820 3.50x memory.fill len=32 same bounded address shape 0.64743 2.25270 3.48x Observed pattern:
- the target-removed control is fast in both runtimes;
- Wasmtime is already much slower for
memory.copy len=0;- the slowdown remains for
memory.copy len=32;- a related bulk-memory instruction (
memory.fill) shows a similar gap.This makes the anomaly look more like bulk-memory helper/runtime-path cost than payload movement cost.
Length sweep for
memory.copy
testcase wasmer_cranelift (s) wasmtime (s) ratio len=0 0.97080 2.73150 2.81x len=1 0.97112 2.99300 3.08x len=4 0.89589 2.82370 3.15x len=8 0.89769 2.81460 3.14x len=16 0.91569 2.77790 3.03x len=32 0.76524 2.65780 3.47x len=64 (safe window) 0.76253 2.68210 3.52x Observed pattern:
- from
len=0throughlen=64, the slowdown ratio stays broadly stable;- the main trigger does not seem to be the payload size itself.
Src/dst relation sweep for
memory.copy len=32
testcase wasmer_cranelift (s) wasmtime (s) ratio src == dst 0.75430 2.61270 3.46x src = dst + 1024 (safe) 0.72937 2.57010 3.52x Observed pattern:
- the gap remains even when the copy is self-copy or a fixed-offset in-bounds copy;
- this does not look specific to the original xor-shaped address relation;
- this also does not look primarily driven by overlap semantics.
Family-level consistency
The original full-trip generated
memory_copy_*seeds all showedwasmtime > wasmer_cranelift:
testcase wasmer_cranelift (s) wasmtime (s) ratio memory_copy_1 12.1567 39.8686 3.28x memory_copy_2 13.4606 36.9620 2.75x memory_copy_3 19.7391 36.3320 1.84x memory_copy_4 23.0015 36.1513 1.57x memory_copy_5 9.9472 37.0502 3.72x Related
memory_fill_*seeds also showed the same direction:
testcase wasmer_cranelift (s) wasmtime (s) ratio memory_fill_1 9.8347 31.9666 3.25x memory_fill_2 12.0900 32.1992 2.66x memory_fill_3 12.9405 34.5910 2.67x Versions and Environment
- wasmtime: 41.0.0 (4898322a4 2025-12-18)
- wasmer: 6.1.0
- WAMR: iwasm 2.4.4
- wasmedge: 0.16.1-18-gc457fe30
- wabt: 1.0.39
- llvm: 21.1.5
- Host OS: Ubuntu 22.04.5 LTS x64
- CPU: 12th Gen Intel® Core™ i7-12700 × 20
Extra Info
For the primary reduced testcase, I also checked Wasmtime CLIF to make sure the benchmark is still alive.
I generated CLIF with:
wasmtime compile -C cache=n --emit-clif out_dir primary_reproducer_memory_copy_len32.wasmIn the generated CLIF for the hot loop, the operation is still lowered through a per-iteration builtin call equivalent to:
call fn0(vmctx, 0, dst, 0, src, len)The emitted builtin
wasmtime_builtin_memory_copystill performs a deeper indirect runtime call:v11 = call_indirect sig0, v10(v0, v1, v2, v3, v4, v5)So this does not look like dead-code elimination or a broken benchmark scaffold.
The strongest trigger condition I can currently support is:
- repeated small bounded bulk-memory operations;
- one-page memory with a hot low-memory window;
- slowdown present for both
memory.copyandmemory.fill;- largely independent of copy length (
0..64in this sweep) and src/dst relation.I have not confirmed the internal root cause, so I’m only reporting the measured trigger pattern here.
gaaraw added the bug label to Issue #13272.
gaaraw added the fuzz-bug label to Issue #13272.
alexcrichton commented on issue #13272:
Wasmtime's implementation of
memory.{fill,copy}right now is "unconditionally call a libcall" which is known to be slower than more optimized strategies such as special-casing constant-sized memcpy's and translating directly. That being said it's also a significantly simpler implementation because we always have to implement a libcall-based fallback anyway.Can you detail a bit more why you feel this is a bug should be fixed? For example is this purely for fuzzing? If so performance of an exact shape of something is not guaranteed to be the same across engines, so I suspect you're going to have a difficult time fuzzing that.
On one hand this seems like an optimization we could implement in Wasmtime, but on the other hand that requires nontrivial work, has risk, and needs justification. If the justification is "significantly simplifies a guest toolchain", that seems worth it, but my impression is toolchains like LLVM already hand-unroll
memory.copyinto component parts. Personally I don't feel "performance differential fuzzing" is the best justification for this sort of performance work myself, but others could also reasonably differ.Also, FWIW, is this issue AI-generated? There's a huge amount of information to sift through when the issue is more-or-less
memory.copyis faster in one engine than another, and most of the other information is just noise around that.
gaaraw commented on issue #13272:
Thanks, this is helpful context.
You're right that my original report was too long. Also yes: it was AI-assisted, and I agree it ended up noisier than it should have been.
The short version of what I'm claiming is:
- not that Wasmtime is incorrect;
- but that repeated small
memory.copy/memory.filloperations seem to have a large fixed overhead in Wasmtime relative to other engines, including Wasmer Cranelift;- and that the evidence points more to call-path/libcall overhead than to payload-movement cost.
The strongest evidence is the reduced 2^28-trip controls:
testcase Wasmer Cranelift Wasmtime memory.copy len=00.97s 2.62s memory.copy len=320.77s 2.69s memory.fill len=320.65s 2.25s
len=0is the key result: when the copy length is zero, the payload work is effectively gone, but most of the gap remains. That makes this look much more like bulk-memory helper/libcall overhead than byte-copy cost.This also matches the emitted Wasmtime CLIF: the hot loop still performs a per-iteration builtin call, and that builtin still performs a deeper indirect runtime call. So a plausible explanation is that small bulk-memory ops are still going through a generic helper path rather than being specialized/inlined in the small constant-size case.
I am not claiming that as a confirmed root cause, but I think it is a real, repeatable execution-shape signal rather than just fuzzing noise.
alexcrichton removed the bug label from Issue #13272.
alexcrichton removed the fuzz-bug label from Issue #13272.
alexcrichton added the performance label to Issue #13272.
fitzgen commented on issue #13272:
This also matches the emitted Wasmtime CLIF: the hot loop still performs a per-iteration builtin call, and that builtin still performs a deeper indirect runtime call. So a plausible explanation is that small bulk-memory ops are still going through a generic helper path rather than being specialized/inlined in the small constant-size case.
As Alex said, it doesn't make sense for us to inline special cases for
len=0(or any smalllen) because toolchains already emit inline code for smallmemory.copys and all that. That is,(loop (memory.copy))is not an interesting performance test case because real toolchains already emit code like(if (is-small-copy) (then (inline-copy)) (else (memory.copy))instead of just(memory.copy)directly.But if our libcalls are significantly slower than other engine's libcalls, then that is maybe more interesting.
alexcrichton commented on issue #13272:
After https://github.com/bytecodealliance/wasmtime/pull/13368 and https://github.com/bytecodealliance/wasmtime/pull/13367 Wasmtime is ~2x faster across the board. There is still no optimization for constant-length short memcpy/memset, however, which will be the lion's share of what's remaining.
Last updated: Jun 01 2026 at 09:49 UTC