wasmtime / issue #13272 <Performance> fuzzbug: Repeated s... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / issue #13272 <Performance> fuzzbug: Repeated s...

Wasmtime GitHub notifications bot (May 05 2026 at 08:10):

Describe the bug

Repeated small bounded bulk-memory operations appear to be much slower in Wasmtime than in Wasmer Cranelift in a minimal microbenchmark family.

I first found this in generated differential tests for memory.copy, then reduced and checked it with a smaller reproducer plus several controls. The slowdown is not limited to one seed, and it is still present after varying copy length and src/dst relation.

test_cases.zip

The clearest primary reproducer I found is:

primary_reproducer_memory_copy_len32.wat

Useful supporting controls are:

supporting_control_memory_copy_len0.wat

supporting_memory_fill_same_shape.wat

supporting_memory_copy_len1.wat

supporting_memory_copy_len64_safe.wat

supporting_memory_copy_src_eq_dst_len32.wat

supporting_memory_copy_src_plus1024_len32_safe.wat

Test Case

Primary reproducer loop body:
(local.get $i)
(i32.wrap_i64)
(i32.const 65504)
(i32.and)
(local.get $i)
(i32.wrap_i64)
(i32.const 1431655765)
(i32.xor)
(i32.const 65504)
(i32.and)
(i32.const 32)
(memory.copy)
The reduced reproducer uses:

trip count: 2^28

one page of memory: (memory 1)

both src/dst addresses constrained to a small low-memory window

The closest controls are:

same shape, but memory.copy length changed to 0

same shape, but memory.copy replaced with memory.fill

same shape, but copy lengths changed across 1/4/8/16/32/64

same shape, but src/dst relation changed to src == dst and src = dst + 1024

Steps to Reproduce

Build the primary testcase:
wat2wasm primary_reproducer_memory_copy_len32.wat -o primary_reproducer_memory_copy_len32.wasm
Warm up once:
wasmtime primary_reproducer_memory_copy_len32.wasm
Measure runtime:
perf stat -r 3 -e 'task-clock' wasmtime primary_reproducer_memory_copy_len32.wasm
For comparison, run the same flow on the supporting testcases listed above.

If helpful, compare against Wasmer Cranelift with:
wasmer run primary_reproducer_memory_copy_len32.wasm
perf stat -r 3 -e 'task-clock' wasmer run primary_reproducer_memory_copy_len32.wasm
Expected and actual Results

Primary memory.copy reproducer and close controls

testcase shape wasmer_cranelift (s) wasmtime (s) ratio

control_drop target removed 0.09570 0.08054 0.84x

memory.copy len=0 xor-shaped src/dst, bounded window 0.97108 2.61960 2.70x

memory.copy len=32 xor-shaped src/dst, bounded window 0.76792 2.68820 3.50x

memory.fill len=32 same bounded address shape 0.64743 2.25270 3.48x

Observed pattern:

the target-removed control is fast in both runtimes;

Wasmtime is already much slower for memory.copy len=0;

the slowdown remains for memory.copy len=32;

a related bulk-memory instruction (memory.fill) shows a similar gap.

This makes the anomaly look more like bulk-memory helper/runtime-path cost than payload movement cost.

Length sweep for memory.copy

testcase wasmer_cranelift (s) wasmtime (s) ratio

len=0 0.97080 2.73150 2.81x

len=1 0.97112 2.99300 3.08x

len=4 0.89589 2.82370 3.15x

len=8 0.89769 2.81460 3.14x

len=16 0.91569 2.77790 3.03x

len=32 0.76524 2.65780 3.47x

len=64 (safe window) 0.76253 2.68210 3.52x

Observed pattern:

from len=0 through len=64, the slowdown ratio stays broadly stable;

the main trigger does not seem to be the payload size itself.

Src/dst relation sweep for memory.copy len=32

testcase wasmer_cranelift (s) wasmtime (s) ratio

src == dst 0.75430 2.61270 3.46x

src = dst + 1024 (safe) 0.72937 2.57010 3.52x

Observed pattern:

the gap remains even when the copy is self-copy or a fixed-offset in-bounds copy;

this does not look specific to the original xor-shaped address relation;

this also does not look primarily driven by overlap semantics.

Family-level consistency

The original full-trip generated memory_copy_* seeds all showed wasmtime > wasmer_cranelift:

testcase wasmer_cranelift (s) wasmtime (s) ratio

memory_copy_1 12.1567 39.8686 3.28x

memory_copy_2 13.4606 36.9620 2.75x

memory_copy_3 19.7391 36.3320 1.84x

memory_copy_4 23.0015 36.1513 1.57x

memory_copy_5 9.9472 37.0502 3.72x

Related memory_fill_* seeds also showed the same direction:

testcase wasmer_cranelift (s) wasmtime (s) ratio

memory_fill_1 9.8347 31.9666 3.25x

memory_fill_2 12.0900 32.1992 2.66x

memory_fill_3 12.9405 34.5910 2.67x

Versions and Environment

wasmtime: 41.0.0 (4898322a4 2025-12-18)

wasmer: 6.1.0

WAMR: iwasm 2.4.4

wasmedge: 0.16.1-18-gc457fe30

wabt: 1.0.39

llvm: 21.1.5

Host OS: Ubuntu 22.04.5 LTS x64

CPU: 12th Gen Intel® Core™ i7-12700 × 20

Extra Info

For the primary reduced testcase, I also checked Wasmtime CLIF to make sure the benchmark is still alive.

I generated CLIF with:
wasmtime compile -C cache=n --emit-clif out_dir primary_reproducer_memory_copy_len32.wasm
In the generated CLIF for the hot loop, the operation is still lowered through a per-iteration builtin call equivalent to:
call fn0(vmctx, 0, dst, 0, src, len)
The emitted builtin wasmtime_builtin_memory_copy still performs a deeper indirect runtime call:
v11 = call_indirect sig0, v10(v0, v1, v2, v3, v4, v5)
So this does not look like dead-code elimination or a broken benchmark scaffold.

The strongest trigger condition I can currently support is:

repeated small bounded bulk-memory operations;

one-page memory with a hot low-memory window;

slowdown present for both memory.copy and memory.fill;

largely independent of copy length (0..64 in this sweep) and src/dst relation.

I have not confirmed the internal root cause, so I’m only reporting the measured trigger pattern here.

testcase	shape	wasmer_cranelift (s)	wasmtime (s)	ratio
control_drop	target removed	0.09570	0.08054	0.84x
memory.copy len=0	xor-shaped src/dst, bounded window	0.97108	2.61960	2.70x
memory.copy len=32	xor-shaped src/dst, bounded window	0.76792	2.68820	3.50x
memory.fill len=32	same bounded address shape	0.64743	2.25270	3.48x

testcase	wasmer_cranelift (s)	wasmtime (s)	ratio
len=0	0.97080	2.73150	2.81x
len=1	0.97112	2.99300	3.08x
len=4	0.89589	2.82370	3.15x
len=8	0.89769	2.81460	3.14x
len=16	0.91569	2.77790	3.03x
len=32	0.76524	2.65780	3.47x
len=64 (safe window)	0.76253	2.68210	3.52x

testcase	wasmer_cranelift (s)	wasmtime (s)	ratio
src == dst	0.75430	2.61270	3.46x
src = dst + 1024 (safe)	0.72937	2.57010	3.52x

testcase	wasmer_cranelift (s)	wasmtime (s)	ratio
memory_copy_1	12.1567	39.8686	3.28x
memory_copy_2	13.4606	36.9620	2.75x
memory_copy_3	19.7391	36.3320	1.84x
memory_copy_4	23.0015	36.1513	1.57x
memory_copy_5	9.9472	37.0502	3.72x

testcase	wasmer_cranelift (s)	wasmtime (s)	ratio
memory_fill_1	9.8347	31.9666	3.25x
memory_fill_2	12.0900	32.1992	2.66x
memory_fill_3	12.9405	34.5910	2.67x

Wasmtime GitHub notifications bot (May 05 2026 at 08:10):

gaaraw added the bug label to Issue #13272.

Wasmtime GitHub notifications bot (May 05 2026 at 08:10):

gaaraw added the fuzz-bug label to Issue #13272.

Wasmtime GitHub notifications bot (May 05 2026 at 21:16):

alexcrichton commented on issue #13272:

Wasmtime's implementation of memory.{fill,copy} right now is "unconditionally call a libcall" which is known to be slower than more optimized strategies such as special-casing constant-sized memcpy's and translating directly. That being said it's also a significantly simpler implementation because we always have to implement a libcall-based fallback anyway.

Can you detail a bit more why you feel this is a bug should be fixed? For example is this purely for fuzzing? If so performance of an exact shape of something is not guaranteed to be the same across engines, so I suspect you're going to have a difficult time fuzzing that.

On one hand this seems like an optimization we could implement in Wasmtime, but on the other hand that requires nontrivial work, has risk, and needs justification. If the justification is "significantly simplifies a guest toolchain", that seems worth it, but my impression is toolchains like LLVM already hand-unroll memory.copy into component parts. Personally I don't feel "performance differential fuzzing" is the best justification for this sort of performance work myself, but others could also reasonably differ.

Also, FWIW, is this issue AI-generated? There's a huge amount of information to sift through when the issue is more-or-less memory.copy is faster in one engine than another, and most of the other information is just noise around that.

Wasmtime GitHub notifications bot (May 06 2026 at 02:34):

gaaraw commented on issue #13272:

Thanks, this is helpful context.

You're right that my original report was too long. Also yes: it was AI-assisted, and I agree it ended up noisier than it should have been.

The short version of what I'm claiming is:

not that Wasmtime is incorrect;

but that repeated small memory.copy / memory.fill operations seem to have a large fixed overhead in Wasmtime relative to other engines, including Wasmer Cranelift;

and that the evidence points more to call-path/libcall overhead than to payload-movement cost.

The strongest evidence is the reduced 2^28-trip controls:

testcase Wasmer Cranelift Wasmtime

memory.copy len=0 0.97s 2.62s

memory.copy len=32 0.77s 2.69s

memory.fill len=32 0.65s 2.25s

len=0 is the key result: when the copy length is zero, the payload work is effectively gone, but most of the gap remains. That makes this look much more like bulk-memory helper/libcall overhead than byte-copy cost.

This also matches the emitted Wasmtime CLIF: the hot loop still performs a per-iteration builtin call, and that builtin still performs a deeper indirect runtime call. So a plausible explanation is that small bulk-memory ops are still going through a generic helper path rather than being specialized/inlined in the small constant-size case.

I am not claiming that as a confirmed root cause, but I think it is a real, repeatable execution-shape signal rather than just fuzzing noise.

testcase	Wasmer Cranelift	Wasmtime
`memory.copy len=0`	0.97s	2.62s
`memory.copy len=32`	0.77s	2.69s
`memory.fill len=32`	0.65s	2.25s

Wasmtime GitHub notifications bot (May 06 2026 at 14:49):

alexcrichton removed the bug label from Issue #13272.

Wasmtime GitHub notifications bot (May 06 2026 at 14:49):

alexcrichton removed the fuzz-bug label from Issue #13272.

Wasmtime GitHub notifications bot (May 06 2026 at 14:49):

alexcrichton added the performance label to Issue #13272.

Wasmtime GitHub notifications bot (May 12 2026 at 19:25):

fitzgen commented on issue #13272:

This also matches the emitted Wasmtime CLIF: the hot loop still performs a per-iteration builtin call, and that builtin still performs a deeper indirect runtime call. So a plausible explanation is that small bulk-memory ops are still going through a generic helper path rather than being specialized/inlined in the small constant-size case.

As Alex said, it doesn't make sense for us to inline special cases for len=0 (or any small len) because toolchains already emit inline code for small memory.copys and all that. That is, (loop (memory.copy)) is not an interesting performance test case because real toolchains already emit code like (if (is-small-copy) (then (inline-copy)) (else (memory.copy)) instead of just (memory.copy) directly.

But if our libcalls are significantly slower than other engine's libcalls, then that is maybe more interesting.

Wasmtime GitHub notifications bot (May 21 2026 at 20:45):

alexcrichton commented on issue #13272:

After https://github.com/bytecodealliance/wasmtime/pull/13368 and https://github.com/bytecodealliance/wasmtime/pull/13367 Wasmtime is ~2x faster across the board. There is still no optimization for constant-length short memcpy/memset, however, which will be the lion's share of what's remaining.

Last updated: Jul 29 2026 at 05:03 UTC

Stream: git-wasmtime

Topic: wasmtime / issue #13272 <Performance> fuzzbug: Repeated s...

Wasmtime GitHub notifications bot (May 05 2026 at 08:10):

Describe the bug

Test Case

Steps to Reproduce

Expected and actual Results

Primary `memory.copy` reproducer and close controls

Length sweep for `memory.copy`

Src/dst relation sweep for `memory.copy len=32`

Family-level consistency

Versions and Environment

Extra Info

Wasmtime GitHub notifications bot (May 05 2026 at 08:10):

Wasmtime GitHub notifications bot (May 05 2026 at 08:10):

Wasmtime GitHub notifications bot (May 05 2026 at 21:16):

Wasmtime GitHub notifications bot (May 06 2026 at 02:34):

Wasmtime GitHub notifications bot (May 06 2026 at 14:49):

Wasmtime GitHub notifications bot (May 06 2026 at 14:49):

Wasmtime GitHub notifications bot (May 06 2026 at 14:49):

Wasmtime GitHub notifications bot (May 12 2026 at 19:25):

Wasmtime GitHub notifications bot (May 21 2026 at 20:45):

Stream: git-wasmtime

Topic: wasmtime / issue #13272 <Performance> fuzzbug: Repeated s...

Wasmtime GitHub notifications bot (May 05 2026 at 08:10):

Describe the bug

Test Case

Steps to Reproduce

Expected and actual Results

Primary memory.copy reproducer and close controls

Length sweep for memory.copy

Src/dst relation sweep for memory.copy len=32

Family-level consistency

Versions and Environment

Extra Info

Wasmtime GitHub notifications bot (May 05 2026 at 08:10):

Wasmtime GitHub notifications bot (May 05 2026 at 08:10):

Wasmtime GitHub notifications bot (May 05 2026 at 21:16):

Wasmtime GitHub notifications bot (May 06 2026 at 02:34):

Wasmtime GitHub notifications bot (May 06 2026 at 14:49):

Wasmtime GitHub notifications bot (May 06 2026 at 14:49):

Wasmtime GitHub notifications bot (May 06 2026 at 14:49):

Wasmtime GitHub notifications bot (May 12 2026 at 19:25):

Wasmtime GitHub notifications bot (May 21 2026 at 20:45):

Primary `memory.copy` reproducer and close controls

Length sweep for `memory.copy`

Src/dst relation sweep for `memory.copy len=32`