wasmtime / PR #13367 Optimize the implementation of `memo... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / PR #13367 Optimize the implementation of `memo...

Wasmtime GitHub notifications bot (May 14 2026 at 16:55):

alexcrichton opened PR #13367 from alexcrichton:optimize-memory-fill to bytecodealliance:main:

This commit is a refactoring and reimplementation of the memory.fill instruction in WebAssembly to be more optimal. Previously the implementation was a libcall with the raw arguments to the instruction which had various amounts of overhead in the host to implement the semantics of the instruction. The implementation now performs a bounds-check inline in wasm itself and while there's still an unconditional libcall it's a much simpler libcall.

The goal of this commit is to spearhead the shape of this optimization to pave the way for other libcalls to get modified as well. In isolation memory.fill isn't too particularly interesting here. Eventually though I'd like to have memory.copy and array.copy, for example, bottom out in the same libcall which is a simple memmove or similar. For now memory.fill was simple enough so I've started here.

Wasmtime GitHub notifications bot (May 14 2026 at 16:55):

alexcrichton requested wasmtime-compiler-reviewers for a review on PR #13367.

Wasmtime GitHub notifications bot (May 14 2026 at 16:55):

alexcrichton requested wasmtime-core-reviewers for a review on PR #13367.

Wasmtime GitHub notifications bot (May 14 2026 at 16:55):

alexcrichton requested cfallin for a review on PR #13367.

Wasmtime GitHub notifications bot (May 14 2026 at 17:12):

alexcrichton updated PR #13367.

Wasmtime GitHub notifications bot (May 14 2026 at 17:13):

:thumbs_up: cfallin submitted PR review:

Looks reasonable to me overall -- thanks!

I'm curious if you've measured overheads here; naively, I'd expect that having a libcall at all is the main fixed cost, and moving logic from the libcall to compiled Wasm code is at best neutral (we have to do the same checks either way). In my mind the real win will come when we can avoid a trampoline and the usual runtime entry machinery, instead directly calling memmove/memcpy/memset (our local version or the real thing). And, as long as we're avoiding relocations in cwasms, the most efficient thing would be a direct call to a little local routine we put alongside the trampolines; on x86-64 a rep stosb / rep movsb (because these are optimized in microcode to do whole-cache-line-sized things, i.e. they're as good as it gets), on aarch64 and others whatever a reasonably good memcpy/memmove/memset loop would be. What do you think?

Wasmtime GitHub notifications bot (May 14 2026 at 17:13):

:speech_balloon: cfallin created PR review comment:

cast_index_to_pointer reads a little ambiguously to me -- maybe cast_wasm_addr_ty_to_native_ptr_ty?

Wasmtime GitHub notifications bot (May 14 2026 at 17:36):

alexcrichton commented on PR #13367:

Agreed yeah, and the main goal here isn't performance of memory.fill itself but rather paving the way for making the libcalls more raw. The main low-hanging-fruit I'd like to pick is array.copy which is an unconditional libcall which, for (array i8) will load/store each byte individually roundtripping through Wasmtime's high-level Val machinery. That ends up being extremely slow, and I'd also prefer to not have a whole bunch of *_copy builtins, so my hope is to use this as a plan to refactory memory.copy and then use that to refactor array.copy, shrinking our builtins and making things faster.

That being said, our safe VM machinery inside of Wasmtime isn't particularly optimized. A 1024-byte memory.fill benchmark in Criterion shows 10.1ns before this commit and 5.8ns after this commit. So this does shave off a good chunk of the overhead.

The profile before is:
  36.26%  wasmtime         libc.so.6                [.] __memset_avx2_unaligned_erms
  25.87%  wasmtime         wasmtime                 [.] wasmtime::runtime::vm::libcalls::memory_fill
  14.01%  wasmtime         wasmtime                 [.] wasmtime::runtime::vm::libcalls::raw::memory_fill
  10.81%  wasmtime         jitted-2068832-7728.so   [.] criterion::bencher::Bencher<M>::iter::h29a2389639587d69
  10.61%  wasmtime         jitted-2068832-11083.so  [.] wasmtime_builtin_memory_fill
and the profile after is:
  45.07%  wasmtime         libc.so.6                [.] __memset_avx2_unaligned_erms
  24.83%  wasmtime         jitted-2068506-7728.so   [.] criterion::bencher::Bencher<M>::iter::h29a2389639587d69
  10.56%  wasmtime         jitted-2068506-11083.so  [.] wasmtime_builtin_memory_fill
   4.58%  wasmtime         wasmtime                 [.] wasmtime::runtime::vm::libcalls::raw::memory_fill
where wasmtime_builtin_memory_fill and the raw::memory_fill functions are pure overhead, but it's still much leaner in terms of before.

Wasmtime GitHub notifications bot (May 14 2026 at 17:40):

alexcrichton commented on PR #13367:

Benchmark in question is this, in which I golf black_box until memory.fill shows up in a profile

Wasmtime GitHub notifications bot (May 14 2026 at 17:43):

alexcrichton updated PR #13367.

Wasmtime GitHub notifications bot (May 14 2026 at 17:50):

alexcrichton updated PR #13367.

Wasmtime GitHub notifications bot (May 14 2026 at 18:18):

alexcrichton added PR #13367 Optimize the implementation of memory.fill to the merge queue

Wasmtime GitHub notifications bot (May 14 2026 at 18:46):

:check: alexcrichton merged PR #13367.

Wasmtime GitHub notifications bot (May 14 2026 at 18:46):

alexcrichton removed PR #13367 Optimize the implementation of memory.fill from the merge queue

Last updated: Jul 29 2026 at 05:03 UTC