alexcrichton opened PR #13367 from alexcrichton:optimize-memory-fill to bytecodealliance:main:
This commit is a refactoring and reimplementation of the
memory.fillinstruction in WebAssembly to be more optimal. Previously the implementation was a libcall with the raw arguments to the instruction which had various amounts of overhead in the host to implement the semantics of the instruction. The implementation now performs a bounds-check inline in wasm itself and while there's still an unconditional libcall it's a much simpler libcall.The goal of this commit is to spearhead the shape of this optimization to pave the way for other libcalls to get modified as well. In isolation
memory.fillisn't too particularly interesting here. Eventually though I'd like to havememory.copyandarray.copy, for example, bottom out in the same libcall which is a simplememmoveor similar. For nowmemory.fillwas simple enough so I've started here.<!--
Please make sure you include the following information:
If this work has been discussed elsewhere, please include a link to that
conversation. If it was discussed in an issue, just mention "issue #...".Explain why this change is needed. If the details are in an issue already,
this can be brief.Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.htmlPlease ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->
alexcrichton requested wasmtime-compiler-reviewers for a review on PR #13367.
alexcrichton requested wasmtime-core-reviewers for a review on PR #13367.
alexcrichton requested cfallin for a review on PR #13367.
alexcrichton updated PR #13367.
:thumbs_up: cfallin submitted PR review:
Looks reasonable to me overall -- thanks!
I'm curious if you've measured overheads here; naively, I'd expect that having a libcall at all is the main fixed cost, and moving logic from the libcall to compiled Wasm code is at best neutral (we have to do the same checks either way). In my mind the real win will come when we can avoid a trampoline and the usual runtime entry machinery, instead directly calling memmove/memcpy/memset (our local version or the real thing). And, as long as we're avoiding relocations in cwasms, the most efficient thing would be a direct call to a little local routine we put alongside the trampolines; on x86-64 a
rep stosb/rep movsb(because these are optimized in microcode to do whole-cache-line-sized things, i.e. they're as good as it gets), on aarch64 and others whatever a reasonably good memcpy/memmove/memset loop would be. What do you think?
:speech_balloon: cfallin created PR review comment:
cast_index_to_pointerreads a little ambiguously to me -- maybecast_wasm_addr_ty_to_native_ptr_ty?
alexcrichton commented on PR #13367:
Agreed yeah, and the main goal here isn't performance of
memory.fillitself but rather paving the way for making the libcalls more raw. The main low-hanging-fruit I'd like to pick isarray.copywhich is an unconditional libcall which, for(array i8)will load/store each byte individually roundtripping through Wasmtime's high-levelValmachinery. That ends up being extremely slow, and I'd also prefer to not have a whole bunch of*_copybuiltins, so my hope is to use this as a plan to refactorymemory.copyand then use that to refactorarray.copy, shrinking our builtins and making things faster.That being said, our safe VM machinery inside of Wasmtime isn't particularly optimized. A 1024-byte
memory.fillbenchmark in Criterion shows 10.1ns before this commit and 5.8ns after this commit. So this does shave off a good chunk of the overhead.The profile before is:
36.26% wasmtime libc.so.6 [.] __memset_avx2_unaligned_erms 25.87% wasmtime wasmtime [.] wasmtime::runtime::vm::libcalls::memory_fill 14.01% wasmtime wasmtime [.] wasmtime::runtime::vm::libcalls::raw::memory_fill 10.81% wasmtime jitted-2068832-7728.so [.] criterion::bencher::Bencher<M>::iter::h29a2389639587d69 10.61% wasmtime jitted-2068832-11083.so [.] wasmtime_builtin_memory_filland the profile after is:
45.07% wasmtime libc.so.6 [.] __memset_avx2_unaligned_erms 24.83% wasmtime jitted-2068506-7728.so [.] criterion::bencher::Bencher<M>::iter::h29a2389639587d69 10.56% wasmtime jitted-2068506-11083.so [.] wasmtime_builtin_memory_fill 4.58% wasmtime wasmtime [.] wasmtime::runtime::vm::libcalls::raw::memory_fillwhere
wasmtime_builtin_memory_filland theraw::memory_fillfunctions are pure overhead, but it's still much leaner in terms of before.
alexcrichton commented on PR #13367:
Benchmark in question is this, in which I golf
black_boxuntilmemory.fillshows up in a profile
alexcrichton updated PR #13367.
alexcrichton updated PR #13367.
alexcrichton added PR #13367 Optimize the implementation of memory.fill to the merge queue
:check: alexcrichton merged PR #13367.
alexcrichton removed PR #13367 Optimize the implementation of memory.fill from the merge queue
Last updated: Jun 01 2026 at 09:49 UTC