Hi. So we're quite heavily using wasmtime in substrate and we're looking to reduce the overhead/latency of instantiating new instances of WASM modules. Our usecase is basically this:
1) we compile a single WASM module once at the start (and maybe update it every few weeks),
2) from a single thread we instantiate a fresh module instance, call into it once, and throw the instance away,
3) we repeat (2) many times per second.
So I've done some experiments, and here's roughly the performance we're getting on a benchmark where we instantiate a module (using our production WASM blob) and call an empty function:
Of course a proper production-quality implementation might benchmark slightly differently, but as you can see the numbers here are pretty promising. (And it doesn't even require a recent Linux kernel like uffd does, nor it does require you to manually estimate the module/instance limits like with the polling allocation strategy.)
So my question here is twofold:
1) Did anyone ever previously investigate introducing a COW-based instance allocation scheme to wasmtime?
2) Would there be interest in adding a COW-based instance instantiation to wasmtime? If so I would be happy to work on it and put up a PR.
i have a similar issue for a game I'm planning on eventually finishing writing -- it would call 1 wasm function perhaps millions of times per second and from multiple threads, I'd like the wasm memory to be copy on write because it's a strict requirement that the function be completely deterministic such that running it twice with the same inputs always gives the exact same result -- even if the wasm is malicious and/or poorly designed to try to detect previous invocations and do something different.
@Jan can you say more about how your CoW scheme works? In particular, do you instrument loads and stores to access an overlay and fall back to the image (i.e., a "software mapping" scheme), or are you altering page mappings (mmap or similar)?
can you say more about how your CoW scheme works?
Sure. It's pretty simple. It roughly goes like this:
1) Map the memory through mmap
with an memfd handle as the file descriptor.
2) Instantiate the WASM module once.
3) Fossilize the contents of the memfd by remapping the memory pages inplace (using MAP_PRIVATE
+ MAP_FIXED
flags). Now the instantiated instance can be normally used without modifying the original memory contents.
4) When we're done with the instance reset it by madvise'ing the memory with a MADV_DONTNEED
. The memory is reset to its initial state just after it was instantiated and the instance can be reused again.
Ah, OK, so I would be interested if you had any throughput tests of this, not just latency; I suspect the remap (step 3) on instantiation will become a bottleneck, as it will (I think) need to take a one-per-address-space lock on the memory mapping data-structures. This bottleneck was, afaik, the original motivation behind the uffd-based implementation, most relevant in high-concurrency settings
Actually, after reading again: the hotpath (instantiation once everything is ready) is just step 4, is that right? So just the madvise()
and nothing else resets the already-existing mapping? Hmm, that's interesting
Yes, the madvise
does all of the work to reset the memory; you only run steps 1 to 3 once.
I haven't really done any more thorough throughput tests for this, but I can't imagine it's going to be slower than initializing everything from scratch again on every instantiation in real word scenarios, although I guess there are probably some scenarios where the uffd
-based approach might still be better.
If you have any benchmarks you'd like me to run I'd be happy to oblige. (I'm currently in the process of getting a less hacky implementation put together which I should be able to eventually put up as a PR.)
i'd expect you to need a mmap to reset the mapping to a non-zero initial state. madvise will just tell the kernel it can replace whatever's in that address range with zeros whenever it feels like it, otherwise it's unmodified -- neither of those are resetting it to the initial memfd contents.
Actually, no, madvise
will automatically repopulate the pages with the memfd contents (because it's a shared anonymous mapping), so the memory doesn't need to be manually restored. The kernel does it all for you.
FWIW you're definitely not the only one interested in making instantiation blazingly fast, Fastly's primary use case of making an instance-per-request also motivates a lot of our work to make instantiation fast. If you've got ideas of how to make it faster we're always open to exploring things!
One thing that may also be helpful is to share the benchmark you're using to analyze that as well. One thing Fastly is concerned about as well (which may be unique to Fastly and not your use case) is concurrent instantiations on many threads which originally motivated uffd because it was much faster than using mmap
/madvise
due to those syscalls requiring a global process vm lock in the kernel
I see, thanks for explaining in more detail the motivation behind picking the uffd approach!
Once I get a proof-of-concept PR of my approach into an usable state I'll port our benchmark to use it (my hack with which I initially tested this isn't really something that anyone should see, for their own sanity's sake :sweat_smile: ), and I'll also add extra benchmarks to see how it scales per-thread and share the benchmarks + results.
Ok nice! We are happy to dig into and help with implementation or prototyping as well, so holler if you need help!
I've made a PR with my implementation here; any comments would be highly appreciated!
https://github.com/bytecodealliance/wasmtime/pull/3691
Awesome thanks for the pr @Jan ! As a heads up Fastly folks have today and this coming Monday off, but we will be sure to look at this at most by Tuesday
Last updated: Jan 24 2025 at 00:11 UTC