Stream: general

Topic: Copy on write based instance reuse


view this post on Zulip Jan (Jan 03 2022 at 07:28):

Hi. So we're quite heavily using wasmtime in substrate and we're looking to reduce the overhead/latency of instantiating new instances of WASM modules. Our usecase is basically this:

1) we compile a single WASM module once at the start (and maybe update it every few weeks),
2) from a single thread we instantiate a fresh module instance, call into it once, and throw the instance away,
3) we repeat (2) many times per second.

So I've done some experiments, and here's roughly the performance we're getting on a benchmark where we instantiate a module (using our production WASM blob) and call an empty function:

Of course a proper production-quality implementation might benchmark slightly differently, but as you can see the numbers here are pretty promising. (And it doesn't even require a recent Linux kernel like uffd does, nor it does require you to manually estimate the module/instance limits like with the polling allocation strategy.)

So my question here is twofold:

1) Did anyone ever previously investigate introducing a COW-based instance allocation scheme to wasmtime?
2) Would there be interest in adding a COW-based instance instantiation to wasmtime? If so I would be happy to work on it and put up a PR.

Substrate: The platform for blockchain innovators. Contribute to paritytech/substrate development by creating an account on GitHub.

view this post on Zulip Jacob Lifshay (Jan 03 2022 at 17:04):

i have a similar issue for a game I'm planning on eventually finishing writing -- it would call 1 wasm function perhaps millions of times per second and from multiple threads, I'd like the wasm memory to be copy on write because it's a strict requirement that the function be completely deterministic such that running it twice with the same inputs always gives the exact same result -- even if the wasm is malicious and/or poorly designed to try to detect previous invocations and do something different.

view this post on Zulip Chris Fallin (Jan 03 2022 at 17:30):

@Jan can you say more about how your CoW scheme works? In particular, do you instrument loads and stores to access an overlay and fall back to the image (i.e., a "software mapping" scheme), or are you altering page mappings (mmap or similar)?

view this post on Zulip Jan (Jan 04 2022 at 05:34):

can you say more about how your CoW scheme works?

Sure. It's pretty simple. It roughly goes like this:

1) Map the memory through mmap with an memfd handle as the file descriptor.
2) Instantiate the WASM module once.
3) Fossilize the contents of the memfd by remapping the memory pages inplace (using MAP_PRIVATE + MAP_FIXED flags). Now the instantiated instance can be normally used without modifying the original memory contents.
4) When we're done with the instance reset it by madvise'ing the memory with a MADV_DONTNEED. The memory is reset to its initial state just after it was instantiated and the instance can be reused again.

view this post on Zulip Chris Fallin (Jan 04 2022 at 05:55):

Ah, OK, so I would be interested if you had any throughput tests of this, not just latency; I suspect the remap (step 3) on instantiation will become a bottleneck, as it will (I think) need to take a one-per-address-space lock on the memory mapping data-structures. This bottleneck was, afaik, the original motivation behind the uffd-based implementation, most relevant in high-concurrency settings

view this post on Zulip Chris Fallin (Jan 04 2022 at 05:57):

Actually, after reading again: the hotpath (instantiation once everything is ready) is just step 4, is that right? So just the madvise() and nothing else resets the already-existing mapping? Hmm, that's interesting

view this post on Zulip Jan (Jan 04 2022 at 06:31):

Yes, the madvise does all of the work to reset the memory; you only run steps 1 to 3 once.

I haven't really done any more thorough throughput tests for this, but I can't imagine it's going to be slower than initializing everything from scratch again on every instantiation in real word scenarios, although I guess there are probably some scenarios where the uffd-based approach might still be better.

If you have any benchmarks you'd like me to run I'd be happy to oblige. (I'm currently in the process of getting a less hacky implementation put together which I should be able to eventually put up as a PR.)

view this post on Zulip Jacob Lifshay (Jan 04 2022 at 07:34):

i'd expect you to need a mmap to reset the mapping to a non-zero initial state. madvise will just tell the kernel it can replace whatever's in that address range with zeros whenever it feels like it, otherwise it's unmodified -- neither of those are resetting it to the initial memfd contents.

view this post on Zulip Jan (Jan 04 2022 at 07:51):

Actually, no, madvise will automatically repopulate the pages with the memfd contents (because it's a shared anonymous mapping), so the memory doesn't need to be manually restored. The kernel does it all for you.

view this post on Zulip Alex Crichton (Jan 04 2022 at 17:12):

FWIW you're definitely not the only one interested in making instantiation blazingly fast, Fastly's primary use case of making an instance-per-request also motivates a lot of our work to make instantiation fast. If you've got ideas of how to make it faster we're always open to exploring things!

view this post on Zulip Alex Crichton (Jan 04 2022 at 17:13):

One thing that may also be helpful is to share the benchmark you're using to analyze that as well. One thing Fastly is concerned about as well (which may be unique to Fastly and not your use case) is concurrent instantiations on many threads which originally motivated uffd because it was much faster than using mmap/madvise due to those syscalls requiring a global process vm lock in the kernel

view this post on Zulip Jan (Jan 05 2022 at 09:13):

I see, thanks for explaining in more detail the motivation behind picking the uffd approach!

Once I get a proof-of-concept PR of my approach into an usable state I'll port our benchmark to use it (my hack with which I initially tested this isn't really something that anyone should see, for their own sanity's sake :sweat_smile: ), and I'll also add extra benchmarks to see how it scales per-thread and share the benchmarks + results.

view this post on Zulip Alex Crichton (Jan 05 2022 at 14:29):

Ok nice! We are happy to dig into and help with implementation or prototyping as well, so holler if you need help!

view this post on Zulip Jan (Jan 14 2022 at 11:13):

I've made a PR with my implementation here; any comments would be highly appreciated!

https://github.com/bytecodealliance/wasmtime/pull/3691

This PR adds a new copy-on-write based instance reuse mechanism on Linux. Usage The general idea is - you instantiate your instance once, and then you can reset its state back to how it was when it...

view this post on Zulip Alex Crichton (Jan 14 2022 at 17:19):

Awesome thanks for the pr @Jan ! As a heads up Fastly folks have today and this coming Monday off, but we will be sure to look at this at most by Tuesday


Last updated: Jan 24 2025 at 00:11 UTC