github-actions[bot] commented on issue #3691:
Subscribe to Label Action
cc @peterhuene
<details>
This issue or pull request has been labeled: "wasmtime:api"Thus the following users have been cc'd because of the following labels:
- peterhuene: wasmtime:api
To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.
Learn more.
</details>
koute commented on issue #3691:
It looks like
qemu
's user mode emulator is not emulatingmadvise
properly and that's why the tests are failing on aarch64; I've just checked and on a bare metal aarch64 system they all pass.I guess ideally we should disable them when running under qemu-user?
cfallin commented on issue #3691:
Hi @koute -- thanks so much for this PR and for bringing up the ideas behind it (in particular, the memfd mechanism)!
Guilty admission on my part: after you mentioned memfd recently on Zulip, and madvise to reset a private mapping (throw away a CoW overlay), I threw together my own implementation as well and did a lot of internal experimentation. (I've been hacking in the pooling allocator and on performance-related things recently as well and your idea was a huge epiphany for me.) I need to clean it up a bit still but will put it up soon (with due credit to you for memfd/madvise/CoW ideas!). Perhaps we can get the best ideas out of both of these PRs :-)
One additional realization I had was that, for performance, we don't want to do any
mprotect()
at all on heap growth (it holds the whole-process mmap lock in write mode, so can quickly become a bottleneck). But we can play tricks with ftruncate and a second anonymous memfd, mapped above the initial image, so that in a steady state with instance-slot reuse on reinstantiation, our only syscalls are madvise and ftruncate. (Key bits: mmap can map beyond the actual size of a file; size of a file can be changed after mapping is made, and this doesn't touch any whole-process locks; accesses beyond end of file cause SIGBUS.)Anyway, a few thoughts on this PR:
The first-class notion of snapshotting, saving globals, etc., is interesting, and I can see how it could be useful in certain scenarios. However I think it actually mixes a few ideas which we probably want to separate and design independently: the notion of instance state, snapshotting, rewinding, etc., and the implementation-level mechanism of CoW-mapping a heap image.
In particular, at least in scenarios I'm more familiar with, we do snapshotting via a separate tool that writes out a Wasm module post-snapshot, Wizer. IMHO, including a runtime snapshot-and-reuse-of-state-in-memory mechanism muddies the waters a bit -- which should be used when? -- and whether or not Wasmtime should have such an API deserves further discussion, I think.
Given that, I think something that we could more easily land would be a completely API-transparent extension of the pooling allocator. Basically a new mode that (i) creates a backing image for a module's memories when it's first seen, (ii) CoW-maps these images into memory slots, with the ftruncate-to-extend trick above, and (iii) tries to reuse slots with existing mappings of the desired images when possible. This is basically what I've put together, so I'll try to get it up ASAP.
I'm a bit concerned with the need to read
/proc/self/pagemap
to determine whether pages are present. This has some interesting permissions implications (see Documentation/vm/pagemap.txt in the Linux kernel) -- some bits need root or a process capability to see, and it seems that under some kernel versions the whole file is inaccessible except by root. In any case, this creates a deep coupling with a Linux-specific interface for snapshotting; ideally, such API functionality should be designed, if it exists, so it is not forever limited to be a Linux-specific feature. (In other words: performance-only features can be OS-specific, because that's the nature of optimizing for a platform; but functionality should not be OS-specific.)I think though that my concern here is basically subsumed by the higher-level question of where snapshotting should occur: if it is the domain of a separate tool, then that separate tool can do special things that may not be as efficient (e.g., diff a whole memory, or mmap inaccessible and mprotect() in pages one at a time), since snapshotting is probably rare compared to instantiation.
I think these are some things we should talk through after I've put up my PR and we can do comparisons. I'm really grateful for the ton of effort you put into this and look forward to comparing the approaches in more detail!
fitzgen commented on issue #3691:
Agreed with @cfallin that we should disentangle snapshots and instantiation here, and focus on a relatively transparent extension of the pooling instance allocator for now.
That said, a snapshotting feature could be very useful for doing neat things like
rr
-style record and replay debugging. But this is a pretty big design space, and I'd want to hash out our motivation/use cases, technical architecture, and API design with an RFC before we dive head first into implementation.
alexcrichton commented on issue #3691:
@koute would using the pooling instance allocator work for your embedding's use case? @cfallin's work right now I believe is entirely focused on that which means that by-default Wasmtime wouldn't have copy-on-write re-instantiation because Wasmtime by default (as you've seen and modified here) uses the on-demand instance allocator. If your embedding doesn't work well with the pooling instance allocator then I think we'll need to brainstorm a solution which "merges" your work here with @cfallin's on the pooling allocator, taking into account the feedback around snapshots (which I personally agree is best to separate and ideally make the copy-on-write business a simple config option of "go faster")
koute commented on issue #3691:
I'm a bit concerned with the need to read
/proc/self/pagemap
to determine whether pages are present.This it technically optional and done entirely for performance, so it could be made allowed to fail. Even without using
/proc/self/pagemap
my instance reuse mechanism is faster than anything that's currently inwasmtime
(in our benchmarks), however it's not always faster than what we're currently using without the/proc/self/pagemap
part. (And ideally we don't want to switch to anything that's slower; we want to improve performance.)This has some interesting permissions implications (see Documentation/vm/pagemap.txt in the Linux kernel) -- some bits need root or a process capability to see, and it seems that under some kernel versions the whole file is inaccessible except by root.
That is, AFAIK, only applicable to the lower bits, which we don't need in this case. The higher bits (which we need) should be always readable when reading our own process' pagemap.
The first-class notion of snapshotting, saving globals, etc., is interesting, and I can see how it could be useful in certain scenarios. However I think it actually mixes a few ideas which we probably want to separate and design independently: the notion of instance state, snapshotting, rewinding, etc., and the implementation-level mechanism of CoW-mapping a heap image.
If your embedding doesn't work well with the pooling instance allocator then I think we'll need to brainstorm a solution which "merges" your work here with @cfallin's on the pooling allocator, taking into account the feedback around snapshots (which I personally agree is best to separate and ideally make the copy-on-write business a simple config option of "go faster")
In my initial prototype implementation I actually tried to do this in the same vein as the current pooling allocator, but in the end decided to go with the current approach. Let me explain.
Basically we have three main requirements:
- Be as fast as possible. (At least as fast as what we currently use, which is the
legacy_instance_reuse
you see on the graphs.)- Be robust. (The instantiation can not fail.)
- Be simple to use and maintainable. (Although we can live with this not being the case if absolutely necessary.)
One of the problems with the current pooling allocator (ignoring how it performs) is that it fails at (2) and somewhat at (3), and isn't a simple "go faster" option that you can just blindly toggle. You have to manually specify module and instance limits (and if the WASM blob changes too significantly you need to modify them), and you need to maintain a separate codepath (basically keep another separate
Engine
s and the sameModule
twice) for the case when instantiation fails.So personally I think it just makes more sense (especially for something so fundamental and low level as
wasmtime
) to just let the user explicitly do the pooling themselves, which is very easy to do with the approach in this PR - just preinstantiate as many instances as you want into aVec
, pop one on use, push them back when you're done, and if you run out of pooled instances you can just easily instantiate one from scratch with a single extraif
.I also considered integrating this into
InstancePre
directly (or into a separate type) - that is, there would be noreset
, and the instances would automatically reset themselves when dropped and returned into a pool inside of theInstancePre
and then automatically and transparently reused whenInstancePre::instantiate
would be called (so that would actually be a transparent "go faster" option), but decided that it'd be better to just let the user control the pooling themselves since it's so simple to do anyway. (And also the way how everything is stored inside of theStore
and not theInstance
would make that somewhat awkward, since you want to reuse both theInstance
and theStore
.)Basically, in our usecase we don't really care about snapshotting at all (it's just an implementation detail to make things go fast), and all we just need is to be able to instantiate clean instances as fast as possible.
Would this be more acceptable to you if API-wise we'd make it less like it's snapshotting and more like an explicit way to pool instances?
I think these are some things we should talk through after I've put up my PR and we can do comparisons. I'm really grateful for the ton of effort you put into this and look forward to comparing the approaches in more detail!
Sounds good to me! I'll hook up your PR into our benchmarks so that we can compare the performance.
alexcrichton commented on issue #3691:
Ok @koute so to follow up on comments and work from before, https://github.com/bytecodealliance/wasmtime/pull/3733 is the final major optimization for instantiation that we know of to implement. There's probably some very minor wins still remaining, but that's the lion's share of improvements that we're going to get into wasmtime (that plus memfd).
Could you try re-running your benchmark numbers with the old strategy y'all are currently using, your proposal in this PR, and then https://github.com/bytecodealliance/wasmtime/pull/3733 as a PR? Note that when using https://github.com/bytecodealliance/wasmtime/pull/3733 using the on-demand allocator, while maybe a little bit slower than the pooling allocator, should still be fine. The intention with https://github.com/bytecodealliance/wasmtime/pull/3733 is that it's fast enough that you won't need to maintain a pool of instances, and each "reuse" of an instance can perform the full re-instantiation process. Note that for re-instantiation it's recommended to start from an
InstancePre<T>
which is the fastest way today to instantiate something.My prediction is that the time-to-instantiate https://github.com/bytecodealliance/wasmtime/pull/3733 is likely quite close to the strategy outlined in this PR. It will probably look a little different one way or another, but that's what I'm curious to see if https://github.com/bytecodealliance/wasmtime/pull/3733 works for your use case in terms of robustness and performance.
If you're up for it then it might be interesting to test the pooling allocator as well. I realize that the pooling allocator as-is isn't a great fit for your use case due to it being too constrained, but as a one-off measurement of numbers it might help give an idea of the performance tradeoff between the pooling allocator and the on-demand allocator. Also unless you're specifically interested in concurrent instantiation performance it's fine to only get single-threaded instantiation numbers. It's expected that https://github.com/bytecodealliance/wasmtime/pull/3733 does not scale well with cores (like this PR) due to the IPIs necessary at the kernel level with all the calls to
madvise
. In that sense the more interesting comparison is probably single-threaded numbers (again unless you're specifically interested in the multi-threaded numbers as well)
alexcrichton commented on issue #3691:
Oh and for now memfd is disabled-by-default, so with #3733 you'll need to execute
config.memfd(true)
as part of Wasmtime's configuration to be sure that it's enabled. If you're seeing instantiation take more than double-digit microseconds then Wasmtime may not be configured correctly and I can help dig in. It's expected, though, that instantiation is probably in the single-digit microseconds.
koute commented on issue #3691:
@alexcrichton Got it! I'll rerun all of the benchmarks next week and I'll get back to you.
koute commented on issue #3691:
@alexcrichton Sorry for the delay! I updated to the newest
main
ofwasmtime
, ported over this PR and reran the benchmarks. Here are the results:![call_empty_function](https://user-images.githubusercontent.com/246574/154690173-ada9148f-700c-4913-9eee-503e64ed126a.png)
![dirty_1mb_of_memory](https://user-images.githubusercontent.com/246574/154690186-99911c50-4eb1-4d1e-aa84-b61d709c9a3a.png)In table form:
call_empty_function | threads | legacy_instance_reuse | native_instance_reuse | recreate_instance_memfd_only | recreate_instance_pooling_memfd | recreate_instance_pooling_only | recreate_instance_pooling_uffd | recreate_instance_vanilla | |---------|-----------------------|-----------------------|------------------------------|---------------------------------|--------------------------------|--------------------------------|---------------------------| | 1 | 48 | 4 | 17 | 8 | 58 | 25 | 65 | | 2 | 67 | 11 | 43 | 21 | 88 | 36 | 109 | | 4 | 91 | 17 | 98 | 31 | 156 | 60 | 195 | | 8 | 160 | 25 | 305 | 45 | 335 | 135 | 434 | | 16 | 228 | 35 | 634 | 64 | 847 | 271 | 1140 | dirty_1mb_of_memory | threads | legacy_instance_reuse | native_instance_reuse | recreate_instance_memfd_only | recreate_instance_pooling_memfd | recreate_instance_pooling_only | recreate_instance_pooling_uffd | recreate_instance_vanilla | |---------|-----------------------|-----------------------|------------------------------|---------------------------------|--------------------------------|--------------------------------|---------------------------| | 1 | 185 | 145 | 159 | 150 | 196 | 293 | 204 | | 2 | 273 | 212 | 274 | 219 | 289 | 835 | 333 | | 4 | 374 | 310 | 494 | 311 | 499 | 1305 | 553 | | 8 | 738 | 520 | 967 | 531 | 919 | 1913 | 1074 | | 16 | 945 | 734 | 2233 | 767 | 1808 | 3122 | 1985 |
When a lot of memory is dirtied it is indeed competitive now, and when not a lot of memory is touched it also performs quite well now! Of course this is assuming both memfd and pooling is enabled.
So I'd like to ask here - are there any plans to make the pooling less painful to use? Something like this would be ideal:
config.allocation_strategy(InstanceAllocationStrategy::Pooling { strategy: Default::default(), instance_limit: 8 });
Basically completely get rid of the
module_limits
and have it work with modules of any size, and change theinstance_limit
to be a soft cap instead of a hard cap, that is - instead of returning an error when the maximum concurrent instance limit is reached it would just transparently create a newOnDemand
instance. That would make the pooling strategy truly a drop-in replacement without having to hack around it.
alexcrichton commented on issue #3691:
Hm so actually the "only memfd" line, which I'm assuming is using the default on-demand allocator, is performing much worse than expected. The cost of the on-demand allocator is an extra mmap-or-two and it's not quite as efficient on reuse (a few
mmap
calls instead of onemadvise
). In that sense I wouldn't expect it to be an almost order of magnitude slower in the measurements, instead what I've been seeing locally is that it's off by some constant factor ish from the pooling allocator. Do you have some code I could read to double-check the on-demand allocator was configured correctly?Otherwise though I definitely think we can improve the story with the usability of the pooling allocator. The reason that the module limits and such exist are so that we can create appropriate space for the internal
*mut VMContext
allocation for each instance which is sized to each instance. I think, though, instead of specifically configuring each field and its limits it would be easier to say "each instance gets N bytes of memory" and then during instantiation if that's exceeded instantiation fails with a descriptive error message. That would mean that a generous amount, such as a megabyte or two, could be allocated for instances and you wouldn't have to worry about tweaking specific limits.Again though I'm surprised that the on-demand allocator is performing as bad as it did in your measurements, as my suspicion was that it would be sufficient for your use case. I think configuring the pooling allocator by allocation size is probably good to do no matter what, though.
alexcrichton commented on issue #3691:
Actually thinking about this some more, my measurements and impression about the relative cost of these is primarily in the single-threaded case, I haven't looked too too closely at the multithreaded bits. Does your use case stress the multi-threaded aspect heavily? If so we can try to dig in some more, but I'm not sure if you're measuring the multi-threaded performance at the behest of an old request of ours or your own project's motivations as well.
As a point of comparison for a 16-threaded scenario I get:
on-demand pooling instantiate 126us 660us empty
327us 690us dirty
3025us 1691us (hence some of my surprise, but again I haven't dug into these numbers much myself, especially the discrepancy between pooling/on-demand and how the speedup changes depending on the allocation strategy)
koute commented on issue #3691:
Do you have some code I could read to double-check the on-demand allocator was configured correctly?
Sure.
This is the crate where we encapsulated our use of
wasmtime
: (I'm linking to a fork which I used for benchmarking since it has some extra changes which haven't yet landed in production; please excuse the somewhat messy code in places.)https://github.com/koute/substrate/tree/master_wasmtime_benchmarks_2/client/executor/wasmtime
Let me give you a quick step-by-step walkthrough of the code:
- The engine's configured.
- The pooling's configured. (optional)
- The Engine is created.
- Any potential memory imports are converted into memory exports. (Since we don't really need to support imported memories, and last time I tried there was big performance difference where the pooling allocator got a lot slower when using imported memories; maybe that's why you're seeing the difference here with on-demand vs pooling?)
- The Module is created.
- The Linker is created.
- Any function imports are registered within the Linker.
- We create an InstancePre. From now we'll exclusively use this
InstancePre
to instantiate new instances.- Then on each instantiation:
a. We create a new Store. (SinceStore
s can't be reused.)
b. We instantiate a new instance throughInstancePre
.
c. We call a single function inside of the WASM blob.
d. Both the Instance and the Store are then dropped.And my benchmarks basically just loop (9) over and over again.
Does your use case stress the multi-threaded aspect heavily?
In certain cases we do call into WASM from multiple threads at the same time, so we do care about the multi-threaded aspect, but an instantiated instance never leaves the thread on which the instantiation happened. (Basically the whole number 9 is always executed in one go without the concrete instance being sent to any other thread nor being called twice. [Unless our legacy instance reuse mechanism is used, but we want to get rid of that.])
koute commented on issue #3691:
Okay, so since the current memfd + pooling allocation strategy is fast enough (our primary goal was to remove our legacy instantiation scheme without compromising on performance, and ideally to also get a speedup if possible) I'm going to close this PR now.
As I've said previously I'm not exactly thrilled by the API to be able to use the pooling allocator without introducing arbitrary hard limits, but ultimately that's not a dealbreaker and we can hack around it. (:
(I'm of course still happy to answer questions and help out if necessary, so please feel free to ping me if needed.)
alexcrichton commented on issue #3691:
Ok cool thanks for the links, it looks like nothing is amiss there. I also forget that the machine I'm working on is an 80-core arm64 machine where IPIs are likely more expensive than smaller-core-count machines, so that would likely help explain the discrepancy.
If it helps I put up https://github.com/bytecodealliance/wasmtime/pull/3837 which removes
ModuleLimits
and folds a few things into theInstanceLimits
structure. You'll probably wanttables
andmemories
set to 1 (since you're only supporting MVP wasm anyway),memory_pages
set to maximum 65536 (as it's just maximally allowed memory, not committed memory). You'll need to manually configuresize
andtable_elements
though for your desired maximums. Increasing them will increase the committed memory of the pooling allocator (well, mmapped-as-zero I guess and it's not actually committed until it's accessed).That hopefully makes things a bit more usable!
Oh also for imported memories, it's true that right now for copy-on-write-based initialization we do not apply the optimization to imported memories, only to locally defined memories. All of the "interesting" modules we've been optimizing and care about the runtime for all define their own memory and export it, but it's not impossible to support imported memories and if you've got a use case we can look to implement support for imported memories.
koute commented on issue #3691:
Ok cool thanks for the links, it looks like nothing is amiss there. I also forget that the machine I'm working on is an 80-core arm64 machine where IPIs are likely more expensive than smaller-core-count machines, so that would likely help explain the discrepancy.
That could potentially affect things, yes; I was testing this on a mere 32-core machine after all. (:
If it helps I put up #3837 which removes
ModuleLimits
and folds a few things into theInstanceLimits
structure. You'll probably wanttables
andmemories
set to 1 (since you're only supporting MVP wasm anyway),memory_pages
set to maximum 65536 (as it's just maximally allowed memory, not committed memory). You'll need to manually configuresize
andtable_elements
though for your desired maximums. Increasing them will increase the committed memory of the pooling allocator (well, mmapped-as-zero I guess and it's not actually committed until it's accessed).That hopefully makes things a bit more usable!
It is indeed a significant improvement! Thanks!
Oh also for imported memories, it's true that right now for copy-on-write-based initialization we do not apply the optimization to imported memories, only to locally defined memories. All of the "interesting" modules we've been optimizing and care about the runtime for all define their own memory and export it, but it's not impossible to support imported memories and if you've got a use case we can look to implement support for imported memories.
In our use case at this point we're fine with the way it is currently. We need to support WASM blobs of either type, but we can just easily patch one into the other if necessary so that the memory's always defined internally and exported.
Last updated: Nov 22 2024 at 16:03 UTC