github-actions[bot] commented on issue #3697:
Subscribe to Label Action
cc @peterhuene
<details>
This issue or pull request has been labeled: "wasmtime:api"Thus the following users have been cc'd because of the following labels:
- peterhuene: wasmtime:api
To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.
Learn more.
</details>
koute commented on issue #3697:
I've hooked up your implementation to our benchmarks; here are the results:
![call_empty_function](https://user-images.githubusercontent.com/246574/150087000-f1ecfa32-25de-490d-810e-c3e80c85fa94.png)
![dirty_1mb_of_memory](https://user-images.githubusercontent.com/246574/150087009-c371470c-391f-449e-9f17-7c6d90d620ba.png)
Legend:
instance_pooling_memfd
- this PRnative_instance_reuse
- https://github.com/bytecodealliance/wasmtime/pull/3691recreate_instance
- create a fresh instance withInstanceAllocationStrategy::OnDemand
strategyinstance_pooling_with_uffd
: create a fresh instance withInstanceAllocationStrategy::Pooling
strategy with uffd turned on and without this PRinstance_pooling_without_uffd
: create a fresh instance withInstanceAllocationStrategy::Pooling
strategy without uffd turned on and without this PRAnd here's the comparison in raw numeric values (times are in microseconds):
call_empty_function
threads instance_pooling_with_uffd instance_pooling_without_uffd recreate_instance native_instance_reuse instance_pooling_memfd 1 41 78 80 4 27 2 63 110 114 11 43 4 76 172 188 17 66 8 123 360 439 24 89 16 286 695 1117 36 127 dirty_1mb_of_memory
threads instance_pooling_with_uffd instance_pooling_without_uffd recreate_instance native_instance_reuse instance_pooling_memfd 1 312 216 221 144 238 2 844 325 334 209 385 4 1400 526 549 306 453 8 1979 1047 1168 581 618 16 3084 1814 1954 765 840 So there's still some way to go performance-wise. (:
Thanks to Jan on Zulip (are you also @koute from #3691?) for the initial
idea/inspiration!Yes that's me. (:
alexcrichton commented on issue #3697:
Can you share the branch and/or code to reproduce the results with this PR? I was able to reproduce somewhat locally where I got:
call_empty_function_from_kusama_runtime_with_legacy_instance_reuse_on_1_threads
- 97uscall_empty_function_from_kusama_runtime_with_native_instance_reuse_on_1_threads
- 7.1us (your PR)call_empty_function_from_kusama_runtime_with_native_instance_reuse_on_1_threads
- 32us (this PR)I think that roughly aligns with what you measured, but I wanted to confirm by peeking at the code if possible.
Some quick benchmarking shows that this function is quite hot in this PR. That makes sense to me because @koute your PR skips that function entirely with the reuse mechanism you've implemented.
I couldn't for sure drill down into what's causing the issue but my best guess is that this loop is the slow part. That's a one-time-initialization which is pretty branch-y which happens once-per-function in a module, and the modules you're loading are quite large.
I personally think that thtis PR's allocation strategy is more viable in terms of long term maintenance so I'd like to continue to measure this and push on this if we can, but "the numbers don't lie" and your PR has some very impressive numbers! I think we should be working towards that as a goal because that would be awesome to reinstantiate instances that fast. P
Personally I think it would be fruitful to diff the performance between these two PRs. Without memfd enabled I was measuring like 132us for the benchmark above, so I think that this PR shows that 100us of that can be shaved off with memfd/cow, but it looks like there's a remaining ~25us or so to get shaved off. I believe the function-initialization is probably one of those issues, but there may be others lurking. Quantifying the difference and where the remaining 25us or so can go I think would be helpful to figure out the best viable implementation strategy going forward.
alexcrichton edited a comment on issue #3697:
Can you share the branch and/or code to reproduce the results with this PR? I was able to reproduce somewhat locally where I got:
call_empty_function_from_kusama_runtime_with_legacy_instance_reuse_on_1_threads
- 97uscall_empty_function_from_kusama_runtime_with_native_instance_reuse_on_1_threads
- 7.1us (your PR)call_empty_function_from_kusama_runtime_with_native_instance_reuse_on_1_threads
- 32us (this PR)I think that roughly aligns with what you measured, but I wanted to confirm by peeking at the code if possible.
Some quick benchmarking shows that this function is quite hot in this PR. That makes sense to me because @koute your PR skips that function entirely with the reuse mechanism you've implemented.
I couldn't for sure drill down into what's causing the issue but my best guess is that this loop is the slow part. That's a one-time-initialization which is pretty branch-y which happens once-per-function in a module, and the modules you're loading are quite large.
I personally think that this PR's allocation strategy is more viable in terms of long term maintenance so I'd like to continue to measure this and push on this if we can, but "the numbers don't lie" and your PR has some very impressive numbers! I think we should be working towards that as a goal because that would be awesome to reinstantiate instances that fast.
Personally I think it would be fruitful to diff the performance between these two PRs. Without memfd enabled I was measuring like 132us for the benchmark above, so I think that this PR shows that 100us of that can be shaved off with memfd/cow, but it looks like there's a remaining ~25us or so to get shaved off. I believe the function-initialization is probably one of those issues, but there may be others lurking. Quantifying the difference and where the remaining 25us or so can go I think would be helpful to figure out the best viable implementation strategy going forward.
koute commented on issue #3697:
Can you share the branch and/or code to reproduce the results with this PR?
Sure. Here's the branch:
https://github.com/koute/substrate/tree/master_wasmtime_benchmarks_with_cfallin_memfd_cow
And here's how to run them to get numbers for this PR (essentially the same as described in my PR, just with an extra cargo feature):
FORCE_WASMTIME_INSTANCE_POOLING=1 rustup run nightly-2021-11-01-x86_64-unknown-linux-gnu cargo bench --features wasmtime,sc-executor-wasmtime/memfd-allocator call_empty_function_from_kusama_runtime_with_recreate_instance_on_1_threads
If you have any further questions feel free to ask! (I can also make myself available for discussion on Zulip.)
Some quick benchmarking shows that this function is quite hot in this PR. That makes sense to me because @koute your PR skips that function entirely with the reuse mechanism you've implemented.
Indeed! One of the reasons why I implemented my PR the way I did is because you have to punch through less layers of abstraction so it's easier to make it faster. (:
cfallin commented on issue #3697:
OK, so I've spent some time digging into this more as well, and thinking about potential designs. My benchmarking (mainly with the
server
benchmark included here, tweaked to load aspidermonkey.wasm
to stress instantiation more) also shows that theVMCallerCheckedAnyfunc
initialization is a large part of theVMContext
initialization time.A few thoughts:
At the highest level, there's the question of what re-instantiation strategy different use-cases require. At least in cases I'm familiar with, and as I understand it, the "freshly instantiated for every request" property is a hard security requirement. The instance-reuse PR #3691 relaxes the boundary somewhat by effectively emulating a fresh instantiation, but without going through all initialization; I remain pretty concerned with the risk here, especially as we add more Wasm features in the future, that we'll forget something and have a small but critical information leak. (In other words, it inverts the initialization: we have to positively erase state or else stale state might come through, rather than today where we'd have to explicitly copy data for it to leak.)
That question feels biggest to me. If all use-cases are OK with reuse and accepting risk of leakage, then let's pursue something like your PR. But if there's still a case for the stronger isolation, then I think it's interesting to see if we can approach it from the "other side" -- making initialization from scratch as fast as possible.
There is also the question of whether we want both, for different circumstances. Currently we seem to be taking for granted that "the fastest approach wins", or in other words, that if we can't get this approach to be at ~parity with #3691, then we pick that approach instead. But the requirements addressed by both are different; this one does more work, for lower risk-profile. Maybe we can optimize it to be close, I'm not sure; but if there is still a gap, I think we should consider them as solving slightly different problems and evaluate appropriately.
All that said, I think it may be possible to optimize and come close. As @alexcrichton pointed out above, the major speed delta between the two is I think almost completely in the function-table initialization in the
VMContext
. (There won't be a significant difference in globals or tables unless the initializers are complex: the reuse-based PR splats a saved array of values over both, while this PR uses the ordinary from-scratch initialization.)We can't fully skip the function-table initialization; the
VMCallerCheckedAnyfunc
s are referred to by pointers in others' tables when one instance imports another's function, as far as I can tell. We also cannot share them between all instances of a module (which would have let us either add indirection to share one block of them, or else do a fastmemcpy
):
- This module's functions (possibly exported) embed the
vmctx
in the 3-tuple, so one of every 3 words needs to be specific to theVMContext
we're building;- Imported functions also may have different
vmctx
s for each instantiation (ie not just the default-callee vmctx for host functions).However, I think we can get away with initializing lazily in
wasmtime_runtime::Instance::get_caller_checked_anyfunc
; just check if null, and if so, lazily init then return. (Re&self
and concurrency, use atomic stores and store the checked-for-null ptr last; updates are idempotent so benign race is safe.) This is sort of loosely inspired by late-binding with dynamic linkers and such, except here the binding happens when someone else asks for the reference (ie during module linking or when filling in table entries), not when a call first occurs. I'm curious what @alexcrichton thinks of all this.Anyway, thoughts on the above? Sorry for the lengthy braindump; I wanted to try to represent how I'm thinking about this and framing the design space in my head, is all. And @koute I want to repeat again a "thank you" for spawning all of this thinking and exploration!
cfallin commented on issue #3697:
I just implemented the lazy-anyfunc-initialization scheme I mentioned above: b8bb1d5d85faeebb3fe275c9a6403fd42d76b8af
It seems to help in my quick-and-dirty local testing; I'm curious what it will do to the benchmarks above but will save that benchmarking (and benchmarking in general with enough precision to have quotable numbers) for Monday :-)
koute commented on issue #3697:
At the highest level, there's the question of what re-instantiation strategy different use-cases require. At least in cases I'm familiar with, and as I understand it, the "freshly instantiated for every request" property is a hard security requirement. The instance-reuse PR Add copy-on-write based instance reuse mechanism #3691 relaxes the boundary somewhat by effectively emulating a fresh instantiation, but without going through all initialization; I remain pretty concerned with the risk here, especially as we add more Wasm features in the future, that we'll forget something and have a small but critical information leak. (In other words, it inverts the initialization: we have to positively erase state or else stale state might come through, rather than today where we'd have to explicitly copy data for it to leak.)
That question feels biggest to me. If all use-cases are OK with reuse and accepting risk of leakage, then let's pursue something like your PR. But if there's still a case for the stronger isolation, then I think it's interesting to see if we can approach it from the "other side" -- making initialization from scratch as fast as possible.
This is a good point; I imagine that there should be use cases on both sides of the spectrum. I guess maybe it'd be nice to survey other
wasmtime
users' and ask their opinion about this?I can only speak for ourselves, but in our main use case it's less of a security issue, and more of a "we just don't want consecutive instantiations to interfere with each other by accident". Our WASM module is - in general - trusted, and (simplifying a lot) we run a distributed system where the nodes essentially "check" each other's results, so leaking information locally within the same node is not a critical issue.
Originally we didn't even clear the linear memory at all - we just manually reinitialized the globals and statics on every "fake" instantiation. But that ended up being problematic, since depending on the flags with which the WASM module was compiled the information about which statics need to be cleared on reinstantiation might not be emitted, so we decided to start just clearing the memory for simplicity as a quick fix, and started working on speeding it up, which resulted in my PR.
As for the new features - in general we're pretty conservative here, and we disable every
wasm_*
feature flag by default until we can validate that there's a benefit to enabling them, and that everything still works as it should.And @koute I want to repeat again a "thank you" for spawning all of this thinking and exploration!
Thank you for tackling this problem!
It seems to help in my quick-and-dirty local testing; I'm curious what it will do to the benchmarks above but will save that benchmarking (and benchmarking in general with enough precision to have quotable numbers) for Monday :-)
I updated my
wasmtime
benchmarking branch (I took your branch and merged the branch from my PR) and reran all the benchmarks and (unless I brainfarted somewhere and did something wrong) unfortunately it looks like the results are unchanged from the previous run for our benchmarks (within the margin of error):call_empty_function | threads | instance_pooling_memfd_v2 | instance_pooling_memfd_v1 | |---------|---------------------------|---------------------------| | 1 | 27 | 27 | | 2 | 47 | 43 | | 4 | 67 | 66 | | 8 | 90 | 89 | | 16 | 125 | 127 | dirty_1mb_of_memory | threads | instance_pooling_memfd_v2 | instance_pooling_memfd_v1 | |---------|---------------------------|---------------------------| | 1 | 239 | 238 | | 2 | 380 | 385 | | 4 | 456 | 453 | | 8 | 613 | 618 | | 16 | 838 | 840 |
(
instance_pooling_memfd_v2
is the new run,instance_pooling_memfd_v1
are the numbers from the original run)
alexcrichton commented on issue #3697:
Thanks for writing all that up @cfallin, and at least my own personal opinion is to primarily err on the side of this PR in the sense of information leakage and spec-compliance. I very much want to get to the performance numbers in #3691 but I am not 100% convinced yet it's necessarily worth the loss in internal structuring (e.g. implementing an "undo" in addition to implementing a fast "redo".). Before making that conclusion though I think it's best to get this approach as close to #3691 as we possibly can.
However, I think we can get away with initializing lazily
Originally I didn't think this would work but digging more into your patch it looks like it's doing everything that would be necessary (although it is pretty gnarly). That being said I believe that the construction of an owned
InstanceAllocationInfo
is probably a large hit to instantiation-time costs, there's some large-ish arrays internally which we don't want to reallocate on all instantiations so to get a speedup from this that would probably need to be cached across instantiations somehow (and may explain why the measured numbers aren't all that different.
I personally still think that the biggest win is going to be not creating a
VMCallerCheckedAnyfunc
for all functions in the module, only for the "possibly exported" ones. That set is at least an order of magnitude smaller than the set of all functions and will make whatever we do that much faster since there's less work to be done. I don't think that it will close the gap fully, though, and further investigation/tweaking may still be warranted.Another possible idea I might have is that the
VMCallerCheckedAnyfunc
representation today is:#[repr(C)] pub struct VMCallerCheckedAnyfunc { pub func_ptr: NonNull<VMFunctionBody>, pub type_index: VMSharedSignatureIndex, pub vmctx: *mut VMContext, }
but we could probably change this to:
#[repr(C)] pub struct VMCallerCheckedAnyfunc { pub info: *mut VMCallerCheckedAnyfuncStaticInfo, pub vmctx: *mut VMContext, } #[repr(C)] pub struct VMCallerCheckedAnyfuncStaticInfo { pub func_ptr: NonNull<VMFunctionBody>, pub type_index: VMSharedSignatureIndex, }
Here
VMCallerCheckedAnyfunc
is per-instance butVMCallerCheckedAnyfuncStaticInfo
can be calculated per-module. This means that atModule
construction time we could build up an array of the "static info" and then instantiation would have less cross-correlation to do across a number of lists (I think we could entirely skip theSharedSignatures
thing being passed down. I think that this will probably speed up that loop but I don't know by how much. This'll also add one more load or two tocall_indirect
and I don't know the performance implications of that. Theoretically though instantiating a module many times consumes less memory sinceVMCallerCheckedAnyfunc
is a pointer smaller!
cfallin commented on issue #3697:
I personally still think that the biggest win is going to be not creating a
VMCallerCheckedAnyfunc
for all functions in the module, only for the "possibly exported" ones. That set is at least an order of magnitude smaller than the set of all functions and will make whatever we do that much faster since there's less work to be done. I don't think that it will close the gap fully, though, and further investigation/tweaking may still be warranted.@alexcrichton I may be misunderstanding something about the internal workings but my intent with the lazy initialization was to actually subsume this; i.e. with lazy init we should at least only construct the anyfuncs that are possibly exported, and ideally even fewer than that. (So in other words I think we should already be getting that benefit with the patch.) Or are you imagining something different here?
[factoring the anyfunc into two parts]
I do like this, but it involves quite a lot more changes to generated code so I'd like to see if we can avoid doing it if we can. I think the most promising approach (what I'm poking at right now) is to share more of the init work wrt the
SharedSignatures
. I'm not sure but I suspect this may be why @koute 's benchmark still isn't showing improvement from lazy-init.(I've been testing by instantiating
spidermonkey.wasm
, so the benefit from skipping initialization of an anyfunc per function is probably significantly magnified compared to smaller modules; I'm planning to get the substrate benchmark working locally for me too, so I can steer based on the same numbers...)
alexcrichton commented on issue #3697:
@cfallin oh that's a good point, I was mostly thinking of other ways to get the speedup without laziness now that I'm thinking about it due to the complexity here. Otherwise though I think your approach is currently slower for unrelated reasons to laziness, the creation of
InstanceAllocationInfo
on each instantiation?
cfallin commented on issue #3697:
@koute I spent some time trying to reproduce numbers on your branch, and ran into a series of issues:
- A clone of the branch linked above refers to a no-longer-existing commit in your wasmtime fork (I guess because of a rebase?); I edited
client/executor/wasmtime/Cargo.toml
andprimitives/wasm-interface/Cargo.toml
to refer to my local clone of your wasmtime;- When trying to build with your
cargo bench
commandline above, I hit a number of build errors inclient/executor/wasmtime/src/tests.rs
related to theSemantics
struct (nofast_instance_reuse
bool, instead aninstantiation_strategy
field);- When I tried to hack that to work (using a "reuse" or "recreate" strategy), I ran into:
thread 'main' panicked at 'creating a full node doesn't fail: Client(VersionInvalid("RuntimeConstruction(Other(\"failed to instantiate a new WASM module instance: Incompatible allocation strategy\"))"))', bin/node/cli/benches/block_production.rs:114:57
I'm benchmarking locally with
benches/server.rs
in this PR, modified to load some other Wasm modules; so far a big one (spidermonkey.wasm) but I'll test with some others as well. I'm planning to improve the lazy-anyfunc-init and will hopefully have another round of changes today at least verified locally :-)
cfallin commented on issue #3697:
I've modified the scheme in this PR to not recompute signature info on every instantiation; now it's showing a decent speedup on SpiderMonkey instantiation, from ~1.69us to ~1.48us on my machine (raw instance creation only, no start function invocation). WIthout the signature fix I was seeing ~1.6us IIRC. I'm redoing some profiling now to see what else might be done...
cfallin commented on issue #3697:
@koute I've now improved perf a bit more -- in my local tests (raw instantiate, no start function,
spidermonkey.wasm
) I went from 1.68us earlier today to 773ns (!):strategy pooling, occupancy 1000, benches ["spidermonkey.wasm"] time: [679.58 ns 773.45 ns 876.66 ns] change: [-58.530% -52.164% -44.172%] (p = 0.00 < 0.05)
I'm curious if the factor of ~2 will translate into your benchmark as well, and/or what constant factors are left. (I'd still like to somehow be able to get your benchmark to run locally too...)
The trick in my latest changes, beyond lazy init of anyfuncs, was to represent non-init with a zeroed bitmap in the vmctx, rather than zeroes in the sparse array. Zeroing the latter was slow; per-field writes, even just a memset, array is too sparse. After the change it doesn't seem to show up in my profiles any more; the actual table-element init (building anyfuncs that are actually referenced) is the bulk of the instance init time.
koute commented on issue #3697:
@cfallin Sorry about the non-working branch! I've updated the code, but I haven't updated the
Cargo.lock
so it was referring to the old commit. (Updating theCargo.lock
would have fixed it.)I've pulled in your most recent changes and updated the branches; here are the numbers:
call_empty_function | threads | native_instance_reuse | instance_pooling_memfd_v3 | instance_pooling_memfd_v1 | |---------|-----------------------|---------------------------|---------------------------| | 1 | 4 | 28 | 27 | | 2 | 11 | 49 | 43 | | 4 | 17 | 71 | 66 | | 8 | 24 | 98 | 89 | | 16 | 35 | 127 | 127 | dirty_1mb_of_memory | threads | native_instance_reuse | instance_pooling_memfd_v3 | instance_pooling_memfd_v1 | |---------|-----------------------|---------------------------|---------------------------| | 1 | 147 | 242 | 238 | | 2 | 214 | 381 | 385 | | 4 | 309 | 453 | 453 | | 8 | 580 | 629 | 618 | | 16 | 732 | 831 | 840 |
Looks like they're mostly... unchanged again?
cfallin commented on issue #3697:
@koute -- thanks! @alexcrichton and I tracked down what we think is the delta in runtime perf; it led back to a comment you left at your use of
Mmap::populated_range
regarding performance of touching anonymous-zero memory vs CoW-mapped memory. Obvious in hindsight (zeroed, or even better pre-zeroed, is faster than copied) but I had missed it. I think this PR should be on par with yours wrt runtime perf now; instantiation still TBD, likely some gap remaining, but the lazy anyfunc init or some other scheme (knowing which functions are reference-taken and having a separate index space for those, for example) is the key for large modules I think.
cfallin commented on issue #3697:
@koute: I'm unfortunately still not able to run your benchmarks: I'm reaching the same "Incompatible allocation strategy" error I quoted above. I'm on the latest commit (
25917541d78
), with Cargo.tomls modified to refer to a local checkout of your Wasmtime branch, commitec95b9365e3
. I had to comment out tests in client/executor/wasmtime/src/tests.rs to get the build to complete.In any case, I suspect that this PR is much closer now -- I'll wait for your confirmation, or not, on this to be sure; if we know we can get close with the "from-scratch" approach, or otherwise if we can quantify what the remaining delta is, then this should help us decide what to do next.
koute edited a comment on issue #3697:
Can you share the branch and/or code to reproduce the results with this PR?
Sure. Here's the branch:
https://github.com/koute/substrate/tree/master_wasmtime_benchmarks_with_cfallin_memfd_cow
And here's how to run them to get numbers for this PR (essentially the same as described in my PR, just with an extra cargo feature):
cd substrate/client/executor/benches FORCE_WASMTIME_INSTANCE_POOLING=1 rustup run nightly-2021-11-01-x86_64-unknown-linux-gnu cargo bench --features wasmtime,sc-executor-wasmtime/memfd-allocator call_empty_function_from_kusama_runtime_with_recreate_instance_on_1_threads
If you have any further questions feel free to ask! (I can also make myself available for discussion on Zulip.)
Some quick benchmarking shows that this function is quite hot in this PR. That makes sense to me because @koute your PR skips that function entirely with the reuse mechanism you've implemented.
Indeed! One of the reasons why I implemented my PR the way I did is because you have to punch through less layers of abstraction so it's easier to make it faster. (:
koute commented on issue #3697:
I'm unfortunately still not able to run your benchmarks: I'm reaching the same "Incompatible allocation strategy" error I quoted above.
@cfallin: This might be a silly question, but... are you using the right branch and calling the benchmarks exactly as described? (:
There's the branch from my original PR (which is unmodified, and won't work), and there's the branch for this PR (link is here: https://github.com/bytecodealliance/wasmtime/pull/3697#issuecomment-1017164295). You also have to
cd
into the right directory first. (I don't remember whether the whole codebase compiled on that branch, but it might not, since I was only interested in our WASM executor and the benchmarks.)Anyway, I reran the benchmarks again, and here are the results:
call_empty_function | threads | native_instance_reuse | instance_pooling_memfd_v4 | instance_pooling_memfd_v1 | |---------|-----------------------|---------------------------|---------------------------| | 1 | 5 | 24 | 27 | | 2 | 11 | 46 | 43 | | 4 | 17 | 69 | 66 | | 8 | 24 | 97 | 89 | | 16 | 35 | 128 | 127 | dirty_1mb_of_memory | threads | native_instance_reuse | instance_pooling_memfd_v4 | instance_pooling_memfd_v1 | |---------|-----------------------|---------------------------|---------------------------| | 1 | 144 | 170 | 238 | | 2 | 217 | 241 | 385 | | 4 | 307 | 338 | 453 | | 8 | 589 | 605 | 618 | | 16 | 764 | 815 | 840 |
There's an improvement!
cfallin commented on issue #3697:
You also have to
cd
into the right directory first.Oh, I definitely missed that bit; I was running
cargo bench
from the root of your tree, and indeed hitting build failures. I confirm I can run some of your benchmarks (the "legacy reuse" ones specifically) locally now.However, on a clean checkout of
3d80c9593624dc9f28ab877bae6e7d86459523db
(your latest commit), with no local changes, I get the following error for a "native reuse" benchmark:$ cd client/executor/benches/ $ FORCE_WASMTIME_INSTANCE_POOLING=1 rustup run nightly-2021-11-01-x86_64-unknown-linux-gnu cargo bench --features wasmtime,sc-executor-wasmtime/memfd-allocator call_empty_function_from_test_runtime_with_native_instance_reuse_on_1_threads [ ... ] Benchmarking call_empty_function_from_test_runtime_with_native_instance_reuse_on_1_threads: Warming up for 3.0000 sthread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: RuntimeConstruction(Other("failed to instantiate a new WASM module instance: Incompatible allocation strategy"))', client/executor/benches/bench.rs:171:67 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
If I should be running a different command, or from a different commit base, please do let me know!
[new numbers]
It's good to see confirmation here, thanks!
So I think at this point we can conclude something like: when the instance runs for long enough, the performance of this PR converges to that of your PR, because the underlying memory mapping is the same (anonymous mmap for zeroes, CoW of memfd otherwise). There is still an instantiation-time penalty, and this still mostly has to do with anyfunc initialization and initialization of the table elements that refer to those anyfunc structs.
One thing that @alexcrichton and I noticed when looking at your benchmark in particular today is that it does not appear to contain any
memory.grow
instruction at all; so the memory growth path is not stressed here. That is a hot path for other use-cases, and the instance-reuse PR needs to callmprotect
, which takes the process-widemmap_lock
; this is something we want to avoid in highly-concurrent-server contexts. This PR uses a per-slot memfd withftruncate
to avoid that.Anyway, at this point I think it might be fair to say that we've closed part of the gap (between stock wasmtime and your instance-reuse PR), know what the remaining gap can be attributed to, and this PR's approach might be slightly easier to maintain (as per @alexcrichton above); so it comes down to a choice of squeezing every last microsecond with a "trusted module" (your case) vs taking a still-significant improvement over today's mainline with existing security boundaries (this PR). That's a choice that depends on relative priorities; do others want to weigh in more here?
One last thing I might say is that it does seem possible to build a snapshot/rewind-like approach on top of a memfd-as-part-of-normal-instantiation foundation (this PR); basically we would provide an extra hook that does the
madvise()
, and aSnapshot
object attached to an instance that saves and restores globals/table entries, and splats them back en-masse. This layering -- putting the memfd and mmap magic in the traditional instantiation machinery -- lets us use it for a "flash-reset" too, while the other way around, building all of this as a side-mechanism, precludes us from getting any benefit in a "transparent behind existing API" scenario. So this might be a way to satisfy both use-cases with one (or 1.2-ish) implementation(s). Thoughts?
koute commented on issue #3697:
However, on a clean checkout of
3d80c9593624dc9f28ab877bae6e7d86459523db
(your latest commit), with no local changes, I get the following error for a "native reuse" benchmark:```
$ cd client/executor/benches/
$ FORCE_WASMTIME_INSTANCE_POOLING=1 rustup run nightly-2021-11-01-x86_64-unknown-linux-gnu cargo bench --features wasmtime,sc-executor-wasmtime/memfd-allocator call_empty_function_from_test_runtime_with_native_instance_reuse_on_1_threads[ ... ]
Benchmarking call_empty_function_from_test_runtime_with_native_instance_reuse_on_1_threads: Warming up for 3.0000 sthread 'main' panicked at 'calledResult::unwrap()
on anErr
value: RuntimeConstruction(Other("failed to instantiate a new WASM module instance: Incompatible allocation strategy"))', client/executor/benches/bench.rs:171:67
note: run withRUST_BACKTRACE=1
environment variable to display a backtrace
```If I should be running a different command, or from a different commit base, please do let me know!
Sorry, the way I quickly hacked different instantiation strategies into the benchmarks is little janky, so you can't run all of them with a single command. So for each instantiation strategy you basically have to run them separately, e.g. to get the numbers for your PR:
$ FORCE_WASMTIME_INSTANCE_POOLING=1 rustup run nightly-2021-11-01-x86_64-unknown-linux-gnu cargo bench --features wasmtime,sc-executor-wasmtime/memfd-allocator with_recreate_instance
To get the numbers for just recreating each instance from scratch with the ondemand strategy without any pooling:
$ rustup run nightly-2021-11-01-x86_64-unknown-linux-gnu cargo bench --features wasmtime with_recreate_instance
And to get the numbers for my PR:
$ rustup run nightly-2021-11-01-x86_64-unknown-linux-gnu cargo bench --features wasmtime native_instance_reuse
So I think at this point we can conclude something like: when the instance runs for long enough, the performance of this PR converges to that of your PR, because the underlying memory mapping is the same (anonymous mmap for zeroes, CoW of memfd otherwise).
Well, yes, but the whole point here is to reduce the overhead of starting fresh, relatively short-lived instances, right? (: Even just using the ondemand strategy converges to the same performance if it runs long enough.
One thing that @alexcrichton and I noticed when looking at your benchmark in particular today is that it does not appear to contain any
memory.grow
instruction at all; so the memory growth path is not stressed here. That is a hot path for other use-cases, and the instance-reuse PR needs to callmprotect
, which takes the process-widemmap_lock
; this is something we want to avoid in highly-concurrent-server contexts. This PR uses a per-slot memfd withftruncate
to avoid that.Indeed. Do you want me to add some extra benchmarks for that?
Hmm... also, couldn't the
ftruncate
trick be also used with my approach? (I can try adding it in, but I'm not sure if there's a point if the chances of my PR being accepted are low anyway?)lets us use it for a "flash-reset" too, while the other way around, building all of this as a side-mechanism, precludes us from getting any benefit in a "transparent behind existing API" scenario
Since we're talking about the API, I just want to chime in here that from our perspective as
wasmtime
users (this might or might not be true for other people with different use cases) the "side-mechanism" is actually more of a "transparent behind existing API" approach. (:Consider the scenario where someone is currently using the ondemand instance allocator and wants to switch to a faster approach. With the approach from my PR it's basically this (I'm obviously simplifying as this is not the actual patch from our code, but that's essentially what it took):
@@ -1,3 +1,13 @@ +if let Some((instance, store)) = instance_cache.pop() { + return Ok((instance, store)); +} else { let store = initialize_store()?; - let instance = instance_pre.instantiate(&mut store)?; + let instance = instance_pre.instantiate_reusable(&mut store)? return Ok((instance, store)); +} + +// ...and when we're done with the instance... +if instance_cache.len() < MAXIMUM_CACHED_INSTANCES { + instance.reset()?; + instance_cache.push((instance, store)); +}
It's fast, gives the user control and is super simple. There's no need to define any limits on the WASM module itself (so the WASM executor won't suddenly get bricked if the module has more functions/globals/larger memory/etc. than expected), it will still work if we instantiate too many modules at the same time (the extra modules just won't be reused), and it allows the user to tweak how many instances are being cached without reinitializing everything.
(And if we wanted to make it even more transparent the reuse could maybe be integrated into
InstancePre
, which would implicitly create reusable instances with its normalinstantiate
and cache them once they're dropped, without the user having to do anything besides setting a single flag. That would obviously require some more API changes in general since e.g.Instance
is currentlyCopy
, among other things. I didn't do this because it would have been even more controversial.)Now consider switching to the pooling allocator. Yes, turning it on is essentially just a single line of code calling
Config::allocation_strategy
, but that's not all it takes! Now suddenly you have limitations which you didn't have before. Your WASM module can't exceed certain limit, and you can only have a certain number of instances active at the same time. It's not transparent. Now you need to write some code to have a fallback, and since the point where the strategy is configured is when creating theEngine
and not when instantiating the module it's not very convenient.So I totally understand that my approach might be harder to maintain, however purely as a user I think the pooling approach (completely ignoring how it performs) is not very convenient to use. (Unless it would be truly a "set and forget" without having to explicitly set any hard limits.)
alexcrichton commented on issue #3697:
My personal take an opinion is that we should follow a sequence of steps that looks like:
- Focus on the strategy outlined in this PR, accelerating instantiation rather than implementing snapshots.
- Update this PR to be simply memfd-backed-CoW for the pooling allocator
- Enumerate remaining optimizations around instantiation which need to be improved. I think an important point to note here is that we literally haven't been able to measure these other bits of instantiation because the memory initialization was so expensive. That basically means that I think there's a lot of low-hanging fruit to optimize here:
- Optimize the
VMCallerCheckedAnyfunc
array by making it smaller, lazily initialized, faster to initialize, or some combination thereof.- Remove the need to grab an rwlock for inserting a host function into a store.
- Remove the need to clone function types when inserting a host function into a store.
- Optimize tracking of trampolines in the store to probably not use
HashMap
vanilla (either a different data structure or a more optimized hash algorithm)- Add a new feature to Wasmtime which is enabling the memory pooling allocator with the on-demand instance allocator. Perhaps with some default settings about how many memories to pool or similar. (or something like this).
- Work towards enabling memfd by default on Linux where we can with the on-demand instance allocator.
I believe we can solve all the problems here using a hybrid of this PR and @koute's, I don't think that this is an either/or choice. In isolation I don't think that anyone would disagree that start-from-scratch instantiation is more robust and easier to reason about. The main reason to implement some sort of snapshot-like approach would be performance, but I'm pretty sure we have a good handle on performance from poking around at the examples from @koute your substrate fork. This so far has revealed:
- As mentioned above the initialize-an-instance path hasn't really been optimized to this degree because we couldn't measure things. There's lots of low-hanging fruit to optimize.
- @cfallin's original PR and implementation showed that CoW-ing a page of zeros from
memfd
is much slower than from an anonymous mapping. This led to tweaking the design to, as your PR does @koute, minimizing the size of thememfd
image as well as using an anonymousmmap
for the initial heap contents that are all zero.- Locally I'm seeing that for the threaded benchmarks you have @koute the main performance bottleneck seems to be the usage of
std::sync::RwLock<T>
around the signatures in an engine and hitting that rwlock ~100 times during instantiation. (this is also one of those easy "low hanging fruit" to optimize)I'm pretty confident that we have basically quantified the difference in numbers that you're seeing @koute between
native_instance_reuse
andinstance_pooling_memfd_v*
. I don't mean to pretend that they're the same and we don't have to worry about the difference, my point is that I feel like we have a good handle on why there's a difference and a number of avenues ahead of us to make the difference even smaller. It may be the case that this PR's strategy of always-initialize-the-instace may not microsecond-for-microsecond come out on par with "reuse a snapshot", but I believe that there's still legwork to be done to definitively say "this is the difference". Until such a conclusion is reached I don't personally think that it's worthwhile to implement a snapshot-like-scheme in Wasmtime.
I'd like to also ideally address some of our API-wise concerns @koute. You've mentioned that for your use case it's important that instantiation always succeeds and you don't want to configure the pooling allocator so far. I think much of this can be solved by having a default pooling allocator for memories within engines like I mentioned above. We can pretty easily remove the need to have limits on the memory pool as well. I think it's worthwhile to acknowledge, though, that the desire to run an unlimited number of instances concurrently isn't a constraint we've seen yet and is so far unique to you.
With a memory pool that enables the CoW/memfd strategy implemented in this PR to be usable with the on-demand allocation strategy. I think that this would solve the issues around trying to configure the pooling allocator since you wouldn't be using it and you could allocate as many stores/instances as the system can support. The limits related to memory could also be relaxed in this situation so they don't need to be specified up-front.
The final thing you've mentioned is that it's easy to build pooling allocators externally, so why not do that? The problem with this is that it places a lot of constraints on the API design of the
wasmtime
crate itself. We originally wanted to do precisely that with the current pooling allocator, allowing an external crate to implement it rather than baking it intowasmtime
itself. The problem is that the API hooks necessary to implement this were far too complicated and invasive. It effectively would have been a massive amount of effort (too much) to design and support such APIs. Similarly with your approach you're enabling a pooling-style of allocation but it comes at the cost ofwasmtime
having to maintain a snapshot-like API. This snapshot-like API probably won't always be exclusively used for pooling which means that it has to be maintained and grow over time for a variety of use-cases. Overall at least my personal conclusion is that it's better for now for us to implement common practices in Wasmtime to give us the maximum flexibility in the API and support over time to ensure we don't lock ourselves into anything.
Well that's a bit of a long-winded post, sorry for the wall of text. I think that captures well my thoughts on all this, though, and I'd like to hear what others hink about this all too!
cfallin commented on issue #3697:
@alexcrichton I've rebased out the lazy-anyfunc stuff into #3733, and done some preliminary cleanup here (e.g. the
Mmap
changes weren't needed). I think this is ready for review now if you'd like to take a look -- thanks!I'm working on a slot-affinity
InstanceAllocationStrategy
now, and can either add that to this PR when it's ready, or do it as a separate PR if that's preferable.
cfallin commented on issue #3697:
@alexcrichton thanks a bunch for all of the review comments -- this definitely polished off a few rough edges!
The main outstanding question right now seems to be how to handle config and conditional compilation. I think it might make sense to do a followup PR that cleans up the pooling allocator in general, including for uffd code, to have a dynamic "mode" setting, and operate based on that. But since it probably makes sense to do that for uffd along with memfd, I think that it's a little out of scope for this PR. Would you mind if I saved this for a followup?
alexcrichton commented on issue #3697:
The main outstanding question right now seems to be how to handle config and conditional compilation
I agree punting this to a future PR makes the mose sense and following uffd's pattern here is the way to go. I'll take a look at #3737 after this, thanks for that!
cfallin commented on issue #3697:
Thanks @alexcrichton ! I've addressed all of your comments I think. Lots of refactoring but things feel much better now. Hopefully that's mostly it :-)
The remaining questions are mostly around how this looks next to uffd, I think. While I definitely understand the desire to make things more unified, I worry a little that the scope of this PR keeps expanding to "clean up uffd while we're at it", and I feel like that could be an endless treadmill, especially if we're hoping to eventually remove uffd if memfd supersedes it. If you feel very strongly about it of course, I'm happy to keep going with whatever suggestions you have.
cfallin commented on issue #3697:
@alexcrichton Thanks for the continued review feedback -- I think I've addressed everything! (Modulo one of your uffd comments -- a "simple" cleanup turned into a huge rabbithole and so I gingerly climbed out. If we really, really need uffd and really, really need the cleanup then I can invest the significant effort to refactor how uffd initialization works.)
I actually agree that it may make sense to start without the ftruncate trick. Synthetic benchmarks have shown the mmap lock to be a bottleneck, e.g. some of my testing comparing worst-case (never reuse the same image, always re-mmap) to best-case (just madvise) shows a significant delta, and the perf profile shows a lot of time spent inside mmap, as one would expect. The
mprotect
necessary to grow the heap without the ftruncate trip hits the same lock. But what I haven't been able to reproduce is a real-world use-case where, including all the overhead of running code and everything else, the heap growth / mmap lock becomes a significant bottleneck.So I've removed that, in favor of just anon-mmap zeroed memory. I did so as a separate commit so if we want to bring this back some day, it's in the history; and we can certainly look at benchmark results if/when the memfd technique sees more real-world use.
cfallin commented on issue #3697:
Just as an addendum here: @alexcrichton and I discussed the possibility of using the implementation bits here (
MemFdSlot
) in the on-demand allocator as well. This would address any use-case that requires that allocator (@koute , we're still very interested in getting something into the codebase that serves your needs!).I took a bit of a deeper look now and I think it's probably best as a followup PR -- it involves some more refactoring in the alloactor machinery (e.g. the
RuntimeMemoryCreator
andRuntimeLinearMemory
traits). But I'm happy to do this work!
cfallin edited a comment on issue #3697:
Just as an addendum here: @alexcrichton and I discussed the possibility of using the implementation bits here (
MemFdSlot
) in the on-demand allocator as well. This would address any use-case that requires that allocator (@koute , we're still very interested in getting something into the codebase that serves your needs!).I took a bit of a deeper look now and I think it's probably best as a followup PR -- it involves some more refactoring in the allocator machinery (e.g. the
RuntimeMemoryCreator
andRuntimeLinearMemory
traits). But I'm happy to do this work!
cfallin commented on issue #3697:
Actually, it turned out to be pretty simple to extend this to the on-demand allocator too; pushed another commit for this, but happy to split it to a second PR if that makes review easier @alexcrichton .
alexcrichton commented on issue #3697:
Sorry I'm heading out for the day and only recently noticed the addition to the on-demand allocator, I skimmed it and it looks good to me, but I'll need to dig in a bit more depth tomorrow, otherwise I've got one instance of what I think is a bug but otherwise just a bunch of nits which can all be deferred to later.
cfallin commented on issue #3697:
Thanks @alexcrichton !
I did a few last refactors based on your comments; given the criticality of this code I want to make sure you're happy with the last commit before merging!
I played around a bit with simplifying
MemFdSlot
to carry less state, but in the end I think as it manages reuse, and has to know what the current mapping state is to reuse mappings, there is not much complexity reduction in externalizing that state and requiring it to be passed back in. It felt a little more brittle to me. Perhaps we could go the other way and have an abstraction for "something that manages an address range", build another one that doesn't use memfds, and make this "slot manager" the one source of truth. We can probably play with this in followup PRs -- what do you think?
koute commented on issue #3697:
@koute we haven't heard from you in a bit, but to reiterate this PR brings all the CoW benefits to the on-demand allocator as well although the reuse case isn't simply an
madvise
to reset. That being said for your "empty" function benchmark from before I'm clocking this PR (re-instantiation in a loop) at around 40us and "'justmadvise
", your PR, at around 5ns. A huge portion of the remaining time is table initialization, scheduled to be addressed in #3733 after some further work (which should make table initialization effectively zero-cost). Basically I want to reiterate we continue to be very interested in solving your use case and accomodating your performance constraints. If you'd like we'd be happy to ping you when other optimization work has settled down and we're ready to re-benchmark.Yes, I saw you're pretty busy working on this so that's why I was keeping quiet until you finish. (:
Please do ping me once this is ready to rebenchmark!
acfoltzer commented on issue #3697:
(tagging myself as a reviewer on this just so I don't lose track of it, but please don't hold up the merge on me)
cfallin commented on issue #3697:
Hmm, running into the
madvise
semantics gaps inqemu
on s390x it seems -- I recall a similar issue when we were adding tests for aarch64 in the uffd work. I'll find a way to skip the tests when on qemu.
Last updated: Dec 23 2024 at 12:05 UTC