Duslia opened issue #4637:
When I try to use tokio to scale wasmtime horizontally, I found that wasmtime performance drops significantly. This is my performance data:
| thread and task | data |
| --- | ----------- |
| thread 1, task 1 | 11w/task |
| thread 3, task 3 | 4w/task |
| thread 8, task 8 | 1w/task |This is my code: https://play.rust-lang.org/?version=nightly&mode=debug&edition=2021&gist=e9fc13a7d68b80afdc07076482a5b787 .
And the wasm code is just a function return 1.
package main //export echo func echo(ctxLen, reqLen int32) int32 { //nolint return 1 } func main() { }
Duslia edited issue #4637:
When I try to use tokio to scale wasmtime horizontally, I found that wasmtime performance drops significantly. This is my performance data:
| thread and task | data |
| --- | ----------- |
| thread 1, task 1 | 11w/task |
| thread 3, task 3 | 4w/task |
| thread 8, task 8 | 1w/task |This is my test code: https://play.rust-lang.org/?version=nightly&mode=debug&edition=2021&gist=e9fc13a7d68b80afdc07076482a5b787 .
And the wasm code is just a function return 1.
package main //export echo func echo(ctxLen, reqLen int32) int32 { //nolint return 1 } func main() { }
Duslia edited issue #4637:
When I try to use tokio to scale wasmtime horizontally, I found that wasmtime performance drops significantly. It looks like there are some shared resources inside. And no matter how many threads I open, The total number of execute num is close. This is my performance data:
| thread and task | data |
| --- | ----------- |
| thread 1, task 1 | 11w/task |
| thread 3, task 3 | 4w/task |
| thread 8, task 8 | 1w/task |This is my test code: https://play.rust-lang.org/?version=nightly&mode=debug&edition=2021&gist=e9fc13a7d68b80afdc07076482a5b787 .
And the wasm code is just a function return 1.
package main //export echo func echo(ctxLen, reqLen int32) int32 { //nolint return 1 } func main() { }
Duslia edited issue #4637:
When I try to use tokio to scale wasmtime horizontally, I found that wasmtime performance drops significantly. It looks like there are some shared resources inside. And no matter how many threads I open, The total number of execute num is almost the same. This is my performance data:
| thread and task | data |
| --- | ----------- |
| thread 1, task 1 | 11w/task |
| thread 3, task 3 | 4w/task |
| thread 8, task 8 | 1w/task |This is my test code: https://play.rust-lang.org/?version=nightly&mode=debug&edition=2021&gist=e9fc13a7d68b80afdc07076482a5b787 .
And the wasm code is just a function return 1.
package main //export echo func echo(ctxLen, reqLen int32) int32 { //nolint return 1 } func main() { }
Duslia edited issue #4637:
When I try to use tokio to scale wasmtime horizontally, I found that wasmtime performance drops significantly. It looks like there are some shared resources inside. And no matter how many threads I open, The total number of execute num is almost the same. This is my performance data:
| thread and task | data | total execute nums |
| --- | ----------- | -------- |
| thread 1, task 1 | 11w/task | 11w
| thread 3, task 3 | 4w/task | 12w
| thread 8, task 8 | 1w/task | 8wThis is my test code: https://play.rust-lang.org/?version=nightly&mode=debug&edition=2021&gist=e9fc13a7d68b80afdc07076482a5b787 .
And the wasm code is just a function return 1.
package main //export echo func echo(ctxLen, reqLen int32) int32 { //nolint return 1 } func main() { }
Duslia edited issue #4637:
When I try to use tokio to scale wasmtime horizontally, I found that wasmtime performance drops significantly. It looks like there are some shared resources inside. And no matter how many threads I open, the total number of execute num is almost the same. This is my performance data:
| thread and task | data | total execute nums |
| --- | ----------- | -------- |
| thread 1, task 1 | 11w/task | 11w
| thread 3, task 3 | 4w/task | 12w
| thread 8, task 8 | 1w/task | 8wThis is my test code: https://play.rust-lang.org/?version=nightly&mode=debug&edition=2021&gist=e9fc13a7d68b80afdc07076482a5b787 .
And the wasm code is just a function return 1.
package main //export echo func echo(ctxLen, reqLen int32) int32 { //nolint return 1 } func main() { }
alexcrichton commented on issue #4637:
Thanks for the report! Could you share the Wasmtime version you're using as well as the specific
*.wasm
file that you're benchmarking?
alexcrichton commented on issue #4637:
That being said what you're almost certainly running into is limits on concurrently running
mmap
in multiple threads within one process. Allocation of fiber stacks for asynchronously executing WebAsembly primarily goes throughmmap
right now and unmapping a stack requires broadcasting to other threads with an IPI which slows down concurrent execution.One improvement that you can do to your benchmark is to use the pooling allocator instead of the on-demand allocator. Otherwise though improvements need to come from the kernel itself (assuming you're running Linux, which would also be great to specify in the issue description). Historically the kernel has prototyped a
madvisev
syscall which allows batch unmappings that Wasmtime could use with the pooling allocator. We've tested this historically and it's shown to greatly increase the throughput in situations like this. This isn't landed in Linux, however.Finally it depends on your use case of whether this comes up much in practice. Typically the actual work performed by a wasm module dwarfs the cost of the memory mappings since the wasm modules live for a longer time than "return 1". If that's the case then this microbenchmark isn't necessarily indicative of performance and it's recommended to use a more realistic workload for your use case. If this sort of short-running computation is common for your embedding, however, then multithreaded execution will currently have the impact that you're measuring here.
Duslia commented on issue #4637:
Thanks!! I got it.
Duslia commented on issue #4637:
I'm trying to comment out these two lines and it can scale horizontally with a high performance.
<img width="855" alt="image" src="https://user-images.githubusercontent.com/37136584/183650680-efd55dcf-b8ad-4748-ae9d-3382b96caa74.png">
Is the function of these lines of code to prevent the previous data from being obtained in the next execution?
Duslia edited a comment on issue #4637:
I'm trying to comment out these two lines and it can scale horizontally with a high performance.
<img width="855" alt="image" src="https://user-images.githubusercontent.com/37136584/183650680-efd55dcf-b8ad-4748-ae9d-3382b96caa74.png">
I 'd like to ask that is the function of these lines of code to prevent the previous data from being obtained in the next execution?
bjorn3 commented on issue #4637:
Yes, without that call the old data would be preserved when reusing a memory region as linear memory for the wasm module.
alexcrichton commented on issue #4637:
Ah ok that's a good confirmation that
madvise
and IPIs are indeed likely your scaling bottleneck. In that case there's unfortunately not much that can be done right now that Wasmtime can do. This is some (old) discussion about a possiblemadvisev
syscall which would largely fix this scaling bottleneck within Wasmtime. This was confirmed with a modified kernel with that patch and developing a version of Wasmtime that usesmadvisev
.In the meantime though this is something that basically just needs to be taken into account when capacity planning with Wasmtime. If you're willing assistance in getting a patch such as that into the Linux kernel (or investigating other routes of handling this bottleneck) would be greatly appreciated!
alexcrichton commented on issue #4637:
Oh sorry, and to confirm the
madvise
call is absolutely critical for safety, it cannot be removed and still have a secure sandbox for wasm.
Duslia commented on issue #4637:
OK. I will think about how to solve it. Thanks a lot!!
ihciah commented on issue #4637:
I'm wondering if the following 2 solutions may be accepted?
- Call madvise with io_uring: It saves syscall but I don't know if it will work to avoid scaling bottleneck.
- Maintain 2 index groups, one is dirty, the other one is clean. On drop, the index is put into dirty group and try to merge with others, and won't be madvised instantly. On index allocation, try to get from the clean one, and if there's no available in it, try to obtain a continuous range from dirty group and call madvise, then return one of it back and put all the index left inside the range into the clean group. In this solution, we can call less madvise by clean stacks in batch.
bjorn3 commented on issue #4637:
Call madvise with io_uring: It saves syscall but I don't know if it will work to avoid scaling bottleneck.
As I understand it the IPI is the bottlwneck, not the syscal overhead. Every time madvise is run all cores running the wasmtime process have to be temporarily interupted and suspended while the page table is being modified. This scales really badly.
bjorn3 edited a comment on issue #4637:
Call madvise with io_uring: It saves syscall but I don't know if it will work to avoid scaling bottleneck.
As I understand it the IPI is the bottlwneck, not the syscall overhead. Every time madvise is run all cores running the wasmtime process have to be temporarily interupted and suspended while the page table is being modified. This scales really badly. io_uring can't do anything against this.
ihciah commented on issue #4637:
Call madvise with io_uring: It saves syscall but I don't know if it will work to avoid scaling bottleneck.
As I understand it the IPI is the bottlwneck, not the syscall overhead. Every time madvise is run all cores running the wasmtime process have to be temporarily interupted and suspended while the page table is being modified. This scales really badly. io_uring can't do anything against this.
Thanks for the explain! I also checked the kernel code and found in func
io_madvise
of io_uring.c, it just calldo_madvise
, the lock will be acquired insidedo_madvise
, which means there's no aggregation with it.
To do madvise aggregation without modifying kernel, what about the solution 2? I implemented a demo here(not finished yet): https://github.com/ihciah/wasmtime/commit/071f5d9978d94ee669d72c90bf2d6f2728324303
bjorn3 commented on issue #4637:
I might be misunderstanding, but isn't that what the pooling allocator does?
ihciah commented on issue #4637:
I might be misunderstanding, but isn't that what the pooling allocator does?
The pool inside the wasmtime allocates a big memory one time in a single call, and manages it by maintaining
index
of memory pages manually. Each time the memory page(memory_pages, table_pages, stack_pages) dropped, it will clear the memory content by calling decommit->madvise, and mark theindex
available for the later allocation. And here is the problem.
What I want to impl is to make the indices to recycle continues and clear the memory in batch. I think it is a kind aggregation in userspace.
(I started reading wasmtime code just 2 from days ago. Maybe there's something I misunderstand..)
alexcrichton commented on issue #4637:
Call madvise with io_uring
While we haven't directly experimented with this I think as you've discovered we took a look and concluded it probably wouldn't help.
Maintain 2 index groups, one is dirty, the other one is clean.
This we have actually experimented with historically. Our conclusion was that it did not actually help all that much. We found that the cost of an
madvise
scaled with the size of the region being advised, which meant that when we batched everything up it took effectively just as long as doing everything individually. We had various hypotheses for why this was the case but we didn't bottom it out 100% at the time.Our testing with
madvisev
is mostly what led us to stop investigating this issue becausemadvisev
so clearly fixed the issue relative to all other workarounds.
ihciah commented on issue #4637:
We found that the cost of an madvise scaled with the size of the region being advised
Thank you for sharing past experiments.
But it still confuse me that advising multiple regions with
madvisev
is better thanmadvise
with a merged region. If this is true, does this means we can impl a bettermadvise
in kernel by simply split the region and do it one by one?
pchickey commented on issue #4637:
You can't merge the regions to madvise - they're not contiguous.
ihciah commented on issue #4637:
You can't merge the regions to madvise - they're not contiguous.
I mean maybe we can make the indices as contiguous as possible and then we can do madvise on the biggest contiguous region when alloc. I don't know if this will work.
alexcrichton commented on issue #4637:
IIRC we were just as confused as you are. I remember though reorganizing the pooling allocator to put the table/memory/instance allocations all next to each other so each instance's index identified one large contiguous region of memory. In the benchmarks for this repository parallel instantiation actually got worse than the current performance on
main
. My guess at the time was that it had to do with the very large size of themadvise
sincemadvise
got more expensive as the region got larger.As for kernel improvements that's never something we've looked into beyond
madvisev
ihciah commented on issue #4637:
I finished my experiment and result shows the same conclusion.
What I did: clear stack pool in batch(https://github.com/ihciah/wasmtime/tree/main)
Result: running return 1 demo(on a KVM vm)
- Original 1 core: 117 kQPS
- Original 3 cores: 66 kQPS * 3
- Exp 1 core: 133 kQPS
- Exp 3 cores: 52kQPS * 3
Under 1 core, the modified version has slightly better performance(the madvise batch is 1000 times original); but under 3 cores it becomes worse.
My guess at the time was that it had to do with the very large size of the madvise since madvise got more expensive as the region got larger.
Also, I changed stack pool size 1000 to bigger values,
size=2048
=> 20ms andsize=4096
=> 45ms, which matches your guess.
I haven't looked closely at themadvisev
impl yet. Comparing to callingmadvise
, what does it saves? It still need the same IPI in theory.
alexcrichton commented on issue #4637:
Always good to have independent confirmation, thanks for doing that!
As for why
madvisev
improves the situation, I believe this happens because each call tomadvise
issues an IPI. Withmadvisev
only one IPI is necessary still. This means that instead of N calls tomadvisev
and N IPIs there's only one call tomadvisev
with only one IPI. From the kernel's perspective all the calling process needs is to know that by the time the syscall returns that the address space is as-expected in all threads, but during execution of the syscall there's a lot more flexibility about precisely what all threads see.
ihciah commented on issue #4637:
Thanks for the explanation. But I guess IPI cost(or other bottleneck) is not linear with syscall times but the memory ranges since doing madvise with twice bigger memory a little bit cost more than twice longer time.
It seems there's no easy way to save the madvise cost and avoid its bottleneck for general usage...
Since users may control the instance isolation by their own, for example they can make sure only the same tenant and the same wasm binary share instances. Under these circumstances doing madvise maybe unneccessary.
Is it acceptable to add an option to disable madvise(madvise can be still enabled by default, and the option can only be touched when some features like "dangerous" enabled)?
bjorn3 commented on issue #4637:
Since users may control the instance isolation by their own, for example they can make sure only the same tenant and the same wasm binary share instances. Under these circumstances doing madvise maybe unneccessary.
In that case you could reuse the
Instance
instead of recreate it every time, right? Not callingmadvise
is effectively equivalent to reusing theInstance
while resetting all tables and wasm globals (actual global variables are in the linear memory and as such not reset), but not the linear memory, in which case you might as well reset nothing. That prevents everything getting out of sync and most observable state isn't reset anyway if you don't domadvise
.
ihciah commented on issue #4637:
Not calling madvise is effectively equivalent to reusing the Instance while resetting all tables and wasm globals (actual global variables are in the linear memory and as such not reset), but not the linear memory, in which case you might as well reset nothing.
I see. There are 3 callers for
decommit
, thedecommit_memory_pages
anddecommit_table_pages
are with theinstance
lifecycle. So we may not affect them.
The hot path is the third callerdecommit_stack_pages
. Every call forfunc.call_async
will do it. So we may only disabledecommit_stack_pages
optionally.
alexcrichton commented on issue #4637:
Using
madvise
to clear linear memory and tables is required for correctness and we can't add a knob to turn that off, but I think it would be reasonable to add a knob for the stacks to avoidmadvise
since that's purely a defense-in-depth thing and is not required for correctness.
alexcrichton commented on issue #4637:
The
*_keep_resident
options added recently should additionally help by reducing page faults and therefore contention on kernel-level locks. There's also work towards optimizing bounds-checking-mode in Cranelift to avoid pagefaults/mprotect even more to reduce contention even further.Otherwise though this is a known issue and isn't something where there's any easy wins on the table we know of that haven't been applied yet. PRs and more discussion around this are of course always welcome but I'm going to go ahead and close this since there's not a whole lot more that can be done at this time.
alexcrichton closed issue #4637:
When I try to use tokio to scale wasmtime horizontally, I found that wasmtime performance drops significantly. It looks like there are some shared resources inside. And no matter how many threads I open, the total number of execute num is almost the same. This is my performance data:
| thread and task | data | total execute nums |
| --- | ----------- | -------- |
| thread 1, task 1 | 11w/task | 11w
| thread 3, task 3 | 4w/task | 12w
| thread 8, task 8 | 1w/task | 8wThis is my test code: https://play.rust-lang.org/?version=nightly&mode=debug&edition=2021&gist=e9fc13a7d68b80afdc07076482a5b787 .
And the wasm code is just a function return 1.
package main //export echo func echo(ctxLen, reqLen int32) int32 { //nolint return 1 } func main() { }
Last updated: Nov 22 2024 at 16:03 UTC