wasmtime / issue #4637 Wasmtime horizontal scaling result... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / issue #4637 Wasmtime horizontal scaling result...

Wasmtime GitHub notifications bot (Aug 08 2022 at 10:41):

When I try to use tokio to scale wasmtime horizontally, I found that wasmtime performance drops significantly. This is my performance data:
| thread and task | data |
| --- | ----------- |
| thread 1, task 1 | 11w/task |
| thread 3, task 3 | 4w/task |
| thread 8, task 8 | 1w/task |

This is my code: https://play.rust-lang.org/?version=nightly&mode=debug&edition=2021&gist=e9fc13a7d68b80afdc07076482a5b787 .

And the wasm code is just a function return 1.
package main

//export echo
func echo(ctxLen, reqLen int32) int32 { //nolint
    return 1
}

func main() {
}

Wasmtime GitHub notifications bot (Aug 08 2022 at 10:42):

Duslia edited issue #4637:

When I try to use tokio to scale wasmtime horizontally, I found that wasmtime performance drops significantly. This is my performance data:
| thread and task | data |
| --- | ----------- |
| thread 1, task 1 | 11w/task |
| thread 3, task 3 | 4w/task |
| thread 8, task 8 | 1w/task |

This is my test code: https://play.rust-lang.org/?version=nightly&mode=debug&edition=2021&gist=e9fc13a7d68b80afdc07076482a5b787 .

And the wasm code is just a function return 1.
package main

//export echo
func echo(ctxLen, reqLen int32) int32 { //nolint
    return 1
}

func main() {
}

Wasmtime GitHub notifications bot (Aug 08 2022 at 12:28):

Duslia edited issue #4637:

When I try to use tokio to scale wasmtime horizontally, I found that wasmtime performance drops significantly. It looks like there are some shared resources inside. And no matter how many threads I open, The total number of execute num is close. This is my performance data:
| thread and task | data |
| --- | ----------- |
| thread 1, task 1 | 11w/task |
| thread 3, task 3 | 4w/task |
| thread 8, task 8 | 1w/task |

This is my test code: https://play.rust-lang.org/?version=nightly&mode=debug&edition=2021&gist=e9fc13a7d68b80afdc07076482a5b787 .

And the wasm code is just a function return 1.
package main

//export echo
func echo(ctxLen, reqLen int32) int32 { //nolint
    return 1
}

func main() {
}

Wasmtime GitHub notifications bot (Aug 08 2022 at 12:37):

Duslia edited issue #4637:

When I try to use tokio to scale wasmtime horizontally, I found that wasmtime performance drops significantly. It looks like there are some shared resources inside. And no matter how many threads I open, The total number of execute num is almost the same. This is my performance data:
| thread and task | data |
| --- | ----------- |
| thread 1, task 1 | 11w/task |
| thread 3, task 3 | 4w/task |
| thread 8, task 8 | 1w/task |

This is my test code: https://play.rust-lang.org/?version=nightly&mode=debug&edition=2021&gist=e9fc13a7d68b80afdc07076482a5b787 .

And the wasm code is just a function return 1.
package main

//export echo
func echo(ctxLen, reqLen int32) int32 { //nolint
    return 1
}

func main() {
}

Wasmtime GitHub notifications bot (Aug 08 2022 at 12:48):

Duslia edited issue #4637:

When I try to use tokio to scale wasmtime horizontally, I found that wasmtime performance drops significantly. It looks like there are some shared resources inside. And no matter how many threads I open, The total number of execute num is almost the same. This is my performance data:
| thread and task | data | total execute nums |
| --- | ----------- | -------- |
| thread 1, task 1 | 11w/task | 11w
| thread 3, task 3 | 4w/task | 12w
| thread 8, task 8 | 1w/task | 8w

This is my test code: https://play.rust-lang.org/?version=nightly&mode=debug&edition=2021&gist=e9fc13a7d68b80afdc07076482a5b787 .

And the wasm code is just a function return 1.
package main

//export echo
func echo(ctxLen, reqLen int32) int32 { //nolint
    return 1
}

func main() {
}

Wasmtime GitHub notifications bot (Aug 08 2022 at 12:50):

Duslia edited issue #4637:

When I try to use tokio to scale wasmtime horizontally, I found that wasmtime performance drops significantly. It looks like there are some shared resources inside. And no matter how many threads I open, the total number of execute num is almost the same. This is my performance data:
| thread and task | data | total execute nums |
| --- | ----------- | -------- |
| thread 1, task 1 | 11w/task | 11w
| thread 3, task 3 | 4w/task | 12w
| thread 8, task 8 | 1w/task | 8w

This is my test code: https://play.rust-lang.org/?version=nightly&mode=debug&edition=2021&gist=e9fc13a7d68b80afdc07076482a5b787 .

And the wasm code is just a function return 1.
package main

//export echo
func echo(ctxLen, reqLen int32) int32 { //nolint
    return 1
}

func main() {
}

Wasmtime GitHub notifications bot (Aug 08 2022 at 15:38):

alexcrichton commented on issue #4637:

Thanks for the report! Could you share the Wasmtime version you're using as well as the specific *.wasm file that you're benchmarking?

Wasmtime GitHub notifications bot (Aug 08 2022 at 15:52):

alexcrichton commented on issue #4637:

That being said what you're almost certainly running into is limits on concurrently running mmap in multiple threads within one process. Allocation of fiber stacks for asynchronously executing WebAsembly primarily goes through mmap right now and unmapping a stack requires broadcasting to other threads with an IPI which slows down concurrent execution.

One improvement that you can do to your benchmark is to use the pooling allocator instead of the on-demand allocator. Otherwise though improvements need to come from the kernel itself (assuming you're running Linux, which would also be great to specify in the issue description). Historically the kernel has prototyped a madvisev syscall which allows batch unmappings that Wasmtime could use with the pooling allocator. We've tested this historically and it's shown to greatly increase the throughput in situations like this. This isn't landed in Linux, however.

Finally it depends on your use case of whether this comes up much in practice. Typically the actual work performed by a wasm module dwarfs the cost of the memory mappings since the wasm modules live for a longer time than "return 1". If that's the case then this microbenchmark isn't necessarily indicative of performance and it's recommended to use a more realistic workload for your use case. If this sort of short-running computation is common for your embedding, however, then multithreaded execution will currently have the impact that you're measuring here.

Wasmtime GitHub notifications bot (Aug 09 2022 at 07:34):

Duslia commented on issue #4637:

Thanks!! I got it.

Wasmtime GitHub notifications bot (Aug 09 2022 at 12:55):

Duslia commented on issue #4637:

I'm trying to comment out these two lines and it can scale horizontally with a high performance.
<img width="855" alt="image" src="https://user-images.githubusercontent.com/37136584/183650680-efd55dcf-b8ad-4748-ae9d-3382b96caa74.png">
Is the function of these lines of code to prevent the previous data from being obtained in the next execution?

Wasmtime GitHub notifications bot (Aug 09 2022 at 12:56):

Duslia edited a comment on issue #4637:

I'm trying to comment out these two lines and it can scale horizontally with a high performance.
<img width="855" alt="image" src="https://user-images.githubusercontent.com/37136584/183650680-efd55dcf-b8ad-4748-ae9d-3382b96caa74.png">
I 'd like to ask that is the function of these lines of code to prevent the previous data from being obtained in the next execution?

Wasmtime GitHub notifications bot (Aug 09 2022 at 13:08):

bjorn3 commented on issue #4637:

Yes, without that call the old data would be preserved when reusing a memory region as linear memory for the wasm module.

Wasmtime GitHub notifications bot (Aug 09 2022 at 14:32):

alexcrichton commented on issue #4637:

Ah ok that's a good confirmation that madvise and IPIs are indeed likely your scaling bottleneck. In that case there's unfortunately not much that can be done right now that Wasmtime can do. This is some (old) discussion about a possible madvisev syscall which would largely fix this scaling bottleneck within Wasmtime. This was confirmed with a modified kernel with that patch and developing a version of Wasmtime that uses madvisev.

In the meantime though this is something that basically just needs to be taken into account when capacity planning with Wasmtime. If you're willing assistance in getting a patch such as that into the Linux kernel (or investigating other routes of handling this bottleneck) would be greatly appreciated!

Wasmtime GitHub notifications bot (Aug 09 2022 at 14:33):

alexcrichton commented on issue #4637:

Oh sorry, and to confirm the madvise call is absolutely critical for safety, it cannot be removed and still have a secure sandbox for wasm.

Wasmtime GitHub notifications bot (Aug 09 2022 at 15:23):

Duslia commented on issue #4637:

OK. I will think about how to solve it. Thanks a lot!!

Wasmtime GitHub notifications bot (Aug 10 2022 at 02:40):

ihciah commented on issue #4637:

I'm wondering if the following 2 solutions may be accepted?

Call madvise with io_uring: It saves syscall but I don't know if it will work to avoid scaling bottleneck.

Maintain 2 index groups, one is dirty, the other one is clean. On drop, the index is put into dirty group and try to merge with others, and won't be madvised instantly. On index allocation, try to get from the clean one, and if there's no available in it, try to obtain a continuous range from dirty group and call madvise, then return one of it back and put all the index left inside the range into the clean group. In this solution, we can call less madvise by clean stacks in batch.

Wasmtime GitHub notifications bot (Aug 10 2022 at 06:51):

bjorn3 commented on issue #4637:

Call madvise with io_uring: It saves syscall but I don't know if it will work to avoid scaling bottleneck.

As I understand it the IPI is the bottlwneck, not the syscal overhead. Every time madvise is run all cores running the wasmtime process have to be temporarily interupted and suspended while the page table is being modified. This scales really badly.

Wasmtime GitHub notifications bot (Aug 10 2022 at 06:51):

bjorn3 edited a comment on issue #4637:

Call madvise with io_uring: It saves syscall but I don't know if it will work to avoid scaling bottleneck.

As I understand it the IPI is the bottlwneck, not the syscall overhead. Every time madvise is run all cores running the wasmtime process have to be temporarily interupted and suspended while the page table is being modified. This scales really badly. io_uring can't do anything against this.

Wasmtime GitHub notifications bot (Aug 10 2022 at 07:43):

ihciah commented on issue #4637:

Call madvise with io_uring: It saves syscall but I don't know if it will work to avoid scaling bottleneck.

As I understand it the IPI is the bottlwneck, not the syscall overhead. Every time madvise is run all cores running the wasmtime process have to be temporarily interupted and suspended while the page table is being modified. This scales really badly. io_uring can't do anything against this.

Thanks for the explain! I also checked the kernel code and found in func io_madvise of io_uring.c, it just call do_madvise, the lock will be acquired inside do_madvise, which means there's no aggregation with it.
To do madvise aggregation without modifying kernel, what about the solution 2? I implemented a demo here(not finished yet): https://github.com/ihciah/wasmtime/commit/071f5d9978d94ee669d72c90bf2d6f2728324303

Wasmtime GitHub notifications bot (Aug 10 2022 at 07:55):

bjorn3 commented on issue #4637:

I might be misunderstanding, but isn't that what the pooling allocator does?

Wasmtime GitHub notifications bot (Aug 10 2022 at 08:09):

ihciah commented on issue #4637:

I might be misunderstanding, but isn't that what the pooling allocator does?

The pool inside the wasmtime allocates a big memory one time in a single call, and manages it by maintaining index of memory pages manually. Each time the memory page(memory_pages, table_pages, stack_pages) dropped, it will clear the memory content by calling decommit->madvise, and mark the index available for the later allocation. And here is the problem.
What I want to impl is to make the indices to recycle continues and clear the memory in batch. I think it is a kind aggregation in userspace.
(I started reading wasmtime code just 2 from days ago. Maybe there's something I misunderstand..)

Wasmtime GitHub notifications bot (Aug 10 2022 at 14:58):

alexcrichton commented on issue #4637:

Call madvise with io_uring

While we haven't directly experimented with this I think as you've discovered we took a look and concluded it probably wouldn't help.

Maintain 2 index groups, one is dirty, the other one is clean.

This we have actually experimented with historically. Our conclusion was that it did not actually help all that much. We found that the cost of an madvise scaled with the size of the region being advised, which meant that when we batched everything up it took effectively just as long as doing everything individually. We had various hypotheses for why this was the case but we didn't bottom it out 100% at the time.

Our testing with madvisev is mostly what led us to stop investigating this issue because madvisev so clearly fixed the issue relative to all other workarounds.

Wasmtime GitHub notifications bot (Aug 10 2022 at 17:23):

ihciah commented on issue #4637:

We found that the cost of an madvise scaled with the size of the region being advised

Thank you for sharing past experiments.

But it still confuse me that advising multiple regions with madvisev is better than madvise with a merged region. If this is true, does this means we can impl a better madvise in kernel by simply split the region and do it one by one?

Wasmtime GitHub notifications bot (Aug 10 2022 at 17:36):

pchickey commented on issue #4637:

You can't merge the regions to madvise - they're not contiguous.

Wasmtime GitHub notifications bot (Aug 10 2022 at 17:56):

ihciah commented on issue #4637:

You can't merge the regions to madvise - they're not contiguous.

I mean maybe we can make the indices as contiguous as possible and then we can do madvise on the biggest contiguous region when alloc. I don't know if this will work.

Wasmtime GitHub notifications bot (Aug 10 2022 at 18:29):

alexcrichton commented on issue #4637:

IIRC we were just as confused as you are. I remember though reorganizing the pooling allocator to put the table/memory/instance allocations all next to each other so each instance's index identified one large contiguous region of memory. In the benchmarks for this repository parallel instantiation actually got worse than the current performance on main. My guess at the time was that it had to do with the very large size of the madvise since madvise got more expensive as the region got larger.

As for kernel improvements that's never something we've looked into beyond madvisev

Wasmtime GitHub notifications bot (Aug 12 2022 at 03:42):

ihciah commented on issue #4637:

I finished my experiment and result shows the same conclusion.
What I did: clear stack pool in batch(https://github.com/ihciah/wasmtime/tree/main)
Result: running return 1 demo(on a KVM vm)

Original 1 core: 117 kQPS

Original 3 cores: 66 kQPS * 3

Exp 1 core: 133 kQPS

Exp 3 cores: 52kQPS * 3

Under 1 core, the modified version has slightly better performance(the madvise batch is 1000 times original); but under 3 cores it becomes worse.

My guess at the time was that it had to do with the very large size of the madvise since madvise got more expensive as the region got larger.

Also, I changed stack pool size 1000 to bigger values, size=2048 => 20ms and size=4096 => 45ms, which matches your guess.
I haven't looked closely at the madvisev impl yet. Comparing to calling madvise, what does it saves? It still need the same IPI in theory.

Wasmtime GitHub notifications bot (Aug 15 2022 at 14:26):

alexcrichton commented on issue #4637:

Always good to have independent confirmation, thanks for doing that!

As for why madvisev improves the situation, I believe this happens because each call to madvise issues an IPI. With madvisev only one IPI is necessary still. This means that instead of N calls to madvisev and N IPIs there's only one call to madvisev with only one IPI. From the kernel's perspective all the calling process needs is to know that by the time the syscall returns that the address space is as-expected in all threads, but during execution of the syscall there's a lot more flexibility about precisely what all threads see.

Wasmtime GitHub notifications bot (Aug 19 2022 at 07:40):

ihciah commented on issue #4637:

Thanks for the explanation. But I guess IPI cost(or other bottleneck) is not linear with syscall times but the memory ranges since doing madvise with twice bigger memory a little bit cost more than twice longer time.

It seems there's no easy way to save the madvise cost and avoid its bottleneck for general usage...
Since users may control the instance isolation by their own, for example they can make sure only the same tenant and the same wasm binary share instances. Under these circumstances doing madvise maybe unneccessary.
Is it acceptable to add an option to disable madvise(madvise can be still enabled by default, and the option can only be touched when some features like "dangerous" enabled)?

Wasmtime GitHub notifications bot (Aug 19 2022 at 07:52):

bjorn3 commented on issue #4637:

Since users may control the instance isolation by their own, for example they can make sure only the same tenant and the same wasm binary share instances. Under these circumstances doing madvise maybe unneccessary.

In that case you could reuse the Instance instead of recreate it every time, right? Not calling madvise is effectively equivalent to reusing the Instance while resetting all tables and wasm globals (actual global variables are in the linear memory and as such not reset), but not the linear memory, in which case you might as well reset nothing. That prevents everything getting out of sync and most observable state isn't reset anyway if you don't do madvise.

Wasmtime GitHub notifications bot (Aug 19 2022 at 08:12):

ihciah commented on issue #4637:

Not calling madvise is effectively equivalent to reusing the Instance while resetting all tables and wasm globals (actual global variables are in the linear memory and as such not reset), but not the linear memory, in which case you might as well reset nothing.

I see. There are 3 callers for decommit, the decommit_memory_pages and decommit_table_pages are with the instance lifecycle. So we may not affect them.
The hot path is the third caller decommit_stack_pages. Every call for func.call_async will do it. So we may only disable decommit_stack_pages optionally.

Wasmtime GitHub notifications bot (Aug 19 2022 at 19:49):

alexcrichton commented on issue #4637:

Using madvise to clear linear memory and tables is required for correctness and we can't add a knob to turn that off, but I think it would be reasonable to add a knob for the stacks to avoid madvise since that's purely a defense-in-depth thing and is not required for correctness.

Wasmtime GitHub notifications bot (Dec 02 2022 at 00:04):

alexcrichton commented on issue #4637:

The *_keep_resident options added recently should additionally help by reducing page faults and therefore contention on kernel-level locks. There's also work towards optimizing bounds-checking-mode in Cranelift to avoid pagefaults/mprotect even more to reduce contention even further.

Otherwise though this is a known issue and isn't something where there's any easy wins on the table we know of that haven't been applied yet. PRs and more discussion around this are of course always welcome but I'm going to go ahead and close this since there's not a whole lot more that can be done at this time.

Wasmtime GitHub notifications bot (Dec 02 2022 at 00:04):

alexcrichton closed issue #4637:

When I try to use tokio to scale wasmtime horizontally, I found that wasmtime performance drops significantly. It looks like there are some shared resources inside. And no matter how many threads I open, the total number of execute num is almost the same. This is my performance data:
| thread and task | data | total execute nums |
| --- | ----------- | -------- |
| thread 1, task 1 | 11w/task | 11w
| thread 3, task 3 | 4w/task | 12w
| thread 8, task 8 | 1w/task | 8w

This is my test code: https://play.rust-lang.org/?version=nightly&mode=debug&edition=2021&gist=e9fc13a7d68b80afdc07076482a5b787 .

And the wasm code is just a function return 1.
package main

//export echo
func echo(ctxLen, reqLen int32) int32 { //nolint
    return 1
}

func main() {
}

Last updated: Apr 18 2025 at 10:03 UTC