thomastaylor312 added the bug label to Issue #8034.
thomastaylor312 opened issue #8034:
Test Case
This is the wasm file, zipped up in order to upload to GH. It is from https://github.com/sunfishcode/hello-wasi-http.git
hello_wasi_http.wasm.zipSteps to Reproduce
Try these steps on a linux machine and on a macos machine (preferably close to the same size):
- Run the component with
wasmtime serve
(no additional flags)- Run
hey -z 10s -c 100 http://localhost:8080/
Expected Results
I expect the number of requests/second to be the same or greater on linux than they are on Mac
Actual Results
On my Mac (details on OS below) that was running a bunch of other applications, I get around 20k req/s
On linux (details on OS below), I get around 4.3k req/sVersions and Environment
Wasmtime version or commit: 18.0.2
Mac
Operating system: Sonoma 14.3.1Architecture: M1 Max (2 performance cores, 8 normal cores) and 64 GB of memory
Linux
Operating system: Debian Bookworm (6.1.0-18-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 (2024-02-01) x86_64 GNU/Linux)Architecture: AMD64 (16 cores) and 64 GB of memory
This was run on a cloud VM but I also tested this on a ubuntu 20.04 amd64 server running at my house with similar performance
Extra Info
On the linux server, I did double check my file descriptor limit had been raised and also observed that the wasmtime processes were all getting to an uninterruptible sleep state almost constantly through the whole test (which could mean nothing). Also, I did a similar test with wasmCloud and Spin, which both use wasmtime and was getting a similar drop in numbers between mac and linux. For reference, I also did some smoke tests with normal server traffic (I did a test with Caddy and with NATS) and all of them were getting easily into the 100k+ range. So this definitely seems like something on the wasmtime side.
I did see #4637 and that does explain some of the horizontal scaling issues, but I didn't expect such a drastic difference between Mac and Linux
fitzgen commented on issue #8034:
First off: are you enabling the pooling allocator? Eg
-O pooling-allocator
in the CLI. Enabling or disabling the pooling allocator is going to greatly affect requests/second.So will disabling virtual memory-based bounds checks and replacing them with explicit bounds checks (
-O static-memory-maximum-size=0
) which should increase requests/second for short-lived Wasm but will slow down Wasm execution by ~1.5x.I expect the number of requests/second to be the same or greater on linux than they are on Mac
I don't think we can make hard guarantees about this unless you disable virtual memory-based bounds checks completely, because the performance bottleneck for concurrent Wasm guests is the kernel's virtual memory subsystem. Even with the pooling allocator, we are bottlenecked on
madvise
kinds of things and their associated IPIs. Without the pooling allocator, you're essentially benchmarking concurrentmmap
.
fitzgen commented on issue #8034:
For example, here are the results I get on my ~6 year old think pad (4-core / 8 hyperthreads) running linux:
No pooling allocator: 5983.7146 requests/second Pooling allocator: 34980.6398 requests/second Pooling allocator + explicit bounds checks: 35368.5013 requests/second
(I'd expect the delta between the second and third configuration to be even greater on machines with more cores)
fitzgen commented on issue #8034:
Ah, you also have to pass
-O memory-init-cow=n
to get rid of all the virtual memory interaction here. Once I do that I get the following results:No pooling allocator: 5983.7146 requests/second Pooling allocator: 34980.6398 requests/second Pooling allocator + explicit bounds checks: 35368.5013 requests/second Pooling allocator + explicit bounds checks + no memory CoW: 45451.2630 requests/second
cfallin commented on issue #8034:
@thomastaylor312 could you tell us more about your hardware on the Linux side? The reason I ask is that "cloud VM" is pretty vague -- it could be an older microarchitecture, or with oversubscribed cores, or something else. Without more specific details, I'm not sure why it's a given that RPS should be higher on Linux on hardware A vs. macOS on hardware B.
Also a possible experiment: are you able to run a Linux VM on your M1 Max hardware (even better, native Linux such as Asahi, but a VM in e.g. UTM is fine too), and test Wasmtime there? That would tell us a lot about how the raw CPU power actually compares.
alexcrichton commented on issue #8034:
I apologize if this is a bit piling on at this point, but I wanted to comment the same as @fitzgen, this is probably the
-O pooling-allocator
vs not. Locally the difference I see is:
wasmtime serve
- 5.5k rpswasmtime serve -O pooling-allocator
- 236k rpsThis is perhaps an argument that we should turn on the pooling allocator by default for the
wasmtime serve
command!Also, as mentioned in https://github.com/bytecodealliance/wasmtime/issues/4637, there's various knobs to help with the overhead of virtual memory here. They're not always applicable in all cases, for example
wasmtime serve -O pooling-allocator,memory-init-cow=n,static-memory-maximum-size=0,pooling-memory-keep-resident=$((10<<20)),pooling-table-keep-resident=$((10<<20))
yields 193k rps for me locally. There are zero interactions with virtual memory in the steady state, but wasmtime spends 70% of its time in memset resetting memory between http requests.
alexcrichton commented on issue #8034:
(also thank you for such a detailed report!)
thomastaylor312 commented on issue #8034:
@cfallin For sure! This is a GCP machine running on an AMD Rome series processor. The exact instance size is a n2d-standard-16. Also, to call out again, I did try this on a local linux server running a slightly older intel processor with similar effects. Working on trying out some of the options suggested and will report back soon
thomastaylor312 edited a comment on issue #8034:
@cfallin For sure! This is a GCP machine running on an AMD Rome series processor. The exact instance size is a n2d-standard-16. Also, to call out again, I did try this on a local linux server running a slightly older intel processor with similar results. Working on trying out some of the options suggested and will report back soon
thomastaylor312 commented on issue #8034:
Here are my numbers:
No pooling allocator: 4714.3026 requests/second Pooling allocator: 42696.3823 requests/second Pooling allocator + explicit bounds checks: 42603.7692 requests/second Pooling allocator + explicit bounds checks + no memory CoW: 55162.2862 requests/second
So that seems to line up with what you were seeing @fitzgen. What I am a little unsure about are the tradeoffs involved here. I think I wouldn't want
static-memory-maximum-size=0
since I don't want to slow down execution of longer lived components. I did already try out the pooling allocator in wasmCloud and saw the benefits, but all of the options were a little confusing as to what the memory footprint will be. I wasn't sure how to set all of those values to use the right amount of memory on any given machine. I was starting to make some guesses but wasn't entirely certain.Also, are there any tradeoffs around using
memory-init-cow=n
with the pooling allocator?
cfallin commented on issue #8034:
There are two performance "figures of merit": the instantiation speed and the runtime speed (how fast the Wasm executes once instantiation completes). The first two lines (no pooling and pooling) keep exactly the same generated code, so there's no runtime slowdown; the pooling allocator speedup is pure win on instantiation speed (due to the virtual-memory tricks).
memory-init-cow
again has to do with instantiation speed and doesn't alter runtime. One other configuration you haven't run yet, that might be interesting to try, is pooling + no memory CoW (but without explicit bounds checks): that should have fully optimal generated code and no runtime slowdown.(Reality-is-complicated footnote: there may be some effects in the margins with page fault latency that do actually affect runtime depending on the virtual memory strategy; but those effects should be much smaller than the explicit vs. static bounds checks.)
thomastaylor312 commented on issue #8034:
Yep I actually just tried the pooling + no memory CoW and that looks good. My remaining concern is around how all the different pooling allocator knobs can affect runtime
thomastaylor312 edited a comment on issue #8034:
Yep I actually just tried the pooling + no memory CoW and that looks good. My remaining concern is around how all the different pooling allocator knobs can affect runtime/memory usage.
cfallin commented on issue #8034:
runtime/memory usage
@thomastaylor312 the best answer is usually "try it and see", since tradeoffs can cross different inflection points depending on your particular workload.
As mentioned above, runtime (that is, execution speed of the Wasm) is unaffected by the pooling allocator. Explicit bounds checks are the only option that modify the generated code.
Memory usage (resident set size) might increase slightly if the "no CoW" option is set, because the pooling allocator keeps private pages around for each slot. (I don't remember if it retains a "warmed up" slot for a given image to reuse it though, @alexcrichton do you remember?) CoW is more memory-efficient because it can leverage shared read-only mappings of the heap image that aren't modified by the guest.
thomastaylor312 closed issue #8034:
Test Case
This is the wasm file, zipped up in order to upload to GH. It is from https://github.com/sunfishcode/hello-wasi-http.git
hello_wasi_http.wasm.zipSteps to Reproduce
Try these steps on a linux machine and on a macos machine (preferably close to the same size):
- Run the component with
wasmtime serve
(no additional flags)- Run
hey -z 10s -c 100 http://localhost:8080/
Expected Results
I expect the number of requests/second to be the same or greater on linux than they are on Mac
Actual Results
On my Mac (details on OS below) that was running a bunch of other applications, I get around 20k req/s
On linux (details on OS below), I get around 4.3k req/sVersions and Environment
Wasmtime version or commit: 18.0.2
Mac
Operating system: Sonoma 14.3.1Architecture: M1 Max (2 performance cores, 8 normal cores) and 64 GB of memory
Linux
Operating system: Debian Bookworm (6.1.0-18-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 (2024-02-01) x86_64 GNU/Linux)Architecture: AMD64 (16 cores) and 64 GB of memory
This was run on a cloud VM but I also tested this on a ubuntu 20.04 amd64 server running at my house with similar performance
Extra Info
On the linux server, I did double check my file descriptor limit had been raised and also observed that the wasmtime processes were all getting to an uninterruptible sleep state almost constantly through the whole test (which could mean nothing). Also, I did a similar test with wasmCloud and Spin, which both use wasmtime and was getting a similar drop in numbers between mac and linux. For reference, I also did some smoke tests with normal server traffic (I did a test with Caddy and with NATS) and all of them were getting easily into the 100k+ range. So this definitely seems like something on the wasmtime side.
I did see #4637 and that does explain some of the horizontal scaling issues, but I didn't expect such a drastic difference between Mac and Linux
thomastaylor312 commented on issue #8034:
Ok, I think between that and the docs that are already on the pooling allocator, I think I should be good enough. I'll go ahead and close this, but hopefully this issue can be helpful to others who might be optimizing things. Thanks for the help all!
alexcrichton commented on issue #8034:
I was inspired to summarize some of the bits and bobs here in https://github.com/bytecodealliance/wasmtime/pull/8038 to hopefully help future readers as well.
that might be interesting to try, is pooling + no memory CoW
...
Yep I actually just tried the pooling + no memory CoW and that looks good
This surprises me! I would not expect disabling CoW to provide much benefit when explicit bounds checks are still enabled. If anything I'd expect it to get a bit slower (like what I measured above).
I say this because even if you disable copy-on-write we still use
madvise
to clear memory (regardless of whether bounds checks are enabled or not) which involves IPIs which don't scale well. This might be a case of you running into something Chris has pointed out in the past where when using CoW if you first read a page that will fault in a read-only mapping, but then if you write to the same page the copy happens in addition to an IPI to clear the old mapping. This means that CoW, while beneficial for large heap images due to removing startup cost, may be less beneficial over time if pages are read-then-written to cause even more IPIs (which aren't scalable) to happen over time.Memory usage (resident set size) might increase slightly if the "no CoW" option is set, because the pooling allocator keeps private pages around for each slot.
I don't think this will actually be the case because when a slot is deallocated we
madvise
-reset the entire linear memory which should release all memory back to the kernel, so with-and-without CoW should have the same rss for deallocated slots.Now for allocated slots, as you've pointed out, having 1000 instances with CoW should have less rss than 1000 instances without CoW because readonly pages will be shared in the CoW case.
Last updated: Nov 22 2024 at 16:03 UTC