Wasmtime Performance Troubleshooting · general

Stream: general

Topic: Wasmtime Performance Troubleshooting

Friday More (Mar 26 2025 at 00:58):

Hello,

We have a use case where a module needs to be instantiated per request with large, immutable data for a high throughput, low-latency server. We use Wizer to snapshot this initialization data into Wasm code and rely on Wasmtime's CoW semantics for rapid module instance creation. This works great! We see a 3 order-of-magnitude performance improvement in our benchmarks when CoW is enabled. However, when we increase the number of threads (each thread would create module instances), we only see minor performance improvements. I am using a very beefy Linux VM with many cores and RAM, but the current setup doesn't seem to be able to utilizes these resources. I suspect there is contention for some resources that may be responsible.

1) Can you think of anything that could be causing the issue?

We are not using Pooled Allocation yet (we are using the C++ API which doesn't expose it) and wondering if that can help.

2) Do the pool slots also benefit from CoW? For example, if a module instance is 1GB in memory where most of the data is immutable, can all slots mostly point to the same physical memory?

If this should be asked in Github instead, I can repost!

Thank you!

Chris Fallin (Mar 26 2025 at 01:07):

We are not using Pooled Allocation yet (we are using the C++ API which doesn't expose it) and wondering if that can help.

Yes, absolutely, start there. All of our high-performance use-cases use the pooling allocator, and that is what we've optimized, so if you're not doing that, you're only getting "accidental benefits" from the parts of the codebase that are in common with our expected high-performance config.

Especially if it is one common module that is being re-instantiated over and over, the pooling allocator should lead to additional speedups because the same mapping can remain (we just do madvise to clear the CoW overlay, which can avoid taking the mmap lock). There are going to be some sublinear scaling issues as you grow to many cores because of TLB shootdown overheads, but I've seen workloads scale well to around 8 -- 16 cores at least (depending on specifics of the Wasm module, etc).

Once you enable pooling, if you're still seeing botlenecks, I'd recommend profiling and showing us the profile -- we can then recommend what else might be an issue, if anything.

Friday More (Mar 26 2025 at 03:01):

Thanks for the quick response! I will proceed with the pooling allocator approach and see how that goes

Last updated: Apr 07 2025 at 23:03 UTC