I'm evaluating various wasm and wasi features for my project and one way that I'm trying to use for lower memory usage is using threads. I've tried wasi-threads
feature with wasm32-wasip1-threads
target on Rust, but memory usage of a module was pretty much the same as running multiple modules. Which of course is not surprising after you look at the implementation as it starts more WASM instances in Rust threads.
After discovering that I've started looking into Core WebAssembly threads, but I'm not really sure how to run them in WASI context :thinking: When I compile on wasm32-wasip1
and run with wasmtime run -W threads
I get an error on unsupported feature: "operation not supported on this platform" and when I compile on wasm32-wasip1-threads
, it obviously compiles the threaded code to use wasi-threads.
I think what I need it a wasy to compile using the wasm32-wasip1
target, but somehow tell Rust to compile threads for Core WebAssembly, but I can't find anyway to do that. I know wasm_bindgen
can do that, but that's only for the JS environment.
Piotr Sarnacki said:
When I compile on
wasm32-wasip1
and run withwasmtime run -W threads
I get an error on unsupported feature: "operation not supported on this platform"
What you're seeing here is that the Rust standard library provides std::thread::spawn
but it unconditionally returns an error on this target as it's not supported.
Piotr Sarnacki said:
when I compile on
wasm32-wasip1-threads
, it obviously compiles the threaded code to use wasi-threads.
Right yeah, this will require -S threads
(or --wasi threads
) on the CLI to execute these binaries.
Piotr Sarnacki said:
I think what I need it a wasy to compile using the
wasm32-wasip1
target, but somehow tell Rust to compile threads for Core WebAssembly, but I can't find anyway to do that. I knowwasm_bindgen
can do that, but that's only for the JS environment.
The story of threads and wasm is unfortunately relatively fraught and complicated. The tl;dr; is that instance-per-thread is the only viable model that can be used as-of today to implement threads. This model is more-or-less unsuitable for standard libraries to integrate with because it requires a lot of host "glue" be it JS code or Rust embedding code. The wasi-threads
proposal is a codification of instance-per-thread for out-of-browser use cases but is a work-in-progress API that is not on track to be standardized or finished. There are known issues with it (e.g. no thread spawn failure, no TLS destructors, no way to limit guest code in threads, etc) which have no one working on it.
Put another way your goal here of getting wasm32-wasip1
working with threads is not possible today. That can only be done with wasm32-wasip1-threads
and has the hidden gotchas above too. Looking to the future this is the purpose of the shared-everything-threads proposal to improve the situation here, but that's still quite aways off
Thank you a lot for summarizing the situation, although it is a bit of a bummer for sure. I'll definitely read about shared-everything-threads to see what the future may bring, even if it's not coming out anytime soon.
and fwiw, it is often (although not always) possible to architect your program and its usage of wasm to lift the parallelism up out of wasm, and instead of having a program that instantiates a wasm instance that does threads stuff, have your program spawn threads and each thread instantiates wasm instances that don't share state but instead get subproblems to work on or whatever
probably not the answer you were looking for, but maybe it is helpful nonetheless
parallelism/concurrency is not the problem I'm trying to solve. The problem I'm having is that I want to be able to spin a big number of WASM instances - ideally in the range of hundreds of thousands. This is obviously problematic as even spawning that number of threads becomes problematic, let alone doing that with WASM instances inside. That's why I was hoping for some form of green threads in WASM libraries, so I could use N:M models, like for example 5k instances (ie. OS threads) with 50-100 threads per instance.
For Rust one way to solve this could be maybe introducing async inside WASM, but that's not ideal, cause I would prefer the code compiled to WASM be synchronous (as I'm doing async calls on the host anyway) and also it doesn't help in any way for other languages like Javascript.
We've talked about having a wasi-sdk mode of support that implements the pthread API in a green-threads fashion historically before. That would mean that wasm would think it has threads, for example, but it would actually be green threads. Sort of like M:N but without the baked in assumption of that everywhere. This work hasn't yet been done though.
If you're curious about per-instance memory usage if you've got an example benchmark/repo we could also help confirm that it's got the right knobs turned in Wasmtime to optimize memory.
One thing to be aware of is that in Wasmtime by default each wasm instance typically has a single linear memory which requires a 6GB virtual memory reservation. That reservation in turn ends up being the limiting factor so you probbaly can't create more than 64k instances (and maybe not even that much). There are ways to tweak this though but they can have various gotchas and performance impact
I want to be able to spin a big number of WASM instances - ideally in the range of hundreds of thousands.
this shouldn't require threads in wasm, no a green-threading pthreads polyfill, at all: you can multiplex M wasm instances on N os threads today with wasmtime's async support and using tokio or another async runtime in the host
your bottleneck here is going to be memory (both physical and virtual) required for all the different wasm linear memories. if virtual memory is a limiting factor, then you can use "dynamic" memories instead of "static" memories, which means slower memory accesses (requires explicit bounds checks instead of virtual memory tricks to elide them) but doesn't reserve large virtual memory guard pages.
and since you were considering threads and shared memories, you could do the same memory-sharing but without threading: have your wasm instances import memory, then define one memory to be shared between and imported by n different wasm instances. this is a bit tricky from a toolchain perspective, however, as you'll need to ensure that each instance uses its own shadow stack and such. basically the same things that threads have to do, but without actually requiring synchronization. this does mean each wasm instance has to trust all the other instances sharing the memory, since any other instance could maliciously or buggily mess with their portion of memory, but that is the same with the threaded version as well. but yeah, this variant is a bit more pie-in-the-sky, and I personally wouldn't pursue it until after exhausting the basic version without shared memories. wasm itself has everything needed for this variant, but getting toolchains to produce code that works well in this situation will likely be a battle, as I mentioned
oh also, you'll def want to be using the pooling allocator in this situation:
ooooh, that's definitely encouraging! I somehow thought that wasmtime async support is just an async API for creating instances in blocking threads, but I've ran a simple experiment and indeed it's fully async, that's very good to know, I'm not sure where the notion it needs threads underneath came from. Also thanks for all the pointers, I'll have a look and I definitely need to learn more about WASM memory!
@Piotr Sarnacki in addition to fitzgen's top-level guidance (strong +1 to "use the async pooling allocator on top of {tokio, ...} runtime" from me too) it might be interesting to us to learn a bit more about your use-case; then we could find ways to fit it best and/or think of optimizations. Are you building an instance-per-request server of some sort (instantiate a Wasm module for every inbound http request, or RPC, or packet, or ...)? Or something else? I ask because that particular path is well-trodden, and is how Wasmtime is used server-side today by most of its major users, but if there's something else, that'd be interesting to know too.
@Chris Fallin I work on a distributed stress testing tool where scenarios are compiled to WASM (repo on GitHub). It's a proof of concept at the moment, but I'm trying to figure out the missing pieces to make it production ready. Hundreds of thousands of tasks is probably not a typical scenario for a stress testing tool (also because of network limitations, ie. you can open ~65k outgoing connections per ip/port combination without doing some trickery with virtual network interfaces and what not), it is sometimes useful to be able to do that. For example when testing how many connections a server can handle concurrently before it breaks.
So far the major limitation I found is memory usage. Both in a sense that Alex mentioned (ie. virtual memory per instance) and with the size of runtimes for interpreted languages like JS (I opened another discussion about reusing an instance with an interpreter here).
In order to figure out what people can expect I usually test K6, which is a great tool, and it's widely used in this area. I know it will be very hard to match it with WASM as it's using a JS interpreter written in Go and thus they mostly run code asynchronously, but I would like to be at least in the order of magnitude of memory usage and ideally at least as fast. My recent rough tests indicate 6GBs per 100k of tasks in K6, which is on par with WASM instances for Rust modules, but obviously it's way higher for any kind of JS interpreter (ie. even with QuickJS I'm looking at 40GBs per 100k tasks). If I can't make it better I might go the route of supporting mostly Rust, but interpreted languages support would be very desirable for adoption.
Makes sense -- if this is all one trust domain, I would suggest looking into instance reuse as you mention, as you'll likely get significant efficiency improvements that way. At the very least architecting the Host/Wasm interface to allow for it (e.g. three entry points, init/run/shutdown or somesuch) allows for it, then you can always have a reuse count of 1 if you want better isolation/determinism/etc.
Copy-on-write instantiation with the pooling allocator might help with actual memory usage too: even with a multi-megabyte heap, if you have a snapshot from Wizer that isn't doing much more than a few hundred lines of execution once it wakes up (and isn't allocating huge amounts of data), most of the heap will hopefully be readonly shared mappings to the underlying initial heap image. Worth measuring anyway
(Also worth noting that those two things can interact: if you reuse instances, their CoW overlay size will grow over time, especially if you have a guest GC'd language with a GC tuned for speed over compactness. Worth playing with both sides and also measuring a manual GC between requests vs. shutdown / instantiate new)
Thanks for all the tips @Chris Fallin! I'll be experimenting with it in the near future, hopefully over the weekend. I'll start with shared memory probably and as soon as I have that ready I'll go towards figuring out more about dynamic linking.
also when you measure/compare memory overhead, make sure to distinguish between physical memory actually used versus virtual memory pages reserved but not actually backed by any physical memory
Last updated: Dec 23 2024 at 14:03 UTC