✔ high performance workers · general

Stream: general

Topic: ✔ high performance workers

René Rössler (May 13 2024 at 13:11):

Hi,

I'm building a service which pipelines many gigabytes of data from a couchdb cluster into a elasticsearch cluster. As the data needs to get transformed and the code for the transformation comes directly from my clients, I want to use wasm components here. The transform code can load more data from the cluster via a host function. As loading from the cluster should be batched I collect all requests for some time and do one big request to the cluster.

At the moment I simulate both clusters as if they have unlimited resources.

I made a poc which performs ok (about 100mbit throughput on my laptop) with 1000 instances of the same component all running on the same engine with async support enabled. I have a central queue (deadqueue::limited) and each instance runs on its own tokio task.
It runs a little bit better if I have more engines and have the 1000 instances split over the engines. But the sweet spot seems to be around 2 engines. (My CPU has 20 Threads).

I searched all over the documentation but can't find any best efforts on how to design such a system. Especially I don't know if having multiple engines is supported at all? If I understood correctly, all instance on one engine run each on an own stack but on the same cpu. So having multiple engines should help here.

On one engine it does not really help to have the component function itself async as component instances are not Clone so I had to clone the linker, component and engine to get a new instance for each runner (which runs in its own tokio task). So async here only helps me with the host function itself, or am I missing something here?

Lann Martin (May 13 2024 at 13:14):

By "engines" are you referring to wasmtime::Engine? There shouldn't be any reason to use multiple engines (assuming they would all have the same wasmtime::Config).

Ramon Klass (May 13 2024 at 13:24):

I think you are looking for the instantiate_pre family of functions. It does everything that can be shared between instances and then returns to you a preinitialized module that can be instantiated as often as you want with minimum overhead, each of these instances is separate and can run on a different CPU core

René Rössler (May 13 2024 at 13:28):

Yes, multiple wasmtime::engine::Engine. I changed some code and it seems I was wrong with the assumption that having multiple engines perform better. Thanks for the tip with instantiate_pre!

I still wonder how to call multiple functions on the same instance which should run on one core with multiple stacks as I can not clone an instance and I need a mutable pointer to a store to call a function on an instance.

Lann Martin (May 13 2024 at 13:32):

You cannot run parallel function calls on a single instance or sharing a single Store today. An InstancePre (returned by instantiate_pre) can be cloned and run with its own Store

Ramon Klass (May 13 2024 at 13:32):

yes the only reason for multiple Engines is if you have diverging configs, reuse as much as possible if you want performance :)

Ramon Klass (May 13 2024 at 13:33):

each thread can have its own separate version of your function, but they can't share data, so you have to design your code in a map reduce way

René Rössler (May 13 2024 at 13:36):

Alright, that's good to know. I read a lot about stack switching in the wasmtime documentation and was really confused how I'm able to use it

Lann Martin (May 13 2024 at 13:41):

Wasmtime's own stack switching (fibers) is an internal detail of its async support. There is - separately - a proposal for stack switching in guest wasm code, which is not yet implemented afaik

René Rössler (May 13 2024 at 13:41):

As I'm already discussing performance with you. As I don't want my tokio executor to lock up too much I would need too use engine.increment_epoch every maybe 100ns? There's only one example which uses 1 second which seems a bit much for the tokio executor in my experience.

Lann Martin (May 13 2024 at 13:45):

Depends on the workload. Spin, which expects relatively IO-heavy workloads, currently uses 10ms. I haven't really tried to tune this experimentally

Notification Bot (May 14 2024 at 09:18):

René Rössler has marked this topic as resolved.

Last updated: Apr 09 2025 at 23:03 UTC