bkolobara opened Issue #2295:
I'm working on an application where I create one Wasm Instance per HTTP request.
Now I have run into a performance issue where most of the response time is spent on instance creation. I'm not sure how I could structure my app to improve this.
I run a profiler to measure where time is spent exactly and here are some results:
47% wasmtime::func::Func::wrap - wasmtime::trampoline::generate_raw_func_export 16% wasmtime_runtime::instance::InstanceHandle::dealloc - core::ptr::drop_in_place 16% wasmtime::linker::Linker::instantiate - wasmtime::instance::Instance::new 7% wasmtime::externals::Memory::new - wasmtime::externals::Memory::new -wasmtime::trampline::generate_memory_export ...
My biggest issue is the almost 50% of time spent wrapping closures for the Linker (Linker::func). But I don't see a way around this as every Linker is connected to a Store, and because I need to create a Store per Instance I can't reuse the Linker (or they would belong to different Stores). Also, I'm capturing some per request environment in the closures and re-using Linkers would be hard in this case.
For now it's still ok, but as I keep adding functionality I'm afraid this is going to be a big issue. Is there a way I could improve this?
Also the 16% dealloc seems a bit big, but I'm not sure where this is exactly coming from.
tschneidereit commented on Issue #2295:
This kind of use case would indeed be great to make faster! While for all I know there might be some low-hanging fruit left in the operations showing up in your profile, I wonder if a different approach might not be better:
Instead of actually creating a new instance for each request, might it also be possible to reset existing instances, or more specifically their Memory, stack, and tables?
You mention capturing per-request data in closures, so that part might need a different solution. Perhaps you could store that as instance-associated data?
@alexcrichton can you comment on how difficult it'd be to do this kind of instance-resetting?
bkolobara commented on Issue #2295:
Instead of actually creating a new instance for each request, might it also be possible to reset existing instances, or more specifically their Memory, stack, and tables?
This was the first idea I also came up with. Having an
Instance::reset()
would be ideal. Then I just could run a pool of ready Instances and save a lot of work. Zeroing out the memory could be potentially much faster than creating new linear memories. But I'm not sure how hard it would be to implement such a thing.You mention capturing per-request data in closures, so that part might need a different solution. Perhaps you could store that as instance-associated data?
I can work around this by capturing a structure that I internally change for each request. In some way this would be "instance-associated data".
tschneidereit commented on Issue #2295:
Ok, great! I guess this would be about resetting the store, specifically, so that should probably be reflected in the name.
If this turns out not to require deep architectural changes, would you potentially be interested in working on adding this to Wasmtime, @bkolobara?
bkolobara commented on Issue #2295:
Yes, this is an important issue to me and I'm able to dedicate some time to it.
tschneidereit commented on Issue #2295:
Excellent! Let's wait for Alex to weigh in then.
fitzgen commented on Issue #2295:
Resetting memories, globals, and tables shouldn't be too hard as long as there aren't any frames on the stack. This way we wouldn't need to reset the stack, which has bigger implications (like spec deviance, for one).
alexcrichton commented on Issue #2295:
Thanks for the report here @bkolobara! This is definitely a situation we'd like to be quite fast, and it sounds like it's not quite up to par at the moment!
Would it be possible to share some code you're using? I'd love to dig in a bit more into a microbenchmark or something like that to see if there's any low hanging fruit.
For resetting the
Store
, I'm not sure how easy that would be. It could be very tricky to pull off safely and I'd prefer to go the route of optimizing what we have there already first if possible (not a ton of optimization effort has gone in just yet). I think my ideal solution if optimization doesn't work though would be to somehow reuse theLinker
so you don't have to recreate the entire host environment each time, but you could instead create it once globally or once per thread maybe (or something like that).
bkolobara commented on Issue #2295:
Would it be possible to share some code you're using? I'd love to dig in a bit more into a microbenchmark or something like that to see if there's any low hanging fruit.
To not dump the whole unrelated project here, I will just c/p the part handling instantiation here (it is mostly straight forward code):
... let store = Store::new(&engine); let mut linker = Linker::new(&store); let memory = match memory { MemoryChoice::Existing(memory) => memory, MemoryChoice::New(min_memory) => { let memory_ty = MemoryType::new(Limits::new(min_memory, None)); Memory::new(&store, memory_ty) } }; let environment = ProcessEnvironment::new(engine, module.clone(), memory.data_ptr(), other_args); linker.define("env", "memory", memory)?; // A lot of let env = environment.clone(); linker.func("env", "func1", move |index: u32| { ... env.something ... })?; ... let env = environment.clone(); linker.func("env", "func2", move |index: u32| { ... env.something ... })?; ... let instance = linker.instantiate(&module)?; let func = instance.get_func("response").unwrap(); func.call(&[Val::from(arg1 as i32), Val::from(arg2)])?;
If I wrap this code inside
std::time::Instant::now()
&now.elapsed().as_micros()
I get around 900us on average on my machine. This is not that bad, but considering that thefunc.call
finishes really fast for simple endpoints most of the time is spent on instantiating.... would be to somehow reuse the
Linker
so you don't have to recreate the entire host environment each time, but you could instead create it once globally or once per thread maybe (or something like that).I attempted to do something like this, but got stuck on the fact that each
Linker
is created from aStore
and can't be used with others. So I need to keep aLinker
&Store
pair preserved. But if I don't drop theStore
I can't drop theInstances
associated with it.
alexcrichton commented on Issue #2295:
Thanks for the info! That should be plenty good for investigating. As one other piece of info, how many functions are you defining via
linker.func
?Also yeah, you can't reuse
Linker
across aStore
today. You're usingwasmtime
how it's intended, we just need to make it faster or tweak the recommended idioms and APIs to make it faster :)
bkolobara commented on Issue #2295:
how many functions are you defining via
linker.func
?16, but I will need much more as I keep adding features.
alexcrichton commented on Issue #2295:
@bkolobara hey wanted to let you know that I haven't forgotten about this (although this is a bit delayed). I've done some small work here and there to optimize this use case, but I wanted to drill a bit more into the numbers that you were seeing. I whipped up a small benchmark with a number of functions of differing signatures, and on my computer that's taking ~50us per iteration. You mentioned you were getting 900us, so I was wondering if you could run that locally and see how much time it's taking for you?
I'd ideally like to get a gauge on what the expected scaling factor is in the difference between these two environments.
bkolobara commented on Issue #2295:
Thanks a lot for looking into this @alexcrichton! Running your benchmark produces also a result of around ~60us on my machine. This got me curious so I just dumped the part I was previously benchmarking with a
let now = std::time::Instant::now()
&let elapsed = now.elapsed().as_micros()
wrapper into citerion and now I'm getting ~120us for the same code.I wasn't sure why the difference was so big, so I went ahead and wrapped the same code block again with
Instant::now()
inside my app. This resulted in:... Elapsed: 395 us Elapsed: 1123 us Elapsed: 281 us Elapsed: 939 us Elapsed: 306 us Elapsed: 925 us Elapsed: 486 us Elapsed: 517 us Elapsed: 418 us Elapsed: 411 us Elapsed: 329 us Elapsed: 484 us Elapsed: 506 us ...
I'm using multiple threads to create the instances and handle the requests. If I limit the execution to only 1 thread I'm getting more consistent results:
... Elapsed: 160 us Elapsed: 154 us Elapsed: 188 us Elapsed: 223 us Elapsed: 182 us Elapsed: 156 us Elapsed: 155 us Elapsed: 152 us Elapsed: 183 us Elapsed: 152 us Elapsed: 151 us Elapsed: 150 us ...
This is much closer to the citerion results.
I assume that I'm getting a big performance hit from cache locality? Or maybe the shared
Module
&Engine
become a bottleneck? Nothing else is shared between the threads.I will try to investigate this further and report back, but I'm not sure how to exactly pinpoint this kind of issues.
bkolobara commented on Issue #2295:
@alexcrichton I have created a modified version of your benchmark using Rayon to instantiate in parallel.
Using
into_par_iter()
orinto_iter()
doesn't see to make any difference. I would expect the parallel one to take less time as the CPU utilisation is much higher during the benchmark. I could also be doing something obviously wrong :big_smile:.
bkolobara edited a comment on Issue #2295:
@alexcrichton I have created a modified version of your benchmark using Rayon to instantiate in parallel.
Using
into_par_iter()
orinto_iter()
doesn't seem to make any difference. I would expect the parallel one to take less time as the CPU utilisation is much higher during the benchmark. I could also be doing something obviously wrong :big_smile:.
alexcrichton commented on Issue #2295:
Hm interesting! What plaform are you running on and what does the cpu count look like? I'm not sure if this means that there's high contention, though, it sort of makes sense that the per-iteration time doesn't change too too much in parallel vs non-parallel because there shouldn't be contention. I don't really understand the math behind criterion though to understand whether this is an expected result or not.
It is interesting though that with a threaded version of your application you're seeing higher overheads than without. That may point to a threading overhead though rather than a wasmtime overhead?
In any case though we're always looking for ways to improve instantiation time and with new work underway to merge lucet and wasmtime I suspect we'll continue to drive this number down.
alexcrichton commented on Issue #2295:
With this rfc merged and implemented I think this is now fixed.
alexcrichton closed Issue #2295:
I'm working on an application where I create one Wasm Instance per HTTP request.
Now I have run into a performance issue where most of the response time is spent on instance creation. I'm not sure how I could structure my app to improve this.
I run a profiler to measure where time is spent exactly and here are some results:
47% wasmtime::func::Func::wrap - wasmtime::trampoline::generate_raw_func_export 16% wasmtime_runtime::instance::InstanceHandle::dealloc - core::ptr::drop_in_place 16% wasmtime::linker::Linker::instantiate - wasmtime::instance::Instance::new 7% wasmtime::externals::Memory::new - wasmtime::externals::Memory::new -wasmtime::trampline::generate_memory_export ...
My biggest issue is the almost 50% of time spent wrapping closures for the Linker (Linker::func). But I don't see a way around this as every Linker is connected to a Store, and because I need to create a Store per Instance I can't reuse the Linker (or they would belong to different Stores). Also, I'm capturing some per request environment in the closures and re-using Linkers would be hard in this case.
For now it's still ok, but as I keep adding functionality I'm afraid this is going to be a big issue. Is there a way I could improve this?
Also the 16% dealloc seems a bit big, but I'm not sure where this is exactly coming from.
Last updated: Jan 24 2025 at 00:11 UTC