wasmtime / Issue #2295 Instantiation performance improvement · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / Issue #2295 Instantiation performance improvement

Wasmtime GitHub notifications bot (Oct 16 2020 at 15:03):

bkolobara opened Issue #2295:

I'm working on an application where I create one Wasm Instance per HTTP request.

Now I have run into a performance issue where most of the response time is spent on instance creation. I'm not sure how I could structure my app to improve this.

I run a profiler to measure where time is spent exactly and here are some results:
47% wasmtime::func::Func::wrap
    - wasmtime::trampoline::generate_raw_func_export
16% wasmtime_runtime::instance::InstanceHandle::dealloc
    - core::ptr::drop_in_place
16% wasmtime::linker::Linker::instantiate
    - wasmtime::instance::Instance::new
 7% wasmtime::externals::Memory::new
    - wasmtime::externals::Memory::new
      -wasmtime::trampline::generate_memory_export
...
My biggest issue is the almost 50% of time spent wrapping closures for the Linker (Linker::func). But I don't see a way around this as every Linker is connected to a Store, and because I need to create a Store per Instance I can't reuse the Linker (or they would belong to different Stores). Also, I'm capturing some per request environment in the closures and re-using Linkers would be hard in this case.

For now it's still ok, but as I keep adding functionality I'm afraid this is going to be a big issue. Is there a way I could improve this?

Also the 16% dealloc seems a bit big, but I'm not sure where this is exactly coming from.

Wasmtime GitHub notifications bot (Oct 16 2020 at 15:51):

tschneidereit commented on Issue #2295:

This kind of use case would indeed be great to make faster! While for all I know there might be some low-hanging fruit left in the operations showing up in your profile, I wonder if a different approach might not be better:

Instead of actually creating a new instance for each request, might it also be possible to reset existing instances, or more specifically their Memory, stack, and tables?

You mention capturing per-request data in closures, so that part might need a different solution. Perhaps you could store that as instance-associated data?

@alexcrichton can you comment on how difficult it'd be to do this kind of instance-resetting?

Wasmtime GitHub notifications bot (Oct 16 2020 at 16:09):

bkolobara commented on Issue #2295:

Instead of actually creating a new instance for each request, might it also be possible to reset existing instances, or more specifically their Memory, stack, and tables?

This was the first idea I also came up with. Having an Instance::reset() would be ideal. Then I just could run a pool of ready Instances and save a lot of work. Zeroing out the memory could be potentially much faster than creating new linear memories. But I'm not sure how hard it would be to implement such a thing.

You mention capturing per-request data in closures, so that part might need a different solution. Perhaps you could store that as instance-associated data?

I can work around this by capturing a structure that I internally change for each request. In some way this would be "instance-associated data".

Wasmtime GitHub notifications bot (Oct 16 2020 at 16:19):

tschneidereit commented on Issue #2295:

Ok, great! I guess this would be about resetting the store, specifically, so that should probably be reflected in the name.

If this turns out not to require deep architectural changes, would you potentially be interested in working on adding this to Wasmtime, @bkolobara?

Wasmtime GitHub notifications bot (Oct 16 2020 at 16:32):

bkolobara commented on Issue #2295:

Yes, this is an important issue to me and I'm able to dedicate some time to it.

Wasmtime GitHub notifications bot (Oct 16 2020 at 16:33):

tschneidereit commented on Issue #2295:

Excellent! Let's wait for Alex to weigh in then.

Wasmtime GitHub notifications bot (Oct 16 2020 at 17:01):

fitzgen commented on Issue #2295:

Resetting memories, globals, and tables shouldn't be too hard as long as there aren't any frames on the stack. This way we wouldn't need to reset the stack, which has bigger implications (like spec deviance, for one).

Wasmtime GitHub notifications bot (Oct 16 2020 at 17:44):

alexcrichton commented on Issue #2295:

Thanks for the report here @bkolobara! This is definitely a situation we'd like to be quite fast, and it sounds like it's not quite up to par at the moment!

Would it be possible to share some code you're using? I'd love to dig in a bit more into a microbenchmark or something like that to see if there's any low hanging fruit.

For resetting the Store, I'm not sure how easy that would be. It could be very tricky to pull off safely and I'd prefer to go the route of optimizing what we have there already first if possible (not a ton of optimization effort has gone in just yet). I think my ideal solution if optimization doesn't work though would be to somehow reuse the Linker so you don't have to recreate the entire host environment each time, but you could instead create it once globally or once per thread maybe (or something like that).

Wasmtime GitHub notifications bot (Oct 16 2020 at 19:03):

bkolobara commented on Issue #2295:

Would it be possible to share some code you're using? I'd love to dig in a bit more into a microbenchmark or something like that to see if there's any low hanging fruit.

To not dump the whole unrelated project here, I will just c/p the part handling instantiation here (it is mostly straight forward code):
...
let store = Store::new(&engine);
let mut linker = Linker::new(&store);

let memory = match memory {
    MemoryChoice::Existing(memory) => memory,
    MemoryChoice::New(min_memory) => {
        let memory_ty = MemoryType::new(Limits::new(min_memory, None));
        Memory::new(&store, memory_ty)
    }
};

let environment =
    ProcessEnvironment::new(engine, module.clone(), memory.data_ptr(), other_args);

linker.define("env", "memory", memory)?;
// A lot of
let env = environment.clone();
linker.func("env", "func1", move |index: u32| { ... env.something ... })?;
...
let env = environment.clone();
linker.func("env", "func2", move |index: u32| { ... env.something ... })?;
...

let instance = linker.instantiate(&module)?;
let func = instance.get_func("response").unwrap();
func.call(&[Val::from(arg1 as i32), Val::from(arg2)])?;
If I wrap this code inside std::time::Instant::now() & now.elapsed().as_micros() I get around 900us on average on my machine. This is not that bad, but considering that the func.call finishes really fast for simple endpoints most of the time is spent on instantiating.

... would be to somehow reuse the Linker so you don't have to recreate the entire host environment each time, but you could instead create it once globally or once per thread maybe (or something like that).

I attempted to do something like this, but got stuck on the fact that each Linker is created from a Store and can't be used with others. So I need to keep a Linker & Store pair preserved. But if I don't drop the Store I can't drop the Instances associated with it.

Wasmtime GitHub notifications bot (Oct 16 2020 at 19:32):

alexcrichton commented on Issue #2295:

Thanks for the info! That should be plenty good for investigating. As one other piece of info, how many functions are you defining via linker.func?

Also yeah, you can't reuse Linker across a Store today. You're using wasmtime how it's intended, we just need to make it faster or tweak the recommended idioms and APIs to make it faster :)

Wasmtime GitHub notifications bot (Oct 16 2020 at 19:54):

bkolobara commented on Issue #2295:

how many functions are you defining via linker.func?

16, but I will need much more as I keep adding features.

Wasmtime GitHub notifications bot (Nov 16 2020 at 20:32):

alexcrichton commented on Issue #2295:

@bkolobara hey wanted to let you know that I haven't forgotten about this (although this is a bit delayed). I've done some small work here and there to optimize this use case, but I wanted to drill a bit more into the numbers that you were seeing. I whipped up a small benchmark with a number of functions of differing signatures, and on my computer that's taking ~50us per iteration. You mentioned you were getting 900us, so I was wondering if you could run that locally and see how much time it's taking for you?

I'd ideally like to get a gauge on what the expected scaling factor is in the difference between these two environments.

Wasmtime GitHub notifications bot (Nov 17 2020 at 10:49):

bkolobara commented on Issue #2295:

Thanks a lot for looking into this @alexcrichton! Running your benchmark produces also a result of around ~60us on my machine. This got me curious so I just dumped the part I was previously benchmarking with a let now = std::time::Instant::now() & let elapsed = now.elapsed().as_micros() wrapper into citerion and now I'm getting ~120us for the same code.

I wasn't sure why the difference was so big, so I went ahead and wrapped the same code block again with Instant::now() inside my app. This resulted in:
...
Elapsed: 395 us
Elapsed: 1123 us
Elapsed: 281 us
Elapsed: 939 us
Elapsed: 306 us
Elapsed: 925 us
Elapsed: 486 us
Elapsed: 517 us
Elapsed: 418 us
Elapsed: 411 us
Elapsed: 329 us
Elapsed: 484 us
Elapsed: 506 us
...
I'm using multiple threads to create the instances and handle the requests. If I limit the execution to only 1 thread I'm getting more consistent results:
...
Elapsed: 160 us
Elapsed: 154 us
Elapsed: 188 us
Elapsed: 223 us
Elapsed: 182 us
Elapsed: 156 us
Elapsed: 155 us
Elapsed: 152 us
Elapsed: 183 us
Elapsed: 152 us
Elapsed: 151 us
Elapsed: 150 us
...
This is much closer to the citerion results.

I assume that I'm getting a big performance hit from cache locality? Or maybe the shared Module & Engine become a bottleneck? Nothing else is shared between the threads.

I will try to investigate this further and report back, but I'm not sure how to exactly pinpoint this kind of issues.

Wasmtime GitHub notifications bot (Nov 17 2020 at 16:00):

bkolobara commented on Issue #2295:

@alexcrichton I have created a modified version of your benchmark using Rayon to instantiate in parallel.

Using into_par_iter() or into_iter() doesn't see to make any difference. I would expect the parallel one to take less time as the CPU utilisation is much higher during the benchmark. I could also be doing something obviously wrong :big_smile:.

Wasmtime GitHub notifications bot (Nov 18 2020 at 11:00):

bkolobara edited a comment on Issue #2295:

@alexcrichton I have created a modified version of your benchmark using Rayon to instantiate in parallel.

Using into_par_iter() or into_iter() doesn't seem to make any difference. I would expect the parallel one to take less time as the CPU utilisation is much higher during the benchmark. I could also be doing something obviously wrong :big_smile:.

Wasmtime GitHub notifications bot (Nov 18 2020 at 15:33):

alexcrichton commented on Issue #2295:

Hm interesting! What plaform are you running on and what does the cpu count look like? I'm not sure if this means that there's high contention, though, it sort of makes sense that the per-iteration time doesn't change too too much in parallel vs non-parallel because there shouldn't be contention. I don't really understand the math behind criterion though to understand whether this is an expected result or not.

It is interesting though that with a threaded version of your application you're seeing higher overheads than without. That may point to a threading overhead though rather than a wasmtime overhead?

In any case though we're always looking for ways to improve instantiation time and with new work underway to merge lucet and wasmtime I suspect we'll continue to drive this number down.

Wasmtime GitHub notifications bot (Mar 19 2021 at 18:50):

alexcrichton commented on Issue #2295:

With this rfc merged and implemented I think this is now fixed.

Wasmtime GitHub notifications bot (Mar 19 2021 at 18:50):

alexcrichton closed Issue #2295:

I'm working on an application where I create one Wasm Instance per HTTP request.

Now I have run into a performance issue where most of the response time is spent on instance creation. I'm not sure how I could structure my app to improve this.

I run a profiler to measure where time is spent exactly and here are some results:
47% wasmtime::func::Func::wrap
    - wasmtime::trampoline::generate_raw_func_export
16% wasmtime_runtime::instance::InstanceHandle::dealloc
    - core::ptr::drop_in_place
16% wasmtime::linker::Linker::instantiate
    - wasmtime::instance::Instance::new
 7% wasmtime::externals::Memory::new
    - wasmtime::externals::Memory::new
      -wasmtime::trampline::generate_memory_export
...
My biggest issue is the almost 50% of time spent wrapping closures for the Linker (Linker::func). But I don't see a way around this as every Linker is connected to a Store, and because I need to create a Store per Instance I can't reuse the Linker (or they would belong to different Stores). Also, I'm capturing some per request environment in the closures and re-using Linkers would be hard in this case.

For now it's still ok, but as I keep adding functionality I'm afraid this is going to be a big issue. Is there a way I could improve this?

Also the 16% dealloc seems a bit big, but I'm not sure where this is exactly coming from.

Last updated: Apr 17 2025 at 21:03 UTC