overhead of calling an exported function? · general

Stream: general

Topic: overhead of calling an exported function?

Xinyu Zeng (Sep 21 2024 at 14:09):

After many testing and profiling, my conclusion is that calling a no-op exported function would take ~23ns (Intel 8474C). And calling an exported function of alloc(0) would take 40ns. Would this be an expected overhead? In that case, if the computation I need WASM to do is super fast (at 100ns scale), but it requires some wasm calls to alloc input buffer and drop output buffer, then I think the function call overhead is non-negligible?

Xinyu Zeng (Sep 21 2024 at 14:11):

I also don't understand yet what are those traphandlers doing inside the host code, and what are array_to_wasm_trampoline doing inside the wasm generated code. Based on my profiling I think those take up the 23ns per call. Really appreciate it if anyone can answer

bjorn3 (Sep 21 2024 at 14:31):

If you are using the Rust api, you can use func.typed() to get a TypedFunc from a Func (or use instance.get_typed_func() to get a TypedFunc directly) Calling a TypedFunc should be a bit faster than calling a Func due to not needing to type check the arguments and being able to generate a trampoline which directly places all arguments in the right register without going through an array on the stack.

Xinyu Zeng (Sep 21 2024 at 14:39):

Thanks, I am using instance.get_typed_func() to get TypedFunc. So that type check may not contribute to the overhead.

Alex Crichton (Sep 21 2024 at 15:29):

Yes on the order of 20ns is the expected overhead at this time. Pushing that number down further would require some various optimizations we've discussed in the past but are nontrivial to do and so they haven't been done yet.

What you're seeing with traphandlers is one of the main points of possible optimization. Notably Wasmtime calls into C which calls setjmp because setjmp isn't safe to call from Rust. This then calls back out into Rust which forces a whole bunch of stuff to not get inlined by default. I've never measured with with cross-language-LTO which should improve the situation here. The ideal here though is related to the next bit...

The array_to_wasm_trampoline function call you're seeing is how Wasmtime enters Cranelift code. Wasm code has a custom ABI that isn't compatible with anything in Rust or C (to support tail calls and additionally be flexibly defined irrespective of the host) meaning that to call WebAssembly Wasmtime goes through a trampoline. This trampoline is resopnsible for reading arguments off the stack, calling wasm with the appropriate ABI, and then storing the results back to the stack.

Perhaps the biggest win we could do for performance would be to get setjmp out of the picture. That can in theory be done by moving it directly into array_to_wasm_trampoline, basically making Cranelift responsible for the setjmp. That would require a fair bit of design and work in Cranelift, however. If we were to do that it would remove the C entirely and much more could be inlined.

The next biggest win after that I'm not sure since I've always assumed that's the lowest hanging fruit. We might be able to shift more responsibilities to the tramopline as well to speed things up but it really depends.

I can also say though that while we've tried to optimize this as much as possible in the past we've generally never had a use case that requires low-overhead function calls. That's meant that some of these meatier optimizations have been tough to prioritize relative to other things to do. If you're use case requires function calls to be faster though I'd love to work with you to help identify the hot spots and, if you're willing, help you craft PRs to optimize things. That being said the main change I think will speed things up, calling setjmp in Cranelift, is likely a very nontrivial thing to take on and may not be a good starter PR

Xinyu Zeng (Sep 22 2024 at 03:00):

Thank you for the detailed explanation. I think right now in my use case the overhead can be amortized by reusing wasm memory that stores input data (so that we don't need calls to the exported alloc and dealloc for each computation), or increasing the data size for one call. I will get back if I found those tricks do not work and need to optimize wasmtime like you proposed.

Last updated: Apr 07 2025 at 08:04 UTC