Hello,
I'd like to reduce instantiation time, which is currently ~10s for a 2MB .wasm file.
Are there safety checks at instantiation (or during execution for that matter) that I could disable to improve performance?
I'm working with the WasmCert-Coq formalization and have a (WIP) formal proof that the module instantiates according to the spec.
Do you know about wasmtime compile
?
also using InstancePre
and maybe the pooling allocator will help: https://docs.rs/wasmtime/latest/wasmtime/struct.Linker.html#method.instantiate_pre
but yeah if you are seeing 10 second "instantiations" then I am pretty sure you are measuring compile time as well, and compiling your Wasm modules ahead of time will be the biggest single improvement you can do
Sorry for the late response.
Thanks for the suggestions!
You're right about that it's compilation and not instantiation (which I thought at first) that takes so long,
so wasmtime compile
indeed speeds up things a lot.
I thus measure startup time now (loading file + compilation + instantiation).
As we finally have a proper testing setup now, I included some numbers below:
color
benchmarkwasmtime-compile
, we can't pretty print the result as that would require importing a function write_char
, which we can't provide for wasmtime run
it seems. (we don't support wasi-io).wasm-opt --coalesce-locals
on our binaries firstthanks!
I'd be quite interested to learn why Node is quite a bit better on our benchmarks...?
Node uses V8, which has a multi-phase compilation pipeline: https://v8.dev/docs/wasm-compilation-pipeline
This includes a very quick startup baseline compiler (Liftoff) that has "pretty good" performance
There is a very work-in-progress backend for wasmtime called winch which I believe has similar goals
Would you be able to share what you're benchmarking in terms of code/setup/etc? Sounds like the lion's share of improvements, separating compile time from what you're benchmarking, worked well but further improvements may require a bit more careful analysis of what exactly node is doing and how Wasmtime is setup and/or configured.
Sure!
Everything for testing is in this repo, this includes the binaries generated by different versions of our compiler (best), benchmark.py to provide a CLI for benchmarking.
It calls run-wasmtime.py and run-node.js in the same folder, which measure one run. Multiple runs are aggregated by benchmark.py
.
I obtained the above numbers like this (in the folder evaluation
):
$./benchmark.py --folder binaries/non-cps-grow-mem-func-mrch-24-24/ --engine=node --wasm-opt --coalesce-locals
$./benchmark.py --folder binaries/non-cps-grow-mem-func-mrch-24-24/ --engine=wasmtime --wasm-opt --coalesce-locals
Some more background information:
--coalesce-locals
, in particular for color
, the main function has >20k localswasmtime-compile
I have a separate but related, concrete performance issue as well:
If we, instead of this fragment
some_check
if
return
end
...
generate
some_check
br_if i
...
we measure the following:
i
is quite small, typically <= 5I am of the opinion that generating the br_if
is the correct way to do it, but I don't want to include this change if it makes wasmtime
that much slower.
Ah ok this is interesting! Cranelift is known to not do the same degree of optimizations of other compilers, for example LLVM and v8, and it's generally expected that Cranelift's performance will only be on-par if the input code has been run through an optimizer beforehand. For LLVM-generated code that's typically the case, but for hand-generated code we recommend running through an optimizer like wasm-opt
first (with all of its bits and pieces turned on).
Not to say that what you're finding with br_if
vs if return end
isn't a bug of course. That'd still be good to fix, but in general it's expected that Cranelift will lose-out performance wise against v8 if the input wasm wasn't itself optimized
FWIW (for Wolfgang), Cranelift has been gaining a bunch of optimization infrastructure over the past few years (and in some benchmarks is seen to be ~at parity with V8); so "much simpler and needs pre-optimization" is becoming less true. Strictly speaking the only missing "fundamental optimization" is inlining; most everything else one expects from e.g. -O2
is there (GVN, LICM, constant prop, alias analysis and related transforms, a bunch of simplification rules).
That is to say: we're definitely not in the realm of "extremely simple and limited compiler that will fall down with trivially different branch patterns", at least that's the expectation! So I'm very curious what's going on above though with the br_if
-- @Wolfgang Meier are you sure that the target (i
to br_if
) is to the outermost block? Or is the branch actually to some tail code? The control-flow graph is technically more complex (many edges into one return-block) so it wouldn't surprise me to see some slowdown due to additional processing, but zeroing in on this could be useful...
Reading over the results, it looks like ack
might be the one with the largest discrepancy? I notice as well that you're enabling tail-call and if ack
stands for "ackermann" then it's known that tail-call historically has had a perf hit with function calls in wasmtime (even non-tail-call ones). That was fixed recently (I believe at least) so recent versions of wasmtime should perform better. Locally with what I think was a development build I saw only very small differences between v8 and wasmtime.
Do you know what version of wasmtime-py you're using?
Also, I'll note that with the --preload
argument you can get write_{int,char}
working with the wasmtime
CLI, although you're only able to invoke one function so you can't implement the driver script you've got as part of the wasmtime CLI.
And finally, are there other benchmarks you're particularly interested in? For example is there another one that you see a large discrepancy on for wasmtime and v8?
Also if you've got a branch with the br_if
vs if return end
change I can try to poke around that as well and see if I can't see any low-hanging fruit for wasmtime
Thanks for looking into it, much appreciated.
Quick reply:
I'll experiment some more and report again later in the week...
Quick update, regarding br_if vs return.
I tried a bunch of things in our code generation, that didn't amount to anything, also:
We indeed have branches for these two versions, but currently no documentation on how to set them up.
I'll have some time in a month to look into it some more.
We spent a bit of time optimizing our compiler, here is a bar plot comparing Wasmtime against Node.js.
Was hoping for some more ideas, on what we could try?
(figured a bar plot would be more helpful than the table with raw data)
What is the y-axis unit (there is no label)?
And what is the distinction between wasmtime and wasmtime-compile? Are you doing separate AOT compilation in the former? (Why isn't the non-hatched part of the bar exactly equivalent between the two in that case?)
as far as "why is it slow to compile", this question is fairly impossible to answer without more detail -- have you tried profiling, and observing where the time is going within Wasmtime? If Cranelift, where in Cranelift?
Wasmtime is unfortunately still much slower than Node.js
are you still measuring all of process creation, wasm compilation, and wasm instantiation?
in general, no one has put in work to optimize how long it takes a wasmtime process to start up because no using wasmtime in production has needed that yet. doesn't mean we wouldn't appreciate PRs improving the current state of things, just that no one has really looked at it or has any need to improve it.
fwiw, wasmtime's happy path for low latency start up, which has been optimized a ton because it is what production users actually do, is roughly the following:
.cwasm
offlinewasmtime::Module
s from the offline-compiled .cwasm
swasmtime::InstancePre
s for those modules, early-binding their imports so that we don't have to do string lookups at instantiation timeinstance_pre.instantiate(...)
we also have benchmarks in the repo that you can run via cargo bench --bench instantiate
. these benchmarks roughly reflect this shape of workload. last time I ran them on my laptop, I was seeing ~5 microseconds per instantiation, regardless of the size of the wasm binary
on the other hand, if you want to reduce compile time, I'd suggest looking into using Winch (e.g. -C compiler=winch
in the CLI). it is a single-pass "baseline" compiler comparable to V8's Liftoff tier. (as mentioned before, if you are comparing node and wasmtime in compile-and-instantiate, then you are comparing apples and oranges unless you switch wasmtime to using Winch or force node to skip Liftoff and go straight to its optimizing tier, somehow)
@Chris Fallin time in ms. Yes, they are separate: Wasmtime
uses the python api, Wasmtime-compile
is a pre-compiled .cwasm
file, that we run with wasmtime run
.
Good point, I'll look more into profiling, thanks!
@fitzgen (he/him)
I'm happy with pre-compiling to .cwasm
, that makes total sense.
We just couldn't use wasmtime run
previously because we had some custom function imports (which we got rid of
now.)
And the python bindings don't (yet?) support .cwasm
files
The title of this thread is perhaps misleading, it's just compilation time, not instantiation time.
The title of this thread is perhaps misleading, it's just compilation time, not instantiation time.
I renamed the thread to reflect this
For benchmarking purposes, you should be able to invoke wasmtime compile
independently of the Python API. If you have surprisingly slow compilations, we're definitely interested; for any report to be actionable by us, though, it would either need profiling output and general info about the input ("this callstack in Cranelift is slow with input of this shape"), or ideally an actual .wasm file we can reproduce the issue with
Wolfgang Meier said:
And the python bindings don't (yet?) support
.cwasm
files
For this I think you can use Module.deserialize{,_file}
methods, but if you run into issues with those feel free to file an issue and/or feature request on wasmtime-py!
It's worth mentioning though that most optimization work in Wasmtime has been focused on the Rust-based wasmtime
crate, so for example InstancePre
isn't part of the Python API (yet)
Chris Fallin said:
For benchmarking purposes, you should be able to invoke
wasmtime compile
independently of the Python API.
Yes, that's what we do.
We're mostly happy with wasmtime compile
and wasmtime run
for now, I think.
Alex Crichton said:
For this I think you can use
Module.deserialize{,_file}
methods, but if you run into issues with those feel free to file an issue and/or feature request on wasmtime-py!
I'll look into this.
But I see now that our benchmarking setup is quite different from what you'd actually use in production
Thank you so much for your help!
In the future, if you're curious, custom host functions can sort of be supported through --preload
on the CLI. That'll load a module and use its exports as imports, so you could define your own custom functions in terms of WASI doing that, for example. That doesn't work well for passing chunks of memory (like strings) to imports though. Also if you've otherwise removed the need for host functions that's additionally not as applicable, but figured I could note here at least.
Hi @Alex Crichton a quick question: is 100ms of compilation time (without parallel compilation, None optimization level) typical for a 160KB WASM (generated from Rust code in release mode) that does simple SIMD bitunpacking? My use case cannot use AOT compilation so I just want to know if this overhead is expected and is there any way to optimize it. Thanks
And I did not find the info in this thread workable for me :(. But the number in figure by @Wolfgang Meier matches my experiments. (the vs_easy and vs_hard)
Whether or not it's typical depends on a lot of factors, e.g. how powerful the cpu is and how big the functions are. I'm not sure many of us are benchmarking on single-threaded compiles ourselves. Would you be able to share the module (or a similar-ish module) so we can test locally? Also have you tried using Winch?
Alex Crichton said:
Whether or not it's typical depends on a lot of factors, e.g. how powerful the cpu is and how big the functions are. I'm not sure many of us are benchmarking on single-threaded compiles ourselves. Would you be able to share the module (or a similar-ish module) so we can test locally? Also have you tried using Winch?
Thanks for reply! The code contains V128 and winch does not support it yet. The code is a wrapper around https://github.com/spiraldb/fastlanes and the machine is Intel(R) Xeon(R) Platinum 8474C
Ah ok yeah our general solution for "you want fast compiles" is indeed not applicable here since Winch doesn't fully support simd yet.
Would you be able to share the wasm binary you're working with?
I have a copy here: https://drive.google.com/file/d/1jtWsDfDEh_uADDqhem4besFAi8NaMnzA/view?usp=sharing
Poking around at this I don't see anything out of the ordinary myself, so what you're seeing is probably expected.
Ok, thanks!
Last updated: Nov 22 2024 at 16:03 UTC