Stream: general

Topic: reduce compilation time


view this post on Zulip Wolfgang Meier (Feb 01 2024 at 21:13):

Hello,

I'd like to reduce instantiation time, which is currently ~10s for a 2MB .wasm file.
Are there safety checks at instantiation (or during execution for that matter) that I could disable to improve performance?

I'm working with the WasmCert-Coq formalization and have a (WIP) formal proof that the module instantiates according to the spec.

view this post on Zulip Pat Hickey (Feb 01 2024 at 21:22):

Do you know about wasmtime compile?

view this post on Zulip fitzgen (he/him) (Feb 02 2024 at 02:27):

also using InstancePre and maybe the pooling allocator will help: https://docs.rs/wasmtime/latest/wasmtime/struct.Linker.html#method.instantiate_pre

but yeah if you are seeing 10 second "instantiations" then I am pretty sure you are measuring compile time as well, and compiling your Wasm modules ahead of time will be the biggest single improvement you can do

view this post on Zulip Wolfgang Meier (Apr 11 2024 at 13:12):

Sorry for the late response.

Thanks for the suggestions!
You're right about that it's compilation and not instantiation (which I thought at first) that takes so long,
so wasmtime compile indeed speeds up things a lot.
I thus measure startup time now (loading file + compilation + instantiation).

As we finally have a proper testing setup now, I included some numbers below:

thanks!

benchmarks.png

view this post on Zulip Wolfgang Meier (Apr 11 2024 at 13:25):

I'd be quite interested to learn why Node is quite a bit better on our benchmarks...?

view this post on Zulip Lann Martin (Apr 11 2024 at 14:16):

Node uses V8, which has a multi-phase compilation pipeline: https://v8.dev/docs/wasm-compilation-pipeline

view this post on Zulip Lann Martin (Apr 11 2024 at 14:18):

This includes a very quick startup baseline compiler (Liftoff) that has "pretty good" performance

view this post on Zulip Lann Martin (Apr 11 2024 at 14:19):

There is a very work-in-progress backend for wasmtime called winch which I believe has similar goals

view this post on Zulip Alex Crichton (Apr 11 2024 at 14:24):

Would you be able to share what you're benchmarking in terms of code/setup/etc? Sounds like the lion's share of improvements, separating compile time from what you're benchmarking, worked well but further improvements may require a bit more careful analysis of what exactly node is doing and how Wasmtime is setup and/or configured.

view this post on Zulip Wolfgang Meier (Apr 12 2024 at 07:06):

Sure!
Everything for testing is in this repo, this includes the binaries generated by different versions of our compiler (best), benchmark.py to provide a CLI for benchmarking.
It calls run-wasmtime.py and run-node.js in the same folder, which measure one run. Multiple runs are aggregated by benchmark.py.

I obtained the above numbers like this (in the folder evaluation):
$./benchmark.py --folder binaries/non-cps-grow-mem-func-mrch-24-24/ --engine=node --wasm-opt --coalesce-locals
$./benchmark.py --folder binaries/non-cps-grow-mem-func-mrch-24-24/ --engine=wasmtime --wasm-opt --coalesce-locals

Some more background information:

Contribute to womeier/certicoqwasm-testing development by creating an account on GitHub.
Contribute to womeier/certicoqwasm-testing development by creating an account on GitHub.
Contribute to womeier/certicoqwasm-testing development by creating an account on GitHub.
Contribute to womeier/certicoqwasm-testing development by creating an account on GitHub.
Contribute to womeier/certicoqwasm-testing development by creating an account on GitHub.
Contribute to womeier/certicoqwasm development by creating an account on GitHub.
Contribute to womeier/certicoqwasm development by creating an account on GitHub.

view this post on Zulip Wolfgang Meier (Apr 12 2024 at 10:22):

I have a separate but related, concrete performance issue as well:

If we, instead of this fragment

some_check
if
  return
end
...

generate

some_check
br_if i
...

we measure the following:

I am of the opinion that generating the br_if is the correct way to do it, but I don't want to include this change if it makes wasmtime that much slower.

br_if.png

view this post on Zulip Alex Crichton (Apr 12 2024 at 14:44):

Ah ok this is interesting! Cranelift is known to not do the same degree of optimizations of other compilers, for example LLVM and v8, and it's generally expected that Cranelift's performance will only be on-par if the input code has been run through an optimizer beforehand. For LLVM-generated code that's typically the case, but for hand-generated code we recommend running through an optimizer like wasm-opt first (with all of its bits and pieces turned on).

Not to say that what you're finding with br_if vs if return end isn't a bug of course. That'd still be good to fix, but in general it's expected that Cranelift will lose-out performance wise against v8 if the input wasm wasn't itself optimized

view this post on Zulip Chris Fallin (Apr 12 2024 at 17:36):

FWIW (for Wolfgang), Cranelift has been gaining a bunch of optimization infrastructure over the past few years (and in some benchmarks is seen to be ~at parity with V8); so "much simpler and needs pre-optimization" is becoming less true. Strictly speaking the only missing "fundamental optimization" is inlining; most everything else one expects from e.g. -O2 is there (GVN, LICM, constant prop, alias analysis and related transforms, a bunch of simplification rules).

That is to say: we're definitely not in the realm of "extremely simple and limited compiler that will fall down with trivially different branch patterns", at least that's the expectation! So I'm very curious what's going on above though with the br_if -- @Wolfgang Meier are you sure that the target (i to br_if) is to the outermost block? Or is the branch actually to some tail code? The control-flow graph is technically more complex (many edges into one return-block) so it wouldn't surprise me to see some slowdown due to additional processing, but zeroing in on this could be useful...

view this post on Zulip Alex Crichton (Apr 12 2024 at 20:29):

Reading over the results, it looks like ack might be the one with the largest discrepancy? I notice as well that you're enabling tail-call and if ack stands for "ackermann" then it's known that tail-call historically has had a perf hit with function calls in wasmtime (even non-tail-call ones). That was fixed recently (I believe at least) so recent versions of wasmtime should perform better. Locally with what I think was a development build I saw only very small differences between v8 and wasmtime.

Do you know what version of wasmtime-py you're using?

Also, I'll note that with the --preload argument you can get write_{int,char} working with the wasmtime CLI, although you're only able to invoke one function so you can't implement the driver script you've got as part of the wasmtime CLI.

And finally, are there other benchmarks you're particularly interested in? For example is there another one that you see a large discrepancy on for wasmtime and v8?

view this post on Zulip Alex Crichton (Apr 12 2024 at 20:53):

Also if you've got a branch with the br_if vs if return end change I can try to poke around that as well and see if I can't see any low-hanging fruit for wasmtime

view this post on Zulip Wolfgang Meier (Apr 13 2024 at 16:25):

Thanks for looking into it, much appreciated.

Quick reply:

I'll experiment some more and report again later in the week...

view this post on Zulip Wolfgang Meier (Jun 08 2024 at 21:52):

Quick update, regarding br_if vs return.
I tried a bunch of things in our code generation, that didn't amount to anything, also:

We indeed have branches for these two versions, but currently no documentation on how to set them up.
I'll have some time in a month to look into it some more.

view this post on Zulip Wolfgang Meier (Sep 24 2024 at 20:40):

We spent a bit of time optimizing our compiler, here is a bar plot comparing Wasmtime against Node.js.

Was hoping for some more ideas, on what we could try?
(figured a bar plot would be more helpful than the table with raw data)

wasmtime_nodejs.png

view this post on Zulip Chris Fallin (Sep 24 2024 at 20:50):

What is the y-axis unit (there is no label)?

view this post on Zulip Chris Fallin (Sep 24 2024 at 20:51):

And what is the distinction between wasmtime and wasmtime-compile? Are you doing separate AOT compilation in the former? (Why isn't the non-hatched part of the bar exactly equivalent between the two in that case?)

view this post on Zulip Chris Fallin (Sep 24 2024 at 20:52):

as far as "why is it slow to compile", this question is fairly impossible to answer without more detail -- have you tried profiling, and observing where the time is going within Wasmtime? If Cranelift, where in Cranelift?

view this post on Zulip fitzgen (he/him) (Sep 24 2024 at 21:01):

Wasmtime is unfortunately still much slower than Node.js

are you still measuring all of process creation, wasm compilation, and wasm instantiation?

in general, no one has put in work to optimize how long it takes a wasmtime process to start up because no using wasmtime in production has needed that yet. doesn't mean we wouldn't appreciate PRs improving the current state of things, just that no one has really looked at it or has any need to improve it.

fwiw, wasmtime's happy path for low latency start up, which has been optimized a ton because it is what production users actually do, is roughly the following:

we also have benchmarks in the repo that you can run via cargo bench --bench instantiate. these benchmarks roughly reflect this shape of workload. last time I ran them on my laptop, I was seeing ~5 microseconds per instantiation, regardless of the size of the wasm binary

on the other hand, if you want to reduce compile time, I'd suggest looking into using Winch (e.g. -C compiler=winch in the CLI). it is a single-pass "baseline" compiler comparable to V8's Liftoff tier. (as mentioned before, if you are comparing node and wasmtime in compile-and-instantiate, then you are comparing apples and oranges unless you switch wasmtime to using Winch or force node to skip Liftoff and go straight to its optimizing tier, somehow)

A fast and secure runtime for WebAssembly. Contribute to bytecodealliance/wasmtime development by creating an account on GitHub.

view this post on Zulip Wolfgang Meier (Sep 24 2024 at 21:36):

@Chris Fallin time in ms. Yes, they are separate: Wasmtime uses the python api, Wasmtime-compile is a pre-compiled .cwasm file, that we run with wasmtime run.

Good point, I'll look more into profiling, thanks!

@fitzgen (he/him)

view this post on Zulip fitzgen (he/him) (Sep 24 2024 at 21:38):

The title of this thread is perhaps misleading, it's just compilation time, not instantiation time.

I renamed the thread to reflect this

view this post on Zulip Chris Fallin (Sep 24 2024 at 21:39):

For benchmarking purposes, you should be able to invoke wasmtime compile independently of the Python API. If you have surprisingly slow compilations, we're definitely interested; for any report to be actionable by us, though, it would either need profiling output and general info about the input ("this callstack in Cranelift is slow with input of this shape"), or ideally an actual .wasm file we can reproduce the issue with

view this post on Zulip Alex Crichton (Sep 24 2024 at 21:40):

Wolfgang Meier said:

And the python bindings don't (yet?) support .cwasm files

For this I think you can use Module.deserialize{,_file} methods, but if you run into issues with those feel free to file an issue and/or feature request on wasmtime-py!

view this post on Zulip Alex Crichton (Sep 24 2024 at 21:41):

It's worth mentioning though that most optimization work in Wasmtime has been focused on the Rust-based wasmtime crate, so for example InstancePre isn't part of the Python API (yet)

view this post on Zulip Wolfgang Meier (Sep 24 2024 at 21:45):

Chris Fallin said:

For benchmarking purposes, you should be able to invoke wasmtime compile independently of the Python API.

Yes, that's what we do.
We're mostly happy with wasmtime compile and wasmtime run for now, I think.

view this post on Zulip Wolfgang Meier (Sep 24 2024 at 21:48):

Alex Crichton said:

For this I think you can use Module.deserialize{,_file} methods, but if you run into issues with those feel free to file an issue and/or feature request on wasmtime-py!

I'll look into this.

But I see now that our benchmarking setup is quite different from what you'd actually use in production

view this post on Zulip Wolfgang Meier (Sep 24 2024 at 21:49):

Thank you so much for your help!

view this post on Zulip Alex Crichton (Sep 24 2024 at 21:50):

In the future, if you're curious, custom host functions can sort of be supported through --preload on the CLI. That'll load a module and use its exports as imports, so you could define your own custom functions in terms of WASI doing that, for example. That doesn't work well for passing chunks of memory (like strings) to imports though. Also if you've otherwise removed the need for host functions that's additionally not as applicable, but figured I could note here at least.

view this post on Zulip Xinyu Zeng (Nov 03 2024 at 12:38):

Hi @Alex Crichton a quick question: is 100ms of compilation time (without parallel compilation, None optimization level) typical for a 160KB WASM (generated from Rust code in release mode) that does simple SIMD bitunpacking? My use case cannot use AOT compilation so I just want to know if this overhead is expected and is there any way to optimize it. Thanks

view this post on Zulip Xinyu Zeng (Nov 03 2024 at 12:42):

And I did not find the info in this thread workable for me :(. But the number in figure by @Wolfgang Meier matches my experiments. (the vs_easy and vs_hard)

view this post on Zulip Alex Crichton (Nov 03 2024 at 14:40):

Whether or not it's typical depends on a lot of factors, e.g. how powerful the cpu is and how big the functions are. I'm not sure many of us are benchmarking on single-threaded compiles ourselves. Would you be able to share the module (or a similar-ish module) so we can test locally? Also have you tried using Winch?

view this post on Zulip Xinyu Zeng (Nov 04 2024 at 09:10):

Alex Crichton said:

Whether or not it's typical depends on a lot of factors, e.g. how powerful the cpu is and how big the functions are. I'm not sure many of us are benchmarking on single-threaded compiles ourselves. Would you be able to share the module (or a similar-ish module) so we can test locally? Also have you tried using Winch?

Thanks for reply! The code contains V128 and winch does not support it yet. The code is a wrapper around https://github.com/spiraldb/fastlanes and the machine is Intel(R) Xeon(R) Platinum 8474C

Rust implementation of the FastLanes compression library - spiraldb/fastlanes

view this post on Zulip Alex Crichton (Nov 04 2024 at 18:13):

Ah ok yeah our general solution for "you want fast compiles" is indeed not applicable here since Winch doesn't fully support simd yet.

Would you be able to share the wasm binary you're working with?

view this post on Zulip Xinyu Zeng (Nov 05 2024 at 07:23):

I have a copy here: https://drive.google.com/file/d/1jtWsDfDEh_uADDqhem4besFAi8NaMnzA/view?usp=sharing

view this post on Zulip Alex Crichton (Nov 05 2024 at 15:41):

Poking around at this I don't see anything out of the ordinary myself, so what you're seeing is probably expected.

view this post on Zulip Xinyu Zeng (Nov 06 2024 at 02:30):

Ok, thanks!


Last updated: Dec 23 2024 at 13:07 UTC