wasmtime / PR #8554 build: add "optimized" profile for ru... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / PR #8554 build: add "optimized" profile for ru...

Wasmtime GitHub notifications bot (May 05 2024 at 12:16):

dundargoc requested fitzgen for a review on PR #8554.

Wasmtime GitHub notifications bot (May 05 2024 at 12:16):

dundargoc requested wasmtime-core-reviewers for a review on PR #8554.

Wasmtime GitHub notifications bot (May 05 2024 at 12:16):

dundargoc opened PR #8554 from dundargoc:build/performance to bytecodealliance:main:

This is extremely useful for cases where the default optimizations just
are not enough.

Background: neovim is interested to
add wasmtime support in https://github.com/neovim/neovim/pull/28415 but
we noticed that including wasmtime, even when not using wasmtime
directly, heavily affects runtime performance. This is not only
reflected in the increased startuptime but affects the runtime
performance overall.

Here are the benchmarks for startuptimes for different configurations.
Important to note is that all of runtime is affected, but the
startuptime is a decent proxy to measure runtime performance:
No wasm
  Time (mean ± σ):      50.5 ms ±   1.5 ms    [User: 32.8 ms, System: 12.3 ms]
  Range (min … max):    48.3 ms …  54.4 ms    56 runs

Wasm, lto=thin
  Time (mean ± σ):     104.9 ms ±   3.5 ms    [User: 86.5 ms, System: 12.7 ms]
  Range (min … max):    99.5 ms … 111.1 ms    26 runs

Wasm, lto=true
  Time (mean ± σ):      83.8 ms ±   2.5 ms    [User: 65.8 ms, System: 12.1 ms]
  Range (min … max):    80.5 ms …  93.3 ms    31 runs

Wasm, lto=true, strip="none", incremental=false, codegen-units=1, panic="abort"
  Time (mean ± σ):      53.1 ms ±   1.0 ms    [User: 35.5 ms, System: 12.5 ms]
  Range (min … max):    50.6 ms …  55.5 ms    54 runs

Wasmtime GitHub notifications bot (May 05 2024 at 12:16):

dundargoc requested wasmtime-default-reviewers for a review on PR #8554.

Wasmtime GitHub notifications bot (May 05 2024 at 12:41):

bjorn3 submitted PR review.

Wasmtime GitHub notifications bot (May 05 2024 at 12:41):

bjorn3 created PR review comment:

This is already the default for release mode.

Wasmtime GitHub notifications bot (May 05 2024 at 12:42):

bjorn3 submitted PR review.

Wasmtime GitHub notifications bot (May 05 2024 at 12:42):

bjorn3 created PR review comment:

Why disable debuginfo stripping when you don't enable debuginfo anyway?

Wasmtime GitHub notifications bot (May 05 2024 at 12:45):

bjorn3 created PR review comment:

This can probably be enabled unconditionally for the c-api. We aren't catching panics on the C abi boundary, so they will abort anyway (or be UB until the c_unwind rustc feature is fully stabilized).

Wasmtime GitHub notifications bot (May 05 2024 at 12:45):

bjorn3 submitted PR review.

Wasmtime GitHub notifications bot (May 05 2024 at 12:46):

clason submitted PR review.

Wasmtime GitHub notifications bot (May 05 2024 at 12:46):

clason created PR review comment:

This strips more, though (symbols as well). We really need all these settings to close the gap to a build without wasmtime (we tested this).

Wasmtime GitHub notifications bot (May 05 2024 at 12:47):

bjorn3 submitted PR review.

Wasmtime GitHub notifications bot (May 05 2024 at 12:47):

bjorn3 created PR review comment:

strip = "none" strips neither debuginfo, nor symbols.

Wasmtime GitHub notifications bot (May 05 2024 at 12:55):

clason submitted PR review.

Wasmtime GitHub notifications bot (May 05 2024 at 12:55):

clason created PR review comment:

My mistake; I was reading this wrong (my own tests were with true).

Wasmtime GitHub notifications bot (May 05 2024 at 13:21):

dundargoc updated PR #8554.

Wasmtime GitHub notifications bot (May 05 2024 at 13:22):

dundargoc submitted PR review.

Wasmtime GitHub notifications bot (May 05 2024 at 13:22):

dundargoc created PR review comment:

I benchmarked this and it seems strip does not seem to make difference. I'll remove it.

Wasmtime GitHub notifications bot (May 05 2024 at 13:22):

dundargoc updated PR #8554.

Wasmtime GitHub notifications bot (May 05 2024 at 13:36):

dundargoc submitted PR review.

Wasmtime GitHub notifications bot (May 05 2024 at 13:36):

dundargoc created PR review comment:

I am getting the following warning
Caused by:
  `panic` may not be specified in a `package` profile
It seems as though this is not possible based on the documentation:

Overrides cannot specify the panic, lto, or rpath settings.

Wasmtime GitHub notifications bot (May 05 2024 at 15:07):

dundargoc requested bjorn3 for a review on PR #8554.

Wasmtime GitHub notifications bot (May 05 2024 at 15:24):

bjorn3 submitted PR review.

Wasmtime GitHub notifications bot (May 05 2024 at 15:24):

bjorn3 created PR review comment:

I meant setting the CARGO_PROFILE_<profile name>_PANIC="abort" env var in CMakeLists.txt where <profile name> is either DEV or RELEASE depending on if --release was passed or not.

Wasmtime GitHub notifications bot (May 05 2024 at 15:57):

dundargoc updated PR #8554.

Wasmtime GitHub notifications bot (May 05 2024 at 15:57):

dundargoc submitted PR review.

Wasmtime GitHub notifications bot (May 05 2024 at 15:57):

dundargoc created PR review comment:

Fixed

Wasmtime GitHub notifications bot (May 06 2024 at 16:01):

alexcrichton commented on PR #8554:

Thanks for the PR here! Is it possible perhaps for CMake to set environment variables for other steps? That's probably the best way to do this rather than adding a static profile. That way you'd be able to set CARGO_PROFILE_RELEASE_* to appropriate values and it can be customized per-project. Or perhaps could knobs be added to the CMake project in Wasmtime to set the env vars?

Otherwise though I also wanted to poke a bit about what the benchmark is. This is just starting a program that's not actually using Wasmtime at all? It's just changing how Wasmtime itself is linked in? If so it sounds like the main issue is that Wasmtime has lots of symbols and they may be all getting processed by the dynamic linker at startup. LTO will reduce the number of symbols, but there should be other knobs to internalize symbols in an executable or dynamic library such that LTO wouldn't be required to get the same performance benefits.

Wasmtime GitHub notifications bot (May 06 2024 at 16:09):

clason commented on PR #8554:

Thanks for the PR here! Is it possible perhaps for CMake to set environment variables for other steps? That's probably the best way to do this rather than adding a static profile. That way you'd be able to set CARGO_PROFILE_RELEASE_* to appropriate values and it can be customized per-project.

That's what we did before, but we thought that the preset could be used by other projects with similar needs.

Otherwise though I also wanted to poke a bit about what the benchmark is. This is just starting a program that's not actually using Wasmtime at all? It's just changing how Wasmtime itself is linked in?

That is correct.

If so it sounds like the main issue is that Wasmtime has lots of symbols and they may be all getting processed by the dynamic linker at startup. LTO will reduce the number of symbols, but there should be other knobs to internalize symbols in an executable or dynamic library such that LTO wouldn't be required to get the same performance benefits.

This is our thinking as well; if you know of such knobs, we'd be more than happy for the hint. Our own interface to wasmtime is extremely small (but we do link to tree-sitter, which uses more of it).

Wasmtime GitHub notifications bot (May 06 2024 at 16:12):

alexcrichton commented on PR #8554:

Ok sounds good, and a preset also sounds good to me!

Is this on macOS, Linux, Windows, or all three? This would likely be flags to the linker invocation that creates the neovim executable itself which would be along the lines of "only export this list of symbols and no more" and that would internalize everything else. Is it possible to construct a list like that, for example is main the only symbol that's necessary? Or are plugins dlopen'd which need to reach into internal symbols?

Wasmtime GitHub notifications bot (May 06 2024 at 16:20):

clason commented on PR #8554:

All three and more (as many as we can build on), with a wide range of compilers. We use only a very limited set (basically, wasm_engine_{new,delete}, but we link to tree-sitter which uses more (which I don't know off the top of my head).

Wasmtime GitHub notifications bot (May 06 2024 at 16:22):

clason edited a comment on PR #8554:

All three and more (as many as we can build on), with a wide range of compilers. We use only a very limited set (basically, wasm_engine_{new,delete}), but we link to tree-sitter which uses more (which I don't know off the top of my head).

Wasmtime GitHub notifications bot (May 06 2024 at 16:32):

alexcrichton commented on PR #8554:

Oh sorry, to clarify, is the performance regression you're seeing on all platforms? Or only some?

Wasmtime GitHub notifications bot (May 06 2024 at 16:34):

clason commented on PR #8554:

So far we have mostly tested with macOS; of course the exact impact varies with platform and toolchain.

Wasmtime GitHub notifications bot (May 06 2024 at 18:12):

alexcrichton commented on PR #8554:

I'm trying to test my theory with a small script but it's not too successful. The big binaries are slower but they don't get faster when removing the symbols so I don't think this going to bear fruit.

Regardless though I think it's fine to land this, but on reviewing it I think it might be best to call this "lto" in the various places instead of "optimized". I want to avoid confusion between "optimized" and --release builds since both are runtime-wise pretty equally optimized, and "lto" is the main differentiating factor here.

Wasmtime GitHub notifications bot (May 06 2024 at 18:30):

clason commented on PR #8554:

I appreciate your testing! Not married to the name, of course, but note that full LTO is only half the game here -- you also need the other settings for the "optimal" benefit. Unless you consider codegen-units to be part of LTO? Then it'd obviously make sense.

Wasmtime GitHub notifications bot (May 06 2024 at 18:33):

alexcrichton commented on PR #8554:

That's true yeah. I suppose I'm mostly looking for a name that evokes something like "pull out all the stops on optimizing this I'm willing to wait for an hour". That typically includes lto and codegen-units=1. I personally consider "lto" as "take your time please optimize this as much as you can" but that's also just me.

Also, question for y'all, does codegen-units=1 help much here in isolation? If that setting is dropped do you see a difference in perf? I ask because that should only affect generated code quality, not symbol visibility. If codegen-units=1 does help then the mystery deepens in my head for what's going on here (PR can land either way of course, I'm just curious now!)

Wasmtime GitHub notifications bot (May 06 2024 at 19:06):

clason commented on PR #8554:

Yes it does! (We benchmarked each option -- @dundargoc could you edit the PR description to add the individual steps?)

And it's _very_ weird -- but then I don't know any other large C project that statically links a Rust dependency (and we have been quite reticent about it im the past).

If LTO has the full connotation for you, that's fine with us of course!

Wasmtime GitHub notifications bot (May 06 2024 at 19:31):

alexcrichton commented on PR #8554:

Odd! Well in any case happy to r+ and flag for merge with a rename.

Otherwise though if y'all have a series of instructions to reproduce this slowdown I'd be happy to try to dig further in my spare time later.

Wasmtime GitHub notifications bot (May 06 2024 at 21:05):

dundargoc commented on PR #8554:

This is the benchmarking we did to determine which of the flags seemed to make a difference:

--release
  Time (mean ± σ):     132.1 ms ±   3.3 ms    [User: 114.5 ms, System: 12.3 ms]
  Range (min … max):   122.9 ms … 137.0 ms    22 runs

lto=true
  Time (mean ± σ):      83.5 ms ±   1.9 ms    [User: 66.3 ms, System: 12.0 ms]
  Range (min … max):    80.1 ms …  86.8 ms    34 runs

lto=true, codegen-units=1
  Time (mean ± σ):      80.7 ms ±   1.9 ms    [User: 63.6 ms, System: 11.9 ms]
  Range (min … max):    77.3 ms …  84.3 ms    34 runs

lto=true, codegen-units=1, strip="none"
  Time (mean ± σ):      80.9 ms ±   2.0 ms    [User: 64.0 ms, System: 12.0 ms]
  Range (min … max):    77.1 ms …  84.7 ms    34 runs

lto=true, codegen-units=1, strip="none", panic="abort"
  Time (mean ± σ):      52.0 ms ±   1.2 ms    [User: 35.2 ms, System: 11.8 ms]
  Range (min … max):    49.9 ms …  56.4 ms    51 runs

lto=true, codegen-units=1, strip="none", panic="abort", incremental=false
  Time (mean ± σ):      52.9 ms ±   1.1 ms    [User: 35.4 ms, System: 12.0 ms]
  Range (min … max):    50.4 ms …  54.9 ms    52 runs

So from this we were able to see that lto and panic makes the biggest difference, and that codegen-units=1 does make a small yet noticable difference as well. Note that @clason noticed a bigger difference for codegen-units=1 on his intel macos, so it could be platform-dependent as well?

As for the name, I am fine with whatever. lto is just fine. For reference, we got these flags from cargo-wizard and the configuration for this was called fast-runtime. So perhaps that is also a viable profile name? In any case, I'm fine with either, just lemme know what you want me to change it to.

Wasmtime GitHub notifications bot (May 06 2024 at 21:05):

dundargoc edited a comment on PR #8554:

This is the benchmarking we did to determine which of the flags seemed to make a difference:

--release
  Time (mean ± σ):     132.1 ms ±   3.3 ms    [User: 114.5 ms, System: 12.3 ms]
  Range (min … max):   122.9 ms … 137.0 ms    22 runs

lto=true
  Time (mean ± σ):      83.5 ms ±   1.9 ms    [User: 66.3 ms, System: 12.0 ms]
  Range (min … max):    80.1 ms …  86.8 ms    34 runs

lto=true, codegen-units=1
  Time (mean ± σ):      80.7 ms ±   1.9 ms    [User: 63.6 ms, System: 11.9 ms]
  Range (min … max):    77.3 ms …  84.3 ms    34 runs

lto=true, codegen-units=1, strip="none"
  Time (mean ± σ):      80.9 ms ±   2.0 ms    [User: 64.0 ms, System: 12.0 ms]
  Range (min … max):    77.1 ms …  84.7 ms    34 runs

lto=true, codegen-units=1, strip="none", panic="abort"
  Time (mean ± σ):      52.0 ms ±   1.2 ms    [User: 35.2 ms, System: 11.8 ms]
  Range (min … max):    49.9 ms …  56.4 ms    51 runs

lto=true, codegen-units=1, strip="none", panic="abort", incremental=false
  Time (mean ± σ):      52.9 ms ±   1.1 ms    [User: 35.4 ms, System: 12.0 ms]
  Range (min … max):    50.4 ms …  54.9 ms    52 runs

As for the profile name, I am fine with whatever. lto is just fine. For reference, we got these flags from cargo-wizard and the configuration for this was called fast-runtime. So perhaps that is also a viable profile name? In any case, I'm fine with either, just lemme know what you want me to change it to.

Wasmtime GitHub notifications bot (May 06 2024 at 21:06):

dundargoc edited a comment on PR #8554:

This is the benchmarking we did to determine which of the flags seemed to make a difference:

--release
  Time (mean ± σ):     132.1 ms ±   3.3 ms    [User: 114.5 ms, System: 12.3 ms]
  Range (min … max):   122.9 ms … 137.0 ms    22 runs

lto=true
  Time (mean ± σ):      83.5 ms ±   1.9 ms    [User: 66.3 ms, System: 12.0 ms]
  Range (min … max):    80.1 ms …  86.8 ms    34 runs

lto=true, codegen-units=1
  Time (mean ± σ):      80.7 ms ±   1.9 ms    [User: 63.6 ms, System: 11.9 ms]
  Range (min … max):    77.3 ms …  84.3 ms    34 runs

lto=true, codegen-units=1, strip="none"
  Time (mean ± σ):      80.9 ms ±   2.0 ms    [User: 64.0 ms, System: 12.0 ms]
  Range (min … max):    77.1 ms …  84.7 ms    34 runs

lto=true, codegen-units=1, strip="none", panic="abort"
  Time (mean ± σ):      52.0 ms ±   1.2 ms    [User: 35.2 ms, System: 11.8 ms]
  Range (min … max):    49.9 ms …  56.4 ms    51 runs

lto=true, codegen-units=1, strip="none", panic="abort", incremental=false
  Time (mean ± σ):      52.9 ms ±   1.1 ms    [User: 35.4 ms, System: 12.0 ms]
  Range (min … max):    50.4 ms …  54.9 ms    52 runs

Wasmtime GitHub notifications bot (May 06 2024 at 21:11):

dundargoc commented on PR #8554:

Oh, right. Reproduction steps. We build neovim in this PR with the following command:
make distclean #just to remove old builds
make CMAKE_EXTRA_FLAGS=-DENABLE_WASMTIME=ON DEPS_CMAKE_FLAGS=-DENABLE_WASMTIME=ON CMAKE_BUILD_TYPE=Release
cmake --install build --prefix bin
hyperfine --warmup 2 "bin/bin/nvim +'quit'"
I point to a local path of wasmtime in cmake.deps/deps.txt so it says
WASMTIME_URL /Users/dundargoc/programs/wasmtime
It's not the most graceful reproductions steps I'm afraid but that's the current tech we're working with at the moment :sweat_smile:

Wasmtime GitHub notifications bot (May 06 2024 at 21:34):

alexcrichton commented on PR #8554:

"fast-runtime" sounds reasonable to me, although I might bikeshed to "fastest-runtime" perhaps to help head off confusion between the default --release and "fast-runtime" where they're both supposed to be fast

Wasmtime GitHub notifications bot (May 06 2024 at 21:35):

alexcrichton commented on PR #8554:

And no worries! I'll try to kick the tires there and see if I can't bottom something out as to how Wasmtime is affecting things

Wasmtime GitHub notifications bot (May 06 2024 at 22:57):

dundargoc updated PR #8554.

Last updated: Apr 18 2025 at 17:03 UTC