dundargoc requested fitzgen for a review on PR #8554.
dundargoc requested wasmtime-core-reviewers for a review on PR #8554.
dundargoc opened PR #8554 from dundargoc:build/performance
to bytecodealliance:main
:
This is extremely useful for cases where the default optimizations just
are not enough.Background: neovim is interested to
add wasmtime support in https://github.com/neovim/neovim/pull/28415 but
we noticed that including wasmtime, even when not using wasmtime
directly, heavily affects runtime performance. This is not only
reflected in the increased startuptime but affects the runtime
performance overall.Here are the benchmarks for startuptimes for different configurations.
Important to note is that all of runtime is affected, but the
startuptime is a decent proxy to measure runtime performance:No wasm Time (mean ± σ): 50.5 ms ± 1.5 ms [User: 32.8 ms, System: 12.3 ms] Range (min … max): 48.3 ms … 54.4 ms 56 runs Wasm, lto=thin Time (mean ± σ): 104.9 ms ± 3.5 ms [User: 86.5 ms, System: 12.7 ms] Range (min … max): 99.5 ms … 111.1 ms 26 runs Wasm, lto=true Time (mean ± σ): 83.8 ms ± 2.5 ms [User: 65.8 ms, System: 12.1 ms] Range (min … max): 80.5 ms … 93.3 ms 31 runs Wasm, lto=true, strip="none", incremental=false, codegen-units=1, panic="abort" Time (mean ± σ): 53.1 ms ± 1.0 ms [User: 35.5 ms, System: 12.5 ms] Range (min … max): 50.6 ms … 55.5 ms 54 runs
dundargoc requested wasmtime-default-reviewers for a review on PR #8554.
bjorn3 submitted PR review.
bjorn3 created PR review comment:
This is already the default for release mode.
bjorn3 submitted PR review.
bjorn3 created PR review comment:
Why disable debuginfo stripping when you don't enable debuginfo anyway?
bjorn3 created PR review comment:
This can probably be enabled unconditionally for the c-api. We aren't catching panics on the C abi boundary, so they will abort anyway (or be UB until the
c_unwind
rustc feature is fully stabilized).
bjorn3 submitted PR review.
clason submitted PR review.
clason created PR review comment:
This strips more, though (symbols as well). We really need all these settings to close the gap to a build without wasmtime (we tested this).
bjorn3 submitted PR review.
bjorn3 created PR review comment:
strip = "none"
strips neither debuginfo, nor symbols.
clason submitted PR review.
clason created PR review comment:
My mistake; I was reading this wrong (my own tests were with
true
).
dundargoc updated PR #8554.
dundargoc submitted PR review.
dundargoc created PR review comment:
I benchmarked this and it seems strip does not seem to make difference. I'll remove it.
dundargoc updated PR #8554.
dundargoc submitted PR review.
dundargoc created PR review comment:
I am getting the following warning
Caused by: `panic` may not be specified in a `package` profile
It seems as though this is not possible based on the documentation:
Overrides cannot specify the
panic
,lto
, orrpath
settings.
dundargoc requested bjorn3 for a review on PR #8554.
bjorn3 submitted PR review.
bjorn3 created PR review comment:
I meant setting the
CARGO_PROFILE_<profile name>_PANIC="abort"
env var inCMakeLists.txt
where<profile name>
is eitherDEV
orRELEASE
depending on if--release
was passed or not.
dundargoc updated PR #8554.
dundargoc submitted PR review.
dundargoc created PR review comment:
Fixed
alexcrichton commented on PR #8554:
Thanks for the PR here! Is it possible perhaps for CMake to set environment variables for other steps? That's probably the best way to do this rather than adding a static profile. That way you'd be able to set
CARGO_PROFILE_RELEASE_*
to appropriate values and it can be customized per-project. Or perhaps could knobs be added to the CMake project in Wasmtime to set the env vars?Otherwise though I also wanted to poke a bit about what the benchmark is. This is just starting a program that's not actually using Wasmtime at all? It's just changing how Wasmtime itself is linked in? If so it sounds like the main issue is that Wasmtime has lots of symbols and they may be all getting processed by the dynamic linker at startup. LTO will reduce the number of symbols, but there should be other knobs to internalize symbols in an executable or dynamic library such that LTO wouldn't be required to get the same performance benefits.
Thanks for the PR here! Is it possible perhaps for CMake to set environment variables for other steps? That's probably the best way to do this rather than adding a static profile. That way you'd be able to set CARGO_PROFILE_RELEASE_* to appropriate values and it can be customized per-project.
That's what we did before, but we thought that the preset could be used by other projects with similar needs.
Otherwise though I also wanted to poke a bit about what the benchmark is. This is just starting a program that's not actually using Wasmtime at all? It's just changing how Wasmtime itself is linked in?
That is correct.
If so it sounds like the main issue is that Wasmtime has lots of symbols and they may be all getting processed by the dynamic linker at startup. LTO will reduce the number of symbols, but there should be other knobs to internalize symbols in an executable or dynamic library such that LTO wouldn't be required to get the same performance benefits.
This is our thinking as well; if you know of such knobs, we'd be more than happy for the hint. Our own interface to wasmtime is extremely small (but we do link to tree-sitter, which uses more of it).
alexcrichton commented on PR #8554:
Ok sounds good, and a preset also sounds good to me!
Is this on macOS, Linux, Windows, or all three? This would likely be flags to the linker invocation that creates the neovim executable itself which would be along the lines of "only export this list of symbols and no more" and that would internalize everything else. Is it possible to construct a list like that, for example is
main
the only symbol that's necessary? Or are plugins dlopen'd which need to reach into internal symbols?
All three and more (as many as we can build on), with a wide range of compilers. We use only a very limited set (basically,
wasm_engine_{new,delete}
, but we link to tree-sitter which uses more (which I don't know off the top of my head).
clason edited a comment on PR #8554:
All three and more (as many as we can build on), with a wide range of compilers. We use only a very limited set (basically,
wasm_engine_{new,delete}
), but we link to tree-sitter which uses more (which I don't know off the top of my head).
alexcrichton commented on PR #8554:
Oh sorry, to clarify, is the performance regression you're seeing on all platforms? Or only some?
So far we have mostly tested with macOS; of course the exact impact varies with platform and toolchain.
alexcrichton commented on PR #8554:
I'm trying to test my theory with a small script but it's not too successful. The big binaries are slower but they don't get faster when removing the symbols so I don't think this going to bear fruit.
Regardless though I think it's fine to land this, but on reviewing it I think it might be best to call this "lto" in the various places instead of "optimized". I want to avoid confusion between "optimized" and
--release
builds since both are runtime-wise pretty equally optimized, and "lto" is the main differentiating factor here.
I appreciate your testing! Not married to the name, of course, but note that full LTO is only half the game here -- you also need the other settings for the "optimal" benefit. Unless you consider codegen-units to be part of LTO? Then it'd obviously make sense.
alexcrichton commented on PR #8554:
That's true yeah. I suppose I'm mostly looking for a name that evokes something like "pull out all the stops on optimizing this I'm willing to wait for an hour". That typically includes lto and codegen-units=1. I personally consider "lto" as "take your time please optimize this as much as you can" but that's also just me.
Also, question for y'all, does codegen-units=1 help much here in isolation? If that setting is dropped do you see a difference in perf? I ask because that should only affect generated code quality, not symbol visibility. If codegen-units=1 does help then the mystery deepens in my head for what's going on here (PR can land either way of course, I'm just curious now!)
Yes it does! (We benchmarked each option -- @dundargoc could you edit the PR description to add the individual steps?)
And it's _very_ weird -- but then I don't know any other large C project that statically links a Rust dependency (and we have been quite reticent about it im the past).
If LTO has the full connotation for you, that's fine with us of course!
alexcrichton commented on PR #8554:
Odd! Well in any case happy to r+ and flag for merge with a rename.
Otherwise though if y'all have a series of instructions to reproduce this slowdown I'd be happy to try to dig further in my spare time later.
dundargoc commented on PR #8554:
This is the benchmarking we did to determine which of the flags seemed to make a difference:
--release Time (mean ± σ): 132.1 ms ± 3.3 ms [User: 114.5 ms, System: 12.3 ms] Range (min … max): 122.9 ms … 137.0 ms 22 runs lto=true Time (mean ± σ): 83.5 ms ± 1.9 ms [User: 66.3 ms, System: 12.0 ms] Range (min … max): 80.1 ms … 86.8 ms 34 runs lto=true, codegen-units=1 Time (mean ± σ): 80.7 ms ± 1.9 ms [User: 63.6 ms, System: 11.9 ms] Range (min … max): 77.3 ms … 84.3 ms 34 runs lto=true, codegen-units=1, strip="none" Time (mean ± σ): 80.9 ms ± 2.0 ms [User: 64.0 ms, System: 12.0 ms] Range (min … max): 77.1 ms … 84.7 ms 34 runs lto=true, codegen-units=1, strip="none", panic="abort" Time (mean ± σ): 52.0 ms ± 1.2 ms [User: 35.2 ms, System: 11.8 ms] Range (min … max): 49.9 ms … 56.4 ms 51 runs lto=true, codegen-units=1, strip="none", panic="abort", incremental=false Time (mean ± σ): 52.9 ms ± 1.1 ms [User: 35.4 ms, System: 12.0 ms] Range (min … max): 50.4 ms … 54.9 ms 52 runs
So from this we were able to see that
lto
andpanic
makes the biggest difference, and thatcodegen-units=1
does make a small yet noticable difference as well. Note that @clason noticed a bigger difference forcodegen-units=1
on his intel macos, so it could be platform-dependent as well?
As for the name, I am fine with whatever.
lto
is just fine. For reference, we got these flags from cargo-wizard and the configuration for this was calledfast-runtime
. So perhaps that is also a viable profile name? In any case, I'm fine with either, just lemme know what you want me to change it to.
dundargoc edited a comment on PR #8554:
This is the benchmarking we did to determine which of the flags seemed to make a difference:
--release Time (mean ± σ): 132.1 ms ± 3.3 ms [User: 114.5 ms, System: 12.3 ms] Range (min … max): 122.9 ms … 137.0 ms 22 runs lto=true Time (mean ± σ): 83.5 ms ± 1.9 ms [User: 66.3 ms, System: 12.0 ms] Range (min … max): 80.1 ms … 86.8 ms 34 runs lto=true, codegen-units=1 Time (mean ± σ): 80.7 ms ± 1.9 ms [User: 63.6 ms, System: 11.9 ms] Range (min … max): 77.3 ms … 84.3 ms 34 runs lto=true, codegen-units=1, strip="none" Time (mean ± σ): 80.9 ms ± 2.0 ms [User: 64.0 ms, System: 12.0 ms] Range (min … max): 77.1 ms … 84.7 ms 34 runs lto=true, codegen-units=1, strip="none", panic="abort" Time (mean ± σ): 52.0 ms ± 1.2 ms [User: 35.2 ms, System: 11.8 ms] Range (min … max): 49.9 ms … 56.4 ms 51 runs lto=true, codegen-units=1, strip="none", panic="abort", incremental=false Time (mean ± σ): 52.9 ms ± 1.1 ms [User: 35.4 ms, System: 12.0 ms] Range (min … max): 50.4 ms … 54.9 ms 52 runs
So from this we were able to see that
lto
andpanic
makes the biggest difference, and thatcodegen-units=1
does make a small yet noticable difference as well. Note that @clason noticed a bigger difference forcodegen-units=1
on his intel macos, so it could be platform-dependent as well?
As for the profile name, I am fine with whatever.
lto
is just fine. For reference, we got these flags from cargo-wizard and the configuration for this was calledfast-runtime
. So perhaps that is also a viable profile name? In any case, I'm fine with either, just lemme know what you want me to change it to.
dundargoc edited a comment on PR #8554:
This is the benchmarking we did to determine which of the flags seemed to make a difference:
--release Time (mean ± σ): 132.1 ms ± 3.3 ms [User: 114.5 ms, System: 12.3 ms] Range (min … max): 122.9 ms … 137.0 ms 22 runs lto=true Time (mean ± σ): 83.5 ms ± 1.9 ms [User: 66.3 ms, System: 12.0 ms] Range (min … max): 80.1 ms … 86.8 ms 34 runs lto=true, codegen-units=1 Time (mean ± σ): 80.7 ms ± 1.9 ms [User: 63.6 ms, System: 11.9 ms] Range (min … max): 77.3 ms … 84.3 ms 34 runs lto=true, codegen-units=1, strip="none" Time (mean ± σ): 80.9 ms ± 2.0 ms [User: 64.0 ms, System: 12.0 ms] Range (min … max): 77.1 ms … 84.7 ms 34 runs lto=true, codegen-units=1, strip="none", panic="abort" Time (mean ± σ): 52.0 ms ± 1.2 ms [User: 35.2 ms, System: 11.8 ms] Range (min … max): 49.9 ms … 56.4 ms 51 runs lto=true, codegen-units=1, strip="none", panic="abort", incremental=false Time (mean ± σ): 52.9 ms ± 1.1 ms [User: 35.4 ms, System: 12.0 ms] Range (min … max): 50.4 ms … 54.9 ms 52 runs
So from this we were able to see that
lto
andpanic
makes the biggest difference, and thatcodegen-units=1
does make a small yet noticable difference as well. Note that @clason noticed a bigger difference forcodegen-units=1
on his intel macos, so it could be platform-dependent as well.
As for the profile name, I am fine with whatever.
lto
is just fine. For reference, we got these flags from cargo-wizard and the configuration for this was calledfast-runtime
. So perhaps that is also a viable profile name? In any case, I'm fine with either, just lemme know what you want me to change it to.
dundargoc commented on PR #8554:
Oh, right. Reproduction steps. We build neovim in this PR with the following command:
make distclean #just to remove old builds make CMAKE_EXTRA_FLAGS=-DENABLE_WASMTIME=ON DEPS_CMAKE_FLAGS=-DENABLE_WASMTIME=ON CMAKE_BUILD_TYPE=Release cmake --install build --prefix bin hyperfine --warmup 2 "bin/bin/nvim +'quit'"
I point to a local path of wasmtime in
cmake.deps/deps.txt
so it saysWASMTIME_URL /Users/dundargoc/programs/wasmtime
It's not the most graceful reproductions steps I'm afraid but that's the current tech we're working with at the moment :sweat_smile:
alexcrichton commented on PR #8554:
"fast-runtime" sounds reasonable to me, although I might bikeshed to "fastest-runtime" perhaps to help head off confusion between the default
--release
and "fast-runtime" where they're both supposed to be fast
alexcrichton commented on PR #8554:
And no worries! I'll try to kick the tires there and see if I can't bottom something out as to how Wasmtime is affecting things
dundargoc updated PR #8554.
Last updated: Dec 23 2024 at 12:05 UTC