I am the developer of a simple fantasy console called MicroW8 based on WASM which is mostly used for sizecoding (for example creating visual effects in 256 bytes of code), which uses wasmtime for the native runner.
Recently, I got reports that that recent releases have much worse performance on windows than releases build two years ago. I'm still trying to reduce this to a minimal test case, but from what I can see, something changed going from rust 1.64 to 1.65, dropping the wasmtime performance on windows by about 30%.
You can find my ongoing test here: https://github.com/exoticorn/wt-regression
For me, I get the following results (on commit 9197048):
$ cargo +1.64 bench
technotunnel_upd time: [3.9256 ms 3.9273 ms 3.9291 ms]
$ cargo +1.65 bench
technotunnel_upd time: [5.1347 ms 5.1365 ms 5.1385 ms]
change: [+30.715% +30.791% +30.869%] (p = 0.00 < 0.05)
I didn't see a performance regression on linux. But on windows, the dropped performance appears to be the same with newer versions of both wasmtime and rust.
Does anyone have any idea about this? Maybe there is a simple tweak somewhere I'm missing to regain the lost performance?
Ok, I simplified the test a bit further. I could still make the actual WASM code smaller, but it seems very much the case that calls to the sin
function defined with linker.func_wrap("env", "sin", |v: f32| v)?;
(yes, that's no longer a sin
call in the performance test) have gotten a lot slower with rust 1.65 on windows.
Would a function defined with func_new
or func_new_unchecked
be likely faster?
While I didn't notice any regression in the linux builds of the original application, now that I'm running the performance test on linux, I also see a bit of a slowdown with rust 1.65.
So the call overhead to those provided functions probably increased across platforms.
Thanks for all the information here! Would you be up for filing an issue on the Wasmtime repo for this? This looks like something to bottom out and I want to make sure we don't lose track of it
To dig in a bit more though:
func_new_unchecked
is in theory slower than func_wrap
, and func_new
should be even slowerAnd you're saying that the env::sin
function there is what's slower? (e.g. even without v.sin()
, it's slower as a noop?)
I also notice that this is using wasmtime 0.37, if you update wasmtime do you see the same performance regression? That would require using a newer rustc, however
also: are you recompiling the wasm and the wasmtime with the different rustc versions, or just wasmtime?
looking at the benchmarking repo it looks like rustc version was all that's changing, it's the same wasm and the same weasmtime, and the wasm looks like it's coming from a custom language
I was able to reproduce locally on Windows but wasn't able to figure out how to use a profiler so couldn't go further
the same wasm as in not being recompiled, and is literally the same wasm binary?
correct
the only variable was rustc
and recompiling wasmtime with that rustc right? (while leaving the wasm binary as-is)
yeah
Yes, the wasm is an unchanged binary blob. I'm using wasmtime 0.37 to compare the old rust versions. With MicroW8, newer rustc and wasmtime versions did not to make a measurable performance difference, but I will do an updated version of the benchmark to verify.
nah that's ok I was just curious, it didn't look like much major in this area had changed since Wasmtime 0.37 to Wasmtime 21
Wasmtime 22 (only a release branch not published yet) may be worth testing though since the host ABI is changing
but that'll just behave similarly to func_new_unchecked
in terms of performance
I'll also fill an issue soon, just wanted to see whether this is already known while I collect a little more info.
Not known yet as far as I know, but if you can I'd also test with the release-22.0.0
branch of Wasmtime and see if that makes a difference
regardless this is probably a rustc/llvm thing and the best we can do in Wasmtime is figure out what's going wrong and tweak #[inline]
and/or code structure, we probably can't fix the underlying issue (but I could be wrong too!)
I was on my mobile phone before, so I might flesh out a few replies now:
Alex Crichton said:
And you're saying that the
env::sin
function there is what's slower? (e.g. even withoutv.sin()
, it's slower as a noop?)
The current version of the benchmark runs two different wasm binaries: technotunnel.wasm
and technotunnel_nosin.wasm
. The former shows a performance regression while the latter doesn't. The only difference between the two is that the latter locally defines
fn sin(v: f32) -> f32 {
v
}
in WASM instead of importing the same function defined in rust. I put the source in the repository as a reference, it is indeed a custom language which essentially maps 1:1 to wasm instructions (https://github.com/exoticorn/curlywas).
And I removed the actual call to v.sin()
on the rust side and just returned v
to make sure it's not that math function that's gotten slower.
I added a small wasm binary to the benchmark which just calls the identity env.sin
in a loop and sums up the return values. This is obviously very artificial, but shows the changes performance much clearer. I also quickly updated to wasmtime 21.0.1 in the latest
branch.
On this windows pc, I got the following results:
wasmtime 0.37, rust 1.64:
technotunnel_upd time: [2.5457 ms 2.5474 ms 2.5491 ms]
technotunnel_nosin_upd time: [2.7231 ms 2.7244 ms 2.7261 ms]
simple_loop_upd time: [537.40 µs 537.78 µs 538.22 µs]
wasmtime 0.37, rust 1.65:
technotunnel_upd time: [3.4619 ms 3.4669 ms 3.4729 ms]
change: [+35.869% +36.093% +36.352%] (p = 0.00 < 0.05)
technotunnel_nosin_upd time: [2.7043 ms 2.7063 ms 2.7083 ms]
change: [-0.7590% -0.6630% -0.5724%] (p = 0.00 < 0.05)
simple_loop_upd time: [1.0879 ms 1.0891 ms 1.0900 ms]
change: [+101.98% +102.69% +103.61%] (p = 0.00 < 0.05)
wasmtime 0.37, rust 1.78:
technotunnel_upd time: [3.7467 ms 3.7499 ms 3.7528 ms]
change: [+7.9532% +8.1630% +8.3516%] (p = 0.00 < 0.05)
technotunnel_nosin_upd time: [2.7039 ms 2.7059 ms 2.7080 ms]
change: [-0.1189% -0.0136% +0.0930%] (p = 0.81 > 0.05)
simple_loop_upd time: [1.4500 ms 1.4505 ms 1.4510 ms]
change: [+32.539% +33.312% +34.042%] (p = 0.00 < 0.05)
wasmtime 21.0.1, rust 1.78
technotunnel_upd time: [3.8879 ms 3.8960 ms 3.9051 ms]
change: [+3.6712% +3.8957% +4.1466%] (p = 0.00 < 0.05)
technotunnel_nosin_upd time: [1.9664 ms 1.9669 ms 1.9676 ms]
change: [-27.352% -27.286% -27.221%] (p = 0.00 < 0.05)
simple_loop_upd time: [1.7066 ms 1.7082 ms 1.7097 ms]
change: [+17.088% +17.555% +17.855%] (p = 0.00 < 0.05)
So for the tests that are calling the imported env.sin
function, each of these steps was a performance regression. Overall, the simple loop is more than 3x slower with the latest versions than on rust 1.64 + wasmtime 0.37. (I'll try out wasmtime 22 tomorrow)
For the test without import, changing rustc versions had no influence, while the wasmtime update improved performance.
wow interesting!
And unexpected... Thanks for gathering the data though!
I just ran the benchmarks on linux on the same pc and got very similar numbers, so it's probably not platform dependent. (Maybe CPU dependent, if it is indeed caused by changes in LLVM (rust 1.65 did update to a newer LLVM version).)
oh nice, if it reproduces on Linux that'll make it much easier to investigate on our end
I have simplified the benchmark some more, putting the wasm code right into the rust source. While doing so I noticed that there was no regression when calling a (i32) -> i32
function like there was for a (f32) -> f32
function.
wasmtime 22.0.0 gains back a little bit of performance compared to 21.0.1, but not that much.
Another round of measurements with the latest benchmark, this time on a linux pc:
wasmtime 0.37, rust 1.64
calling function defined in wasm
time: [2.2450 ms 2.2456 ms 2.2462 ms]
calling function defined in rust
time: [3.1960 ms 3.1972 ms 3.1991 ms]
Back then, there was not too much overhead calling the function defined in rust vs. defined in wasm.
wasmtime 0.37, rust 1.78
calling function defined in rust
time: [7.1093 ms 7.1108 ms 7.1125 ms]
change: [+122.27% +122.40% +122.51%] (p = 0.00 < 0.05)
wasmtime 21.0.1, rust 1.78 (branch latest
)
calling function defined in rust
time: [11.948 ms 11.953 ms 11.959 ms]
change: [+68.017% +68.102% +68.198%] (p = 0.00 < 0.05)
wasmtime 22.0.0, rust 1.78 (branch 22.0
)
calling function defined in rust
time: [9.4459 ms 9.4482 ms 9.4510 ms]
change: [-21.000% -20.957% -20.916%] (p = 0.00 < 0.05)
(the wasm->wasm performance was constant over all versions)
i have no real insight into this but i did find that the transition from rust 1.64 -> rust 1.65 was when LLVM 15 was introduced. id be suspicious that the change is down in there somewhere, but i haven't looked deeper. https://github.com/rust-lang/rust/blob/stable/RELEASES.md#version-1650-2022-11-03
I was doing some investigation of this last night (sorry should have posted earlier) and there a huge number of inefficiences we've introduced in the wasm->host call path since 0.37, the hot functions are littered with dozens of indirect calls which all end up being noops. There's definitely work to do in wasmtime orthogonal to LLVM.
I also suspect that the LLVM upgrade is what "broke" the performance here originally. My best guess is that it has to do with the catch_unwind
as there seemed to be more stack movement somewhat related to that. Optimizing that is theoretically possible in Rust but will require the c_unwind
feature to stabilize, and that's actually looking like it'll stabilize in the next nightly or so of Rust
but cc @fitzgen (he/him) since in profiling I saw that the GC hooks were a huge source of slowdowns. I compared Rust 1.64.0/1.65.0 with Wasmtime 0.37 to Rust 1.78 with Wasmtime 22 (on main
) and adding a single #[cfg(feature = "gc")]
reduced the slowdown by 150%: Rust 1.65 was 30% slower than Rust 1.64, Wasmtime 22 was ~200% slower than Rust 1.64, and Wasmtime 22 with a small patch was ~40% slower than Rust 1.64.
basically I don't think we've been studiously watching the perf of wasm->host calls and we've got a lot of trait objects which can't be optimized out, especially GC-related ones, and in this case there was no GC bits anyway so it's sort of a bummer it was causing a slowdown (to recover perf I was disabling the whole gc
feature at compile time which isn't a great solution)
I also wanted to open a more official issue for this but... busy!
hm I'm not totally surprised, but also thought we had been trying to avoid these things except for when actually using GC stuff. happy to work on tweaks and brainstorm designs when we have a concrete example of a call/branch/path that was fast before but slow now.
Unfortunately, I similarly don't have the cycles to do top-level investigations to chase down and discover those concrete things myself at the moment
With a combination of https://github.com/bytecodealliance/wasmtime/pull/8779, https://github.com/bytecodealliance/wasmtime/pull/8780, https://github.com/bytecodealliance/wasmtime/pull/8778, and disabling the gc
of the wasmtime crate I'm currently seeing ~10% better performance than Rust 1.64 with wasmtime 0.37.
With this diff
diff --git a/crates/wasmtime/src/runtime/store.rs b/crates/wasmtime/src/runtime/store.rs
index e66f383251..72a929e10d 100644
--- a/crates/wasmtime/src/runtime/store.rs
+++ b/crates/wasmtime/src/runtime/store.rs
@@ -1125,13 +1125,14 @@ impl<T> StoreInner<T> {
#[inline]
pub fn call_hook(&mut self, s: CallHook) -> Result<()> {
- if self.inner.pkey.is_none() && self.call_hook.is_none() {
- Ok(())
- } else {
- self.call_hook_slow_path(s)
- }
+ // if self.inner.pkey.is_none() && self.call_hook.is_none() {
+ Ok(())
+ // } else {
+ // self.call_hook_slow_path(s)
+ // }
}
+ #[inline(never)]
fn call_hook_slow_path(&mut self, s: CallHook) -> Result<()> {
if let Some(pkey) = &self.inner.pkey {
let allocator = self.engine().allocator();
the number goes down to 50% faster than before.
This final diff is possible to apply but a bit more difficult. @Dennis Ranke is it reasonable to disable the gc
feature for you? Also is on-par performance the goal here or were you looking to optimize the wasm->host path even further?
Optimizing that is theoretically possible in Rust but will require the
c_unwind
feature to stabilize, and that's actually looking like it'll stabilize in the next nightly or so of Rust
extern "C-unwind"
has been stable for a while now. What isn't stable yet is the behavior of unwinding out of an extern "C"
function aborting the process.
Oh, that sounds pretty good, thanks for looking into it!
Disabling gc
is totally fine for my use-case. I'll try your patches myself this evening.
On-par performance sounds good enough for me. This use-case is A) the demo scene (ie. realtime animated graphical effects), so people are very sensitive to performance regression, as it impacts how smooth their effects run, but B) it's the sizecoding niche inside the demo scene, so people will happily inline complex expressions with many trigonometry host calls into their inner loop to save the 4 bytes of a local.set
, local.get
pair.
All this to say: This is a very niche use-case in the WASM world, and while we appreciate any performance wins we can get, this is maybe not enough motivation for more difficult changes that gain a few more percent.
@bjorn3 true! I was seeing unwind edges and cleanups which in theory shouldn't be there so I was considering using extern "C"
for "hey llvm this function won't unwind don't try to add cleanups around this" for internal hooks in wasmtime that should never unwind (e.g. weren't getting inlined), so yeah I was needing the abort-on-unwind behavior of extern "C"
which is gated behind c_unwind
(or more specifically I was interested in nounwind
in the LLVM IR)
@Dennis Ranke the patches have landed on main
now so if you can kick the tires there and see how it is that should work well! (while ensuring to disable the gc
feature)
I should also clarify I didn't actually get to the point of testing nounwind
and its effects on perf, it was just a theory I had and couldn't bottom out
@Alex Crichton I just tried main
without gc
. Did you mean 10% faster than wasmtime 0.37 with rust 1.64 or 1.65? I'm getting 10% faster than 1.65, which is already really cool, but still some way of the 1.64 performance.
Just confirmed, but locally on Linux I see that when comparing Rust 1.64 and Wasmtime 0.37 with Rust 1.78 with Wasmtime main
:
technotunnel_upd
- improved 23%technotunnel_nosin_upd
- improved 50%simple_loop_upd
- regressed 141%I was only looking at technotunnel_upd
numbers the other night, is your regression based on simple_loop_upd
?
Er apologies the numbers I just gave were with wasmtime with default features enabled, after disabling default features I get
technotunnel_upd
- improved 43%technotunnel_nosin_upd
- improved 52%simple_loop_upd
- improved 3%You mentioned earlier you're on Windows so this could also be a Windows/Linux difference
Essentially, yes. I simplified the benchmark further, which only kept the simple_loop version.
But to clarify locally I'm measuring a speedup in simple_loop_upd
on Rust 1.78 with the main
branch relative to Rust 1.64
I can try to poke around on Windows later and see if I can reproduce Rust 1.64 still being slower than main
er, other way around, main
still being slower than Rust 1.64
I can recheck with the older benchmark. I did run the performance test on Linux, too, though. (I'm not actually much of a Windows user myself, after I got the reports of regressions on Windows, I first had to spend (what felt like) multiple hours to update my windows install since I hadn't booted into it in more than a year... ;) )
Heh I know what you mean :)
Does this dependency line look correct?
wasmtime = { git = "https://github.com/bytecodealliance/wasmtime", branch = "main", default-features = false, features = ["wat", "cranelift", "runtime"] }
To clarify what I did I'm on 4dc3dc13871f7a3b66bb6732c4f381e0aa876410
of the wt-regression
repository. I first ran cargo +1.64.0 bench --bench wasmtime_benchmark -- --save-baseline original
. Then I applied this diff
diff --git a/Cargo.toml b/Cargo.toml
index db47927..2b1c0ca 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -5,7 +5,8 @@ edition = "2021"
[dependencies]
criterion = { version = "0.3.0", default-features = false }
-wasmtime = "0.37.0"
+#wasmtime = "0.37.0"
+wasmtime = { path = '../wasmtime/crates/wasmtime', default-features = false, features = ['cranelift', 'runtime'] }
anyhow = "1"
[[bench]]
diff --git a/benches/wasmtime_benchmark.rs b/benches/wasmtime_benchmark.rs
index 024854c..5535ce9 100644
--- a/benches/wasmtime_benchmark.rs
+++ b/benches/wasmtime_benchmark.rs
@@ -6,6 +6,7 @@ fn benchmark_frame(c: &mut Criterion) {
fn inner(c: &mut Criterion, wasm: &[u8], id: &str) -> Result<()> {
let mut config = wasmtime::Config::new();
config.cranelift_opt_level(wasmtime::OptLevel::Speed);
+ // config.profiler(wasmtime::ProfilingStrategy::JitDump);
let engine = wasmtime::Engine::new(&config)?;
let mut store = wasmtime::Store::new(&engine, ());
@@ -13,7 +14,7 @@ fn benchmark_frame(c: &mut Criterion) {
let memory = wasmtime::Memory::new(&mut store, MemoryType::new(4, Some(4)))?;
let mut linker = wasmtime::Linker::new(&engine);
- linker.define("env", "memory", memory)?;
+ linker.define(&mut store, "env", "memory", memory)?;
let module = wasmtime::Module::new(&engine, wasm)?;
@@ -21,7 +22,7 @@ fn benchmark_frame(c: &mut Criterion) {
let instance = linker.instantiate(&mut store, &module)?;
- let update = instance.get_typed_func::<(), (), _>(&mut store, "upd")?;
+ let update = instance.get_typed_func::<(), ()>(&mut store, "upd")?;
c.bench_function(id, |b| {
b.iter(|| {
@@ -32,7 +33,12 @@ fn benchmark_frame(c: &mut Criterion) {
Ok(())
}
inner(c, include_bytes!("technotunnel.wasm"), "technotunnel_upd").unwrap();
- inner(c, include_bytes!("technotunnel_nosin.wasm"), "technotunnel_nosin_upd").unwrap();
+ inner(
+ c,
+ include_bytes!("technotunnel_nosin.wasm"),
+ "technotunnel_nosin_upd",
+ )
+ .unwrap();
inner(c, include_bytes!("simple_loop.wasm"), "simple_loop_upd").unwrap();
}
and ran cargo bench --bench wasmtime_benchmark -- --baseline original
Yes that features-line looks correct
Actually with https://github.com/bytecodealliance/wasmtime/pull/8794 I'm seeing a 36% improvement on main
relative to 1.64, an improvement over the 3% above
Hm, I followed the same steps, just with a git = "https://github.com/bytecodealliance/wasmtime.git"
dependency instead of the path
one + I had to do a cargo update
. With that I got the result that simple_loop_upd
regressed by 64.694%.
Mind pushing up the state of the repo to the wt-regression repo?
That way we can compare two commits of wt-regression and try to make sure we're all sync'd up
sure, one second, I'll re-run it here.
Branch name is simple-loop-main
, but I'll just use commit ids below:
$ git checkout 4dc3dc13871f7a3b66bb6732c4f381e0aa876410
$ cargo +1.64 bench --bench wasmtime_benchmark -- --save-baseline original
$ git checkout 529af46497eb61787dd4204577831b2169efe486
$ cargo bench --bench wasmtime_benchmark -- --baseline original
simple_loop_upd time: [636.11 µs 636.17 µs 636.27 µs]
change: [+69.564% +69.720% +69.850%] (p = 0.00 < 0.05)
Performance has regressed.
nice, I'll poke around later
with those instructions I get an 87% slowdown on simple_loop_upd
, meaning main
is slower.
Looks like we're catching things at the wrong time though:
c65666c50a231bf4998419f6306e87b1fb18d586
, the commit used by wt-regression, is 87% slower than 1.649f29c6e92629a8552f57fa6b2cec1371bc34f9e8
, the commit before A, is 12% faster than 1.64515fa4de012591e4edbc2548870ad377d0b5f3bc
, the commit after A, is 37% faster than 1.64Can you try re-updating to current main
? (which is C, one commit after the one you're using). Basically this is all boiling down to Store::call_hook
and the performance impact of that feature. It may point towards needing a Cargo feature to disable it (but on-by-default like gc)
(sorry sent a message soon so I just updated my prior message)
I can confirm that B (9f29c6e92629a8552f57fa6b2cec1371bc34f9e8
) is indeed a little faster than 1.64, which is amazing! However, C (515fa4de012591e4edbc2548870ad377d0b5f3bc
) is still 64% slower than 1.64 for me.
I updated the branch to C.
Interesting! Using d58b3b730c4e2ae71ee708304a2574a97f770e1f
which is the latest commit of the wt-regression repository I see a 36% improvement over Rust 1.64, and looks like you're seeing a 64% slowdown. My guess is it may have to do with CPU differences? Are you on x64 or arm64?
Also you're testing on Linux, not macos?
Yes, this is all on Linux.
x64?
yes
what cpu?
I'm testing on Intel(R) Core(TM) i9-14900K
myself
AMD Ryzen 7 7700
mk I think this may just be cpu differences then and how good branch predictors and caches and such are
nevertheless I'll land something on main
to remove all the call hook things
Oh look at that: On a i5-9400F I indeed see a 30% improvement over 1.64. Benchmarking is hard.
with https://github.com/bytecodealliance/wasmtime/pull/8795 the AMD performance should improve, although I'm not sure if it will return to being on-part with 1.64
it's basically another crate feature for you to disable, but with default-features = false
you'll already have it disabled
It's -33.333% on AMD with that commit!
This is really amazing, given that I was fully prepared to accept "It's some change in LLVM, there's not much we can do." :)
nice!
sometimes a change in a different part of the system is the final motivation you need to change parts of your own system :)
this helps fix a lot of cruft which accidentally snuck in on this path over the releases
so great fixes to have regardless
One thing I'll note on this as well now is that after some follow-up fixes it's no longer necessary to disable default features to get the perf improvements here, they're now present in builds by default
Last updated: Nov 22 2024 at 16:03 UTC