Stream: wasmtime

Topic: Performance regression since rust 1.65


view this post on Zulip Dennis Ranke (Jun 09 2024 at 18:31):

I am the developer of a simple fantasy console called MicroW8 based on WASM which is mostly used for sizecoding (for example creating visual effects in 256 bytes of code), which uses wasmtime for the native runner.

Recently, I got reports that that recent releases have much worse performance on windows than releases build two years ago. I'm still trying to reduce this to a minimal test case, but from what I can see, something changed going from rust 1.64 to 1.65, dropping the wasmtime performance on windows by about 30%.

You can find my ongoing test here: https://github.com/exoticorn/wt-regression

For me, I get the following results (on commit 9197048):

$ cargo +1.64 bench

technotunnel_upd        time:   [3.9256 ms 3.9273 ms 3.9291 ms]

$ cargo +1.65 bench

technotunnel_upd        time:   [5.1347 ms 5.1365 ms 5.1385 ms]
                        change: [+30.715% +30.791% +30.869%] (p = 0.00 < 0.05)

I didn't see a performance regression on linux. But on windows, the dropped performance appears to be the same with newer versions of both wasmtime and rust.

Does anyone have any idea about this? Maybe there is a simple tweak somewhere I'm missing to regain the lost performance?

Investigating a wasmtime preformance regression on win32 since rust 1.65 - exoticorn/wt-regression

view this post on Zulip Dennis Ranke (Jun 09 2024 at 21:22):

Ok, I simplified the test a bit further. I could still make the actual WASM code smaller, but it seems very much the case that calls to the sin function defined with linker.func_wrap("env", "sin", |v: f32| v)?;(yes, that's no longer a sin call in the performance test) have gotten a lot slower with rust 1.65 on windows.

Would a function defined with func_new or func_new_unchecked be likely faster?

view this post on Zulip Dennis Ranke (Jun 10 2024 at 07:30):

While I didn't notice any regression in the linux builds of the original application, now that I'm running the performance test on linux, I also see a bit of a slowdown with rust 1.65.

So the call overhead to those provided functions probably increased across platforms.

view this post on Zulip Alex Crichton (Jun 10 2024 at 15:45):

Thanks for all the information here! Would you be up for filing an issue on the Wasmtime repo for this? This looks like something to bottom out and I want to make sure we don't lose track of it

view this post on Zulip Alex Crichton (Jun 10 2024 at 15:48):

To dig in a bit more though:

And you're saying that the env::sin function there is what's slower? (e.g. even without v.sin(), it's slower as a noop?)

view this post on Zulip Alex Crichton (Jun 10 2024 at 17:03):

I also notice that this is using wasmtime 0.37, if you update wasmtime do you see the same performance regression? That would require using a newer rustc, however

view this post on Zulip fitzgen (he/him) (Jun 10 2024 at 17:10):

also: are you recompiling the wasm and the wasmtime with the different rustc versions, or just wasmtime?

view this post on Zulip Alex Crichton (Jun 10 2024 at 17:11):

looking at the benchmarking repo it looks like rustc version was all that's changing, it's the same wasm and the same weasmtime, and the wasm looks like it's coming from a custom language

view this post on Zulip Alex Crichton (Jun 10 2024 at 17:11):

I was able to reproduce locally on Windows but wasn't able to figure out how to use a profiler so couldn't go further

view this post on Zulip fitzgen (he/him) (Jun 10 2024 at 17:11):

the same wasm as in not being recompiled, and is literally the same wasm binary?

view this post on Zulip Alex Crichton (Jun 10 2024 at 17:12):

correct

view this post on Zulip Alex Crichton (Jun 10 2024 at 17:12):

the only variable was rustc

view this post on Zulip fitzgen (he/him) (Jun 10 2024 at 17:12):

and recompiling wasmtime with that rustc right? (while leaving the wasm binary as-is)

view this post on Zulip Alex Crichton (Jun 10 2024 at 17:15):

yeah

view this post on Zulip Dennis Ranke (Jun 10 2024 at 17:15):

Yes, the wasm is an unchanged binary blob. I'm using wasmtime 0.37 to compare the old rust versions. With MicroW8, newer rustc and wasmtime versions did not to make a measurable performance difference, but I will do an updated version of the benchmark to verify.

view this post on Zulip Alex Crichton (Jun 10 2024 at 17:15):

nah that's ok I was just curious, it didn't look like much major in this area had changed since Wasmtime 0.37 to Wasmtime 21

view this post on Zulip Alex Crichton (Jun 10 2024 at 17:15):

Wasmtime 22 (only a release branch not published yet) may be worth testing though since the host ABI is changing

view this post on Zulip Alex Crichton (Jun 10 2024 at 17:16):

but that'll just behave similarly to func_new_unchecked in terms of performance

view this post on Zulip Dennis Ranke (Jun 10 2024 at 17:16):

I'll also fill an issue soon, just wanted to see whether this is already known while I collect a little more info.

view this post on Zulip Alex Crichton (Jun 10 2024 at 17:19):

Not known yet as far as I know, but if you can I'd also test with the release-22.0.0 branch of Wasmtime and see if that makes a difference

view this post on Zulip Alex Crichton (Jun 10 2024 at 17:20):

regardless this is probably a rustc/llvm thing and the best we can do in Wasmtime is figure out what's going wrong and tweak #[inline] and/or code structure, we probably can't fix the underlying issue (but I could be wrong too!)

view this post on Zulip Dennis Ranke (Jun 10 2024 at 21:15):

I was on my mobile phone before, so I might flesh out a few replies now:

Alex Crichton said:

And you're saying that the env::sin function there is what's slower? (e.g. even without v.sin(), it's slower as a noop?)

The current version of the benchmark runs two different wasm binaries: technotunnel.wasm and technotunnel_nosin.wasm. The former shows a performance regression while the latter doesn't. The only difference between the two is that the latter locally defines

fn sin(v: f32) -> f32 {
    v
}

in WASM instead of importing the same function defined in rust. I put the source in the repository as a reference, it is indeed a custom language which essentially maps 1:1 to wasm instructions (https://github.com/exoticorn/curlywas).

And I removed the actual call to v.sin() on the rust side and just returned v to make sure it's not that math function that's gotten slower.

A curly-braces infix language that compiles to WebAssembly - exoticorn/curlywas

view this post on Zulip Dennis Ranke (Jun 10 2024 at 21:24):

I added a small wasm binary to the benchmark which just calls the identity env.sin in a loop and sums up the return values. This is obviously very artificial, but shows the changes performance much clearer. I also quickly updated to wasmtime 21.0.1 in the latest branch.

On this windows pc, I got the following results:

wasmtime 0.37, rust 1.64:

technotunnel_upd        time:   [2.5457 ms 2.5474 ms 2.5491 ms]

technotunnel_nosin_upd  time:   [2.7231 ms 2.7244 ms 2.7261 ms]

simple_loop_upd         time:   [537.40 µs 537.78 µs 538.22 µs]

wasmtime 0.37, rust 1.65:

technotunnel_upd        time:   [3.4619 ms 3.4669 ms 3.4729 ms]
                        change: [+35.869% +36.093% +36.352%] (p = 0.00 < 0.05)

technotunnel_nosin_upd  time:   [2.7043 ms 2.7063 ms 2.7083 ms]
                        change: [-0.7590% -0.6630% -0.5724%] (p = 0.00 < 0.05)

simple_loop_upd         time:   [1.0879 ms 1.0891 ms 1.0900 ms]
                        change: [+101.98% +102.69% +103.61%] (p = 0.00 < 0.05)

wasmtime 0.37, rust 1.78:

technotunnel_upd        time:   [3.7467 ms 3.7499 ms 3.7528 ms]
                        change: [+7.9532% +8.1630% +8.3516%] (p = 0.00 < 0.05)

technotunnel_nosin_upd  time:   [2.7039 ms 2.7059 ms 2.7080 ms]
                        change: [-0.1189% -0.0136% +0.0930%] (p = 0.81 > 0.05)

simple_loop_upd         time:   [1.4500 ms 1.4505 ms 1.4510 ms]
                        change: [+32.539% +33.312% +34.042%] (p = 0.00 < 0.05)

wasmtime 21.0.1, rust 1.78

technotunnel_upd        time:   [3.8879 ms 3.8960 ms 3.9051 ms]
                        change: [+3.6712% +3.8957% +4.1466%] (p = 0.00 < 0.05)

technotunnel_nosin_upd  time:   [1.9664 ms 1.9669 ms 1.9676 ms]
                        change: [-27.352% -27.286% -27.221%] (p = 0.00 < 0.05)

simple_loop_upd         time:   [1.7066 ms 1.7082 ms 1.7097 ms]
                        change: [+17.088% +17.555% +17.855%] (p = 0.00 < 0.05)

So for the tests that are calling the imported env.sin function, each of these steps was a performance regression. Overall, the simple loop is more than 3x slower with the latest versions than on rust 1.64 + wasmtime 0.37. (I'll try out wasmtime 22 tomorrow)

For the test without import, changing rustc versions had no influence, while the wasmtime update improved performance.

view this post on Zulip Alex Crichton (Jun 10 2024 at 21:33):

wow interesting!

view this post on Zulip Alex Crichton (Jun 10 2024 at 21:33):

And unexpected... Thanks for gathering the data though!

view this post on Zulip Dennis Ranke (Jun 10 2024 at 21:41):

I just ran the benchmarks on linux on the same pc and got very similar numbers, so it's probably not platform dependent. (Maybe CPU dependent, if it is indeed caused by changes in LLVM (rust 1.65 did update to a newer LLVM version).)

view this post on Zulip Alex Crichton (Jun 10 2024 at 21:42):

oh nice, if it reproduces on Linux that'll make it much easier to investigate on our end

view this post on Zulip Dennis Ranke (Jun 11 2024 at 21:57):

I have simplified the benchmark some more, putting the wasm code right into the rust source. While doing so I noticed that there was no regression when calling a (i32) -> i32 function like there was for a (f32) -> f32 function.

wasmtime 22.0.0 gains back a little bit of performance compared to 21.0.1, but not that much.

Another round of measurements with the latest benchmark, this time on a linux pc:

wasmtime 0.37, rust 1.64

calling function defined in wasm
                        time:   [2.2450 ms 2.2456 ms 2.2462 ms]

calling function defined in rust
                        time:   [3.1960 ms 3.1972 ms 3.1991 ms]

Back then, there was not too much overhead calling the function defined in rust vs. defined in wasm.

wasmtime 0.37, rust 1.78

calling function defined in rust
                        time:   [7.1093 ms 7.1108 ms 7.1125 ms]
                        change: [+122.27% +122.40% +122.51%] (p = 0.00 < 0.05)

wasmtime 21.0.1, rust 1.78 (branch latest)

calling function defined in rust
                        time:   [11.948 ms 11.953 ms 11.959 ms]
                        change: [+68.017% +68.102% +68.198%] (p = 0.00 < 0.05)

wasmtime 22.0.0, rust 1.78 (branch 22.0)

calling function defined in rust
                        time:   [9.4459 ms 9.4482 ms 9.4510 ms]
                        change: [-21.000% -20.957% -20.916%] (p = 0.00 < 0.05)

(the wasm->wasm performance was constant over all versions)

view this post on Zulip Pat Hickey (Jun 11 2024 at 22:27):

i have no real insight into this but i did find that the transition from rust 1.64 -> rust 1.65 was when LLVM 15 was introduced. id be suspicious that the change is down in there somewhere, but i haven't looked deeper. https://github.com/rust-lang/rust/blob/stable/RELEASES.md#version-1650-2022-11-03

Empowering everyone to build reliable and efficient software. - rust-lang/rust

view this post on Zulip Alex Crichton (Jun 11 2024 at 22:55):

I was doing some investigation of this last night (sorry should have posted earlier) and there a huge number of inefficiences we've introduced in the wasm->host call path since 0.37, the hot functions are littered with dozens of indirect calls which all end up being noops. There's definitely work to do in wasmtime orthogonal to LLVM.

I also suspect that the LLVM upgrade is what "broke" the performance here originally. My best guess is that it has to do with the catch_unwind as there seemed to be more stack movement somewhat related to that. Optimizing that is theoretically possible in Rust but will require the c_unwind feature to stabilize, and that's actually looking like it'll stabilize in the next nightly or so of Rust

view this post on Zulip Alex Crichton (Jun 11 2024 at 22:56):

but cc @fitzgen (he/him) since in profiling I saw that the GC hooks were a huge source of slowdowns. I compared Rust 1.64.0/1.65.0 with Wasmtime 0.37 to Rust 1.78 with Wasmtime 22 (on main) and adding a single #[cfg(feature = "gc")] reduced the slowdown by 150%: Rust 1.65 was 30% slower than Rust 1.64, Wasmtime 22 was ~200% slower than Rust 1.64, and Wasmtime 22 with a small patch was ~40% slower than Rust 1.64.

view this post on Zulip Alex Crichton (Jun 11 2024 at 22:57):

basically I don't think we've been studiously watching the perf of wasm->host calls and we've got a lot of trait objects which can't be optimized out, especially GC-related ones, and in this case there was no GC bits anyway so it's sort of a bummer it was causing a slowdown (to recover perf I was disabling the whole gc feature at compile time which isn't a great solution)

view this post on Zulip Alex Crichton (Jun 11 2024 at 22:57):

I also wanted to open a more official issue for this but... busy!

view this post on Zulip fitzgen (he/him) (Jun 11 2024 at 23:15):

hm I'm not totally surprised, but also thought we had been trying to avoid these things except for when actually using GC stuff. happy to work on tweaks and brainstorm designs when we have a concrete example of a call/branch/path that was fast before but slow now.

Unfortunately, I similarly don't have the cycles to do top-level investigations to chase down and discover those concrete things myself at the moment

view this post on Zulip Alex Crichton (Jun 12 2024 at 04:18):

With a combination of https://github.com/bytecodealliance/wasmtime/pull/8779, https://github.com/bytecodealliance/wasmtime/pull/8780, https://github.com/bytecodealliance/wasmtime/pull/8778, and disabling the gc of the wasmtime crate I'm currently seeing ~10% better performance than Rust 1.64 with wasmtime 0.37.

With this diff

diff --git a/crates/wasmtime/src/runtime/store.rs b/crates/wasmtime/src/runtime/store.rs
index e66f383251..72a929e10d 100644
--- a/crates/wasmtime/src/runtime/store.rs
+++ b/crates/wasmtime/src/runtime/store.rs
@@ -1125,13 +1125,14 @@ impl<T> StoreInner<T> {

     #[inline]
     pub fn call_hook(&mut self, s: CallHook) -> Result<()> {
-        if self.inner.pkey.is_none() && self.call_hook.is_none() {
-            Ok(())
-        } else {
-            self.call_hook_slow_path(s)
-        }
+        // if self.inner.pkey.is_none() && self.call_hook.is_none() {
+        Ok(())
+        // } else {
+        //     self.call_hook_slow_path(s)
+        // }
     }

+    #[inline(never)]
     fn call_hook_slow_path(&mut self, s: CallHook) -> Result<()> {
         if let Some(pkey) = &self.inner.pkey {
             let allocator = self.engine().allocator();

the number goes down to 50% faster than before.

This final diff is possible to apply but a bit more difficult. @Dennis Ranke is it reasonable to disable the gc feature for you? Also is on-par performance the goal here or were you looking to optimize the wasm->host path even further?

These are on the critical path of calling from wasm to the host and while minor helps clean up the path.
Statically avoids various assertions/checks when this feature is disabled.
Replace it with an enum of the two possibilities that it can be. This removes the need to have a trait dispatch indirection in the vm module. Previously this was required as wasmtime-runtime was a ...

view this post on Zulip bjorn3 (Jun 12 2024 at 07:13):

Optimizing that is theoretically possible in Rust but will require the c_unwind feature to stabilize, and that's actually looking like it'll stabilize in the next nightly or so of Rust

extern "C-unwind" has been stable for a while now. What isn't stable yet is the behavior of unwinding out of an extern "C" function aborting the process.

view this post on Zulip Dennis Ranke (Jun 12 2024 at 07:39):

Oh, that sounds pretty good, thanks for looking into it!

Disabling gc is totally fine for my use-case. I'll try your patches myself this evening.

On-par performance sounds good enough for me. This use-case is A) the demo scene (ie. realtime animated graphical effects), so people are very sensitive to performance regression, as it impacts how smooth their effects run, but B) it's the sizecoding niche inside the demo scene, so people will happily inline complex expressions with many trigonometry host calls into their inner loop to save the 4 bytes of a local.set, local.get pair.

All this to say: This is a very niche use-case in the WASM world, and while we appreciate any performance wins we can get, this is maybe not enough motivation for more difficult changes that gain a few more percent.

view this post on Zulip Alex Crichton (Jun 12 2024 at 18:13):

@bjorn3 true! I was seeing unwind edges and cleanups which in theory shouldn't be there so I was considering using extern "C" for "hey llvm this function won't unwind don't try to add cleanups around this" for internal hooks in wasmtime that should never unwind (e.g. weren't getting inlined), so yeah I was needing the abort-on-unwind behavior of extern "C" which is gated behind c_unwind (or more specifically I was interested in nounwind in the LLVM IR)

view this post on Zulip Alex Crichton (Jun 12 2024 at 18:13):

@Dennis Ranke the patches have landed on main now so if you can kick the tires there and see how it is that should work well! (while ensuring to disable the gc feature)

view this post on Zulip Alex Crichton (Jun 12 2024 at 18:15):

I should also clarify I didn't actually get to the point of testing nounwind and its effects on perf, it was just a theory I had and couldn't bottom out

view this post on Zulip Dennis Ranke (Jun 13 2024 at 07:54):

@Alex Crichton I just tried main without gc. Did you mean 10% faster than wasmtime 0.37 with rust 1.64 or 1.65? I'm getting 10% faster than 1.65, which is already really cool, but still some way of the 1.64 performance.

view this post on Zulip Alex Crichton (Jun 13 2024 at 14:47):

Just confirmed, but locally on Linux I see that when comparing Rust 1.64 and Wasmtime 0.37 with Rust 1.78 with Wasmtime main:

I was only looking at technotunnel_upd numbers the other night, is your regression based on simple_loop_upd?

view this post on Zulip Alex Crichton (Jun 13 2024 at 14:49):

Er apologies the numbers I just gave were with wasmtime with default features enabled, after disabling default features I get

view this post on Zulip Alex Crichton (Jun 13 2024 at 14:50):

You mentioned earlier you're on Windows so this could also be a Windows/Linux difference

view this post on Zulip Dennis Ranke (Jun 13 2024 at 14:50):

Essentially, yes. I simplified the benchmark further, which only kept the simple_loop version.

view this post on Zulip Alex Crichton (Jun 13 2024 at 14:51):

But to clarify locally I'm measuring a speedup in simple_loop_upd on Rust 1.78 with the main branch relative to Rust 1.64

view this post on Zulip Alex Crichton (Jun 13 2024 at 14:52):

I can try to poke around on Windows later and see if I can reproduce Rust 1.64 still being slower than main

view this post on Zulip Alex Crichton (Jun 13 2024 at 14:52):

er, other way around, main still being slower than Rust 1.64

view this post on Zulip Dennis Ranke (Jun 13 2024 at 14:55):

I can recheck with the older benchmark. I did run the performance test on Linux, too, though. (I'm not actually much of a Windows user myself, after I got the reports of regressions on Windows, I first had to spend (what felt like) multiple hours to update my windows install since I hadn't booted into it in more than a year... ;) )

view this post on Zulip Alex Crichton (Jun 13 2024 at 14:56):

Heh I know what you mean :)

view this post on Zulip Dennis Ranke (Jun 13 2024 at 14:56):

Does this dependency line look correct?

wasmtime = { git = "https://github.com/bytecodealliance/wasmtime", branch = "main", default-features = false, features = ["wat", "cranelift", "runtime"] }

view this post on Zulip Alex Crichton (Jun 13 2024 at 14:57):

To clarify what I did I'm on 4dc3dc13871f7a3b66bb6732c4f381e0aa876410 of the wt-regression repository. I first ran cargo +1.64.0 bench --bench wasmtime_benchmark -- --save-baseline original. Then I applied this diff

diff --git a/Cargo.toml b/Cargo.toml
index db47927..2b1c0ca 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -5,7 +5,8 @@ edition = "2021"

 [dependencies]
 criterion = { version = "0.3.0", default-features = false }
-wasmtime = "0.37.0"
+#wasmtime = "0.37.0"
+wasmtime = { path = '../wasmtime/crates/wasmtime', default-features = false, features = ['cranelift', 'runtime'] }
 anyhow = "1"

 [[bench]]
diff --git a/benches/wasmtime_benchmark.rs b/benches/wasmtime_benchmark.rs
index 024854c..5535ce9 100644
--- a/benches/wasmtime_benchmark.rs
+++ b/benches/wasmtime_benchmark.rs
@@ -6,6 +6,7 @@ fn benchmark_frame(c: &mut Criterion) {
     fn inner(c: &mut Criterion, wasm: &[u8], id: &str) -> Result<()> {
         let mut config = wasmtime::Config::new();
         config.cranelift_opt_level(wasmtime::OptLevel::Speed);
+        // config.profiler(wasmtime::ProfilingStrategy::JitDump);
         let engine = wasmtime::Engine::new(&config)?;

         let mut store = wasmtime::Store::new(&engine, ());
@@ -13,7 +14,7 @@ fn benchmark_frame(c: &mut Criterion) {
         let memory = wasmtime::Memory::new(&mut store, MemoryType::new(4, Some(4)))?;

         let mut linker = wasmtime::Linker::new(&engine);
-        linker.define("env", "memory", memory)?;
+        linker.define(&mut store, "env", "memory", memory)?;

         let module = wasmtime::Module::new(&engine, wasm)?;

@@ -21,7 +22,7 @@ fn benchmark_frame(c: &mut Criterion) {

         let instance = linker.instantiate(&mut store, &module)?;

-        let update = instance.get_typed_func::<(), (), _>(&mut store, "upd")?;
+        let update = instance.get_typed_func::<(), ()>(&mut store, "upd")?;

         c.bench_function(id, |b| {
             b.iter(|| {
@@ -32,7 +33,12 @@ fn benchmark_frame(c: &mut Criterion) {
         Ok(())
     }
     inner(c, include_bytes!("technotunnel.wasm"), "technotunnel_upd").unwrap();
-    inner(c, include_bytes!("technotunnel_nosin.wasm"), "technotunnel_nosin_upd").unwrap();
+    inner(
+        c,
+        include_bytes!("technotunnel_nosin.wasm"),
+        "technotunnel_nosin_upd",
+    )
+    .unwrap();
     inner(c, include_bytes!("simple_loop.wasm"), "simple_loop_upd").unwrap();
 }

and ran cargo bench --bench wasmtime_benchmark -- --baseline original

view this post on Zulip Alex Crichton (Jun 13 2024 at 14:57):

Yes that features-line looks correct

view this post on Zulip Alex Crichton (Jun 13 2024 at 15:21):

Actually with https://github.com/bytecodealliance/wasmtime/pull/8794 I'm seeing a 36% improvement on main relative to 1.64, an improvement over the 3% above

When the pooling allocator is itself disabled then there's no use for enabling MPK so this commit switches the implementation to all of the disabled versions of the primitives. This notably makes P...

view this post on Zulip Dennis Ranke (Jun 13 2024 at 16:04):

Hm, I followed the same steps, just with a git = "https://github.com/bytecodealliance/wasmtime.git" dependency instead of the path one + I had to do a cargo update. With that I got the result that simple_loop_upd regressed by 64.694%.

view this post on Zulip Alex Crichton (Jun 13 2024 at 16:05):

Mind pushing up the state of the repo to the wt-regression repo?

view this post on Zulip Alex Crichton (Jun 13 2024 at 16:05):

That way we can compare two commits of wt-regression and try to make sure we're all sync'd up

view this post on Zulip Dennis Ranke (Jun 13 2024 at 16:12):

sure, one second, I'll re-run it here.

view this post on Zulip Dennis Ranke (Jun 13 2024 at 16:15):

Branch name is simple-loop-main, but I'll just use commit ids below:

$ git checkout 4dc3dc13871f7a3b66bb6732c4f381e0aa876410
$ cargo +1.64 bench --bench wasmtime_benchmark -- --save-baseline original
$ git checkout 529af46497eb61787dd4204577831b2169efe486
$ cargo bench --bench wasmtime_benchmark -- --baseline original

simple_loop_upd         time:   [636.11 µs 636.17 µs 636.27 µs]
                        change: [+69.564% +69.720% +69.850%] (p = 0.00 < 0.05)
                        Performance has regressed.

view this post on Zulip Alex Crichton (Jun 13 2024 at 16:16):

nice, I'll poke around later

view this post on Zulip Alex Crichton (Jun 13 2024 at 17:32):

with those instructions I get an 87% slowdown on simple_loop_upd, meaning main is slower.

Looks like we're catching things at the wrong time though:

Can you try re-updating to current main? (which is C, one commit after the one you're using). Basically this is all boiling down to Store::call_hook and the performance impact of that feature. It may point towards needing a Cargo feature to disable it (but on-by-default like gc)

view this post on Zulip Alex Crichton (Jun 13 2024 at 17:34):

(sorry sent a message soon so I just updated my prior message)

view this post on Zulip Dennis Ranke (Jun 13 2024 at 18:21):

I can confirm that B (9f29c6e92629a8552f57fa6b2cec1371bc34f9e8) is indeed a little faster than 1.64, which is amazing! However, C (515fa4de012591e4edbc2548870ad377d0b5f3bc) is still 64% slower than 1.64 for me.

I updated the branch to C.

view this post on Zulip Alex Crichton (Jun 13 2024 at 19:24):

Interesting! Using d58b3b730c4e2ae71ee708304a2574a97f770e1f which is the latest commit of the wt-regression repository I see a 36% improvement over Rust 1.64, and looks like you're seeing a 64% slowdown. My guess is it may have to do with CPU differences? Are you on x64 or arm64?

view this post on Zulip Alex Crichton (Jun 13 2024 at 19:25):

Also you're testing on Linux, not macos?

view this post on Zulip Dennis Ranke (Jun 13 2024 at 19:26):

Yes, this is all on Linux.

view this post on Zulip Alex Crichton (Jun 13 2024 at 19:26):

x64?

view this post on Zulip Dennis Ranke (Jun 13 2024 at 19:26):

yes

view this post on Zulip Alex Crichton (Jun 13 2024 at 19:27):

what cpu?

view this post on Zulip Alex Crichton (Jun 13 2024 at 19:27):

I'm testing on Intel(R) Core(TM) i9-14900K myself

view this post on Zulip Dennis Ranke (Jun 13 2024 at 19:27):

AMD Ryzen 7 7700

view this post on Zulip Alex Crichton (Jun 13 2024 at 19:27):

mk I think this may just be cpu differences then and how good branch predictors and caches and such are

view this post on Zulip Alex Crichton (Jun 13 2024 at 19:28):

nevertheless I'll land something on main to remove all the call hook things

view this post on Zulip Dennis Ranke (Jun 13 2024 at 19:37):

Oh look at that: On a i5-9400F I indeed see a 30% improvement over 1.64. Benchmarking is hard.

view this post on Zulip Alex Crichton (Jun 13 2024 at 19:39):

with https://github.com/bytecodealliance/wasmtime/pull/8795 the AMD performance should improve, although I'm not sure if it will return to being on-part with 1.64

This commit moves the Store::call_hook API behind a Cargo feature named call-hook. This helps speed up the path from wasm into the host by avoiding branches at the start and the end of the executio...

view this post on Zulip Alex Crichton (Jun 13 2024 at 19:39):

it's basically another crate feature for you to disable, but with default-features = false you'll already have it disabled

view this post on Zulip Dennis Ranke (Jun 13 2024 at 19:44):

It's -33.333% on AMD with that commit!

This is really amazing, given that I was fully prepared to accept "It's some change in LLVM, there's not much we can do." :)

view this post on Zulip Alex Crichton (Jun 13 2024 at 19:45):

nice!

view this post on Zulip Alex Crichton (Jun 13 2024 at 19:45):

sometimes a change in a different part of the system is the final motivation you need to change parts of your own system :)

view this post on Zulip Alex Crichton (Jun 13 2024 at 19:46):

this helps fix a lot of cruft which accidentally snuck in on this path over the releases

view this post on Zulip Alex Crichton (Jun 13 2024 at 19:46):

so great fixes to have regardless

view this post on Zulip Alex Crichton (Jun 16 2024 at 21:59):

One thing I'll note on this as well now is that after some follow-up fixes it's no longer necessary to disable default features to get the perf improvements here, they're now present in builds by default


Last updated: Nov 22 2024 at 16:03 UTC