wasmtime / issue #4327 Implement lowered-then-lifted func... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / issue #4327 Implement lowered-then-lifted func...

Wasmtime GitHub notifications bot (Jun 27 2022 at 15:41):

I was originally going to include lifted-then-lowered functions in the same component here as well, but that has snowballed quite a lot so I opted to split this out and get this in first.

Wasmtime GitHub notifications bot (Jun 27 2022 at 16:47):

github-actions[bot] commented on issue #4327:

Subscribe to Label Action

cc @peterhuene

<details>
This issue or pull request has been labeled: "wasmtime:api"

Thus the following users have been cc'd because of the following labels:

peterhuene: wasmtime:api

To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.

Learn more.
</details>

Wasmtime GitHub notifications bot (Aug 17 2022 at 18:28):

cfallin commented on issue #4327:

@alexcrichton I'm doing some historical benchmarking to look at how various features have improved Wasmtime + Cranelift over time, and I found a regression in compile time that I've bisected to this PR.

In particular, compiling spidermonkey.wasm (with wasmtime compile on the command line measured using hyperfine, on a machine with CPU frequency locked to 2.8GHz and turbo-boost disabled):

commit df1502531d437b2105effb990a7caa28796aa985, just prior to this PR's merge: 1.23s

commit c1b3962f7b9fc1c193e1b9709db64c455699a295, this PR's merge: 1.43s, or 16% slower

Is this expected (I see a note above about compiling more trampolines)?

The regression persists to today, so I'm definitely interested in ways we could ameliorate this!

Wasmtime GitHub notifications bot (Aug 17 2022 at 18:34):

alexcrichton commented on issue #4327:

Hm no this PR shouldn't have affected core wasm compilation, I will try to dig in further and see what this caused.

Wasmtime GitHub notifications bot (Aug 17 2022 at 18:50):

alexcrichton commented on issue #4327:

Hm ok I am unable to reproduce this. I'm running:
$ hyperfine --warmup 5 "./before compile ./benches/instantiation/spidermonkey.wasm" "./after compile ./benches/instantiation/spidermonkey.wasm"
where that spidermonkey.wasm just happens to be some old copy I have locally. On an arm64 machine this reports "after" being faster by a fairly variable amount if I run the hyperfine command multiple times. On an x86_64 machine it's waffling between either "after" or "before" being faster.

The main change that this PR had on core wasm is that core wasm trampolines are now compiled in parallel whereas they were compiled previously. Could that parallelism be affecting your measurements somehow perhaps? It may be that the parallelism here isn't a great idea for the short-lived task of compiling small trampolines.

Wasmtime GitHub notifications bot (Aug 17 2022 at 18:51):

alexcrichton edited a comment on issue #4327:

Hm ok I am unable to reproduce this. I'm running:
$ hyperfine --warmup 5 "./before compile ./benches/instantiation/spidermonkey.wasm" "./after compile ./benches/instantiation/spidermonkey.wasm"
where that spidermonkey.wasm just happens to be some old copy I have locally. On an arm64 machine this reports "after" being faster by a fairly variable amount if I run the hyperfine command multiple times. On an x86_64 machine it's waffling between either "after" or "before" being faster.

The main change that this PR had on core wasm is that core wasm trampolines are now compiled in parallel whereas they were compiled serially previously. Could that parallelism be affecting your measurements somehow perhaps? It may be that the parallelism here isn't a great idea for the short-lived task of compiling small trampolines.

Wasmtime GitHub notifications bot (Aug 17 2022 at 18:59):

cfallin commented on issue #4327:

Weird -- here's what I see:

[cfallin@xap]~/work/wasmtime-clean% hyperfine --warmup 5 './df15025 compile ../wasm-tests/spidermonkey.wasm' './c1b3962 compile ../wasm-tests/spidermonkey.wasm'
Benchmark 1: ./df15025 compile ../wasm-tests/spidermonkey.wasm
  Time (mean ± σ):      1.413 s ±  0.050 s    [User: 20.426 s, System: 0.605 s]
  Range (min … max):    1.349 s …  1.492 s    10 runs

Benchmark 2: ./c1b3962 compile ../wasm-tests/spidermonkey.wasm
  Time (mean ± σ):      1.649 s ±  0.024 s    [User: 20.312 s, System: 0.579 s]
  Range (min … max):    1.636 s …  1.715 s    10 runs

Summary
  './df15025 compile ../wasm-tests/spidermonkey.wasm' ran
    1.17 ± 0.04 times faster than './c1b3962 compile ../wasm-tests/spidermonkey.wasm'

Parameters: 12-core/24-thread system (Ryzen 3900X), frequency governor locked at 2.8GHz, Linux/x86-64, built with Rust 1.63.0, nothing else running on system (ssh into headless box).

Wasmtime GitHub notifications bot (Aug 17 2022 at 19:02):

cfallin commented on issue #4327:

Notably the user CPU time (sum of all threads' time) is ~about the same or a little better in "after" (c1b) than "before" (df1), so parallelism is perhaps the culprit here.

Wasmtime GitHub notifications bot (Aug 17 2022 at 20:25):

cfallin commented on issue #4327:

Doing some more experimentation on my system:

with 12 cores (so, one SMT sibling on each core, removing the effects of SMT), a 26% regression:

[cfallin@xap]~/work/wasmtime-clean% hyperfine --warmup 5 'taskset 0xfff ./df15025 compile ../wasm-tests/spidermonkey.wasm' 'taskset 0xfff ./c1b3962 compile ../wasm-tests/spidermonkey.wasm'
Benchmark 1: taskset 0xfff ./df15025 compile ../wasm-tests/spidermonkey.wasm
  Time (mean ± σ):      1.694 s ±  0.012 s    [User: 14.586 s, System: 0.422 s]
  Range (min … max):    1.679 s …  1.717 s    10 runs

Benchmark 2: taskset 0xfff ./c1b3962 compile ../wasm-tests/spidermonkey.wasm
  Time (mean ± σ):      2.135 s ±  0.020 s    [User: 14.597 s, System: 0.429 s]
  Range (min … max):    2.109 s …  2.175 s    10 runs

Summary
  'taskset 0xfff ./df15025 compile ../wasm-tests/spidermonkey.wasm' ran
    1.26 ± 0.01 times faster than 'taskset 0xfff ./c1b3962 compile ../wasm-tests/spidermonkey.wasm'

with the first 6 cores (one of two chiplets, eliminating any interconnect or NUMA-related issues), a 31% regression:

[cfallin@xap]~/work/wasmtime-clean% hyperfine --warmup 5 'taskset 0x3f ./df15025 compile ../wasm-tests/spidermonkey.wasm' 'taskset 0x3f ./c1b3962 compile ../wasm-tests/spidermonkey.wasm'
Benchmark 1: taskset 0x3f ./df15025 compile ../wasm-tests/spidermonkey.wasm
  Time (mean ± σ):      2.862 s ±  0.018 s    [User: 14.378 s, System: 0.375 s]
  Range (min … max):    2.841 s …  2.907 s    10 runs

Benchmark 2: taskset 0x3f ./c1b3962 compile ../wasm-tests/spidermonkey.wasm
  Time (mean ± σ):      3.748 s ±  0.018 s    [User: 14.368 s, System: 0.377 s]
  Range (min … max):    3.736 s …  3.795 s    10 runs

Summary
  'taskset 0x3f ./df15025 compile ../wasm-tests/spidermonkey.wasm' ran
    1.31 ± 0.01 times faster than 'taskset 0x3f ./c1b3962 compile ../wasm-tests/spidermonkey.wasm'

@alexcrichton , would you be OK with reverting the parallel compilation of trampolines that this PR introduced? Even if this regression appears only on some systems or in some parallelism scenarios, I think these numbers are too big to ignore and while from first principles the parallel-iter runtime should do the right thing, it seems to be falling flat here...

Wasmtime GitHub notifications bot (Aug 17 2022 at 21:33):

alexcrichton commented on issue #4327:

Personally something is weird enough here that I don't think it warrants reverting this and sweeping it under the rug. I initially struggled to reproduce on an arm64 and x86_64 system so I don't think that this is a systemic issue. Testing just on the x86_64 machine I have I tried variations of RAYON_NUM_THREADS=N and 1,2,4,5,8,16 showed no perf difference in the before/after. Surprinsingly though 3,6,7 all showed 20%+ regressions. Given it seems like some CPU configurations are probably not great at the parallelism, but this seems like a deeper issue with our parallel strategy in general rather than specific to just the trampolines which happened to change here.

Wasmtime GitHub notifications bot (Aug 17 2022 at 21:34):

alexcrichton edited a comment on issue #4327:

Personally something is weird enough here that I don't think it warrants reverting this and sweeping it under the rug. I initially struggled to reproduce on an arm64 and x86_64 system so I don't think that this is a systemic issue. Testing just on the x86_64 machine I have I tried variations of RAYON_NUM_THREADS=N and 1,2,4,5,8,16 showed no perf difference in the before/after. Surprinsingly though 3,6,7 all showed 20%+ regressions. It seems like some CPU configurations are probably not great at the parallelism, but this seems like a deeper issue with our parallel strategy in general rather than specific to just the trampolines which happened to change here.

Wasmtime GitHub notifications bot (Aug 17 2022 at 21:53):

alexcrichton commented on issue #4327:

The problem here seems to stem from rayon::join. If I move out compilation of the trampolines entirely from the rayon::join and compile all trampolines serially I still observe the slowdown locally at 6 threads. This is where one closure to rayon::join does a parallel compilation of all of a module's functions, and the other closure does nothing. Seems like we may need to avoid rayon::join, but I'll try to dig a bit more.

Wasmtime GitHub notifications bot (Aug 17 2022 at 22:01):

alexcrichton commented on issue #4327:

I can confirm that this diff:

diff --git a/crates/wasmtime/src/module.rs b/crates/wasmtime/src/module.rs
index 5524ebb73..1494bed18 100644
--- a/crates/wasmtime/src/module.rs
+++ b/crates/wasmtime/src/module.rs
@@ -372,8 +372,9 @@ impl Module {
         let tunables = &engine.config().tunables;
         let functions = mem::take(&mut translation.function_body_inputs);
         let functions = functions.into_iter().collect::<Vec<_>>();
-        let funcs = engine
-            .run_maybe_parallel(functions, |(index, func)| {
+        let mut funcs = None;
+        rayon::scope(|_| {
+            funcs = Some(engine.run_maybe_parallel(functions, |(index, func)| {
                 let offset = func.body.range().start;
                 engine
                     .compiler()
@@ -389,9 +390,9 @@ impl Module {
                             "failed to compile wasm function {index}{name} at offset {offset:#x}"
                         )
                     })
-            })?
-            .into_iter()
-            .collect();
+            }));
+        });
+        let funcs = funcs.unwrap()?.into_iter().collect();

         // Collect all the function results into a final ELF object.
         let mut obj = engine.compiler().object()?;

on df1502531d437b2105effb990a7caa28796aa985 produces the slowdown I'm seeing. Unsure what's happening within rayon...

Wasmtime GitHub notifications bot (Aug 17 2022 at 22:51):

alexcrichton commented on issue #4327:

I don't really know what's going on here. I dug into the actual compile time of SLOW=1 vs not with the above patch. Summing the time spent compiling each individual function shows a 2.34% difference in time which doesn't account for the 20% wall time seen for the entire command. My best guess is that something is tickling rayon in just the wrong fashion here, although I have no idea what.

Wasmtime GitHub notifications bot (Aug 17 2022 at 23:28):

cfallin commented on issue #4327:

Wow, this is wild: sure enough, when run with RAYON_NUM_THREADS=8 on my system, the regression disappears (+/- 1% delta, within noise).

I agree this seems like a Rayon bug of some sort -- a dependence on there being a power-of-two number of worker threads (some sort of work-distribution imbalance maybe?).

As a very short-term solution I will benchmark with this environment variable set; but I will file a bug upstream and see what folks think.

Last updated: Apr 17 2025 at 19:03 UTC