alexcrichton commented on issue #4327:
I was originally going to include lifted-then-lowered functions in the same component here as well, but that has snowballed quite a lot so I opted to split this out and get this in first.
github-actions[bot] commented on issue #4327:
Subscribe to Label Action
cc @peterhuene
<details>
This issue or pull request has been labeled: "wasmtime:api"Thus the following users have been cc'd because of the following labels:
- peterhuene: wasmtime:api
To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.
Learn more.
</details>
cfallin commented on issue #4327:
@alexcrichton I'm doing some historical benchmarking to look at how various features have improved Wasmtime + Cranelift over time, and I found a regression in compile time that I've bisected to this PR.
In particular, compiling
spidermonkey.wasm
(withwasmtime compile
on the command line measured usinghyperfine
, on a machine with CPU frequency locked to 2.8GHz and turbo-boost disabled):
- commit
df1502531d437b2105effb990a7caa28796aa985
, just prior to this PR's merge: 1.23s- commit
c1b3962f7b9fc1c193e1b9709db64c455699a295
, this PR's merge: 1.43s, or 16% slowerIs this expected (I see a note above about compiling more trampolines)?
The regression persists to today, so I'm definitely interested in ways we could ameliorate this!
alexcrichton commented on issue #4327:
Hm no this PR shouldn't have affected core wasm compilation, I will try to dig in further and see what this caused.
alexcrichton commented on issue #4327:
Hm ok I am unable to reproduce this. I'm running:
$ hyperfine --warmup 5 "./before compile ./benches/instantiation/spidermonkey.wasm" "./after compile ./benches/instantiation/spidermonkey.wasm"
where that
spidermonkey.wasm
just happens to be some old copy I have locally. On an arm64 machine this reports "after" being faster by a fairly variable amount if I run thehyperfine
command multiple times. On an x86_64 machine it's waffling between either "after" or "before" being faster.The main change that this PR had on core wasm is that core wasm trampolines are now compiled in parallel whereas they were compiled previously. Could that parallelism be affecting your measurements somehow perhaps? It may be that the parallelism here isn't a great idea for the short-lived task of compiling small trampolines.
alexcrichton edited a comment on issue #4327:
Hm ok I am unable to reproduce this. I'm running:
$ hyperfine --warmup 5 "./before compile ./benches/instantiation/spidermonkey.wasm" "./after compile ./benches/instantiation/spidermonkey.wasm"
where that
spidermonkey.wasm
just happens to be some old copy I have locally. On an arm64 machine this reports "after" being faster by a fairly variable amount if I run thehyperfine
command multiple times. On an x86_64 machine it's waffling between either "after" or "before" being faster.The main change that this PR had on core wasm is that core wasm trampolines are now compiled in parallel whereas they were compiled serially previously. Could that parallelism be affecting your measurements somehow perhaps? It may be that the parallelism here isn't a great idea for the short-lived task of compiling small trampolines.
cfallin commented on issue #4327:
Weird -- here's what I see:
[cfallin@xap]~/work/wasmtime-clean% hyperfine --warmup 5 './df15025 compile ../wasm-tests/spidermonkey.wasm' './c1b3962 compile ../wasm-tests/spidermonkey.wasm' Benchmark 1: ./df15025 compile ../wasm-tests/spidermonkey.wasm Time (mean ± σ): 1.413 s ± 0.050 s [User: 20.426 s, System: 0.605 s] Range (min … max): 1.349 s … 1.492 s 10 runs Benchmark 2: ./c1b3962 compile ../wasm-tests/spidermonkey.wasm Time (mean ± σ): 1.649 s ± 0.024 s [User: 20.312 s, System: 0.579 s] Range (min … max): 1.636 s … 1.715 s 10 runs Summary './df15025 compile ../wasm-tests/spidermonkey.wasm' ran 1.17 ± 0.04 times faster than './c1b3962 compile ../wasm-tests/spidermonkey.wasm'
Parameters: 12-core/24-thread system (Ryzen 3900X), frequency governor locked at 2.8GHz, Linux/x86-64, built with Rust 1.63.0, nothing else running on system (ssh into headless box).
cfallin commented on issue #4327:
Notably the user CPU time (sum of all threads' time) is ~about the same or a little better in "after" (c1b) than "before" (df1), so parallelism is perhaps the culprit here.
cfallin commented on issue #4327:
Doing some more experimentation on my system:
- with 12 cores (so, one SMT sibling on each core, removing the effects of SMT), a 26% regression:
[cfallin@xap]~/work/wasmtime-clean% hyperfine --warmup 5 'taskset 0xfff ./df15025 compile ../wasm-tests/spidermonkey.wasm' 'taskset 0xfff ./c1b3962 compile ../wasm-tests/spidermonkey.wasm' Benchmark 1: taskset 0xfff ./df15025 compile ../wasm-tests/spidermonkey.wasm Time (mean ± σ): 1.694 s ± 0.012 s [User: 14.586 s, System: 0.422 s] Range (min … max): 1.679 s … 1.717 s 10 runs Benchmark 2: taskset 0xfff ./c1b3962 compile ../wasm-tests/spidermonkey.wasm Time (mean ± σ): 2.135 s ± 0.020 s [User: 14.597 s, System: 0.429 s] Range (min … max): 2.109 s … 2.175 s 10 runs Summary 'taskset 0xfff ./df15025 compile ../wasm-tests/spidermonkey.wasm' ran 1.26 ± 0.01 times faster than 'taskset 0xfff ./c1b3962 compile ../wasm-tests/spidermonkey.wasm'
- with the first 6 cores (one of two chiplets, eliminating any interconnect or NUMA-related issues), a 31% regression:
[cfallin@xap]~/work/wasmtime-clean% hyperfine --warmup 5 'taskset 0x3f ./df15025 compile ../wasm-tests/spidermonkey.wasm' 'taskset 0x3f ./c1b3962 compile ../wasm-tests/spidermonkey.wasm' Benchmark 1: taskset 0x3f ./df15025 compile ../wasm-tests/spidermonkey.wasm Time (mean ± σ): 2.862 s ± 0.018 s [User: 14.378 s, System: 0.375 s] Range (min … max): 2.841 s … 2.907 s 10 runs Benchmark 2: taskset 0x3f ./c1b3962 compile ../wasm-tests/spidermonkey.wasm Time (mean ± σ): 3.748 s ± 0.018 s [User: 14.368 s, System: 0.377 s] Range (min … max): 3.736 s … 3.795 s 10 runs Summary 'taskset 0x3f ./df15025 compile ../wasm-tests/spidermonkey.wasm' ran 1.31 ± 0.01 times faster than 'taskset 0x3f ./c1b3962 compile ../wasm-tests/spidermonkey.wasm'
@alexcrichton , would you be OK with reverting the parallel compilation of trampolines that this PR introduced? Even if this regression appears only on some systems or in some parallelism scenarios, I think these numbers are too big to ignore and while from first principles the parallel-iter runtime should do the right thing, it seems to be falling flat here...
alexcrichton commented on issue #4327:
Personally something is weird enough here that I don't think it warrants reverting this and sweeping it under the rug. I initially struggled to reproduce on an arm64 and x86_64 system so I don't think that this is a systemic issue. Testing just on the x86_64 machine I have I tried variations of
RAYON_NUM_THREADS=N
and 1,2,4,5,8,16 showed no perf difference in the before/after. Surprinsingly though 3,6,7 all showed 20%+ regressions. Given it seems like some CPU configurations are probably not great at the parallelism, but this seems like a deeper issue with our parallel strategy in general rather than specific to just the trampolines which happened to change here.
alexcrichton edited a comment on issue #4327:
Personally something is weird enough here that I don't think it warrants reverting this and sweeping it under the rug. I initially struggled to reproduce on an arm64 and x86_64 system so I don't think that this is a systemic issue. Testing just on the x86_64 machine I have I tried variations of
RAYON_NUM_THREADS=N
and 1,2,4,5,8,16 showed no perf difference in the before/after. Surprinsingly though 3,6,7 all showed 20%+ regressions. It seems like some CPU configurations are probably not great at the parallelism, but this seems like a deeper issue with our parallel strategy in general rather than specific to just the trampolines which happened to change here.
alexcrichton commented on issue #4327:
The problem here seems to stem from
rayon::join
. If I move out compilation of the trampolines entirely from therayon::join
and compile all trampolines serially I still observe the slowdown locally at 6 threads. This is where one closure torayon::join
does a parallel compilation of all of a module's functions, and the other closure does nothing. Seems like we may need to avoidrayon::join
, but I'll try to dig a bit more.
alexcrichton commented on issue #4327:
I can confirm that this diff:
diff --git a/crates/wasmtime/src/module.rs b/crates/wasmtime/src/module.rs index 5524ebb73..1494bed18 100644 --- a/crates/wasmtime/src/module.rs +++ b/crates/wasmtime/src/module.rs @@ -372,8 +372,9 @@ impl Module { let tunables = &engine.config().tunables; let functions = mem::take(&mut translation.function_body_inputs); let functions = functions.into_iter().collect::<Vec<_>>(); - let funcs = engine - .run_maybe_parallel(functions, |(index, func)| { + let mut funcs = None; + rayon::scope(|_| { + funcs = Some(engine.run_maybe_parallel(functions, |(index, func)| { let offset = func.body.range().start; engine .compiler() @@ -389,9 +390,9 @@ impl Module { "failed to compile wasm function {index}{name} at offset {offset:#x}" ) }) - })? - .into_iter() - .collect(); + })); + }); + let funcs = funcs.unwrap()?.into_iter().collect(); // Collect all the function results into a final ELF object. let mut obj = engine.compiler().object()?;
on df1502531d437b2105effb990a7caa28796aa985 produces the slowdown I'm seeing. Unsure what's happening within rayon...
alexcrichton commented on issue #4327:
I don't really know what's going on here. I dug into the actual compile time of
SLOW=1
vs not with the above patch. Summing the time spent compiling each individual function shows a 2.34% difference in time which doesn't account for the 20% wall time seen for the entire command. My best guess is that something is tickling rayon in just the wrong fashion here, although I have no idea what.
cfallin commented on issue #4327:
Wow, this is wild: sure enough, when run with
RAYON_NUM_THREADS=8
on my system, the regression disappears (+/- 1% delta, within noise).I agree this seems like a Rayon bug of some sort -- a dependence on there being a power-of-two number of worker threads (some sort of work-distribution imbalance maybe?).
As a very short-term solution I will benchmark with this environment variable set; but I will file a bug upstream and see what folks think.
Last updated: Dec 23 2024 at 12:05 UTC