fitzgen commented on issue #4163:
Great! I will take a look at this tomorrow.
cfallin commented on issue #4163:
I ran the Sightglass bnechmarks on this and the only delta is for
meshoptimizer
:execution :: cycles :: benchmarks-next/meshoptimizer/benchmark.wasm Δ = 957833251.60 ± 610441836.41 (confidence = 99%) new.so is 1.02x to 1.09x faster than old.so! old.so is 0.92x to 0.98x faster than new.so! [17612149560 17873615806.60 18098060500] new.so [17544919808 18831449058.20 19306513928] old.so execution :: nanoseconds :: benchmarks-next/meshoptimizer/benchmark.wasm Δ = 252025449.20 ± 160632667.79 (confidence = 99%) new.so is 1.02x to 1.09x faster than old.so! old.so is 0.92x to 0.98x faster than new.so! [4634388228 4703136639.50 4762142132] new.so [4616626357 4955162088.70 5080175835] old.so
(with no impact on compilation time for any benchmark). I'm not too surprised we don't see more opportunity in Wasm benchmarks, because a lot of the RLE and store-to-load opts will have already been done by the Wasm toolchain. Regardless it seems nice to have this in, if it doesn't hurt, and it could become more applicable if/when we inline.
cfallin edited a comment on issue #4163:
I ran the Sightglass benchmarks on this and the only delta is for
meshoptimizer
:execution :: cycles :: benchmarks-next/meshoptimizer/benchmark.wasm Δ = 957833251.60 ± 610441836.41 (confidence = 99%) new.so is 1.02x to 1.09x faster than old.so! old.so is 0.92x to 0.98x faster than new.so! [17612149560 17873615806.60 18098060500] new.so [17544919808 18831449058.20 19306513928] old.so execution :: nanoseconds :: benchmarks-next/meshoptimizer/benchmark.wasm Δ = 252025449.20 ± 160632667.79 (confidence = 99%) new.so is 1.02x to 1.09x faster than old.so! old.so is 0.92x to 0.98x faster than new.so! [4634388228 4703136639.50 4762142132] new.so [4616626357 4955162088.70 5080175835] old.so
(with no impact on compilation time for any benchmark). I'm not too surprised we don't see more opportunity in Wasm benchmarks, because a lot of the RLE and store-to-load opts will have already been done by the Wasm toolchain. Regardless it seems nice to have this in, if it doesn't hurt, and it could become more applicable if/when we inline.
bjorn3 commented on issue #4163:
I think you will need to add support for volatile loads and stores for this to not miscompile rust code. It would also be nice if you could state an exhaustive list of UB it depends on. For example that reading/writing memory racing with the compiled clif function is UB unless it is an atomic or volatile load/store.
bjorn3 commented on issue #4163:
By the way what is the time complexity of this optinization pass?
cfallin commented on issue #4163:
volatile loads/stores to not miscompile
@bjorn3 can you clarify if the cg\_clif frontend currently expects all loads to not be elided, or otherwise what the requirements are?
Otherwise removing a load and replacing it with a known value is always legal, given CLIF semantics today. (If it helps, imagine the disjoint alias categories don't exist: by default all loads/stores fall into "other", so memory locations are named by the last store of any kind.)
Atomics should use atomic ops at the CLIF level of course; I believe those should act as full barriers across all categories, as calls do. (I don't think I've actually added this logic yet; will ensure it's there tomorrow.)
time complexity of this optimization pass?
Same as any of the other iterative dataflow analyses: worst case passes over any given basic block bounded by maximum descending chain depth in the lattice. Here the last-store vector meets to "current inst" on any conflict at a merge point so the lattice is effectively three levels deep, constant; so that's O(|blocks|) to converge last-store info (likely one visit per block, rarely two). Then one more visit per block to discover and use aliases.
bjorn3 commented on issue #4163:
@bjorn3 can you clarify if the cg_clif frontend currently expects all loads to not be elided, or otherwise what the requirements are?
Volatile memory operations can't be changed as they may have side effects like initiating a DMA transfer or changing the state of a device. Non-volatile and non-atomic memory accesses can be optimized as rust defines data races to be UB.
bjorn3 commented on issue #4163:
This is a slight regresion for the simple-raytracer benchmark of cg_clif. I would guess it increases live-range sizes causing a regalloc pessimization. Implementing https://github.com/bytecodealliance/wasmtime/pull/4163#discussion_r876805710 would make it a beneficial optimization I think. I used to have a simple implementation of that in cg_clif which also looks at which bytes of a stack slot were accessed by the load/store to effectively allow SROA, but I removed it in https://github.com/bjorn3/rustc_codegen_cranelift/commit/a793be8ee8895538e99acc2a855d9c4ae145fc78 as it was buggy and I didn't think I knew how to fix it. It also only operated within a single basic block. See https://github.com/bjorn3/rustc_codegen_cranelift/issues/846.
Benchmark 1: ./raytracer_cg_clif_release_alias_analysis Time (mean ± σ): 5.196 s ± 0.012 s [User: 5.191 s, System: 0.004 s] Range (min … max): 5.169 s … 5.211 s 10 runs Benchmark 2: ./raytracer_cg_clif_release_main Time (mean ± σ): 5.224 s ± 0.024 s [User: 5.218 s, System: 0.005 s] Range (min … max): 5.197 s … 5.263 s 10 runs Summary './raytracer_cg_clif_release_alias_analysis' ran 1.01 ± 0.01 times faster than './raytracer_cg_clif_release_main'
cfallin commented on issue #4163:
@bjorn3 can you clarify if the cg_clif frontend currently expects all loads to not be elided, or otherwise what the requirements are?
Volatile memory operations can't be changed as they may have side effects like initiating a DMA transfer or changing the state of a device. Non-volatile and non-atomic memory accesses can be optimized as rust defines data races to be UB.
OK, so if I understand correctly, cg\_clif is currently compiling volatiles at the Rust level to normal loads/stores at the CLIF level? That I agree would result in miscompiles; the issue though isn't alias analysis as the semantics of
load
andstore
in CLIF have always been (even if not exploited until now) those of normal loads/stores. Adding volatile memory ops seems like a reasonable feature request though and I'd be happy to review a PR for that as well.
cfallin commented on issue #4163:
@bjorn3 I may be misreading your benchmark result, but:
Benchmark 1: ./raytracer_cg_clif_release_alias_analysis Time (mean ± σ): 5.196 s ± 0.012 s [User: 5.191 s, System: 0.004 s] Benchmark 2: ./raytracer_cg_clif_release_main Time (mean ± σ): 5.224 s ± 0.024 s [User: 5.218 s, System: 0.005 s]
looks like
alias_analysis
ran a bit faster thanmain
? Then'./raytracer_cg_clif_release_alias_analysis' ran 1.01 ± 0.01 times faster than './raytracer_cg_clif_release_main'
seems to confirm -- +1%, though maybe within a margin of error.
bjorn3 commented on issue #4163:
looks like alias_analysis ran a bit faster than main?
Right, my bad.
cfallin commented on issue #4163:
Updated, thanks!
bjorn3 commented on issue #4163:
Hang on, I think I know why it didn't help much for cg_clif. Cg_clif uses stack_load and stack_store rather than the stack_addr+load/store this opt pass needs.
cfallin commented on issue #4163:
@bjorn3 the alias analysis / redundant-load pass runs after legalization, so it should see plain loads/stores rather than
stack_load
/stack_store
, I think.
pepyakin commented on issue #4163:
FWIW, I've noticed around ≈15% regression and bisection points on this commit.
the results for 0824abb:
./target/release/wasmtime compile -g -O wasmibox.wasm && , hyperfine --show-output --warmup 2 "./target/release/wasmtime run -g -O --allow-precompiled --invoke wasm_kernel_run wasmibox.cwasm" Benchmark 1: ./target/release/wasmtime run -g -O --allow-precompiled --invoke wasm_kernel_run wasmibox.cwasm Time (mean ± σ): 7.098 s ± 0.033 s [User: 7.080 s, System: 0.015 s] Range (min … max): 7.018 s … 7.146 s 10 runs
previous good, 89ccc56:
./target/release/wasmtime compile -g -O wasmibox.wasm && , hyperfine --show-output --warmup 2 "./target/release/wasmtime run -g -O --allow-precompiled --invoke wasm_kernel_run wasmibox.cwasm" Benchmark 1: ./target/release/wasmtime run -g -O --allow-precompiled --invoke wasm_kernel_run wasmibox.cwasm Time (mean ± σ): 6.071 s ± 0.039 s [User: 6.052 s, System: 0.016 s] Range (min … max): 6.024 s … 6.153 s 10 runs
This particular benchmark is a standalone wasm file, i.e. no imports. The entry point loads a test wasm binary and interprets it with compiled in wasmi. The test binary is also standalone. The test wasm generates random numbers using xorshift and takes keccak hash out of it. I can publish the wasm or the source to get the wasm.
pepyakin commented on issue #4163:
Well, it gets weird.
If I apply the following change:
diff --cc crates/wasmtime/src/config.rs index 9c1ed87ca,9c1ed87ca..6e9bcd1f7 --- a/crates/wasmtime/src/config.rs +++ b/crates/wasmtime/src/config.rs @@@ -1287,6 -1287,6 +1287,7 @@@ fn compiler_builder(strategy: Strategy }, ) .unwrap(); ++ builder.set("enable_pinned_reg", "true").unwrap(); Ok(builder) } diff --cc crates/wasmtime/src/engine.rs index c81ae6e67,c81ae6e67..cc150af3a --- a/crates/wasmtime/src/engine.rs +++ b/crates/wasmtime/src/engine.rs @@@ -312,7 -312,7 +312,7 @@@ impl Engine "baldrdash_prologue_words" => *value == FlagValue::Num(0), "enable_llvm_abi_extensions" => *value == FlagValue::Bool(false), "emit_all_ones_funcaddrs" => *value == FlagValue::Bool(false), -- "enable_pinned_reg" => *value == FlagValue::Bool(false), ++ "enable_pinned_reg" => *value == FlagValue::Bool(true), "enable_probestack" => *value == FlagValue::Bool(false), "use_colocated_libcalls" => *value == FlagValue::Bool(false), "use_pinned_reg_as_heap_base" => *value == FlagValue::Bool(false),
then the result would be like this:
Benchmark 1: ./target/release/wasmtime run -g -O --allow-precompiled --invoke wasm_kernel_run wasmibox.cwasm Time (mean ± σ): 6.111 s ± 0.050 s [User: 6.094 s, System: 0.014 s] Range (min … max): 6.044 s … 6.226 s 10 runs
I'm confused because not sure how that could possibly be related.
pepyakin edited a comment on issue #4163:
Well, it gets weird.
If I apply the following change on top of 0824abb
diff --cc crates/wasmtime/src/config.rs index 9c1ed87ca,9c1ed87ca..6e9bcd1f7 --- a/crates/wasmtime/src/config.rs +++ b/crates/wasmtime/src/config.rs @@@ -1287,6 -1287,6 +1287,7 @@@ fn compiler_builder(strategy: Strategy }, ) .unwrap(); ++ builder.set("enable_pinned_reg", "true").unwrap(); Ok(builder) } diff --cc crates/wasmtime/src/engine.rs index c81ae6e67,c81ae6e67..cc150af3a --- a/crates/wasmtime/src/engine.rs +++ b/crates/wasmtime/src/engine.rs @@@ -312,7 -312,7 +312,7 @@@ impl Engine "baldrdash_prologue_words" => *value == FlagValue::Num(0), "enable_llvm_abi_extensions" => *value == FlagValue::Bool(false), "emit_all_ones_funcaddrs" => *value == FlagValue::Bool(false), -- "enable_pinned_reg" => *value == FlagValue::Bool(false), ++ "enable_pinned_reg" => *value == FlagValue::Bool(true), "enable_probestack" => *value == FlagValue::Bool(false), "use_colocated_libcalls" => *value == FlagValue::Bool(false), "use_pinned_reg_as_heap_base" => *value == FlagValue::Bool(false),
then the result would be like this:
Benchmark 1: ./target/release/wasmtime run -g -O --allow-precompiled --invoke wasm_kernel_run wasmibox.cwasm Time (mean ± σ): 6.111 s ± 0.050 s [User: 6.094 s, System: 0.014 s] Range (min … max): 6.044 s … 6.226 s 10 runs
i.e. the regression disappears.
I'm confused because not sure how that could possibly be related.
cfallin commented on issue #4163:
@pepyakin I'm happy to take a look at this next week (I'm out of office now); can you publish the .wasm somewhere in the meantime?
pepyakin commented on issue #4163:
Sure!
pepyakin edited a comment on issue #4163:
FWIW, I've noticed around ≈15% regression and bisection points on this commit.
the results for 0824abb:
./target/release/wasmtime compile -g -O wasmibox.wasm && , hyperfine --show-output --warmup 2 "./target/release/wasmtime run -g -O --allow-precompiled --invoke wasm_kernel_run wasmibox.cwasm" Benchmark 1: ./target/release/wasmtime run -g -O --allow-precompiled --invoke wasm_kernel_run wasmibox.cwasm Time (mean ± σ): 7.098 s ± 0.033 s [User: 7.080 s, System: 0.015 s] Range (min … max): 7.018 s … 7.146 s 10 runs
previous good, 89ccc56:
./target/release/wasmtime compile -g -O wasmibox.wasm && , hyperfine --show-output --warmup 2 "./target/release/wasmtime run -g -O --allow-precompiled --invoke wasm_kernel_run wasmibox.cwasm" Benchmark 1: ./target/release/wasmtime run -g -O --allow-precompiled --invoke wasm_kernel_run wasmibox.cwasm Time (mean ± σ): 6.071 s ± 0.039 s [User: 6.052 s, System: 0.016 s] Range (min … max): 6.024 s … 6.153 s 10 runs
This particular benchmark is a standalone wasm file, i.e. no imports. The entry point loads a test wasm binary and interprets it with compiled in wasmi. The test binary is also standalone. The test wasm generates random numbers using xorshift and takes keccak hash out of it. I can publish the wasm or the source to get the wasm.
UPD:
uname: Linux hetzner 5.10.78 #1-NixOS SMP Sat Nov 6 13:10:10 UTC 2021 x86_64 GNU/Linux
CPU: AMD Ryzen 9 5950X 16-Core Processor
cfallin commented on issue #4163:
@pepyakin I'm just getting to this now (sorry for the delay!) and I'm actually seeing a speedup from alias analysis, consistently, of about 6% (6.2s to 5.8s):
[cfallin@xap]~/work/wasmtime% time target/release/wasmtime run --allow-precompiled wasmibox_no_aa.cwasm --invoke wasm_kernel_run target/release/wasmtime run --allow-precompiled wasmibox_no_aa.cwasm --invoke 6.22s user 0.01s system 99% cpu 6.246 total [cfallin@xap]~/work/wasmtime% time target/release/wasmtime run --allow-precompiled wasmibox_aa.cwasm --invoke wasm_kernel_run target/release/wasmtime run --allow-precompiled wasmibox_aa.cwasm --invoke 5.84s user 0.01s system 99% cpu 5.856 total [cfallin@xap]~/work/wasmtime% time target/release/wasmtime run --allow-precompiled wasmibox_no_aa.cwasm --invoke wasm_kernel_run target/release/wasmtime run --allow-precompiled wasmibox_no_aa.cwasm --invoke 6.19s user 0.01s system 99% cpu 6.216 total [cfallin@xap]~/work/wasmtime% time target/release/wasmtime run --allow-precompiled wasmibox_aa.cwasm --invoke wasm_kernel_run target/release/wasmtime run --allow-precompiled wasmibox_aa.cwasm --invoke 5.85s user 0.01s system 99% cpu 5.868 total [cfallin@xap]~/work/wasmtime% time target/release/wasmtime run --allow-precompiled wasmibox_no_aa.cwasm --invoke wasm_kernel_run target/release/wasmtime run --allow-precompiled wasmibox_no_aa.cwasm --invoke 6.22s user 0.01s system 99% cpu 6.242 total [cfallin@xap]~/work/wasmtime% time target/release/wasmtime run --allow-precompiled wasmibox_aa.cwasm --invoke wasm_kernel_run target/release/wasmtime run --allow-precompiled wasmibox_aa.cwasm --invoke 5.77s user 0.01s system 99% cpu 5.792 total
These were built with current
main
(c1b3962f7b9fc1c193e1b9709db64c455699a295
) with this patch to make alias analysis / redundant-load elimination optional; then$ wasmtime compile --cranelift-set opt_level=speed --cranelift-set enable_alias_analysis=true wasmibox.wasm -o wasmibox_aa.cwasm $ wasmtime compile --cranelift-set opt_level=speed --cranelift-set enable_alias_analysis=false wasmibox.wasm -o wasmibox_no_aa.cwasm
and I double-checked I wasn't swapping the two. Looking at a diff of the disassemblies, it's sort of what I expect: a few extra loads are removed and carried in registers instead, and this I would indeed expect to improve performance, as it does.
Can you confirm you're still seeing a slowdown, not speedup, and if so at which commit and anything else about your environment and measurements?
pepyakin commented on issue #4163:
With the following script:
./wasmtime-0824abbae compile -g -O wasmibox.wasm -o wasmibox-0824abbae.cwasm ./wasmtime-89ccc56e4 compile -g -O wasmibox.wasm -o wasmibox-89ccc56e4.cwasm ./wasmtime-a2197ebbe compile -g -O wasmibox.wasm --cranelift-set enable_alias_analysis=true -o wasmibox-a2197ebbe-aa.cwasm ./wasmtime-a2197ebbe compile -g -O wasmibox.wasm --cranelift-set enable_alias_analysis=false -o wasmibox-a2197ebbe-no-aa.cwasm hyperfine --show-output --warmup 2 "./wasmtime-0824abbae run -g --allow-precompiled -O wasmibox-0824abbae.cwasm --invoke wasm_kernel_run" hyperfine --show-output --warmup 2 "./wasmtime-89ccc56e4 run -g --allow-precompiled -O wasmibox-89ccc56e4.cwasm --invoke wasm_kernel_run" hyperfine --show-output --warmup 2 "./wasmtime-a2197ebbe run -g --allow-precompiled -O wasmibox-a2197ebbe-aa.cwasm --invoke wasm_kernel_run" hyperfine --show-output --warmup 2 "./wasmtime-a2197ebbe run -g --allow-precompiled -O wasmibox-a2197ebbe-no-aa.cwasm --invoke wasm_kernel_run"``` I am getting the following results: ```shell ++ hyperfine --show-output --warmup 2 './wasmtime-0824abbae run -g --allow-precompiled -O wasmibox-0824abbae.cwasm --invoke wasm_kernel_run' Benchmark 1: ./wasmtime-0824abbae run -g --allow-precompiled -O wasmibox-0824abbae.cwasm --invoke wasm_kernel_run Time (mean ± σ): 6.887 s ± 0.046 s [User: 6.872 s, System: 0.013 s] Range (min … max): 6.848 s … 6.990 s 10 runs ++ hyperfine --show-output --warmup 2 './wasmtime-89ccc56e4 run -g --allow-precompiled -O wasmibox-89ccc56e4.cwasm --invoke wasm_kernel_run' Benchmark 1: ./wasmtime-89ccc56e4 run -g --allow-precompiled -O wasmibox-89ccc56e4.cwasm --invoke wasm_kernel_run Time (mean ± σ): 5.941 s ± 0.024 s [User: 5.927 s, System: 0.013 s] Range (min … max): 5.914 s … 5.978 s 10 runs ++ hyperfine --show-output --warmup 2 './wasmtime-a2197ebbe run -g --allow-precompiled -O wasmibox-a2197ebbe-aa.cwasm --invoke wasm_kernel_run' Benchmark 1: ./wasmtime-a2197ebbe run -g --allow-precompiled -O wasmibox-a2197ebbe-aa.cwasm --invoke wasm_kernel_run Time (mean ± σ): 4.666 s ± 0.037 s [User: 4.651 s, System: 0.013 s] Range (min … max): 4.627 s … 4.743 s 10 runs ++ hyperfine --show-output --warmup 2 './wasmtime-a2197ebbe run -g --allow-precompiled -O wasmibox-a2197ebbe-no-aa.cwasm --invoke wasm_kernel_run' Benchmark 1: ./wasmtime-a2197ebbe run -g --allow-precompiled -O wasmibox-a2197ebbe-no-aa.cwasm --invoke wasm_kernel_run Time (mean ± σ): 4.524 s ± 0.133 s [User: 4.510 s, System: 0.013 s] Range (min … max): 4.472 s … 4.898 s 10 runs
It confirms that on my machine the pre-AA commit performs considerably better than the commit when AA was introduced. The current main performs better than both of them regardless if AA is enabled or not. No AA performs better than with AA on main though.
However, if I remove the
-g
flag then the whole picture changes:++ hyperfine --show-output --warmup 2 './wasmtime-0824abbae run --allow-precompiled -O wasmibox-0824abbae.cwasm --invoke wasm_kernel_run' Benchmark 1: ./wasmtime-0824abbae run --allow-precompiled -O wasmibox-0824abbae.cwasm --invoke wasm_kernel_run Time (mean ± σ): 5.826 s ± 0.082 s [User: 5.815 s, System: 0.010 s] Range (min … max): 5.759 s … 6.040 s 10 runs ++ hyperfine --show-output --warmup 2 './wasmtime-89ccc56e4 run --allow-precompiled -O wasmibox-89ccc56e4.cwasm --invoke wasm_kernel_run' Benchmark 1: ./wasmtime-89ccc56e4 run --allow-precompiled -O wasmibox-89ccc56e4.cwasm --invoke wasm_kernel_run Time (mean ± σ): 6.600 s ± 0.073 s [User: 6.589 s, System: 0.009 s] Range (min … max): 6.543 s … 6.722 s 10 runs ++ hyperfine --show-output --warmup 2 './wasmtime-a2197ebbe run --allow-precompiled -O wasmibox-a2197ebbe-aa.cwasm --invoke wasm_kernel_run' Benchmark 1: ./wasmtime-a2197ebbe run --allow-precompiled -O wasmibox-a2197ebbe-aa.cwasm --invoke wasm_kernel_run Time (mean ± σ): 4.375 s ± 0.013 s [User: 4.364 s, System: 0.010 s] Range (min … max): 4.364 s … 4.406 s 10 runs ++ hyperfine --show-output --warmup 2 './wasmtime-a2197ebbe run --allow-precompiled -O wasmibox-a2197ebbe-no-aa.cwasm --invoke wasm_kernel_run' Benchmark 1: ./wasmtime-a2197ebbe run --allow-precompiled -O wasmibox-a2197ebbe-no-aa.cwasm --invoke wasm_kernel_run Time (mean ± σ): 4.721 s ± 0.016 s [User: 4.711 s, System: 0.009 s] Range (min … max): 4.702 s … 4.746 s 10 runs
Now, it starts to make more sense, although I still don't understand how that could be possibly related.
cfallin commented on issue #4163:
OK, I agree that it doesn't make any sense that
-g
would make the difference here. I suspect either some odd measurement effect or some pessimization caused by debuginfo registration or something like that. Given that (i) the normal config behaves as expected (alias analysis produces a nontrivial speedup), (ii) codegen looks as expected on manual examination (diff of no-AA vs AA in default config onmain
), and (iii) debuginfo in general is a big mess that needs more focused attention, I don't think I'm able to justify spending significantly more time looking into this; but if you find anything more please let us know.
Last updated: Nov 22 2024 at 16:03 UTC