Stream: git-wasmtime

Topic: wasmtime / issue #4163 Add a basic alias analysis with re...


view this post on Zulip Wasmtime GitHub notifications bot (May 19 2022 at 00:30):

fitzgen commented on issue #4163:

Great! I will take a look at this tomorrow.

view this post on Zulip Wasmtime GitHub notifications bot (May 19 2022 at 00:31):

cfallin commented on issue #4163:

I ran the Sightglass bnechmarks on this and the only delta is for meshoptimizer:

execution :: cycles :: benchmarks-next/meshoptimizer/benchmark.wasm

  Δ = 957833251.60 ± 610441836.41 (confidence = 99%)

  new.so is 1.02x to 1.09x faster than old.so!
  old.so is 0.92x to 0.98x faster than new.so!

  [17612149560 17873615806.60 18098060500] new.so
  [17544919808 18831449058.20 19306513928] old.so

execution :: nanoseconds :: benchmarks-next/meshoptimizer/benchmark.wasm

  Δ = 252025449.20 ± 160632667.79 (confidence = 99%)

  new.so is 1.02x to 1.09x faster than old.so!
  old.so is 0.92x to 0.98x faster than new.so!

  [4634388228 4703136639.50 4762142132] new.so
  [4616626357 4955162088.70 5080175835] old.so

(with no impact on compilation time for any benchmark). I'm not too surprised we don't see more opportunity in Wasm benchmarks, because a lot of the RLE and store-to-load opts will have already been done by the Wasm toolchain. Regardless it seems nice to have this in, if it doesn't hurt, and it could become more applicable if/when we inline.

view this post on Zulip Wasmtime GitHub notifications bot (May 19 2022 at 00:31):

cfallin edited a comment on issue #4163:

I ran the Sightglass benchmarks on this and the only delta is for meshoptimizer:

execution :: cycles :: benchmarks-next/meshoptimizer/benchmark.wasm

  Δ = 957833251.60 ± 610441836.41 (confidence = 99%)

  new.so is 1.02x to 1.09x faster than old.so!
  old.so is 0.92x to 0.98x faster than new.so!

  [17612149560 17873615806.60 18098060500] new.so
  [17544919808 18831449058.20 19306513928] old.so

execution :: nanoseconds :: benchmarks-next/meshoptimizer/benchmark.wasm

  Δ = 252025449.20 ± 160632667.79 (confidence = 99%)

  new.so is 1.02x to 1.09x faster than old.so!
  old.so is 0.92x to 0.98x faster than new.so!

  [4634388228 4703136639.50 4762142132] new.so
  [4616626357 4955162088.70 5080175835] old.so

(with no impact on compilation time for any benchmark). I'm not too surprised we don't see more opportunity in Wasm benchmarks, because a lot of the RLE and store-to-load opts will have already been done by the Wasm toolchain. Regardless it seems nice to have this in, if it doesn't hurt, and it could become more applicable if/when we inline.

view this post on Zulip Wasmtime GitHub notifications bot (May 19 2022 at 05:59):

bjorn3 commented on issue #4163:

I think you will need to add support for volatile loads and stores for this to not miscompile rust code. It would also be nice if you could state an exhaustive list of UB it depends on. For example that reading/writing memory racing with the compiled clif function is UB unless it is an atomic or volatile load/store.

view this post on Zulip Wasmtime GitHub notifications bot (May 19 2022 at 06:04):

bjorn3 commented on issue #4163:

By the way what is the time complexity of this optinization pass?

view this post on Zulip Wasmtime GitHub notifications bot (May 19 2022 at 06:15):

cfallin commented on issue #4163:

volatile loads/stores to not miscompile

@bjorn3 can you clarify if the cg\_clif frontend currently expects all loads to not be elided, or otherwise what the requirements are?

Otherwise removing a load and replacing it with a known value is always legal, given CLIF semantics today. (If it helps, imagine the disjoint alias categories don't exist: by default all loads/stores fall into "other", so memory locations are named by the last store of any kind.)

Atomics should use atomic ops at the CLIF level of course; I believe those should act as full barriers across all categories, as calls do. (I don't think I've actually added this logic yet; will ensure it's there tomorrow.)

time complexity of this optimization pass?

Same as any of the other iterative dataflow analyses: worst case passes over any given basic block bounded by maximum descending chain depth in the lattice. Here the last-store vector meets to "current inst" on any conflict at a merge point so the lattice is effectively three levels deep, constant; so that's O(|blocks|) to converge last-store info (likely one visit per block, rarely two). Then one more visit per block to discover and use aliases.

view this post on Zulip Wasmtime GitHub notifications bot (May 19 2022 at 08:59):

bjorn3 commented on issue #4163:

@bjorn3 can you clarify if the cg_clif frontend currently expects all loads to not be elided, or otherwise what the requirements are?

Volatile memory operations can't be changed as they may have side effects like initiating a DMA transfer or changing the state of a device. Non-volatile and non-atomic memory accesses can be optimized as rust defines data races to be UB.

view this post on Zulip Wasmtime GitHub notifications bot (May 19 2022 at 09:49):

bjorn3 commented on issue #4163:

This is a slight regresion for the simple-raytracer benchmark of cg_clif. I would guess it increases live-range sizes causing a regalloc pessimization. Implementing https://github.com/bytecodealliance/wasmtime/pull/4163#discussion_r876805710 would make it a beneficial optimization I think. I used to have a simple implementation of that in cg_clif which also looks at which bytes of a stack slot were accessed by the load/store to effectively allow SROA, but I removed it in https://github.com/bjorn3/rustc_codegen_cranelift/commit/a793be8ee8895538e99acc2a855d9c4ae145fc78 as it was buggy and I didn't think I knew how to fix it. It also only operated within a single basic block. See https://github.com/bjorn3/rustc_codegen_cranelift/issues/846.

Benchmark 1: ./raytracer_cg_clif_release_alias_analysis
  Time (mean ± σ):      5.196 s ±  0.012 s    [User: 5.191 s, System: 0.004 s]
  Range (min  max):    5.169 s   5.211 s    10 runs

Benchmark 2: ./raytracer_cg_clif_release_main
  Time (mean ± σ):      5.224 s ±  0.024 s    [User: 5.218 s, System: 0.005 s]
  Range (min  max):    5.197 s   5.263 s    10 runs

Summary
  './raytracer_cg_clif_release_alias_analysis' ran
    1.01 ± 0.01 times faster than './raytracer_cg_clif_release_main'

view this post on Zulip Wasmtime GitHub notifications bot (May 19 2022 at 15:02):

cfallin commented on issue #4163:

@bjorn3 can you clarify if the cg_clif frontend currently expects all loads to not be elided, or otherwise what the requirements are?

Volatile memory operations can't be changed as they may have side effects like initiating a DMA transfer or changing the state of a device. Non-volatile and non-atomic memory accesses can be optimized as rust defines data races to be UB.

OK, so if I understand correctly, cg\_clif is currently compiling volatiles at the Rust level to normal loads/stores at the CLIF level? That I agree would result in miscompiles; the issue though isn't alias analysis as the semantics of load and store in CLIF have always been (even if not exploited until now) those of normal loads/stores. Adding volatile memory ops seems like a reasonable feature request though and I'd be happy to review a PR for that as well.

view this post on Zulip Wasmtime GitHub notifications bot (May 19 2022 at 15:05):

cfallin commented on issue #4163:

@bjorn3 I may be misreading your benchmark result, but:

Benchmark 1: ./raytracer_cg_clif_release_alias_analysis
  Time (mean ± σ):      5.196 s ±  0.012 s    [User: 5.191 s, System: 0.004 s]

Benchmark 2: ./raytracer_cg_clif_release_main
  Time (mean ± σ):      5.224 s ±  0.024 s    [User: 5.218 s, System: 0.005 s]

looks like alias_analysis ran a bit faster than main? Then './raytracer_cg_clif_release_alias_analysis' ran 1.01 ± 0.01 times faster than './raytracer_cg_clif_release_main' seems to confirm -- +1%, though maybe within a margin of error.

view this post on Zulip Wasmtime GitHub notifications bot (May 19 2022 at 22:10):

bjorn3 commented on issue #4163:

looks like alias_analysis ran a bit faster than main?

Right, my bad.

view this post on Zulip Wasmtime GitHub notifications bot (May 19 2022 at 23:23):

cfallin commented on issue #4163:

Updated, thanks!

view this post on Zulip Wasmtime GitHub notifications bot (May 20 2022 at 06:17):

bjorn3 commented on issue #4163:

Hang on, I think I know why it didn't help much for cg_clif. Cg_clif uses stack_load and stack_store rather than the stack_addr+load/store this opt pass needs.

view this post on Zulip Wasmtime GitHub notifications bot (May 20 2022 at 06:46):

cfallin commented on issue #4163:

@bjorn3 the alias analysis / redundant-load pass runs after legalization, so it should see plain loads/stores rather than stack_load/stack_store, I think.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 16 2022 at 16:36):

pepyakin commented on issue #4163:

FWIW, I've noticed around ≈15% regression and bisection points on this commit.

the results for 0824abb:

./target/release/wasmtime compile -g -O wasmibox.wasm && , hyperfine --show-output --warmup 2 "./target/release/wasmtime run -g -O --allow-precompiled --invoke wasm_kernel_run wasmibox.cwasm"
Benchmark 1: ./target/release/wasmtime run -g -O --allow-precompiled --invoke wasm_kernel_run wasmibox.cwasm
  Time (mean ± σ):      7.098 s ±  0.033 s    [User: 7.080 s, System: 0.015 s]
  Range (min  max):    7.018 s   7.146 s    10 runs

previous good, 89ccc56:

./target/release/wasmtime compile -g -O wasmibox.wasm && , hyperfine --show-output --warmup 2 "./target/release/wasmtime run -g -O --allow-precompiled --invoke wasm_kernel_run wasmibox.cwasm"
Benchmark 1: ./target/release/wasmtime run -g -O --allow-precompiled --invoke wasm_kernel_run wasmibox.cwasm
  Time (mean ± σ):      6.071 s ±  0.039 s    [User: 6.052 s, System: 0.016 s]
  Range (min  max):    6.024 s   6.153 s    10 runs

This particular benchmark is a standalone wasm file, i.e. no imports. The entry point loads a test wasm binary and interprets it with compiled in wasmi. The test binary is also standalone. The test wasm generates random numbers using xorshift and takes keccak hash out of it. I can publish the wasm or the source to get the wasm.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 16 2022 at 17:15):

pepyakin commented on issue #4163:

Well, it gets weird.

If I apply the following change:

diff --cc crates/wasmtime/src/config.rs
index 9c1ed87ca,9c1ed87ca..6e9bcd1f7
--- a/crates/wasmtime/src/config.rs
+++ b/crates/wasmtime/src/config.rs
@@@ -1287,6 -1287,6 +1287,7 @@@ fn compiler_builder(strategy: Strategy
              },
          )
          .unwrap();
++    builder.set("enable_pinned_reg", "true").unwrap();
      Ok(builder)
  }

diff --cc crates/wasmtime/src/engine.rs
index c81ae6e67,c81ae6e67..cc150af3a
--- a/crates/wasmtime/src/engine.rs
+++ b/crates/wasmtime/src/engine.rs
@@@ -312,7 -312,7 +312,7 @@@ impl Engine
              "baldrdash_prologue_words" => *value == FlagValue::Num(0),
              "enable_llvm_abi_extensions" => *value == FlagValue::Bool(false),
              "emit_all_ones_funcaddrs" => *value == FlagValue::Bool(false),
--            "enable_pinned_reg" => *value == FlagValue::Bool(false),
++            "enable_pinned_reg" => *value == FlagValue::Bool(true),
              "enable_probestack" => *value == FlagValue::Bool(false),
              "use_colocated_libcalls" => *value == FlagValue::Bool(false),
              "use_pinned_reg_as_heap_base" => *value == FlagValue::Bool(false),

then the result would be like this:

Benchmark 1: ./target/release/wasmtime run -g -O --allow-precompiled --invoke wasm_kernel_run wasmibox.cwasm
  Time (mean ± σ):      6.111 s ±  0.050 s    [User: 6.094 s, System: 0.014 s]
  Range (min  max):    6.044 s   6.226 s    10 runs

I'm confused because not sure how that could possibly be related.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 16 2022 at 17:16):

pepyakin edited a comment on issue #4163:

Well, it gets weird.

If I apply the following change on top of 0824abb

diff --cc crates/wasmtime/src/config.rs
index 9c1ed87ca,9c1ed87ca..6e9bcd1f7
--- a/crates/wasmtime/src/config.rs
+++ b/crates/wasmtime/src/config.rs
@@@ -1287,6 -1287,6 +1287,7 @@@ fn compiler_builder(strategy: Strategy
              },
          )
          .unwrap();
++    builder.set("enable_pinned_reg", "true").unwrap();
      Ok(builder)
  }

diff --cc crates/wasmtime/src/engine.rs
index c81ae6e67,c81ae6e67..cc150af3a
--- a/crates/wasmtime/src/engine.rs
+++ b/crates/wasmtime/src/engine.rs
@@@ -312,7 -312,7 +312,7 @@@ impl Engine
              "baldrdash_prologue_words" => *value == FlagValue::Num(0),
              "enable_llvm_abi_extensions" => *value == FlagValue::Bool(false),
              "emit_all_ones_funcaddrs" => *value == FlagValue::Bool(false),
--            "enable_pinned_reg" => *value == FlagValue::Bool(false),
++            "enable_pinned_reg" => *value == FlagValue::Bool(true),
              "enable_probestack" => *value == FlagValue::Bool(false),
              "use_colocated_libcalls" => *value == FlagValue::Bool(false),
              "use_pinned_reg_as_heap_base" => *value == FlagValue::Bool(false),

then the result would be like this:

Benchmark 1: ./target/release/wasmtime run -g -O --allow-precompiled --invoke wasm_kernel_run wasmibox.cwasm
  Time (mean ± σ):      6.111 s ±  0.050 s    [User: 6.094 s, System: 0.014 s]
  Range (min  max):    6.044 s   6.226 s    10 runs

i.e. the regression disappears.

I'm confused because not sure how that could possibly be related.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 17 2022 at 20:41):

cfallin commented on issue #4163:

@pepyakin I'm happy to take a look at this next week (I'm out of office now); can you publish the .wasm somewhere in the meantime?

view this post on Zulip Wasmtime GitHub notifications bot (Jun 18 2022 at 10:16):

pepyakin commented on issue #4163:

Sure!

wasmibox.wasm.gz

view this post on Zulip Wasmtime GitHub notifications bot (Jun 18 2022 at 10:26):

pepyakin edited a comment on issue #4163:

FWIW, I've noticed around ≈15% regression and bisection points on this commit.

the results for 0824abb:

./target/release/wasmtime compile -g -O wasmibox.wasm && , hyperfine --show-output --warmup 2 "./target/release/wasmtime run -g -O --allow-precompiled --invoke wasm_kernel_run wasmibox.cwasm"
Benchmark 1: ./target/release/wasmtime run -g -O --allow-precompiled --invoke wasm_kernel_run wasmibox.cwasm
  Time (mean ± σ):      7.098 s ±  0.033 s    [User: 7.080 s, System: 0.015 s]
  Range (min  max):    7.018 s   7.146 s    10 runs

previous good, 89ccc56:

./target/release/wasmtime compile -g -O wasmibox.wasm && , hyperfine --show-output --warmup 2 "./target/release/wasmtime run -g -O --allow-precompiled --invoke wasm_kernel_run wasmibox.cwasm"
Benchmark 1: ./target/release/wasmtime run -g -O --allow-precompiled --invoke wasm_kernel_run wasmibox.cwasm
  Time (mean ± σ):      6.071 s ±  0.039 s    [User: 6.052 s, System: 0.016 s]
  Range (min  max):    6.024 s   6.153 s    10 runs

This particular benchmark is a standalone wasm file, i.e. no imports. The entry point loads a test wasm binary and interprets it with compiled in wasmi. The test binary is also standalone. The test wasm generates random numbers using xorshift and takes keccak hash out of it. I can publish the wasm or the source to get the wasm.

UPD:

uname: Linux hetzner 5.10.78 #1-NixOS SMP Sat Nov 6 13:10:10 UTC 2021 x86_64 GNU/Linux
CPU: AMD Ryzen 9 5950X 16-Core Processor

view this post on Zulip Wasmtime GitHub notifications bot (Jun 28 2022 at 21:43):

cfallin commented on issue #4163:

@pepyakin I'm just getting to this now (sorry for the delay!) and I'm actually seeing a speedup from alias analysis, consistently, of about 6% (6.2s to 5.8s):

[cfallin@xap]~/work/wasmtime% time target/release/wasmtime run --allow-precompiled wasmibox_no_aa.cwasm --invoke wasm_kernel_run
target/release/wasmtime run --allow-precompiled wasmibox_no_aa.cwasm --invoke  6.22s user 0.01s system 99% cpu 6.246 total
[cfallin@xap]~/work/wasmtime% time target/release/wasmtime run --allow-precompiled wasmibox_aa.cwasm --invoke wasm_kernel_run
target/release/wasmtime run --allow-precompiled wasmibox_aa.cwasm --invoke   5.84s user 0.01s system 99% cpu 5.856 total
[cfallin@xap]~/work/wasmtime% time target/release/wasmtime run --allow-precompiled wasmibox_no_aa.cwasm --invoke wasm_kernel_run
target/release/wasmtime run --allow-precompiled wasmibox_no_aa.cwasm --invoke  6.19s user 0.01s system 99% cpu 6.216 total
[cfallin@xap]~/work/wasmtime% time target/release/wasmtime run --allow-precompiled wasmibox_aa.cwasm --invoke wasm_kernel_run
target/release/wasmtime run --allow-precompiled wasmibox_aa.cwasm --invoke   5.85s user 0.01s system 99% cpu 5.868 total
[cfallin@xap]~/work/wasmtime% time target/release/wasmtime run --allow-precompiled wasmibox_no_aa.cwasm --invoke wasm_kernel_run
target/release/wasmtime run --allow-precompiled wasmibox_no_aa.cwasm --invoke  6.22s user 0.01s system 99% cpu 6.242 total
[cfallin@xap]~/work/wasmtime% time target/release/wasmtime run --allow-precompiled wasmibox_aa.cwasm --invoke wasm_kernel_run
target/release/wasmtime run --allow-precompiled wasmibox_aa.cwasm --invoke   5.77s user 0.01s system 99% cpu 5.792 total

These were built with current main (c1b3962f7b9fc1c193e1b9709db64c455699a295) with this patch to make alias analysis / redundant-load elimination optional; then

$ wasmtime compile --cranelift-set opt_level=speed --cranelift-set enable_alias_analysis=true wasmibox.wasm -o wasmibox_aa.cwasm
$ wasmtime compile --cranelift-set opt_level=speed --cranelift-set enable_alias_analysis=false wasmibox.wasm -o wasmibox_no_aa.cwasm

and I double-checked I wasn't swapping the two. Looking at a diff of the disassemblies, it's sort of what I expect: a few extra loads are removed and carried in registers instead, and this I would indeed expect to improve performance, as it does.

Can you confirm you're still seeing a slowdown, not speedup, and if so at which commit and anything else about your environment and measurements?

view this post on Zulip Wasmtime GitHub notifications bot (Jul 04 2022 at 10:43):

pepyakin commented on issue #4163:

With the following script:

./wasmtime-0824abbae compile -g -O wasmibox.wasm -o wasmibox-0824abbae.cwasm
./wasmtime-89ccc56e4 compile -g -O wasmibox.wasm -o wasmibox-89ccc56e4.cwasm
./wasmtime-a2197ebbe compile -g -O wasmibox.wasm --cranelift-set enable_alias_analysis=true -o wasmibox-a2197ebbe-aa.cwasm
./wasmtime-a2197ebbe compile -g -O wasmibox.wasm --cranelift-set enable_alias_analysis=false -o wasmibox-a2197ebbe-no-aa.cwasm

hyperfine --show-output --warmup 2 "./wasmtime-0824abbae run -g --allow-precompiled -O wasmibox-0824abbae.cwasm --invoke wasm_kernel_run"
hyperfine --show-output --warmup 2 "./wasmtime-89ccc56e4 run -g --allow-precompiled -O wasmibox-89ccc56e4.cwasm --invoke wasm_kernel_run"
hyperfine --show-output --warmup 2  "./wasmtime-a2197ebbe run -g --allow-precompiled -O wasmibox-a2197ebbe-aa.cwasm --invoke wasm_kernel_run"
hyperfine --show-output --warmup 2 "./wasmtime-a2197ebbe run -g --allow-precompiled -O wasmibox-a2197ebbe-no-aa.cwasm --invoke wasm_kernel_run"```

I am getting the following results:

```shell
++ hyperfine --show-output --warmup 2 './wasmtime-0824abbae run -g --allow-precompiled -O wasmibox-0824abbae.cwasm --invoke wasm_kernel_run'
Benchmark 1: ./wasmtime-0824abbae run -g --allow-precompiled -O wasmibox-0824abbae.cwasm --invoke wasm_kernel_run
  Time (mean ± σ):      6.887 s ±  0.046 s    [User: 6.872 s, System: 0.013 s]
  Range (min … max):    6.848 s …  6.990 s    10 runs

++ hyperfine --show-output --warmup 2 './wasmtime-89ccc56e4 run -g --allow-precompiled -O wasmibox-89ccc56e4.cwasm --invoke wasm_kernel_run'
Benchmark 1: ./wasmtime-89ccc56e4 run -g --allow-precompiled -O wasmibox-89ccc56e4.cwasm --invoke wasm_kernel_run
  Time (mean ± σ):      5.941 s ±  0.024 s    [User: 5.927 s, System: 0.013 s]
  Range (min … max):    5.914 s …  5.978 s    10 runs

++ hyperfine --show-output --warmup 2 './wasmtime-a2197ebbe run -g --allow-precompiled -O wasmibox-a2197ebbe-aa.cwasm --invoke wasm_kernel_run'
Benchmark 1: ./wasmtime-a2197ebbe run -g --allow-precompiled -O wasmibox-a2197ebbe-aa.cwasm --invoke wasm_kernel_run
  Time (mean ± σ):      4.666 s ±  0.037 s    [User: 4.651 s, System: 0.013 s]
  Range (min … max):    4.627 s …  4.743 s    10 runs

++ hyperfine --show-output --warmup 2 './wasmtime-a2197ebbe run -g --allow-precompiled -O wasmibox-a2197ebbe-no-aa.cwasm --invoke wasm_kernel_run'
Benchmark 1: ./wasmtime-a2197ebbe run -g --allow-precompiled -O wasmibox-a2197ebbe-no-aa.cwasm --invoke wasm_kernel_run
  Time (mean ± σ):      4.524 s ±  0.133 s    [User: 4.510 s, System: 0.013 s]
  Range (min … max):    4.472 s …  4.898 s    10 runs

It confirms that on my machine the pre-AA commit performs considerably better than the commit when AA was introduced. The current main performs better than both of them regardless if AA is enabled or not. No AA performs better than with AA on main though.

However, if I remove the -g flag then the whole picture changes:

++ hyperfine --show-output --warmup 2 './wasmtime-0824abbae run --allow-precompiled -O wasmibox-0824abbae.cwasm --invoke wasm_kernel_run'
Benchmark 1: ./wasmtime-0824abbae run --allow-precompiled -O wasmibox-0824abbae.cwasm --invoke wasm_kernel_run
  Time (mean ± σ):      5.826 s ±  0.082 s    [User: 5.815 s, System: 0.010 s]
  Range (min  max):    5.759 s   6.040 s    10 runs

++ hyperfine --show-output --warmup 2 './wasmtime-89ccc56e4 run --allow-precompiled -O wasmibox-89ccc56e4.cwasm --invoke wasm_kernel_run'
Benchmark 1: ./wasmtime-89ccc56e4 run --allow-precompiled -O wasmibox-89ccc56e4.cwasm --invoke wasm_kernel_run
  Time (mean ± σ):      6.600 s ±  0.073 s    [User: 6.589 s, System: 0.009 s]
  Range (min  max):    6.543 s   6.722 s    10 runs

++ hyperfine --show-output --warmup 2 './wasmtime-a2197ebbe run --allow-precompiled -O wasmibox-a2197ebbe-aa.cwasm --invoke wasm_kernel_run'
Benchmark 1: ./wasmtime-a2197ebbe run --allow-precompiled -O wasmibox-a2197ebbe-aa.cwasm --invoke wasm_kernel_run
  Time (mean ± σ):      4.375 s ±  0.013 s    [User: 4.364 s, System: 0.010 s]
  Range (min  max):    4.364 s   4.406 s    10 runs

++ hyperfine --show-output --warmup 2 './wasmtime-a2197ebbe run --allow-precompiled -O wasmibox-a2197ebbe-no-aa.cwasm --invoke wasm_kernel_run'
Benchmark 1: ./wasmtime-a2197ebbe run --allow-precompiled -O wasmibox-a2197ebbe-no-aa.cwasm --invoke wasm_kernel_run
  Time (mean ± σ):      4.721 s ±  0.016 s    [User: 4.711 s, System: 0.009 s]
  Range (min  max):    4.702 s   4.746 s    10 runs

Now, it starts to make more sense, although I still don't understand how that could be possibly related.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 05 2022 at 05:42):

cfallin commented on issue #4163:

OK, I agree that it doesn't make any sense that -g would make the difference here. I suspect either some odd measurement effect or some pessimization caused by debuginfo registration or something like that. Given that (i) the normal config behaves as expected (alias analysis produces a nontrivial speedup), (ii) codegen looks as expected on manual examination (diff of no-AA vs AA in default config on main), and (iii) debuginfo in general is a big mess that needs more focused attention, I don't think I'm able to justify spending significantly more time looking into this; but if you find anything more please let us know.


Last updated: Oct 23 2024 at 20:03 UTC