Stream: git-wasmtime

Topic: wasmtime / issue #6094 Further Improve Execution Speed wi...


view this post on Zulip Wasmtime GitHub notifications bot (Mar 23 2023 at 18:35):

fitzgen labeled issue #6094:

I've been digging into Cranelift's and Wasmtime's code quality and Wasm execution throughput when "dynamic memories" with explicit bounds checks are used to implement Wasm linear memories. Here I'm summarizing the progress made thus far here, and laying out what I believe the next steps to continue this effort should be.

Static memories require large reservations of virtual memory, to the extent that it can become a bottleneck for how many Wasm instances can exist concurrently in a system. Dynamic memories, on the other hand, require very little or even zero virtual memory reservations (depending on configuration) beyond the actually resident pages backing the Wasm linear memory itself. Additionally, when dynamic memories are configured without any guard regions, they incur zero virtual memory-related syscalls on instantiation and tear down (modulo any copy-on-write heap image initialization). These syscalls have been observed to be bottlenecks in concurrent systems due to VMA locks taken in the kernel.

Furthermore these issues will only compound in the future. Right now, each tenant in a multi-tenant system is a single core Wasm module with a single Wasm linear memory. With the component model, a tenant will instead be a Wasm component that is made up of many core Wasm modules, each with their own linear memories. If the average Wasm component contains five linear memories, then the multi-tenant system can only support one fifth as many concurrent tenants under the component model, and instantiating and tearing down each tenant's service will take five times as many syscalls as it does today.

So there is a lot of pressure towards implementing Wasm linear memories with dynamic memories. The downside is that dynamic memories currently incur roughly 55% overhead to Wasm execution throughput compared to static memories. That is, a Wasm computation that takes 1 second in a static memory will take about 1.55 seconds in a dynamic memory. This is because dynamic memory accesses must be explicitly bounds checked rather than caught after the fact via page faults like out-of-bounds static memory accesses. The goal of the effort described here is to minimize this overhead.

I don't intend to continue this work in the immediate future, so I wanted to make sure everything was summarized and documented for posterity, as well as written down while it is still fresh in my mind. Additionally I have proposals for new optimization passes we should implement in the future, and this issue can serve as a forum for discussing them.

Progress Made Thus Far

Before I began this effort, running the spidermonkey.wasm benchmark in Sightglass with static memories was 3.06x faster than doing the same with dynamic memories. Now we are down to static memories being only 1.55x faster.

This improvement comes from a number of sources:

Additionally, I created the wasmtime explore subcommand to dig into native code disassemblies and associate each native instruction with the Wasm bytecode it implements (similar to the Godbolt Compiler Explorer). This new tool has already been incredibly helpful for analyzing what room for improvement still exists, and should help further future optimization efforts in Cranelift and Wasmtime.

Recent Sightglass Benchmark Results

Here are some recent Sightglass benchmark results comparing static memories (static.so) versus dynamic memories with a small (64KiB) guard region (dyn-with-guard.so) versus dynamic memories without a guard region (dyn-without-guard.so).

static.so vs dyn-with-guard.so

<details>

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 517183665.48 ± 6392410.71 (confidence = 99%)

  static.so is 1.55x to 1.56x faster than dyn-with-guard.so!

  [1420787062 1450201603.68 1522542159] dyn-with-guard.so
  [914013248 933017938.20 1036142782] static.so

execution :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 45802386.02 ± 1486278.64 (confidence = 99%)

  static.so is 1.42x to 1.45x faster than dyn-with-guard.so!

  [146742085 150423911.30 159078923] dyn-with-guard.so
  [100319894 104621525.28 149448856] static.so

execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 3014103.57 ± 316843.43 (confidence = 99%)

  static.so is 1.32x to 1.40x faster than dyn-with-guard.so!

  [10662276 11365102.25 16630094] dyn-with-guard.so
  [7683153 8350998.68 13449448] static.so

compilation :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 45130213.24 ± 4838200.92 (confidence = 99%)

  static.so is 1.18x to 1.22x faster than dyn-with-guard.so!

  [249993052 271654226.14 316970220] dyn-with-guard.so
  [208845361 226524012.90 290090035] static.so

compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 53326476.11 ± 5528961.43 (confidence = 99%)

  static.so is 1.16x to 1.19x faster than dyn-with-guard.so!

  [330373001 356824375.72 397185785] dyn-with-guard.so
  [279072702 303497899.61 348984338] static.so

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 1163853854.64 ± 81595847.03 (confidence = 99%)

  static.so is 1.15x to 1.18x faster than dyn-with-guard.so!

  [8097294016 8213392062.47 10103274666] dyn-with-guard.so
  [6969056325 7049538207.83 8442021515] static.so

</details>

static.so vs dyn-without-guard.so

<details>

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 594932587.44 ± 6811906.18 (confidence = 99%)

  static.so is 1.63x to 1.65x faster than dyn-without-guard.so!

  [1494952789 1524945424.11 1606143187] dyn-without-guard.so
  [908877847 930012836.67 980176520] static.so

execution :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 49482140.46 ± 2175287.73 (confidence = 99%)

  static.so is 1.45x to 1.49x faster than dyn-without-guard.so!

  [149370743 154540290.45 163713075] dyn-without-guard.so
  [99902180 105058149.99 163151023] static.so

compilation :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 54959651.89 ± 5371934.09 (confidence = 99%)

  static.so is 1.22x to 1.26x faster than dyn-without-guard.so!

  [258958338 282812487.71 327560254] dyn-without-guard.so
  [210542623 227852835.82 315990395] static.so

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 1592325533.89 ± 100212378.19 (confidence = 99%)

  static.so is 1.21x to 1.24x faster than dyn-without-guard.so!

  [8494305552 8674966162.14 10607319542] dyn-without-guard.so
  [6985640940 7082640628.25 8405451996] static.so

compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 66518323.42 ± 5778757.98 (confidence = 99%)

  static.so is 1.20x to 1.24x faster than dyn-without-guard.so!

  [342947509 371023668.35 415514806] dyn-without-guard.so
  [277471563 304505344.93 339940435] static.so

execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 1217702.33 ± 627813.45 (confidence = 99%)

  static.so is 1.06x to 1.18x faster than dyn-without-guard.so!

  [11111517 11632130.88 14705814] dyn-without-guard.so
  [7682848 10414428.55 14122261] static.so

</details>

Comparing Wasmtime and SpiderMonkey

To get a sense of whether our 1.55x slowdown when enabling explicit bounds checks was typical of other engines, I also benchmarked SpiderMonkey (native) running Sightglass's spidermonkey.wasm benchmark with virtual memory guard pages vs with explicit bounds checks.

execution :: milliseconds :: spidermonkey.wasm

  Δ = 163.62 ± 5.57 (confidence = 99%)

  virtual-memory-spidermonkey is 1.52x to 1.56x faster than bounds-checks-spidermonkey!

  [446 465.49 520] bounds-checks-spidermonkey
  [290 301.87 353] virtual-memory-spidermonkey

Because SpiderMonkey's slowdown is almost identical to Wasmtime's, we can infer that there aren't any critical bounds-checking optimizations that SpiderMonkey performs but Wasmtime/Cranelift does not. This is reassuring: it means we can basically ignore other engines (or at least SpiderMonkey) for now, when trying to further improve codegen when explicit bounds checks are enabled, and focus only on ourselves.

Proposed Future Work

What fruit remains to be harvested? I've been digging into spidermonkey.wasm's hot functions (as reported by perf on Linux) and staring at disassemblies to get an idea of how much room for improvement we still have. In short: a lot!

I found many sequences of back-to-back
[message truncated]

view this post on Zulip Wasmtime GitHub notifications bot (Mar 23 2023 at 18:35):

fitzgen opened issue #6094:

I've been digging into Cranelift's and Wasmtime's code quality and Wasm execution throughput when "dynamic memories" with explicit bounds checks are used to implement Wasm linear memories. Here I'm summarizing the progress made thus far here, and laying out what I believe the next steps to continue this effort should be.

Static memories require large reservations of virtual memory, to the extent that it can become a bottleneck for how many Wasm instances can exist concurrently in a system. Dynamic memories, on the other hand, require very little or even zero virtual memory reservations (depending on configuration) beyond the actually resident pages backing the Wasm linear memory itself. Additionally, when dynamic memories are configured without any guard regions, they incur zero virtual memory-related syscalls on instantiation and tear down (modulo any copy-on-write heap image initialization). These syscalls have been observed to be bottlenecks in concurrent systems due to VMA locks taken in the kernel.

Furthermore these issues will only compound in the future. Right now, each tenant in a multi-tenant system is a single core Wasm module with a single Wasm linear memory. With the component model, a tenant will instead be a Wasm component that is made up of many core Wasm modules, each with their own linear memories. If the average Wasm component contains five linear memories, then the multi-tenant system can only support one fifth as many concurrent tenants under the component model, and instantiating and tearing down each tenant's service will take five times as many syscalls as it does today.

So there is a lot of pressure towards implementing Wasm linear memories with dynamic memories. The downside is that dynamic memories currently incur roughly 55% overhead to Wasm execution throughput compared to static memories. That is, a Wasm computation that takes 1 second in a static memory will take about 1.55 seconds in a dynamic memory. This is because dynamic memory accesses must be explicitly bounds checked rather than caught after the fact via page faults like out-of-bounds static memory accesses. The goal of the effort described here is to minimize this overhead.

I don't intend to continue this work in the immediate future, so I wanted to make sure everything was summarized and documented for posterity, as well as written down while it is still fresh in my mind. Additionally I have proposals for new optimization passes we should implement in the future, and this issue can serve as a forum for discussing them.

Progress Made Thus Far

Before I began this effort, running the spidermonkey.wasm benchmark in Sightglass with static memories was 3.06x faster than doing the same with dynamic memories. Now we are down to static memories being only 1.55x faster.

This improvement comes from a number of sources:

Additionally, I created the wasmtime explore subcommand to dig into native code disassemblies and associate each native instruction with the Wasm bytecode it implements (similar to the Godbolt Compiler Explorer). This new tool has already been incredibly helpful for analyzing what room for improvement still exists, and should help further future optimization efforts in Cranelift and Wasmtime.

Recent Sightglass Benchmark Results

Here are some recent Sightglass benchmark results comparing static memories (static.so) versus dynamic memories with a small (64KiB) guard region (dyn-with-guard.so) versus dynamic memories without a guard region (dyn-without-guard.so).

static.so vs dyn-with-guard.so

<details>

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 517183665.48 ± 6392410.71 (confidence = 99%)

  static.so is 1.55x to 1.56x faster than dyn-with-guard.so!

  [1420787062 1450201603.68 1522542159] dyn-with-guard.so
  [914013248 933017938.20 1036142782] static.so

execution :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 45802386.02 ± 1486278.64 (confidence = 99%)

  static.so is 1.42x to 1.45x faster than dyn-with-guard.so!

  [146742085 150423911.30 159078923] dyn-with-guard.so
  [100319894 104621525.28 149448856] static.so

execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 3014103.57 ± 316843.43 (confidence = 99%)

  static.so is 1.32x to 1.40x faster than dyn-with-guard.so!

  [10662276 11365102.25 16630094] dyn-with-guard.so
  [7683153 8350998.68 13449448] static.so

compilation :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 45130213.24 ± 4838200.92 (confidence = 99%)

  static.so is 1.18x to 1.22x faster than dyn-with-guard.so!

  [249993052 271654226.14 316970220] dyn-with-guard.so
  [208845361 226524012.90 290090035] static.so

compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 53326476.11 ± 5528961.43 (confidence = 99%)

  static.so is 1.16x to 1.19x faster than dyn-with-guard.so!

  [330373001 356824375.72 397185785] dyn-with-guard.so
  [279072702 303497899.61 348984338] static.so

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 1163853854.64 ± 81595847.03 (confidence = 99%)

  static.so is 1.15x to 1.18x faster than dyn-with-guard.so!

  [8097294016 8213392062.47 10103274666] dyn-with-guard.so
  [6969056325 7049538207.83 8442021515] static.so

</details>

static.so vs dyn-without-guard.so

<details>

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 594932587.44 ± 6811906.18 (confidence = 99%)

  static.so is 1.63x to 1.65x faster than dyn-without-guard.so!

  [1494952789 1524945424.11 1606143187] dyn-without-guard.so
  [908877847 930012836.67 980176520] static.so

execution :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 49482140.46 ± 2175287.73 (confidence = 99%)

  static.so is 1.45x to 1.49x faster than dyn-without-guard.so!

  [149370743 154540290.45 163713075] dyn-without-guard.so
  [99902180 105058149.99 163151023] static.so

compilation :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 54959651.89 ± 5371934.09 (confidence = 99%)

  static.so is 1.22x to 1.26x faster than dyn-without-guard.so!

  [258958338 282812487.71 327560254] dyn-without-guard.so
  [210542623 227852835.82 315990395] static.so

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 1592325533.89 ± 100212378.19 (confidence = 99%)

  static.so is 1.21x to 1.24x faster than dyn-without-guard.so!

  [8494305552 8674966162.14 10607319542] dyn-without-guard.so
  [6985640940 7082640628.25 8405451996] static.so

compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 66518323.42 ± 5778757.98 (confidence = 99%)

  static.so is 1.20x to 1.24x faster than dyn-without-guard.so!

  [342947509 371023668.35 415514806] dyn-without-guard.so
  [277471563 304505344.93 339940435] static.so

execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 1217702.33 ± 627813.45 (confidence = 99%)

  static.so is 1.06x to 1.18x faster than dyn-without-guard.so!

  [11111517 11632130.88 14705814] dyn-without-guard.so
  [7682848 10414428.55 14122261] static.so

</details>

Comparing Wasmtime and SpiderMonkey

To get a sense of whether our 1.55x slowdown when enabling explicit bounds checks was typical of other engines, I also benchmarked SpiderMonkey (native) running Sightglass's spidermonkey.wasm benchmark with virtual memory guard pages vs with explicit bounds checks.

execution :: milliseconds :: spidermonkey.wasm

  Δ = 163.62 ± 5.57 (confidence = 99%)

  virtual-memory-spidermonkey is 1.52x to 1.56x faster than bounds-checks-spidermonkey!

  [446 465.49 520] bounds-checks-spidermonkey
  [290 301.87 353] virtual-memory-spidermonkey

Because SpiderMonkey's slowdown is almost identical to Wasmtime's, we can infer that there aren't any critical bounds-checking optimizations that SpiderMonkey performs but Wasmtime/Cranelift does not. This is reassuring: it means we can basically ignore other engines (or at least SpiderMonkey) for now, when trying to further improve codegen when explicit bounds checks are enabled, and focus only on ourselves.

Proposed Future Work

What fruit remains to be harvested? I've been digging into spidermonkey.wasm's hot functions (as reported by perf on Linux) and staring at disassemblies to get an idea of how much room for improvement we still have. In short: a lot!

I found many sequences of back-to-back m
[message truncated]

view this post on Zulip Wasmtime GitHub notifications bot (Mar 23 2023 at 18:35):

fitzgen labeled issue #6094:

I've been digging into Cranelift's and Wasmtime's code quality and Wasm execution throughput when "dynamic memories" with explicit bounds checks are used to implement Wasm linear memories. Here I'm summarizing the progress made thus far here, and laying out what I believe the next steps to continue this effort should be.

Static memories require large reservations of virtual memory, to the extent that it can become a bottleneck for how many Wasm instances can exist concurrently in a system. Dynamic memories, on the other hand, require very little or even zero virtual memory reservations (depending on configuration) beyond the actually resident pages backing the Wasm linear memory itself. Additionally, when dynamic memories are configured without any guard regions, they incur zero virtual memory-related syscalls on instantiation and tear down (modulo any copy-on-write heap image initialization). These syscalls have been observed to be bottlenecks in concurrent systems due to VMA locks taken in the kernel.

Furthermore these issues will only compound in the future. Right now, each tenant in a multi-tenant system is a single core Wasm module with a single Wasm linear memory. With the component model, a tenant will instead be a Wasm component that is made up of many core Wasm modules, each with their own linear memories. If the average Wasm component contains five linear memories, then the multi-tenant system can only support one fifth as many concurrent tenants under the component model, and instantiating and tearing down each tenant's service will take five times as many syscalls as it does today.

So there is a lot of pressure towards implementing Wasm linear memories with dynamic memories. The downside is that dynamic memories currently incur roughly 55% overhead to Wasm execution throughput compared to static memories. That is, a Wasm computation that takes 1 second in a static memory will take about 1.55 seconds in a dynamic memory. This is because dynamic memory accesses must be explicitly bounds checked rather than caught after the fact via page faults like out-of-bounds static memory accesses. The goal of the effort described here is to minimize this overhead.

I don't intend to continue this work in the immediate future, so I wanted to make sure everything was summarized and documented for posterity, as well as written down while it is still fresh in my mind. Additionally I have proposals for new optimization passes we should implement in the future, and this issue can serve as a forum for discussing them.

Progress Made Thus Far

Before I began this effort, running the spidermonkey.wasm benchmark in Sightglass with static memories was 3.06x faster than doing the same with dynamic memories. Now we are down to static memories being only 1.55x faster.

This improvement comes from a number of sources:

Additionally, I created the wasmtime explore subcommand to dig into native code disassemblies and associate each native instruction with the Wasm bytecode it implements (similar to the Godbolt Compiler Explorer). This new tool has already been incredibly helpful for analyzing what room for improvement still exists, and should help further future optimization efforts in Cranelift and Wasmtime.

Recent Sightglass Benchmark Results

Here are some recent Sightglass benchmark results comparing static memories (static.so) versus dynamic memories with a small (64KiB) guard region (dyn-with-guard.so) versus dynamic memories without a guard region (dyn-without-guard.so).

static.so vs dyn-with-guard.so

<details>

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 517183665.48 ± 6392410.71 (confidence = 99%)

  static.so is 1.55x to 1.56x faster than dyn-with-guard.so!

  [1420787062 1450201603.68 1522542159] dyn-with-guard.so
  [914013248 933017938.20 1036142782] static.so

execution :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 45802386.02 ± 1486278.64 (confidence = 99%)

  static.so is 1.42x to 1.45x faster than dyn-with-guard.so!

  [146742085 150423911.30 159078923] dyn-with-guard.so
  [100319894 104621525.28 149448856] static.so

execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 3014103.57 ± 316843.43 (confidence = 99%)

  static.so is 1.32x to 1.40x faster than dyn-with-guard.so!

  [10662276 11365102.25 16630094] dyn-with-guard.so
  [7683153 8350998.68 13449448] static.so

compilation :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 45130213.24 ± 4838200.92 (confidence = 99%)

  static.so is 1.18x to 1.22x faster than dyn-with-guard.so!

  [249993052 271654226.14 316970220] dyn-with-guard.so
  [208845361 226524012.90 290090035] static.so

compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 53326476.11 ± 5528961.43 (confidence = 99%)

  static.so is 1.16x to 1.19x faster than dyn-with-guard.so!

  [330373001 356824375.72 397185785] dyn-with-guard.so
  [279072702 303497899.61 348984338] static.so

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 1163853854.64 ± 81595847.03 (confidence = 99%)

  static.so is 1.15x to 1.18x faster than dyn-with-guard.so!

  [8097294016 8213392062.47 10103274666] dyn-with-guard.so
  [6969056325 7049538207.83 8442021515] static.so

</details>

static.so vs dyn-without-guard.so

<details>

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 594932587.44 ± 6811906.18 (confidence = 99%)

  static.so is 1.63x to 1.65x faster than dyn-without-guard.so!

  [1494952789 1524945424.11 1606143187] dyn-without-guard.so
  [908877847 930012836.67 980176520] static.so

execution :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 49482140.46 ± 2175287.73 (confidence = 99%)

  static.so is 1.45x to 1.49x faster than dyn-without-guard.so!

  [149370743 154540290.45 163713075] dyn-without-guard.so
  [99902180 105058149.99 163151023] static.so

compilation :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 54959651.89 ± 5371934.09 (confidence = 99%)

  static.so is 1.22x to 1.26x faster than dyn-without-guard.so!

  [258958338 282812487.71 327560254] dyn-without-guard.so
  [210542623 227852835.82 315990395] static.so

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 1592325533.89 ± 100212378.19 (confidence = 99%)

  static.so is 1.21x to 1.24x faster than dyn-without-guard.so!

  [8494305552 8674966162.14 10607319542] dyn-without-guard.so
  [6985640940 7082640628.25 8405451996] static.so

compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 66518323.42 ± 5778757.98 (confidence = 99%)

  static.so is 1.20x to 1.24x faster than dyn-without-guard.so!

  [342947509 371023668.35 415514806] dyn-without-guard.so
  [277471563 304505344.93 339940435] static.so

execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 1217702.33 ± 627813.45 (confidence = 99%)

  static.so is 1.06x to 1.18x faster than dyn-without-guard.so!

  [11111517 11632130.88 14705814] dyn-without-guard.so
  [7682848 10414428.55 14122261] static.so

</details>

Comparing Wasmtime and SpiderMonkey

To get a sense of whether our 1.55x slowdown when enabling explicit bounds checks was typical of other engines, I also benchmarked SpiderMonkey (native) running Sightglass's spidermonkey.wasm benchmark with virtual memory guard pages vs with explicit bounds checks.

execution :: milliseconds :: spidermonkey.wasm

  Δ = 163.62 ± 5.57 (confidence = 99%)

  virtual-memory-spidermonkey is 1.52x to 1.56x faster than bounds-checks-spidermonkey!

  [446 465.49 520] bounds-checks-spidermonkey
  [290 301.87 353] virtual-memory-spidermonkey

Because SpiderMonkey's slowdown is almost identical to Wasmtime's, we can infer that there aren't any critical bounds-checking optimizations that SpiderMonkey performs but Wasmtime/Cranelift does not. This is reassuring: it means we can basically ignore other engines (or at least SpiderMonkey) for now, when trying to further improve codegen when explicit bounds checks are enabled, and focus only on ourselves.

Proposed Future Work

What fruit remains to be harvested? I've been digging into spidermonkey.wasm's hot functions (as reported by perf on Linux) and staring at disassemblies to get an idea of how much room for improvement we still have. In short: a lot!

I found many sequences of back-to-back
[message truncated]

view this post on Zulip Wasmtime GitHub notifications bot (Mar 23 2023 at 18:35):

fitzgen labeled issue #6094:

I've been digging into Cranelift's and Wasmtime's code quality and Wasm execution throughput when "dynamic memories" with explicit bounds checks are used to implement Wasm linear memories. Here I'm summarizing the progress made thus far here, and laying out what I believe the next steps to continue this effort should be.

Static memories require large reservations of virtual memory, to the extent that it can become a bottleneck for how many Wasm instances can exist concurrently in a system. Dynamic memories, on the other hand, require very little or even zero virtual memory reservations (depending on configuration) beyond the actually resident pages backing the Wasm linear memory itself. Additionally, when dynamic memories are configured without any guard regions, they incur zero virtual memory-related syscalls on instantiation and tear down (modulo any copy-on-write heap image initialization). These syscalls have been observed to be bottlenecks in concurrent systems due to VMA locks taken in the kernel.

Furthermore these issues will only compound in the future. Right now, each tenant in a multi-tenant system is a single core Wasm module with a single Wasm linear memory. With the component model, a tenant will instead be a Wasm component that is made up of many core Wasm modules, each with their own linear memories. If the average Wasm component contains five linear memories, then the multi-tenant system can only support one fifth as many concurrent tenants under the component model, and instantiating and tearing down each tenant's service will take five times as many syscalls as it does today.

So there is a lot of pressure towards implementing Wasm linear memories with dynamic memories. The downside is that dynamic memories currently incur roughly 55% overhead to Wasm execution throughput compared to static memories. That is, a Wasm computation that takes 1 second in a static memory will take about 1.55 seconds in a dynamic memory. This is because dynamic memory accesses must be explicitly bounds checked rather than caught after the fact via page faults like out-of-bounds static memory accesses. The goal of the effort described here is to minimize this overhead.

I don't intend to continue this work in the immediate future, so I wanted to make sure everything was summarized and documented for posterity, as well as written down while it is still fresh in my mind. Additionally I have proposals for new optimization passes we should implement in the future, and this issue can serve as a forum for discussing them.

Progress Made Thus Far

Before I began this effort, running the spidermonkey.wasm benchmark in Sightglass with static memories was 3.06x faster than doing the same with dynamic memories. Now we are down to static memories being only 1.55x faster.

This improvement comes from a number of sources:

Additionally, I created the wasmtime explore subcommand to dig into native code disassemblies and associate each native instruction with the Wasm bytecode it implements (similar to the Godbolt Compiler Explorer). This new tool has already been incredibly helpful for analyzing what room for improvement still exists, and should help further future optimization efforts in Cranelift and Wasmtime.

Recent Sightglass Benchmark Results

Here are some recent Sightglass benchmark results comparing static memories (static.so) versus dynamic memories with a small (64KiB) guard region (dyn-with-guard.so) versus dynamic memories without a guard region (dyn-without-guard.so).

static.so vs dyn-with-guard.so

<details>

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 517183665.48 ± 6392410.71 (confidence = 99%)

  static.so is 1.55x to 1.56x faster than dyn-with-guard.so!

  [1420787062 1450201603.68 1522542159] dyn-with-guard.so
  [914013248 933017938.20 1036142782] static.so

execution :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 45802386.02 ± 1486278.64 (confidence = 99%)

  static.so is 1.42x to 1.45x faster than dyn-with-guard.so!

  [146742085 150423911.30 159078923] dyn-with-guard.so
  [100319894 104621525.28 149448856] static.so

execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 3014103.57 ± 316843.43 (confidence = 99%)

  static.so is 1.32x to 1.40x faster than dyn-with-guard.so!

  [10662276 11365102.25 16630094] dyn-with-guard.so
  [7683153 8350998.68 13449448] static.so

compilation :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 45130213.24 ± 4838200.92 (confidence = 99%)

  static.so is 1.18x to 1.22x faster than dyn-with-guard.so!

  [249993052 271654226.14 316970220] dyn-with-guard.so
  [208845361 226524012.90 290090035] static.so

compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 53326476.11 ± 5528961.43 (confidence = 99%)

  static.so is 1.16x to 1.19x faster than dyn-with-guard.so!

  [330373001 356824375.72 397185785] dyn-with-guard.so
  [279072702 303497899.61 348984338] static.so

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 1163853854.64 ± 81595847.03 (confidence = 99%)

  static.so is 1.15x to 1.18x faster than dyn-with-guard.so!

  [8097294016 8213392062.47 10103274666] dyn-with-guard.so
  [6969056325 7049538207.83 8442021515] static.so

</details>

static.so vs dyn-without-guard.so

<details>

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 594932587.44 ± 6811906.18 (confidence = 99%)

  static.so is 1.63x to 1.65x faster than dyn-without-guard.so!

  [1494952789 1524945424.11 1606143187] dyn-without-guard.so
  [908877847 930012836.67 980176520] static.so

execution :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 49482140.46 ± 2175287.73 (confidence = 99%)

  static.so is 1.45x to 1.49x faster than dyn-without-guard.so!

  [149370743 154540290.45 163713075] dyn-without-guard.so
  [99902180 105058149.99 163151023] static.so

compilation :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 54959651.89 ± 5371934.09 (confidence = 99%)

  static.so is 1.22x to 1.26x faster than dyn-without-guard.so!

  [258958338 282812487.71 327560254] dyn-without-guard.so
  [210542623 227852835.82 315990395] static.so

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 1592325533.89 ± 100212378.19 (confidence = 99%)

  static.so is 1.21x to 1.24x faster than dyn-without-guard.so!

  [8494305552 8674966162.14 10607319542] dyn-without-guard.so
  [6985640940 7082640628.25 8405451996] static.so

compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 66518323.42 ± 5778757.98 (confidence = 99%)

  static.so is 1.20x to 1.24x faster than dyn-without-guard.so!

  [342947509 371023668.35 415514806] dyn-without-guard.so
  [277471563 304505344.93 339940435] static.so

execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 1217702.33 ± 627813.45 (confidence = 99%)

  static.so is 1.06x to 1.18x faster than dyn-without-guard.so!

  [11111517 11632130.88 14705814] dyn-without-guard.so
  [7682848 10414428.55 14122261] static.so

</details>

Comparing Wasmtime and SpiderMonkey

To get a sense of whether our 1.55x slowdown when enabling explicit bounds checks was typical of other engines, I also benchmarked SpiderMonkey (native) running Sightglass's spidermonkey.wasm benchmark with virtual memory guard pages vs with explicit bounds checks.

execution :: milliseconds :: spidermonkey.wasm

  Δ = 163.62 ± 5.57 (confidence = 99%)

  virtual-memory-spidermonkey is 1.52x to 1.56x faster than bounds-checks-spidermonkey!

  [446 465.49 520] bounds-checks-spidermonkey
  [290 301.87 353] virtual-memory-spidermonkey

Because SpiderMonkey's slowdown is almost identical to Wasmtime's, we can infer that there aren't any critical bounds-checking optimizations that SpiderMonkey performs but Wasmtime/Cranelift does not. This is reassuring: it means we can basically ignore other engines (or at least SpiderMonkey) for now, when trying to further improve codegen when explicit bounds checks are enabled, and focus only on ourselves.

Proposed Future Work

What fruit remains to be harvested? I've been digging into spidermonkey.wasm's hot functions (as reported by perf on Linux) and staring at disassemblies to get an idea of how much room for improvement we still have. In short: a lot!

I found many sequences of back-to-back
[message truncated]

view this post on Zulip Wasmtime GitHub notifications bot (Mar 23 2023 at 18:46):

fitzgen edited issue #6094:

I've been digging into Cranelift's and Wasmtime's code quality and Wasm execution throughput when "dynamic memories" with explicit bounds checks are used to implement Wasm linear memories. Here I'm summarizing the progress made thus far here, and laying out what I believe the next steps to continue this effort should be.

Static memories require large reservations of virtual memory, to the extent that it can become a bottleneck for how many Wasm instances can exist concurrently in a system. Dynamic memories, on the other hand, require very little or even zero virtual memory reservations (depending on configuration) beyond the actually resident pages backing the Wasm linear memory itself. Additionally, when dynamic memories are configured without any guard regions, they incur zero virtual memory-related syscalls on instantiation and tear down (modulo any copy-on-write heap image initialization). These syscalls have been observed to be bottlenecks in concurrent systems due to VMA locks taken in the kernel.

Furthermore these issues will only compound in the future. Right now, each tenant in a multi-tenant system is a single core Wasm module with a single Wasm linear memory. With the component model, a tenant will instead be a Wasm component that is made up of many core Wasm modules, each with their own linear memories. If the average Wasm component contains five linear memories, then the multi-tenant system can only support one fifth as many concurrent tenants under the component model, and instantiating and tearing down each tenant's service will take five times as many syscalls as it does today.

So there is a lot of pressure towards implementing Wasm linear memories with dynamic memories. The downside is that dynamic memories currently incur roughly 55% overhead to Wasm execution throughput compared to static memories. That is, a Wasm computation that takes 1 second in a static memory will take about 1.55 seconds in a dynamic memory. This is because dynamic memory accesses must be explicitly bounds checked rather than caught after the fact via page faults like out-of-bounds static memory accesses. The goal of the effort described here is to minimize this overhead.

I don't intend to continue this work in the immediate future, so I wanted to make sure everything was summarized and documented for posterity, as well as written down while it is still fresh in my mind. Additionally I have proposals for new optimization passes we should implement in the future, and this issue can serve as a forum for discussing them.

Progress Made Thus Far

Before I began this effort, running the spidermonkey.wasm benchmark in Sightglass with static memories was 3.06x faster than doing the same with dynamic memories. Now we are down to static memories being only 1.55x faster.

This improvement comes from a number of sources:

Additionally, I created the wasmtime explore subcommand to dig into native code disassemblies and associate each native instruction with the Wasm bytecode it implements (similar to the Godbolt Compiler Explorer). This new tool has already been incredibly helpful for analyzing what room for improvement still exists, and should help further future optimization efforts in Cranelift and Wasmtime.

Recent Sightglass Benchmark Results

Here are some recent Sightglass benchmark results comparing static memories (static.so) versus dynamic memories with a small (64KiB) guard region (dyn-with-guard.so) versus dynamic memories without a guard region (dyn-without-guard.so).

static.so vs dyn-with-guard.so

<details>

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 517183665.48 ± 6392410.71 (confidence = 99%)

  static.so is 1.55x to 1.56x faster than dyn-with-guard.so!

  [1420787062 1450201603.68 1522542159] dyn-with-guard.so
  [914013248 933017938.20 1036142782] static.so

execution :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 45802386.02 ± 1486278.64 (confidence = 99%)

  static.so is 1.42x to 1.45x faster than dyn-with-guard.so!

  [146742085 150423911.30 159078923] dyn-with-guard.so
  [100319894 104621525.28 149448856] static.so

execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 3014103.57 ± 316843.43 (confidence = 99%)

  static.so is 1.32x to 1.40x faster than dyn-with-guard.so!

  [10662276 11365102.25 16630094] dyn-with-guard.so
  [7683153 8350998.68 13449448] static.so

compilation :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 45130213.24 ± 4838200.92 (confidence = 99%)

  static.so is 1.18x to 1.22x faster than dyn-with-guard.so!

  [249993052 271654226.14 316970220] dyn-with-guard.so
  [208845361 226524012.90 290090035] static.so

compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 53326476.11 ± 5528961.43 (confidence = 99%)

  static.so is 1.16x to 1.19x faster than dyn-with-guard.so!

  [330373001 356824375.72 397185785] dyn-with-guard.so
  [279072702 303497899.61 348984338] static.so

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 1163853854.64 ± 81595847.03 (confidence = 99%)

  static.so is 1.15x to 1.18x faster than dyn-with-guard.so!

  [8097294016 8213392062.47 10103274666] dyn-with-guard.so
  [6969056325 7049538207.83 8442021515] static.so

</details>

static.so vs dyn-without-guard.so

<details>

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 594932587.44 ± 6811906.18 (confidence = 99%)

  static.so is 1.63x to 1.65x faster than dyn-without-guard.so!

  [1494952789 1524945424.11 1606143187] dyn-without-guard.so
  [908877847 930012836.67 980176520] static.so

execution :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 49482140.46 ± 2175287.73 (confidence = 99%)

  static.so is 1.45x to 1.49x faster than dyn-without-guard.so!

  [149370743 154540290.45 163713075] dyn-without-guard.so
  [99902180 105058149.99 163151023] static.so

compilation :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 54959651.89 ± 5371934.09 (confidence = 99%)

  static.so is 1.22x to 1.26x faster than dyn-without-guard.so!

  [258958338 282812487.71 327560254] dyn-without-guard.so
  [210542623 227852835.82 315990395] static.so

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 1592325533.89 ± 100212378.19 (confidence = 99%)

  static.so is 1.21x to 1.24x faster than dyn-without-guard.so!

  [8494305552 8674966162.14 10607319542] dyn-without-guard.so
  [6985640940 7082640628.25 8405451996] static.so

compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 66518323.42 ± 5778757.98 (confidence = 99%)

  static.so is 1.20x to 1.24x faster than dyn-without-guard.so!

  [342947509 371023668.35 415514806] dyn-without-guard.so
  [277471563 304505344.93 339940435] static.so

execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 1217702.33 ± 627813.45 (confidence = 99%)

  static.so is 1.06x to 1.18x faster than dyn-without-guard.so!

  [11111517 11632130.88 14705814] dyn-without-guard.so
  [7682848 10414428.55 14122261] static.so

</details>

Comparing Wasmtime and SpiderMonkey

To get a sense of whether our 1.55x slowdown when enabling explicit bounds checks was typical of other engines, I also benchmarked SpiderMonkey (native) running Sightglass's spidermonkey.wasm benchmark with virtual memory guard pages vs with explicit bounds checks.

execution :: milliseconds :: spidermonkey.wasm

  Δ = 163.62 ± 5.57 (confidence = 99%)

  virtual-memory-spidermonkey is 1.52x to 1.56x faster than bounds-checks-spidermonkey!

  [446 465.49 520] bounds-checks-spidermonkey
  [290 301.87 353] virtual-memory-spidermonkey

Because SpiderMonkey's slowdown is almost identical to Wasmtime's, we can infer that there aren't any critical bounds-checking optimizations that SpiderMonkey performs but Wasmtime/Cranelift does not. This is reassuring: it means we can basically ignore other engines (or at least SpiderMonkey) for now, when trying to further improve codegen when explicit bounds checks are enabled, and focus only on ourselves.

Proposed Future Work

What fruit remains to be harvested? I've been digging into spidermonkey.wasm's hot functions (as reported by perf on Linux) and staring at disassemblies to get an idea of how much room for improvement we still have. In short: a lot!

I found many sequences of back-to-back m
[message truncated]

view this post on Zulip Wasmtime GitHub notifications bot (Mar 23 2023 at 18:53):

fitzgen commented on issue #6094:

For posterity, here is the suuuuuper-hacky WASI polyfill I used to get SpiderMonkey running the spidermonkey.wasm benchmark from sightglass (which was forked off from a smaller polyfill that @alexcrichton shared with me, thank you).

<details>

const ITERS = 100;

////////////////////////////////////////////////////////////////////////////////

class TextEncoder {
  constructor(enc) {
    if (enc != "utf-8") {
      throw new Error("FITZGEN: unsupported encoding: " + enc);
    }
  }
  encode(s, n) {
    let buf = new Uint8Array(s.length);
    for (let i = 0; i < Math.min(s.length, n); i++) {
      buf[i] = s.charCodeAt(i); // lol
    }
    return buf;
  }
};

class TextDecoder {
  constructor(enc) {
    if (enc != "utf-8") {
      throw new Error("FITZGEN: unsupported encoding: " + enc);
    }
  }
  decode(buf) {
    let buf8 = new Uint8Array(buf);
    let s = "";
    for (let i = 0; i < buf8.length; i++) {
      s += String.fromCharCode(buf8[i]); // lol
    }
    return s;
  }
};

////////////////////////////////////////////////////////////////////////////////

async function ionCompile(wasm) {
  let module = await WebAssembly.compile(wasm);
  while (!wasmHasTier2CompilationCompleted(module)) {
    sleep(1);
  }
  return module;
}

function unimplemented(name) {
  return function (...args) {
    throw new Error(name + " is unimplemented! args = " + args);
  }
}

class ProcExitError extends Error {
  constructor(code) {
    this.code = code;
    this.message = "Program exited with code " + code;
  }
  toString() {
    return "ProcExitError: " + this.message
  }
}

function trace(obj) {
  let proxy = {};
  for (let key of Object.keys(obj)) {
    if (typeof obj[key] != "function") {
      proxy[key] = obj[key];
      continue;
    }
    proxy[key] = function (...args) {
      print("TRACE: " + key + "(" + args + ")");
      let ret = obj[key](...args);
      print("TRACE:   -> " + ret);
      return ret;
    };
  }
  return proxy;
}

////////////////////////////////////////////////////////////////////////////////

async function main() {
  let smWasm = os.file.readFile("/home/nick/sightglass/benchmarks/spidermonkey/benchmark.wasm", "binary");
  let smModule = await ionCompile(smWasm);

  for (let i = 0; i < ITERS; i++) {
    let args = ["js"];
    let mem = null;
    let start = null;
    let unread = true;

    let smInstance = await WebAssembly.instantiate(smModule, {
      bench: {
        start: function () {
          start = monotonicNow()
        },
        end: function () {
          let end = monotonicNow();
          print("ITER: " + (end - start));
        }
      },
      wasi_snapshot_preview1: {
        random_get: function(a, b) {
          while (b > 0) {
            (new DataView(mem.buffer)).setInt8(a, 1, true);
            b -= 1;
            a += 1;
          }
          return 0;
        },
        args_get: function(a, b) {
          for (let arg of args) {
            (new DataView(mem.buffer)).setInt32(a, b, true);
            a += 4;
            for (let c in arg) {
              (new DataView(mem.buffer)).setInt8(b, arg.charCodeAt(c), true);
              b += 1;
            }
            (new DataView(mem.buffer)).setInt8(b, 0, true);
            b += 1;
          }
          return 0;
        },
        args_sizes_get: function(a, b) {
          let len = 0;
          for (let arg of args) {
            len += arg.length + 1;
          }
          (new DataView(mem.buffer)).setInt32(a, args.length, true);
          (new DataView(mem.buffer)).setInt32(b, len, true);
          return 0;
        },
        clock_res_get: function () { return 1; },
        clock_time_get: function(a, b, c) {
          const now = Math.round(performance.now() * 1000000);
          (new DataView(mem.buffer)).setBigInt64(c, BigInt(now), true);
          return 0;
        },
        fd_filestat_get: function() { throw new Error('fd_filestat_get'); },
        fd_read: function(fd, iovecs_ptr, iovecs_len, out_ptr) {
          let mem8 = new Uint8Array(mem.buffer);
          switch (fd) {
          case 4:
            let data = os.file.readFile("/home/nick/sightglass/benchmarks/spidermonkey/default.input.md");
            let k = 0;
            for (let i = 0; i < iovecs_len && k < data.length && unread; i++) {
              let ptr = (new DataView(mem.buffer)).getUint32(iovecs_ptr + i * 8, true);
              let len = (new DataView(mem.buffer)).getUint32(iovecs_ptr + i * 8 + 4, true);
              for (let j = 0; j < len && k < data.length; j++) {
                mem8[ptr + j] = data.charCodeAt(k++);
              }
            }
            unread = false;
            (new DataView(mem.buffer)).setUint32(out_ptr, k, true);
            return 0;
          default:
            return 8;
          }
        },
        fd_seek: function(fd, offset, whence, out_ptr) {
          switch (fd) {
          case 4:
            let len = os.file.readFile("/home/nick/sightglass/benchmarks/spidermonkey/default.input.md").length;
            (new DataView(mem.buffer)).setBigUint64(out_ptr, BigInt(len), true);
            return 0;
          default:
            return 8;
          }
        },
        fd_write: function(a, b, c, d) {
          let s = '';
          let total = 0;
          while (c > 0) {
            let base = (new DataView(mem.buffer)).getInt32(b, true);
            let len = (new DataView(mem.buffer)).getInt32(b + 4, true);
            b += 8;
            c -= 1;

            while (len > 0) {
              let c = new Uint8Array(mem.buffer)[base]
              s += String.fromCharCode(c);
              len -= 1;
              base += 1;
              total += 1;
            }
          }
          // print("fd_write(" + a + "): " + s.trimEnd());
          (new DataView(mem.buffer)).setInt32(d, total, true);
          return 0;
        },
        fd_fdstat_set_flags: function() { throw new Error('fd_fdstat_set_flags'); },
        path_filestat_get: function() { throw new Error('path_filestat_get'); },
        path_open: function(fd, dirflags, path_ptr, path_len, oflags, fs_rights_base, fs_rights_inheriting, fdflags, out_ptr) {
          let buf = new Uint8Array(path_len);
          let mem8 = new Uint8Array(mem.buffer);
          for (let i = 0; i < path_len; i++) {
            buf[i] = mem8[path_ptr + i];
          }
          let path = (new TextDecoder('utf-8')).decode(buf);
          switch (path) {
          case "default.input.md":
            (new DataView(mem.buffer)).setInt32(out_ptr, 4, true);
            return 0;
          default:
            print('denying path_open(' + path + ')');
            return 2;
          }
        },
        path_remove_directory: function() { throw new Error('path_remove_directory'); },
        path_unlink_file: function() { throw new Error('path_unlink_file'); },
        sched_yield: function() { throw new Error('sched_yield'); },
        environ_get: function() { return 0; },
        environ_sizes_get: function(a, b) {
          (new DataView(mem.buffer)).setInt32(a, 0, true);
          (new DataView(mem.buffer)).setInt32(b, 0, true);
          return 0;
        },
        fd_close: function(fd) {
          return 0;
        },
        fd_fdstat_get: function(fd, out_ptr) {
          (new DataView(mem.buffer)).setInt32(out_ptr, 0);
          // type
          switch (fd) {
          case 3:
            (new DataView(mem.buffer)).setInt32(out_ptr + 4, 3, true);
            break;
          case 4:
            (new DataView(mem.buffer)).setInt32(out_ptr + 4, 4, true);
            break;
          default:
            (new DataView(mem.buffer)).setInt32(out_ptr + 4, 0, true);
            break;
          }
          (new DataView(mem.buffer)).setInt32(out_ptr + 8, 0, true);
          (new DataView(mem.buffer)).setInt32(out_ptr + 12, 0, true);
          (new DataView(mem.buffer)).setInt32(out_ptr + 16, 0, true);
          (new DataView(mem.buffer)).setInt32(out_ptr + 20, 0, true);
          return 0;
        },
        fd_prestat_get: function(a, b) {
          // case for preopened "."
          if (a == 3) {
            // discriminant for directory
            (new DataView(mem.buffer)).setInt32(b, 0, true);
            // one entry in directory
            (new DataView(mem.buffer)).setInt32(b + 4, 1, true);
            return 0;
          }
          return 8;
        },
        fd_prestat_dir_name: function(fd, ptr, len) {
          let buf = (new TextEncoder('utf-8')).encode(".", 1);
          let mem8 = new Uint8Array(mem.buffer);
          for (let i = 0; i < Math.min(buf.length, len); i++) {
            mem8[ptr + i] = buf[i];
          }
          mem8[Math.min(buf.length, len)] = 0;
          return 0;
        },
        proc_exit: function(code) { throw new ProcExitError(code); },
      },
    });

    mem = smInstance.exports.memory;

    try {
      smInstance.exports._start();
    } catch (e) {
      if ((e instanceof ProcExitError) && e.code == 0) {
        continue;
      }
      throw e;
    }
  }
  print("Okay!");
}

main().catch(function (e) {
  print("====== ERROR!!! ======");
  print(e);
  print(e.stack);
});

</details>

view this post on Zulip Wasmtime GitHub notifications bot (Jul 20 2023 at 13:55):

fitzgen commented on issue #6094:

I talked with @cfallin about these optimizations a little bit yesterday and he pointed out that the select_spectre_guard-elimination pass is probably not safe because it relies on certain guards _dominating_ other guards. This is a control-flow property but we can only rely on data-flow properties when optimizing Spectre guards. Speculative execution can guess (incorrectly) that control will transfer to a dominated block (where we optimized away a Spectre guard) from some code path that doesn't include the dominator block (where the remaining Spectre guard lives). This guess will ultimately fail, but until then the speculative execution got to execute loads/stores that weren't Spectre guarded, modifying micro-arch state (eg caches) that could then leak data via side channel timing attacks.

(Thinking out loud a bit here:) It is unclear to me whether a local (within the same basic block) version of the select_spectre_guard-elimination pass would be safe. This would still give us nearly all the benefit for the examples I was previously digging into in spidermonkey.wasm. But this is still a control-flow property, and not a data-flow property. Speculative execution could technically still guess that control would transfer into the middle of a block (even though that would be impossible at runtime, assuming we maintain control flow integrity, which if we aren't then we have bigger problems) but that seems like it would never actually happen in practice? I'm not sure I really want to make that argument and stand by it...

All that said, I think the basic constraint-elimination pass would still be beneficial here, but we wouldn't get as much as I originally hoped for in the OP.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 20 2023 at 15:23):

cfallin commented on issue #6094:

Thanks for summarizing all this @fitzgen; sorry for not getting to it sooner!

I don't think a "remove dominated checks within a basic block" optimization is safe, either, for the same reasons; I want to try to give a little intuition into how the out-of-order execution works to show why and help guide any further discussions.

The main idea behind OoO is restricted dataflow execution: "restricted" because it operates like a sliding window over the program's execution trace, and "dataflow execution" because within that window, instructions are decoded into a true dependency graph and can execute whenever their inputs are ready.

You can think of the CPU has having a frontend that predicts some trace (at each branch, picks a direction; at indirects, predicts a target; at returns, predicts based on an internal predictor callstack); that output is a linear stream of instructions. This stream goes through a "rename" stage that computes dependencies and does a sort of SSA-like thing, assigning new physical registers to instructions' written dests; and an "allocate" stage that puts the instructions into a "reorder buffer" (ROB) as well as the scheduling data structures (think "ready queues" and such). Then there is a "retirement stage" that crawls the reorder buffer from a tail pointer, and at any cycle can take some number of the oldest instructions in-flight and "commit" or "retire".

Speculation is a behavior that arises out of the cooperation of the frontend and backend. The frontend is predicting some path; the OoO engine then treats this linear stream as truth; but if a branch resolves differently than it was predicted, or an instruction traps, we need to flush and resteer. In the simplest designs, this happens when the instruction reaches retirement (is the oldest instruction): we need a consistent architectural machine state to restart from, and the most natural way to get that is to let retirement put things back in-order. More complex aproaches use periodic checkpoints but the effect is the same -- the speculation recovery is delayed a bit, and younger instructions (those "beyond" the misspeculation or trap) were still in the dataflow window for however long and could have executed.

So if we have a program something like:

block0:
jnz oob // if out-of-bounds, go elsewhere
r0 <- cmove0 // check conditions for unsafe_load0
unsafe_load0 r0
r1 <- cmove1 // check conditions for unsafe_load1
unsafe_load1 r1

and let's say the branch is mispredicted as falling through (predicted not-OOB, but actually is). Then the reorder buffer will contain those five instructions in order and execution obeys only true dataflow dependencies. Even if we can resolve the branch relatively early, misspeculation recovery is delayed somewhat (possibly until retirement in simple designs, and imagine there are 100 instructions prior to these in the window, so it could be a while). If it weren't for the cmoves, the loads could execute and there's our side-channel. But unsafe_load0 can't execute until cmove0 produces its result, and likewise unsafe_load1, and even though we predicted this wrong path, the actual dataflow is correct, so the cmoves will pick a NULL pointer and we don't leak anything.

Now say we have

jnz oob
r0 <- cmove0
unsafe_load r0
unsafe_load r1 // assume first inst trapping will guard this one

we get these instructions into the window, and the jnz might be marked in the ROB as mispredicted (resteer at retirement), and the unsafe_load r0 might be marked as faulting (resteer to pagefault handler when it reaches retirement; this will never happen though because the jnz recovery happens first), but hey look, this unsafe_load r1 uses a register that has been available for a while; nothing is stopping us from running it out-of-order (just as it says on the tin; out-of-order execution to the rescue!). And there's our side-channel leak again.

The basic intuition behind the Spectre mitigations, and guidance from all the CPU vendors, is that processors can speculate control-flow and execute instructions in arbitrary order (and this is fundamental to the 10x perf per clock cycle we've gotten in hardware in the last few decades), but they won't speculate dataflow. (Value speculation is a thing in the uarch research literature but AFAIK no one has ever shipped it; and now that Spectre is a thing, likely no one will?). This is why the cmove thing works. Fences also work, because a fence instruction can't enter the OoO window until every older instruction retires (the window is empty). But of course that window flush is quite expensive.

Given that, I think we need to obey the invariant that some cmove based on an OOB condition exists in the computation of every Wasm address we load. It's possible we could do a multi-stage thing: if we access p and later p - 16, the computation of p - 16 could use the cmove-guarded p as input. So something like: v2 = icmp ge v1, bound; v3 = select_spectre_guard v2, v1, nullptr; v4 = load v3+0; v5 = load v3-16. But I suspect that's the best we can do, unless we start playing with fences. My intuition is that this might still be OK if it means we can hoist the one cmove out of a loop -- basically we have a single choke-point where we can "turn off" all the address dataflow. But we'd have to think through it a bit further...

view this post on Zulip Wasmtime GitHub notifications bot (Jul 20 2023 at 15:39):

cfallin commented on issue #6094:

Ah, the tidbit of intuition I forgot to state explicitly as well, in case it helps: we can think of faulting loads/stores as a kind of "branch" as well, given that we use the same recovery mechanism as for mispredicted branches. So instructions beyond one that must fault may still be executed speculatively, because we either haven't executed the to-fault inst yet, or we have but our recovery mechanism hasn't yet let us flush the window and redirect to the trap handler. So the "make the load fault" mitigation protects that load, because it won't result to a value that lets its dependent instructions run; but it doesn't protect younger instructions, because they can run before or after (of course with architectural effects only committed once earlier insts resolve successfully, so, never if earlier fault).

Other tidbit of intuition that might help: it's more accurate to think of an OoO CPU as executing every instruction speculatively, rather than thinking of speculation as a thing that happens when the branch predictor meets a branch. The latter is certainly a form of speculation, which we need to produce the linear stream of instructions into the backend. But strictly speaking every execution of a node in the dataflow graph in the backend when that node is not the oldest is speculation that some condition (trap, misspeculated memory ordering or aliasing (st-to-ld-fwd) condition; etc) will not cause a flush. Effects of this speculative early execution are kept in the ROB (or equivalent structures alongside, such as the store buffer) until retirement.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 20 2023 at 17:37):

fitzgen commented on issue #6094:

If we add a knob for tuning the size of the null guard page, and both offset1 and offset2 were within the null guard page and offset1 > offset2, then with the help of the basic constraint-elimination pass, we could do this series of rewrites

Original input, what cranelift-wasm generates now:

addr1_oob = index + offset1 > heap_bound
addr1 = spectre_select_guard(addr1_oob, 0, heap_base + index + offset1)
addr2_oob = index + offset2 > heap_bound
addr2 = spectre_select_guard(addr2_oob, 0, heap_base + index + offset2)

Pull offsets out the spectre guard because they are smaller than the null guard page (or just modify cranelift-wasm to emit them like this in the first place, since that is the location where we know the heap configuration):

addr1_oob = index + offset1 > heap_bound
addr1 = spectre_select_guard(addr1_oob, 0, heap_base + index) + offset1
addr2_oob = index + offset2 > heap_bound
addr2 = spectre_select_guard(addr2_oob, 0, heap_base + index) + offset2

Basic constraint-elimination pass makes addr2_oob an alias of addr1_oob because offset1 > offset2:

addr1_oob = index + offset1 > heap_bound
addr1 = spectre_select_guard(addr1_oob, 0, heap_base + index) + offset1
addr2 = spectre_select_guard(addr1_oob, 0, heap_base + index) + offset2

"GVN" (via e-graphs) the spectre_select_guard operation:

addr1_oob = index + offset1 > heap_bound
base_and_index = spectre_select_guard(addr1_oob, 0, heap_base + index)
addr1 = base_and_index + offset1
addr2 = base_and_index + offset2

Now we only have a single spectre_select_guard!

@cfallin, how does that sound?

view this post on Zulip Wasmtime GitHub notifications bot (Jul 20 2023 at 17:38):

fitzgen edited a comment on issue #6094:

If we add a knob for tuning the size of the null guard page, and both offset1 and offset2 were within the null guard page and offset1 > offset2, then with the help of the basic constraint-elimination pass, we could do this series of rewrites

Original input, what cranelift-wasm generates now:

addr1_oob = index + offset1 > heap_bound
addr1 = spectre_select_guard(addr1_oob, 0, heap_base + index + offset1)
addr2_oob = index + offset2 > heap_bound
addr2 = spectre_select_guard(addr2_oob, 0, heap_base + index + offset2)

Pull offsets out the spectre guard because they are smaller than the null guard page (or just modify cranelift-wasm to emit them like this in the first place, since that is the location where we know the heap configuration):

addr1_oob = index + offset1 > heap_bound
addr1 = spectre_select_guard(addr1_oob, 0, heap_base + index) + offset1
addr2_oob = index + offset2 > heap_bound
addr2 = spectre_select_guard(addr2_oob, 0, heap_base + index) + offset2

Basic constraint-elimination pass makes addr2_oob an alias of addr1_oob because offset1 > offset2:

addr1_oob = index + offset1 > heap_bound
addr1 = spectre_select_guard(addr1_oob, 0, heap_base + index) + offset1
addr2 = spectre_select_guard(addr1_oob, 0, heap_base + index) + offset2

"GVN" (via e-graphs) the spectre_select_guard operation:

addr1_oob = index + offset1 > heap_bound
base_and_index = spectre_select_guard(addr1_oob, 0, heap_base + index)
addr1 = base_and_index + offset1
addr2 = base_and_index + offset2

Now we only have a single spectre_select_guard! And we kept a spectre_select_guard in the DFG for all addresses.

@cfallin, how does that sound?

view this post on Zulip Wasmtime GitHub notifications bot (Jul 20 2023 at 20:39):

cfallin commented on issue #6094:

I think this looks workable, yep. It's worth defining here precisely what the "replace addr2_oob with addr1_oob" condition is because it's a little nonstandard -- it isn't equivalence but rather implication (addr2_oob --> addr1_oob). It thus wouldn't be a valid substitution on its own, sans any control-flow context (if addr1 and addr2 were just value outputs of the program, we'd be changing addr2 sometimes); but here it is fine because the access of addr1 dominates the access of addr2. Said another way: altering the computed value is fine because such alteration is only observable on a path past a trap.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 05 2024 at 19:47):

jameysharp commented on issue #6094:

Could cranelift-wasm track enough information to fully do this transformation itself, instead of implementing a generic optimization in Cranelift?

I think that because wasm only has reducible control flow, the dominator tree is implied by the control-flow instructions in the input wasm. So I think we can keep track of which linear-memory accesses dominate the current instruction while making a single pass over the input program.

If so, then we could maintain a scoped hash map keyed on CLIF Values which have been used as the index to some dominating memory access. (Since local.get and various stack manipulations are all rewritten to SSA values during this phase, this should suffice to normalize a variety of weird patterns in the original wasm.)

These keys would map to a pair of the maximum offset checked so far, and the select_spectre_guard result that was used in that check. If the current access is less than or equal to that offset then the result can be reused. Otherwise we need another bounds-check but can update the map with the new larger offset and new bounds-checked base.

That's pretty simple compared to doing it generically in Cranelift, right?

view this post on Zulip Wasmtime GitHub notifications bot (Mar 05 2024 at 22:54):

fitzgen commented on issue #6094:

I think it is possible, yes. The idea of fusing this optimization with an existing pass where we have domination "for free" is definitely tempting.

One hardship I expect we would encounter would be that this runs before the phi-elimination pass and therefore we would likely leave some optimizations on the table, especially around Wasm locals and joining control flow.

That's pretty simple compared to doing it generically in Cranelift, right?

Perhaps?

I don't think the passes I outlined above are too complicated, its mostly just a matter of writing them such that they can be fused with each other and with the egraphs pass (similar to how the alias analysis is written). As long as that is done, then it seems fairly straightforward to me: two passes that each do a single shape of optimization.

In contrast, in cranelift-wasm we are already doing a bunch of translation and such, and adding another optimization responsibility in addition to that feels a bit like conflating concerns.


Last updated: Dec 23 2024 at 13:07 UTC