Stream: git-wasmtime

Topic: wasmtime / PR #5004 x86_64: align loop headers to 64 bytes


view this post on Zulip Wasmtime GitHub notifications bot (Oct 04 2022 at 10:40):

pepyakin requested akirilov-arm for a review on PR #5004.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 04 2022 at 10:40):

pepyakin requested cfallin for a review on PR #5004.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 04 2022 at 10:40):

pepyakin opened PR #5004 from pep-align-loops to main:

Closes #4883

This PR introduces alignment for the loop headers to 64 bytes on x86_64. I benchmarked this on AMD Zen 3 x5950 and it produced the following results.

<details>

instantiation :: cycles :: benchmarks/shootout-gimli/benchmark.wasm

  Δ = 26281.32 ± 4598.52 (confidence = 99%)

  pep-align-loops.so is 1.44x to 1.63x faster than main.so!

  [53890 75407.92 118592] main.so
  [36448 49126.60 87992] pep-align-loops.so

instantiation :: cycles :: benchmarks/shootout-seqhash/benchmark.wasm

  Δ = 34856.12 ± 3933.07 (confidence = 99%)

  pep-align-loops.so is 1.46x to 1.57x faster than main.so!

  [82756 102507.28 150960] main.so
  [50592 67651.16 106658] pep-align-loops.so

instantiation :: cycles :: benchmarks/shootout-keccak/benchmark.wasm

  Δ = 28160.16 ± 5341.58 (confidence = 99%)

  pep-align-loops.so is 1.40x to 1.59x faster than main.so!

  [59942 85105.40 132532] main.so
  [42262 56945.24 91494] pep-align-loops.so

instantiation :: cycles :: benchmarks/shootout-heapsort/benchmark.wasm

  Δ = 30212.40 ± 4510.98 (confidence = 99%)

  pep-align-loops.so is 1.41x to 1.56x faster than main.so!

  [68714 92486.12 158032] main.so
  [47056 62273.72 97852] pep-align-loops.so

instantiation :: cycles :: benchmarks/shootout-xchacha20/benchmark.wasm

  Δ = 26062.70 ± 4747.21 (confidence = 99%)

  pep-align-loops.so is 1.39x to 1.56x faster than main.so!

  [59262 81086.94 130798] main.so
  [40392 55024.24 87312] pep-align-loops.so

instantiation :: cycles :: benchmarks/shootout-fib2/benchmark.wasm

  Δ = 34163.20 ± 3949.15 (confidence = 99%)

  pep-align-loops.so is 1.42x to 1.53x faster than main.so!

  [83232 106564.84 146948] main.so
  [55590 72401.64 104278] pep-align-loops.so

instantiation :: cycles :: benchmarks/shootout-sieve/benchmark.wasm

  Δ = 30320.18 ± 4710.86 (confidence = 99%)

  pep-align-loops.so is 1.39x to 1.53x faster than main.so!

  [73882 96454.94 189584] main.so
  [50558 66134.76 110466] pep-align-loops.so

instantiation :: cycles :: benchmarks/blake3-simd/benchmark.wasm

  Δ = 26868.16 ± 4298.60 (confidence = 99%)

  pep-align-loops.so is 1.38x to 1.53x faster than main.so!

  [63682 85859.86 126990] main.so
  [43044 58991.70 83606] pep-align-loops.so

instantiation :: cycles :: benchmarks/shootout-memmove/benchmark.wasm

  Δ = 25414.32 ± 4195.94 (confidence = 99%)

  pep-align-loops.so is 1.37x to 1.52x faster than main.so!

  [57460 82822.64 127364] main.so
  [46376 57408.32 98498] pep-align-loops.so

instantiation :: cycles :: benchmarks/shootout-ed25519/benchmark.wasm

  Δ = 39825.22 ± 5512.82 (confidence = 99%)

  pep-align-loops.so is 1.38x to 1.50x faster than main.so!

  [93432 130104.74 168538] main.so
  [57222 90279.52 166124] pep-align-loops.so

instantiation :: cycles :: benchmarks/shootout-minicsv/benchmark.wasm

  Δ = 24848.90 ± 3936.23 (confidence = 99%)

  pep-align-loops.so is 1.37x to 1.51x faster than main.so!

  [60656 81573.48 123760] main.so
  [41990 56724.58 90338] pep-align-loops.so

instantiation :: cycles :: benchmarks/shootout-random/benchmark.wasm

  Δ = 26969.48 ± 4123.30 (confidence = 99%)

  pep-align-loops.so is 1.35x to 1.48x faster than main.so!

  [67694 91377.04 137054] main.so
  [46206 64407.56 93228] pep-align-loops.so

instantiation :: cycles :: benchmarks/shootout-nestedloop/benchmark.wasm

  Δ = 25948.46 ± 4511.46 (confidence = 99%)

  pep-align-loops.so is 1.34x to 1.48x faster than main.so!

  [69088 89657.66 136612] main.so
  [46784 63709.20 90916] pep-align-loops.so

instantiation :: cycles :: benchmarks/shootout-ctype/benchmark.wasm

  Δ = 27127.58 ± 4332.85 (confidence = 99%)

  pep-align-loops.so is 1.33x to 1.45x faster than main.so!

  [76228 96649.76 169218] main.so
  [55964 69522.18 109922] pep-align-loops.so

instantiation :: cycles :: benchmarks/shootout-xblabla20/benchmark.wasm

  Δ = 20529.54 ± 4116.39 (confidence = 99%)

  pep-align-loops.so is 1.29x to 1.44x faster than main.so!

  [58446 76712.84 112710] main.so
  [42534 56183.30 83572] pep-align-loops.so

instantiation :: cycles :: benchmarks/shootout-matrix/benchmark.wasm

  Δ = 24574.18 ± 4205.02 (confidence = 99%)

  pep-align-loops.so is 1.30x to 1.43x faster than main.so!

  [73576 92209.02 127092] main.so
  [50830 67634.84 120972] pep-align-loops.so

instantiation :: cycles :: benchmarks/shootout-switch/benchmark.wasm

  Δ = 28385.24 ± 4903.24 (confidence = 99%)

  pep-align-loops.so is 1.30x to 1.42x faster than main.so!

  [78370 107118.02 161194] main.so
  [61778 78732.78 114172] pep-align-loops.so

instantiation :: cycles :: benchmarks/shootout-ratelimit/benchmark.wasm

  Δ = 23482.78 ± 4058.42 (confidence = 99%)

  pep-align-loops.so is 1.29x to 1.42x faster than main.so!

  [66096 89596.12 135694] main.so
  [48586 66113.34 98260] pep-align-loops.so

instantiation :: cycles :: benchmarks/blake3-scalar/benchmark.wasm

  Δ = 28377.42 ± 5294.94 (confidence = 99%)

  pep-align-loops.so is 1.29x to 1.42x faster than main.so!

  [85680 108899.96 149872] main.so
  [63138 80522.54 135082] pep-align-loops.so

instantiation :: cycles :: benchmarks/shootout-base64/benchmark.wasm

  Δ = 22593.34 ± 4545.94 (confidence = 99%)

  pep-align-loops.so is 1.26x to 1.40x faster than main.so!

  [71060 91037.72 146880] main.so
  [54060 68444.38 107202] pep-align-loops.so

instantiation :: cycles :: benchmarks/noop/benchmark.wasm

  Δ = 17670.82 ± 4192.78 (confidence = 99%)

  pep-align-loops.so is 1.25x to 1.40x faster than main.so!

  [59908 72167.72 107202] main.so
  [43724 54496.90 93670] pep-align-loops.so

instantiation :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 35288.26 ± 5934.66 (confidence = 99%)

  pep-align-loops.so is 1.26x to 1.37x faster than main.so!

  [102340 146073.52 194038] main.so
  [80784 110785.26 180472] pep-align-loops.so

instantiation :: cycles :: benchmarks/shootout-ackermann/benchmark.wasm

  Δ = 21964.34 ± 4552.01 (confidence = 99%)

  pep-align-loops.so is 1.24x to 1.36x faster than main.so!

  [71502 95173.48 139842] main.so
  [54060 73209.14 108086] pep-align-loops.so

instantiation :: cycles :: benchmarks/intgemm-simd/benchmark.wasm

  Δ = 30523.16 ± 5684.40 (confidence = 99%)

  pep-align-loops.so is 1.21x to 1.31x faster than main.so!

  [127704 147993.16 218076] main.so
  [85170 117470.00 152490] pep-align-loops.so

instantiation :: cycles :: benchmarks/meshoptimizer/benchmark.wasm

  Δ = 27309.14 ± 3876.14 (confidence = 99%)

  pep-align-loops.so is 1.19x to 1.25x faster than main.so!

  [131002 152892.56 180472] main.so
  [89828 125583.42 153068] pep-align-loops.so

execution :: cycles :: benchmarks/shootout-sieve/benchmark.wasm

  Δ = 160632987.42 ± 5972804.43 (confidence = 99%)

  pep-align-loops.so is 1.17x to 1.18x faster than main.so!

  [1039973776 1079490706.44 1137124492] main.so
  [884091222 918857719.02 940632032] pep-align-loops.so

instantiation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 22790.88 ± 7872.59 (confidence = 99%)

  pep-align-loops.so is 1.09x to 1.19x faster than main.so!

  [151266 187293.76 236504] main.so
  [115294 164502.88 223210] pep-align-loops.so

execution :: cycles :: benchmarks/shootout-xchacha20/benchmark.wasm

  Δ = 750342.94 ± 343121.40 (confidence = 99%)

  main.so is 1.07x to 1.18x faster than pep-align-loops.so!

  [4639198 5917878.50 8029610] main.so
  [5627170 6668221.44 9286080] pep-align-loops.so

compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 40131994.18 ± 6247250.64 (confidence = 99%)

  pep-align-loops.so is 1.08x to 1.11x faster than main.so!

  [447057432 471406190.64 514834086] main.so
  [398536440 431274196.46 494554072] pep-align-loops.so

compilation :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 17023452.18 ± 6101055.67 (confidence = 99%)

  pep-align-loops.so is 1.05x to 1.11x faster than main.so!

  [197441876 225073129.96 300179132] main.so
  [186926662 208049677.78 259271080] pep-align-loops.so

execution :: cycles :: benchmarks/shootout-fib2/benchmark.wasm

  Δ = 229041837.76 ± 62294164.86 (confidence = 99%)

  main.so is 1.06x to 1.10x faster than pep-align-loops.so!

  [2760966328 2877473442.90 2963631080] main.so
  [3000497042 3106515280.66 5393578372] pep-align-loops.so

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 680581047.42 ± 58488688.83 (confidence = 99%)

  pep-align-loops.so is 1.06x to 1.07x faster than main.so!

  [10586670916 10705050364.66 10875778390] main.so
  [9403424098 10024469317.24 10473503548] pep-align-loops.so

compilation :: cycles :: benchmarks/blake3-scalar/benchmark.wasm

  Δ = 14350847.50 ± 5439729.54 (confidence = 99%)

  pep-align-loops.so is 1.04x to 1.09x faster than main.so!

  [223399074 234169891.16 298711012] main.so
  [203019610 219819043.66 281894306] pep-align-loops.so

execution :: cycles :: benchmarks/shootout-switch/benchmark.wasm

  Δ = 6943520.80 ± 883372.86 (confidence = 99%)

  main.so is 1.06x to 1.07x faster than pep-align-loops.so!

  [103285506 107387684.20 112425352] main.so
  [108359326 114331205.00 123381886] pep-align-loops.so

compilation :: cycles :: benchmarks/shootout-ackermann/benchmark.wasm

  Δ = 4700652.32 ± 4147779.68 (confidence = 99%)

  pep-align-loops.so is 1.01x to 1.11x faster than main.so!

  [78591068 87546072.66 131549298] main.so
  [74343754 82845420.34 119083402] pep-align-loops.so

compilation :: cycles :: benchmarks/shootout-ed25519/benchmark.wasm

  Δ = 32031030.76 ± 10733541.38 (confidence = 99%)

  pep-align-loops.so is 1.04x to 1.07x faster than main.so!


[message truncated]

view this post on Zulip Wasmtime GitHub notifications bot (Oct 04 2022 at 11:06):

pepyakin updated PR #5004 from pep-align-loops to main.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 04 2022 at 11:26):

bjorn3 created PR review comment:

Could you disable this when optimizing for size? Also won't this cause a large blowup when there are a lot of tiny loops in a function with multiple being able to fit in a single cacheline?

view this post on Zulip Wasmtime GitHub notifications bot (Oct 04 2022 at 11:26):

bjorn3 submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 04 2022 at 11:27):

bjorn3 submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 04 2022 at 11:27):

bjorn3 created PR review comment:

Also maybe disable it for cold blocks?

view this post on Zulip Wasmtime GitHub notifications bot (Oct 04 2022 at 11:29):

bjorn3 submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 04 2022 at 11:29):

bjorn3 created PR review comment:

Would it make sense to merge cold_blocks and loop_headers into a single SecondaryMap with a bitflag as element? Or are both still too infrequently used such that it causes a large increase in memory usage?

view this post on Zulip Wasmtime GitHub notifications bot (May 21 2024 at 22:18):

lpereira commented on PR #5004:

Doing some necromancing here, but I found a nice article detailing how the .NET JIT is padding their loops and adopting a similar strategy might be useful. It's quite a bit more involved than this PR.

view this post on Zulip Wasmtime GitHub notifications bot (May 21 2024 at 22:34):

cfallin commented on PR #5004:

@lpereira how are the loop / basic-block weights computed in the .NET JIT? Are they based on profiling counters or some ahead-of-time estimate? It'd be helpful to see a summary of the technique given your background with the .NET JIT.

The heuristics seem to be the crux of things here: we can't align all loops because it would lead to (presumably) a significant increase in code size (how much? someone needs to do the experiments). We are also an AOT compiler (even in JIT mode, we are morally AOT because we compile once at load-time) so we can't pick based on profiling counters. So we need some heuristics for loops that are hot enough for this to matter, yet infrequent enough that the size impact (and fetch bandwidth impact on the entry to the loop) is acceptable.

view this post on Zulip Wasmtime GitHub notifications bot (May 21 2024 at 23:44):

cfallin edited a comment on PR #5004:

@lpereira how are the loop / basic-block weights computed in the .NET JIT? Are they based on profiling counters or some ahead-of-time estimate? It'd be helpful to see a summary of the technique given your background with the .NET JIT.

The heuristics seem to be the crux of things here: we can't align all loops because it would lead to (presumably) a significant increase in code size (how much? someone needs to do the experiments). We are also an AOT compiler (even in JIT mode, we are morally AOT because we compile once at load-time) so we can't pick based on profiling counters. So we need some heuristics for loops that are hot enough for this to matter, yet infrequent enough that the size impact (and fetch bandwidth impact on the entry to the loop) is acceptable.

(EDIT: to clarify, I'm hoping to learn more about e.g. "block-weight meets a certain weight threshold" in that article: is a block-weight static or dynamic, and how is it measured?)

view this post on Zulip Wasmtime GitHub notifications bot (May 23 2024 at 00:27):

lpereira commented on PR #5004:

@lpereira how are the loop / basic-block weights computed in the .NET JIT? Are they based on profiling counters or some ahead-of-time estimate?

I don't know!

What I know, however, is that CoreCLR has a tiered JIT: AOT compilation is possible and usually performed to speed up startup, but the compiler in AOT mode does as much work optimizing code as Winch, IIRC. Once it detects that a function needs a second look (and I don't know how it does, maybe a per-function counter and some tiny code in the epilogue that decrements and calls into a "jit this harder" function that does that, then patches the caller to become a trampoline? I really don't know; it's probably something a whole lot more clever than that), it then spends more time recompiling things. This is pretty complicated to implement, and wouldn't help with the pure AOT scenarios we have, of course (unless we had some kind of PGO thing).

I'll spend some time looking through CoreCLR later to figure out exactly what they do and how they do it. I'll look at how GCC and Clang does it too.

The heuristics seem to be the crux of things here: we can't align all loops because it would lead to (presumably) a significant increase in code size (how much? someone needs to do the experiments).

Right. I've given this some thought, and we could experiment with some heuristics, based on:

It's all a bit fiddly, of course, and we'd need to empirically adjust these scores, but I think it's a doable strategy to identify hot loops.

Then, when lowering, at least on x86, two things need to happen if a loop is considered hot:


Last updated: Dec 23 2024 at 12:05 UTC