pepyakin requested akirilov-arm for a review on PR #5004.
pepyakin requested cfallin for a review on PR #5004.
pepyakin opened PR #5004 from pep-align-loops
to main
:
Closes #4883
This PR introduces alignment for the loop headers to 64 bytes on x86_64. I benchmarked this on AMD Zen 3 x5950 and it produced the following results.
<details>
instantiation :: cycles :: benchmarks/shootout-gimli/benchmark.wasm Δ = 26281.32 ± 4598.52 (confidence = 99%) pep-align-loops.so is 1.44x to 1.63x faster than main.so! [53890 75407.92 118592] main.so [36448 49126.60 87992] pep-align-loops.so instantiation :: cycles :: benchmarks/shootout-seqhash/benchmark.wasm Δ = 34856.12 ± 3933.07 (confidence = 99%) pep-align-loops.so is 1.46x to 1.57x faster than main.so! [82756 102507.28 150960] main.so [50592 67651.16 106658] pep-align-loops.so instantiation :: cycles :: benchmarks/shootout-keccak/benchmark.wasm Δ = 28160.16 ± 5341.58 (confidence = 99%) pep-align-loops.so is 1.40x to 1.59x faster than main.so! [59942 85105.40 132532] main.so [42262 56945.24 91494] pep-align-loops.so instantiation :: cycles :: benchmarks/shootout-heapsort/benchmark.wasm Δ = 30212.40 ± 4510.98 (confidence = 99%) pep-align-loops.so is 1.41x to 1.56x faster than main.so! [68714 92486.12 158032] main.so [47056 62273.72 97852] pep-align-loops.so instantiation :: cycles :: benchmarks/shootout-xchacha20/benchmark.wasm Δ = 26062.70 ± 4747.21 (confidence = 99%) pep-align-loops.so is 1.39x to 1.56x faster than main.so! [59262 81086.94 130798] main.so [40392 55024.24 87312] pep-align-loops.so instantiation :: cycles :: benchmarks/shootout-fib2/benchmark.wasm Δ = 34163.20 ± 3949.15 (confidence = 99%) pep-align-loops.so is 1.42x to 1.53x faster than main.so! [83232 106564.84 146948] main.so [55590 72401.64 104278] pep-align-loops.so instantiation :: cycles :: benchmarks/shootout-sieve/benchmark.wasm Δ = 30320.18 ± 4710.86 (confidence = 99%) pep-align-loops.so is 1.39x to 1.53x faster than main.so! [73882 96454.94 189584] main.so [50558 66134.76 110466] pep-align-loops.so instantiation :: cycles :: benchmarks/blake3-simd/benchmark.wasm Δ = 26868.16 ± 4298.60 (confidence = 99%) pep-align-loops.so is 1.38x to 1.53x faster than main.so! [63682 85859.86 126990] main.so [43044 58991.70 83606] pep-align-loops.so instantiation :: cycles :: benchmarks/shootout-memmove/benchmark.wasm Δ = 25414.32 ± 4195.94 (confidence = 99%) pep-align-loops.so is 1.37x to 1.52x faster than main.so! [57460 82822.64 127364] main.so [46376 57408.32 98498] pep-align-loops.so instantiation :: cycles :: benchmarks/shootout-ed25519/benchmark.wasm Δ = 39825.22 ± 5512.82 (confidence = 99%) pep-align-loops.so is 1.38x to 1.50x faster than main.so! [93432 130104.74 168538] main.so [57222 90279.52 166124] pep-align-loops.so instantiation :: cycles :: benchmarks/shootout-minicsv/benchmark.wasm Δ = 24848.90 ± 3936.23 (confidence = 99%) pep-align-loops.so is 1.37x to 1.51x faster than main.so! [60656 81573.48 123760] main.so [41990 56724.58 90338] pep-align-loops.so instantiation :: cycles :: benchmarks/shootout-random/benchmark.wasm Δ = 26969.48 ± 4123.30 (confidence = 99%) pep-align-loops.so is 1.35x to 1.48x faster than main.so! [67694 91377.04 137054] main.so [46206 64407.56 93228] pep-align-loops.so instantiation :: cycles :: benchmarks/shootout-nestedloop/benchmark.wasm Δ = 25948.46 ± 4511.46 (confidence = 99%) pep-align-loops.so is 1.34x to 1.48x faster than main.so! [69088 89657.66 136612] main.so [46784 63709.20 90916] pep-align-loops.so instantiation :: cycles :: benchmarks/shootout-ctype/benchmark.wasm Δ = 27127.58 ± 4332.85 (confidence = 99%) pep-align-loops.so is 1.33x to 1.45x faster than main.so! [76228 96649.76 169218] main.so [55964 69522.18 109922] pep-align-loops.so instantiation :: cycles :: benchmarks/shootout-xblabla20/benchmark.wasm Δ = 20529.54 ± 4116.39 (confidence = 99%) pep-align-loops.so is 1.29x to 1.44x faster than main.so! [58446 76712.84 112710] main.so [42534 56183.30 83572] pep-align-loops.so instantiation :: cycles :: benchmarks/shootout-matrix/benchmark.wasm Δ = 24574.18 ± 4205.02 (confidence = 99%) pep-align-loops.so is 1.30x to 1.43x faster than main.so! [73576 92209.02 127092] main.so [50830 67634.84 120972] pep-align-loops.so instantiation :: cycles :: benchmarks/shootout-switch/benchmark.wasm Δ = 28385.24 ± 4903.24 (confidence = 99%) pep-align-loops.so is 1.30x to 1.42x faster than main.so! [78370 107118.02 161194] main.so [61778 78732.78 114172] pep-align-loops.so instantiation :: cycles :: benchmarks/shootout-ratelimit/benchmark.wasm Δ = 23482.78 ± 4058.42 (confidence = 99%) pep-align-loops.so is 1.29x to 1.42x faster than main.so! [66096 89596.12 135694] main.so [48586 66113.34 98260] pep-align-loops.so instantiation :: cycles :: benchmarks/blake3-scalar/benchmark.wasm Δ = 28377.42 ± 5294.94 (confidence = 99%) pep-align-loops.so is 1.29x to 1.42x faster than main.so! [85680 108899.96 149872] main.so [63138 80522.54 135082] pep-align-loops.so instantiation :: cycles :: benchmarks/shootout-base64/benchmark.wasm Δ = 22593.34 ± 4545.94 (confidence = 99%) pep-align-loops.so is 1.26x to 1.40x faster than main.so! [71060 91037.72 146880] main.so [54060 68444.38 107202] pep-align-loops.so instantiation :: cycles :: benchmarks/noop/benchmark.wasm Δ = 17670.82 ± 4192.78 (confidence = 99%) pep-align-loops.so is 1.25x to 1.40x faster than main.so! [59908 72167.72 107202] main.so [43724 54496.90 93670] pep-align-loops.so instantiation :: cycles :: benchmarks/bz2/benchmark.wasm Δ = 35288.26 ± 5934.66 (confidence = 99%) pep-align-loops.so is 1.26x to 1.37x faster than main.so! [102340 146073.52 194038] main.so [80784 110785.26 180472] pep-align-loops.so instantiation :: cycles :: benchmarks/shootout-ackermann/benchmark.wasm Δ = 21964.34 ± 4552.01 (confidence = 99%) pep-align-loops.so is 1.24x to 1.36x faster than main.so! [71502 95173.48 139842] main.so [54060 73209.14 108086] pep-align-loops.so instantiation :: cycles :: benchmarks/intgemm-simd/benchmark.wasm Δ = 30523.16 ± 5684.40 (confidence = 99%) pep-align-loops.so is 1.21x to 1.31x faster than main.so! [127704 147993.16 218076] main.so [85170 117470.00 152490] pep-align-loops.so instantiation :: cycles :: benchmarks/meshoptimizer/benchmark.wasm Δ = 27309.14 ± 3876.14 (confidence = 99%) pep-align-loops.so is 1.19x to 1.25x faster than main.so! [131002 152892.56 180472] main.so [89828 125583.42 153068] pep-align-loops.so execution :: cycles :: benchmarks/shootout-sieve/benchmark.wasm Δ = 160632987.42 ± 5972804.43 (confidence = 99%) pep-align-loops.so is 1.17x to 1.18x faster than main.so! [1039973776 1079490706.44 1137124492] main.so [884091222 918857719.02 940632032] pep-align-loops.so instantiation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm Δ = 22790.88 ± 7872.59 (confidence = 99%) pep-align-loops.so is 1.09x to 1.19x faster than main.so! [151266 187293.76 236504] main.so [115294 164502.88 223210] pep-align-loops.so execution :: cycles :: benchmarks/shootout-xchacha20/benchmark.wasm Δ = 750342.94 ± 343121.40 (confidence = 99%) main.so is 1.07x to 1.18x faster than pep-align-loops.so! [4639198 5917878.50 8029610] main.so [5627170 6668221.44 9286080] pep-align-loops.so compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm Δ = 40131994.18 ± 6247250.64 (confidence = 99%) pep-align-loops.so is 1.08x to 1.11x faster than main.so! [447057432 471406190.64 514834086] main.so [398536440 431274196.46 494554072] pep-align-loops.so compilation :: cycles :: benchmarks/bz2/benchmark.wasm Δ = 17023452.18 ± 6101055.67 (confidence = 99%) pep-align-loops.so is 1.05x to 1.11x faster than main.so! [197441876 225073129.96 300179132] main.so [186926662 208049677.78 259271080] pep-align-loops.so execution :: cycles :: benchmarks/shootout-fib2/benchmark.wasm Δ = 229041837.76 ± 62294164.86 (confidence = 99%) main.so is 1.06x to 1.10x faster than pep-align-loops.so! [2760966328 2877473442.90 2963631080] main.so [3000497042 3106515280.66 5393578372] pep-align-loops.so compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm Δ = 680581047.42 ± 58488688.83 (confidence = 99%) pep-align-loops.so is 1.06x to 1.07x faster than main.so! [10586670916 10705050364.66 10875778390] main.so [9403424098 10024469317.24 10473503548] pep-align-loops.so compilation :: cycles :: benchmarks/blake3-scalar/benchmark.wasm Δ = 14350847.50 ± 5439729.54 (confidence = 99%) pep-align-loops.so is 1.04x to 1.09x faster than main.so! [223399074 234169891.16 298711012] main.so [203019610 219819043.66 281894306] pep-align-loops.so execution :: cycles :: benchmarks/shootout-switch/benchmark.wasm Δ = 6943520.80 ± 883372.86 (confidence = 99%) main.so is 1.06x to 1.07x faster than pep-align-loops.so! [103285506 107387684.20 112425352] main.so [108359326 114331205.00 123381886] pep-align-loops.so compilation :: cycles :: benchmarks/shootout-ackermann/benchmark.wasm Δ = 4700652.32 ± 4147779.68 (confidence = 99%) pep-align-loops.so is 1.01x to 1.11x faster than main.so! [78591068 87546072.66 131549298] main.so [74343754 82845420.34 119083402] pep-align-loops.so compilation :: cycles :: benchmarks/shootout-ed25519/benchmark.wasm Δ = 32031030.76 ± 10733541.38 (confidence = 99%) pep-align-loops.so is 1.04x to 1.07x faster than main.so! [message truncated]
pepyakin updated PR #5004 from pep-align-loops
to main
.
bjorn3 created PR review comment:
Could you disable this when optimizing for size? Also won't this cause a large blowup when there are a lot of tiny loops in a function with multiple being able to fit in a single cacheline?
bjorn3 submitted PR review.
bjorn3 submitted PR review.
bjorn3 created PR review comment:
Also maybe disable it for cold blocks?
bjorn3 submitted PR review.
bjorn3 created PR review comment:
Would it make sense to merge
cold_blocks
andloop_headers
into a singleSecondaryMap
with a bitflag as element? Or are both still too infrequently used such that it causes a large increase in memory usage?
lpereira commented on PR #5004:
Doing some necromancing here, but I found a nice article detailing how the .NET JIT is padding their loops and adopting a similar strategy might be useful. It's quite a bit more involved than this PR.
cfallin commented on PR #5004:
@lpereira how are the loop / basic-block weights computed in the .NET JIT? Are they based on profiling counters or some ahead-of-time estimate? It'd be helpful to see a summary of the technique given your background with the .NET JIT.
The heuristics seem to be the crux of things here: we can't align all loops because it would lead to (presumably) a significant increase in code size (how much? someone needs to do the experiments). We are also an AOT compiler (even in JIT mode, we are morally AOT because we compile once at load-time) so we can't pick based on profiling counters. So we need some heuristics for loops that are hot enough for this to matter, yet infrequent enough that the size impact (and fetch bandwidth impact on the entry to the loop) is acceptable.
cfallin edited a comment on PR #5004:
@lpereira how are the loop / basic-block weights computed in the .NET JIT? Are they based on profiling counters or some ahead-of-time estimate? It'd be helpful to see a summary of the technique given your background with the .NET JIT.
The heuristics seem to be the crux of things here: we can't align all loops because it would lead to (presumably) a significant increase in code size (how much? someone needs to do the experiments). We are also an AOT compiler (even in JIT mode, we are morally AOT because we compile once at load-time) so we can't pick based on profiling counters. So we need some heuristics for loops that are hot enough for this to matter, yet infrequent enough that the size impact (and fetch bandwidth impact on the entry to the loop) is acceptable.
(EDIT: to clarify, I'm hoping to learn more about e.g. "block-weight meets a certain weight threshold" in that article: is a block-weight static or dynamic, and how is it measured?)
lpereira commented on PR #5004:
@lpereira how are the loop / basic-block weights computed in the .NET JIT? Are they based on profiling counters or some ahead-of-time estimate?
I don't know!
What I know, however, is that CoreCLR has a tiered JIT: AOT compilation is possible and usually performed to speed up startup, but the compiler in AOT mode does as much work optimizing code as Winch, IIRC. Once it detects that a function needs a second look (and I don't know how it does, maybe a per-function counter and some tiny code in the epilogue that decrements and calls into a "jit this harder" function that does that, then patches the caller to become a trampoline? I really don't know; it's probably something a whole lot more clever than that), it then spends more time recompiling things. This is pretty complicated to implement, and wouldn't help with the pure AOT scenarios we have, of course (unless we had some kind of PGO thing).
I'll spend some time looking through CoreCLR later to figure out exactly what they do and how they do it. I'll look at how GCC and Clang does it too.
The heuristics seem to be the crux of things here: we can't align all loops because it would lead to (presumably) a significant increase in code size (how much? someone needs to do the experiments).
Right. I've given this some thought, and we could experiment with some heuristics, based on:
- The number of (WASM) instructions in the loop. Loops with over, I don't know, 100 instructions in the body is likely to not be that tight.
- The kind of Cranelift IR instructions in the loop. For instance:
- If it's mostly arithmetic, SIMD, and memory read/writes, then it's probably a hot loop
- If there are non-math-related libcalls, then it's less likely it's a hot loop
- If there are host calls, it's also less likely it's a hot loop
- If most calls are to small, trivial functions (say ~16 WASM instructions, no loops, maybe at most 2 branches, no other calls, or some other heuristic similar we'd use for inlining a function)
- Maybe factor in a ratio of branches/number of instructions too, and if it's not branch-heavy, it's probably a hot loop?- If we ever have some sort of PGO, feed that into the "loop hotness" score
It's all a bit fiddly, of course, and we'd need to empirically adjust these scores, but I think it's a doable strategy to identify hot loops.
Then, when lowering, at least on x86, two things need to happen if a loop is considered hot:
- Align the first instruction of the loop on a 32-byte boundary if there's SIMD, and 16-byte otherwise
- Ensure that no instructions in the loop crosses a cache line boundary
Last updated: Jan 24 2025 at 00:11 UTC