Spidermonkey significantly improved their regalloc performance for large functions: https://spidermonkey.dev/blog/2024/10/16/75x-faster-optimizing-the-ion-compiler-backend.html Is there anything in here that would be useful for regalloc2?
Semi-NCA (from Linear-Time Algorithms for Dominators and Related Problems by Loukas Georgiadis) is a different algorithm that’s also used by LLVM and the Julia compiler. I prototyped this and was surprised to see how much faster it was: it got our total compilation time down from 14 seconds to less than 8 seconds. For a single-threaded compilation, it reduced the time under
ComputeImmediateDominators
from 7.1 seconds to 0.15 seconds.
we use the simple-fast algorithm in cranelift, not sure if regalloc2 recomputes dominators again and if so what algorithm it uses, but it would definitely be nice to replace simple-fast with semi-nca or something else that isn't quadratic
the first change, switching from linked lists to vectors, regalloc2 has already done, as noted in the post
not sure about the sorting bits tho
https://github.com/bytecodealliance/wasmtime/issues/9481
some discussion with Jan on mastodon here: https://mastodon.world/@mgaudet@discuss.systems/113318166531502647
tl;dr is that RA2 independently came to the same optimizations more or less
sparse bitsets (we have our own thing that is adaptive past a threshold), and also avoiding quadratic behavior in move generation by doing half-moves with sorting to match them up
RA3 uses a new algorithm for computing live ranges which is faster than the one used in RA2 and only requires 2 bitsets total (live-in blocks and live-out blocks). This works by first creating the use list for each value and then computing live range for each value one at a time.
Since live-in/out data is only needed for a single value at a time, we only need O(blocks) bits in the bitset instead of O(blocks*values).
It does use a dense bitset though, so there may still be perf issues with very large functions like the one in ONNX, I'll need to do some testing.
On the same ONNX wasm blob from that issue, regalloc2 seems to spend 72% of the total regalloc time sorting ranges when merging bundles. That can be improved in 2 ways:
Last updated: Nov 22 2024 at 16:03 UTC