spidermonkey regalloc speedup · cranelift

Stream: cranelift

Topic: spidermonkey regalloc speedup

bjorn3 (Oct 17 2024 at 07:13):

Spidermonkey significantly improved their regalloc performance for large functions: https://spidermonkey.dev/blog/2024/10/16/75x-faster-optimizing-the-ion-compiler-backend.html Is there anything in here that would be useful for regalloc2?

75x faster: optimizing the Ion compiler backend

In September, machine learning engineers at Mozilla filed a bug report indicating that Firefox was consuming excessive memory and CPU resources while running Microsoft’s ONNX Runtime (a machine learning library) compiled to WebAssembly.

fitzgen (he/him) (Oct 17 2024 at 15:22):

Semi-NCA (from Linear-Time Algorithms for Dominators and Related Problems by Loukas Georgiadis) is a different algorithm that’s also used by LLVM and the Julia compiler. I prototyped this and was surprised to see how much faster it was: it got our total compilation time down from 14 seconds to less than 8 seconds. For a single-threaded compilation, it reduced the time under ComputeImmediateDominators from 7.1 seconds to 0.15 seconds.

we use the simple-fast algorithm in cranelift, not sure if regalloc2 recomputes dominators again and if so what algorithm it uses, but it would definitely be nice to replace simple-fast with semi-nca or something else that isn't quadratic

fitzgen (he/him) (Oct 17 2024 at 15:23):

the first change, switching from linked lists to vectors, regalloc2 has already done, as noted in the post

fitzgen (he/him) (Oct 17 2024 at 15:23):

not sure about the sorting bits tho

fitzgen (he/him) (Oct 17 2024 at 15:27):

https://github.com/bytecodealliance/wasmtime/issues/9481

Use a linear time dominators algorithm · Issue #9481 · bytecodealliance/wasmtime

Right now we use the classic simple-fast algorithm (A Simple, Fast Dominance Algorithm by Cooper et al) and while this works well most of the time in practice, it has quadratic worst case time. Lin...

Chris Fallin (Oct 17 2024 at 18:10):

some discussion with Jan on mastodon here: https://mastodon.world/@mgaudet@discuss.systems/113318166531502647

Chris Fallin (Oct 17 2024 at 18:10):

tl;dr is that RA2 independently came to the same optimizations more or less

Chris Fallin (Oct 17 2024 at 18:10):

sparse bitsets (we have our own thing that is adaptive past a threshold), and also avoiding quadratic behavior in move generation by doing half-moves with sorting to match them up

Amanieu (Oct 17 2024 at 18:55):

RA3 uses a new algorithm for computing live ranges which is faster than the one used in RA2 and only requires 2 bitsets total (live-in blocks and live-out blocks). This works by first creating the use list for each value and then computing live range for each value one at a time.
Since live-in/out data is only needed for a single value at a time, we only need O(blocks) bits in the bitset instead of O(blocks*values).

regalloc3/src/internal/value_live_ranges.rs at main · Amanieu/regalloc3

New register allocator designed as a successor to regalloc2 - Amanieu/regalloc3

Amanieu (Oct 17 2024 at 19:03):

It does use a dense bitset though, so there may still be perf issues with very large functions like the one in ONNX, I'll need to do some testing.

Amanieu (Oct 18 2024 at 04:37):

On the same ONNX wasm blob from that issue, regalloc2 seems to spend 72% of the total regalloc time sorting ranges when merging bundles. That can be improved in 2 ways:

More thorough phi elimination before regalloc.
Using a linear-time merge instead of a sort.

Last updated: Apr 08 2025 at 03:15 UTC