Stream: cranelift

Topic: spidermonkey regalloc speedup


view this post on Zulip bjorn3 (Oct 17 2024 at 07:13):

Spidermonkey significantly improved their regalloc performance for large functions: https://spidermonkey.dev/blog/2024/10/16/75x-faster-optimizing-the-ion-compiler-backend.html Is there anything in here that would be useful for regalloc2?

In September, machine learning engineers at Mozilla filed a bug report indicating that Firefox was consuming excessive memory and CPU resources while running Microsoft’s ONNX Runtime (a machine learning library) compiled to WebAssembly.

view this post on Zulip fitzgen (he/him) (Oct 17 2024 at 15:22):

Semi-NCA (from Linear-Time Algorithms for Dominators and Related Problems by Loukas Georgiadis) is a different algorithm that’s also used by LLVM and the Julia compiler. I prototyped this and was surprised to see how much faster it was: it got our total compilation time down from 14 seconds to less than 8 seconds. For a single-threaded compilation, it reduced the time under ComputeImmediateDominators from 7.1 seconds to 0.15 seconds.

we use the simple-fast algorithm in cranelift, not sure if regalloc2 recomputes dominators again and if so what algorithm it uses, but it would definitely be nice to replace simple-fast with semi-nca or something else that isn't quadratic

view this post on Zulip fitzgen (he/him) (Oct 17 2024 at 15:23):

the first change, switching from linked lists to vectors, regalloc2 has already done, as noted in the post

view this post on Zulip fitzgen (he/him) (Oct 17 2024 at 15:23):

not sure about the sorting bits tho

view this post on Zulip fitzgen (he/him) (Oct 17 2024 at 15:27):

https://github.com/bytecodealliance/wasmtime/issues/9481

Right now we use the classic simple-fast algorithm (A Simple, Fast Dominance Algorithm by Cooper et al) and while this works well most of the time in practice, it has quadratic worst case time. Lin...

view this post on Zulip Chris Fallin (Oct 17 2024 at 18:10):

some discussion with Jan on mastodon here: https://mastodon.world/@mgaudet@discuss.systems/113318166531502647

view this post on Zulip Chris Fallin (Oct 17 2024 at 18:10):

tl;dr is that RA2 independently came to the same optimizations more or less

view this post on Zulip Chris Fallin (Oct 17 2024 at 18:10):

sparse bitsets (we have our own thing that is adaptive past a threshold), and also avoiding quadratic behavior in move generation by doing half-moves with sorting to match them up

view this post on Zulip Amanieu (Oct 17 2024 at 18:55):

RA3 uses a new algorithm for computing live ranges which is faster than the one used in RA2 and only requires 2 bitsets total (live-in blocks and live-out blocks). This works by first creating the use list for each value and then computing live range for each value one at a time.
Since live-in/out data is only needed for a single value at a time, we only need O(blocks) bits in the bitset instead of O(blocks*values).

New register allocator designed as a successor to regalloc2 - Amanieu/regalloc3

view this post on Zulip Amanieu (Oct 17 2024 at 19:03):

It does use a dense bitset though, so there may still be perf issues with very large functions like the one in ONNX, I'll need to do some testing.

view this post on Zulip Amanieu (Oct 18 2024 at 04:37):

On the same ONNX wasm blob from that issue, regalloc2 seems to spend 72% of the total regalloc time sorting ranges when merging bundles. That can be improved in 2 ways:


Last updated: Nov 22 2024 at 16:03 UTC