Happy new year! I'd like to give an update on my progress with regalloc3: I've been working on a new algorithm for live range splitting that is inspired by the one used in LLVM. The basic idea is as follows:
Benchmark results show a promising performance improvement, although compilation speed suffers a lot. This is mainly because I haven't focused on optimizing the code yet, in particular the part that collects interference. I believe the compilation speed can be significantly improves to be in line with the current performance of regalloc2.
These benchmarks were run using the split
branch on https://github.com/Amanieu/regalloc3, using the regalloc2-to-regalloc3 adapter from the regalloc3
branch of https://github.com/Amanieu/regalloc2.
compilation
benchmarks/bz2/benchmark.wasm
cycles
[105151020 112663113.33 142774958] ./ra2.so
[58353818 65341754.47 105462298] ./ra3-spill.so
[98641846 112952493.80 157907792] ./ra3-split-new5.so
execution
benchmarks/bz2/benchmark.wasm
cycles
[87786800 88337229.47 94848250] ./ra2.so
[86056240 87312036.27 93641650] ./ra3-spill.so
[82282446 82995957.27 89504782] ./ra3-split-new5.so
compilation
benchmarks/pulldown-cmark/benchmark.wasm
cycles
[89363986 108134334.47 136772140] ./ra2.so
[75750146 93458888.47 111040262] ./ra3-spill.so
[139417192 165489712.73 209262902] ./ra3-split-new5.so
execution
benchmarks/pulldown-cmark/benchmark.wasm
cycles
[6561456 6676379.93 6795542] ./ra2.so
[6519470 6589951.00 6883314] ./ra3-spill.so
[6302414 6376103.27 6464282] ./ra3-split-new5.so
compilation
benchmarks/spidermonkey/benchmark.wasm
cycles
[556633590 595953455.53 678524764] ./ra2.so
[493291612 550306873.73 691952498] ./ra3-spill.so
[823929652 879684180.93 1008727294] ./ra3-split-new5.so
execution
benchmarks/spidermonkey/benchmark.wasm
cycles
[781633522 785413154.47 795564016] ./ra2.so
[785475380 788557095.93 799293270] ./ra3-spill.so
[765625126 767272451.73 773143238] ./ra3-split-new5.so
Also I think now would be a good time to discuss how the Cranelift team wants to move forward with this. The first step would probably be to add it via the regalloc2 wrapper as a 3rd algorithm (in addition to the existing ion
and fastalloc
). Maybe this could be discussed during the cranelift meeting next week?
Hi @Amanieu -- I gave my thoughts on this last month in this message. Technically I'd like to see the blockparams-on-single-pred-blocks issue resolved; we don't want to have to have a rewrite pass even in no-opt builds in Cranelift.
I think the social side is a much bigger concern: right now we have general familiarity with RA2 and every line has been code-reviewed; BA standards would require us to code-review all of RA3 and we'd want to make sure that folks on our team are familiar with the codebase before switching to it. We also have standards around ensuring key bits are owned/published by team permissions rather than individuals -- we don't want a bus factor of one when we need to push security fixes, for example -- which practically speaking would probably mean maybe trying to adopt RA3 as a BA project. Basically, outsourcing regalloc to a "third-party crate" (that third party is you, hi, at least we know you!) is a risk factor w.r.t. flexibility during security incidents, and flexibility with our own requirements, that we'd want to address somehow.
Happy to discuss further in the CL meeting of course -- please feel free to add an agenda item
@Chris Fallin The rewrite is very straightforward, as you can see from how it is implemented in the regalloc wrapper here. I don't think this will cost much in compilation time and it significantly simplifies the implementation of the register allocator.
Fair enough, perhaps we can do the remapping as part of our lowering pass (which runs regardless of optimizations). A PR that adds this invariant to CL's VCode blockparams would be welcome
Last updated: Jan 24 2025 at 00:11 UTC