Are there any reports on cranelift and RA2's performance? Flamegraphs generated? What percentage of time of overall compilation should I expect register allocation to take?(right now it's around 40%, where as isel(dag tiling), optimizations, and more occupy less than half that.
For a paper we took time measurements of the overall compilation pipeline (BDEE5997-5C92-4E1B-BAAC-B5984B44EB67.jpg)
We did a bit of profiling but not a lot made it into the results due to space constraints and we concluded that the problems are more of algorithmic nature. Wrt RA2 specifically, we got similar results (roughly 50% of time spent in register allocation). IMO it is simply a pretty slow algorithm (and the strict separation between cranelift and RA2 also doesn‘t help in some cases).
In my own compiler using RA2 it spends around 70% of the time in register allocation.
Is that paper published?
bjorn3 said:
Is that paper published?
If you mean our paper, it was published at CGO this year. A preprint is available here
Thanks! Interesting to see that regalloc is much faster for LLVM than for Cranelift.
T0b1 said:
~~
Ok glad to know I'm somewhat on track.
Flamegraph revealing RA2 spends 30%(13% of overall compilation) of it's time in .collect()....
And the results are steady across compilations lots of binaries, including several over 200mb
Cranelift-IR (CIR)
fyi: Cranelift's IR is called CLIF
bjorn3 said:
Thanks! Interesting to see that regalloc is much faster for LLVM than for Cranelift.
Haven't actually read the paper yet, just skimming now, but is it actually faster or does it take less relative time because LLVM is doing so much more in the mid-end?
It's actually faster. But there may be several factors involved. For example if Cranelift generates more code due to poor optimizations then it will have to spend more time in register allocation.
While I agree with that, Idk if it could account for this big of a difference.
Flamegraph revealing RA2 spends 30%(13% of overall compilation) of it's time in .collect()....
This doesn't mean too much to me in isolation -- in several places RA2 uses iterator-chain constructions to compute results, that are in the end collected into a new data structure. Certainly we've been really careful not to allocate small intermediate objects (almost all the work is operating on a few large arrays of different entity types) so it's not the usual low-hanging fruit that one could infer from this.
In general I've spent a lot of time (several months fulltime) profiling RA2, and at least back in 2021/2022 I found most of what I could, heavily optimized data structure size and layout, and eeked out a bunch of 1-2% wins, so I believe it's pretty close to a local optimum for its design point. I think any big wins are going to have to come from algorithmic changes, as someone mentioned upthread: fewer splits, for example, or better bundle merging, or ... and all those have complexity and generated-code-quality tradeoffs.
If someone comparing against LLVM were to point out some specific design choices that seem to make a difference, or compare other statistics (e.g., number of liveranges, number of splits, number of backtracking points, that sort of thing) that'd be quite useful -- and as always PRs welcome!
Reading a bit more of the paper, I find myself a little perplexed, e.g. here
roughly 6% of register allocation time is spent inserting and looking up values in B-trees. Using a greedy approach instead could significantly improve register allocation performance.
The B-trees (the allocation maps) are how we know which registers are free; these lookups are necessary. I'm not sure what "a greedy approach" could mean: do you mean something like a linear-scan allocator, where we only need the current snapshot-in-time availability (one bit per register)? That fundamentally won't work because this is a backtracking allocator and needs to record allocations across all program points to undo them.
Also the description of RA2 as " a modified linear-scan allocator that calculates live-ranges for virtual registers, merges non-overlapping ones into bundles and splits them if too many are live in physical registers at the same time." is not really accurate: linear-scan implies a single pass to allocate (possibly with a pre-pass to compute liveness?), whereas RA2 can evict, split, and reallocate the smaller pieces -- it is backtracking.
I'm playing around with creating reuseable environment to lower the number of heap allocations made. This is looking very promising when compiling large binaries :eyes:
Chris Fallin said:
roughly 6% of register allocation time is spent inserting and looking up values in B-trees. Using a greedy approach instead could significantly improve register allocation performance.
The B-trees (the allocation maps) are how we know which registers are free; these lookups are necessary. I'm not sure what "a greedy approach" could mean: do you mean something like a linear-scan allocator, where we only need the current snapshot-in-time availability (one bit per register)? That fundamentally won't work because this is a backtracking allocator and needs to record allocations across all program points to undo them.
The wording is confusing, yes. What we mean is that we think that switching to a simple (linear scan) greedy algorithm like LLVM's FastRegAlloc or DirectEmit's for register allocation is a better way to fix the register allocation performance (at least in a JIT setting where compilation performance is important).
Both of these allocators iterate only once over the IR (LLVM backwards, DirectEmit forwards) and just assign any free register or spill using some simple heuristics without being able to reverse any decision made.
LLVM's allocator does not construct liveness information, DirectEmit constructs live intervals based on loops instead of ranges.
I'm no too familiar with LLVM's design in particular however, the other author looked at that, so I do not know what heuristics they are using exactly.
Go's register allocator might also be interesting, it does some more advanced stuff but is still relatively fast AFAIK (though we haven't benchmarked it).
This obviously leads to worse codegen (DirectEmit for example has some problems with "overspilling" due to its architecture)
but it is still pretty good (as seen in the runtime benchmarks).
Chris Fallin said:
Also the description of RA2 as " a modified linear-scan allocator that calculates live-ranges for virtual registers, merges non-overlapping ones into bundles and splits them if too many are live in physical registers at the same time." is not really accurate: linear-scan implies a single pass to allocate (possibly with a pre-pass to compute liveness?), whereas RA2 can evict, split, and reallocate the smaller pieces -- it is backtracking.
I might have gotten confused. I think I saw some references to the original Linear Scan paper when I was reading up about it so that's where that might be coming from though I'm not sure anymore.
You'll be happy to know that a Google Summer of Code project participant is actually writing a linear-scan algorithm to have as an option in RA2, so we will actually have that option!
Ah, cool that it got picked up. I‘ll definitely check out the results once it‘s available
It's not actually linear-scan, it's a single-pass backwards allocator. Linear scan requires building live ranges for each value, which a backwards allocator doesn't.
I too am guilty of approximating the meaning of linear-scan :-) Yes, even simpler and faster; inspiration from the SSRA (solid-state register allocator) post
By the way, I noticed in profiles that regalloc2 spends 2.5x more time in merge_vreg_bundle
than regalloc3 does despite using the same algorithm.
In regalloc3 I only allow blockparams on blocks with more than one predecessor, so the regalloc2->regalloc3 adaptation layer is eliminating all the unnecessary blockparams by renaming vregs into values. Perhaps that's a sign that there is some performance to gain if Cranelift stops emitting these unnecessary blockparams?
certainly possible; this is with optimizations enabled (including CL's constant phi removal pass) or no?
we have had issues with excess blockparam creation in the past
It was with the default for wasmtime compile
, I'm not sure if this enables optimizations.
It was definitely triggering errors in my validator about blockparams on functions with only 1 predecessor.
I even tried including https://github.com/bytecodealliance/wasmtime/pull/8565 but it was still generating unnecessary blockparams.
Does crane lift produce a pruned ssa? I don’t think so…
So it isn’t trying to remove all truly useless phis
Or rather, it may be removing dead phis, that don’t really join anything, but it might not(and I don’t think is) removing all truly useless phis
I’m not sure yet though and I’m studying the code daily trying to get a better understanding of
@Leaves The remove-constant-phis pass at least claims to remove "useless phis" (those with all inputs the same) as well
Last updated: Dec 23 2024 at 13:07 UTC