Cranelift Profiling. · cranelift

Are there any reports on cranelift and RA2's performance? Flamegraphs generated? What percentage of time of overall compilation should I expect register allocation to take?(right now it's around 40%, where as isel(dag tiling), optimizations, and more occupy less than half that.

T0b1 (Jul 29 2024 at 06:34):

We did a bit of profiling but not a lot made it into the results due to space constraints and we concluded that the problems are more of algorithmic nature. Wrt RA2 specifically, we got similar results (roughly 50% of time spent in register allocation). IMO it is simply a pretty slow algorithm (and the strict separation between cranelift and RA2 also doesn‘t help in some cases).

Amanieu (Jul 29 2024 at 07:36):

In my own compiler using RA2 it spends around 70% of the time in register allocation.

bjorn3 (Jul 29 2024 at 12:31):

T0b1 (Jul 29 2024 at 14:39):

If you mean our paper, it was published at CGO this year. A preprint is available here

bjorn3 (Jul 29 2024 at 15:16):

Thanks! Interesting to see that regalloc is much faster for LLVM than for Cranelift.

Leaves (Jul 29 2024 at 16:26):

Flamegraph revealing RA2 spends 30%(13% of overall compilation) of it's time in .collect()....

Leaves (Jul 29 2024 at 16:28):

And the results are steady across compilations lots of binaries, including several over 200mb

fitzgen (he/him) (Jul 29 2024 at 16:36):

fitzgen (he/him) (Jul 29 2024 at 16:37):

Haven't actually read the paper yet, just skimming now, but is it actually faster or does it take less relative time because LLVM is doing so much more in the mid-end?

Amanieu (Jul 29 2024 at 16:47):

It's actually faster. But there may be several factors involved. For example if Cranelift generates more code due to poor optimizations then it will have to spend more time in register allocation.

Leaves (Jul 29 2024 at 16:47):

Chris Fallin (Jul 29 2024 at 18:42):

This doesn't mean too much to me in isolation -- in several places RA2 uses iterator-chain constructions to compute results, that are in the end collected into a new data structure. Certainly we've been really careful not to allocate small intermediate objects (almost all the work is operating on a few large arrays of different entity types) so it's not the usual low-hanging fruit that one could infer from this.

In general I've spent a lot of time (several months fulltime) profiling RA2, and at least back in 2021/2022 I found most of what I could, heavily optimized data structure size and layout, and eeked out a bunch of 1-2% wins, so I believe it's pretty close to a local optimum for its design point. I think any big wins are going to have to come from algorithmic changes, as someone mentioned upthread: fewer splits, for example, or better bundle merging, or ... and all those have complexity and generated-code-quality tradeoffs.

If someone comparing against LLVM were to point out some specific design choices that seem to make a difference, or compare other statistics (e.g., number of liveranges, number of splits, number of backtracking points, that sort of thing) that'd be quite useful -- and as always PRs welcome!

Chris Fallin (Jul 29 2024 at 18:50):

The B-trees (the allocation maps) are how we know which registers are free; these lookups are necessary. I'm not sure what "a greedy approach" could mean: do you mean something like a linear-scan allocator, where we only need the current snapshot-in-time availability (one bit per register)? That fundamentally won't work because this is a backtracking allocator and needs to record allocations across all program points to undo them.

Also the description of RA2 as " a modified linear-scan allocator that calculates live-ranges for virtual registers, merges non-overlapping ones into bundles and splits them if too many are live in physical registers at the same time." is not really accurate: linear-scan implies a single pass to allocate (possibly with a pre-pass to compute liveness?), whereas RA2 can evict, split, and reallocate the smaller pieces -- it is backtracking.

Leaves (Jul 29 2024 at 19:42):

I'm playing around with creating reuseable environment to lower the number of heap allocations made. This is looking very promising when compiling large binaries :eyes:

T0b1 (Jul 30 2024 at 13:43):

The wording is confusing, yes. What we mean is that we think that switching to a simple (linear scan) greedy algorithm like LLVM's FastRegAlloc or DirectEmit's for register allocation is a better way to fix the register allocation performance (at least in a JIT setting where compilation performance is important).
Both of these allocators iterate only once over the IR (LLVM backwards, DirectEmit forwards) and just assign any free register or spill using some simple heuristics without being able to reverse any decision made.
LLVM's allocator does not construct liveness information, DirectEmit constructs live intervals based on loops instead of ranges.
I'm no too familiar with LLVM's design in particular however, the other author looked at that, so I do not know what heuristics they are using exactly.

Go's register allocator might also be interesting, it does some more advanced stuff but is still relatively fast AFAIK (though we haven't benchmarked it).

This obviously leads to worse codegen (DirectEmit for example has some problems with "overspilling" due to its architecture)
but it is still pretty good (as seen in the runtime benchmarks).

I might have gotten confused. I think I saw some references to the original Linear Scan paper when I was reading up about it so that's where that might be coming from though I'm not sure anymore.

go/src/cmd/compile/internal/ssa/regalloc.go at master · golang/go

The Go programming language. Contribute to golang/go development by creating an account on GitHub.

Chris Fallin (Jul 30 2024 at 15:26):

You'll be happy to know that a Google Summer of Code project participant is actually writing a linear-scan algorithm to have as an option in RA2, so we will actually have that option!

T0b1 (Jul 30 2024 at 16:33):

Ah, cool that it got picked up. I‘ll definitely check out the results once it‘s available

Amanieu (Jul 30 2024 at 16:37):

It's not actually linear-scan, it's a single-pass backwards allocator. Linear scan requires building live ranges for each value, which a backwards allocator doesn't.

Chris Fallin (Jul 30 2024 at 16:43):

I too am guilty of approximating the meaning of linear-scan :-) Yes, even simpler and faster; inspiration from the SSRA (solid-state register allocator) post

Amanieu (Jul 30 2024 at 23:28):

By the way, I noticed in profiles that regalloc2 spends 2.5x more time in merge_vreg_bundle than regalloc3 does despite using the same algorithm.

In regalloc3 I only allow blockparams on blocks with more than one predecessor, so the regalloc2->regalloc3 adaptation layer is eliminating all the unnecessary blockparams by renaming vregs into values. Perhaps that's a sign that there is some performance to gain if Cranelift stops emitting these unnecessary blockparams?

Chris Fallin (Jul 30 2024 at 23:29):

certainly possible; this is with optimizations enabled (including CL's constant phi removal pass) or no?

Chris Fallin (Jul 30 2024 at 23:29):

Amanieu (Jul 30 2024 at 23:35):

It was with the default for wasmtime compile, I'm not sure if this enables optimizations.

Amanieu (Jul 30 2024 at 23:35):

It was definitely triggering errors in my validator about blockparams on functions with only 1 predecessor.

Amanieu (Jul 30 2024 at 23:36):

wasmtime: Remove ALL constant phis by jameysharp · Pull Request #8565 · bytecodealliance/wasmtime

When we're trying to delete block-params that can be replaced by a single dominating value, we weren't handling some cases. In particular, if we concluded that a block formal parameter (say, v3) ha...

Leaves (Jul 31 2024 at 06:38):

Or rather, it may be removing dead phis, that don’t really join anything, but it might not(and I don’t think is) removing all truly useless phis

Leaves (Jul 31 2024 at 06:39):

I’m not sure yet though and I’m studying the code daily trying to get a better understanding of

Chris Fallin (Jul 31 2024 at 15:19):

@Leaves The remove-constant-phis pass at least claims to remove "useless phis" (those with all inputs the same) as well

Stream: cranelift

Topic: Cranelift Profiling.

Leaves (Jul 29 2024 at 06:00):

T0b1 (Jul 29 2024 at 06:34):

Amanieu (Jul 29 2024 at 07:36):

bjorn3 (Jul 29 2024 at 12:31):

T0b1 (Jul 29 2024 at 14:39):

bjorn3 (Jul 29 2024 at 15:16):

Leaves (Jul 29 2024 at 16:26):

Leaves (Jul 29 2024 at 16:28):

fitzgen (he/him) (Jul 29 2024 at 16:36):

fitzgen (he/him) (Jul 29 2024 at 16:37):

Amanieu (Jul 29 2024 at 16:47):

Leaves (Jul 29 2024 at 16:47):

Chris Fallin (Jul 29 2024 at 18:42):

Chris Fallin (Jul 29 2024 at 18:50):

Leaves (Jul 29 2024 at 19:42):

T0b1 (Jul 30 2024 at 13:43):

Chris Fallin (Jul 30 2024 at 15:26):

T0b1 (Jul 30 2024 at 16:33):

Amanieu (Jul 30 2024 at 16:37):

Chris Fallin (Jul 30 2024 at 16:43):

Amanieu (Jul 30 2024 at 23:28):

Chris Fallin (Jul 30 2024 at 23:29):

Chris Fallin (Jul 30 2024 at 23:29):

Amanieu (Jul 30 2024 at 23:35):

Amanieu (Jul 30 2024 at 23:35):

Amanieu (Jul 30 2024 at 23:36):

Leaves (Jul 31 2024 at 06:38):

Leaves (Jul 31 2024 at 06:38):

Leaves (Jul 31 2024 at 06:38):

Leaves (Jul 31 2024 at 06:39):

Chris Fallin (Jul 31 2024 at 15:19):