Stream: cranelift

Topic: Cranelift Profiling.


view this post on Zulip Leaves (Jul 29 2024 at 06:00):

Are there any reports on cranelift and RA2's performance? Flamegraphs generated? What percentage of time of overall compilation should I expect register allocation to take?(right now it's around 40%, where as isel(dag tiling), optimizations, and more occupy less than half that.

view this post on Zulip T0b1 (Jul 29 2024 at 06:34):

For a paper we took time measurements of the overall compilation pipeline (BDEE5997-5C92-4E1B-BAAC-B5984B44EB67.jpg)

We did a bit of profiling but not a lot made it into the results due to space constraints and we concluded that the problems are more of algorithmic nature. Wrt RA2 specifically, we got similar results (roughly 50% of time spent in register allocation). IMO it is simply a pretty slow algorithm (and the strict separation between cranelift and RA2 also doesn‘t help in some cases).

view this post on Zulip Amanieu (Jul 29 2024 at 07:36):

In my own compiler using RA2 it spends around 70% of the time in register allocation.

view this post on Zulip bjorn3 (Jul 29 2024 at 12:31):

Is that paper published?

view this post on Zulip T0b1 (Jul 29 2024 at 14:39):

bjorn3 said:

Is that paper published?

If you mean our paper, it was published at CGO this year. A preprint is available here

view this post on Zulip bjorn3 (Jul 29 2024 at 15:16):

Thanks! Interesting to see that regalloc is much faster for LLVM than for Cranelift.

view this post on Zulip Leaves (Jul 29 2024 at 16:26):

T0b1 said:

~~

Ok glad to know I'm somewhat on track.

Flamegraph revealing RA2 spends 30%(13% of overall compilation) of it's time in .collect()....

view this post on Zulip Leaves (Jul 29 2024 at 16:28):

And the results are steady across compilations lots of binaries, including several over 200mb

view this post on Zulip fitzgen (he/him) (Jul 29 2024 at 16:36):

Cranelift-IR (CIR)

fyi: Cranelift's IR is called CLIF

view this post on Zulip fitzgen (he/him) (Jul 29 2024 at 16:37):

bjorn3 said:

Thanks! Interesting to see that regalloc is much faster for LLVM than for Cranelift.

Haven't actually read the paper yet, just skimming now, but is it actually faster or does it take less relative time because LLVM is doing so much more in the mid-end?

view this post on Zulip Amanieu (Jul 29 2024 at 16:47):

It's actually faster. But there may be several factors involved. For example if Cranelift generates more code due to poor optimizations then it will have to spend more time in register allocation.

view this post on Zulip Leaves (Jul 29 2024 at 16:47):

While I agree with that, Idk if it could account for this big of a difference.

view this post on Zulip Chris Fallin (Jul 29 2024 at 18:42):

Flamegraph revealing RA2 spends 30%(13% of overall compilation) of it's time in .collect()....

This doesn't mean too much to me in isolation -- in several places RA2 uses iterator-chain constructions to compute results, that are in the end collected into a new data structure. Certainly we've been really careful not to allocate small intermediate objects (almost all the work is operating on a few large arrays of different entity types) so it's not the usual low-hanging fruit that one could infer from this.

In general I've spent a lot of time (several months fulltime) profiling RA2, and at least back in 2021/2022 I found most of what I could, heavily optimized data structure size and layout, and eeked out a bunch of 1-2% wins, so I believe it's pretty close to a local optimum for its design point. I think any big wins are going to have to come from algorithmic changes, as someone mentioned upthread: fewer splits, for example, or better bundle merging, or ... and all those have complexity and generated-code-quality tradeoffs.

If someone comparing against LLVM were to point out some specific design choices that seem to make a difference, or compare other statistics (e.g., number of liveranges, number of splits, number of backtracking points, that sort of thing) that'd be quite useful -- and as always PRs welcome!

view this post on Zulip Chris Fallin (Jul 29 2024 at 18:50):

Reading a bit more of the paper, I find myself a little perplexed, e.g. here

roughly 6% of register allocation time is spent inserting and looking up values in B-trees. Using a greedy approach instead could significantly improve register allocation performance.

The B-trees (the allocation maps) are how we know which registers are free; these lookups are necessary. I'm not sure what "a greedy approach" could mean: do you mean something like a linear-scan allocator, where we only need the current snapshot-in-time availability (one bit per register)? That fundamentally won't work because this is a backtracking allocator and needs to record allocations across all program points to undo them.

Also the description of RA2 as " a modified linear-scan allocator that calculates live-ranges for virtual registers, merges non-overlapping ones into bundles and splits them if too many are live in physical registers at the same time." is not really accurate: linear-scan implies a single pass to allocate (possibly with a pre-pass to compute liveness?), whereas RA2 can evict, split, and reallocate the smaller pieces -- it is backtracking.

view this post on Zulip Leaves (Jul 29 2024 at 19:42):

I'm playing around with creating reuseable environment to lower the number of heap allocations made. This is looking very promising when compiling large binaries :eyes:

view this post on Zulip T0b1 (Jul 30 2024 at 13:43):

Chris Fallin said:

roughly 6% of register allocation time is spent inserting and looking up values in B-trees. Using a greedy approach instead could significantly improve register allocation performance.

The B-trees (the allocation maps) are how we know which registers are free; these lookups are necessary. I'm not sure what "a greedy approach" could mean: do you mean something like a linear-scan allocator, where we only need the current snapshot-in-time availability (one bit per register)? That fundamentally won't work because this is a backtracking allocator and needs to record allocations across all program points to undo them.

The wording is confusing, yes. What we mean is that we think that switching to a simple (linear scan) greedy algorithm like LLVM's FastRegAlloc or DirectEmit's for register allocation is a better way to fix the register allocation performance (at least in a JIT setting where compilation performance is important).
Both of these allocators iterate only once over the IR (LLVM backwards, DirectEmit forwards) and just assign any free register or spill using some simple heuristics without being able to reverse any decision made.
LLVM's allocator does not construct liveness information, DirectEmit constructs live intervals based on loops instead of ranges.
I'm no too familiar with LLVM's design in particular however, the other author looked at that, so I do not know what heuristics they are using exactly.

Go's register allocator might also be interesting, it does some more advanced stuff but is still relatively fast AFAIK (though we haven't benchmarked it).

This obviously leads to worse codegen (DirectEmit for example has some problems with "overspilling" due to its architecture)
but it is still pretty good (as seen in the runtime benchmarks).

Chris Fallin said:

Also the description of RA2 as " a modified linear-scan allocator that calculates live-ranges for virtual registers, merges non-overlapping ones into bundles and splits them if too many are live in physical registers at the same time." is not really accurate: linear-scan implies a single pass to allocate (possibly with a pre-pass to compute liveness?), whereas RA2 can evict, split, and reallocate the smaller pieces -- it is backtracking.

I might have gotten confused. I think I saw some references to the original Linear Scan paper when I was reading up about it so that's where that might be coming from though I'm not sure anymore.

The Go programming language. Contribute to golang/go development by creating an account on GitHub.

view this post on Zulip Chris Fallin (Jul 30 2024 at 15:26):

You'll be happy to know that a Google Summer of Code project participant is actually writing a linear-scan algorithm to have as an option in RA2, so we will actually have that option!

view this post on Zulip T0b1 (Jul 30 2024 at 16:33):

Ah, cool that it got picked up. I‘ll definitely check out the results once it‘s available

view this post on Zulip Amanieu (Jul 30 2024 at 16:37):

It's not actually linear-scan, it's a single-pass backwards allocator. Linear scan requires building live ranges for each value, which a backwards allocator doesn't.

view this post on Zulip Chris Fallin (Jul 30 2024 at 16:43):

I too am guilty of approximating the meaning of linear-scan :-) Yes, even simpler and faster; inspiration from the SSRA (solid-state register allocator) post

view this post on Zulip Amanieu (Jul 30 2024 at 23:28):

By the way, I noticed in profiles that regalloc2 spends 2.5x more time in merge_vreg_bundle than regalloc3 does despite using the same algorithm.

In regalloc3 I only allow blockparams on blocks with more than one predecessor, so the regalloc2->regalloc3 adaptation layer is eliminating all the unnecessary blockparams by renaming vregs into values. Perhaps that's a sign that there is some performance to gain if Cranelift stops emitting these unnecessary blockparams?

view this post on Zulip Chris Fallin (Jul 30 2024 at 23:29):

certainly possible; this is with optimizations enabled (including CL's constant phi removal pass) or no?

view this post on Zulip Chris Fallin (Jul 30 2024 at 23:29):

we have had issues with excess blockparam creation in the past

view this post on Zulip Amanieu (Jul 30 2024 at 23:35):

It was with the default for wasmtime compile, I'm not sure if this enables optimizations.

view this post on Zulip Amanieu (Jul 30 2024 at 23:35):

It was definitely triggering errors in my validator about blockparams on functions with only 1 predecessor.

view this post on Zulip Amanieu (Jul 30 2024 at 23:36):

I even tried including https://github.com/bytecodealliance/wasmtime/pull/8565 but it was still generating unnecessary blockparams.

When we're trying to delete block-params that can be replaced by a single dominating value, we weren't handling some cases. In particular, if we concluded that a block formal parameter (say, v3) ha...

view this post on Zulip Leaves (Jul 31 2024 at 06:38):

Does crane lift produce a pruned ssa? I don’t think so…

view this post on Zulip Leaves (Jul 31 2024 at 06:38):

So it isn’t trying to remove all truly useless phis

view this post on Zulip Leaves (Jul 31 2024 at 06:38):

Or rather, it may be removing dead phis, that don’t really join anything, but it might not(and I don’t think is) removing all truly useless phis

view this post on Zulip Leaves (Jul 31 2024 at 06:39):

I’m not sure yet though and I’m studying the code daily trying to get a better understanding of

view this post on Zulip Chris Fallin (Jul 31 2024 at 15:19):

@Leaves The remove-constant-phis pass at least claims to remove "useless phis" (those with all inputs the same) as well


Last updated: Nov 22 2024 at 17:03 UTC