Stream: cranelift

Topic: Cranelift regalloc on arm64


view this post on Zulip pitust (Nov 17 2025 at 17:23):

I'm having some issues with cranelift's code generation being quite bad...

IR and assembly: https://paste.sr.ht/~pitust/29f8e92da8a90f94336d4d7b648f3a514dd612b4

Note that at line 24 of the asssembly, cranelift decides to move a saved register into x1, dereference it, then overwrite x1 with something else. Is there something weird going on that I'm not aware of around regalloc in cranelift or something like that? Or maybe I'm missing some option (I set opt_level to speed and regalloc_algorithm to backtracking, but the behavior persists...)

view this post on Zulip Chris Fallin (Nov 17 2025 at 17:27):

The short answer is that regalloc (in general, not just ours) is full of heuristics and sometimes does the wrong thing because it's an NP-hard binpacking problem. Unfortunately I don't have time to look at specific cases at the moment, but if you're willing to, PRs welcome. (Sorry, I don't mean to sound dismissive with that, just trying to level-set -- we aren't at a level of staffing that can allow for fine-tuning at this level at the moment)

view this post on Zulip pitust (Nov 17 2025 at 17:30):

Yeah, I know it's an NP-hard problem and all, but this specific case is also not a very complex one (there are more available machine registers than values, and not all of the values even need a register of their own, and this definitely has the "some heuristic broke" vibe rather than the "this is just a really hard problem to solve" vibe), so I figured I'd ask anyway in case it's like a known bug or something.

view this post on Zulip Chris Fallin (Nov 17 2025 at 17:31):

It's a fair question to ask! If you decide to dig into this, I'm at least happy to point at relevant bits of code

view this post on Zulip pitust (Nov 17 2025 at 17:31):

Sure! I guess the first step is enabling logs from log?

view this post on Zulip Chris Fallin (Nov 17 2025 at 17:32):

yes, the trace-log feature in cranelift-codegen guards a bunch of logging to eliminate dynamic checks, so enable that, then WASMTIME_LOG=trace will give you gigabytes of stuff; probably also RAYON_NUM_THREADS=1 so parallel compilation doesn't confuse you

view this post on Zulip pitust (Nov 17 2025 at 17:33):

Does cranelift create threads on its own? (since I'm writing a custom frontend)

view this post on Zulip Chris Fallin (Nov 17 2025 at 17:33):

ah, no, that's a Wasmtime thing

view this post on Zulip Chris Fallin (Nov 17 2025 at 17:33):

so RUST_LOG=trace (or whatever your logging frontend is in your embedding) too

view this post on Zulip pitust (Nov 17 2025 at 17:37):

Okay, sure. I have a whole lot of logs now. What should I look for?

view this post on Zulip pitust (Nov 17 2025 at 17:44):

I think I know what might be going wrong. It seems like cranelift tries really hard to keep the arguments in the same registers as they were passed in, even if that makes no sense...

view this post on Zulip pitust (Nov 17 2025 at 17:54):

Yeah, the bad codegen disappears if i set opt_level to none and mangle the pointer by running let ctx = b.ins().band_imm(ctx, (!1u64) as i64);...

view this post on Zulip pitust (Nov 17 2025 at 17:55):

With optimizations enabled, then the bitand gets rematerialized and the bad regalloc is back.

view this post on Zulip Chris Fallin (Nov 17 2025 at 17:58):

What should I look for?

You aren't going to like the answer but unfortunately it's "understand the regalloc's algorithms and see where they are doing something suboptimal" :-/

view this post on Zulip Chris Fallin (Nov 17 2025 at 17:59):

cranelift tries really hard to keep the arguments in the same registers as they were passed in

It's not exactly that, more that regalloc has constraints, the fixed-reg constraints apply to the args, and there are heuristics for when one wants to split liveranges

view this post on Zulip pitust (Nov 17 2025 at 18:03):

Yeah, I figured. What I think happens is that cranelift decides to not split the liverange but just spill, and then after unspilling it maintains the same register assignment (for whatever reason), and then later it decides to move it into a different register instead of spilling, and then move elision fails for whatever reason.

view this post on Zulip Chris Fallin (Nov 17 2025 at 18:05):

a few things just to help: (i) there aren't any spills here, just moves between registers (those are distinct); (ii) move elision will remove moves that are redundant, but it's not going to see that A was copied to B and we can use A directly instead of B

view this post on Zulip pitust (Nov 17 2025 at 18:09):

thanks

view this post on Zulip pitust (Nov 17 2025 at 18:55):

I think the issue is around split_and_requeue_bundle: when a bundle is split because one fixed register constraint contradicts another, then the register allocator should explicitly try to not use that fixed register when allocating, instead of having it be the preferred register...

view this post on Zulip Chris Fallin (Nov 17 2025 at 18:58):

when a split happens, the split-off portion has a preference but not a hard requirement anymore -- the requirements for each half are recomputed

view this post on Zulip pitust (Nov 17 2025 at 18:59):

Yeah, but the soft assignment is still problematic.

view this post on Zulip pitust (Nov 17 2025 at 19:00):

The ranges may have been split because that register cannot be used, in which case, the register is explicitly the one you don't want to use.

view this post on Zulip Chris Fallin (Nov 17 2025 at 19:01):

OK -- if there is a conflict, or if other liveranges have stronger preference, the weights should move the split-off half to another register instead

view this post on Zulip pitust (Nov 17 2025 at 19:01):

Yeah. I'll try implementing something like that.

view this post on Zulip Chris Fallin (Nov 17 2025 at 19:01):

No, I mean, that's how things already work

view this post on Zulip pitust (Nov 17 2025 at 19:01):

Oh?

view this post on Zulip pitust (Nov 17 2025 at 19:01):

That.. doesn't seem to be happening?

view this post on Zulip Chris Fallin (Nov 17 2025 at 19:03):

Well, it's all heuristics. There are use-weights and there are costs that are considered during assignment. A simplistic "I would simply not use that register" heuristic stacked on top isn't going to be the answer, because there is already a fairly complex cost machinery considering the tradeoffs

view this post on Zulip pitust (Nov 17 2025 at 19:03):

In the split off LR, the hint is checked as the first thing to be processed. It can work (because the thing stopping the register from being viable is between the two LRs), but it results in some extra shuffling (because to get the register from one LR to the other, some other register needs to be used).

view this post on Zulip Chris Fallin (Nov 17 2025 at 19:04):

This is the backtracking part of the allocator: when another use comes along that can make use of the register, it has the opportunity to undo that choice

view this post on Zulip Chris Fallin (Nov 17 2025 at 19:04):

if it doesn't, then the cost machinery could use tuning

view this post on Zulip pitust (Nov 17 2025 at 19:04):

Ah, wait, I didn't read the code carefully enough. Hmm.

view this post on Zulip Chris Fallin (Nov 17 2025 at 19:04):

I'd caution against simplistic answers here though: "don't consider the hint" is going to break a lot of other stuff

view this post on Zulip Chris Fallin (Nov 17 2025 at 19:04):

(or not "break" but "make much less good")

view this post on Zulip pitust (Nov 17 2025 at 19:23):

I'm confused by something. When cranelift splits the liverange into two parts, the log it prints issplit bundle LiveBundleIndex(0) at progpoint4-pre and requeue with reg hint (for first part) PReg(hw = 0, class = Int, index = 0).

But it uses the hint for both parts?

view this post on Zulip Chris Fallin (Nov 17 2025 at 19:31):

I believe the hint is just for the first part, but the code would be the definitive answer. Unfortunately I can't give much more time to this this morning, but best of luck (sorry!)

view this post on Zulip pitust (Nov 17 2025 at 19:33):

no worries

view this post on Zulip Amanieu (Nov 18 2025 at 04:36):

I looked into this: the root cause from regalloc2's choice to split bundles into 3 parts instead of 2. It moves the middle part of the live range between 2 conflicting uses directly to the spill bundle (which is then allocated to a register in the second chance allocation pass). This results in the unnecessary move because the middle and end bundles end up assigned different registers since regalloc2 doesn't know they are related.

view this post on Zulip Amanieu (Nov 18 2025 at 04:44):

I also tried the example code with regalloc3 (disassembly since it has a completely different live range splitting algorithm. However it avoids the splitting in the first place by not putting a fixed-register constraint on the whole live range, instead only treating the constraint as a hint (which only applies to the first part of the live range if it is split).

I also looked into why regalloc2 spills more than regalloc3 and it's because of regalloc2's randomization of the probe order. This will tend to spread bundles over all available registers instead of minimizing the number of registers used.

GitHub Gist: instantly share code, notes, and snippets.

view this post on Zulip pitust (Nov 18 2025 at 10:15):

Yeah, the regaloc3 codegen looks much more reasonable. Out of curiousity, is it possible to use regalloc3 with cranelift? (I know it's not yet production-ready, but this isn't very security-critical code and I don't mind getting crashes).

view this post on Zulip Chris Fallin (Nov 18 2025 at 17:22):

We talked about that over the summer in Cranelift meetings; the conclusion was that we who use Wasmtime+Cranelift in security-critical contexts (i) would need to review regalloc3 but (ii) don't have the bandwidth to do so to the degree that we would understand the codebase and be able to fix it in an emergency or add features as needed. Basically, regalloc is too load-bearing to outsource

view this post on Zulip pitust (Nov 18 2025 at 17:26):

Ah, and I guess the burden of maintaining both concurrently is too great? That is understandable, then.

view this post on Zulip Chris Fallin (Nov 18 2025 at 17:28):

Unfortunately, yeah. If we had a large compiler team it would be an obvious choice to ramp up on it, but, sadly, that's not the case

view this post on Zulip Amanieu (Nov 18 2025 at 17:44):

@pitust If you really want to try it out then you can use my branch from https://github.com/bytecodealliance/wasmtime/pull/11213. It may be slightly out of date though since the regalloc2 API has changed a bit since.

This is based on bytecodealliance/regalloc2#230 which adds regalloc3 as a back-end to regalloc2. Only one change had to be made to codegen: tail calls were reserving r11 as a scratch register by ma...

Last updated: Dec 06 2025 at 07:03 UTC