I'm having some issues with cranelift's code generation being quite bad...
IR and assembly: https://paste.sr.ht/~pitust/29f8e92da8a90f94336d4d7b648f3a514dd612b4
Note that at line 24 of the asssembly, cranelift decides to move a saved register into x1, dereference it, then overwrite x1 with something else. Is there something weird going on that I'm not aware of around regalloc in cranelift or something like that? Or maybe I'm missing some option (I set opt_level to speed and regalloc_algorithm to backtracking, but the behavior persists...)
The short answer is that regalloc (in general, not just ours) is full of heuristics and sometimes does the wrong thing because it's an NP-hard binpacking problem. Unfortunately I don't have time to look at specific cases at the moment, but if you're willing to, PRs welcome. (Sorry, I don't mean to sound dismissive with that, just trying to level-set -- we aren't at a level of staffing that can allow for fine-tuning at this level at the moment)
Yeah, I know it's an NP-hard problem and all, but this specific case is also not a very complex one (there are more available machine registers than values, and not all of the values even need a register of their own, and this definitely has the "some heuristic broke" vibe rather than the "this is just a really hard problem to solve" vibe), so I figured I'd ask anyway in case it's like a known bug or something.
It's a fair question to ask! If you decide to dig into this, I'm at least happy to point at relevant bits of code
Sure! I guess the first step is enabling logs from log?
yes, the trace-log feature in cranelift-codegen guards a bunch of logging to eliminate dynamic checks, so enable that, then WASMTIME_LOG=trace will give you gigabytes of stuff; probably also RAYON_NUM_THREADS=1 so parallel compilation doesn't confuse you
Does cranelift create threads on its own? (since I'm writing a custom frontend)
ah, no, that's a Wasmtime thing
so RUST_LOG=trace (or whatever your logging frontend is in your embedding) too
Okay, sure. I have a whole lot of logs now. What should I look for?
I think I know what might be going wrong. It seems like cranelift tries really hard to keep the arguments in the same registers as they were passed in, even if that makes no sense...
Yeah, the bad codegen disappears if i set opt_level to none and mangle the pointer by running let ctx = b.ins().band_imm(ctx, (!1u64) as i64);...
With optimizations enabled, then the bitand gets rematerialized and the bad regalloc is back.
What should I look for?
You aren't going to like the answer but unfortunately it's "understand the regalloc's algorithms and see where they are doing something suboptimal" :-/
cranelift tries really hard to keep the arguments in the same registers as they were passed in
It's not exactly that, more that regalloc has constraints, the fixed-reg constraints apply to the args, and there are heuristics for when one wants to split liveranges
Yeah, I figured. What I think happens is that cranelift decides to not split the liverange but just spill, and then after unspilling it maintains the same register assignment (for whatever reason), and then later it decides to move it into a different register instead of spilling, and then move elision fails for whatever reason.
a few things just to help: (i) there aren't any spills here, just moves between registers (those are distinct); (ii) move elision will remove moves that are redundant, but it's not going to see that A was copied to B and we can use A directly instead of B
thanks
I think the issue is around split_and_requeue_bundle: when a bundle is split because one fixed register constraint contradicts another, then the register allocator should explicitly try to not use that fixed register when allocating, instead of having it be the preferred register...
when a split happens, the split-off portion has a preference but not a hard requirement anymore -- the requirements for each half are recomputed
Yeah, but the soft assignment is still problematic.
The ranges may have been split because that register cannot be used, in which case, the register is explicitly the one you don't want to use.
OK -- if there is a conflict, or if other liveranges have stronger preference, the weights should move the split-off half to another register instead
Yeah. I'll try implementing something like that.
No, I mean, that's how things already work
Oh?
That.. doesn't seem to be happening?
Well, it's all heuristics. There are use-weights and there are costs that are considered during assignment. A simplistic "I would simply not use that register" heuristic stacked on top isn't going to be the answer, because there is already a fairly complex cost machinery considering the tradeoffs
In the split off LR, the hint is checked as the first thing to be processed. It can work (because the thing stopping the register from being viable is between the two LRs), but it results in some extra shuffling (because to get the register from one LR to the other, some other register needs to be used).
This is the backtracking part of the allocator: when another use comes along that can make use of the register, it has the opportunity to undo that choice
if it doesn't, then the cost machinery could use tuning
Ah, wait, I didn't read the code carefully enough. Hmm.
I'd caution against simplistic answers here though: "don't consider the hint" is going to break a lot of other stuff
(or not "break" but "make much less good")
I'm confused by something. When cranelift splits the liverange into two parts, the log it prints issplit bundle LiveBundleIndex(0) at progpoint4-pre and requeue with reg hint (for first part) PReg(hw = 0, class = Int, index = 0).
But it uses the hint for both parts?
I believe the hint is just for the first part, but the code would be the definitive answer. Unfortunately I can't give much more time to this this morning, but best of luck (sorry!)
no worries
I looked into this: the root cause from regalloc2's choice to split bundles into 3 parts instead of 2. It moves the middle part of the live range between 2 conflicting uses directly to the spill bundle (which is then allocated to a register in the second chance allocation pass). This results in the unnecessary move because the middle and end bundles end up assigned different registers since regalloc2 doesn't know they are related.
I also tried the example code with regalloc3 (disassembly since it has a completely different live range splitting algorithm. However it avoids the splitting in the first place by not putting a fixed-register constraint on the whole live range, instead only treating the constraint as a hint (which only applies to the first part of the live range if it is split).
I also looked into why regalloc2 spills more than regalloc3 and it's because of regalloc2's randomization of the probe order. This will tend to spread bundles over all available registers instead of minimizing the number of registers used.
Yeah, the regaloc3 codegen looks much more reasonable. Out of curiousity, is it possible to use regalloc3 with cranelift? (I know it's not yet production-ready, but this isn't very security-critical code and I don't mind getting crashes).
We talked about that over the summer in Cranelift meetings; the conclusion was that we who use Wasmtime+Cranelift in security-critical contexts (i) would need to review regalloc3 but (ii) don't have the bandwidth to do so to the degree that we would understand the codebase and be able to fix it in an emergency or add features as needed. Basically, regalloc is too load-bearing to outsource
Ah, and I guess the burden of maintaining both concurrently is too great? That is understandable, then.
Unfortunately, yeah. If we had a large compiler team it would be an obvious choice to ramp up on it, but, sadly, that's not the case
@pitust If you really want to try it out then you can use my branch from https://github.com/bytecodealliance/wasmtime/pull/11213. It may be slightly out of date though since the regalloc2 API has changed a bit since.
Last updated: Dec 06 2025 at 07:03 UTC