Hi all -- thought I would add a pointer here to a new blog post describing the new backend's branch optimizations and efficient binary-code generation: https://cfallin.org/blog/2021/01/22/cranelift-isel-2/. Feel free to share where you think it might be interesting :-) Happy to discuss here as well
The diagram for step 2 uses L2/L3 for the metadata where L1/L2 should have been used it seems.
This is a really well written article @Chris Fallin!
Thanks @bjorn3, fixed!
What did you use to make the diagrams by the way?
I drew them in Inkscape and saved as SVGs; text-to-path first to avoid issues with web fonts
I dunno if that's the best way to do vector graphics for blog posts but I've been using Inkscape too long to try anything else :-)
Excellent writeup! I'm looking forward to reading future entries in the series.
Are (3,5) and (4,5) not merged into 3&4 because (2,3) and (2,4) were? I saw a note about that in the linked implementation. Otherwise they would be candidate as they are not critical edges, correct?
Thanks! And yes -- edges (3,5) and (4,5) are not merged because we have (for simplicity) a design invariant that only one edge block can merge with an original block. Otherwise, the enum that encodes the source of a lowered block -- edge, edge+orig, orig+edge, or orig -- and its successor function would become more complex, and computing the successor would require more lookahead. I opted to keep it simple and it seemed that the marginal gain would be minimal beyond that :-)
Hello! I have a question about branch optimization during lowering, as described in the article.
My understanding is that we have a CFG of separately-allocated CLIF blocks, which we transform into a continuous buffer of VCode instructions, which we then transform into a continuous buffer of machine instructions; most of the branch optimizations are done in that last pass with a peephole optimizer of sorts, that looks at the last few branch instructions emitted and changes them as needed.
Now, I understand how this pass can handle fallthroughs and branch relaxation. But I don't understand how it handles jump threading and dead block removal.
Like, the principle of jump threading is that, if you have a block starting at L1 that does nothing but jump at L2, and the block starting at L3 jumps at L1, you can replace the jump at L1 to a jump at L2 and elide all the code in L1.
But I don't understand how you can do that in a streaming compiler. By the time the code at L3 is processed, the code at L1 is already "frozen". And in the opposite scenario (a block B1 ends with a jump to "future" block B4 that itself jumps to future code B5, etc), by the time you're writing B4 and you figure out that B1 can jump directly to B5, B1 has already been written and frozen. How do you optimize that?
Hi @Olivier FAURE ! So there are a few interesting cases here that we can handle. The key is the redirect table described in the post; by updating that on-the-fly, consulting it eagerly, and using it when doing branch-target fixups, we can handle all of the cases you describe.
L1 -> L2; L3 -> L1. Here, we would emit the branch at L1 to L2, but also note in the table that L1 is an alias for L2. Importantly, also, if the only instruction in the block at L1 is a branch to L2 (i.e., the immediately preceding instruction is also an unconditional branch), we can completely remove the branch. Then L1 exists only as an alias, and not as any machine code at all. That's this case in the code. Finally, when we process the branch at L3 to L1, we note the fixup record ("target at this offset must refer to L1") via the LabelUse
mechanism; when we patch in the offset, we use the redirect table to note that L1 is actually L2, and patch in the offset for L2.
So the key insight is that we remove the empty block before it's frozen, and we can know it's an empty block by recognizing the two-back-to-back-branch pattern. This is safe because we set up the redirect at the same time and always refer to it.
L1 -> L4, L4 -> L5. Here we will note the redirect L4 -> L5 as above. When we later come back to patch the offset in the first branch to refer to L4, we actually patch to refer to L5 instead.
So the key insight here is that we have the limited second pass in which we just process fixups, and at that time we know about all "future" redirects. It's not really a conventional or full two-pass algorithm because the second pass is not O(code size), just O(branches); it's like relocation processing, and is truly unavoidable if there are any forward branches at all (since we don't know offsets until we see them).
Does that make sense? If you're curious you can also add test cases with interesting patterns in buffer.rs
above and enable debug logging; you should see the branch-chomping in real time :-)
I might start hacking into the code with debug logging soon, yeah. I've been meaning to for a while but 1) it's intimidating and I'm about to take a short break from coding 2) I'm not sure where to start anyway. But your post is super helpful!
So just to be clear, you're saying we have a full pass where we produce all the instructions in our contiguous buffer, and a second pass where go over all jump/branch instructions in that buffer and fix them up to do jump threading?
Well, not quite: most of the optimizations (and all of the code deletion) happen on the main pass, during emission. The second pass is only over the fixup list. The reason that is relevant to your question is that we know "the future" when we handle target fixups, so we can chase through redirects. But emphatically we don't have an LLVM-style "generate it all first then edit it later" design
So what do the fixups cover?
They are "this branch at this offset refers to this target" records
The struct is here
The name may be a bit of a misnomer, sorry: it's not a general fixup (in the sense that we do all the branch opts); it's a label reference, i.e., patch these bits to the final code offset
So jump threading is the only thing it can do?
Well, no; it handles all relocations
If I jump to a future block, a fixup record will record that and the code will come back later and patch it
but if you mean that it cannot indicate that we delete code, then yes; all code is frozen once emission goes beyond it
Ok, so, if I'm following, you can send the following instructions to the machine instruction backend:
Then we do a second pass where we replace all the "TODO" jumps with their final frozen offset; this may result in jump threading if there is an occasion to do so, since all frozen offsets are final.
yep, exactly!
and to add possibly useful detail on your second point, the backwards -> TODO case happens when the branch goes backward to another branch that has an unresolved target (so the final, optimized target is still in the future)
(and then there's the conditional inversion stuff but you could see that as an orthogonal streaming opt)
Oh, ok. So if we're doing L1->L3 ; L2->... ; L3->... ; L4->L1
(where ...
means "non-empty block") then L1->L3
needs to be fixed up because L3 was in the future when L1 was emitted, but L4->L1
doesn't need to be fixed up because it threaded up to L3 which was in the past when L4 was emitted.
yes
Another part I'm having trouble with:
Any offset that (i) immediately follows an unconditional jump, and (ii) has no labels bound to it, is unreachable; an unconditional jump at an unreachable offset can be elided. (Actually, any code at an unreachable offset can be removed, but for simplicity and to make it easier to reason about correctness, we restrict the MachBuffer’s edits to code explicitly marked as branch instructions only.)
How can an offset be unreachable if there's an unconditional jump to it?
ah, there's no unconditional jump to it; rather, it immediately follows an unconditional jump in the linear code stream
so in other words, jmp somewhere_else; unreachable_inst
if there are no labels on unreachable_inst
, then there's no way to execute it (because there's no fallthrough into it either)
So what gets elided in that case?
just the unreachable_inst
and specifically only if unreachable_inst is a jump too
Oh, okay. But only if that instr is a jump?
Right
(that's for simplicity as noted above)
it basically is complementary to the redirection; we first redirect all labels, then often it's the case the block was just a jump and we can delete it
(and these empty blocks happen all the time because of split critical edges)
I'm guessing the assumption is that most unreachable blocks would be elided in MIR or CLIR, if at all possible, and remaining unreachable blocks will mostly be short blocks with jumps and nothing else?
well, these are generally not coming from unreachable blocks in CLIF; in fact the BlockLoweringOrder would not include such blocks at all
they become unreachable only after we redirect labels
Ok. Thanks for the detailed answers :D
No problem, happy to help!
Last updated: Dec 23 2024 at 12:05 UTC