Stream: cranelift

Topic: blog post on branch optimizations in new backend


view this post on Zulip Chris Fallin (Jan 22 2021 at 20:52):

Hi all -- thought I would add a pointer here to a new blog post describing the new backend's branch optimizations and efficient binary-code generation: https://cfallin.org/blog/2021/01/22/cranelift-isel-2/. Feel free to share where you think it might be interesting :-) Happy to discuss here as well

view this post on Zulip bjorn3 (Jan 22 2021 at 22:18):

The diagram for step 2 uses L2/L3 for the metadata where L1/L2 should have been used it seems.

view this post on Zulip bjorn3 (Jan 22 2021 at 22:30):

This is a really well written article @Chris Fallin!

view this post on Zulip Chris Fallin (Jan 22 2021 at 22:49):

Thanks @bjorn3, fixed!

view this post on Zulip bjorn3 (Jan 22 2021 at 23:07):

What did you use to make the diagrams by the way?

view this post on Zulip Chris Fallin (Jan 22 2021 at 23:10):

I drew them in Inkscape and saved as SVGs; text-to-path first to avoid issues with web fonts

view this post on Zulip Chris Fallin (Jan 22 2021 at 23:11):

I dunno if that's the best way to do vector graphics for blog posts but I've been using Inkscape too long to try anything else :-)

view this post on Zulip Kasey C (Jan 26 2021 at 00:25):

Excellent writeup! I'm looking forward to reading future entries in the series.

Are (3,5) and (4,5) not merged into 3&4 because (2,3) and (2,4) were? I saw a note about that in the linked implementation. Otherwise they would be candidate as they are not critical edges, correct?

view this post on Zulip Chris Fallin (Jan 26 2021 at 00:36):

Thanks! And yes -- edges (3,5) and (4,5) are not merged because we have (for simplicity) a design invariant that only one edge block can merge with an original block. Otherwise, the enum that encodes the source of a lowered block -- edge, edge+orig, orig+edge, or orig -- and its successor function would become more complex, and computing the successor would require more lookahead. I opted to keep it simple and it seemed that the marginal gain would be minimal beyond that :-)

Standalone JIT-style runtime for WebAssembly, using Cranelift - bytecodealliance/wasmtime

view this post on Zulip Olivier FAURE (Feb 18 2021 at 11:37):

Hello! I have a question about branch optimization during lowering, as described in the article.

My understanding is that we have a CFG of separately-allocated CLIF blocks, which we transform into a continuous buffer of VCode instructions, which we then transform into a continuous buffer of machine instructions; most of the branch optimizations are done in that last pass with a peephole optimizer of sorts, that looks at the last few branch instructions emitted and changes them as needed.

Now, I understand how this pass can handle fallthroughs and branch relaxation. But I don't understand how it handles jump threading and dead block removal.

Like, the principle of jump threading is that, if you have a block starting at L1 that does nothing but jump at L2, and the block starting at L3 jumps at L1, you can replace the jump at L1 to a jump at L2 and elide all the code in L1.

But I don't understand how you can do that in a streaming compiler. By the time the code at L3 is processed, the code at L1 is already "frozen". And in the opposite scenario (a block B1 ends with a jump to "future" block B4 that itself jumps to future code B5, etc), by the time you're writing B4 and you figure out that B1 can jump directly to B5, B1 has already been written and frozen. How do you optimize that?

view this post on Zulip Chris Fallin (Feb 18 2021 at 16:25):

Hi @Olivier FAURE ! So there are a few interesting cases here that we can handle. The key is the redirect table described in the post; by updating that on-the-fly, consulting it eagerly, and using it when doing branch-target fixups, we can handle all of the cases you describe.

  1. L1 -> L2; L3 -> L1. Here, we would emit the branch at L1 to L2, but also note in the table that L1 is an alias for L2. Importantly, also, if the only instruction in the block at L1 is a branch to L2 (i.e., the immediately preceding instruction is also an unconditional branch), we can completely remove the branch. Then L1 exists only as an alias, and not as any machine code at all. That's this case in the code. Finally, when we process the branch at L3 to L1, we note the fixup record ("target at this offset must refer to L1") via the LabelUse mechanism; when we patch in the offset, we use the redirect table to note that L1 is actually L2, and patch in the offset for L2.

    So the key insight is that we remove the empty block before it's frozen, and we can know it's an empty block by recognizing the two-back-to-back-branch pattern. This is safe because we set up the redirect at the same time and always refer to it.

  2. L1 -> L4, L4 -> L5. Here we will note the redirect L4 -> L5 as above. When we later come back to patch the offset in the first branch to refer to L4, we actually patch to refer to L5 instead.

    So the key insight here is that we have the limited second pass in which we just process fixups, and at that time we know about all "future" redirects. It's not really a conventional or full two-pass algorithm because the second pass is not O(code size), just O(branches); it's like relocation processing, and is truly unavoidable if there are any forward branches at all (since we don't know offsets until we see them).

Does that make sense? If you're curious you can also add test cases with interesting patterns in buffer.rs above and enable debug logging; you should see the branch-chomping in real time :-)

Standalone JIT-style runtime for WebAssembly, using Cranelift - bytecodealliance/wasmtime

view this post on Zulip Olivier FAURE (Feb 18 2021 at 17:45):

I might start hacking into the code with debug logging soon, yeah. I've been meaning to for a while but 1) it's intimidating and I'm about to take a short break from coding 2) I'm not sure where to start anyway. But your post is super helpful!

view this post on Zulip Olivier FAURE (Feb 18 2021 at 17:48):

So just to be clear, you're saying we have a full pass where we produce all the instructions in our contiguous buffer, and a second pass where go over all jump/branch instructions in that buffer and fix them up to do jump threading?

view this post on Zulip Chris Fallin (Feb 18 2021 at 17:51):

Well, not quite: most of the optimizations (and all of the code deletion) happen on the main pass, during emission. The second pass is only over the fixup list. The reason that is relevant to your question is that we know "the future" when we handle target fixups, so we can chase through redirects. But emphatically we don't have an LLVM-style "generate it all first then edit it later" design

view this post on Zulip Olivier FAURE (Feb 18 2021 at 17:52):

So what do the fixups cover?

view this post on Zulip Chris Fallin (Feb 18 2021 at 17:52):

They are "this branch at this offset refers to this target" records

view this post on Zulip Chris Fallin (Feb 18 2021 at 17:53):

The struct is here

Standalone JIT-style runtime for WebAssembly, using Cranelift - bytecodealliance/wasmtime

view this post on Zulip Chris Fallin (Feb 18 2021 at 17:53):

The name may be a bit of a misnomer, sorry: it's not a general fixup (in the sense that we do all the branch opts); it's a label reference, i.e., patch these bits to the final code offset

view this post on Zulip Olivier FAURE (Feb 18 2021 at 17:54):

So jump threading is the only thing it can do?

view this post on Zulip Chris Fallin (Feb 18 2021 at 17:54):

Well, no; it handles all relocations

view this post on Zulip Chris Fallin (Feb 18 2021 at 17:54):

If I jump to a future block, a fixup record will record that and the code will come back later and patch it

view this post on Zulip Chris Fallin (Feb 18 2021 at 17:55):

but if you mean that it cannot indicate that we delete code, then yes; all code is frozen once emission goes beyond it

view this post on Zulip Olivier FAURE (Feb 18 2021 at 17:59):

Ok, so, if I'm following, you can send the following instructions to the machine instruction backend:

Then we do a second pass where we replace all the "TODO" jumps with their final frozen offset; this may result in jump threading if there is an occasion to do so, since all frozen offsets are final.

view this post on Zulip Chris Fallin (Feb 18 2021 at 18:00):

yep, exactly!

view this post on Zulip Chris Fallin (Feb 18 2021 at 18:01):

and to add possibly useful detail on your second point, the backwards -> TODO case happens when the branch goes backward to another branch that has an unresolved target (so the final, optimized target is still in the future)

view this post on Zulip Chris Fallin (Feb 18 2021 at 18:02):

(and then there's the conditional inversion stuff but you could see that as an orthogonal streaming opt)

view this post on Zulip Olivier FAURE (Feb 18 2021 at 18:05):

Oh, ok. So if we're doing L1->L3 ; L2->... ; L3->... ; L4->L1 (where ... means "non-empty block") then L1->L3needs to be fixed up because L3 was in the future when L1 was emitted, but L4->L1 doesn't need to be fixed up because it threaded up to L3 which was in the past when L4 was emitted.

view this post on Zulip Chris Fallin (Feb 18 2021 at 18:07):

yes

view this post on Zulip Olivier FAURE (Feb 18 2021 at 18:07):

Another part I'm having trouble with:

Any offset that (i) immediately follows an unconditional jump, and (ii) has no labels bound to it, is unreachable; an unconditional jump at an unreachable offset can be elided. (Actually, any code at an unreachable offset can be removed, but for simplicity and to make it easier to reason about correctness, we restrict the MachBuffer’s edits to code explicitly marked as branch instructions only.)

How can an offset be unreachable if there's an unconditional jump to it?

view this post on Zulip Chris Fallin (Feb 18 2021 at 18:07):

ah, there's no unconditional jump to it; rather, it immediately follows an unconditional jump in the linear code stream

view this post on Zulip Chris Fallin (Feb 18 2021 at 18:08):

so in other words, jmp somewhere_else; unreachable_inst

view this post on Zulip Chris Fallin (Feb 18 2021 at 18:08):

if there are no labels on unreachable_inst, then there's no way to execute it (because there's no fallthrough into it either)

view this post on Zulip Olivier FAURE (Feb 18 2021 at 18:09):

So what gets elided in that case?

view this post on Zulip Chris Fallin (Feb 18 2021 at 18:10):

just the unreachable_inst

view this post on Zulip Chris Fallin (Feb 18 2021 at 18:11):

and specifically only if unreachable_inst is a jump too

view this post on Zulip Olivier FAURE (Feb 18 2021 at 18:11):

Oh, okay. But only if that instr is a jump?

view this post on Zulip Olivier FAURE (Feb 18 2021 at 18:11):

Right

view this post on Zulip Chris Fallin (Feb 18 2021 at 18:11):

(that's for simplicity as noted above)

view this post on Zulip Chris Fallin (Feb 18 2021 at 18:12):

it basically is complementary to the redirection; we first redirect all labels, then often it's the case the block was just a jump and we can delete it

view this post on Zulip Chris Fallin (Feb 18 2021 at 18:12):

(and these empty blocks happen all the time because of split critical edges)

view this post on Zulip Olivier FAURE (Feb 18 2021 at 18:12):

I'm guessing the assumption is that most unreachable blocks would be elided in MIR or CLIR, if at all possible, and remaining unreachable blocks will mostly be short blocks with jumps and nothing else?

view this post on Zulip Chris Fallin (Feb 18 2021 at 18:12):

well, these are generally not coming from unreachable blocks in CLIF; in fact the BlockLoweringOrder would not include such blocks at all

view this post on Zulip Chris Fallin (Feb 18 2021 at 18:13):

they become unreachable only after we redirect labels

view this post on Zulip Olivier FAURE (Feb 18 2021 at 18:13):

Ok. Thanks for the detailed answers :D

view this post on Zulip Chris Fallin (Feb 18 2021 at 18:15):

No problem, happy to help!


Last updated: Nov 22 2024 at 16:03 UTC