wasmtime / issue #6818 Cranelift: should we do branch opt... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / issue #6818 Cranelift: should we do branch opt...

Wasmtime GitHub notifications bot (Aug 07 2023 at 19:05):

@alexcrichton brought up in #6810

And again to clarify I'm not advocating for here-and-now removing branch optimizations from the mach buffer. I realize it's a deep refactoring and would take quite some work and that additionally this is the result of historical work and balancing efforts. I still do believe though that this probably isn't the best place to do this because it's extremely late in the pipeline and we're already unable to perform optimization such as constant-branch-folding with knock-on effects from that. For example I'd imagine that if constant-branch-folding were implemented then many of the optimizations that the mach buffer does would probably fall out of that, where in such a situation it may not be necessary for the mach buffer to be as clever as it is today.

This issue is meant to answer the question: should we do this refactor, and why or why not?

First, addressing some specific points:

extremely late in the pipeline and we're already unable to perform optimization such as constant-branch-folding with knock-on effects from that.

It's unclear to me why this is the case -- could you clarify? Constant-branch folding can be done on CLIF in earlier stages; nothing in MachBuffer precludes replacing a brif with a jump when we know the conditional is constant (and we hope to do that optimization in the future!).

For example I'd imagine that if constant-branch-folding were implemented then many of the optimizations that the mach buffer does would probably fall out of that

Unfortunately this is also not quite the case I think, for several reasons:

The MachBuffer's rearranging of branches requires knowledge of block order, because it is taking advantage of fallthroughs. We don't know block order when we're doing midend optimizations, and the flexibility is important because...

... we also don't know if some blocks will become empty or not, as a result of optimizations or regalloc. For example, we need to split critical edges so regalloc has a place to put spills/reloads in all cases; but many of those split edges end up being an empty block, and MachBuffer can then eat it (chomp the one jump in the body and redirect the label).

So basically pass ordering and regalloc needs mean that block ordering is only known late, and this means that block-order-dependent changes cannot happen in the mid-end.

I think both of the above are possibly springing from a slight mis-interpretation of what purpose the MachBuffer's branch folding rules actually serve. It is not an optimization (in the sense of an optional thing); it is a lowering step that takes us from N-target branches (every block terminator always specifies all destination labels) to branch-or-fallthrough "machine-style branches". The former is the standard CFG representation in most SSA compilers and is convenient to work with, all the way down into regalloc2. The latter is known as "EBBs" (extended basic blocks) and was formerly used in Cranelift. Removing the lowering step means propagating the output side's invariants and properties (EBBs) upward into all VCode.

EBBs were removed 4 years or so ago (I think? it was before my time), for good reason: it requires subtle and not-well-understood tweaks to common algorithms, it's hard to reason about, and it infects all of the compiler backend. Basically moving back to EBBs would require a careful audit and refactor of regalloc2 and the ~30k lines of code in Cranelift that touch VCode, and that's a nonstarter.

EBBs also require knowledge of, and lock in, the block order, which is a weird nonlocal invariant thing that precludes a bunch of other optimizations as noted above.

So both branch-folding in machine-independent code and the lowering from N-target form to fallthrough form are useful transforms, but they serve different purposes. The former is an optimization and is optional; the latter is fundamental to the IR design, in that it lets us reason at a slightly higher semantic level and avoid huge complexities.

Finally, the original motivation for this discussion came from a perceived inability to make the branch simplifications work with a more complex forward-reference label resolution algorithm. I don't think this is necessarily or fundamentally the case. The label resolution (island/veneer insertion) and branch optimizations interact in subtle ways, but they are separate, tied by the invariant: the buffer of most-recent branches continguous to the end of the buffer (this is the visibility of the peephole opts) must correspond to the latest fixups in the fixup list.

The veneer insertion happens only at island generation, and islands clear the "most recent branches" list (because none are contiguous to the end of buffer anymore), so actually we can do whatever we want with more complex data structures and algorithms, without violating the invariant. The only requirement is that between islands, we do keep a linear list of most recent fixups (we can clear it whenever we emit a non-branch instruction).

So from my perspective, there is a proposal for a large refactor of the compiler backend, with high risk, motivated by two reasons that I don't think are really the case (branch constant folding gets better? --> no, different and orthogonal things; precludes more advanced veneer fixup? --> no, the invariant still allows this). It's entirely possible I've missed something or misunderstood some part of the proposal though -- curious what others think!

Wasmtime GitHub notifications bot (Aug 07 2023 at 19:06):

cfallin commented on issue #6818:

(Thanks to @jameysharp, @elliottt and others for talking through this a bit offline and especially Jamey for noting the implications on block order above. @alexcrichton if you want to discuss further we can definitely do so in the next Cranelift meeting, if that's easier!)

Wasmtime GitHub notifications bot (Aug 08 2023 at 21:21):

alexcrichton commented on issue #6818:

Thanks for taking the time to write this up! I've added this to the agenda for tomorrow to go over if there's anything remaining, mostly because I'm still curious to dig into more details if you and others are willing!
extremely late in the pipeline and we're already unable to perform optimization such as constant-branch-folding with knock-on effects from that.
It's unclear to me why this is the case -- could you clarify?
My gut is that this level of optimization is best done earlier in the pipeline which is where my thinking stems from, but there's other concrete issues which I think would rely on doing branch-related optimizations earlier rather than at the very end:

https://github.com/bytecodealliance/wasmtime/issues/6154 - this is where current lowering rules won't match across blocks and while the jump isn't present in the final compiled output it's thwarting other things.

https://github.com/bytecodealliance/wasmtime/issues/6094 - ignoring the spectre parts of that, GVN-ing bounds checks to eliminate dominated bounds checks can result in a bunch of dead code which would ideally be removed rather than simply having branches fused in the MachBuffer

Basically my thinking is that while we could make MachBuffer fancier related to branching it seems like much of the benefit should come from the mid-end optimizations and their knock-on effects.

I think both of the above are possibly springing from a slight mis-interpretation of what purpose the MachBuffer's branch folding rules actually serve. It is not an optimization (in the sense of an optional thing); it is a lowering step

This is a good point yeah and I think a good way to help me frame my thinking. On one hand we as a compiler have the problem of optimizing a CFG, for example A->B->C could be come A+B -> C under some conditions and things like that. Also things like br_if true a, b can become jump a which might enable further CFG simplifications. This, in my mind, is all about taking a messy CFG to a "clean" CFG which would handle the issues I described above (e.g. no ret-in-its-own-block and DCE would fall out of CFG rewriting). There is however still the problem of taking a perfect CFG and lowering it to optimal machine code.

I feel like MachBuffer has historically conflated these two operations, performing both CFG optimization and optimal linear layout of a CFG at the same time. My thinking has been that the CFG-optimizing parts of MachBuffer would ideally be removed into an independent pass in Cranelift to get executed at some point, perhaps intertwined with egraph optimizations somehow (e.g. to make use of constant propagation and/or GVN).

I'll also clarify that I barely know what an EBB is much less advocate for it. And again if anything I'm thinking comes with a fundamental compile-time or runtime performance penalty then it's not worth it, I'm trying to see if we can have our cake and eat it too.

I'll definitely admit that this may all be stemming from me misunderstanding MachBuffer and its optimizations. If all it's doing is the optimal linear layout of a CFG without extra optimizations then there's no escaping that class of problem so there's nothing more that can be done.

All that being said your observation about the recent branches list being empty on islands I think is a great one and would be a way to solve the suboptimal branch generation from https://github.com/bytecodealliance/wasmtime/pull/6804. I might try to give that a stab in the future!

Wasmtime GitHub notifications bot (Aug 09 2023 at 16:55):

alexcrichton commented on issue #6818:

Ok I talked more with folks at the Cranelift meeting today and the conclusion I took away was that the MachBuffer optimizations are all required at this time to have an optimal linear layout of a CFG we input. In that sense what I was mentioning above about CFG optimizations and optimal CFG layout the MachBuffer only deals with the latter which all sounds good.

In that sense I believe this issue is resolved as the answer is "yes, the MachBuffer should continue to produce high-quality code".

As for the one remaining issue of the eager promotion of veneers I think @cfallin's idea will solve it well. I mentioned today I had a branch (which is here) which implemented the BinaryHeap idea although it only works by disabling optimization, so not landable as-is of course but wanted to share as it had the bits about removing precalculation of the island deadline.

Wasmtime GitHub notifications bot (Aug 09 2023 at 17:09):

jameysharp commented on issue #6818:

Given that there are a small ISA-specific constant number of possible deadline-distances (e.g. 2^20, 2^27, 2^32) it occurs to me a priority queue isn't strictly necessary. We could keep one VecDeque per distance instead. Then all the labels that are at their deadlines are together at the beginning of their respective lists; you can still remove the last-added labels efficiently if needed; and nothing needs to repeatedly visit labels in the middle. Does that help?

Wasmtime GitHub notifications bot (Aug 09 2023 at 18:05):

alexcrichton commented on issue #6818:

I think that would work as well yeah, but I think that'd still require the "flush" behavior during island insertion right? Otherwise I'm at least not too worried about overhead with a BinaryHeap and it feels like a simpler alternative to reach for, but I think I'm on a bit of a one-track mind right now...

Wasmtime GitHub notifications bot (Aug 09 2023 at 18:18):

cfallin commented on issue #6818:

It's possible we can make do with just Vecs actually. The key thing is that (i) as fixups are pushed prior to an island, the only operation we need is "chomp" (so push/pop on a vec suffices); and after the island, we can't chomp anymore, so it suffices to sort by deadline. As you're observing, for a given class we will have monotonically increasing deadlines, so we can put the "long-term vecs" per class in reverse-deadline order and pop off the back of all of them.

That said, and as I mentioned in the earlier thread, I worry a lot about complexity here so I'd want to be careful about premature optimization. One "deadline-keyed heap" that we pop from is a lot simpler than a radix-sort-like thing with N vecs, and it's not clear to me how to parameterize the latter on LabelUse arms either (they are different per architecture), strictly in terms of generic programming in Rust. (I guess the trait on the enum could map the kinds into an integer index space, with each concrete implementation a repr(u8) or whatever, but now we're way down the rabbit hole.) So maybe it's best to try the simpler thing and then see whether we need anything more...

Wasmtime GitHub notifications bot (Aug 09 2023 at 18:19):

cfallin edited a comment on issue #6818:

It's possible we can make do with just Vecs actually. The key thing is that (i) as fixups are pushed prior to an island, the only operation we need is "chomp" (so push/pop on a vec suffices); and after the island, we can't chomp anymore, so it suffices to sort by deadline. As you're observing, for a given class we will have monotonically increasing deadlines, so we can put the "long-term vecs" per class in reverse-deadline order and pop off the back of all of them.

That said, and as I mentioned in the earlier thread, I worry a lot about complexity here so I'd want to be careful about premature optimization. One "deadline-keyed heap" that we pop from is a lot simpler than a radix-sort-like thing with N vecs, and it's not clear to me how to parameterize the latter on LabelUse arms either (they are different per architecture), strictly in terms of generic programming in Rust. (I guess the trait on the enum could map the kinds into an integer index space, with each concrete implementation a repr(u8) or whatever, but now we're way down the rabbit hole.) So maybe it's best to try the simpler thing (EDIT for clarity: that would be @alexcrichton 's BinaryHeap suggestion) and then see whether we need anything more...

Wasmtime GitHub notifications bot (Aug 09 2023 at 18:52):

jameysharp commented on issue #6818:

I figured we'd just have what amounts to a map keyed on the CodeOffset (which is a u32) from MachInstLabelUse::max_pos_range. We don't actually care what kind of label it is for this purpose; we care what maximum offset it has. There are no more than four distinct max offsets on any current target, so even linear-scan in an array-of-pairs-based "map" is asymptotically O(1) and the constant factors might be better than a HashMap.

But I don't quite understand the interaction between veneer emission and label chomping and I'll take your word for it, both of you, that the binary heap is fine.

Wasmtime GitHub notifications bot (Aug 25 2023 at 00:37):

cfallin closed issue #6818:

@alexcrichton brought up in #6810

And again to clarify I'm not advocating for here-and-now removing branch optimizations from the mach buffer. I realize it's a deep refactoring and would take quite some work and that additionally this is the result of historical work and balancing efforts. I still do believe though that this probably isn't the best place to do this because it's extremely late in the pipeline and we're already unable to perform optimization such as constant-branch-folding with knock-on effects from that. For example I'd imagine that if constant-branch-folding were implemented then many of the optimizations that the mach buffer does would probably fall out of that, where in such a situation it may not be necessary for the mach buffer to be as clever as it is today.

This issue is meant to answer the question: should we do this refactor, and why or why not?

First, addressing some specific points:

extremely late in the pipeline and we're already unable to perform optimization such as constant-branch-folding with knock-on effects from that.

It's unclear to me why this is the case -- could you clarify? Constant-branch folding can be done on CLIF in earlier stages; nothing in MachBuffer precludes replacing a brif with a jump when we know the conditional is constant (and we hope to do that optimization in the future!).

For example I'd imagine that if constant-branch-folding were implemented then many of the optimizations that the mach buffer does would probably fall out of that

Unfortunately this is also not quite the case I think, for several reasons:

The MachBuffer's rearranging of branches requires knowledge of block order, because it is taking advantage of fallthroughs. We don't know block order when we're doing midend optimizations, and the flexibility is important because...

... we also don't know if some blocks will become empty or not, as a result of optimizations or regalloc. For example, we need to split critical edges so regalloc has a place to put spills/reloads in all cases; but many of those split edges end up being an empty block, and MachBuffer can then eat it (chomp the one jump in the body and redirect the label).

So basically pass ordering and regalloc needs mean that block ordering is only known late, and this means that block-order-dependent changes cannot happen in the mid-end.

I think both of the above are possibly springing from a slight mis-interpretation of what purpose the MachBuffer's branch folding rules actually serve. It is not an optimization (in the sense of an optional thing); it is a lowering step that takes us from N-target branches (every block terminator always specifies all destination labels) to branch-or-fallthrough "machine-style branches". The former is the standard CFG representation in most SSA compilers and is convenient to work with, all the way down into regalloc2. The latter is known as "EBBs" (extended basic blocks) and was formerly used in Cranelift. Removing the lowering step means propagating the output side's invariants and properties (EBBs) upward into all VCode.

EBBs were removed 4 years or so ago (I think? it was before my time), for good reason: it requires subtle and not-well-understood tweaks to common algorithms, it's hard to reason about, and it infects all of the compiler backend. Basically moving back to EBBs would require a careful audit and refactor of regalloc2 and the ~30k lines of code in Cranelift that touch VCode, and that's a nonstarter.

EBBs also require knowledge of, and lock in, the block order, which is a weird nonlocal invariant thing that precludes a bunch of other optimizations as noted above.

So both branch-folding in machine-independent code and the lowering from N-target form to fallthrough form are useful transforms, but they serve different purposes. The former is an optimization and is optional; the latter is fundamental to the IR design, in that it lets us reason at a slightly higher semantic level and avoid huge complexities.

Finally, the original motivation for this discussion came from a perceived inability to make the branch simplifications work with a more complex forward-reference label resolution algorithm. I don't think this is necessarily or fundamentally the case. The label resolution (island/veneer insertion) and branch optimizations interact in subtle ways, but they are separate, tied by the invariant: the buffer of most-recent branches continguous to the end of the buffer (this is the visibility of the peephole opts) must correspond to the latest fixups in the fixup list.

The veneer insertion happens only at island generation, and islands clear the "most recent branches" list (because none are contiguous to the end of buffer anymore), so actually we can do whatever we want with more complex data structures and algorithms, without violating the invariant. The only requirement is that between islands, we do keep a linear list of most recent fixups (we can clear it whenever we emit a non-branch instruction).

So from my perspective, there is a proposal for a large refactor of the compiler backend, with high risk, motivated by two reasons that I don't think are really the case (branch constant folding gets better? --> no, different and orthogonal things; precludes more advanced veneer fixup? --> no, the invariant still allows this). It's entirely possible I've missed something or misunderstood some part of the proposal though -- curious what others think!

Last updated: Apr 18 2025 at 01:31 UTC