cfallin opened issue #6818:
@alexcrichton brought up in #6810
And again to clarify I'm not advocating for here-and-now removing branch optimizations from the mach buffer. I realize it's a deep refactoring and would take quite some work and that additionally this is the result of historical work and balancing efforts. I still do believe though that this probably isn't the best place to do this because it's extremely late in the pipeline and we're already unable to perform optimization such as constant-branch-folding with knock-on effects from that. For example I'd imagine that if constant-branch-folding were implemented then many of the optimizations that the mach buffer does would probably fall out of that, where in such a situation it may not be necessary for the mach buffer to be as clever as it is today.
This issue is meant to answer the question: should we do this refactor, and why or why not?
First, addressing some specific points:
extremely late in the pipeline and we're already unable to perform optimization such as constant-branch-folding with knock-on effects from that.
It's unclear to me why this is the case -- could you clarify? Constant-branch folding can be done on CLIF in earlier stages; nothing in
MachBuffer
precludes replacing abrif
with ajump
when we know the conditional is constant (and we hope to do that optimization in the future!).For example I'd imagine that if constant-branch-folding were implemented then many of the optimizations that the mach buffer does would probably fall out of that
Unfortunately this is also not quite the case I think, for several reasons:
- The
MachBuffer
's rearranging of branches requires knowledge of block order, because it is taking advantage of fallthroughs. We don't know block order when we're doing midend optimizations, and the flexibility is important because...- ... we also don't know if some blocks will become empty or not, as a result of optimizations or regalloc. For example, we need to split critical edges so regalloc has a place to put spills/reloads in all cases; but many of those split edges end up being an empty block, and
MachBuffer
can then eat it (chomp the one jump in the body and redirect the label).So basically pass ordering and regalloc needs mean that block ordering is only known late, and this means that block-order-dependent changes cannot happen in the mid-end.
I think both of the above are possibly springing from a slight mis-interpretation of what purpose the
MachBuffer
's branch folding rules actually serve. It is not an optimization (in the sense of an optional thing); it is a lowering step that takes us from N-target branches (every block terminator always specifies all destination labels) to branch-or-fallthrough "machine-style branches". The former is the standard CFG representation in most SSA compilers and is convenient to work with, all the way down into regalloc2. The latter is known as "EBBs" (extended basic blocks) and was formerly used in Cranelift. Removing the lowering step means propagating the output side's invariants and properties (EBBs) upward into all VCode.EBBs were removed 4 years or so ago (I think? it was before my time), for good reason: it requires subtle and not-well-understood tweaks to common algorithms, it's hard to reason about, and it infects all of the compiler backend. Basically moving back to EBBs would require a careful audit and refactor of regalloc2 and the ~30k lines of code in Cranelift that touch VCode, and that's a nonstarter.
EBBs also require knowledge of, and lock in, the block order, which is a weird nonlocal invariant thing that precludes a bunch of other optimizations as noted above.
So both branch-folding in machine-independent code and the lowering from N-target form to fallthrough form are useful transforms, but they serve different purposes. The former is an optimization and is optional; the latter is fundamental to the IR design, in that it lets us reason at a slightly higher semantic level and avoid huge complexities.
Finally, the original motivation for this discussion came from a perceived inability to make the branch simplifications work with a more complex forward-reference label resolution algorithm. I don't think this is necessarily or fundamentally the case. The label resolution (island/veneer insertion) and branch optimizations interact in subtle ways, but they are separate, tied by the invariant: the buffer of most-recent branches continguous to the end of the buffer (this is the visibility of the peephole opts) must correspond to the latest fixups in the fixup list.
The veneer insertion happens only at island generation, and islands clear the "most recent branches" list (because none are contiguous to the end of buffer anymore), so actually we can do whatever we want with more complex data structures and algorithms, without violating the invariant. The only requirement is that between islands, we do keep a linear list of most recent fixups (we can clear it whenever we emit a non-branch instruction).
So from my perspective, there is a proposal for a large refactor of the compiler backend, with high risk, motivated by two reasons that I don't think are really the case (branch constant folding gets better? --> no, different and orthogonal things; precludes more advanced veneer fixup? --> no, the invariant still allows this). It's entirely possible I've missed something or misunderstood some part of the proposal though -- curious what others think!
cfallin commented on issue #6818:
(Thanks to @jameysharp, @elliottt and others for talking through this a bit offline and especially Jamey for noting the implications on block order above. @alexcrichton if you want to discuss further we can definitely do so in the next Cranelift meeting, if that's easier!)
alexcrichton commented on issue #6818:
Thanks for taking the time to write this up! I've added this to the agenda for tomorrow to go over if there's anything remaining, mostly because I'm still curious to dig into more details if you and others are willing!
extremely late in the pipeline and we're already unable to perform optimization such as constant-branch-folding with knock-on effects from that.
It's unclear to me why this is the case -- could you clarify?
My gut is that this level of optimization is best done earlier in the pipeline which is where my thinking stems from, but there's other concrete issues which I think would rely on doing branch-related optimizations earlier rather than at the very end:
- https://github.com/bytecodealliance/wasmtime/issues/6154 - this is where current lowering rules won't match across blocks and while the
jump
isn't present in the final compiled output it's thwarting other things.- https://github.com/bytecodealliance/wasmtime/issues/6094 - ignoring the spectre parts of that, GVN-ing bounds checks to eliminate dominated bounds checks can result in a bunch of dead code which would ideally be removed rather than simply having branches fused in the MachBuffer
Basically my thinking is that while we could make MachBuffer fancier related to branching it seems like much of the benefit should come from the mid-end optimizations and their knock-on effects.
I think both of the above are possibly springing from a slight mis-interpretation of what purpose the MachBuffer's branch folding rules actually serve. It is not an optimization (in the sense of an optional thing); it is a lowering step
This is a good point yeah and I think a good way to help me frame my thinking. On one hand we as a compiler have the problem of optimizing a CFG, for example A->B->C could be come A+B -> C under some conditions and things like that. Also things like
br_if true a, b
can becomejump a
which might enable further CFG simplifications. This, in my mind, is all about taking a messy CFG to a "clean" CFG which would handle the issues I described above (e.g. noret
-in-its-own-block and DCE would fall out of CFG rewriting). There is however still the problem of taking a perfect CFG and lowering it to optimal machine code.I feel like MachBuffer has historically conflated these two operations, performing both CFG optimization and optimal linear layout of a CFG at the same time. My thinking has been that the CFG-optimizing parts of MachBuffer would ideally be removed into an independent pass in Cranelift to get executed at some point, perhaps intertwined with egraph optimizations somehow (e.g. to make use of constant propagation and/or GVN).
I'll also clarify that I barely know what an EBB is much less advocate for it. And again if anything I'm thinking comes with a fundamental compile-time or runtime performance penalty then it's not worth it, I'm trying to see if we can have our cake and eat it too.
I'll definitely admit that this may all be stemming from me misunderstanding MachBuffer and its optimizations. If all it's doing is the optimal linear layout of a CFG without extra optimizations then there's no escaping that class of problem so there's nothing more that can be done.
All that being said your observation about the recent branches list being empty on islands I think is a great one and would be a way to solve the suboptimal branch generation from https://github.com/bytecodealliance/wasmtime/pull/6804. I might try to give that a stab in the future!
alexcrichton commented on issue #6818:
Ok I talked more with folks at the Cranelift meeting today and the conclusion I took away was that the MachBuffer optimizations are all required at this time to have an optimal linear layout of a CFG we input. In that sense what I was mentioning above about CFG optimizations and optimal CFG layout the MachBuffer only deals with the latter which all sounds good.
In that sense I believe this issue is resolved as the answer is "yes, the MachBuffer should continue to produce high-quality code".
As for the one remaining issue of the eager promotion of veneers I think @cfallin's idea will solve it well. I mentioned today I had a branch (which is here) which implemented the
BinaryHeap
idea although it only works by disabling optimization, so not landable as-is of course but wanted to share as it had the bits about removing precalculation of the island deadline.
jameysharp commented on issue #6818:
Given that there are a small ISA-specific constant number of possible deadline-distances (e.g. 2^20, 2^27, 2^32) it occurs to me a priority queue isn't strictly necessary. We could keep one
VecDeque
per distance instead. Then all the labels that are at their deadlines are together at the beginning of their respective lists; you can still remove the last-added labels efficiently if needed; and nothing needs to repeatedly visit labels in the middle. Does that help?
alexcrichton commented on issue #6818:
I think that would work as well yeah, but I think that'd still require the "flush" behavior during island insertion right? Otherwise I'm at least not too worried about overhead with a
BinaryHeap
and it feels like a simpler alternative to reach for, but I think I'm on a bit of a one-track mind right now...
cfallin commented on issue #6818:
It's possible we can make do with just
Vec
s actually. The key thing is that (i) as fixups are pushed prior to an island, the only operation we need is "chomp" (so push/pop on a vec suffices); and after the island, we can't chomp anymore, so it suffices to sort by deadline. As you're observing, for a given class we will have monotonically increasing deadlines, so we can put the "long-term vecs" per class in reverse-deadline order and pop off the back of all of them.That said, and as I mentioned in the earlier thread, I worry a lot about complexity here so I'd want to be careful about premature optimization. One "deadline-keyed heap" that we pop from is a lot simpler than a radix-sort-like thing with N vecs, and it's not clear to me how to parameterize the latter on
LabelUse
arms either (they are different per architecture), strictly in terms of generic programming in Rust. (I guess the trait on the enum could map the kinds into an integer index space, with each concrete implementation arepr(u8)
or whatever, but now we're way down the rabbit hole.) So maybe it's best to try the simpler thing and then see whether we need anything more...
cfallin edited a comment on issue #6818:
It's possible we can make do with just
Vec
s actually. The key thing is that (i) as fixups are pushed prior to an island, the only operation we need is "chomp" (so push/pop on a vec suffices); and after the island, we can't chomp anymore, so it suffices to sort by deadline. As you're observing, for a given class we will have monotonically increasing deadlines, so we can put the "long-term vecs" per class in reverse-deadline order and pop off the back of all of them.That said, and as I mentioned in the earlier thread, I worry a lot about complexity here so I'd want to be careful about premature optimization. One "deadline-keyed heap" that we pop from is a lot simpler than a radix-sort-like thing with N vecs, and it's not clear to me how to parameterize the latter on
LabelUse
arms either (they are different per architecture), strictly in terms of generic programming in Rust. (I guess the trait on the enum could map the kinds into an integer index space, with each concrete implementation arepr(u8)
or whatever, but now we're way down the rabbit hole.) So maybe it's best to try the simpler thing (EDIT for clarity: that would be @alexcrichton 'sBinaryHeap
suggestion) and then see whether we need anything more...
jameysharp commented on issue #6818:
I figured we'd just have what amounts to a map keyed on the
CodeOffset
(which is au32
) fromMachInstLabelUse::max_pos_range
. We don't actually care what kind of label it is for this purpose; we care what maximum offset it has. There are no more than four distinct max offsets on any current target, so even linear-scan in an array-of-pairs-based "map" is asymptotically O(1) and the constant factors might be better than a HashMap.But I don't quite understand the interaction between veneer emission and label chomping and I'll take your word for it, both of you, that the binary heap is fine.
cfallin closed issue #6818:
@alexcrichton brought up in #6810
And again to clarify I'm not advocating for here-and-now removing branch optimizations from the mach buffer. I realize it's a deep refactoring and would take quite some work and that additionally this is the result of historical work and balancing efforts. I still do believe though that this probably isn't the best place to do this because it's extremely late in the pipeline and we're already unable to perform optimization such as constant-branch-folding with knock-on effects from that. For example I'd imagine that if constant-branch-folding were implemented then many of the optimizations that the mach buffer does would probably fall out of that, where in such a situation it may not be necessary for the mach buffer to be as clever as it is today.
This issue is meant to answer the question: should we do this refactor, and why or why not?
First, addressing some specific points:
extremely late in the pipeline and we're already unable to perform optimization such as constant-branch-folding with knock-on effects from that.
It's unclear to me why this is the case -- could you clarify? Constant-branch folding can be done on CLIF in earlier stages; nothing in
MachBuffer
precludes replacing abrif
with ajump
when we know the conditional is constant (and we hope to do that optimization in the future!).For example I'd imagine that if constant-branch-folding were implemented then many of the optimizations that the mach buffer does would probably fall out of that
Unfortunately this is also not quite the case I think, for several reasons:
- The
MachBuffer
's rearranging of branches requires knowledge of block order, because it is taking advantage of fallthroughs. We don't know block order when we're doing midend optimizations, and the flexibility is important because...- ... we also don't know if some blocks will become empty or not, as a result of optimizations or regalloc. For example, we need to split critical edges so regalloc has a place to put spills/reloads in all cases; but many of those split edges end up being an empty block, and
MachBuffer
can then eat it (chomp the one jump in the body and redirect the label).So basically pass ordering and regalloc needs mean that block ordering is only known late, and this means that block-order-dependent changes cannot happen in the mid-end.
I think both of the above are possibly springing from a slight mis-interpretation of what purpose the
MachBuffer
's branch folding rules actually serve. It is not an optimization (in the sense of an optional thing); it is a lowering step that takes us from N-target branches (every block terminator always specifies all destination labels) to branch-or-fallthrough "machine-style branches". The former is the standard CFG representation in most SSA compilers and is convenient to work with, all the way down into regalloc2. The latter is known as "EBBs" (extended basic blocks) and was formerly used in Cranelift. Removing the lowering step means propagating the output side's invariants and properties (EBBs) upward into all VCode.EBBs were removed 4 years or so ago (I think? it was before my time), for good reason: it requires subtle and not-well-understood tweaks to common algorithms, it's hard to reason about, and it infects all of the compiler backend. Basically moving back to EBBs would require a careful audit and refactor of regalloc2 and the ~30k lines of code in Cranelift that touch VCode, and that's a nonstarter.
EBBs also require knowledge of, and lock in, the block order, which is a weird nonlocal invariant thing that precludes a bunch of other optimizations as noted above.
So both branch-folding in machine-independent code and the lowering from N-target form to fallthrough form are useful transforms, but they serve different purposes. The former is an optimization and is optional; the latter is fundamental to the IR design, in that it lets us reason at a slightly higher semantic level and avoid huge complexities.
Finally, the original motivation for this discussion came from a perceived inability to make the branch simplifications work with a more complex forward-reference label resolution algorithm. I don't think this is necessarily or fundamentally the case. The label resolution (island/veneer insertion) and branch optimizations interact in subtle ways, but they are separate, tied by the invariant: the buffer of most-recent branches continguous to the end of the buffer (this is the visibility of the peephole opts) must correspond to the latest fixups in the fixup list.
The veneer insertion happens only at island generation, and islands clear the "most recent branches" list (because none are contiguous to the end of buffer anymore), so actually we can do whatever we want with more complex data structures and algorithms, without violating the invariant. The only requirement is that between islands, we do keep a linear list of most recent fixups (we can clear it whenever we emit a non-branch instruction).
So from my perspective, there is a proposal for a large refactor of the compiler backend, with high risk, motivated by two reasons that I don't think are really the case (branch constant folding gets better? --> no, different and orthogonal things; precludes more advanced veneer fixup? --> no, the invariant still allows this). It's entirely possible I've missed something or misunderstood some part of the proposal though -- curious what others think!
Last updated: Dec 23 2024 at 13:07 UTC