Consider the following function...
function u0:0(r64) system_v {
block0(v0: r64):
v1 = iconst.i64 0
v2 = iconst.i64 0
v3 = icmp eq v1, v2 ; v1 = 0, v2 = 0
brif v3, block1, block2
block1:
v4 = iconst.i64 0
jump block3(v4) ; v4 = 0
block2:
v5 = load.i64 notrap aligned v0
jump block3(v5)
block3(v6: i64):
return
}
This is currently optimized to ...
function u0:0(r64) system_v {
block0(v0: r64):
v7 = iconst.i8 1
brif v7, block1, block2 ; v7 = 1
block1:
v1 = iconst.i64 0
jump block3(v1) ; v1 = 0
block2:
v5 = load.i64 notrap aligned v0
jump block3(v5)
block3(v6: i64):
return
}
Completely missing the fact that the brif v7, block1, block2 ; v7 = 1 uses a constant and can be replaced with a jump block1. Ideally this should reduce to something like...
function u0:0(r64) system_v {
block0(v0: r64):
v7 = iconst.i8 1
jump block1
block1:
v1 = iconst.i64 0
jump block3(v1) ; v1 = 0
block3(v6: i64):
return
}
or even better ...
function u0:0(r64) system_v {
block0(v0: r64):
return
}
The reason this is mattering to me is if the result of the branch is constant and the value passed to the final block becomes constant, more of the middle layer optimizations start to take effect and great big spaghetti code (ie naively lifted code) condenses down really nice.
I thought I would be super slick and add a optimization to cprop.isle. Turns out brif is a "pure" instruction and doesn't pass through ISLE.
So tracing through the code led to me optimize_skeleton_inst in egraph.rs. I think this would be where you would put in a condition for checking on constant branches, but I am totally at a loss on where to even begin. I feel like once I got out of the ISLE land it was way too deep in the weeds to start hacking and slashing without causing too much grief. Am I correct in understanding that brif instructions will never pass in to ISLE land? Is this something that can be done simply or should I create a github issue?
The egraph based optimizations don't support any control flow.
(ISLE is a generic pattern matching language. It is used by the egraph based optimizations, but also by backends to lower from clif ir to machine specific instructions. In the later case brif definitively does pass through ISLE.)
@Michael Conrad you're correct that the egraph midend does not invoke the simplify
entry-point on non-pure insts, like branches. However there's been a lot of thought about how we could do this. Most recently, @Jamey Sharp, @Trevor Elliott , @fitzgen (he/him) and I were discussing how we could build branch simplification directly into the egraph pass and the idea is basically to check the condition of a brif
when we reach it (since it will have been eagerly optimized to a const by this point) and replace as necessary
So: yes, we'd like to do this, we just haven't gotten around to it yet :-)
Chris Fallin said:
Michael Conrad you're correct that the egraph midend does not invoke the
simplify
entry-point on non-pure insts, like branches. However there's been a lot of thought about how we could do this. Most recently, Jamey Sharp, Trevor Elliott , fitzgen (he/him) and I were discussing how we could build branch simplification directly into the egraph pass and the idea is basically to check the condition of abrif
when we reach it (since it will have been eagerly optimized to a const by this point) and replace as necessary
This was basically what I was thinking and I think I found the "right" place to do it, but got lost at that point trying to extract the information needed to do the check. I am glad you have it on your minds to do and will either keep pondering it or watching for it to appear. Thank you!
If you want to give it a go, I'd be happy to review a PR and help guide the way! I think it should basically come down to:
OptimizeCtx
for branches;all that does imply some context/familiarity so I'm happy to help point things out but it's also fine if you want to leave this for others later :-)
To add to Chris' comment, specifically, I want to change the optimize_skeleton_inst
function you found so that it calls a different ISLE term than the one used to simplify pure instructions.
There are a bunch of rewrites that would make sense to do on not just branches but really any instructions that have side effects. However that doesn't fit into the "egraph" idea, where we're declaring that there are several equivalent ways to compute the same "Value". Instead, I think we just want to do a more traditional optimization here and pick a single "best" instruction that has the same effect as the original. That's why these "skeleton" instructions need to be treated differently than pure ops.
I think we should do a minimal PR that invokes an Inst
->Inst
ISLE term on these instructions (which is tricky for a few reasons) but doesn't actually implement any optimization rules, and then people should be able to write a bunch of interesting rules in that framework.
Also I've been thinking about the pruning unreachable blocks detail: instead of ref-counting branches as we remove them, I think we just need to mark a block as reachable whenever we process a branch instruction that points to it. As we walk the blocks in dominance-tree pre-order, if we see one that we haven't already marked as reachable, then at that point I believe it's definitely unreachable.
I think we should do a minimal PR that invokes an Inst->Inst ISLE term on these instructions (which is tricky for a few reasons) but doesn't actually implement any optimization rules, and then people should be able to write a bunch of interesting rules in that framework.
I actually think there's value in considering not doing this, and having the branch folding be built-in behavior in the egraph pass. (I'm not 100% tied to this, but good reasons follow...) The original thought behind ISLE-for-midend was that it works well for value-substitutions, so pure nodes make perfect sense, but there are many considerations (moving side effects? creating new ones?) and plumbing issues (talking about block-targets in ISLE; multi-value, which will arise for calls; probably more that I'm forgetting now) when generalizing to all instructions.
I say all this because I too went down this path when trying to do some branch optimizations (removing uextend
on a (brif (uextend (icmp ...)))
pattern, specifically) and found that I kept bumping into all of these unfinished bits and incongruities. My "this is the wrong solution and I'm bulldozing through a wall" intuition went off.
At the very least, such a PR is far from "minimal", as compared to branch-folding that I think could be done in ~50-100 LoC and provide immediate value, in a simple and verifiable way!
I have the start of the branch folding implemented, and am looking into to removing dead blocks.
Clearly the easiest thing to do is bail out when a branch target is eliminated, recompute the graphs and start over. Gross. So I'm thinking I need to get access to the ControlFlowGraph inside egraph, fix up the ControlFlowGraph and the DominatorTree / DomTreeWithChildren before continuing on, correct?
Initial code is at https://github.com/mmitton/wasmtime/tree/branch_folding
ah, I would actually not try to remove dead blocks -- lowering will skip them automatically
the only relevant thing to do on the midend, IMHO, is to skip them during traversal, as an optimization
Without removing them, you miss out on optimizations in the midend.
What I have at the moment is code that folds BrIf and BrTable to Jump if there is a constant, but then it converts the block params to aliases if there is one and only one pred that matches the block (looking inside the pred inst for the block call that matches, and only if there is ONLY one block call that matches. ie, you cannot have two block calls in a BrTable that branch to that block)
Without elimination dead code, the target block may never know it only has one pred and then cannot alias the block params / enable more optimizations.
I have committed my changes to my fork and have a sample test app at https://github.com/mmitton/test-opt which shows the results of folding. Commenting out line 404 in codegen/src/context.rs shows the gain/loss of doing this.
I am still not thrilled that the result code has jump jump jump jump to the final block and then the return. I suspect that is something in the lowering that can realize a jump to the very next instruction isn't needed?
I think we're using terms slightly differently: by "removing" I mean actually removing from the body. Branch folding can make something unreachable, while the dead block sticks around in the layout
We can absolutely do the right thing once we know that the block has one less pred in that case
function u0:0() -> i32 system_v {
block0:
v0 = iconst.i32 0
brif v0, block1, block2 ; v0 = 0
block1:
v1 = iconst.i32 4
jump block3(v1) ; v1 = 4
block2:
v2 = iconst.i32 8
jump block3(v2) ; v2 = 8
block3(v3: i32):
return v3
}
Folding the brif in block0 makes block2 unreachable, but block3 doesn't get the memo without removing block2. You could, and I contemplated at one point, realize block2 was unreachable after folding the brif, but if block2 is involved in a loop, it would still have a pred after taking block0 out of its list of preds.
Now that I'm talking this through, there is probably a way to extract that from LoopAnalysis? I may have to explore that.
I think it could look something like: keep is-constant state per blockparam; this is a lattice (Top
, Const(value)
, Bottom
); on processing a branch, meet state into succs. On starting new block, check preds: if any do not dominate this block, do nothing (all blockparams are effectively unknown); else, use state as starting constant-state when optimizing this block.
Basically, we are implicitly OK because if we exclude a block by not visiting it (if it becomes unreachable), we just won't meet into the blockparams of its succ(s)
this requires keeping a "reachability" bit that is set whenever a branch targets another block, and is set for the entry block initially
There's a refinement to this that @Jamey Sharp and I talked about where if we see a pred uses our own blockparam value as an arg to that blockparam, we can ignore it (trivial self-loop); this subsumes the "remove useless phis" pass; but that's an extension to the above and not necessary at first
This might be the type of optimization where something like a RVSDG form might be more natural: dead code elimination would become another egraph rule, eg if true X else Y <=> X
. Then again, you would need to reduce the CFG first, and you would need to keep some special node types for irreducible graphs.
Yes, this would be somewhat easier with an RVSDG input than a CFG. Of course, getting to that point is a much bigger change. And some difficulties apply to both approaches.
In particular, Cranelift's "acyclic egraph" currently relies on visiting each block only once during the equality saturation phase, which misses some optimization opportunities for loops but has a bunch of advantages. In an RVSDG, we'd miss the same loop optimizations. I think it's possible to offer a compiler option for re-visiting already-processed blocks when you care more about generated code quality than compile time, but I don't have the details right yet. If we had that though, it would work equally well on CFGs and RVSDGs.
Regarding irreducible graphs, have you read Perfect Reconstructability of Control Flow from Demand Dependence Graphs by Helge Bahmann _et al_? I suspect it would work out better to use their "predicate continuation normal form" to transform arbitrary control flow to purely reducible control flow and back, rather than having special-case nodes.
Jamey Sharp said:
Regarding irreducible graphs, have you read Perfect Reconstructability of Control Flow from Demand Dependence Graphs by Helge Bahmann _et al_? I suspect it would work out better to use their "predicate continuation normal form" to transform arbitrary control flow to purely reducible control flow and back, rather than having special-case nodes.
Trying to read it now, but it's a bit dense for me. That paper needs more illustrations =(
I've read the paper now; maybe I missed something, but it really doesn't seem to address irreducible control flows at all. Or at least, it doesn't seem to treat them in a specific way, and they barely test any irreducible CFGs in their benchmarks. It just seems like a more (?) efficient way to transform CFGs into RVSDGs and back
The abstract summarizes this by saying, "Our destruction algorithm can generate arbitrarily complex control flow; we show this by proving that the original CFG an RVSDG was derived from can, apart from overspecific detail, be reconstructed perfectly." They mean that regardless of whether the input CFG was reducible or not. The really cool part is that even irreducible control flow is reconstructed perfectly.
Section 4, "TRANSFORMING CFGS TO RVSDGS", begins by describing a "direct transformation from any structured CFG to an RVSDG". Structured CFGs exclude irreducible CFGs and even some reducible CFGs. But the next page notes, "The remainder of this section describes an algorithm that converts an arbitrary CFG into a structured one." And figure 3 is an irreducible CFG, though it's only labeled as "unstructured" in the figure caption.
I totally agree that the paper could have used more illustrations. I spent a lot of time staring at figure 3 when writing my implementation of sections 4.1 and 4.2. Maybe reading my implementation will help clarify the paper, or maybe it'll just be more confusing… https://github.com/jameysharp/optir/blob/main/src/cfg/restructure.rs
Last updated: Nov 22 2024 at 17:03 UTC