cfallin opened PR #5382 from egraph-in-dfg-rebase
to main
:
(Stacked on #5380.)
This work rewrites the "egraph"-based optimization framework in
Cranelift to operate on aegraphs (acyclic egraphs) represented in the
CLIF itself rather than as a separate data structure to which and from
which we translate the CLIF.The basic idea is to add a new kind of value, a "union", that is like an
alias but refers to two other values rather than one. This allows us to
represent an eclass of enodes (values) as a tree. The union node allows
for a value to have multiple representations: either constituent value
could be used, and (in well-formed CLIF produced by correct
optimization rules) they must be equivalent.Like the old egraph infrastructure, we take advantage of acyclicity and
eager rule application to do optimization in a single pass. Like before,
we integrate GVN (during the optimization pass) and LICM (during
elaboration).Unlike the old egraph infrastructure, everything stays in the
DataFlowGraph. "Pure" enodes are represented as instructions that have
values attached, but that are not placed into the function layout. When
entering "egraph" form, we remove them from the layout while optimizing.
When leaving "egraph" form, during elaboration, we can place an
instruction back into the layout the first time we elaborate the enode;
if we elaborate it more than once, we clone the instruction.The implementation performs two passes overall:
One, a forward pass in RPO (to see defs before uses), that (i) removes
"pure" instructions from the layout and (ii) optimizes as it goes. As
before, we eagerly optimize, so we form the entire union of optimized
forms of a value before we see any uses of that value. This lets us
rewrite uses to use the most "up-to-date" form of the value and
canonicalize and optimize that form.The eager rewriting and acyclic representation make each other work
(we could not eagerly rewrite if there were cycles; and acyclicity
does not miss optimization opportunities only because the first time
we introduce a value, we immediately produce its "best" form). This
design choice is also what allows us to avoid the "parent pointers"
and fixpoint loop of traditional egraphs.This forward optimization pass keeps a scoped hashmap to "intern"
nodes (thus performing GVN), and also interleaves on a per-instruction
level with alias analysis. The interleaving with alias analysis allows
alias analysis to see the most optimized form of each address (so it
can see equivalences), and allows the next value to see any
equivalences (reuses of loads or stored values) that alias analysis
uncovers.Two, a forward pass in domtree preorder, that "elaborates" pure enodes
back into the layout, possibly in multiple places if needed. This
tracks the loop nest and hoists nodes as needed, performing LICM as it
goes. Note that by doing this in forward order, we avoid the
"fixpoint" that traditional LICM needs: we hoist a def before its
uses, so when we place a node, we place it in the right place the
first time rather than moving later.This PR replaces the old (a)egraph implementation. It removes both the
cranelift-egraph crate and the logic in cranelift-codegen that uses it.On
spidermonkey.wasm
running a simple recursive Fibonacci
microbenchmark, this work shows 5.5% compile-time reduction and 7.7%
runtime improvement (speedup). More benchmarks to come.Most of this implementation was done in (very productive) pair
programming sessions with Jamey Sharp, thus:Co-authored-by: Jamey Sharp <jsharp@fastly.com>
cfallin requested fitzgen for a review on PR #5382.
cfallin updated PR #5382 from egraph-in-dfg-rebase
to main
.
bjorn3 submitted PR review.
bjorn3 created PR review comment:
We already have an fxhash impl in the fx module.
bjorn3 submitted PR review.
bjorn3 created PR review comment:
What exactly is hashbrown necessary for? std::collection::HashMap is a wrapper around hashbrown.
cfallin created PR review comment:
It's necessary for
crate::ctxhash::CtxHashMap
, which is a hashmap that uses external context for hashing and equality.HashMap
isn't sufficient for this.Note that this was previously in the
cranelift-egraph
crate, which was unconditionally required bycranelift-codegen
, and now has been moved intocranelift-codegen
; so there's no change in the dependency footprint.
cfallin submitted PR review.
cfallin updated PR #5382 from egraph-in-dfg-rebase
to main
.
cfallin submitted PR review.
cfallin created PR review comment:
Ah, yes, I had forgotten that; thanks, updated.
cfallin updated PR #5382 from egraph-in-dfg-rebase
to main
.
cfallin updated PR #5382 from egraph-in-dfg-rebase
to main
.
cfallin updated PR #5382 from egraph-in-dfg-rebase
to main
.
cfallin edited PR #5382 from egraph-in-dfg-rebase
to main
:
This work rewrites the "egraph"-based optimization framework in
Cranelift to operate on aegraphs (acyclic egraphs) represented in the
CLIF itself rather than as a separate data structure to which and from
which we translate the CLIF.The basic idea is to add a new kind of value, a "union", that is like an
alias but refers to two other values rather than one. This allows us to
represent an eclass of enodes (values) as a tree. The union node allows
for a value to have multiple representations: either constituent value
could be used, and (in well-formed CLIF produced by correct
optimization rules) they must be equivalent.Like the old egraph infrastructure, we take advantage of acyclicity and
eager rule application to do optimization in a single pass. Like before,
we integrate GVN (during the optimization pass) and LICM (during
elaboration).Unlike the old egraph infrastructure, everything stays in the
DataFlowGraph. "Pure" enodes are represented as instructions that have
values attached, but that are not placed into the function layout. When
entering "egraph" form, we remove them from the layout while optimizing.
When leaving "egraph" form, during elaboration, we can place an
instruction back into the layout the first time we elaborate the enode;
if we elaborate it more than once, we clone the instruction.The implementation performs two passes overall:
One, a forward pass in RPO (to see defs before uses), that (i) removes
"pure" instructions from the layout and (ii) optimizes as it goes. As
before, we eagerly optimize, so we form the entire union of optimized
forms of a value before we see any uses of that value. This lets us
rewrite uses to use the most "up-to-date" form of the value and
canonicalize and optimize that form.The eager rewriting and acyclic representation make each other work
(we could not eagerly rewrite if there were cycles; and acyclicity
does not miss optimization opportunities only because the first time
we introduce a value, we immediately produce its "best" form). This
design choice is also what allows us to avoid the "parent pointers"
and fixpoint loop of traditional egraphs.This forward optimization pass keeps a scoped hashmap to "intern"
nodes (thus performing GVN), and also interleaves on a per-instruction
level with alias analysis. The interleaving with alias analysis allows
alias analysis to see the most optimized form of each address (so it
can see equivalences), and allows the next value to see any
equivalences (reuses of loads or stored values) that alias analysis
uncovers.Two, a forward pass in domtree preorder, that "elaborates" pure enodes
back into the layout, possibly in multiple places if needed. This
tracks the loop nest and hoists nodes as needed, performing LICM as it
goes. Note that by doing this in forward order, we avoid the
"fixpoint" that traditional LICM needs: we hoist a def before its
uses, so when we place a node, we place it in the right place the
first time rather than moving later.This PR replaces the old (a)egraph implementation. It removes both the
cranelift-egraph crate and the logic in cranelift-codegen that uses it.On
spidermonkey.wasm
running a simple recursive Fibonacci
microbenchmark, this work shows 5.5% compile-time reduction and 7.7%
runtime improvement (speedup). More benchmarks to come.Most of this implementation was done in (very productive) pair
programming sessions with Jamey Sharp, thus:Co-authored-by: Jamey Sharp <jsharp@fastly.com>
fitzgen submitted PR review.
fitzgen submitted PR review.
fitzgen created PR review comment:
Nitpick: in general, I think a
where
clause would make these methods easier to read.
fitzgen created PR review comment:
Could we wrap this up into
process_skeleton_inst
or something like that, just to make the overall algorithm a little easier to read at a glance?
fitzgen created PR review comment:
InstructionData
isCopy
so this can beNewOrExistingInst::New(data, ty) => (*ty, *data),
fitzgen created PR review comment:
This is shorter and allows for reserving capacity for all children at once:
block_stack.extend(self.domtree_children.children(block));
fitzgen created PR review comment:
Can we package this up in a
self.egraph_pass()
method, similar to other passes, rather than have all the bits of state exposed in our pass-ordering / mid end? I think that would be a bit nicer.
fitzgen created PR review comment:
- What about something like
iadd_carry
, which is a pure instruction with two results, how does that fit in here? We used to return tuples and then project fields out of them, but now that we are using the DFG directly, how is this handled?- This could be simplified as
suggestion let orig_value = self.func.dfg.first_result(inst);
fitzgen created PR review comment:
Should this be behind
if cfg!(feature = "trace-log")
or something? Or are we pretty confident that LLVM will clean this up?
fitzgen created PR review comment:
I would expect this to more closely mirror
std::hash::Hash::hash
, likefn ctx_hash<H>(&self, state: &mut H, value: &Value) where H: Hasher;
Is there a reason it doesn't?
fitzgen submitted PR review.
fitzgen submitted PR review.
fitzgen created PR review comment:
A comment at the top of
Cost
explaining what sentinel values there are would be useful here, especially since I had a comment about how the loop level max should be 3 not 2 until I realized that it was leaving room for sentinels.
fitzgen created PR review comment:
Do we really need
u32
s here? With theCost
we limit the loop level to 2 at most...
fitzgen created PR review comment:
Additionally, can we debug assert that there are no
ValueDef::Union
s in the layout after running this pass?
cfallin updated PR #5382 from egraph-in-dfg-rebase
to main
.
cfallin submitted PR review.
cfallin created PR review comment:
Done!
cfallin submitted PR review.
cfallin created PR review comment:
No particular reason, just expediency at the time (it was this way in the original egraph prototype then carried over). Updated as suggested!
cfallin submitted PR review.
cfallin created PR review comment:
Done!
cfallin created PR review comment:
Done!
cfallin submitted PR review.
cfallin submitted PR review.
cfallin created PR review comment:
Good point, added the conditional to be sure (it never hurts to make LLVM's life easier in this way either...).
cfallin submitted PR review.
cfallin created PR review comment:
Ah, the
DomTreeChildIter
doesn't give a size hint (it traverses a linked list internally) so we won't get that one-reservation benefit, but this form is still clearer, so updated, thanks!
cfallin submitted PR review.
cfallin created PR review comment:
This was a bit ugly because of
AliasAnalysis
carrying a lifetime parameter, and ended up propagating lifetimes outward to the point that the ISLE context has three lifetimes with two constraints between them. But now that it's done, the factoring is kind of nice, so I think I'll leave it...
cfallin submitted PR review.
cfallin created PR review comment:
Ah, great point with
first_result
, I hadn't remembered that method existed. Updated!Regarding multi-output instructions in general: yes, this was a bit of a compromise for now. Jamey and I went back and forth a bit on an
Inst
-centric vs.Value
-centric design in the elaborator and ultimately basing everything onValue
s was far, far simpler; so to keep everything straight, we need to deal only with single-result instructions.We judged this to be acceptable-ish because the most common multi-result instructions are calls. There are indeed also adds-with-carries but, for now, not optimizing these probably will not have a material impact on speedups.
Ultimately I'm curious whether we might be able to adapt a tuple-types-and-projection approach to CLIF as well, though that's a much larger change.
cfallin updated PR #5382 from egraph-in-dfg-rebase
to main
.
cfallin updated PR #5382 from egraph-in-dfg-rebase
to main
.
cfallin submitted PR review.
cfallin created PR review comment:
Added! Hopefully the explanation helps.
cfallin submitted PR review.
cfallin created PR review comment:
Ah, reverted, thanks for catching this!
For the curious, this change came from a dead-end (since reverted) where we invented a notion of a "pseudo-loop nest" based on the domtree (basic idea is that we wanted the loop nest to fully be a subtree of the domtree; consider e.g. a loop exit dominating the rest of the function, that's where the loop tree and domtree differ) and loop nests could get much deeper. But we've reverted all that and
LoopLevel
should correspond only to true loop-nest level again, now.
cfallin updated PR #5382 from egraph-in-dfg-rebase
to main
.
cfallin updated PR #5382 from egraph-in-dfg-rebase
to main
.
cfallin merged PR #5382.
Last updated: Nov 22 2024 at 16:03 UTC