:wave: @Chris Fallin
I'm trying to add some of the div/rem opts i talked about on github, i just had a few questions on things im a bit stuck on if you dont mind:
would this be the overall correct way to implement the basic transform?
(rule (simplify (udiv ty x c @ (iconst ty (power_of_two _))))
(ushr ty x c))
I am mostly just confused on how to handle the immediate RHS value. I wrote this small extractor here:
fn power_of_two(&mut self, imm: Imm64) -> Option<Imm64> {
if let Some((false, pow)) = i64_is_power_of_two(imm.bits()) {
Some(Imm64::new(pow as i64))
} else {
None
}
}
As part of prelude_opt.isle
to extract the power of two imm. I then run cargo run -- test ./filetests/filetests/egraph/algebraic.clif
for which i added another function below the existing one:
function %div_rem_pow_2(i32) -> i32 {
block0(v0: i32):
v1 = iconst.i32 2
v2 = udiv v0, v1
; check: v2 = ushr v0, v1
return v2
}
It seems to fail however. I presume i am doing something wrong or missing something. I am also unsure if i should insert an upper bound for power_of_two
, since the existing preopts seem to do that too.
in particular, im not sure how to match a const on the rhs with a specific extractor, and then also use the value in the rewrite
@Riccardo D'Ambrosio the general shape here looks correct. I'm not sure why the if-let on i64_is_power_of_two
's result matches a false
(what does the returned tuple mean? the function definition isn't here to check). Ah, and the rule as-written will use the original 2^x constant as the shift amount, rather than x
You'll probably want to write (iconst ty (power_of_two c))
and then use c
as the shift-amount
its the function from simple_preopt.rs
, the false is to check if the power of two is not negative
which it should not be because this is udiv, but im not sure
Chris Fallin said:
You'll probably want to write
(iconst ty (power_of_two c))
and then usec
as the shift-amount
iirc i tried that, but it complained that Variable 'c' has type Imm64 but we need Id in context
. Which makes sense to me, but i wasn't quite sure how to put a new value into the graph and get back an ID
ah! for that you probably want (iconst ty c)
in the right-hand side
(iconst
along with the other opcodes work as both extractors and constructors in the mid-end)
I see
it compiles but it does not pass the test:
FAIL ./filetests/filetests/egraph/algebraic.clif: optimize
Caused by:
filecheck failed for function on line 15:
#0 check: v2 = ushr v0, v1
> function %div_rem_pow_2(i32) -> i32 fast {
Missed #0: \bv2 = ushr v0, v1\b
> block0(v0: i32):
> v1 = iconst.i32 2
> v2 = udiv v0, v1
> return v2
> }
ok. I think I just need to look at this myself -- this stuff is literally a few days old and I'm still working out the kinks; debugging indirectly is an order of magnitude harder. But thank you for kicking the tires early :-)
no problem :+1:
oh wait, i just noticed, that should not be an iconst.i32 2
, i need to check for v1 = iconst.i32 1
, tho i dont think thats the root issue
seems like udiv
stuff overall doesn't match, even with the existing x / 1 => x
transform, odd
@Chris Fallin i figured it out, imul
maps to a Pure
node because it cannot trap, while udiv
/sdiv
map to Instr
because it could trap. So it is not matching the Instr node.
a possible solution i noticed is adding Node::Instr
to the ContextIter
handler, though i am unsure if that's what you planned to do to handle such node types
Ah! That would do it. I suspect we might want to actually make the extractor iterate over all nodes instead (or Pure
and Inst
at least) for this reason, and perhaps have a way of marking a non-pure node as "deleted" when we can prove that it's safe, e.g. if we statically know the trap will never occur (this case)
Yeah that would make sense
thanks for tracking this down; I've been working on integrating the alias analysis then I'll likely make this change
:+1:
kinda surprised you can integrate alias analysis without this change tho haha
then again im not quite sure how an e-graph alias analysis would look like, since alias analysis requires tracking information in multiple passes and things like that
oh, I had been writing a special set of ctors/etors to make it work; and the analysis itself (last-store analysis) is separate
interestingly Redundant Load Elimination can actually just fall out of the node dedup, if one defines the node identity (Hash/Eq) carefully: based on addr eclass and offset (and load type), but not specific load instruction. Store-to-load forwarding is a bit harder :-)
Chris Fallin said:
oh, I had been writing a special set of ctors/etors to make it work; and the analysis itself (last-store analysis) is separate
I see, what does the analysis pass run on/when? my confusion is that, wouldn't you need to run the pass at every iteration of the graph?
No, because stores are side-effecting so always exist and will not move; the analysis happens on the CLIF before egraph build and computes a property for each load
Oh interesting
i imagine that might get a bit more complex though once things like unrolling are added
If/when we want dead-store elimination that gets more complex, but DSE is quite tricky for Wasm at least because we have to get the state at any given trap point right
I see
unrolling falls into "control-flow opt" which is outside the scope of the egraph-based work for now -- needs more thought!
Yeah absolutely
there is certainly a fixpoint between at least inlining and egraph value opts (specifically: constant prop can lead to constant func ptr can lead to inlining can lead to more constant prop; important for dynamic-dispatch cases) so I do want to think more about control flow in the context of the egraph work eventually
Yeah
i imagine inling in an e-graph may work fairly well if the whole program is a single e-graph, since it could select between inlining and not inlining much better because it knows that inlining allowed for better optimization that outweighed the cost
yep! and that was suggested by mwillsey over in the rfc thread actually too
one could in theory have a bridge from the callsite to the function body also in the egraph, and let elaboration go transparently across that bridge
Right, i noticed some talk about whether everything should be a single e-graph or not, but i forget what the final decision was/if there was one
it gets a bit trickier once one wants to optimize though: fundamentally the different inlined bodies want to diverge sometimes. so some sort of cloning is needed
for now it's a big architectural change (everything in Cranelift assumes per-function compilation) so it's more of a "future thoughts" sort of thing
everything being one graph sounds awesome for memory usage since e-graphs are great at sharing
but more complex as well
Chris Fallin said:
for now it's a big architectural change (everything in Cranelift assumes per-function compilation) so it's more of a "future thoughts" sort of thing
Ah ok
btw, does cranelift have a concept of NSW/NUW? i.e. ops which assume overflow does not happen. I know that condition is required for a significant amount of optimizations in llvm
nope; that actually came up somewhere here a bit ago when I was looking at address-computation code
the major issue with that is it introduces semantically undefined behavior, which is a notorious correctness footgun; it's better, all other things equal, to fully define the output for every set of inputs
Yeah it's required for things like optimizing out bounds checks
Yeah true
I suspect that there may be cases where we can instead encode the knowledge in the types, or ... basically I want to find a way around adding it
but this, too, needs more thought
Riccardo D'Ambrosio said:
Yeah it's required for things like optimizing out bounds checks
well, to some degree, llvm has to also prove the condition does not overflow before optimizing it out
This presentation by the llvm devs on how they optimize out math-based conditions was very interesting YouTube - 2021 LLVM Dev Mtg “A New Approach to Removing Redundant Conditions in LLVM”
right, so for example a range-analysis could drive a rule that widens adds or removes uextends or whatever when the wraparound behavior can't be hit
cool, I'll add that to my queue, thanks!
I was wondering at some point if its possible to do it using e-graphs by tracking invariants using egg::Analysis
-like things, but i never explored it a lot
range analysis should definitely be possible to build using such a framework
because it requires storing a variable amount of invariants for each node/eclass, which is difficult to manage, allocation-optimization-wise
and it uses fourier-motzkin elimination to check the invariants, which is probably possible in a constructor, but it needs more thought
if it doesn't explode the number of invariants kept, it's probably possible to store just an id into a bump-allocator or something
hmm, I guess I've envisioned range analysis as something much simpler: if the analysis domain allows just a single integer range (open-ended or closed on both sides) then it's constant space per value, no?
my concern is mainly that it might widely increase memory usage as optimization proceeds, because data is associated with every eclass, so you may end up having to allocate a lot of invariants, especially if you support a variable amount of range invariants per class
this seems sufficient for something like eliminating overflow cases: if we can carry the fact that x \in [0, 4095) or whatever then we can use that, and we can propagate these through a number of binary operators too
Yeah that's true, a simple single-range system might work fairly well as a starting experiment
anything that involves solving more complex systems (variable number of constraints etc) I think would probably not be practical in our compilation-time envelope, though I would be curious to see what use-cases/applications of that you had in mind
Yeah i think it's probably out of scope for cranelift
Im writing a shader language, and naturally, the core of optimizing shaders is optimizing the math to hell and back, which is why im exploring e-graphs, since from simple experiments, they work unbelievably well for this task, i had this wild optimization my experiment did a while back:
2.0 - pow(x * 1.0 * -1.0, 2.0) + (y / z + 3.0) ----> fma(x, neg(x), fma(y, recip(z), 5))
Cost: 26.5 -> 12
However, one intricacy about shader optimization is that some transforms only work on normalized vectors/floats, so i must track that invariant somehow
interesting, that's pretty cool
this is definitely a case where more expensive analysis is warranted
i think a single range associated with each eclass may work, i probably won't have indexing conditions that require disjoint ranges
Chris Fallin said:
this is definitely a case where more expensive analysis is warranted
To some degree i actually think using e-graphs vastly simplifies my job, since the e-graph can propagate invariants for me really easily. I just have to store the invariants efficiently and write rules that check them
though i have an RVSDG-based IR also, so im not solely working off the e-graph, which allows me to do some analyses much easier, to then propagate those results into the egraph
another analysis which might work very well with egraphs is scalar evolution (thats what llvm calls it), which builds a "chain of recurrences" to describe how a loop calculates its next result, i.e. how it evolves. An e-graph that knows about such info might be able to make really good inferences about moving out or transforming things inside the loop
yup, that's a really good point
its probably also possible to piggyback off the CR for LICM, any recurrence not directly associated with an induction variable can be moved out
not sure if you are familiar with chains of recurrences, but this article breaks it down fairly well https://kristerw.blogspot.com/2019/04/how-llvm-optimizes-geometric-sums.html
i'm unsure of how expensive of an analysis it actually is however
yep, I've looked at the general idea before. there's certainly a lot we could do here
one thing I want to be a little careful of avoiding is actually introducing loops into the egraph; right now blockparams (hence part of the path for any recurrence) are terminals (nodes without children)
Yeah, i imagine they play a somewhat significant part in performance for some programs
I see
but perhaps there's a way of giving rules access to preds/inputs without actually wiring up the dataflow (a separate extractor, etc)
could loops perhaps be special Loop
nodes?
so, this, too, "needs more thought" (sorry that's becoming my frequent refrain :-) )
that gets hairy really quickly. I looked at more structured forms of graph IR (RVSDGs etc) but the issue is going into and out of it from the CFG world on either end
hehe, yeah, i feel like this is mostly uncharted waters
hmm, yeah
so for now I keep the CFG in the "side-effect skeleton" and avoid completely any questions of reasoning about CFG-related patterns, but perhaps there's something we could do
at least an overlay (a node representing "this is a loop and these are the recurrent inputs") that doesn't feed into the lowering-back-out but can be used to match
couldn't loop outputs be treated as opaque variables? like variables used in more than one place
inputs you mean? or specifically outputs?
uhhh, mostly outputs, but maybe it could apply to inputs too? idk haha
kind of like how RVSDG has input and output edges for loop nodes
in general, what are the issues with treating structured control flow as special nodes if you know they fit a particular control flow pattern like looping or switching/branching?
I guess the main issue I was trying to avoid was fitting all control flow into that form; we have to be able to accept arbitrary CFGs as input and I don't want a Relooper-like algorithm
Yeah absolutely
so then there's a question of how to mix the two; an overlay node for informational purposes only seems totally reasonable, as long as the control flow also exists in the current blockparam nodes and side-effecting skeleton
Yeah it needs some way of combining both unstructured control flow, and structured control flow known to fit some pattern
ok, I am going to disappear from this conversation for a bit to get some hacking done :-) but thanks for all the ideas, I'll keep chewing on this
bye :wave:
noticed PR seems to now support inst nodes, i'll try to add the div/rem transforms now
most of the hard work seems to already be done from simple_preopt so i hope it won't be too bad
also, has there been any thought on being able to mark certain ISLE rules as bidirectional? it's helpful for some rules
for example, x * 2
<---> x >> 1
which could help in discovering other transforms based on the mul
the bidirectional thing is interesting but I would want to be a bit careful with this as it would be easy to produce a lot of blowup
Yes absolutely, from my experiments it seems that rules that allow the optimizer to create random terms are really bad, not sure what i expected tho...
in general I think I need to nail down some "rules for rules" more precisely -- something like layering for forward-progress guarantees
Yeah totally
I've found it not too hard to accidentally create a generative loop, and because we don't do full eq-sat we end up creating an infinite chain of union nodes instead
associativity/commutativity rules are an interesting case specifically
but in limited cases ("either of these two nodes creates the other automatically") I agree some sort of syntax sugar for one rule to become two might be useful
Yeah, and you dont do exponential backoff either, so it's a recipe for disaster haha
associativity/commutativity scare me a bit because it's easy to get combinatorial blowup
yeah i never really decided on an approach to standardize how they would work so they dont end up dominating rewrite time
so for now what I'm doing is having two forms of each rule (in cprop for example) for commutativity, and some rules that try to reassociate in certain cases for cprop and licm
makes sense, although i think such rewrite rules are very helpful once u have simplifications that can take multiple terms, such as FMA (a * b + c
-> fma(a, b, c)
). Though this might not be a problem for cranelift
maybe doing some sort of marking to avoid one associative rewrite from happening infinitely many times, idk
This is another problem with bidirectional rules, something like an FMA rewrite can result in infinite loops. And im not quite sure that its possible to avoid such a problem without having a way to make the rewriter realize that it already went through such a rewrite
right, the interning/dedup is important there. over the weekend I did some hacking on the aegraph implementation to do hash/eq on nodes based on a "canonical id" which is different from the "latest id" (pointing to the whole tree of union nodes) that should help in that regard...
one still has to be careful about truly generative rules of course
Yeah
it might also be possible to order the rules in a way where actually "meaningful" rewrites are tried first, but doing that automatically seems challenging
yeah, what I'm finding is that there's a good amount of tweaking/engineering that goes into this if one wants to make it performant on par with a handwritten pass
it's still better than actually writing a pass by hand! but I'm thinking we'll want to invest in tooling to show what's going on maybe, or at least write down some guidelines
yeah, especially with no full eq-sat it might be easy for a contributor to accidentally create hard to debug infinite loops
Another random thought i had, e-graphs might make it really easy to implement at least basic SLP vectorization, assuming the optimizer can figure out how to potentially reorder statements to allow this to happen
Hmm seems like inst nodes still don't match, i think the issue now is that there is only a single constructor, and that is for pure enodes. Though im unsure of how to solve this since constructing an inst node requires a sourceloc as well, so the sourceloc value has to be either passed transparently through the rewrite somehow, or matched in the rule
Yeah, you're hitting a very "under construction" bit of code again
right now my focus has been on pure nodes as reasoning about merging and code motion with side-effects is quite a bit more complex
makes sense
i kind of wonder if it might be possible to do something similar to what RVSDG does for solving the side effect problem with graphs
where you have edges for every side-effectful op and they connect together to form an ordering constraint for what ops must run before/after
I'm late to the conversation but Riccardo, your project sounds cool! I did experiments a few months ago on manipulating RVSDGs in e-graphs (https://github.com/jameysharp/optir), if you want to see a way you can do full control-flow inside your e-graph. Cranelift isn't going that direction because the trade-offs are different but I bet it'd expose more opportunities in a shader compiler. regarding the earlier conversation about inlining: I found it's especially neat in e-graphs because you can optimize call sites based on information you learn from inlining, but then still decide not to inline; that doesn't change the fact that Cranelift isn't designed for that, but I think it makes the potential future gains from this work bigger.
@Jamey Sharp oh hey i read optir a lot to check out ideas! awesome. I was a little confused in some areas but i got the general gist and it inspired me to try RVSDG + egraphs even more. Im interested in e-graphs mostly for superoptimizing math, since shaders are just blobs of math, and float math is criminally underoptimized, especially trigonometry. Inlining is something i'd like to explore a bit more fine-grained, my current (and potentially final) approach is to just recursively inline everything and optimize a single graph for each shader entry point. Reason being that shaders don't exactly have true function calls, and literally every single gpu compiler on the planet inlines everything it can get away with, even CUDA compilers. Especially when it comes to math and optimizing math across functions. But maybe it would be cool to let e-graphs drive cross-function discovery to choose whether to inline or decide to treat it as "opaque" for optimization purposes
the timing of cranelift's e-graphs RFC could not have been more perfect :smile:
in particular i think RVSDG + e-graphs could help tremendously for cases where the ordering of actual operations doesnt matter, only how they are connected together
and once u think about it, this fits so many optimizations, even something pretty complex like superword level parallelism
so i think it's the perfect combination once it can be explored more in depth, RVSDG excels at the "large bits", analyzing how things connect, things like that. And e-graphs excel at discovering the smaller bits and how they can be potentially rewritten to be better
I think there are also interesting opportunities with noticing that different expressions of control-flow are equivalent. I didn't previously have any intraprocedural examples where e-graphs help with that, but the link you shared last week about "How LLVM optimizes power sums" is a good one: for small loop counts, the closed-form loop-free version may take longer than the simpler loop. doing that transformation on an e-graph means you can defer the decision about whether it's profitable until you've reached saturation, at which point your cost function may have more information to work with. like with inlining, it also allows surrounding code to be improved based on information learned from the power-sum analysis, even if the later decision is to emit a loop after all.
and yeah, I agree that there are a ton of interesting directions to go, various kinds of parallelism definitely included! I think the general criterion for "would an e-graph help here" is: are you sure this transformation is an improvement in all cases? if not, an e-graph is great, and I think most optimizations have this kind of uncertainty. (although I think it'd be nice if egg provided some better support for pruning alternatives if you do know they're always worse; the way egg's examples do constant-folding seems like it's a foot-gun without some more restricted API.)
on the other hand I think you're right that for shaders you should just always inline everything.
Yeah, shaders in particular make it easier to do certain things, for example, its rarely not profitable to unroll a loop unless its a really big loop.
Last updated: Dec 23 2024 at 12:05 UTC