:wave: Hey!
Lately i've been working on a jump threading pass. The scope has blown up a bit and it has become more of a general control flow optimization pass now.
The pass does a few optimizations:
I also had plans to add another optimization that would inline jump terminated blocks into all its predecessors gated on some cost function. This is the optimization that I think has the most potential to bring performance benefits, and I have this mostly working but it fails a few fuzz cases that I'm not entirely sure how to fix.
The pass works iteratively, by visiting all blocks and applying the optimizations. Each optimization may also schedule visits to other blocks if it thinks there are further optimization opportunities.
I'm also very surprised that all of the optimizations above run at least once on pretty much all wasm files, I wouldn't have expected that, especially from the "const eval" optimization.
The performance results of this haven't been very positive so far, they're mostly neutral or slightly negative. So now I'm not sure if this is worth upstreaming, but either way, I think it's worth exploring.
The backend already does jump threading implicitly. It tries to layout all basic blocks such that fallthroughs are possible as much as possible. And then when emitting machine code it will omit unconditional jump instructions if the target is the next basic block.
And then when emitting machine code it will omit unconditional jump instructions if the target is the next basic block.
Yes, these show up a lot in generated machine code, and are something that we may be able to avoid. (i.e. when it isn't the next basic block)
The backend already does jump threading implicitly. It tries to layout all basic blocks such that fallthroughs are possible as much as possible.
Right, that works for simple jump terminated blocks, my goal was to try to perform more advanced optimizations, such as inlining a block into multiple predecessors if we can justify it for a given cost function. Or the const eval thing. Those are probably harder to do later on, and can allow further optimization opportunities for egraphs.
Although that doesen't seem to be reflected in the performance numbers above.
this is something we've talked about a lot over the years, exciting to see you looking into it!
I wouldn't be that discouraged by the initial results: a 1% compilation overhead is actually pretty low, and is something we can work on improving, so as a starting point this is pretty promising
regarding the jump fusion that the lowering does: its great that it does that stuff, but it is also fairly limited and cannot, by its design, feed into other mid-end optimizations, which is the kind of thing I'd hope this new pass would be able to do
have you tried measuring each optimization in isolation? eg enabled const eval'ing br_if
and br_table
but not any of the other things, etc... and doing a sightglass run for each of them? one suspicion I have is that the "selectify" optimization is actually a pessimization for well-predicted branches, and the conditional moves that the select
s are lowering to are effectively acting as barriers to speculation
aside: we should try lowering plain select
s (not select_spectre_guard
s, of course) into pseudo-instructions that become little branches, instead of conditional moves, at emit time
n I have is that the "selectify" optimization is actually a pessimization for well-predicted branches, and the conditional moves that the
select
s are lowering to are effectively acting as barriers to speculation
Yes! I had the same concern wrt the selectify pass. I haven't exactly measured each one individually, but I did measure selectify and it didn't have any impact on the benchmarks. All of these slowdowns show up before that.
That being said, I'm not too confident in my setup for measuring such small changes. I'll run those individual benchmarks.
My only goal with selectify was really to get more optimization opportunites for egraphs, since select
is something we can optimize, otherwise i'm not 100% convinced it's worth it
my other big piece of feedback would be to see if we can figure out a way to do this at the same time as the e-graphs pass, so that if we merge two blocks together and replace block params with uses of instruction results, that the e-graph rewrite rules can try (re?)running now that we have additional visibility on the inputs to some of the instructions
That would be amazing for the const eval opt here, since it's quite limited. I have no idea how to go about it, I've never looked too much into how egraphs work internally
at the same time as the e-graphs pass
specifically, I mean not having two different passes here, one for e-graph and one for the control-flow-y stuff. because we don't want to have a phase ordering problem where the br_if
const-eval could benefit from having already cleaned up the straight-line code leading to the condition, but the e-graph rules would benefit from having already cleaned up control flow. Ideally we would be Doing Everything All At Once, which was the big benefit that e-graphs give us (supposedly)
yeah, I know a bit about it, but not as much as Chris (who is out sick rn, unfortunately) and @Jamey Sharp
Also, one of the things that might be slightly incompatible (not sure) is that we invalidate the DFG in this pass very early on (on the first change) and never rebuild it until the end of the pass.
This was one of the most impactful changes in terms of build time. BZ2 had a 10-15% regression before doing this.
hm gotcha
I'm not sure if this would be a dealbreaker for merging with egraphs
are you available to talk about this stuff in more detail on Wednesday's cranelift meeting?
Sorry, I unfortunately have some work related stuff on that time slot
Although I think I might be able on the 14th of August
Afonso Bordado said:
I'm not sure if this would be a dealbreaker for merging with egraphs
the e-graphs pass needs the DFG in a non-broken state
10-15% compile time regression is not acceptable, in my view. but if we could make it optional, with a 0% overhead to compile time when disabled, that is probably okay to land, with the understanding that it is going to be improved and worked on in the near future (and ultimately removed if we can't get it to some acceptable threshold or if it gets abandoned). this is something that would be good to talk to more folks about tho, and get more opinions on.
When looking into this on LLVM's side, i noticed that they can incrementally rebuild the DFG, which would be very neat, because we know precisely which blocks have changed
another option is to open an issue or even a draft PR with more details there and consolidate larger, more code-focused discussion
Yeah, I'm happy to do that
I would love to chat about this a bunch more! We haven't re-evaluated the Cranelift meeting's schedule in a while, maybe it's time to poll folks. I have a bunch of thoughts on this topic, but a few very brief comments:
it's not obvious to me that merging blocks that are connected by a single-predecessor/single-successor edge buys us anything in the egraph pass, but if there are optimizations we can't do without it I'd love to hear about them
yeah I guess the remove-constant-phis pass already makes that stuff visible
exactly. between that and the machbuffer optimizations (which should always fire for that specific case) I don't know what else we can get from doing this earlier
My real goal was to get a single-successor/multi-predecessor opt working, That's the one I think would bring some real benefits, but unfortunatley I ran into a bunch of issues with that
makes sense!
Late to this thread (sorry, was out sick the last few days) but my major +1 is to the "let's do this as part of the egraph pass" discussion above. I share the intuition that there is likely a lot of opportunity if we can get the whole thing into a fixpoint loop: branch folding and block merging and redundant blockparam removal leads to more visibility for mid-end rules and further cprop. I suspect that this is where we'll find the wins that make the cost (if any) worth it; and I have a hunch that the integration (doing this as part of existing passes) may reduce the compile-time overhead as well.
The challenge is dealing with cycles (removing redundant blockparams means we need to -- or at least should -- revisit nodes in our eager egraph build/rewrite pass; treating blockparams as opaque roots is what let us get away with a single pass before) and updating dominance, as Jamey notes.
On the latter (domtree updates) I'll throw out an observation that might help: for correctness the egraph pass I think technically only needs us to keep entries in the scopedhashmap for at most their actual scope; in other words it's fine if we process a block as-if it were further down the domtree, I think, it just loses GVN opportunity. If we can do a precise update then of course all the better!
(oh, and keyword for the above "all in the fixpoint": this is Sparse Conditional Constant Propagation, IIUC)
(and one more footnote, I know Jamey has a lot more thoughts about the "revisit blockparams" bit above; he's done way more thinking about this than I have :-) )
It's very possible I'm confused, but I think SCCP is different than any of the optimizations Afonso has been working on so far. I understood SCCP to mean path-dependent constant propagation, where we know an SSA value has a particular constant value at a use site due to conditional branches which dominate that use. I think Afonso has done constant-folding on conditional branches, where we know that an SSA value is a particular constant at all uses and we see it used as the controlling condition in a conditional branch, so we know only one of the branches can ever be taken. Our current approach to Spectre-safety prohibits SCCP as far as I understand it, but should allow what I think Afonso has implemented.
And it's true, I do have a lot of thoughts on revisiting blocks. :laughing: I'm not sure which of those thoughts are immediately useful but I'm definitely interested in discussing the topic.
Last updated: Nov 22 2024 at 16:03 UTC