Stream: cranelift

Topic: Jump threading pass


view this post on Zulip Afonso Bordado (Jul 22 2024 at 10:27):

:wave: Hey!

Lately i've been working on a jump threading pass. The scope has blown up a bit and it has become more of a general control flow optimization pass now.

The pass does a few optimizations:

I also had plans to add another optimization that would inline jump terminated blocks into all its predecessors gated on some cost function. This is the optimization that I think has the most potential to bring performance benefits, and I have this mostly working but it fails a few fuzz cases that I'm not entirely sure how to fix.

The pass works iteratively, by visiting all blocks and applying the optimizations. Each optimization may also schedule visits to other blocks if it thinks there are further optimization opportunities.

I'm also very surprised that all of the optimizations above run at least once on pretty much all wasm files, I wouldn't have expected that, especially from the "const eval" optimization.

The performance results of this haven't been very positive so far, they're mostly neutral or slightly negative. So now I'm not sure if this is worth upstreaming, but either way, I think it's worth exploring.

Jump threading sightglass benchmark results. GitHub Gist: instantly share code, notes, and snippets.
Standalone JIT-style runtime for WebAssembly, using Cranelift - GitHub - afonso360/wasmtime at jump-threading

view this post on Zulip bjorn3 (Jul 22 2024 at 12:43):

The backend already does jump threading implicitly. It tries to layout all basic blocks such that fallthroughs are possible as much as possible. And then when emitting machine code it will omit unconditional jump instructions if the target is the next basic block.

view this post on Zulip Afonso Bordado (Jul 22 2024 at 12:47):

And then when emitting machine code it will omit unconditional jump instructions if the target is the next basic block.

Yes, these show up a lot in generated machine code, and are something that we may be able to avoid. (i.e. when it isn't the next basic block)

The backend already does jump threading implicitly. It tries to layout all basic blocks such that fallthroughs are possible as much as possible.

Right, that works for simple jump terminated blocks, my goal was to try to perform more advanced optimizations, such as inlining a block into multiple predecessors if we can justify it for a given cost function. Or the const eval thing. Those are probably harder to do later on, and can allow further optimization opportunities for egraphs.

Although that doesen't seem to be reflected in the performance numbers above.

view this post on Zulip fitzgen (he/him) (Jul 22 2024 at 18:40):

this is something we've talked about a lot over the years, exciting to see you looking into it!

view this post on Zulip fitzgen (he/him) (Jul 22 2024 at 18:40):

I wouldn't be that discouraged by the initial results: a 1% compilation overhead is actually pretty low, and is something we can work on improving, so as a starting point this is pretty promising

view this post on Zulip fitzgen (he/him) (Jul 22 2024 at 18:42):

regarding the jump fusion that the lowering does: its great that it does that stuff, but it is also fairly limited and cannot, by its design, feed into other mid-end optimizations, which is the kind of thing I'd hope this new pass would be able to do

view this post on Zulip fitzgen (he/him) (Jul 22 2024 at 18:44):

have you tried measuring each optimization in isolation? eg enabled const eval'ing br_if and br_table but not any of the other things, etc... and doing a sightglass run for each of them? one suspicion I have is that the "selectify" optimization is actually a pessimization for well-predicted branches, and the conditional moves that the selects are lowering to are effectively acting as barriers to speculation

view this post on Zulip fitzgen (he/him) (Jul 22 2024 at 18:46):

aside: we should try lowering plain selects (not select_spectre_guards, of course) into pseudo-instructions that become little branches, instead of conditional moves, at emit time

view this post on Zulip Afonso Bordado (Jul 22 2024 at 18:47):

n I have is that the "selectify" optimization is actually a pessimization for well-predicted branches, and the conditional moves that the selects are lowering to are effectively acting as barriers to speculation

Yes! I had the same concern wrt the selectify pass. I haven't exactly measured each one individually, but I did measure selectify and it didn't have any impact on the benchmarks. All of these slowdowns show up before that.

That being said, I'm not too confident in my setup for measuring such small changes. I'll run those individual benchmarks.

view this post on Zulip Afonso Bordado (Jul 22 2024 at 18:48):

My only goal with selectify was really to get more optimization opportunites for egraphs, since select is something we can optimize, otherwise i'm not 100% convinced it's worth it

view this post on Zulip fitzgen (he/him) (Jul 22 2024 at 18:50):

my other big piece of feedback would be to see if we can figure out a way to do this at the same time as the e-graphs pass, so that if we merge two blocks together and replace block params with uses of instruction results, that the e-graph rewrite rules can try (re?)running now that we have additional visibility on the inputs to some of the instructions

view this post on Zulip Afonso Bordado (Jul 22 2024 at 18:51):

That would be amazing for the const eval opt here, since it's quite limited. I have no idea how to go about it, I've never looked too much into how egraphs work internally

view this post on Zulip fitzgen (he/him) (Jul 22 2024 at 18:53):

at the same time as the e-graphs pass

specifically, I mean not having two different passes here, one for e-graph and one for the control-flow-y stuff. because we don't want to have a phase ordering problem where the br_if const-eval could benefit from having already cleaned up the straight-line code leading to the condition, but the e-graph rules would benefit from having already cleaned up control flow. Ideally we would be Doing Everything All At Once, which was the big benefit that e-graphs give us (supposedly)

view this post on Zulip fitzgen (he/him) (Jul 22 2024 at 18:53):

yeah, I know a bit about it, but not as much as Chris (who is out sick rn, unfortunately) and @Jamey Sharp

view this post on Zulip Afonso Bordado (Jul 22 2024 at 18:55):

Also, one of the things that might be slightly incompatible (not sure) is that we invalidate the DFG in this pass very early on (on the first change) and never rebuild it until the end of the pass.

This was one of the most impactful changes in terms of build time. BZ2 had a 10-15% regression before doing this.

view this post on Zulip fitzgen (he/him) (Jul 22 2024 at 18:56):

hm gotcha

view this post on Zulip Afonso Bordado (Jul 22 2024 at 18:56):

I'm not sure if this would be a dealbreaker for merging with egraphs

view this post on Zulip fitzgen (he/him) (Jul 22 2024 at 18:56):

are you available to talk about this stuff in more detail on Wednesday's cranelift meeting?

view this post on Zulip Afonso Bordado (Jul 22 2024 at 18:56):

Sorry, I unfortunately have some work related stuff on that time slot

view this post on Zulip Afonso Bordado (Jul 22 2024 at 18:57):

Although I think I might be able on the 14th of August

view this post on Zulip fitzgen (he/him) (Jul 22 2024 at 18:58):

Afonso Bordado said:

I'm not sure if this would be a dealbreaker for merging with egraphs

the e-graphs pass needs the DFG in a non-broken state

10-15% compile time regression is not acceptable, in my view. but if we could make it optional, with a 0% overhead to compile time when disabled, that is probably okay to land, with the understanding that it is going to be improved and worked on in the near future (and ultimately removed if we can't get it to some acceptable threshold or if it gets abandoned). this is something that would be good to talk to more folks about tho, and get more opinions on.

view this post on Zulip Afonso Bordado (Jul 22 2024 at 18:59):

When looking into this on LLVM's side, i noticed that they can incrementally rebuild the DFG, which would be very neat, because we know precisely which blocks have changed

view this post on Zulip fitzgen (he/him) (Jul 22 2024 at 18:59):

another option is to open an issue or even a draft PR with more details there and consolidate larger, more code-focused discussion

view this post on Zulip Afonso Bordado (Jul 22 2024 at 19:03):

Yeah, I'm happy to do that

view this post on Zulip Jamey Sharp (Jul 22 2024 at 19:04):

I would love to chat about this a bunch more! We haven't re-evaluated the Cranelift meeting's schedule in a while, maybe it's time to poll folks. I have a bunch of thoughts on this topic, but a few very brief comments:

view this post on Zulip fitzgen (he/him) (Jul 22 2024 at 19:05):

it's not obvious to me that merging blocks that are connected by a single-predecessor/single-successor edge buys us anything in the egraph pass, but if there are optimizations we can't do without it I'd love to hear about them

yeah I guess the remove-constant-phis pass already makes that stuff visible

view this post on Zulip Jamey Sharp (Jul 22 2024 at 19:06):

exactly. between that and the machbuffer optimizations (which should always fire for that specific case) I don't know what else we can get from doing this earlier

view this post on Zulip Afonso Bordado (Jul 22 2024 at 19:06):

My real goal was to get a single-successor/multi-predecessor opt working, That's the one I think would bring some real benefits, but unfortunatley I ran into a bunch of issues with that

view this post on Zulip Jamey Sharp (Jul 22 2024 at 19:07):

makes sense!

view this post on Zulip Chris Fallin (Jul 24 2024 at 16:39):

Late to this thread (sorry, was out sick the last few days) but my major +1 is to the "let's do this as part of the egraph pass" discussion above. I share the intuition that there is likely a lot of opportunity if we can get the whole thing into a fixpoint loop: branch folding and block merging and redundant blockparam removal leads to more visibility for mid-end rules and further cprop. I suspect that this is where we'll find the wins that make the cost (if any) worth it; and I have a hunch that the integration (doing this as part of existing passes) may reduce the compile-time overhead as well.

The challenge is dealing with cycles (removing redundant blockparams means we need to -- or at least should -- revisit nodes in our eager egraph build/rewrite pass; treating blockparams as opaque roots is what let us get away with a single pass before) and updating dominance, as Jamey notes.

On the latter (domtree updates) I'll throw out an observation that might help: for correctness the egraph pass I think technically only needs us to keep entries in the scopedhashmap for at most their actual scope; in other words it's fine if we process a block as-if it were further down the domtree, I think, it just loses GVN opportunity. If we can do a precise update then of course all the better!

view this post on Zulip Chris Fallin (Jul 24 2024 at 16:40):

(oh, and keyword for the above "all in the fixpoint": this is Sparse Conditional Constant Propagation, IIUC)

view this post on Zulip Chris Fallin (Jul 24 2024 at 16:40):

(and one more footnote, I know Jamey has a lot more thoughts about the "revisit blockparams" bit above; he's done way more thinking about this than I have :-) )

view this post on Zulip Jamey Sharp (Jul 24 2024 at 19:54):

It's very possible I'm confused, but I think SCCP is different than any of the optimizations Afonso has been working on so far. I understood SCCP to mean path-dependent constant propagation, where we know an SSA value has a particular constant value at a use site due to conditional branches which dominate that use. I think Afonso has done constant-folding on conditional branches, where we know that an SSA value is a particular constant at all uses and we see it used as the controlling condition in a conditional branch, so we know only one of the branches can ever be taken. Our current approach to Spectre-safety prohibits SCCP as far as I understand it, but should allow what I think Afonso has implemented.

And it's true, I do have a lot of thoughts on revisiting blocks. :laughing: I'm not sure which of those thoughts are immediately useful but I'm definitely interested in discussing the topic.


Last updated: Oct 23 2024 at 20:03 UTC