Dominator trees · cranelift · Zulip Chat Archive

Hi! Looking into dominator tree computation, I've found out, that less than 0.5% of compilation time is spent there. Nevertheless, I've implemented semi-NCA algorithm and unsurprisingly, the changes didn't register in any of the sightglass benchmarks. Note, that I haven't compared old and new implementations in isolation.

Given that it doesn't seem to bring noticeable performance benefits right now, would it be worth it to introduce to Cranelift?

Unrelated, it seems like in the existing code all branch instructions are terminators now, which would allow for a nice little simplification of storing blocks instead of instructions in the existing DominatorTree implementation. This would allow to get rid of several awkward layout.inst_block(..).expect(..) uses. Is my understanding correct? Would such simplification be desirable?

Chris Fallin (Nov 06 2024 at 21:41):

In general I think we have a few possible reasons to accept new code into Cranelift. The first, most straightforward is that it improves performance by some metric (compile time, runtime, etc). Another is when it is simpler than the existing implementation, or more general, or easier to convince ourselves of correctness.

How does the implementation compare to the existing one? Is it shorter, simpler, etc.? Can you show us a diff?

IMHO if the code is not simpler, and has no other quantitative changes, then probably the best engineering verdict is to leave the existing implementation as-is -- the reason being that new code carries risk as well, and so there has to be something that outweighs that. (Again, many reasons will do!)

There are two other questions in your message that I want to separate out: (i) unifying implementations, and (ii) reasoning in terms of blocks rather than instructions. Both of these are good refactors that I think would be worthwhile to pursue regardless of which algorithm we end up using.

Amanieu (Nov 06 2024 at 23:08):

Dominator tree computation actually came up as an issue in SpiderMonkey (https://spidermonkey.dev/blog/2024/10/16/75x-faster-optimizing-the-ion-compiler-backend.html) which used the same algorithm as Cranelift currently does. The old algorithm had trouble with a function that had 132856 basic blocks, but the perf issues were solved by switching to semi-NCA for computing dominator trees.

75x faster: optimizing the Ion compiler backend

In September, machine learning engineers at Mozilla filed a bug report indicating that Firefox was consuming excessive memory and CPU resources while running Microsoft’s ONNX Runtime (a machine learning library) compiled to WebAssembly.

Chris Fallin (Nov 06 2024 at 23:23):

Right, I'm aware of that as well; I'm going off of @amartosch 's claim that there aren't noticeable performance benefits, but @amartosch , were you able to test with very large function inputs?

Amanieu (Nov 06 2024 at 23:37):

David Lloyd (Nov 07 2024 at 15:23):

fitzgen (he/him) (Nov 07 2024 at 19:17):

amartosch (Nov 07 2024 at 21:21):

I've tried to run hyperfine on the file linked by @Amanieu, it didn't show time difference above noise.
The command line used was wasmtime compile -C compiler=cranelift -C parallel-compilation=n ort-wasm-simd-threaded.jsep.wasm.
However, perf consistently shows ~4.2% of samples in DominatorTree::computeand only 0.2% for semi-NCA version. This suggests that I might be doing something very wrong. In order not to waste everybody's time I'd like to clean up the implementation and post it as draft for others to discuss and/or measure themselves.

fitzgen (he/him) (Nov 07 2024 at 21:25):

Chris Fallin (Nov 07 2024 at 21:35):

4.2% -> 0.2% for domtree computation is a great speedup, that's more than enough to justify this -- let's see a draft PR and we can go from there. Thanks for working on this!

Chris Fallin (Nov 07 2024 at 21:36):

(we've done a lot of work shaving off the obvious low-hanging fruit to speed things up, so even 1-2% improvements are exciting now)

Alex Crichton (Nov 07 2024 at 21:44):

Locally I can confirm that DominatorTree::compute takes awhile. On the wasm here I get this profile where I removed regalloc from the profile (which was 80% of the time) and 20% of the remaining time (4% total) it's DominatorTree::compute

amartosch (Nov 07 2024 at 22:10):

amartosch (Nov 07 2024 at 22:11):

@Alex Crichton's results seem to coincide with mine. Thanks for directions, will post PR as soon as it's ready!

Stream: cranelift

Topic: Dominator trees

amartosch (Nov 06 2024 at 21:32):

Chris Fallin (Nov 06 2024 at 21:41):

Amanieu (Nov 06 2024 at 23:08):

Chris Fallin (Nov 06 2024 at 23:23):

Amanieu (Nov 06 2024 at 23:37):

David Lloyd (Nov 07 2024 at 15:23):

fitzgen (he/him) (Nov 07 2024 at 19:17):

fitzgen (he/him) (Nov 07 2024 at 19:17):

amartosch (Nov 07 2024 at 21:21):

fitzgen (he/him) (Nov 07 2024 at 21:25):

Chris Fallin (Nov 07 2024 at 21:35):

Chris Fallin (Nov 07 2024 at 21:36):

Alex Crichton (Nov 07 2024 at 21:44):

amartosch (Nov 07 2024 at 22:10):

amartosch (Nov 07 2024 at 22:11):