Stream: cranelift

Topic: Dominator trees


view this post on Zulip amartosch (Nov 06 2024 at 21:32):

Hi! Looking into dominator tree computation, I've found out, that less than 0.5% of compilation time is spent there. Nevertheless, I've implemented semi-NCA algorithm and unsurprisingly, the changes didn't register in any of the sightglass benchmarks. Note, that I haven't compared old and new implementations in isolation.

Given that it doesn't seem to bring noticeable performance benefits right now, would it be worth it to introduce to Cranelift?

I feel like there is a couple of additional point to consider:

Unrelated, it seems like in the existing code all branch instructions are terminators now, which would allow for a nice little simplification of storing blocks instead of instructions in the existing DominatorTree implementation. This would allow to get rid of several awkward layout.inst_block(..).expect(..) uses. Is my understanding correct? Would such simplification be desirable?

view this post on Zulip Chris Fallin (Nov 06 2024 at 21:41):

Hi @amartosch -- thanks for working on this!

Given that it doesn't seem to bring noticeable performance benefits right now, would it be worth it to introduce to Cranelift?

In general I think we have a few possible reasons to accept new code into Cranelift. The first, most straightforward is that it improves performance by some metric (compile time, runtime, etc). Another is when it is simpler than the existing implementation, or more general, or easier to convince ourselves of correctness.

How does the implementation compare to the existing one? Is it shorter, simpler, etc.? Can you show us a diff?

IMHO if the code is not simpler, and has no other quantitative changes, then probably the best engineering verdict is to leave the existing implementation as-is -- the reason being that new code carries risk as well, and so there has to be something that outweighs that. (Again, many reasons will do!)

There are two other questions in your message that I want to separate out: (i) unifying implementations, and (ii) reasoning in terms of blocks rather than instructions. Both of these are good refactors that I think would be worthwhile to pursue regardless of which algorithm we end up using.

view this post on Zulip Amanieu (Nov 06 2024 at 23:08):

Dominator tree computation actually came up as an issue in SpiderMonkey (https://spidermonkey.dev/blog/2024/10/16/75x-faster-optimizing-the-ion-compiler-backend.html) which used the same algorithm as Cranelift currently does. The old algorithm had trouble with a function that had 132856 basic blocks, but the perf issues were solved by switching to semi-NCA for computing dominator trees.

In September, machine learning engineers at Mozilla filed a bug report indicating that Firefox was consuming excessive memory and CPU resources while running Microsoft’s ONNX Runtime (a machine learning library) compiled to WebAssembly.

view this post on Zulip Chris Fallin (Nov 06 2024 at 23:23):

Right, I'm aware of that as well; I'm going off of @amartosch 's claim that there aren't noticeable performance benefits, but @amartosch , were you able to test with very large function inputs?

view this post on Zulip Amanieu (Nov 06 2024 at 23:37):

Here's the wasm file with the massive function: https://bugzilla.mozilla.org/attachment.cgi?id=9426114

view this post on Zulip David Lloyd (Nov 07 2024 at 15:23):

I think such a thing is at least interesting at any rate :)

view this post on Zulip fitzgen (he/him) (Nov 07 2024 at 19:17):

definitely worth testing compilation on the massive wasm file from that bug

view this post on Zulip fitzgen (he/him) (Nov 07 2024 at 19:17):

using hyperfine or something

view this post on Zulip amartosch (Nov 07 2024 at 21:21):

I've tried to run hyperfine on the file linked by @Amanieu, it didn't show time difference above noise.
The command line used was wasmtime compile -C compiler=cranelift -C parallel-compilation=n ort-wasm-simd-threaded.jsep.wasm.
However, perf consistently shows ~4.2% of samples in DominatorTree::computeand only 0.2% for semi-NCA version. This suggests that I might be doing something very wrong. In order not to waste everybody's time I'd like to clean up the implementation and post it as draft for others to discuss and/or measure themselves.

view this post on Zulip fitzgen (he/him) (Nov 07 2024 at 21:25):

just to be sure, can you try running again and also passing -C cache=n?

view this post on Zulip Chris Fallin (Nov 07 2024 at 21:35):

4.2% -> 0.2% for domtree computation is a great speedup, that's more than enough to justify this -- let's see a draft PR and we can go from there. Thanks for working on this!

view this post on Zulip Chris Fallin (Nov 07 2024 at 21:36):

(we've done a lot of work shaving off the obvious low-hanging fruit to speed things up, so even 1-2% improvements are exciting now)

view this post on Zulip Alex Crichton (Nov 07 2024 at 21:44):

Locally I can confirm that DominatorTree::compute takes awhile. On the wasm here I get this profile where I removed regalloc from the profile (which was 80% of the time) and 20% of the remaining time (4% total) it's DominatorTree::compute

view this post on Zulip amartosch (Nov 07 2024 at 22:10):

fitzgen (he/him) said:

just to be sure, can you try running again and also passing -C cache=n?

Tried, it didn't seem to change anything.

view this post on Zulip amartosch (Nov 07 2024 at 22:11):

@Alex Crichton's results seem to coincide with mine. Thanks for directions, will post PR as soon as it's ready!


Last updated: Dec 23 2024 at 12:05 UTC