Hi! Looking into dominator tree computation, I've found out, that less than 0.5% of compilation time is spent there. Nevertheless, I've implemented semi-NCA algorithm and unsurprisingly, the changes didn't register in any of the sightglass benchmarks. Note, that I haven't compared old and new implementations in isolation.
Given that it doesn't seem to bring noticeable performance benefits right now, would it be worth it to introduce to Cranelift?
I feel like there is a couple of additional point to consider:
DominatorTreePreorder
. This would unify two dominator trees, which is mentioned in #7954. The optimized structure might then be computed in place and on demand.Unrelated, it seems like in the existing code all branch instructions are terminators now, which would allow for a nice little simplification of storing blocks instead of instructions in the existing DominatorTree implementation. This would allow to get rid of several awkward layout.inst_block(..).expect(..)
uses. Is my understanding correct? Would such simplification be desirable?
Hi @amartosch -- thanks for working on this!
Given that it doesn't seem to bring noticeable performance benefits right now, would it be worth it to introduce to Cranelift?
In general I think we have a few possible reasons to accept new code into Cranelift. The first, most straightforward is that it improves performance by some metric (compile time, runtime, etc). Another is when it is simpler than the existing implementation, or more general, or easier to convince ourselves of correctness.
How does the implementation compare to the existing one? Is it shorter, simpler, etc.? Can you show us a diff?
IMHO if the code is not simpler, and has no other quantitative changes, then probably the best engineering verdict is to leave the existing implementation as-is -- the reason being that new code carries risk as well, and so there has to be something that outweighs that. (Again, many reasons will do!)
There are two other questions in your message that I want to separate out: (i) unifying implementations, and (ii) reasoning in terms of blocks rather than instructions. Both of these are good refactors that I think would be worthwhile to pursue regardless of which algorithm we end up using.
Dominator tree computation actually came up as an issue in SpiderMonkey (https://spidermonkey.dev/blog/2024/10/16/75x-faster-optimizing-the-ion-compiler-backend.html) which used the same algorithm as Cranelift currently does. The old algorithm had trouble with a function that had 132856 basic blocks, but the perf issues were solved by switching to semi-NCA for computing dominator trees.
Right, I'm aware of that as well; I'm going off of @amartosch 's claim that there aren't noticeable performance benefits, but @amartosch , were you able to test with very large function inputs?
Here's the wasm file with the massive function: https://bugzilla.mozilla.org/attachment.cgi?id=9426114
I think such a thing is at least interesting at any rate :)
definitely worth testing compilation on the massive wasm file from that bug
using hyperfine or something
I've tried to run hyperfine on the file linked by @Amanieu, it didn't show time difference above noise.
The command line used was wasmtime compile -C compiler=cranelift -C parallel-compilation=n ort-wasm-simd-threaded.jsep.wasm
.
However, perf consistently shows ~4.2% of samples in DominatorTree::compute
and only 0.2% for semi-NCA version. This suggests that I might be doing something very wrong. In order not to waste everybody's time I'd like to clean up the implementation and post it as draft for others to discuss and/or measure themselves.
just to be sure, can you try running again and also passing -C cache=n
?
4.2% -> 0.2% for domtree computation is a great speedup, that's more than enough to justify this -- let's see a draft PR and we can go from there. Thanks for working on this!
(we've done a lot of work shaving off the obvious low-hanging fruit to speed things up, so even 1-2% improvements are exciting now)
Locally I can confirm that DominatorTree::compute takes awhile. On the wasm here I get this profile where I removed regalloc from the profile (which was 80% of the time) and 20% of the remaining time (4% total) it's DominatorTree::compute
fitzgen (he/him) said:
just to be sure, can you try running again and also passing
-C cache=n
?
Tried, it didn't seem to change anything.
@Alex Crichton's results seem to coincide with mine. Thanks for directions, will post PR as soon as it's ready!
Last updated: Nov 22 2024 at 16:03 UTC