I know there used to be a function in do_dce but I can't seem to find it anymore. Anyone happen to know where it was moved?
It was removed when we removed the old optimization pipeline (in favor of the new egraphs-based pipeline)
The new pipeline also does DCE, implicitly, as part of the "elaboration" pass from the egraph back to CLIF
Any references you would recommend for writing a DCE algorithm for an IR that uses basic block parameters? Specifically, handling the elimination of parameters when used in loops. With PHI's it's somewhat obvious because the phi joins itself with an incoming variable. But the BBparams I guess I need to write some algorithm to discover the same situation.
reading over the OLD cranelift dce pass... It doesn't seem to do anything with block parameters. If unused block parameters are not being removed, then predecessor blocks cannot know that successors don't need them. Which would mean that only code unused by the block it is in is going to be removed. which makes me question the post order traversal in the first place?
Chris Fallin said:
The new pipeline also does DCE, implicitly, as part of the "elaboration" pass from the egraph back to CLIF
also during lowering, right?
during lowering might make sense, since I just can't seem to find where useless bb params are being removed.
..... seems we have spammer.
this is how you know a technology has really made it!
(I don't have permissions to ban accounts, finding someone who does)
More specifically, how and where does cranelift deal with situations like this:
v0, created in BB1 is dead and has no actual uses other than being passed around in a loop by BB2.
fitzgen (he/him) said:
also during lowering, right?
yep, also during lowering, implicitly (due to our demand-driven approach); so we have two implicit DCEs, one in the mid-end and one in the back-end
Jim said:
It doesn't seem to do anything with block parameters. If unused block parameters are not being removed, then predecessor blocks cannot know that successors don't need them. Which would mean that only code unused by the block it is in is going to be removed. which makes me question the post order traversal in the first place?
we separately have a remove_constant_phis
pass that is meant to address this (link)
note that integrating this with at least the mid-end DCE (egraph approach overall, really) is something we'd like to do in the future, but for now they're separate
It would seem to me like this would HAVE to be integrated with some sort of DCE pass to produce good results?Additionally, because these are separate, confirming that the original do_dce pass wasn't doing this, what was the point of the dfs post_order traversal in there? Since parameters wouldn't be changed, there could be no interblock/crossblock dce.
The dfs post-order makes sense in the same way that traversing the instructions in a single block backwards make sense, and when the cfg is a DAG my dce algorithm works as it does change block parameters as it traverses the dfs post-order. But without changing block parameters, I can't see what good the traversal of the BBs in that order is(was, since it's removed now) doing?
seems to me like this would HAVE to be integrated
Well, no, the constant-phis removal pass turned out to be useful at the time even without being in a DCE (or full opt rewrite) fixpoint. I believe the main issue when Julian wrote it (circa 2020 or so) was excessive redundancy in blockparams coming from the frontend at the time; subsequently Jamey did some cleanups to that and that may have reduced some pressure.
what was the point of the dfs post_order traversal in there? Since parameters wouldn't be changed, there could be no interblock/crossblock dce
Note that SSA allows inter-block use-def links; the only constraint is that the use is in a block dominated by the def (including but not limited to the same block).
Given that, removing dead instructions in a block deeper in the domtree (earlier in postorder) can lead to instructions becoming dead in "parent" (later) blocks.
I believe you're imagining the IR as a sort of "max-SSA" construction where every live value passes through blockparams, so unchanging blockparams "lock down" liveness at block boundaries; that's (fortunately!) not the case.
Oh... I'm definitely imagining the IR as that, very good point. As it stands, I do pass all values from one block to another through block parameters. Why do you say "fortunately"?
And yeah this locking down of liveness does seem to be what I need to deal with.
And also sorry if I haven't been clear, I am not actually using cranelift, I am designing my own IR but am following several of the same design decisions as cranelift(and using Regalloc2 in the process because it's so handy!)
Ah, "fortunately" because it's a huge efficiency loss to require all live values to pass through blockparams, for both direct and indirect reasons (I learned this when building another tool using SSA -- part of weval -- that initially required this invariant for correctness):
Directly, led to something like ~5x as many SSA value numbers; gets worse when many small blocks exist due to simplifications in the frontend or stray blocks left by transforms. This leads to significant memory and processing-time bloat in all subsequent stages
Indirectly, leads to much lower opportunity to optimize -- dead-phis as you're observing here but also, e.g., pattern-matching for rewrites might not look through blockparams/phis (even if all sources are actually the same value and so the rewrite would apply)
also:
Yeah I can see how that might choke things up for regalloc2. Ok going to spend some time reworking things now. No real need for the maximal ssa.
what I really want is a pruned or optimal ssa I guess. Need to figure out how to produce that.
Reviving this since I came across something that might be related (let me know if I should make a new thread!)
I was curious to see how cranelift would optimize the following program
int foo(int x) {
int y;
if (x) {
y = 1;
} else {
y = 1;
}
return y;
}
I hand wrote the following wasm:
(func $foo (type 1) (param i32) (result i32)
(local i32)
block
block
local.get 0
i32.eqz
br_if 0
i32.const 1
local.set 1
br 1
end
i32.const 1
local.set 1
end
local.get 1
return)
which cranelift compiled down to
0000000000000020 <wasm[0]::function[1]>:
20: 55 push %rbp
21: 48 89 e5 mov %rsp,%rbp
24: 85 d2 test %edx,%edx
26: 0f 84 0a 00 00 00 je 36 <wasm[0]::function[1]+0x16>
2c: b8 01 00 00 00 mov $0x1,%eax
31: e9 05 00 00 00 jmp 3b <wasm[0]::function[1]+0x1b>
36: b8 01 00 00 00 mov $0x1,%eax
3b: 48 89 ec mov %rbp,%rsp
3e: 5d pop %rbp
3f: c3 ret
Evidently, the branch was not optimized away. This was slightly surprising since I expected that egraph GVN + elaboration should be able to eliminate the distinct assignments. Is this optimization missed because it requires composing elaboration with remove_constant_phis
and some kind of branch elimination?
Thanks!
Right, we don't remove blockparams unless they are actually taking the same SSA value on all in-edges ("constant phis" is a little bit of a misnomer -- it doesn't refer to integer constants or any sort of constant-value analysis). That's why the value is materialized on both sides of the diamond (though fortunately register hinting and liverange merging worked well in regalloc here, so materialized directly to the return register)
the egraph infrastructure actually doesn't mutate blockparams at all -- that adds significant additional complexity that we haven't worked through yet (it's fair to say it's a research question, I guess?); "constant phi removal" happens first, to clean up anything that SSA construction from the Wasm may have missed
@Chris Fallin I actually do this in my compiler. I don't use egraphs but a forward-dataflow optimization (const-prop, simplification, etc) but the same principle applies. Blockparams are basically multi-input pure operations, the only tricky part is that the input may come from a later instruction in the processing order. The way to handle this is to re-run the pass until a fixed point: every time an branch argument is modified and the branch target has already been processed in the current pass then you need to do run another pass.
To avoid redundant work I do track for each value whether it has changed since the last pass. If no instruction inputs have changed then there is no need to re-process it in this pass.
Indeed, it's definitely possible! I do it in waffle (a Wasm-to-Wasm compiler library) as well. The trick with Cranelift is integrating it into the aegraph -- it's not a traditional multipass setup, there are interesting data structure implications to modifying the "skeleton", that's really the challenge
Chris Fallin said:
The trick with Cranelift is integrating it into the aegraph
I'm not as familiar with cranelift's internals, but assuming you have one aegraph for the function, couldn't you add the blockparams to the aegraph as phi functions since they're pure functions (with an extra input for the program counter)? this would allow pretty trivial dce of blockparams, assuming the aegraph can support operations that have cycles in the data flow graph
this would allow you to make it look like the skeleton wasn't changing since the changing parts would instead be the phi functions
I'm not as familiar with cranelift's internals ... couldn't you add the blockparams to the aegraph
Well, not quite; the aegraph is acyclic exactly because we break it at blockparams (i.e. do not see through them), and the acyclicity is critical for a bunch of other properties.
There's a lot of subtlety here; I'd encourage you to study the code (there are a lot of comments) and watch the talk; a number of folks have thought pretty hard about various aspects and "why not do X ..." thoughts are best had after all of that background is absorbed, though I do appreciate it is well-intentioned!
Chris Fallin said:
the acyclicity is critical for a bunch of other properties.
ah, yeah, interning doesn't really work with cyclic graphs.
watch the talk
I did a while ago, but apparently was brain-dead tired enough today that I forgot about needing acyclicity
Last updated: Dec 23 2024 at 13:07 UTC