Stream: cranelift

Topic: cranelift dce


view this post on Zulip Leaves (Jul 01 2024 at 22:29):

I know there used to be a function in do_dce but I can't seem to find it anymore. Anyone happen to know where it was moved?

view this post on Zulip Chris Fallin (Jul 01 2024 at 23:58):

It was removed when we removed the old optimization pipeline (in favor of the new egraphs-based pipeline)

view this post on Zulip Chris Fallin (Jul 01 2024 at 23:58):

The new pipeline also does DCE, implicitly, as part of the "elaboration" pass from the egraph back to CLIF

view this post on Zulip Leaves (Jul 02 2024 at 02:23):

Any references you would recommend for writing a DCE algorithm for an IR that uses basic block parameters? Specifically, handling the elimination of parameters when used in loops. With PHI's it's somewhat obvious because the phi joins itself with an incoming variable. But the BBparams I guess I need to write some algorithm to discover the same situation.

view this post on Zulip Leaves (Jul 02 2024 at 15:03):

reading over the OLD cranelift dce pass... It doesn't seem to do anything with block parameters. If unused block parameters are not being removed, then predecessor blocks cannot know that successors don't need them. Which would mean that only code unused by the block it is in is going to be removed. which makes me question the post order traversal in the first place?

view this post on Zulip fitzgen (he/him) (Jul 02 2024 at 15:47):

Chris Fallin said:

The new pipeline also does DCE, implicitly, as part of the "elaboration" pass from the egraph back to CLIF

also during lowering, right?

view this post on Zulip Leaves (Jul 02 2024 at 16:28):

during lowering might make sense, since I just can't seem to find where useless bb params are being removed.

view this post on Zulip Leaves (Jul 02 2024 at 16:31):

..... seems we have spammer.

view this post on Zulip David Lloyd (Jul 02 2024 at 16:44):

this is how you know a technology has really made it!

view this post on Zulip fitzgen (he/him) (Jul 02 2024 at 16:44):

(I don't have permissions to ban accounts, finding someone who does)

view this post on Zulip Leaves (Jul 02 2024 at 23:30):

More specifically, how and where does cranelift deal with situations like this:

view this post on Zulip Leaves (Jul 02 2024 at 23:31):

v0, created in BB1 is dead and has no actual uses other than being passed around in a loop by BB2.

view this post on Zulip Chris Fallin (Jul 03 2024 at 00:11):

fitzgen (he/him) said:

also during lowering, right?

yep, also during lowering, implicitly (due to our demand-driven approach); so we have two implicit DCEs, one in the mid-end and one in the back-end

Jim said:

It doesn't seem to do anything with block parameters. If unused block parameters are not being removed, then predecessor blocks cannot know that successors don't need them. Which would mean that only code unused by the block it is in is going to be removed. which makes me question the post order traversal in the first place?

we separately have a remove_constant_phis pass that is meant to address this (link)

note that integrating this with at least the mid-end DCE (egraph approach overall, really) is something we'd like to do in the future, but for now they're separate

A fast and secure runtime for WebAssembly. Contribute to bytecodealliance/wasmtime development by creating an account on GitHub.

view this post on Zulip Leaves (Jul 03 2024 at 00:20):

It would seem to me like this would HAVE to be integrated with some sort of DCE pass to produce good results?Additionally, because these are separate, confirming that the original do_dce pass wasn't doing this, what was the point of the dfs post_order traversal in there? Since parameters wouldn't be changed, there could be no interblock/crossblock dce.

The dfs post-order makes sense in the same way that traversing the instructions in a single block backwards make sense, and when the cfg is a DAG my dce algorithm works as it does change block parameters as it traverses the dfs post-order. But without changing block parameters, I can't see what good the traversal of the BBs in that order is(was, since it's removed now) doing?

view this post on Zulip Chris Fallin (Jul 03 2024 at 00:34):

seems to me like this would HAVE to be integrated

Well, no, the constant-phis removal pass turned out to be useful at the time even without being in a DCE (or full opt rewrite) fixpoint. I believe the main issue when Julian wrote it (circa 2020 or so) was excessive redundancy in blockparams coming from the frontend at the time; subsequently Jamey did some cleanups to that and that may have reduced some pressure.

what was the point of the dfs post_order traversal in there? Since parameters wouldn't be changed, there could be no interblock/crossblock dce

Note that SSA allows inter-block use-def links; the only constraint is that the use is in a block dominated by the def (including but not limited to the same block).

Given that, removing dead instructions in a block deeper in the domtree (earlier in postorder) can lead to instructions becoming dead in "parent" (later) blocks.

I believe you're imagining the IR as a sort of "max-SSA" construction where every live value passes through blockparams, so unchanging blockparams "lock down" liveness at block boundaries; that's (fortunately!) not the case.

view this post on Zulip Leaves (Jul 03 2024 at 00:41):

Oh... I'm definitely imagining the IR as that, very good point. As it stands, I do pass all values from one block to another through block parameters. Why do you say "fortunately"?

view this post on Zulip Leaves (Jul 03 2024 at 00:42):

And yeah this locking down of liveness does seem to be what I need to deal with.

view this post on Zulip Leaves (Jul 03 2024 at 00:43):

And also sorry if I haven't been clear, I am not actually using cranelift, I am designing my own IR but am following several of the same design decisions as cranelift(and using Regalloc2 in the process because it's so handy!)

view this post on Zulip Chris Fallin (Jul 03 2024 at 00:57):

Ah, "fortunately" because it's a huge efficiency loss to require all live values to pass through blockparams, for both direct and indirect reasons (I learned this when building another tool using SSA -- part of weval -- that initially required this invariant for correctness):

view this post on Zulip Chris Fallin (Jul 03 2024 at 00:58):

also:

view this post on Zulip Leaves (Jul 03 2024 at 01:39):

Yeah I can see how that might choke things up for regalloc2. Ok going to spend some time reworking things now. No real need for the maximal ssa.

view this post on Zulip Leaves (Jul 03 2024 at 01:44):

what I really want is a pruned or optimal ssa I guess. Need to figure out how to produce that.

view this post on Zulip Altan Haan (Sep 18 2024 at 22:53):

Reviving this since I came across something that might be related (let me know if I should make a new thread!)

I was curious to see how cranelift would optimize the following program

int  foo(int x) {
    int y;
    if (x) {
        y = 1;
    } else {
        y = 1;
    }
    return y;
}

I hand wrote the following wasm:

  (func $foo (type 1) (param i32) (result i32)
    (local i32)
    block
      block
        local.get 0
        i32.eqz
        br_if 0
        i32.const 1
        local.set 1
        br 1
      end
      i32.const 1
      local.set 1
    end
    local.get 1
    return)

which cranelift compiled down to

0000000000000020 <wasm[0]::function[1]>:
      20:       55                      push   %rbp
      21:       48 89 e5                mov    %rsp,%rbp
      24:       85 d2                   test   %edx,%edx
      26:       0f 84 0a 00 00 00       je     36 <wasm[0]::function[1]+0x16>
      2c:       b8 01 00 00 00          mov    $0x1,%eax
      31:       e9 05 00 00 00          jmp    3b <wasm[0]::function[1]+0x1b>
      36:       b8 01 00 00 00          mov    $0x1,%eax
      3b:       48 89 ec                mov    %rbp,%rsp
      3e:       5d                      pop    %rbp
      3f:       c3                      ret

Evidently, the branch was not optimized away. This was slightly surprising since I expected that egraph GVN + elaboration should be able to eliminate the distinct assignments. Is this optimization missed because it requires composing elaboration with remove_constant_phis and some kind of branch elimination?

Thanks!

view this post on Zulip Chris Fallin (Sep 18 2024 at 22:58):

Right, we don't remove blockparams unless they are actually taking the same SSA value on all in-edges ("constant phis" is a little bit of a misnomer -- it doesn't refer to integer constants or any sort of constant-value analysis). That's why the value is materialized on both sides of the diamond (though fortunately register hinting and liverange merging worked well in regalloc here, so materialized directly to the return register)

view this post on Zulip Chris Fallin (Sep 18 2024 at 23:00):

the egraph infrastructure actually doesn't mutate blockparams at all -- that adds significant additional complexity that we haven't worked through yet (it's fair to say it's a research question, I guess?); "constant phi removal" happens first, to clean up anything that SSA construction from the Wasm may have missed

view this post on Zulip Amanieu (Sep 18 2024 at 23:56):

@Chris Fallin I actually do this in my compiler. I don't use egraphs but a forward-dataflow optimization (const-prop, simplification, etc) but the same principle applies. Blockparams are basically multi-input pure operations, the only tricky part is that the input may come from a later instruction in the processing order. The way to handle this is to re-run the pass until a fixed point: every time an branch argument is modified and the branch target has already been processed in the current pass then you need to do run another pass.

view this post on Zulip Amanieu (Sep 18 2024 at 23:58):

To avoid redundant work I do track for each value whether it has changed since the last pass. If no instruction inputs have changed then there is no need to re-process it in this pass.

view this post on Zulip Chris Fallin (Sep 19 2024 at 00:12):

Indeed, it's definitely possible! I do it in waffle (a Wasm-to-Wasm compiler library) as well. The trick with Cranelift is integrating it into the aegraph -- it's not a traditional multipass setup, there are interesting data structure implications to modifying the "skeleton", that's really the challenge

Wasm Analysis Framework For Lightweight Experiments - bytecodealliance/waffle

view this post on Zulip Jacob Lifshay (Sep 19 2024 at 03:17):

Chris Fallin said:

The trick with Cranelift is integrating it into the aegraph

I'm not as familiar with cranelift's internals, but assuming you have one aegraph for the function, couldn't you add the blockparams to the aegraph as phi functions since they're pure functions (with an extra input for the program counter)? this would allow pretty trivial dce of blockparams, assuming the aegraph can support operations that have cycles in the data flow graph

view this post on Zulip Jacob Lifshay (Sep 19 2024 at 03:18):

this would allow you to make it look like the skeleton wasn't changing since the changing parts would instead be the phi functions

view this post on Zulip Chris Fallin (Sep 19 2024 at 03:51):

I'm not as familiar with cranelift's internals ... couldn't you add the blockparams to the aegraph

Well, not quite; the aegraph is acyclic exactly because we break it at blockparams (i.e. do not see through them), and the acyclicity is critical for a bunch of other properties.

There's a lot of subtlety here; I'd encourage you to study the code (there are a lot of comments) and watch the talk; a number of folks have thought pretty hard about various aspects and "why not do X ..." thoughts are best had after all of that background is absorbed, though I do appreciate it is well-intentioned!

view this post on Zulip Jacob Lifshay (Sep 19 2024 at 05:40):

Chris Fallin said:

the acyclicity is critical for a bunch of other properties.

ah, yeah, interning doesn't really work with cyclic graphs.

watch the talk

I did a while ago, but apparently was brain-dead tired enough today that I forgot about needing acyclicity


Last updated: Dec 23 2024 at 13:07 UTC