Stream: git-wasmtime

Topic: wasmtime / PR #7891 egraphs: Undo changes to union find a...


view this post on Zulip Wasmtime GitHub notifications bot (Feb 07 2024 at 22:49):

elliottt opened PR #7891 from bytecodealliance:revert-union-find to bytecodealliance:main:

This is not an efficient implementation, and is meant to test a theory.

Explicitly cache and restore the values of the union find and gvn map structures when backtracking while traversing the dominator graph. This avoids accidentally violating the invariant provided by Union values: they should only union together values that locally are valid substitutions.

Additionally, remove uses of the union find structure during elaboration, as we have already computed the best values prior to elaboration.

<!--
Please make sure you include the following information:

Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.html

Please ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->

view this post on Zulip Wasmtime GitHub notifications bot (Feb 08 2024 at 00:29):

cfallin commented on PR #7891:

Re: the "scoping" approach, an efficient implementation (I think!) occurred to me a few days ago. It would probably get at what I think y'all are trying to get at here. Basically what I think we need to do is:

Finally, when computing the best nodes in each class, compare based on these intervals first, so the most-widely-available value wins.

This should I think have O(1) cost, require storage of an integer pair per node (maybe a u32 if we feel like limiting to 2^16 blocks, u64 otherwise), and (I think!) retains the "cost function is purely heuristic" property without the need to think about subsumes.

(I think I have a proof sketch that all this works; more next week if it doesn't break in the meantime :-) )

view this post on Zulip Wasmtime GitHub notifications bot (Feb 08 2024 at 00:29):

cfallin edited a comment on PR #7891:

Re: the "scoping" approach, an efficient implementation (I think!) occurred to me a few days ago. It would probably get at what I think y'all are trying to get at here. Basically what I think we need to do is:

This should I think have O(1) cost, require storage of an integer pair per node (maybe a u32 if we feel like limiting to 2^16 blocks, u64 otherwise), and (I think!) retains the "cost function is purely heuristic" property without the need to think about subsumes.

(I think I have a proof sketch that all this works; more next week if it doesn't break in the meantime :-) )

view this post on Zulip Wasmtime GitHub notifications bot (Feb 08 2024 at 00:30):

cfallin edited a comment on PR #7891:

Re: the "scoping" approach, an efficient implementation (I think!) occurred to me a few days ago. It would probably get at what I think y'all are trying to get at here. Basically what I think we need to do is:

This should I think have O(1) cost (edit: per node; O(n) over the program), require storage of an integer pair per node (maybe a u32 if we feel like limiting to 2^16 blocks, u64 otherwise), and (I think!) retains the "cost function is purely heuristic" property without the need to think about subsumes.

(I think I have a proof sketch that all this works; more next week if it doesn't break in the meantime :-) )

view this post on Zulip Wasmtime GitHub notifications bot (Feb 08 2024 at 00:31):

cfallin edited a comment on PR #7891:

Re: the "scoping" approach, an efficient implementation (I think!) occurred to me a few days ago. It would probably get at what I think y'all are trying to get at here. Basically what I think we need to do is:

This should I think have O(1) cost (edit: per node; O(n) over the program), require storage of an integer pair per node (maybe a u32 if we feel like limiting to 2^16 blocks, u64 otherwise), and (I think!) retains the "cost function is purely heuristic" property without the need to think about subsumes.

(I think I have a proof sketch that all this works; more next week if it doesn't break in the meantime :-) )

view this post on Zulip Wasmtime GitHub notifications bot (Feb 08 2024 at 00:32):

cfallin edited a comment on PR #7891:

Re: the "scoping" approach, an efficient implementation (I think!) occurred to me a few days ago. It would probably get at what I think y'all are trying to get at here. Basically what I think we need to do is:

This should I think have O(1) cost (edit: per node; O(n) over the program), require storage of an integer pair per node (maybe a u32 if we feel like limiting to 2^16 blocks, u64 otherwise), and (I think!) retains the "cost function is purely heuristic" property without the need to think about subsumes.

(I think I have a proof sketch that all this works; more next week if it doesn't break in the meantime :-) )

view this post on Zulip Wasmtime GitHub notifications bot (Feb 08 2024 at 01:56):

github-actions[bot] commented on PR #7891:

Subscribe to Label Action

cc @cfallin, @fitzgen

<details>
This issue or pull request has been labeled: "cranelift", "isle"

Thus the following users have been cc'd because of the following labels:

To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.

Learn more.
</details>

view this post on Zulip Wasmtime GitHub notifications bot (Feb 14 2024 at 23:10):

jameysharp commented on PR #7891:

When creating pure nodes, record the intersection of the intervals of the args (max of starts, min of ends). Note that the intervals must overlap because of a property of SSA: all inputs ultimately come from blocks up the direct ancestry in the domtree, not "side blocks".

I'm trying to make sure I follow your reasoning here so I want to restate this in a way that doesn't refer to preorder/postorder intervals, which I think may be a clever implementation detail rather than a necessary aspect of the solution.

All of an instruction's operands must dominate the instruction. The SSA properties imply that dominance imposes a total order on those values which are used together as operands in the same instruction, and in particular one operand must be dominated by all the others. Therefore, the instruction can be placed anywhere in the subtree dominated by that one operand.

Is that right? If so we need to be able to quickly evaluate which block is dominated by all the operands, and we can use the pre/post-order intervals as an implementation detail to identify one operand whose block is the right choice using constant-time comparisons. We can use the same trick to optimize dominance comparisons elsewhere in Cranelift so this may be a useful trick to apply more broadly.

view this post on Zulip Wasmtime GitHub notifications bot (Feb 14 2024 at 23:35):

jameysharp commented on PR #7891:

If I have that right then the neat follow-up trick that I think works is to record a block ID for each value rather than an interval per value. We just need to compare values according to their associated block's interval, we don't need to compute new intervals or map computed intervals back to blocks. And the block ID for a value is just the block where the value is actually defined, so we can get it from the layout rather than needing to store anything extra per value.

I'm not sure how to extend this to union nodes, but I'm also not clear on whether we need to answer dominance questions about union nodes. If your sketch of an algorithm is correct about union nodes, Chris, and if I understand it correctly, then the block associated with a union node is whichever block dominates all the values in the eclass identified by that union node. I'm not sure what that tells us.

By the way, I said this verbally to somebody recently but we should write it down: In the aegraph, every Value identifies an eclass, but some eclasses contain others. It's important to remember that our form of eclass doesn't grow over time, we just may start referring to a different eclass that contains it as a subset. We had quite a bit of confusion about aegraph invariants until we clarified that.

view this post on Zulip Wasmtime GitHub notifications bot (Feb 15 2024 at 00:33):

cfallin commented on PR #7891:

That all sounds basically correct to me! A few thoughts:

The last bit is pretty subtle: it wasn't apparent to me at first that this would always work. Consider e.g. block B1 dominating two sub-blocks B2 and B3 in the domtree, and two defs of a value v1, v2 in B2 and v3 in B3 (i.e., v1 = union v2, v3.) Maybe v2's highest-possible-block is B2 and v3's highest-possible-block is B3. Should v1's highest-possible-block be B1? There are no valid ways to compute it there!

It turns out that the answer is that this can't happen: the highest-possible-block is sort of a max of a set of "fixed-location" (impure) inputs, and so the only way for v2 and v3 to be in this situation is for them to be functions of blockparams (or e.g. memory loads) in B2 and B3 respectively. But then they shouldn't be in the same eclass -- they are two separate values that never exist at the same time (there is no block that is dominated by both B2 and B3) -- unless the eclass is incomplete and they both e.g. merge to iconst 0 (in which case the highest-available-block is the root).

Said another way, I think the invariant is that all members of an eclass have highest-available-block that lie along a path from some block upward toward the root of the domtree; there is always some unique answer for "block that dominates all other blocks" in this set of blocks. That block is what we hope to propagate forward through union nodes.

view this post on Zulip Wasmtime GitHub notifications bot (Feb 16 2024 at 02:35):

jameysharp commented on PR #7891:

Okay, @elliottt and @lpereira and I just spent a couple hours trying to reason through this and I, for one, am toast. But I'm going to see if I can write down what we learned anyway.

As I was writing this I became uncertain that we investigated the right questions, but for reference, the approach we were planning was to filter the result of the gvn_map.get() call in insert_pure_enode. We'd pretend that there was no entry in the map if the value returned was associated with a block that's not dominated by the block associated with the instruction we're trying to look up. (Or maybe it was the other way around, I've already forgotten.)

I wonder if #7922 might have already fixed the underlying issue that #7879 introduced more subsume uses to work around. (We have not yet tried reverting #7879 on top of main to check that, which we should try since #7922 has landed.)

The remaining part of this draft PR that was not covered in #7922 is the changes in egraph.rs that undo union-find and gvn-map changes when backtracking up the dominator tree in remove_pure_and_optimize. So we tried to figure out what impact those changes actually have on correctness. To do that, we commented out the line in this PR that says gvn_map = saved_gvn_map, while still leaving self.eclasses = eclasses, and re-ran the filetests. (We had previously tried the other way around, restoring the GVN map but letting the union-find keep growing, and found that all filetests pass.)

The only test which failed was from cranelift/filetests/filetests/pcc/succeed/opt.clif. Specifically, only the test which is supposed to demonstrate redundant-load elimination failed.

Side note: this proof-carrying-code RLE test does not appear to test what it was supposed to because the v2 load is notrap and never used, so it is deleted by the DCE pass before the egraph even sees it. Neither of the remaining loads is redundant with the other as they're on different branches, so the egraph pass doesn't change the function at all.

So the bug we see when we undo changes to the union-find but not to the GVN map only occurs when there are two different branches of the dominance tree containing instructions which have equivalent InstructionData. (Apparently we need to write more filetests with this pattern?) When we encounter the second copy of the instruction, insert_pure_enode tries to record in the union-find that it's equivalent to the first copy as found in the GVN map, but the union-find no longer knows that the first copy exists, and panics because we tried to union with a value that we didn't add first.

I want to think more carefully about whether this situation can lead to bugs outside of this specific situation (undoing changes to the union-find but not to the GVN map). I guess the question is, what happens with a function like the following? (I haven't tested this.)

block0(v0: i64):
  v1 = iconst.i64 0
  brif v0, block1(v0), block2

block1(v2: i64):
  v3 = isub v2, v2
  v4 = icmp v3, v0
  jmp block1(v4)

block2:
  v5 = icmp v1, v0
  return v5

If we don't undo changes in either the GVN map or the union-find, then we'll have v3 and v1 in the same set in the union-find when we try to look up the instruction defining v5 in the map, so we'll decide that it's the same instruction as v4 and try to re-use that. If we then choose v3's isub instruction during elaboration of block2 instead of the iconst, we'll try to use block1's block-parameter in block2, where it isn't in scope. This is because the return instruction points to an eclass (union node) that is only valid in the block1 scope.

I am way out of energy at this point but I think this will be a good test case to study when we pick this up again next week, and I think it might be made safe by the plan we discussed today.

view this post on Zulip Wasmtime GitHub notifications bot (Feb 16 2024 at 02:51):

cfallin commented on PR #7891:

@jameysharp most of this sounds basically right to me and agrees with my understanding; thank you all for looking into this more deeply!

One thought on the "trying to revert..." and experiments in general: I may be misunderstanding the intent here (if so, my apologies), but at least to me it doesn't seem too important to understand exactly which combination of partial fixes is just enough to mask issues; and the empirical approach ("does this particular bug occur on this particular test-case with these particular algorithm knobs turned this way") to me feels like it muddies the waters a bit. I suspect maybe y'all are trying to map the landscape better? Or is there a question of finding some other fix or minimize changes somehow?

In any case, I think the test case you wrote out, combined with worst-case cost extraction (say that we always prefer isub to iconst, for example...) is basically the canonical example to test with here. Maybe the way to develop this fix is to make the cost function "evil" in this way, and then get the dominance-scoping right so the isub is ignored? That way, we can be sure that no matter what we do with the cost function, it's not load-bearing.

view this post on Zulip Wasmtime GitHub notifications bot (Feb 16 2024 at 03:03):

cfallin commented on PR #7891:

(Actually, that gives me an idea: maybe a chaos-mode knob, or just an OptLevel, could be "as pessimized as possible"; this inverts cost comparisons so we always pick the most expensive enode in any eclass. Can't wait to see what that would do for testing!)

view this post on Zulip Wasmtime GitHub notifications bot (Feb 16 2024 at 06:37):

elliottt commented on PR #7891:

I love the idea of a chaos mode cost function! Much of what we discussed this afternoon was that when choosing the best value from union nodes, it really must be acceptable to choose either branch. Having a chaos mode for the cost function would be a great way to test that we're not accidentally relying on the cost function for correctness anymore.

view this post on Zulip Wasmtime GitHub notifications bot (Feb 16 2024 at 23:50):

jameysharp commented on PR #7891:

I suspect maybe y'all are trying to map the landscape better?

Yeah, exactly. I won't speak for anyone else but I certainly didn't feel like I had a firm grasp on which invariants are important for the bug we investigated a couple weeks ago. So we tried breaking various implicit invariants to see what would fail and whether that would give us better intuitions about what's necessary. I think these experiments did help us get closer.

I also love the idea of the chaos-mode cost function, so maybe we can work on that next week.

view this post on Zulip Wasmtime GitHub notifications bot (Feb 22 2024 at 00:52):

jameysharp commented on PR #7891:

In the interests of trying to make Cranelift give correct results without relying on the subsume guidelines that we recently introduced as a temporary measure, I've constructed an interesting CLIF test.

test optimize
set opt_level=speed
target x86_64

function %bad_reuse(i32) -> i8 {
block0(v0: i32):
    v1 = iconst.i32 0
    brif v0, block1, block2(v0)

block1:
    brif v0, block3, block2(v1)

block2(v2: i32):
    v3 = isub v2, v2
    v4 = icmp eq v0, v3
    return v4

block3:
    v5 = icmp eq v0, v1
    return v5
}

This test currently passes, but with two small changes to the egraph pass it fails.

With those changes, the verifier rejects the output of the egraph pass:

    function %bad_reuse(i32) -> i8 fast {
    block0(v0: i32):
        brif v0, block1, block2(v0)

    block1:
        v1 = iconst.i32 0
        brif.i32 v0, block3, block2(v1)  ; v1 = 0

    block2(v2: i32):
        v7 = isub v2, v2
        v8 = icmp.i32 eq v0, v7
        return v8

    block3:
        v3 = isub.i32 v2, v2
    ;   ^~~~~~~~~~~~~~~~~~~~
    ; error: inst3 (v3 = isub.i32 v2, v2): uses value arg from non-dominating block2

        v4 = icmp.i32 eq v0, v3
        return v4
    }

This shows that we're currently relying on either subsume or the cost function to pick valid choices from an eclass constructed through GVN. If we break _both_ of those mechanisms then this test is miscompiled by duplicating the isub into a branch where it isn't legal.

view this post on Zulip Wasmtime GitHub notifications bot (Feb 22 2024 at 02:08):

cfallin commented on PR #7891:

@jameysharp this aligns with my understanding from our discussion yesterday; thanks for putting together the concrete test-case! Just to ensure I understand your path from here (and to have it written down for others): is the current plan to implement the domtree-scoping idea per node (from here)?

view this post on Zulip Wasmtime GitHub notifications bot (Feb 22 2024 at 21:06):

jameysharp commented on PR #7891:

Yes, roughly, @cfallin, but we've discussed enough variations on that idea that I'm going to try writing it down again explicitly.

The first piece is to compute, for every value in the DataFlowGraph, that value's "available block". This is the unique block, closest to the root of the dominator tree, at which it is legal to place an instruction which computes that value.

So the second piece of this plan is to establish this new invariant. The only time we construct union nodes is in optimize_pure_enode, when we call the ISLE simplify entry-point and union together all the alternatives it gives us. So that's the only place we need to change.

We'll choose to do it by discarding any alternative with an available block that is strictly dominated by the available block of some other alternative. In other words we choose only alternatives that all have the same available block, where that is the block which is closest, of all the alternatives, to the entry block in the dominator tree. One way to think of this is like an eagerly-evaluated cost function, applied to prune the egraph while we're still building it.

This is a form of subsume which automatically covers exactly the cases where we currently need subsume for correctness, but does not cover all cases where we currently use subsume for performance. For example, a rule which rewrites isub x, x to iconst 0 will only automatically eliminate the isub from later consideration if x is not available in the entry block, where iconst 0 is always available. So we might still want subsume annotations even if we don't strictly need them.

The third piece to consider here is the union-find data structure and the GVN map. There are three general cases where we union values together in the union-find.

So there's at least one hole in this plan that I haven't sorted out yet.


Here's a slightly simpler test case that occurred to me while thinking through this plan.

test optimize
set opt_level=speed
target x86_64

function %bad_reuse(i64) -> i8 {
block0(v0: i64):
    brif v0, block1, block2

block1:
    v1 = load.i64 heap v0
    v2 = isub v1, v1
    v3 = icmp eq v0, v2
    return v3

block2:
    v4 = iconst.i64 0
    v5 = icmp eq v0, v4
    return v5
}

When subsume is disabled and the cost model picks the worst alternative, elaboration tries to duplicate the load into block2. That triggers the "something has gone very wrong if we are elaborating effectful instructions, they should have remained in the skeleton" assertion that was added recently in #7859.


It's not immediately obvious to me that this plan is correct with regard to side-effecting instructions, but I think it's okay. The question is, why is an "available block" precise enough, rather than an "available instruction"? What if we construct an eclass where one alternative is legal before a load but the other must be placed after the load because it uses its result? Could elaboration ever pick the latter alternative when trying to place it earlier?

block0(v0: i64) -> i64:
    v1 = iconst.i32 0
    v2 = load.i32 v0
    v3 = isub v2, v2
    return v3

I think the answer is that the load, and any earlier instruction in the side-effecting skeleton, can only reference versions of eclasses which contain values that were originally defined before that instruction. So we can only extract a "best" instruction from an eclass that transitively depends on the load when we're elaborating later instructions in the side-effecting skeleton.

That reasoning only holds now that we merged #7922 though, because before that I think the use of the union-find structure could smuggle later versions of eclasses to earlier instructions during elaboration.

view this post on Zulip Wasmtime GitHub notifications bot (Feb 22 2024 at 21:18):

cfallin commented on PR #7891:

Thanks for writing this all out! One question:

When subsume is disabled and the cost model picks the worst alternative, elaboration tries to duplicate the load into block2. That triggers the "something has gone very wrong if we are elaborating effectful instructions, they should have remained in the skeleton" assertion that was added recently in #7859.

I'm not sure I understand why the icmp-against-load would still be in the eclass: given your invariant, and checks on union to maintain it, wouldn't the rewriting while still in block1 retain only icmp-against-iconst (a new iconst created by the rewrite on isub v1, v1), with an available block of block0 (constrained only by v0), and then we match that when looking up the GVN map in block2?

view this post on Zulip Wasmtime GitHub notifications bot (Feb 22 2024 at 21:20):

cfallin commented on PR #7891:

Ah and re:

The question is, why is an "available block" precise enough, rather than an "available instruction"?

is it enough that we do a forward scan through the block and process in that order? (In other words, what is the bad case that happens with dominance? It occurs because we move back up the domtree, but retain some state from our deep traversal. That doesn't happen within a single block.)

view this post on Zulip Wasmtime GitHub notifications bot (Feb 22 2024 at 22:23):

jameysharp commented on PR #7891:

When subsume is disabled and the cost model picks the worst alternative, elaboration tries to duplicate the load into block2. That triggers the "something has gone very wrong if we are elaborating effectful instructions, they should have remained in the skeleton" assertion that was added recently in #7859.

I'm not sure I understand why the icmp-against-load would still be in the eclass: given your invariant, and checks on union to maintain it, […]

I was describing what happens with that test case today, when I break both subsume and the cost model. I believe you're right that the invariant I described should fix this test case, and I believe it should also fix the test case I posted yesterday.

The question is, why is an "available block" precise enough, rather than an "available instruction"?

is it enough that we do a forward scan through the block and process in that order? (In other words, what is the bad case that happens with dominance? It occurs because we move back up the domtree, but retain some state from our deep traversal. That doesn't happen within a single block.)

Yeah, I think that's getting at the same thing I was trying to say. But scanning in forward order is only sufficient if we're also careful not to let any eclass travel "back in time" during elaboration, like what we fixed in #7922.

view this post on Zulip Wasmtime GitHub notifications bot (Feb 22 2024 at 23:55):

jameysharp commented on PR #7891:

After reasoning this through more with @cfallin, we're both pretty convinced that my worries about the union-find and GVN map are not necessary, and that just pruning the result of simplify is sufficient for correctness.

I was worried that if an instruction was added to the GVN map in one branch, and matched in an unrelated branch, that using the same value could lead to using an eclass that transitively references values that aren't available in the new context. (That's what's happening in the two test cases I've posted in this thread.)

However, Chris pointed out that the new invariant I described means that the only values which can be added to the GVN map are valid everywhere that the instruction could appear.

The most important thing about that is it makes the original motivation of this draft PR irrelevant: We don't need to undo changes in the union-find or in the GVN map during backtracking.

If anyone has more questions about this, please ask, because attempting to answer them will probably help me straighten this out more in my head.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 07 2024 at 18:18):

jameysharp closed without merge PR #7891.


Last updated: Nov 22 2024 at 16:03 UTC