wasmtime / PR #8006 egraph: Ensure eclass members are all... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / PR #8006 egraph: Ensure eclass members are all...

Wasmtime GitHub notifications bot (Feb 28 2024 at 02:08):

jameysharp opened PR #8006 from jameysharp:egraph-available-block to bytecodealliance:main:

During elaboration, we want the property that it's safe to pick any member of an eclass to implement that eclass. To establish that property, this commit defines a notion of the "available block" for every value, and enforces that all values in an eclass have the same available block.

TODO: this needs more comments, especially documenting the available_block structure and the linear-time filtering loop in optimize_pure_enode. If somebody else wants to write those I would really appreciate it.

Future work: immediately after optimized_values.push(orig_value), we should try optimized_values.sort_unstable() and then optimized_values.dedup(). That avoids creating union nodes for duplicate values, which should save a little time downstream in the compiler pipeline, but causes values to get renumbered in some filetests so they need more inspection. It also means we can delete the if optimized_value == orig_value test in the loop afterward. But it's an O(n log n) pass over a potentially large list so we should benchmark the change too.

cc: @elliottt @cfallin @lpereira

Wasmtime GitHub notifications bot (Feb 28 2024 at 02:22):

jameysharp edited PR #8006:

During elaboration, we want the property that it's safe to pick any member of an eclass to implement that eclass. To establish that property, this commit defines a notion of the "available block" for every value, and enforces that all values in an eclass have the same available block.

Closes #7891 where we devised this plan.

TODO: this needs more comments, especially documenting the available_block structure and the linear-time filtering loop in optimize_pure_enode. If somebody else wants to write those I would really appreciate it.

Future work: immediately after optimized_values.push(orig_value), we should try optimized_values.sort_unstable() and then optimized_values.dedup(). That avoids creating union nodes for duplicate values, which should save a little time downstream in the compiler pipeline, but causes values to get renumbered in some filetests so they need more inspection. It also means we can delete the if optimized_value == orig_value test in the loop afterward. But it's an O(n log n) pass over a potentially large list so we should benchmark the change too.

cc: @elliottt @cfallin @lpereira

Wasmtime GitHub notifications bot (Feb 28 2024 at 02:25):

jameysharp updated PR #8006.

Wasmtime GitHub notifications bot (Feb 28 2024 at 14:42):

fitzgen submitted PR review.

Wasmtime GitHub notifications bot (Feb 28 2024 at 14:42):

fitzgen created PR review comment:

Might be worth pre-reserving space here, since IIUC we are going to end up filling this in for every single value.

Wasmtime GitHub notifications bot (Feb 28 2024 at 14:42):

fitzgen submitted PR review.

Wasmtime GitHub notifications bot (Feb 28 2024 at 14:42):

fitzgen created PR review comment:

Rather than the indirect pushing and then popping, this could be
        let mut union_value = optimized_values.pop().unwrap_or(orig_value);
Then, as a reader, I don't have to wonder if the push is going to trigger a realloc or not (probably not because it is a smallvec and we limit the number of returns from ISLE in ISLE, but I think it is possible if we return the max ISLE returns and then push orig_vlaue?)

Wasmtime GitHub notifications bot (Feb 28 2024 at 16:03):

cfallin submitted PR review:

Very nice overall! Just some comment nits below.

Also, could we have a test case that triggers the trimming behavior, perhaps with our subsume-for-correctness disabled? Actually, a thought on that: we could (with this PR, not before!) add an option disable_egraph_subsume to Cranelift; keep it off by default, but add it to a test we know triggers this auto-subsume; then fuzz with the option; then eventually turn it on and remove our careful subsume cases.

Wasmtime GitHub notifications bot (Feb 28 2024 at 16:03):

cfallin created PR review comment:

Can we add a comment here about the SSA property we're relying on? Something like: "Note that the def-point of all arguments to an instruction in SSA lie on a line of direct ancestors in the domtree, and so do their available-blocks. This means that either x dom y or y dom x; they cannot be unordered wrt each other."

Wasmtime GitHub notifications bot (Feb 28 2024 at 16:03):

cfallin submitted PR review:

Very nice overall! Just some comment nits below.

Also, could we have a test case that triggers the trimming behavior, perhaps with our subsume-for-correctness disabled? Actually, a thought on that: we could (with this PR, not before!) add an option disable_egraph_subsume to Cranelift; keep it off by default, but add it to a test we know triggers this auto-subsume; then fuzz with the option; then eventually turn it on and remove our careful subsume cases.

Wasmtime GitHub notifications bot (Feb 28 2024 at 16:03):

cfallin created PR review comment:

This loop looks correct to me (very slick in-place compaction!) but could we have a comment describing what it's doing? It took me a little bit, with knowledge of what it's supposed to be doing, to grok it. Perhaps something like: "Remove any values from optimized_values that do not have the highest possible available block in the domtree." at the top, and "This value's available block is higher; all higher indices we've already scanned have lower blocks and need to be removed, so truncate after this element." in the first case, and "This value's available block is lower, and we want to skip it; remove it from the list and keep going (order does not matter so swap_remove is OK)." in the second?

Wasmtime GitHub notifications bot (Feb 28 2024 at 22:52):

lpereira submitted PR review.

Wasmtime GitHub notifications bot (Feb 28 2024 at 22:52):

lpereira created PR review comment:

One of the things I'm not sure here is that, when swap_remove() is called, the element that was at the end of optimized_values is put in the idx-th position, but then it's never used again to lookup this_block for the next iteration. Is this the desirable behavior, or am I missing something here?

Wasmtime GitHub notifications bot (Feb 28 2024 at 22:56):

lpereira edited PR review comment.

Wasmtime GitHub notifications bot (Feb 28 2024 at 22:56):

lpereira edited PR review comment.

Wasmtime GitHub notifications bot (Feb 28 2024 at 22:59):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Feb 28 2024 at 22:59):

cfallin created PR review comment:

(Re: your edit) yep, indeed; I think it might be good to add a comment with the loop invariant here, for clarity. Something like: "Loop scans in reverse; all values at indices >= idx have the same available block, which is the best (highest in domtree) available block seen so far."

Wasmtime GitHub notifications bot (Mar 04 2024 at 23:09):

jameysharp submitted PR review:

I think I've addressed all the current review comments and added comments in all the places I intended to add them. I expect this will still fail CI as I haven't investigated why it failed before yet.

Wasmtime GitHub notifications bot (Mar 04 2024 at 23:09):

jameysharp submitted PR review:

I think I've addressed all the current review comments and added comments in all the places I intended to add them. I expect this will still fail CI as I haven't investigated why it failed before yet.

Wasmtime GitHub notifications bot (Mar 04 2024 at 23:09):

jameysharp created PR review comment:

Pre-allocating space for this map is not totally trivial for two reasons.

We add more values to the data-flow graph while the egraph pass is running and I don't know what bound to use if we want to avoid reallocating later.

SecondaryMap has with_default and with_capacity constructors, but no constructor that does both, and types like Block don't implement Default. This is mostly just an annoyance since we can call resize after construction, but resize initializes all of the backing memory, while with_capacity doesn't, so it might make a difference.

I'll go ahead and do an initial resize to the number of values present before the start of the egraph pass, but if you have any suggestions for estimating a reasonable larger size, let me know.

Wasmtime GitHub notifications bot (Mar 04 2024 at 23:09):

jameysharp created PR review comment:

We need the ability to filter out the original value, not just the optimized values. We could add an extra check after the loop to compare the original value's available block against best_block, but that means pretty much duplicating the body of the loop, and I think the cleanest thing to do then would still involve pushing the original value sometimes.

Besides, this smallvec is allocated with size MATCHES_LIMIT, which is 5, but ISLE truncates at MAX_ISLE_RETURNS, which is 8, so we already might heap-allocate. Now that I've learned that I have a suggestion for simplifying ISLE and fixing that discrepancy at the same time, but that should be a separate issue.

The simplest way to ensure that this push can't cause a realloc is to push _before_ calling constructor_simplify, which does not require that the provided vector is empty on entry. I can do that if you prefer.

Wasmtime GitHub notifications bot (Mar 04 2024 at 23:10):

jameysharp updated PR #8006.

Wasmtime GitHub notifications bot (Mar 04 2024 at 23:19):

cfallin submitted PR review:

LGTM wrt my comments at least; thanks very much for working on this!

Wasmtime GitHub notifications bot (Mar 05 2024 at 00:00):

jameysharp submitted PR review:

I forgot about Chris' request for some new test cases, so I guess I haven't actually addressed all review comments yet.

And I think I don't want to address Nick's request, to avoid a possible realloc, in this PR. It's an easy change but it changes a bunch of filetests unnecessarily because it means that the optimized_values and orig_value are visited in a different order.

Also, and more importantly, there's something wrong with GVN.

Wasmtime GitHub notifications bot (Mar 05 2024 at 00:00):

jameysharp submitted PR review:

I forgot about Chris' request for some new test cases, so I guess I haven't actually addressed all review comments yet.

And I think I don't want to address Nick's request, to avoid a possible realloc, in this PR. It's an easy change but it changes a bunch of filetests unnecessarily because it means that the optimized_values and orig_value are visited in a different order.

Also, and more importantly, there's something wrong with GVN.

Wasmtime GitHub notifications bot (Mar 05 2024 at 00:00):

jameysharp created PR review comment:

This debug-assert is failing in a few of the wasm spec tests and I don't understand why yet.

In every case that's failed, the GVN map says the instruction we looked up is equivalent to an existing instruction, but the available-block for the existing instruction is the entry block, while the available-block for the new instruction is somewhere deeper in the dominator tree.

I believed that GVN would always yield the same available-block for pure instructions so this is concerning.

Wasmtime GitHub notifications bot (Mar 05 2024 at 19:58):

jameysharp submitted PR review.

Wasmtime GitHub notifications bot (Mar 05 2024 at 19:58):

jameysharp created PR review comment:

I'm still reasoning through this. For GVN to match, both instructions must be identical, except that each operand is only required to be in the same set in the union-find rather than actually equal. The assertion failing means that for some operand in the instruction, the two instructions consume values which are in the same union-find set but have different available-blocks.

I had been concerned about this specific situation but Chris and I reasoned that it was okay (https://github.com/bytecodealliance/wasmtime/pull/7891#issuecomment-1960531054) and now I'm confused about whether we were right and the assertion is overly strict. If so, what is the correct available-block to use here?

Wasmtime GitHub notifications bot (Mar 05 2024 at 23:54):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Mar 05 2024 at 23:54):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Mar 05 2024 at 23:54):

cfallin created PR review comment:

I suspect what is happening here is that the instruction gets rewritten the first time it is seen and the GVN-map entry points to the rewritten result (which could be e.g. an iconst available everywhere), even though the instruction per its arguments is only available in one block. So I think the assertion that we want is that the available block of orig_result (which we're going to use as our value here) dominates the available block of inst, rather than is exactly the same. I think that resolves the issue?

Wasmtime GitHub notifications bot (Mar 06 2024 at 02:41):

jameysharp updated PR #8006.

Wasmtime GitHub notifications bot (Mar 06 2024 at 02:41):

jameysharp submitted PR review.

Wasmtime GitHub notifications bot (Mar 06 2024 at 02:41):

jameysharp created PR review comment:

That does make the assert pass in all the tests we have. I think you're correct but the reason why was obscured by the surrounding code.

First off, the GVN-hit case did not always return orig_result. If we were looking up a NewOrExistingInst::Existing instruction, insert_pure_enode actually returned the result of that instruction. But that case only happens when this function was called from remove_pure_and_optimize, which ignores the return value. The return value only matters for make_inst_ctor, which always passes NewOrExistingInst::New, in which case we were returning orig_result.

Therefore it's safe to change this function to always return orig_result, which I think is what we probably intended in the first place, and based on your comment is what you thought we were already doing. :grin:

The next observation is that if the result of inst never appears in the values of the value_to_opt_value map, then it will never get looked up in available_block (or in the union-find, for that matter). So we don't actually need an answer to the question of what block this instruction should be available in. In fact it's better to leave it at the "reserved" default value so we'll get assertion failures if we accidentally do look it up.

I've made these changes; could you review again when you have a chance? Hopefully CI will pass now too.

Wasmtime GitHub notifications bot (Mar 06 2024 at 04:51):

cfallin submitted PR review:

Changes LGTM. OK to merge with a test-case added that exercises the "auto-subsume" behavior (or if you've seen that it's already covered by one of the existing tests, note that here and that's fine too). Thanks!

Wasmtime GitHub notifications bot (Mar 06 2024 at 04:52):

cfallin commented on PR #8006:

(Or, I guess that could be slightly tricky with our current rules around subsume; OK to punt test to followup removal of the subsumes if you like.)

Wasmtime GitHub notifications bot (Mar 07 2024 at 17:36):

jameysharp updated PR #8006.

Wasmtime GitHub notifications bot (Mar 07 2024 at 17:37):

jameysharp has marked PR #8006 as ready for review.

Wasmtime GitHub notifications bot (Mar 07 2024 at 17:37):

jameysharp requested elliottt for a review on PR #8006.

Wasmtime GitHub notifications bot (Mar 07 2024 at 17:37):

jameysharp requested wasmtime-compiler-reviewers for a review on PR #8006.

Wasmtime GitHub notifications bot (Mar 07 2024 at 17:39):

jameysharp has enabled auto merge for PR #8006.

Wasmtime GitHub notifications bot (Mar 07 2024 at 18:18):

jameysharp merged PR #8006.

Last updated: Apr 18 2025 at 08:04 UTC