jameysharp opened PR #8006 from jameysharp:egraph-available-block
to bytecodealliance:main
:
During elaboration, we want the property that it's safe to pick any member of an eclass to implement that eclass. To establish that property, this commit defines a notion of the "available block" for every value, and enforces that all values in an eclass have the same available block.
TODO: this needs more comments, especially documenting the
available_block
structure and the linear-time filtering loop inoptimize_pure_enode
. If somebody else wants to write those I would really appreciate it.Future work: immediately after
optimized_values.push(orig_value)
, we should tryoptimized_values.sort_unstable()
and thenoptimized_values.dedup()
. That avoids creating union nodes for duplicate values, which should save a little time downstream in the compiler pipeline, but causes values to get renumbered in some filetests so they need more inspection. It also means we can delete theif optimized_value == orig_value
test in the loop afterward. But it's an O(n log n) pass over a potentially large list so we should benchmark the change too.cc: @elliottt @cfallin @lpereira
jameysharp edited PR #8006:
During elaboration, we want the property that it's safe to pick any member of an eclass to implement that eclass. To establish that property, this commit defines a notion of the "available block" for every value, and enforces that all values in an eclass have the same available block.
Closes #7891 where we devised this plan.
TODO: this needs more comments, especially documenting the
available_block
structure and the linear-time filtering loop inoptimize_pure_enode
. If somebody else wants to write those I would really appreciate it.Future work: immediately after
optimized_values.push(orig_value)
, we should tryoptimized_values.sort_unstable()
and thenoptimized_values.dedup()
. That avoids creating union nodes for duplicate values, which should save a little time downstream in the compiler pipeline, but causes values to get renumbered in some filetests so they need more inspection. It also means we can delete theif optimized_value == orig_value
test in the loop afterward. But it's an O(n log n) pass over a potentially large list so we should benchmark the change too.cc: @elliottt @cfallin @lpereira
jameysharp updated PR #8006.
fitzgen submitted PR review.
fitzgen created PR review comment:
Might be worth pre-reserving space here, since IIUC we are going to end up filling this in for every single value.
fitzgen submitted PR review.
fitzgen created PR review comment:
Rather than the indirect pushing and then popping, this could be
let mut union_value = optimized_values.pop().unwrap_or(orig_value);
Then, as a reader, I don't have to wonder if the push is going to trigger a realloc or not (probably not because it is a smallvec and we limit the number of returns from ISLE in ISLE, but I think it is possible if we return the max ISLE returns and then push
orig_vlaue
?)
cfallin submitted PR review:
Very nice overall! Just some comment nits below.
Also, could we have a test case that triggers the trimming behavior, perhaps with our subsume-for-correctness disabled? Actually, a thought on that: we could (with this PR, not before!) add an option
disable_egraph_subsume
to Cranelift; keep it off by default, but add it to a test we know triggers this auto-subsume; then fuzz with the option; then eventually turn it on and remove our careful subsume cases.
cfallin created PR review comment:
Can we add a comment here about the SSA property we're relying on? Something like: "Note that the def-point of all arguments to an instruction in SSA lie on a line of direct ancestors in the domtree, and so do their available-blocks. This means that either
x dom y
ory dom x
; they cannot be unordered wrt each other."
cfallin submitted PR review:
Very nice overall! Just some comment nits below.
Also, could we have a test case that triggers the trimming behavior, perhaps with our subsume-for-correctness disabled? Actually, a thought on that: we could (with this PR, not before!) add an option
disable_egraph_subsume
to Cranelift; keep it off by default, but add it to a test we know triggers this auto-subsume; then fuzz with the option; then eventually turn it on and remove our careful subsume cases.
cfallin created PR review comment:
This loop looks correct to me (very slick in-place compaction!) but could we have a comment describing what it's doing? It took me a little bit, with knowledge of what it's supposed to be doing, to grok it. Perhaps something like: "Remove any values from
optimized_values
that do not have the highest possible available block in the domtree." at the top, and "This value's available block is higher; all higher indices we've already scanned have lower blocks and need to be removed, so truncate after this element." in the first case, and "This value's available block is lower, and we want to skip it; remove it from the list and keep going (order does not matter soswap_remove
is OK)." in the second?
lpereira submitted PR review.
lpereira created PR review comment:
One of the things I'm not sure here is that, when
swap_remove()
is called, the element that was at the end ofoptimized_values
is put in theidx-th
position, but then it's never used again to lookupthis_block
for the next iteration. Is this the desirable behavior, or am I missing something here?
lpereira edited PR review comment.
lpereira edited PR review comment.
cfallin submitted PR review.
cfallin created PR review comment:
(Re: your edit) yep, indeed; I think it might be good to add a comment with the loop invariant here, for clarity. Something like: "Loop scans in reverse; all values at indices >= idx have the same available block, which is the best (highest in domtree) available block seen so far."
jameysharp submitted PR review:
I think I've addressed all the current review comments and added comments in all the places I intended to add them. I expect this will still fail CI as I haven't investigated why it failed before yet.
jameysharp submitted PR review:
I think I've addressed all the current review comments and added comments in all the places I intended to add them. I expect this will still fail CI as I haven't investigated why it failed before yet.
jameysharp created PR review comment:
Pre-allocating space for this map is not totally trivial for two reasons.
- We add more values to the data-flow graph while the egraph pass is running and I don't know what bound to use if we want to avoid reallocating later.
SecondaryMap
haswith_default
andwith_capacity
constructors, but no constructor that does both, and types likeBlock
don't implementDefault
. This is mostly just an annoyance since we can callresize
after construction, butresize
initializes all of the backing memory, whilewith_capacity
doesn't, so it might make a difference.I'll go ahead and do an initial resize to the number of values present before the start of the egraph pass, but if you have any suggestions for estimating a reasonable larger size, let me know.
jameysharp created PR review comment:
We need the ability to filter out the original value, not just the optimized values. We could add an extra check after the loop to compare the original value's available block against
best_block
, but that means pretty much duplicating the body of the loop, and I think the cleanest thing to do then would still involve pushing the original value sometimes.Besides, this smallvec is allocated with size
MATCHES_LIMIT
, which is 5, but ISLE truncates atMAX_ISLE_RETURNS
, which is 8, so we already might heap-allocate. Now that I've learned that I have a suggestion for simplifying ISLE and fixing that discrepancy at the same time, but that should be a separate issue.The simplest way to ensure that this push can't cause a realloc is to push _before_ calling
constructor_simplify
, which does not require that the provided vector is empty on entry. I can do that if you prefer.
jameysharp updated PR #8006.
cfallin submitted PR review:
LGTM wrt my comments at least; thanks very much for working on this!
jameysharp submitted PR review:
I forgot about Chris' request for some new test cases, so I guess I haven't actually addressed all review comments yet.
And I think I don't want to address Nick's request, to avoid a possible realloc, in this PR. It's an easy change but it changes a bunch of filetests unnecessarily because it means that the optimized_values and orig_value are visited in a different order.
Also, and more importantly, there's something wrong with GVN.
jameysharp submitted PR review:
I forgot about Chris' request for some new test cases, so I guess I haven't actually addressed all review comments yet.
And I think I don't want to address Nick's request, to avoid a possible realloc, in this PR. It's an easy change but it changes a bunch of filetests unnecessarily because it means that the optimized_values and orig_value are visited in a different order.
Also, and more importantly, there's something wrong with GVN.
jameysharp created PR review comment:
This debug-assert is failing in a few of the wasm spec tests and I don't understand why yet.
In every case that's failed, the GVN map says the instruction we looked up is equivalent to an existing instruction, but the available-block for the existing instruction is the entry block, while the available-block for the new instruction is somewhere deeper in the dominator tree.
I believed that GVN would always yield the same available-block for pure instructions so this is concerning.
jameysharp submitted PR review.
jameysharp created PR review comment:
I'm still reasoning through this. For GVN to match, both instructions must be identical, except that each operand is only required to be in the same set in the union-find rather than actually equal. The assertion failing means that for some operand in the instruction, the two instructions consume values which are in the same union-find set but have different available-blocks.
I had been concerned about this specific situation but Chris and I reasoned that it was okay (https://github.com/bytecodealliance/wasmtime/pull/7891#issuecomment-1960531054) and now I'm confused about whether we were right and the assertion is overly strict. If so, what is the correct available-block to use here?
cfallin submitted PR review.
cfallin submitted PR review.
cfallin created PR review comment:
I suspect what is happening here is that the instruction gets rewritten the first time it is seen and the GVN-map entry points to the rewritten result (which could be e.g. an
iconst
available everywhere), even though the instruction per its arguments is only available in one block. So I think the assertion that we want is that the available block oforig_result
(which we're going to use as our value here) dominates the available block ofinst
, rather than is exactly the same. I think that resolves the issue?
jameysharp updated PR #8006.
jameysharp submitted PR review.
jameysharp created PR review comment:
That does make the assert pass in all the tests we have. I think you're correct but the reason why was obscured by the surrounding code.
First off, the GVN-hit case did not always return
orig_result
. If we were looking up aNewOrExistingInst::Existing
instruction,insert_pure_enode
actually returned the result of that instruction. But that case only happens when this function was called fromremove_pure_and_optimize
, which ignores the return value. The return value only matters formake_inst_ctor
, which always passesNewOrExistingInst::New
, in which case we were returningorig_result
.Therefore it's safe to change this function to always return
orig_result
, which I think is what we probably intended in the first place, and based on your comment is what you thought we were already doing. :grin:The next observation is that if the result of
inst
never appears in the values of thevalue_to_opt_value
map, then it will never get looked up inavailable_block
(or in the union-find, for that matter). So we don't actually need an answer to the question of what block this instruction should be available in. In fact it's better to leave it at the "reserved" default value so we'll get assertion failures if we accidentally do look it up.I've made these changes; could you review again when you have a chance? Hopefully CI will pass now too.
cfallin submitted PR review:
Changes LGTM. OK to merge with a test-case added that exercises the "auto-subsume" behavior (or if you've seen that it's already covered by one of the existing tests, note that here and that's fine too). Thanks!
cfallin commented on PR #8006:
(Or, I guess that could be slightly tricky with our current rules around subsume; OK to punt test to followup removal of the
subsume
s if you like.)
jameysharp updated PR #8006.
jameysharp has marked PR #8006 as ready for review.
jameysharp requested elliottt for a review on PR #8006.
jameysharp requested wasmtime-compiler-reviewers for a review on PR #8006.
jameysharp has enabled auto merge for PR #8006.
jameysharp merged PR #8006.
Last updated: Dec 23 2024 at 12:05 UTC