jameysharp opened issue #6260:
Feature
When any optimization level is enabled, Cranelift runs e-graph equality saturation, which removes all pure instructions from the function layout and leaves only a data-flow graph representation. We then have to put those instructions back into a linear order before we can emit machine code.
Currently, we pick an arbitrary topo-sort of the instructions. Instead, I propose implementing list scheduling and, when there are multiple instructions whose operands are all ready, using a heuristic to decide between them.
Benefit
See #6159 for one in-depth analysis of a workload where our instruction scheduling has a measurable impact on performance.
Implementation
I suggest a combination of two heuristics, based on the results in this paper:
Shobaki, G., Sakka, L., Abu Rmaileh, N. E., and Al-Hamash, H. (2015) Experimental evaluation of various register-pressure-reduction heuristics. Softw. Pract. Exper., 45: 1497– 1517. doi: 10.1002/spe.2297
The first heuristic is called "Last Use Count" (LUC). It counts how many live-ranges will end if we place this instruction now. Put another way, for each operand, check whether there are any other instructions that will use the same value but haven't been scheduled yet. If there aren't any, then this instruction is the last user of that value. Count how many operands that applies to. We want to choose an instruction with maximum LUC at each point, because that tends to reduce register pressure.
LUC should have frequent ties. Since most instructions have at most two operands, they can end zero, one, or two live ranges. So we need a tie-breaker.
The tie-breaker heuristic is "Critical Path" (CP) length. It counts the maximum length of any data-dependent chain of instructions that must execute after this instruction before the end of the block. By itself, CP optimizes for instruction-level parallelism (ILP) by starting long chains of operations as soon as possible, but it tends to increase register pressure. However, when used as a tie-breaker for LUC, it can reduce register pressure more than LUC alone. We want to choose an instruction with maximum CP.
If both LUC and CP are tied, we need a last-resort tie-breaker. Prefer the instruction with the smallest instruction number. That will tend to keep instructions in the same order that they appeared in the input.
Computing CP for each instruction can be done in linear time with a breadth-first traversal starting from the block terminator. If there's an edge from instruction A to instruction B, then the CP length for B is whichever is higher of: 1) A's CP length plus 1, or 2) any CP length from any other path. Note that each instruction in the side-effecting skeleton should be treated as if it has an edge to the side-effecting instruction immediately before it.
While visiting all the instructions in the block we should also record the reverse edges: the set of instructions which use each value. Also initialize a side table recording, for each instruction, the number of values it uses which are defined by instructions inside the current block.
The main scheduling loop should proceed like this:
- Pick the highest-priority instruction that's currently ready and append it to the current block's layout.
- For each operand of this instruction, remove this instruction from the set of the value's users. If the value now has exactly one user left, find that user and increment its LUC.
- For each result defined by this instruction, find all instructions which use the value and decrement their count of unready operands. If any now has zero unready operands, add it to the priority queue of ready instructions.
- If this instruction was in the side-effecting skeleton, check the next side-effecting instruction to see if it's ready. Add it to the queue if so.
Because the LUC can increase while an instruction is waiting in the ready queue, we need a priority queue which supports an efficient "decrease-key" operation. (Textbooks and academic papers generally assume you want to find/extract the minimum element, so we need to reverse the order.) "Rank-pairing heaps" (by Haeupler, Sen, and Tarjan) are the newest option I found in a literature review, and they cite all the earlier papers that I found, so that paper is a good starting point for picking a suitable data structure. It should be possible to get O(1) time for all the operations we need except for extract-min, which should be O(log n); but depending on the data structure chosen these bounds will either be worst-case or amortized time.
I _think_ the overall running time of this algorithm is then O(n log n) in the number of instructions in the block. (There should maybe be a factor for the number of edges as well, but we can treat that as bounded by a constant factor of the number of instructions. Again, for most instructions that bound is 2.)
Alternatives
@cfallin suggested a variation on this idea, where we use some other heuristic to decide whether it's worth spending extra CPU time on heuristic-guided list scheduling. We can always fall back to the current topo-sort. One heuristic for this could be some function of the critical path length of the block as a whole (that is, the maximum CP for any instruction) versus the number of instructions in the block, which is one way of estimating the maximum amount of parallelism available in the block. If there's very little parallelism then there aren't many choices of topo-sort and they're probably all pretty close to equivalent.
Critical path length could be defined in terms of an estimate of the number of CPU cycles needed to evaluate the instruction, rather than assuming that all CLIF instructions take the same amount of time. This is probably not worth bothering with but, given a suitable cost model, it's easy to integrate by using a priority queue instead of a FIFO queue when computing CP length.
There are no doubt other ordering heuristics we could try.
jameysharp labeled issue #6260:
Feature
When any optimization level is enabled, Cranelift runs e-graph equality saturation, which removes all pure instructions from the function layout and leaves only a data-flow graph representation. We then have to put those instructions back into a linear order before we can emit machine code.
Currently, we pick an arbitrary topo-sort of the instructions. Instead, I propose implementing list scheduling and, when there are multiple instructions whose operands are all ready, using a heuristic to decide between them.
Benefit
See #6159 for one in-depth analysis of a workload where our instruction scheduling has a measurable impact on performance.
Implementation
I suggest a combination of two heuristics, based on the results in this paper:
Shobaki, G., Sakka, L., Abu Rmaileh, N. E., and Al-Hamash, H. (2015) Experimental evaluation of various register-pressure-reduction heuristics. Softw. Pract. Exper., 45: 1497– 1517. doi: 10.1002/spe.2297
The first heuristic is called "Last Use Count" (LUC). It counts how many live-ranges will end if we place this instruction now. Put another way, for each operand, check whether there are any other instructions that will use the same value but haven't been scheduled yet. If there aren't any, then this instruction is the last user of that value. Count how many operands that applies to. We want to choose an instruction with maximum LUC at each point, because that tends to reduce register pressure.
LUC should have frequent ties. Since most instructions have at most two operands, they can end zero, one, or two live ranges. So we need a tie-breaker.
The tie-breaker heuristic is "Critical Path" (CP) length. It counts the maximum length of any data-dependent chain of instructions that must execute after this instruction before the end of the block. By itself, CP optimizes for instruction-level parallelism (ILP) by starting long chains of operations as soon as possible, but it tends to increase register pressure. However, when used as a tie-breaker for LUC, it can reduce register pressure more than LUC alone. We want to choose an instruction with maximum CP.
If both LUC and CP are tied, we need a last-resort tie-breaker. Prefer the instruction with the smallest instruction number. That will tend to keep instructions in the same order that they appeared in the input.
Computing CP for each instruction can be done in linear time with a breadth-first traversal starting from the block terminator. If there's an edge from instruction A to instruction B, then the CP length for B is whichever is higher of: 1) A's CP length plus 1, or 2) any CP length from any other path. Note that each instruction in the side-effecting skeleton should be treated as if it has an edge to the side-effecting instruction immediately before it.
While visiting all the instructions in the block we should also record the reverse edges: the set of instructions which use each value. Also initialize a side table recording, for each instruction, the number of values it uses which are defined by instructions inside the current block.
The main scheduling loop should proceed like this:
- Pick the highest-priority instruction that's currently ready and append it to the current block's layout.
- For each operand of this instruction, remove this instruction from the set of the value's users. If the value now has exactly one user left, find that user and increment its LUC.
- For each result defined by this instruction, find all instructions which use the value and decrement their count of unready operands. If any now has zero unready operands, add it to the priority queue of ready instructions.
- If this instruction was in the side-effecting skeleton, check the next side-effecting instruction to see if it's ready. Add it to the queue if so.
Because the LUC can increase while an instruction is waiting in the ready queue, we need a priority queue which supports an efficient "decrease-key" operation. (Textbooks and academic papers generally assume you want to find/extract the minimum element, so we need to reverse the order.) "Rank-pairing heaps" (by Haeupler, Sen, and Tarjan) are the newest option I found in a literature review, and they cite all the earlier papers that I found, so that paper is a good starting point for picking a suitable data structure. It should be possible to get O(1) time for all the operations we need except for extract-min, which should be O(log n); but depending on the data structure chosen these bounds will either be worst-case or amortized time.
I _think_ the overall running time of this algorithm is then O(n log n) in the number of instructions in the block. (There should maybe be a factor for the number of edges as well, but we can treat that as bounded by a constant factor of the number of instructions. Again, for most instructions that bound is 2.)
Alternatives
@cfallin suggested a variation on this idea, where we use some other heuristic to decide whether it's worth spending extra CPU time on heuristic-guided list scheduling. We can always fall back to the current topo-sort. One heuristic for this could be some function of the critical path length of the block as a whole (that is, the maximum CP for any instruction) versus the number of instructions in the block, which is one way of estimating the maximum amount of parallelism available in the block. If there's very little parallelism then there aren't many choices of topo-sort and they're probably all pretty close to equivalent.
Critical path length could be defined in terms of an estimate of the number of CPU cycles needed to evaluate the instruction, rather than assuming that all CLIF instructions take the same amount of time. This is probably not worth bothering with but, given a suitable cost model, it's easy to integrate by using a priority queue instead of a FIFO queue when computing CP length.
There are no doubt other ordering heuristics we could try.
jameysharp labeled issue #6260:
Feature
When any optimization level is enabled, Cranelift runs e-graph equality saturation, which removes all pure instructions from the function layout and leaves only a data-flow graph representation. We then have to put those instructions back into a linear order before we can emit machine code.
Currently, we pick an arbitrary topo-sort of the instructions. Instead, I propose implementing list scheduling and, when there are multiple instructions whose operands are all ready, using a heuristic to decide between them.
Benefit
See #6159 for one in-depth analysis of a workload where our instruction scheduling has a measurable impact on performance.
Implementation
I suggest a combination of two heuristics, based on the results in this paper:
Shobaki, G., Sakka, L., Abu Rmaileh, N. E., and Al-Hamash, H. (2015) Experimental evaluation of various register-pressure-reduction heuristics. Softw. Pract. Exper., 45: 1497– 1517. doi: 10.1002/spe.2297
The first heuristic is called "Last Use Count" (LUC). It counts how many live-ranges will end if we place this instruction now. Put another way, for each operand, check whether there are any other instructions that will use the same value but haven't been scheduled yet. If there aren't any, then this instruction is the last user of that value. Count how many operands that applies to. We want to choose an instruction with maximum LUC at each point, because that tends to reduce register pressure.
LUC should have frequent ties. Since most instructions have at most two operands, they can end zero, one, or two live ranges. So we need a tie-breaker.
The tie-breaker heuristic is "Critical Path" (CP) length. It counts the maximum length of any data-dependent chain of instructions that must execute after this instruction before the end of the block. By itself, CP optimizes for instruction-level parallelism (ILP) by starting long chains of operations as soon as possible, but it tends to increase register pressure. However, when used as a tie-breaker for LUC, it can reduce register pressure more than LUC alone. We want to choose an instruction with maximum CP.
If both LUC and CP are tied, we need a last-resort tie-breaker. Prefer the instruction with the smallest instruction number. That will tend to keep instructions in the same order that they appeared in the input.
Computing CP for each instruction can be done in linear time with a breadth-first traversal starting from the block terminator. If there's an edge from instruction A to instruction B, then the CP length for B is whichever is higher of: 1) A's CP length plus 1, or 2) any CP length from any other path. Note that each instruction in the side-effecting skeleton should be treated as if it has an edge to the side-effecting instruction immediately before it.
While visiting all the instructions in the block we should also record the reverse edges: the set of instructions which use each value. Also initialize a side table recording, for each instruction, the number of values it uses which are defined by instructions inside the current block.
The main scheduling loop should proceed like this:
- Pick the highest-priority instruction that's currently ready and append it to the current block's layout.
- For each operand of this instruction, remove this instruction from the set of the value's users. If the value now has exactly one user left, find that user and increment its LUC.
- For each result defined by this instruction, find all instructions which use the value and decrement their count of unready operands. If any now has zero unready operands, add it to the priority queue of ready instructions.
- If this instruction was in the side-effecting skeleton, check the next side-effecting instruction to see if it's ready. Add it to the queue if so.
Because the LUC can increase while an instruction is waiting in the ready queue, we need a priority queue which supports an efficient "decrease-key" operation. (Textbooks and academic papers generally assume you want to find/extract the minimum element, so we need to reverse the order.) "Rank-pairing heaps" (by Haeupler, Sen, and Tarjan) are the newest option I found in a literature review, and they cite all the earlier papers that I found, so that paper is a good starting point for picking a suitable data structure. It should be possible to get O(1) time for all the operations we need except for extract-min, which should be O(log n); but depending on the data structure chosen these bounds will either be worst-case or amortized time.
I _think_ the overall running time of this algorithm is then O(n log n) in the number of instructions in the block. (There should maybe be a factor for the number of edges as well, but we can treat that as bounded by a constant factor of the number of instructions. Again, for most instructions that bound is 2.)
Alternatives
@cfallin suggested a variation on this idea, where we use some other heuristic to decide whether it's worth spending extra CPU time on heuristic-guided list scheduling. We can always fall back to the current topo-sort. One heuristic for this could be some function of the critical path length of the block as a whole (that is, the maximum CP for any instruction) versus the number of instructions in the block, which is one way of estimating the maximum amount of parallelism available in the block. If there's very little parallelism then there aren't many choices of topo-sort and they're probably all pretty close to equivalent.
Critical path length could be defined in terms of an estimate of the number of CPU cycles needed to evaluate the instruction, rather than assuming that all CLIF instructions take the same amount of time. This is probably not worth bothering with but, given a suitable cost model, it's easy to integrate by using a priority queue instead of a FIFO queue when computing CP length.
There are no doubt other ordering heuristics we could try.
jameysharp labeled issue #6260:
Feature
When any optimization level is enabled, Cranelift runs e-graph equality saturation, which removes all pure instructions from the function layout and leaves only a data-flow graph representation. We then have to put those instructions back into a linear order before we can emit machine code.
Currently, we pick an arbitrary topo-sort of the instructions. Instead, I propose implementing list scheduling and, when there are multiple instructions whose operands are all ready, using a heuristic to decide between them.
Benefit
See #6159 for one in-depth analysis of a workload where our instruction scheduling has a measurable impact on performance.
Implementation
I suggest a combination of two heuristics, based on the results in this paper:
Shobaki, G., Sakka, L., Abu Rmaileh, N. E., and Al-Hamash, H. (2015) Experimental evaluation of various register-pressure-reduction heuristics. Softw. Pract. Exper., 45: 1497– 1517. doi: 10.1002/spe.2297
The first heuristic is called "Last Use Count" (LUC). It counts how many live-ranges will end if we place this instruction now. Put another way, for each operand, check whether there are any other instructions that will use the same value but haven't been scheduled yet. If there aren't any, then this instruction is the last user of that value. Count how many operands that applies to. We want to choose an instruction with maximum LUC at each point, because that tends to reduce register pressure.
LUC should have frequent ties. Since most instructions have at most two operands, they can end zero, one, or two live ranges. So we need a tie-breaker.
The tie-breaker heuristic is "Critical Path" (CP) length. It counts the maximum length of any data-dependent chain of instructions that must execute after this instruction before the end of the block. By itself, CP optimizes for instruction-level parallelism (ILP) by starting long chains of operations as soon as possible, but it tends to increase register pressure. However, when used as a tie-breaker for LUC, it can reduce register pressure more than LUC alone. We want to choose an instruction with maximum CP.
If both LUC and CP are tied, we need a last-resort tie-breaker. Prefer the instruction with the smallest instruction number. That will tend to keep instructions in the same order that they appeared in the input.
Computing CP for each instruction can be done in linear time with a breadth-first traversal starting from the block terminator. If there's an edge from instruction A to instruction B, then the CP length for B is whichever is higher of: 1) A's CP length plus 1, or 2) any CP length from any other path. Note that each instruction in the side-effecting skeleton should be treated as if it has an edge to the side-effecting instruction immediately before it.
While visiting all the instructions in the block we should also record the reverse edges: the set of instructions which use each value. Also initialize a side table recording, for each instruction, the number of values it uses which are defined by instructions inside the current block.
The main scheduling loop should proceed like this:
- Pick the highest-priority instruction that's currently ready and append it to the current block's layout.
- For each operand of this instruction, remove this instruction from the set of the value's users. If the value now has exactly one user left, find that user and increment its LUC.
- For each result defined by this instruction, find all instructions which use the value and decrement their count of unready operands. If any now has zero unready operands, add it to the priority queue of ready instructions.
- If this instruction was in the side-effecting skeleton, check the next side-effecting instruction to see if it's ready. Add it to the queue if so.
Because the LUC can increase while an instruction is waiting in the ready queue, we need a priority queue which supports an efficient "decrease-key" operation. (Textbooks and academic papers generally assume you want to find/extract the minimum element, so we need to reverse the order.) "Rank-pairing heaps" (by Haeupler, Sen, and Tarjan) are the newest option I found in a literature review, and they cite all the earlier papers that I found, so that paper is a good starting point for picking a suitable data structure. It should be possible to get O(1) time for all the operations we need except for extract-min, which should be O(log n); but depending on the data structure chosen these bounds will either be worst-case or amortized time.
I _think_ the overall running time of this algorithm is then O(n log n) in the number of instructions in the block. (There should maybe be a factor for the number of edges as well, but we can treat that as bounded by a constant factor of the number of instructions. Again, for most instructions that bound is 2.)
Alternatives
@cfallin suggested a variation on this idea, where we use some other heuristic to decide whether it's worth spending extra CPU time on heuristic-guided list scheduling. We can always fall back to the current topo-sort. One heuristic for this could be some function of the critical path length of the block as a whole (that is, the maximum CP for any instruction) versus the number of instructions in the block, which is one way of estimating the maximum amount of parallelism available in the block. If there's very little parallelism then there aren't many choices of topo-sort and they're probably all pretty close to equivalent.
Critical path length could be defined in terms of an estimate of the number of CPU cycles needed to evaluate the instruction, rather than assuming that all CLIF instructions take the same amount of time. This is probably not worth bothering with but, given a suitable cost model, it's easy to integrate by using a priority queue instead of a FIFO queue when computing CP length.
There are no doubt other ordering heuristics we could try.
dimitris-aspetakis commented on issue #6260:
Hello!
As part of a university project, me and a colleague of mine (Panagiotis Karouzakis) have started working on this proposal in this branch of our fork. We are of course hitting a few assertions in the code we shouldn't.
We believe we have implemented every part of the algorithm proposed, but we might have missed some important things, since (despite our efforts) our understanding of the elaboration stack machinery would probably be considered incomplete...
Current Approach
Currently, we have three passes over every block:
- The elaboration pass mostly as it was (
elaborate_block(...)
), only with the actual layout placement removed.- The breadth-first pass for the construction of critical paths, the
value_uses
map and the reverse edges usingdependencies_count
(the last one includes the manufactured dependencies among skeleton instructions too).- The ready-queue pass, inside which we place the instruction to the current place in the layout, or where the LICM optimization from the elaboration pass told us.
Our Open Questions
- We temporarily removed rematerialization to focus on other bugs. When we re-enable it, it should be placed right before instruction layout placement in our ready-queue pass, right?
- We are not certain if the work that elaboration pass does (namely
value_to_best_value
andvalue_to_elaborated_value
construction) is independent of the layout order of instructions. Should we leave it as a separate pass, or should we integrate it somehow into the ready-queue pass? Our uncertainty applies to LICM optimizations as well.In general, any comments on our approach would be much appreciated!
jameysharp commented on issue #6260:
This is so cool. I don't understand your implementation yet and can't give much feedback right now, but I'm very excited that you're working on this!
Regarding the construction of
value_to_best_value
, at least, I can tell you that I'm pretty surecompute_best_values
can be merged into the initialremove_pure_and_optimize
pass. It requires visiting definitions before uses, but the initial pass that's visiting all the original instructions and adding union nodes and alternative instructions also requires that order. Computing best values does not require knowing either the original layout or the final layout, though.For
value_to_elaborated_value
, that reflects which layout decisions have been made by elaboration in the current branch of the dominator tree, so I'd guess you need something like it, but maybe you have something else that's equivalent. Or maybe not having that is why you have comments about trying to place instructions which have already been placed elsewhere?I think it's an excellent idea to figure out how to make this work without rematerialization or LICM first, and then add those back later. I'd love to see this get to the point where we can look at what effect it has on
cargo test -p cranelift-tools --test filetests
(probably withCRANELIFT_TEST_BLESS=1
set so you can just look atgit diff
after it rewrites all the tests for you). Even better to see what effect this has on our benchmark suite, Sightglass, but don't worry about that yet.
jameysharp commented on issue #6260:
Just want to note again that I'm very excited you're working on this. I'm not sure what you need right now; let me know how I can help!
dimitris-aspetakis commented on issue #6260:
Thanks for the encouraging words!
We're trying to pass all tests from cranelift-tools right now, and (since today) we are down to seemingly 12 not passing (most of which probably come from 2–3 distinct problems). We'll probably give them a go for a few more days, and then reach for possible help, since the cost of explaining our implementation in detail is probably non-trivial :sweat_smile:
dimitris-aspetakis edited a comment on issue #6260:
Thanks for the encouraging words!
We're trying to pass all tests from
cranelift-tools
right now, and (since today) we are down to seemingly 12 not passing (most of which probably come from 2–3 distinct problems). We'll probably give them a go for a few more days, and then reach for possible help, since the cost of explaining our implementation in detail is probably non-trivial :sweat_smile:
dimitris-aspetakis commented on issue #6260:
Progress Update
We believe we have now basically fixed every bug we encountered, and are able to successfully execute all Sightglass benchmarks! There might be some
cranelift-tools
ordisas
tests we fail, but those are most probably from checks that implicitly check for scheduling according to the currently used dependency-driven approach (we have manually checked and changed some already incranelift-tools
).We removed entirely the elaboration pass mentioned in https://github.com/bytecodealliance/wasmtime/issues/6260#issuecomment-2123288457, merging its functionality to the other two passes. We also re-implemented rematerialization and LICM.
As for the correctness of our implementation, while we have checked a couple of small examples manually, we have not created any tests to reassure us that we conform to the expected scheduling of the LUC-CP-sequence heuristic. I'm thinking it might be useful to manually create some in
cranelift/filetests/filetests/egraph/
(or possibly in a new directory) whenever we find time.Compilation Performance
Iterating through flamegraph sessions, we have managed to reduce compilation penalties to ~15–30% in spidermonkey (on a Ryzen 5700G with an isolated core). There are some parts of our implementation (e.g. use of
FxHashSet
forvalue_users
sets) that we aren't sure how much optimization opportunities they might hold. As for the priority queue, samply recordings tell us its pushes and pops take ~13% of theelaborate_domtree()
function.For the priority queue, we initially used
heapz
, a Rank-Pairing heap implementation. Due to some bugs we encountered with its implementation though, we switched to a Hash Heap implementation. The performance profile seemed similar to that ofheapz
(specifically for the compilation metrics).I'm not certain about this, but I believe we might be able to create a simpler and (in practice) better structure for the priority queue, given that the majority of our LUC values are 0, 1 and 2. I was thinking something like three separate buckets for each LUC value, each using some structure with fast insertion, and a fourth bucket for the slow-path of LUC values higher than 2. Another idea was to do a merge of the three metrics (LUC, CP and sequences) into a single value for the structure's ordering fields, with the expectation of reducing comparisons — since their importance is linear and well-defined.
Execution Performance
Unfortunately, our changes do not seem to have the expected execution-time improvements, with some benchmarks giving us regressions too. The best results we got were from Sightglass's
regex
benchmark, and the sparse FP32 benchmarks from XNNPACK (as used in https://github.com/bytecodealliance/wasmtime/issues/6159). This text file includes results from a set of five implementations. The original, dependency-driven approach (dep-driven), our heuristics implementation using all three metrics (LUC, CP and sequence), an implementation without the LUC, one without the CP and one using just sequences (original program order sequence numbers). For the last three implementations we just changed the ordering implementation of the key structure for the priority queue — meaning that the compilation performance of those is not optimized.
fitzgen commented on issue #6260:
I'm not certain about this, but I believe we might be able to create a simpler and (in practice) better structure for the priority queue
Did you happen to try https://doc.rust-lang.org/nightly/std/collections/struct.BinaryHeap.html ? Not that I think it is probably the best option but if it is easy to swap in/out it is at least worth trying.
dimitris-aspetakis commented on issue #6260:
Did you happen to try https://doc.rust-lang.org/nightly/std/collections/struct.BinaryHeap.html ?
The problem with it is that (as far as I saw from its interface) it does not support an "update" or at least a "promote" operation on the keys (assuming the keys are used for sorting — I kinda got confused with all the implementations we tested). We need an efficient implementation of that because LUC get dynamically upgraded in nodes while they are in the priority queue. We could simply remove the node and re-insert it, resulting in a (theoretically) asymptotically worse implementation.
That's kinda what I was thinking of when suggesting this four-bucket structure — each bucket would have a simpler structure compared to Rank Pairing Heaps / Hash Heaps (e.g. Rust's
BinaryHeap
), with the expectation that while this would have higher theoretical time complexity, it would be faster in practice. I guess we could have oneBinaryHeap
instead of four, but splitting it in four would give us smaller sizes in each collection (without any regressions?) — resulting in faster removals & insertions.I have to add though, that I believe we should first focus on seeing what is wrong with execution performance in the benchmarks before trying to optimize compilation any further. Of course, if anyone wants to help with the modules' compilation optimizations, he should feel more than free to help :upside_down:. One thing we could probably do is start diffing machine code from V8 / node with our implementation just like in https://github.com/bytecodealliance/wasmtime/issues/6159 and work backwards from there to see what our scheduling is missing.
fitzgen commented on issue #6260:
That all makes perfect sense, thanks for explaining!
dimitris-aspetakis edited a comment on issue #6260:
Did you happen to try https://doc.rust-lang.org/nightly/std/collections/struct.BinaryHeap.html ?
The problem with it is that (as far as I saw from its interface) it does not support an "update" or at least a "promote" operation on the keys (assuming the keys are used for sorting — I kinda got confused with all the implementations we tested). We need an efficient implementation of that because LUC get dynamically upgraded in nodes while they are in the priority queue. We could simply remove the node and re-insert it, resulting in a (theoretically) asymptotically worse implementation.
That's kinda what I was thinking of when suggesting this four-bucket structure — each bucket would have a simpler structure compared to Rank Pairing Heaps / Hash Heaps (e.g. Rust's
BinaryHeap
), with the expectation that while this would have higher theoretical time complexity, it would be faster in practice. I guess we could have oneBinaryHeap
instead of four, but splitting it in four would give us smaller sizes in each collection (without any regressions?) — resulting in faster removals & insertions.I have to add though, that I believe we should first focus on seeing what is wrong with execution performance in the benchmarks before trying to optimize compilation any further. Of course, if anyone wants to help with the modules' compilation optimizations, they should feel more than free to help :upside_down:. One thing we could probably do is start diffing machine code from V8 / node with our implementation just like in https://github.com/bytecodealliance/wasmtime/issues/6159 and work backwards from there to see what our scheduling is missing.
dimitris-aspetakis edited a comment on issue #6260:
Did you happen to try https://doc.rust-lang.org/nightly/std/collections/struct.BinaryHeap.html ?
The problem with it is that (as far as I saw from its interface) it does not support an "update" or at least a "promote" operation on the keys (assuming the keys are used for sorting — I kinda got confused with all the implementations we tested). We need an efficient implementation of that because LUC gets dynamically upgraded in nodes while they are in the priority queue. We could simply remove the node and re-insert it, resulting in a (theoretically) asymptotically worse implementation.
That's kinda what I was thinking of when suggesting this four-bucket structure — each bucket would have a simpler structure compared to Rank Pairing Heaps / Hash Heaps (e.g. Rust's
BinaryHeap
), with the expectation that while this would have higher theoretical time complexity, it would be faster in practice. I guess we could have oneBinaryHeap
instead of four, but splitting it in four would give us smaller sizes in each collection (without any regressions?) — resulting in faster removals & insertions.I have to add though, that I believe we should first focus on seeing what is wrong with execution performance in the benchmarks before trying to optimize compilation any further. Of course, if anyone wants to help with the modules' compilation optimizations, they should feel more than free to help :upside_down:. One thing we could probably do is start diffing machine code from V8 / node with our implementation just like in https://github.com/bytecodealliance/wasmtime/issues/6159 and work backwards from there to see what our scheduling is missing.
dimitris-aspetakis edited a comment on issue #6260:
Progress Update
We believe we have now basically fixed every bug we encountered, and are able to successfully execute all Sightglass benchmarks! There might be some
cranelift-tools
ordisas
tests we fail, but those are most probably from checks that implicitly check for scheduling according to the currently used dependency-driven approach (we have manually checked and changed some already incranelift-tools
).We removed entirely the elaboration pass mentioned in https://github.com/bytecodealliance/wasmtime/issues/6260#issuecomment-2123288457, merging its functionality to the other two passes. We also re-implemented rematerialization and LICM.
As for the correctness of our implementation, while we have checked a couple of small examples manually, we have not created any tests to reassure ourselves that we conform to the expected scheduling of the LUC-CP-sequence heuristic. I'm thinking it might be useful to manually create some in
cranelift/filetests/filetests/egraph/
(or possibly in a new directory) whenever we find time.Compilation Performance
Iterating through flamegraph sessions, we have managed to reduce compilation penalties to ~15–30% in spidermonkey (on a Ryzen 5700G with an isolated core). There are some parts of our implementation (e.g. use of
FxHashSet
forvalue_users
sets) that we aren't sure how much optimization opportunities they might hold. As for the priority queue, samply recordings tell us its pushes and pops take ~13% of theelaborate_domtree()
function.For the priority queue, we initially used
heapz
, a Rank-Pairing heap implementation. Due to some bugs we encountered with its implementation though, we switched to a Hash Heap implementation. The performance profile seemed similar to that ofheapz
(specifically for the compilation metrics).I'm not certain about this, but I believe we might be able to create a simpler and (in practice) better structure for the priority queue, given that the majority of our LUC values are 0, 1 and 2. I was thinking something like three separate buckets for each LUC value, each using some structure with fast insertion, and a fourth bucket for the slow-path of LUC values higher than 2. Another idea was to do a merge of the three metrics (LUC, CP and sequences) into a single value for the structure's ordering fields, with the expectation of reducing comparisons — since their importance is linear and well-defined.
Execution Performance
Unfortunately, our changes do not seem to have the expected execution-time improvements, with some benchmarks giving us regressions too. The best results we got were from Sightglass's
regex
benchmark, and the sparse FP32 benchmarks from XNNPACK (as used in https://github.com/bytecodealliance/wasmtime/issues/6159). This text file includes results from a set of five implementations. The original, dependency-driven approach (dep-driven), our heuristics implementation using all three metrics (LUC, CP and sequence), an implementation without the LUC, one without the CP and one using just sequences (original program order sequence numbers). For the last three implementations we just changed the ordering implementation of the key structure for the priority queue — meaning that the compilation performance of those is not optimized.
dimitris-aspetakis edited a comment on issue #6260:
Hello!
As part of a university project, me and a colleague of mine (Panagiotis Karouzakis) have started working on this proposal in this branch of our fork. We are of course hitting a few assertions in the code we shouldn't.
We believe we have implemented every part of the algorithm proposed, but we might have missed some important things, since (despite our efforts) our understanding of the elaboration stack machinery would probably be considered incomplete...
Current Approach
Currently, we have three passes over every block:
- The elaboration pass mostly as it was (
elaborate_block(...)
), only with the actual layout placement removed.- The breadth-first pass for the construction of critical paths, the
value_uses
map and the reverse edges usingdependencies_count
(the last one includes the manufactured dependencies among skeleton instructions too).- The ready-queue pass, inside which we place the instruction to the current place in the layout, or where the LICM optimization from the elaboration pass told us.
Our Open Questions
- We temporarily removed rematerialization to focus on other bugs. When we re-enable it, it should be placed right before instruction layout placement in our ready-queue pass, right?
- We are not certain if the work that elaboration pass does (namely
value_to_best_value
andvalue_to_elaborated_value
construction) is independent of the layout order of instructions. Should we leave it as a separate pass, or should we integrate it somehow into the ready-queue pass? Our uncertainty applies to LICM optimizations as well.In general, any comments on our approach would be much appreciated!
dimitris-aspetakis commented on issue #6260:
After writing https://github.com/bytecodealliance/wasmtime/issues/6154#issuecomment-2334157945, I retroactively understood that I should probably write it here, referencing #6154.
The comment summarized:
I can't help but note here my feeling (...) that instruction scheduling should be more integrated with the register allocation and lowering stages. I believe the heuristics we use for register spilling might benefit from some sort of integration with the register allocator specifically.
In that spirit, I tend to support the suggestion where elaborating instructions back to the layout is delayed.
As a side-note: would it make sense for pattern-matching fusing optimizations from the instruction lowering stage to work on an AST-like representation directly generated from e-graphs? It sounds more expensive, but it also seems like it could catch more of such optimizations.
Any feedback (how hard would this be, is it even a practical approach in terms of efficiency etc) would be much appreciated :)
Last updated: Dec 23 2024 at 12:05 UTC