Stream: git-wasmtime

Topic: wasmtime / issue #8520 Cranelift: Add a vcode peephole pa...


view this post on Zulip Wasmtime GitHub notifications bot (May 01 2024 at 18:37):

fitzgen added the cranelift:goal:optimize-speed label to Issue #8520.

view this post on Zulip Wasmtime GitHub notifications bot (May 01 2024 at 18:37):

fitzgen added the cranelift:area:regalloc label to Issue #8520.

view this post on Zulip Wasmtime GitHub notifications bot (May 01 2024 at 18:37):

fitzgen added the cranelift:area:machinst label to Issue #8520.

view this post on Zulip Wasmtime GitHub notifications bot (May 01 2024 at 18:37):

fitzgen added the isle label to Issue #8520.

view this post on Zulip Wasmtime GitHub notifications bot (May 01 2024 at 18:37):

fitzgen opened issue #8520:

(Fleshing out and jotting down for posterity an idea I floated and we collectively refined at a recent Cranelift meeting.)

This peephole pass could run after regalloc, just before encoding instructions. It wouldn't need to be a whole new traversal over the IR, it could run as we walk the vcode for encoding, essentially implemented as an iterator combinator.

This peephole pass would not be DFG-based. It would be a classic sliding-window-of-instructions peephole pass. This both allows us to avoid constructing a more-complete DFG for vcode like what we have for clif, which could be expensive, and it allows us to benefit from opportunities that arise only because of instruction scheduling of two different sequences that are otherwise disconnected in the DFG.

Put another way: if you want DFG-y rewrites, use the mid-end for target-agnostic rules and the clif-to-vcode instruction selection lowering for target-specific rules. We have good solutions for that kind of thing. If however you want non-DFG-y rewrites, we currently have nothing and this issue is proposing we add something for the target-specific dimension, which is where I suspect most of the opportunity exists.

Implementation idea brain dump:

Between integrating with an existing traversal over the vcode, not constructing a DFG, and the string-matching sketch just above, this new peephole pass should impose extremely little overhead on compile times.

A few concrete example peephole optimizations we could do, I'm sure we can think of more and will do so if we put this infra in place:

view this post on Zulip Wasmtime GitHub notifications bot (May 01 2024 at 18:37):

github-actions[bot] commented on issue #8520:

Subscribe to Label Action

cc @cfallin, @fitzgen

<details>
This issue or pull request has been labeled: "isle"

Thus the following users have been cc'd because of the following labels:

To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.

Learn more.
</details>

view this post on Zulip Wasmtime GitHub notifications bot (May 02 2024 at 21:24):

jameysharp commented on issue #8520:

Thanks for writing this up!

I'd add that some of these peephole opportunities are also available before register allocation, and might simplify the register allocation problem. (Bonus points if you can match patterns in reverse program order, so the pre-regalloc pass can be integrated with the backwards lowering pass.) But as you point out, regalloc can introduce more instances of these patterns so it's good to do it afterwards as well.

For example: When removing a test or cmp instruction that's redundant with an earlier instruction, there may be other instructions in between that don't set flags. In that case, the test/cmp unnecessarily extend the live ranges of their inputs, so deleting them decreases register pressure. Also, removing these instructions before regalloc should be sufficient; regalloc should never introduce them.

Your note about load-sinking is another example. It's useful to combine loads into memory operands of other instructions before register allocation because that reduces register pressure. Because this is useful we currently we do this optimization in ISLE, and by your guidelines it should still be in ISLE because liveness is a dataflow property. However we have repeatedly gotten this optimization wrong, because it's a property of the dataflow of the lowered instructions, not the dataflow of the CLIF.

So I actually think there are good reasons to do this kind of peephole optimization even when we need a dataflow analysis to justify it. In particular, an analysis that "this use is the last use of this register" is necessary for load-sinking. Before regalloc we can rely on vcode being in SSA form, which means we can maintain the set of live virtual registers incrementally during the backward lowering pass. Unfortunately I think this analysis requires a full backward dataflow fix-point if we want to do it reliably after register allocation, but your point that it would be nice to sink reloads generated by RA2 is a good one. We can approximate it by checking if there's another definition of the same physical register in the same basic block following the use we want to sink into.

That same analysis would be useful for another pre-regalloc optimization I want. When we lower to (x64_add x y), we tell RA2 that it needs to put the result in the same physical register that it assigns to virtual register x. If x has more uses later, then RA2 has to copy x to another register first. In that case, if y doesn't have any more uses, we can swap the operands (since "add" is commutative), avoiding a mov instruction and reducing register pressure.

One extension to an optimization you mentioned is when test x, x follows any instruction that sets flags according to its result, and its result is x. (Same goes for cmp x, 0 I think, but we carefully don't generate that.)

As far as implementation goes: I don't see that this has anything in common with ISLE. I'd lean toward a Thompson-style virtual machine simulating an NFA, because I think we can start by writing that in pure Rust without it being incomprehensible, and then decide later if it's worth building a new DSL for it. That has the same property as Aho-Corasick that it doesn't backtrack. If we transform to a DFA then when matching a set of fixed strings I believe the resulting automaton is identical to the one produced by Aho-Corasick, but determinization and minimization are complicated and I think we can build a useful first version without them. Further, I think we can generalize beyond finite automata in order to match patterns like "the same register I saw in an earlier instruction" (a limited form of back-references) while staying within the Thompson VM model; that means we don't have a constant bound on the number of live states, but everything else works.

view this post on Zulip Wasmtime GitHub notifications bot (May 03 2024 at 19:36):

fitzgen commented on issue #8520:

Agreed that this could be useful both pre- and post-regalloc.

It definitely seems doable to be able to run the pass both forwards and backwards, to integrate with the instruction encoding traversal and lowering respectively.

So I actually think there are good reasons to do this kind of peephole optimization even when we need a dataflow analysis to justify it.

I agree that there are additional improvements we could do on vcode if we had a DFG.

But it isn't clear to me that constructing a full DFG for vcode sits in the same "sweet spot" we've been trying to aim for with Cranelift in general with regards to balancing comlexity, compile time, and code quality. Not saying it doesn't fit! I honestly don't know, and I think we'd have to investigate the costs and complexity some more to clarify that for me. (And I would just have to think about it some more.)

I do think that a sliding-window style peephole pass is obviously consistent with the existing system with its existing costs and tradeoffs, in my eyes, because of its relative simplicity and how it can be integrated with existing traversals and data structures (or lack thereof).

Are you imagining that we would have both DFG-based and sliding-window-based peephole passes? Or that the same pass could do either style of rewrite? Because (and I don't mean to imply that you said this) I don't think the DFG-based approach subsumes the sliding-window-based approach because of how the latter can leverage instruction scheduling.

As far as implementation goes: I don't see that this has anything in common with ISLE.

In general:

But I agree that it is not off-the-shelf reusable as-is without a new/subset prelude and some other tweaks.

view this post on Zulip Wasmtime GitHub notifications bot (May 03 2024 at 21:06):

jameysharp commented on issue #8520:

I'm beginning to think we're talking about different things when we say "dataflow".

It sounds like you mean in the sense of our DataFlowGraph type, which supports querying where a value is defined, for example.

I agree that we don't want that here.

I was referring to "dataflow analysis" in the sense of the dragon book. We may need to compute some collection of facts about instructions in the program, and that computation may depend on control flow, but we don't need a graph representation of dataflow for it.

The analysis results allow us to write patterns that are only correct if their surrounding context is okay, without needing to examine the rest of the function in the pattern. This is necessary for load sinking, for example: the pattern to match is simple, but the transformation is only valid if the load is not used again afterward.

In some cases we can do the analysis incrementally while evaluating the patterns in a single pass. That works for pre-regalloc liveness analysis as long as we're doing a backward pass in post-order, because of SSA form. In other cases we may need a full fix-point analysis or a heuristic approximation, such as for post-regalloc liveness analysis.

view this post on Zulip Wasmtime GitHub notifications bot (May 04 2024 at 17:01):

fitzgen commented on issue #8520:

Ah yeah I must have been reading too fast because you clearly said "dataflow analysis", and I assumed you were talking about DFG construction because I had originally been discussing avoiding the cost of constructing a DFG.

Agreed that some dataflow analyses can be done incrementally in a single pass, and could integrate with a single traversal that is doing both the analysis and the peephole rewrites, and agreed that it would be cool to do.

(and, of course, if/when we get around to actually prioritizing this stuff, we can talk about incremental steps, whether it makes sense to start with support for running these kinds of dataflow analyses right off the bat or not, when to implement what, and how to break all this apart into milestones at that time)

view this post on Zulip Wasmtime GitHub notifications bot (May 12 2024 at 00:50):

jameysharp commented on issue #8520:

I think this might subsume #4124.


Last updated: Dec 23 2024 at 12:05 UTC