wasmtime / issue #8520 Cranelift: Add a vcode peephole pa... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / issue #8520 Cranelift: Add a vcode peephole pa...

Wasmtime GitHub notifications bot (May 01 2024 at 18:37):

fitzgen added the cranelift:goal:optimize-speed label to Issue #8520.

Wasmtime GitHub notifications bot (May 01 2024 at 18:37):

fitzgen added the cranelift:area:regalloc label to Issue #8520.

Wasmtime GitHub notifications bot (May 01 2024 at 18:37):

fitzgen added the cranelift:area:machinst label to Issue #8520.

Wasmtime GitHub notifications bot (May 01 2024 at 18:37):

fitzgen added the isle label to Issue #8520.

Wasmtime GitHub notifications bot (May 01 2024 at 18:37):

fitzgen opened issue #8520:

(Fleshing out and jotting down for posterity an idea I floated and we collectively refined at a recent Cranelift meeting.)

This peephole pass could run after regalloc, just before encoding instructions. It wouldn't need to be a whole new traversal over the IR, it could run as we walk the vcode for encoding, essentially implemented as an iterator combinator.

This peephole pass would not be DFG-based. It would be a classic sliding-window-of-instructions peephole pass. This both allows us to avoid constructing a more-complete DFG for vcode like what we have for clif, which could be expensive, and it allows us to benefit from opportunities that arise only because of instruction scheduling of two different sequences that are otherwise disconnected in the DFG.

Put another way: if you want DFG-y rewrites, use the mid-end for target-agnostic rules and the clif-to-vcode instruction selection lowering for target-specific rules. We have good solutions for that kind of thing. If however you want non-DFG-y rewrites, we currently have nothing and this issue is proposing we add something for the target-specific dimension, which is where I suspect most of the opportunity exists.

Implementation idea brain dump:

Treat finding matches as a a string-matching algorithm.

For simplicity, and to cut down on the size of tables, ignore operands and only consider opcodes.

So we treat the vcode as a string where the alphabet is its opcodes

Take all the left-hand sides of our rewrites and treat them as substrings in that same alphabet

Do an online version of aho-corasick or similar to find all substring matches in the vcode, without needing to repeat work for each index in the vcode or for each rewrite rule

On each substring match

run additional predicates to capture preconditions in the associated pattern(s)'s left-hand side that are not accounted for in the string match (i.e. that one of the operands is zero)

if those additional predicates are satisfied, replace the matched sequence with the rule's right-hand side

As far as ISLE integration goes:

I don't think we could do this with just custom external multi extractors, since we need to process all the rules to get the substring patterns to create the aho-corasick automaton

This means we would need to add a new kind of mode/backend to ISLE or something?

We could, I guess, just accept that we are doing redundant work at each index in the vcode "string" and not do the full aho-corasick thing.

We would still get the trie-y bits via islec

But we wouldn't be able to avoid repeatedly looking at the same instructions when we call into ISLE at vcode[i] that we've already looked at when we called into ISLE from vcode[i - 1].

pretty sure we could implement this less-ambitious version with ISLE today

but this is also way less cool

Between integrating with an existing traversal over the vcode, not constructing a DFG, and the string-matching sketch just above, this new peephole pass should impose extremely little overhead on compile times.

A few concrete example peephole optimizations we could do, I'm sure we can think of more and will do so if we put this infra in place:

turn adjacent pairs of loads/stores into ldp/stp on aarch64

similarly: turn a pair of adjacent u32 loads into an unaligned u64 load and split on targets where the cost of an unaligned u64 load is less than the cost of two u32 loads (I think x64, at least)

sink spills/reloads introduced by regalloc into the actual adds or whatever that produces/consumes the value on x64

on x64 (and others?) avoid unnecessary cmp and test instructions when we know that the last instruction in the stream happened to set identical flags (eg and rax, r10; test rax, r10 or sub rdi, rsi; cmp rdi, rsi)

Wasmtime GitHub notifications bot (May 01 2024 at 18:37):

github-actions[bot] commented on issue #8520:

Subscribe to Label Action

cc @cfallin, @fitzgen

<details>
This issue or pull request has been labeled: "isle"

Thus the following users have been cc'd because of the following labels:

cfallin: isle

fitzgen: isle

To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.

Learn more.
</details>

Wasmtime GitHub notifications bot (May 02 2024 at 21:24):

jameysharp commented on issue #8520:

Thanks for writing this up!

I'd add that some of these peephole opportunities are also available before register allocation, and might simplify the register allocation problem. (Bonus points if you can match patterns in reverse program order, so the pre-regalloc pass can be integrated with the backwards lowering pass.) But as you point out, regalloc can introduce more instances of these patterns so it's good to do it afterwards as well.

For example: When removing a test or cmp instruction that's redundant with an earlier instruction, there may be other instructions in between that don't set flags. In that case, the test/cmp unnecessarily extend the live ranges of their inputs, so deleting them decreases register pressure. Also, removing these instructions before regalloc should be sufficient; regalloc should never introduce them.

Your note about load-sinking is another example. It's useful to combine loads into memory operands of other instructions before register allocation because that reduces register pressure. Because this is useful we currently we do this optimization in ISLE, and by your guidelines it should still be in ISLE because liveness is a dataflow property. However we have repeatedly gotten this optimization wrong, because it's a property of the dataflow of the lowered instructions, not the dataflow of the CLIF.

So I actually think there are good reasons to do this kind of peephole optimization even when we need a dataflow analysis to justify it. In particular, an analysis that "this use is the last use of this register" is necessary for load-sinking. Before regalloc we can rely on vcode being in SSA form, which means we can maintain the set of live virtual registers incrementally during the backward lowering pass. Unfortunately I think this analysis requires a full backward dataflow fix-point if we want to do it reliably after register allocation, but your point that it would be nice to sink reloads generated by RA2 is a good one. We can approximate it by checking if there's another definition of the same physical register in the same basic block following the use we want to sink into.

That same analysis would be useful for another pre-regalloc optimization I want. When we lower to (x64_add x y), we tell RA2 that it needs to put the result in the same physical register that it assigns to virtual register x. If x has more uses later, then RA2 has to copy x to another register first. In that case, if y doesn't have any more uses, we can swap the operands (since "add" is commutative), avoiding a mov instruction and reducing register pressure.

One extension to an optimization you mentioned is when test x, x follows any instruction that sets flags according to its result, and its result is x. (Same goes for cmp x, 0 I think, but we carefully don't generate that.)

As far as implementation goes: I don't see that this has anything in common with ISLE. I'd lean toward a Thompson-style virtual machine simulating an NFA, because I think we can start by writing that in pure Rust without it being incomprehensible, and then decide later if it's worth building a new DSL for it. That has the same property as Aho-Corasick that it doesn't backtrack. If we transform to a DFA then when matching a set of fixed strings I believe the resulting automaton is identical to the one produced by Aho-Corasick, but determinization and minimization are complicated and I think we can build a useful first version without them. Further, I think we can generalize beyond finite automata in order to match patterns like "the same register I saw in an earlier instruction" (a limited form of back-references) while staying within the Thompson VM model; that means we don't have a constant bound on the number of live states, but everything else works.

Wasmtime GitHub notifications bot (May 03 2024 at 19:36):

fitzgen commented on issue #8520:

Agreed that this could be useful both pre- and post-regalloc.

It definitely seems doable to be able to run the pass both forwards and backwards, to integrate with the instruction encoding traversal and lowering respectively.

So I actually think there are good reasons to do this kind of peephole optimization even when we need a dataflow analysis to justify it.

I agree that there are additional improvements we could do on vcode if we had a DFG.

But it isn't clear to me that constructing a full DFG for vcode sits in the same "sweet spot" we've been trying to aim for with Cranelift in general with regards to balancing comlexity, compile time, and code quality. Not saying it doesn't fit! I honestly don't know, and I think we'd have to investigate the costs and complexity some more to clarify that for me. (And I would just have to think about it some more.)

I do think that a sliding-window style peephole pass is obviously consistent with the existing system with its existing costs and tradeoffs, in my eyes, because of its relative simplicity and how it can be integrated with existing traversals and data structures (or lack thereof).

Are you imagining that we would have both DFG-based and sliding-window-based peephole passes? Or that the same pass could do either style of rewrite? Because (and I don't mean to imply that you said this) I don't think the DFG-based approach subsumes the sliding-window-based approach because of how the latter can leverage instruction scheduling.

As far as implementation goes: I don't see that this has anything in common with ISLE.

In general:

It would be nice to reuse some helper terms (e.g. mach inst constructors).

ISLE is a nice DSL for writing patterns and matching.

As soon as we have more than a handful of rewrites, we probably want a DSL. Would be nice not to have multiple DSLs.

But I agree that it is not off-the-shelf reusable as-is without a new/subset prelude and some other tweaks.

Wasmtime GitHub notifications bot (May 03 2024 at 21:06):

jameysharp commented on issue #8520:

I'm beginning to think we're talking about different things when we say "dataflow".

It sounds like you mean in the sense of our DataFlowGraph type, which supports querying where a value is defined, for example.

I agree that we don't want that here.

I was referring to "dataflow analysis" in the sense of the dragon book. We may need to compute some collection of facts about instructions in the program, and that computation may depend on control flow, but we don't need a graph representation of dataflow for it.

The analysis results allow us to write patterns that are only correct if their surrounding context is okay, without needing to examine the rest of the function in the pattern. This is necessary for load sinking, for example: the pattern to match is simple, but the transformation is only valid if the load is not used again afterward.

In some cases we can do the analysis incrementally while evaluating the patterns in a single pass. That works for pre-regalloc liveness analysis as long as we're doing a backward pass in post-order, because of SSA form. In other cases we may need a full fix-point analysis or a heuristic approximation, such as for post-regalloc liveness analysis.

Wasmtime GitHub notifications bot (May 04 2024 at 17:01):

fitzgen commented on issue #8520:

Ah yeah I must have been reading too fast because you clearly said "dataflow analysis", and I assumed you were talking about DFG construction because I had originally been discussing avoiding the cost of constructing a DFG.

Agreed that some dataflow analyses can be done incrementally in a single pass, and could integrate with a single traversal that is doing both the analysis and the peephole rewrites, and agreed that it would be cool to do.

(and, of course, if/when we get around to actually prioritizing this stuff, we can talk about incremental steps, whether it makes sense to start with support for running these kinds of dataflow analyses right off the bat or not, when to implement what, and how to break all this apart into milestones at that time)

Wasmtime GitHub notifications bot (May 12 2024 at 00:50):

jameysharp commented on issue #8520:

I think this might subsume #4124.

Last updated: Apr 17 2025 at 20:03 UTC