ValueUseState::Multiple and multi-result instructions · cranelift

I'm curious to explore some 128-bit stuff a bit and right now I'm looking at input such as:

function u0:0(i64, i64) -> i64, i64 tail {
block0(v0: i64, v1: i64):
    v10 = load.i64 v0
    v18 = uextend.i128 v1
    v20 = uextend.i128 v10
    v14 = iadd v18, v20
    v15, v16 = isplit v14
    jump block1

block1:
    return v15, v16
}

   0:   55                      pushq   %rbp
   1:   48 89 e5                movq    %rsp, %rbp
   4:   4c 8b 0f                movq    (%rdi), %r9
   7:   48 31 c9                xorq    %rcx, %rcx
   a:   48 89 f0                movq    %rsi, %rax
   d:   4c 01 c8                addq    %r9, %rax
  10:   48 83 d1 00             adcq    $0, %rcx
  14:   48 89 ec                movq    %rbp, %rsp
  17:   5d                      popq    %rbp
  18:   c3                      retq

notably what I'm trying to get happen is to sink the movq (%rdi), %r9 into addq %r9, %rax. Currently the sinking is being prevented by this line but I don't fully understand the comment there. It seems like if this heuristic is needed then it's probably a conservative one because load sinking should be ok in this specific scenario at least.

Which leads me to my question: could this force_multiple be explained a bit more? Is there an example of how this could go wrong if force_multiple were ignored? (if I set it to false unconditionally no filetests break so I wasn't able to learn more via that). Furthermore is there a way to perhaps make this check less conservative to allow load sinking in the above case?

Implement a few minor optimizations around 128-bit integers by alexcrichton · Pull Request #9136 · bytecodealliance/wasmtime

This commit implements a few minor changes for i128 in both the egraph optimizations and lowerings for x64. The optimization pass will now transform iconcat into a uextend or sextend where appropri...

wasmtime/cranelift/codegen/src/machinst/lower.rs at 3f5c21bff4babe54f51deceed1bb1d3321b6f2aa · bytecodealliance/wasmtime

A fast and secure runtime for WebAssembly. Contribute to bytecodealliance/wasmtime development by creating an account on GitHub.

fitzgen (he/him) (Aug 16 2024 at 18:03):

Cranelift: avoid load coalescing when an operand is reused · Issue #3953 · bytecodealliance/wasmtime

In examining #3951 and #3934, the main problem in both cases seems to be that the load coalescing logic cannot detect when a loaded operand may be reused. Because of this, we must manually ensure t...

fitzgen (he/him) (Aug 16 2024 at 18:03):

Chris Fallin (Aug 16 2024 at 18:04):

yep, that; the case you show in CLIF above is a special case where a single instruction is responsible for both of the uses, in theory we could detect that, we'd want to think it through very carefully though

Chris Fallin (Aug 16 2024 at 18:05):

(nit: not "heuristic" but "invariant" or "rule"; heuristics can be violated without correctness issues)

Alex Crichton (Aug 16 2024 at 18:08):

hm ok, I still don't really understand what this is doing but sounds like I should assume this can't easily change

Chris Fallin (Aug 16 2024 at 18:09):

so the basic idea of the analysis is to prove that an instruction is used only (or at most) once; that's the condition we need to sink it

Chris Fallin (Aug 16 2024 at 18:09):

(we could do better if we reasoned about location and sunk it to the first of several locations it's used if it dominates all the others, but that's way more complex)

Chris Fallin (Aug 16 2024 at 18:10):

when there is a single output, this is reasonably straightforward: we do a transitive fixpoint algorithm where we generate a "Multiple" at any op that's used multiple times, and propagate that backward

Chris Fallin (Aug 16 2024 at 18:10):

(because matching/sinking can go arbitrarily high up the tree, we need a transitive notion of "used more than once")

Chris Fallin (Aug 16 2024 at 18:10):

when there are multiple outputs, even if each output is used only once, the op itself is used more than once

Chris Fallin (Aug 16 2024 at 18:11):

we could refine the lattice more and encode logic for 0+0 -> 0, 0+1 / 1+0 -> 1, 1+1 -> many, 1+many / many+1 -> many

Chris Fallin (Aug 16 2024 at 18:11):

but that's... a lot of subtlety and I don't have confidence enough in it to do it right now

Chris Fallin (Aug 16 2024 at 18:11):

Alex Crichton (Aug 16 2024 at 18:12):

One part I don't understand though is why multiple-results matter here. ISLE can't match instructions with multiple results in patterns afaik so we can't possibly go up the tree and somehow duplicate the operands of a multi-result instruction, right?

Chris Fallin (Aug 16 2024 at 18:13):

well, when building the analysis we try not to couple current implementation limits such as "ISLE can't match instructions with multiple results" with the invariants elsewhere; that could easily lead to very subtle bugs later when we expand its capability, and in any case leads to very hard to reason about invariants

Alex Crichton (Aug 16 2024 at 18:13):

the specific case of isplit returning two things and then ret consuming both is a bit of a red herring as it was the first test case I came up with, the actual benchmark I think stores the two results of isplit into memory or just one or something like that

Chris Fallin (Aug 16 2024 at 18:13):

I think the right fix here, if we want to make the analysis more powerful, is to do as noted above -- recognize the special case

Chris Fallin (Aug 16 2024 at 18:13):

or alternately, have a notion of "rooted value" that stops propagation of multiplicity

Chris Fallin (Aug 16 2024 at 18:14):

Chris Fallin (Aug 16 2024 at 18:15):

the key there would be to have a single source of truth for that notion/kind of instruction and consult it in both places

Alex Crichton (Aug 16 2024 at 18:17):

I'm still not understanding though because isn't it structurally not possible to match through a multi-result instruction?

Alex Crichton (Aug 16 2024 at 18:17):

Chris Fallin (Aug 16 2024 at 18:18):

it currently isn't possible, but let's not bake in that assumption, is what I'm saying; it's an implementation limit not something fundamental

Chris Fallin (Aug 16 2024 at 18:18):

the "rooted instruction" notion above is my attempt to reify it as an actual concept we can rely on :-)

Chris Fallin (Aug 16 2024 at 18:18):

said another way: can we describe accurately what the result of the analysis is?

Alex Crichton (Aug 16 2024 at 18:19):

ok yeah that makes sense, I'm just trying to affirm my current understanding of the world to understand what you're saying we want to move towards

Alex Crichton (Aug 16 2024 at 18:19):

Chris Fallin (Aug 16 2024 at 18:19):

if the answer has to invoke implementation limits and just-so arguments why it won't cause an issue, that's a very bad place to be in

Chris Fallin (Aug 16 2024 at 18:19):

Alex Crichton (Aug 16 2024 at 18:19):

and it's difficult for me to reason about matching through things when that's currently just a hypothetical for multi-result instructions

Chris Fallin (Aug 16 2024 at 18:20):

"rooted instruction" is a concept that reifies the notion of "thing that cannot be matched through"

Chris Fallin (Aug 16 2024 at 18:20):

once we have that guarantee, semantically, we can define multiplicity in terms of it

Chris Fallin (Aug 16 2024 at 18:20):

basically my whole thing here is: we need to define multiplicity succinctly and clearly so we can think about it. if our definition includes reasoning about matching/lowering algorithm stuff, we've failed

Chris Fallin (Aug 16 2024 at 18:21):

Alex Crichton (Aug 16 2024 at 18:21):

sorry I'm also trying to catch up because it sounds like you're trying to rebut a change I'm proposing but I'm not proposing anything and am trying to better understand what we have today

Chris Fallin (Aug 16 2024 at 18:22):

basically I'm justifying why we have what we have today against the "why can't we remove force_multiple"

Chris Fallin (Aug 16 2024 at 18:24):

as succinctly as I can put it: multiplicity has been defined (in this analysis) to mean "is an instruction referenced multiple times, or referenced by another instruction that is referenced multiple times". The algorithm approximates that in the presence of multiple-output instructions by the use of the force_multiple mechanism

Chris Fallin (Aug 16 2024 at 18:24):

what I'm then proposing is that we can change that definition to change the answer to your question ("why can't we remove" -> "now we can") by introducing this notion of rooted instructions

Alex Crichton (Aug 16 2024 at 18:27):

Chris Fallin (Aug 16 2024 at 18:27):

cool. happy to review if you want to take a stab at this (I unfortunately don't have the time or mental energy atm to do it myself!)

Alex Crichton (Aug 16 2024 at 18:27):

is this function, get_value_as_source_or_const, a suitable "choke point" where the concept of a rooted instruction might be able to be inserted?

wasmtime/cranelift/codegen/src/machinst/lower.rs at 3f5c21bff4babe54f51deceed1bb1d3321b6f2aa · bytecodealliance/wasmtime

A fast and secure runtime for WebAssembly. Contribute to bytecodealliance/wasmtime development by creating an account on GitHub.

Alex Crichton (Aug 16 2024 at 18:28):

wasmtime/cranelift/codegen/src/machinst/lower.rs at 3f5c21bff4babe54f51deceed1bb1d3321b6f2aa · bytecodealliance/wasmtime

A fast and secure runtime for WebAssembly. Contribute to bytecodealliance/wasmtime development by creating an account on GitHub.

Chris Fallin (Aug 16 2024 at 18:29):

yep, that's indeed where we have all the conditions for "seeing through" a use, so let's add conditions there (for pure and impure cases) for "is rooted" and avoid seeing it as any more than an opaque value if so

Alex Crichton (Aug 16 2024 at 18:29):

Alex Crichton (Aug 16 2024 at 18:30):

let's say that you v0 = load ... ; v1, v2 = isplit v0 - if you "look through" the isplit, despite it being pure, you could "look through" a use of v1 and v2 and possibly generate two loads

Alex Crichton (Aug 16 2024 at 18:30):

Chris Fallin (Aug 16 2024 at 18:30):

Chris Fallin (Aug 16 2024 at 18:31):

so we cut things at the isplit, it's "rooted" now, it becomes just a single use for the load, and we can't look through it

Chris Fallin (Aug 16 2024 at 18:33):

(I'm not sure if that addresses your original desire to get more load-sinking; the loads you want sunk are actually single-use and the splits happen as consumers of the things like iadds that consume loads, right?)

Alex Crichton (Aug 16 2024 at 18:34):

yeah what specifically I was seeing was that any operand to isplit would transitively mark all operands as Multiple which thwarted load sinking elsewhere

Alex Crichton (Aug 16 2024 at 18:36):

the downside of this "simple" change though is that if a multi-result instruction has a dead result, e.g. the second return value of isplit, it in theory should allow seeing through the isplit but it doesn't?

Chris Fallin (Aug 16 2024 at 18:36):

that delta is I think the logical outcome of the concept of "rooted" being defined for now as any multi-output; I was going to suggest a more generalized thing but then it would result in silly dead code; as long as we have comments where force_multiple is noting that we explicitly do not look through multi-output insts when matching, and add a comment where we block such seeing-through noting that it is load-bearing

Chris Fallin (Aug 16 2024 at 18:37):

another alternative is to make the multiplicity analysis smarter: Once + Once -> Multiple, including across multiple inputs, but Once + not-used should result in Once

Chris Fallin (Aug 16 2024 at 18:37):

Alex Crichton (Aug 16 2024 at 18:37):

Alex Crichton (Aug 16 2024 at 18:38):

hm actually wait I don't think "just" removing force_multiple enough, the DFS would have to stop propagating through a multi-result instruction

Alex Crichton (Aug 16 2024 at 18:39):

b/c otherwise if you use both results of isplit it'll naturally mark the operand as used twice right? (if the only change is to remove force_multiple)

Chris Fallin (Aug 16 2024 at 18:39):

right, I think we actually need the notion of "rooted"; it's actually that we force codegen at that matching root (by not seeing through it) and really do only use the input once as a result

Alex Crichton (Aug 16 2024 at 18:41):

Chris Fallin (Aug 16 2024 at 18:41):

Alex Crichton (Aug 16 2024 at 20:55):

Don't force `Multiple` on multi-result instructions by alexcrichton · Pull Request #9137 · bytecodealliance/wasmtime

This commit is a result of discussion on Zulip and is attempting to fix an issue where some 128-bit instructions aren't fully benefitting from load sinking on x64. At a high level 128-bit addit...

Stream: cranelift

Topic: ValueUseState::Multiple and multi-result instructions

Alex Crichton (Aug 16 2024 at 18:00):

fitzgen (he/him) (Aug 16 2024 at 18:03):

fitzgen (he/him) (Aug 16 2024 at 18:03):

Chris Fallin (Aug 16 2024 at 18:04):

Chris Fallin (Aug 16 2024 at 18:05):

Alex Crichton (Aug 16 2024 at 18:08):

Chris Fallin (Aug 16 2024 at 18:09):

Chris Fallin (Aug 16 2024 at 18:09):

Chris Fallin (Aug 16 2024 at 18:10):

Chris Fallin (Aug 16 2024 at 18:10):

Chris Fallin (Aug 16 2024 at 18:10):

Chris Fallin (Aug 16 2024 at 18:11):

Chris Fallin (Aug 16 2024 at 18:11):

Chris Fallin (Aug 16 2024 at 18:11):

Chris Fallin (Aug 16 2024 at 18:11):

Alex Crichton (Aug 16 2024 at 18:12):

Chris Fallin (Aug 16 2024 at 18:13):

Alex Crichton (Aug 16 2024 at 18:13):

Chris Fallin (Aug 16 2024 at 18:13):

Chris Fallin (Aug 16 2024 at 18:13):

Chris Fallin (Aug 16 2024 at 18:14):

Chris Fallin (Aug 16 2024 at 18:15):

Alex Crichton (Aug 16 2024 at 18:17):

Alex Crichton (Aug 16 2024 at 18:17):

Chris Fallin (Aug 16 2024 at 18:18):

Chris Fallin (Aug 16 2024 at 18:18):

Chris Fallin (Aug 16 2024 at 18:18):

Alex Crichton (Aug 16 2024 at 18:19):

Alex Crichton (Aug 16 2024 at 18:19):

Chris Fallin (Aug 16 2024 at 18:19):

Chris Fallin (Aug 16 2024 at 18:19):

Alex Crichton (Aug 16 2024 at 18:19):

Chris Fallin (Aug 16 2024 at 18:20):

Chris Fallin (Aug 16 2024 at 18:20):

Chris Fallin (Aug 16 2024 at 18:20):

Chris Fallin (Aug 16 2024 at 18:20):

Chris Fallin (Aug 16 2024 at 18:21):

Alex Crichton (Aug 16 2024 at 18:21):

Chris Fallin (Aug 16 2024 at 18:22):

Chris Fallin (Aug 16 2024 at 18:24):

Chris Fallin (Aug 16 2024 at 18:24):

Alex Crichton (Aug 16 2024 at 18:27):

Chris Fallin (Aug 16 2024 at 18:27):

Alex Crichton (Aug 16 2024 at 18:27):

Alex Crichton (Aug 16 2024 at 18:28):

Chris Fallin (Aug 16 2024 at 18:29):

Alex Crichton (Aug 16 2024 at 18:29):

Alex Crichton (Aug 16 2024 at 18:30):

Alex Crichton (Aug 16 2024 at 18:30):

Alex Crichton (Aug 16 2024 at 18:30):

Chris Fallin (Aug 16 2024 at 18:30):

Chris Fallin (Aug 16 2024 at 18:31):

Chris Fallin (Aug 16 2024 at 18:33):

Alex Crichton (Aug 16 2024 at 18:34):

Alex Crichton (Aug 16 2024 at 18:34):

Alex Crichton (Aug 16 2024 at 18:36):

Chris Fallin (Aug 16 2024 at 18:36):

Chris Fallin (Aug 16 2024 at 18:37):

Chris Fallin (Aug 16 2024 at 18:37):

Alex Crichton (Aug 16 2024 at 18:37):

Alex Crichton (Aug 16 2024 at 18:37):

Alex Crichton (Aug 16 2024 at 18:38):

Alex Crichton (Aug 16 2024 at 18:39):

Chris Fallin (Aug 16 2024 at 18:39):

Alex Crichton (Aug 16 2024 at 18:41):

Chris Fallin (Aug 16 2024 at 18:41):

Alex Crichton (Aug 16 2024 at 20:55):