Stream: cranelift

Topic: ValueUseState::Multiple and multi-result instructions


view this post on Zulip Alex Crichton (Aug 16 2024 at 18:00):

I'm curious to explore some 128-bit stuff a bit and right now I'm looking at input such as:

function u0:0(i64, i64) -> i64, i64 tail {
block0(v0: i64, v1: i64):
    v10 = load.i64 v0
    v18 = uextend.i128 v1
    v20 = uextend.i128 v10
    v14 = iadd v18, v20
    v15, v16 = isplit v14
    jump block1

block1:
    return v15, v16
}

which after https://github.com/bytecodealliance/wasmtime/pull/9136 produces:

   0:   55                      pushq   %rbp
   1:   48 89 e5                movq    %rsp, %rbp
   4:   4c 8b 0f                movq    (%rdi), %r9
   7:   48 31 c9                xorq    %rcx, %rcx
   a:   48 89 f0                movq    %rsi, %rax
   d:   4c 01 c8                addq    %r9, %rax
  10:   48 83 d1 00             adcq    $0, %rcx
  14:   48 89 ec                movq    %rbp, %rsp
  17:   5d                      popq    %rbp
  18:   c3                      retq

notably what I'm trying to get happen is to sink the movq (%rdi), %r9 into addq %r9, %rax. Currently the sinking is being prevented by this line but I don't fully understand the comment there. It seems like if this heuristic is needed then it's probably a conservative one because load sinking should be ok in this specific scenario at least.

Which leads me to my question: could this force_multiple be explained a bit more? Is there an example of how this could go wrong if force_multiple were ignored? (if I set it to false unconditionally no filetests break so I wasn't able to learn more via that). Furthermore is there a way to perhaps make this check less conservative to allow load sinking in the above case?

This commit implements a few minor changes for i128 in both the egraph optimizations and lowerings for x64. The optimization pass will now transform iconcat into a uextend or sextend where appropri...
A fast and secure runtime for WebAssembly. Contribute to bytecodealliance/wasmtime development by creating an account on GitHub.

view this post on Zulip fitzgen (he/him) (Aug 16 2024 at 18:03):

blame says it is a fix for https://github.com/bytecodealliance/wasmtime/issues/3953

In examining #3951 and #3934, the main problem in both cases seems to be that the load coalescing logic cannot detect when a loaded operand may be reused. Because of this, we must manually ensure t...

view this post on Zulip fitzgen (he/him) (Aug 16 2024 at 18:03):

cc @Chris Fallin @Andrew Brown

view this post on Zulip Chris Fallin (Aug 16 2024 at 18:04):

yep, that; the case you show in CLIF above is a special case where a single instruction is responsible for both of the uses, in theory we could detect that, we'd want to think it through very carefully though

view this post on Zulip Chris Fallin (Aug 16 2024 at 18:05):

(nit: not "heuristic" but "invariant" or "rule"; heuristics can be violated without correctness issues)

view this post on Zulip Alex Crichton (Aug 16 2024 at 18:08):

hm ok, I still don't really understand what this is doing but sounds like I should assume this can't easily change

view this post on Zulip Chris Fallin (Aug 16 2024 at 18:09):

so the basic idea of the analysis is to prove that an instruction is used only (or at most) once; that's the condition we need to sink it

view this post on Zulip Chris Fallin (Aug 16 2024 at 18:09):

(we could do better if we reasoned about location and sunk it to the first of several locations it's used if it dominates all the others, but that's way more complex)

view this post on Zulip Chris Fallin (Aug 16 2024 at 18:10):

when there is a single output, this is reasonably straightforward: we do a transitive fixpoint algorithm where we generate a "Multiple" at any op that's used multiple times, and propagate that backward

view this post on Zulip Chris Fallin (Aug 16 2024 at 18:10):

(because matching/sinking can go arbitrarily high up the tree, we need a transitive notion of "used more than once")

view this post on Zulip Chris Fallin (Aug 16 2024 at 18:10):

when there are multiple outputs, even if each output is used only once, the op itself is used more than once

view this post on Zulip Chris Fallin (Aug 16 2024 at 18:11):

we could refine the lattice more and encode logic for 0+0 -> 0, 0+1 / 1+0 -> 1, 1+1 -> many, 1+many / many+1 -> many

view this post on Zulip Chris Fallin (Aug 16 2024 at 18:11):

and then add a special case for 1+1 (if same 1) -> 1

view this post on Zulip Chris Fallin (Aug 16 2024 at 18:11):

but that's... a lot of subtlety and I don't have confidence enough in it to do it right now

view this post on Zulip Chris Fallin (Aug 16 2024 at 18:11):

hopefully makes more sense?

view this post on Zulip Alex Crichton (Aug 16 2024 at 18:12):

One part I don't understand though is why multiple-results matter here. ISLE can't match instructions with multiple results in patterns afaik so we can't possibly go up the tree and somehow duplicate the operands of a multi-result instruction, right?

view this post on Zulip Chris Fallin (Aug 16 2024 at 18:13):

well, when building the analysis we try not to couple current implementation limits such as "ISLE can't match instructions with multiple results" with the invariants elsewhere; that could easily lead to very subtle bugs later when we expand its capability, and in any case leads to very hard to reason about invariants

view this post on Zulip Alex Crichton (Aug 16 2024 at 18:13):

the specific case of isplit returning two things and then ret consuming both is a bit of a red herring as it was the first test case I came up with, the actual benchmark I think stores the two results of isplit into memory or just one or something like that

view this post on Zulip Chris Fallin (Aug 16 2024 at 18:13):

I think the right fix here, if we want to make the analysis more powerful, is to do as noted above -- recognize the special case

view this post on Zulip Chris Fallin (Aug 16 2024 at 18:13):

or alternately, have a notion of "rooted value" that stops propagation of multiplicity

view this post on Zulip Chris Fallin (Aug 16 2024 at 18:14):

then we need to guarantee that this is never matched through

view this post on Zulip Chris Fallin (Aug 16 2024 at 18:15):

the key there would be to have a single source of truth for that notion/kind of instruction and consult it in both places

view this post on Zulip Alex Crichton (Aug 16 2024 at 18:17):

I'm still not understanding though because isn't it structurally not possible to match through a multi-result instruction?

view this post on Zulip Alex Crichton (Aug 16 2024 at 18:17):

in that aren't we already guaranteed that doesn't happen?

view this post on Zulip Chris Fallin (Aug 16 2024 at 18:18):

it currently isn't possible, but let's not bake in that assumption, is what I'm saying; it's an implementation limit not something fundamental

view this post on Zulip Chris Fallin (Aug 16 2024 at 18:18):

the "rooted instruction" notion above is my attempt to reify it as an actual concept we can rely on :-)

view this post on Zulip Chris Fallin (Aug 16 2024 at 18:18):

said another way: can we describe accurately what the result of the analysis is?

view this post on Zulip Alex Crichton (Aug 16 2024 at 18:19):

ok yeah that makes sense, I'm just trying to affirm my current understanding of the world to understand what you're saying we want to move towards

view this post on Zulip Alex Crichton (Aug 16 2024 at 18:19):

e.g. I don't know what you mean by "rooted instruction"

view this post on Zulip Chris Fallin (Aug 16 2024 at 18:19):

if the answer has to invoke implementation limits and just-so arguments why it won't cause an issue, that's a very bad place to be in

view this post on Zulip Chris Fallin (Aug 16 2024 at 18:19):

vs. having a semantic concept and defining things around it

view this post on Zulip Alex Crichton (Aug 16 2024 at 18:19):

and it's difficult for me to reason about matching through things when that's currently just a hypothetical for multi-result instructions

view this post on Zulip Chris Fallin (Aug 16 2024 at 18:20):

"rooted instruction" is a concept that reifies the notion of "thing that cannot be matched through"

view this post on Zulip Chris Fallin (Aug 16 2024 at 18:20):

rooted because we need to actually codegen it at the root of a matching tree

view this post on Zulip Chris Fallin (Aug 16 2024 at 18:20):

once we have that guarantee, semantically, we can define multiplicity in terms of it

view this post on Zulip Chris Fallin (Aug 16 2024 at 18:20):

basically my whole thing here is: we need to define multiplicity succinctly and clearly so we can think about it. if our definition includes reasoning about matching/lowering algorithm stuff, we've failed

view this post on Zulip Chris Fallin (Aug 16 2024 at 18:21):

(in the sense that, that's a recipe for muddy thinking and further bugs)

view this post on Zulip Alex Crichton (Aug 16 2024 at 18:21):

sorry I'm also trying to catch up because it sounds like you're trying to rebut a change I'm proposing but I'm not proposing anything and am trying to better understand what we have today

view this post on Zulip Chris Fallin (Aug 16 2024 at 18:22):

basically I'm justifying why we have what we have today against the "why can't we remove force_multiple"

view this post on Zulip Chris Fallin (Aug 16 2024 at 18:24):

as succinctly as I can put it: multiplicity has been defined (in this analysis) to mean "is an instruction referenced multiple times, or referenced by another instruction that is referenced multiple times". The algorithm approximates that in the presence of multiple-output instructions by the use of the force_multiple mechanism

view this post on Zulip Chris Fallin (Aug 16 2024 at 18:24):

what I'm then proposing is that we can change that definition to change the answer to your question ("why can't we remove" -> "now we can") by introducing this notion of rooted instructions

view this post on Zulip Alex Crichton (Aug 16 2024 at 18:27):

ok that makes sense to me yeah, thanks!

view this post on Zulip Chris Fallin (Aug 16 2024 at 18:27):

cool. happy to review if you want to take a stab at this (I unfortunately don't have the time or mental energy atm to do it myself!)

view this post on Zulip Alex Crichton (Aug 16 2024 at 18:27):

is this function, get_value_as_source_or_const, a suitable "choke point" where the concept of a rooted instruction might be able to be inserted?

A fast and secure runtime for WebAssembly. Contribute to bytecodealliance/wasmtime development by creating an account on GitHub.

view this post on Zulip Alex Crichton (Aug 16 2024 at 18:28):

I see that for side-effecting instructions it already ensures there's only one output but I suspect more would be needed

A fast and secure runtime for WebAssembly. Contribute to bytecodealliance/wasmtime development by creating an account on GitHub.

view this post on Zulip Chris Fallin (Aug 16 2024 at 18:29):

yep, that's indeed where we have all the conditions for "seeing through" a use, so let's add conditions there (for pure and impure cases) for "is rooted" and avoid seeing it as any more than an opaque value if so

view this post on Zulip Alex Crichton (Aug 16 2024 at 18:29):

how come it would be necessary for the "pure" case?

view this post on Zulip Alex Crichton (Aug 16 2024 at 18:30):

oh wait let me try to answer myself

view this post on Zulip Alex Crichton (Aug 16 2024 at 18:30):

let's say that you v0 = load ... ; v1, v2 = isplit v0 - if you "look through" the isplit, despite it being pure, you could "look through" a use of v1 and v2 and possibly generate two loads

view this post on Zulip Alex Crichton (Aug 16 2024 at 18:30):

is that right?

view this post on Zulip Chris Fallin (Aug 16 2024 at 18:30):

yep exactly

view this post on Zulip Chris Fallin (Aug 16 2024 at 18:31):

so we cut things at the isplit, it's "rooted" now, it becomes just a single use for the load, and we can't look through it

view this post on Zulip Chris Fallin (Aug 16 2024 at 18:33):

(I'm not sure if that addresses your original desire to get more load-sinking; the loads you want sunk are actually single-use and the splits happen as consumers of the things like iadds that consume loads, right?)

view this post on Zulip Alex Crichton (Aug 16 2024 at 18:34):

so as perhaps the most simple change, it would perhaps look like:

view this post on Zulip Alex Crichton (Aug 16 2024 at 18:34):

yeah what specifically I was seeing was that any operand to isplit would transitively mark all operands as Multiple which thwarted load sinking elsewhere

view this post on Zulip Alex Crichton (Aug 16 2024 at 18:36):

the downside of this "simple" change though is that if a multi-result instruction has a dead result, e.g. the second return value of isplit, it in theory should allow seeing through the isplit but it doesn't?

view this post on Zulip Chris Fallin (Aug 16 2024 at 18:36):

that delta is I think the logical outcome of the concept of "rooted" being defined for now as any multi-output; I was going to suggest a more generalized thing but then it would result in silly dead code; as long as we have comments where force_multiple is noting that we explicitly do not look through multi-output insts when matching, and add a comment where we block such seeing-through noting that it is load-bearing

view this post on Zulip Chris Fallin (Aug 16 2024 at 18:37):

another alternative is to make the multiplicity analysis smarter: Once + Once -> Multiple, including across multiple inputs, but Once + not-used should result in Once

view this post on Zulip Chris Fallin (Aug 16 2024 at 18:37):

s/multiple inputs/multiple outputs/

view this post on Zulip Alex Crichton (Aug 16 2024 at 18:37):

removing force_multiple gets that for free though right?

view this post on Zulip Alex Crichton (Aug 16 2024 at 18:37):

b/c if you only use one result of isplit the operand only gets a single use?

view this post on Zulip Alex Crichton (Aug 16 2024 at 18:38):

hm actually wait I don't think "just" removing force_multiple enough, the DFS would have to stop propagating through a multi-result instruction

view this post on Zulip Alex Crichton (Aug 16 2024 at 18:39):

b/c otherwise if you use both results of isplit it'll naturally mark the operand as used twice right? (if the only change is to remove force_multiple)

view this post on Zulip Chris Fallin (Aug 16 2024 at 18:39):

right, I think we actually need the notion of "rooted"; it's actually that we force codegen at that matching root (by not seeing through it) and really do only use the input once as a result

view this post on Zulip Alex Crichton (Aug 16 2024 at 18:41):

hm ok I'll try to poke at this and see if I can't bottom this out more

view this post on Zulip Chris Fallin (Aug 16 2024 at 18:41):

happy to talk more in next week's CL meeting too!

view this post on Zulip Alex Crichton (Aug 16 2024 at 20:55):

I've ended up with https://github.com/bytecodealliance/wasmtime/pull/9137 as an attempt of this

This commit is a result of discussion on Zulip and is attempting to fix an issue where some 128-bit instructions aren't fully benefitting from load sinking on x64. At a high level 128-bit addit...

Last updated: Nov 22 2024 at 17:03 UTC