I'm curious to explore some 128-bit stuff a bit and right now I'm looking at input such as:
function u0:0(i64, i64) -> i64, i64 tail {
block0(v0: i64, v1: i64):
v10 = load.i64 v0
v18 = uextend.i128 v1
v20 = uextend.i128 v10
v14 = iadd v18, v20
v15, v16 = isplit v14
jump block1
block1:
return v15, v16
}
which after https://github.com/bytecodealliance/wasmtime/pull/9136 produces:
0: 55 pushq %rbp
1: 48 89 e5 movq %rsp, %rbp
4: 4c 8b 0f movq (%rdi), %r9
7: 48 31 c9 xorq %rcx, %rcx
a: 48 89 f0 movq %rsi, %rax
d: 4c 01 c8 addq %r9, %rax
10: 48 83 d1 00 adcq $0, %rcx
14: 48 89 ec movq %rbp, %rsp
17: 5d popq %rbp
18: c3 retq
notably what I'm trying to get happen is to sink the movq (%rdi), %r9
into addq %r9, %rax
. Currently the sinking is being prevented by this line but I don't fully understand the comment there. It seems like if this heuristic is needed then it's probably a conservative one because load sinking should be ok in this specific scenario at least.
Which leads me to my question: could this force_multiple
be explained a bit more? Is there an example of how this could go wrong if force_multiple
were ignored? (if I set it to false
unconditionally no filetests break so I wasn't able to learn more via that). Furthermore is there a way to perhaps make this check less conservative to allow load sinking in the above case?
blame says it is a fix for https://github.com/bytecodealliance/wasmtime/issues/3953
cc @Chris Fallin @Andrew Brown
yep, that; the case you show in CLIF above is a special case where a single instruction is responsible for both of the uses, in theory we could detect that, we'd want to think it through very carefully though
(nit: not "heuristic" but "invariant" or "rule"; heuristics can be violated without correctness issues)
hm ok, I still don't really understand what this is doing but sounds like I should assume this can't easily change
so the basic idea of the analysis is to prove that an instruction is used only (or at most) once; that's the condition we need to sink it
(we could do better if we reasoned about location and sunk it to the first of several locations it's used if it dominates all the others, but that's way more complex)
when there is a single output, this is reasonably straightforward: we do a transitive fixpoint algorithm where we generate a "Multiple" at any op that's used multiple times, and propagate that backward
(because matching/sinking can go arbitrarily high up the tree, we need a transitive notion of "used more than once")
when there are multiple outputs, even if each output is used only once, the op itself is used more than once
we could refine the lattice more and encode logic for 0+0 -> 0, 0+1 / 1+0 -> 1, 1+1 -> many, 1+many / many+1 -> many
and then add a special case for 1+1 (if same 1) -> 1
but that's... a lot of subtlety and I don't have confidence enough in it to do it right now
hopefully makes more sense?
One part I don't understand though is why multiple-results matter here. ISLE can't match instructions with multiple results in patterns afaik so we can't possibly go up the tree and somehow duplicate the operands of a multi-result instruction, right?
well, when building the analysis we try not to couple current implementation limits such as "ISLE can't match instructions with multiple results" with the invariants elsewhere; that could easily lead to very subtle bugs later when we expand its capability, and in any case leads to very hard to reason about invariants
the specific case of isplit
returning two things and then ret
consuming both is a bit of a red herring as it was the first test case I came up with, the actual benchmark I think stores the two results of isplit
into memory or just one or something like that
I think the right fix here, if we want to make the analysis more powerful, is to do as noted above -- recognize the special case
or alternately, have a notion of "rooted value" that stops propagation of multiplicity
then we need to guarantee that this is never matched through
the key there would be to have a single source of truth for that notion/kind of instruction and consult it in both places
I'm still not understanding though because isn't it structurally not possible to match through a multi-result instruction?
in that aren't we already guaranteed that doesn't happen?
it currently isn't possible, but let's not bake in that assumption, is what I'm saying; it's an implementation limit not something fundamental
the "rooted instruction" notion above is my attempt to reify it as an actual concept we can rely on :-)
said another way: can we describe accurately what the result of the analysis is?
ok yeah that makes sense, I'm just trying to affirm my current understanding of the world to understand what you're saying we want to move towards
e.g. I don't know what you mean by "rooted instruction"
if the answer has to invoke implementation limits and just-so arguments why it won't cause an issue, that's a very bad place to be in
vs. having a semantic concept and defining things around it
and it's difficult for me to reason about matching through things when that's currently just a hypothetical for multi-result instructions
"rooted instruction" is a concept that reifies the notion of "thing that cannot be matched through"
rooted because we need to actually codegen it at the root of a matching tree
once we have that guarantee, semantically, we can define multiplicity in terms of it
basically my whole thing here is: we need to define multiplicity succinctly and clearly so we can think about it. if our definition includes reasoning about matching/lowering algorithm stuff, we've failed
(in the sense that, that's a recipe for muddy thinking and further bugs)
sorry I'm also trying to catch up because it sounds like you're trying to rebut a change I'm proposing but I'm not proposing anything and am trying to better understand what we have today
basically I'm justifying why we have what we have today against the "why can't we remove force_multiple
"
as succinctly as I can put it: multiplicity has been defined (in this analysis) to mean "is an instruction referenced multiple times, or referenced by another instruction that is referenced multiple times". The algorithm approximates that in the presence of multiple-output instructions by the use of the force_multiple
mechanism
what I'm then proposing is that we can change that definition to change the answer to your question ("why can't we remove" -> "now we can") by introducing this notion of rooted instructions
ok that makes sense to me yeah, thanks!
cool. happy to review if you want to take a stab at this (I unfortunately don't have the time or mental energy atm to do it myself!)
is this function, get_value_as_source_or_const
, a suitable "choke point" where the concept of a rooted instruction might be able to be inserted?
I see that for side-effecting instructions it already ensures there's only one output but I suspect more would be needed
yep, that's indeed where we have all the conditions for "seeing through" a use, so let's add conditions there (for pure and impure cases) for "is rooted" and avoid seeing it as any more than an opaque value if so
how come it would be necessary for the "pure" case?
oh wait let me try to answer myself
let's say that you v0 = load ... ; v1, v2 = isplit v0
- if you "look through" the isplit
, despite it being pure, you could "look through" a use of v1
and v2
and possibly generate two loads
is that right?
yep exactly
so we cut things at the isplit, it's "rooted" now, it becomes just a single use for the load, and we can't look through it
(I'm not sure if that addresses your original desire to get more load-sinking; the loads you want sunk are actually single-use and the splits happen as consumers of the things like iadds that consume loads, right?)
so as perhaps the most simple change, it would perhaps look like:
force_multiple
get_value_as_source_or_const
to return InputSourceInst::None
if the instruction has multiple outputsyeah what specifically I was seeing was that any operand to isplit
would transitively mark all operands as Multiple
which thwarted load sinking elsewhere
the downside of this "simple" change though is that if a multi-result instruction has a dead result, e.g. the second return value of isplit
, it in theory should allow seeing through the isplit
but it doesn't?
that delta is I think the logical outcome of the concept of "rooted" being defined for now as any multi-output; I was going to suggest a more generalized thing but then it would result in silly dead code; as long as we have comments where force_multiple
is noting that we explicitly do not look through multi-output insts when matching, and add a comment where we block such seeing-through noting that it is load-bearing
another alternative is to make the multiplicity analysis smarter: Once + Once -> Multiple, including across multiple inputs, but Once + not-used should result in Once
s/multiple inputs/multiple outputs/
removing force_multiple
gets that for free though right?
b/c if you only use one result of isplit
the operand only gets a single use?
hm actually wait I don't think "just" removing force_multiple
enough, the DFS would have to stop propagating through a multi-result instruction
b/c otherwise if you use both results of isplit
it'll naturally mark the operand as used twice right? (if the only change is to remove force_multiple
)
right, I think we actually need the notion of "rooted"; it's actually that we force codegen at that matching root (by not seeing through it) and really do only use the input once as a result
hm ok I'll try to poke at this and see if I can't bottom this out more
happy to talk more in next week's CL meeting too!
I've ended up with https://github.com/bytecodealliance/wasmtime/pull/9137 as an attempt of this
Last updated: Jan 24 2025 at 00:11 UTC