Stream: git-wasmtime

Topic: wasmtime / issue #5783 Souper-synthesized optimizations t...


view this post on Zulip Wasmtime GitHub notifications bot (Feb 15 2023 at 00:15):

fitzgen opened issue #5783:

Here are some synthesized optimizations for CLIF harvested from
spidermonkey.wasm with explicit bounds checks enabled. I won't have time to
investigate, generalize, or implement them before I go on vacation, so I want to
note them down in an issue for posterity.


Replacing clz(x) == 0 with a comparison:

========= ./2559773154913247904.result =========
%0:i32 = var
%1:i32 = ctlz %0
%2:i1 = eq %1, 0:i32
infer %2

; RHS inferred successfully
%3:i1 = ult 2147483647:i32, %0
result %3

Similarly, we can do this for clz(x) == 1, which wasn't harvested from the
CLIF:

%0:i32 = var
%1:i32 = ctlz %0
%2:i1 = eq %1, 1:i32
infer %2

; RHS inferred successfully
%3:i1 = slt 1073741823:i32, %0
result %3

Note that there is no generalization for clz(x) == C available here (unless
C is larger than x's bit width, in which case it is always false). It only
works for zero and one because we don't need to check for a lower bound on x.

Probably a similar rewrite we could do with clz(x) == OP_BIT_WIDTH.


"Reverse" const propagation of a shl into a select:

========= ./11901186248914954319.result =========
%1:i1 = var
%2:i32 = select %1, 9:i32, 1:i32
%3:i32 = shl -1:i32, %2
infer %3

; RHS inferred successfully
%3:i32 = select %1, 4294966784:i32, 4294967294:i32
result %3

Maybe not profitable if %2 is used multiple times, in which case %1's live
range might be extended after this optimization. But I guess this is always true
with these sorts of peepholes...


Haven't dug into this one yet:

========= ./2969648510667058023.result =========
%0:i32 = var
%1:i32 = and %0, 2147483647:i32
%2:i32 = shl %1, 1:i32
infer %2

; RHS inferred successfully
%3:i32 = add %0, %0
result %3

A bunch of masking off bits that will be shifted out anyways:

========= ./4985840733093600201.result =========
%0:i32 = var
%1:i32 = and %0, 1073741823:i32
%2:i32 = shl %1, 2:i32
infer %2

; RHS inferred successfully
%3:i32 = shl %0, 2:i32
result %3

========= ./17860953418551930245.result =========
%0:i32 = var
%1:i32 = and %0, 268435455:i32
%2:i32 = shl %1, 4:i32
infer %2

; RHS inferred successfully
%3:i32 = shl %0, 4:i32
result %3

========= ./8617343957324108668.result =========
%0:i32 = var
%1:i32 = and %0, 536870911:i32
%2:i32 = shl %1, 3:i32
infer %2

; RHS inferred successfully
%3:i32 = shl %0, 3:i32
result %3

Replacing an and and an eq with an ult. Haven't thought about these yet.

========= ./9633751011751143656.result =========
%0:i32 = var
%1:i32 = and %0, -2147483648:i32
%2:i1 = eq %1, 0:i32
infer %2

; RHS inferred successfully
%3:i1 = ult %0, 2147483648:i32
result %3

========= ./3840756061434214754.result =========
%0:i32 = var
%1:i32 = and %0, 4294967280:i32
%2:i1 = eq %1, 0:i32
infer %2

; RHS inferred successfully
%3:i1 = ult %0, 16:i32
result %3

Unnecessary ors:

========= ./14551618322220925804.result =========
%0:i32 = var
%1:i32 = or %0, 1024:i32
%2:i32 = and %1, 2048:i32
infer %2

; RHS inferred successfully
%3:i32 = and 2048:i32, %0
result %3

========= ./13190508444419006141.result =========
%0:i32 = var
%1:i32 = or %0, 33032:i32
%2:i32 = and %1, 192:i32
infer %2

; RHS inferred successfully
%3:i32 = and 192:i32, %0
result %3

More masking off bits that will be shifted away:

========= ./3703682576609255056.result =========
%0:i32 = var
%1:i32 = and %0, 255:i32
%2:i32 = shl %1, 24:i32
infer %2

; RHS inferred successfully
%3:i32 = shl %0, 24:i32
result %3

========= ./5310030443405410098.result =========
%0:i32 = var
%1:i32 = and %0, 255:i32
%2:i32 = shl %1, 25:i32
infer %2

; RHS inferred successfully
%3:i32 = shl %0, 25:i32
result %3

A ton of "reverse" const prop with comparisons found, here are a few:

========= ./13877742632332331899.result =========
%0:i32 = var
%1:i32 = add %0, 1:i32
%2:i1 = eq %1, 1:i32
infer %2

; RHS inferred successfully
%3:i1 = eq 0:i32, %0
result %3

========= ./3568272213429168939.result =========
%0:i32 = var
%1:i32 = add %0, 44:i32
%2:i1 = eq %1, 3608:i32
infer %2

; RHS inferred successfully
%3:i1 = eq 3564:i32, %0
result %3

========= ./12119931432587256453.result =========
%0:i32 = var
%1:i32 = ashr %0, 2:i32
%2:i1 = slt %1, 24:i32
infer %2

; RHS inferred successfully
%3:i1 = slt %0, 96:i32
result %3

========= ./7033427926372082757.result =========
%0:i32 = var
%1:i32 = and %0, -8:i32
%2:i1 = sle %1, 31:i32
infer %2

; RHS inferred successfully
%3:i1 = slt %0, 32:i32
result %3

And also a bunch "reverse" const prop with other operators as well (not as many
as with comparisons though). Again, here are a few:

========= ./7402500702784523674.result =========
%0:i32 = var
%1:i32 = shl %0, 1:i32
%2:i32 = shl %1, 16:i32
infer %2

; RHS inferred successfully
%3:i32 = shl %0, 17:i32
result %3

========= ./10936462652029262118.result =========
%0:i32 = var
%1:i32 = shl %0, 1:i32
%2:i32 = mul %1, 12:i32
infer %2

; RHS inferred successfully
%3:i32 = mul 24:i32, %0
result %3

========= ./13431533325972989820.result =========
%0:i32 = var
%1:i32 = shl -1:i32, %0
%2:i32 = shl %1, 2:i32
infer %2

; RHS inferred successfully
%3:i32 = shl 4294967292:i32, %0
result %3

view this post on Zulip Wasmtime GitHub notifications bot (Feb 15 2023 at 00:15):

fitzgen labeled issue #5783:

Here are some synthesized optimizations for CLIF harvested from
spidermonkey.wasm with explicit bounds checks enabled. I won't have time to
investigate, generalize, or implement them before I go on vacation, so I want to
note them down in an issue for posterity.


Replacing clz(x) == 0 with a comparison:

========= ./2559773154913247904.result =========
%0:i32 = var
%1:i32 = ctlz %0
%2:i1 = eq %1, 0:i32
infer %2

; RHS inferred successfully
%3:i1 = ult 2147483647:i32, %0
result %3

Similarly, we can do this for clz(x) == 1, which wasn't harvested from the
CLIF:

%0:i32 = var
%1:i32 = ctlz %0
%2:i1 = eq %1, 1:i32
infer %2

; RHS inferred successfully
%3:i1 = slt 1073741823:i32, %0
result %3

Note that there is no generalization for clz(x) == C available here (unless
C is larger than x's bit width, in which case it is always false). It only
works for zero and one because we don't need to check for a lower bound on x.

Probably a similar rewrite we could do with clz(x) == OP_BIT_WIDTH.


"Reverse" const propagation of a shl into a select:

========= ./11901186248914954319.result =========
%1:i1 = var
%2:i32 = select %1, 9:i32, 1:i32
%3:i32 = shl -1:i32, %2
infer %3

; RHS inferred successfully
%3:i32 = select %1, 4294966784:i32, 4294967294:i32
result %3

Maybe not profitable if %2 is used multiple times, in which case %1's live
range might be extended after this optimization. But I guess this is always true
with these sorts of peepholes...


Haven't dug into this one yet:

========= ./2969648510667058023.result =========
%0:i32 = var
%1:i32 = and %0, 2147483647:i32
%2:i32 = shl %1, 1:i32
infer %2

; RHS inferred successfully
%3:i32 = add %0, %0
result %3

A bunch of masking off bits that will be shifted out anyways:

========= ./4985840733093600201.result =========
%0:i32 = var
%1:i32 = and %0, 1073741823:i32
%2:i32 = shl %1, 2:i32
infer %2

; RHS inferred successfully
%3:i32 = shl %0, 2:i32
result %3

========= ./17860953418551930245.result =========
%0:i32 = var
%1:i32 = and %0, 268435455:i32
%2:i32 = shl %1, 4:i32
infer %2

; RHS inferred successfully
%3:i32 = shl %0, 4:i32
result %3

========= ./8617343957324108668.result =========
%0:i32 = var
%1:i32 = and %0, 536870911:i32
%2:i32 = shl %1, 3:i32
infer %2

; RHS inferred successfully
%3:i32 = shl %0, 3:i32
result %3

Replacing an and and an eq with an ult. Haven't thought about these yet.

========= ./9633751011751143656.result =========
%0:i32 = var
%1:i32 = and %0, -2147483648:i32
%2:i1 = eq %1, 0:i32
infer %2

; RHS inferred successfully
%3:i1 = ult %0, 2147483648:i32
result %3

========= ./3840756061434214754.result =========
%0:i32 = var
%1:i32 = and %0, 4294967280:i32
%2:i1 = eq %1, 0:i32
infer %2

; RHS inferred successfully
%3:i1 = ult %0, 16:i32
result %3

Unnecessary ors:

========= ./14551618322220925804.result =========
%0:i32 = var
%1:i32 = or %0, 1024:i32
%2:i32 = and %1, 2048:i32
infer %2

; RHS inferred successfully
%3:i32 = and 2048:i32, %0
result %3

========= ./13190508444419006141.result =========
%0:i32 = var
%1:i32 = or %0, 33032:i32
%2:i32 = and %1, 192:i32
infer %2

; RHS inferred successfully
%3:i32 = and 192:i32, %0
result %3

More masking off bits that will be shifted away:

========= ./3703682576609255056.result =========
%0:i32 = var
%1:i32 = and %0, 255:i32
%2:i32 = shl %1, 24:i32
infer %2

; RHS inferred successfully
%3:i32 = shl %0, 24:i32
result %3

========= ./5310030443405410098.result =========
%0:i32 = var
%1:i32 = and %0, 255:i32
%2:i32 = shl %1, 25:i32
infer %2

; RHS inferred successfully
%3:i32 = shl %0, 25:i32
result %3

A ton of "reverse" const prop with comparisons found, here are a few:

========= ./13877742632332331899.result =========
%0:i32 = var
%1:i32 = add %0, 1:i32
%2:i1 = eq %1, 1:i32
infer %2

; RHS inferred successfully
%3:i1 = eq 0:i32, %0
result %3

========= ./3568272213429168939.result =========
%0:i32 = var
%1:i32 = add %0, 44:i32
%2:i1 = eq %1, 3608:i32
infer %2

; RHS inferred successfully
%3:i1 = eq 3564:i32, %0
result %3

========= ./12119931432587256453.result =========
%0:i32 = var
%1:i32 = ashr %0, 2:i32
%2:i1 = slt %1, 24:i32
infer %2

; RHS inferred successfully
%3:i1 = slt %0, 96:i32
result %3

========= ./7033427926372082757.result =========
%0:i32 = var
%1:i32 = and %0, -8:i32
%2:i1 = sle %1, 31:i32
infer %2

; RHS inferred successfully
%3:i1 = slt %0, 32:i32
result %3

And also a bunch "reverse" const prop with other operators as well (not as many
as with comparisons though). Again, here are a few:

========= ./7402500702784523674.result =========
%0:i32 = var
%1:i32 = shl %0, 1:i32
%2:i32 = shl %1, 16:i32
infer %2

; RHS inferred successfully
%3:i32 = shl %0, 17:i32
result %3

========= ./10936462652029262118.result =========
%0:i32 = var
%1:i32 = shl %0, 1:i32
%2:i32 = mul %1, 12:i32
infer %2

; RHS inferred successfully
%3:i32 = mul 24:i32, %0
result %3

========= ./13431533325972989820.result =========
%0:i32 = var
%1:i32 = shl -1:i32, %0
%2:i32 = shl %1, 2:i32
infer %2

; RHS inferred successfully
%3:i32 = shl 4294967292:i32, %0
result %3

view this post on Zulip Wasmtime GitHub notifications bot (Feb 15 2023 at 00:15):

fitzgen labeled issue #5783:

Here are some synthesized optimizations for CLIF harvested from
spidermonkey.wasm with explicit bounds checks enabled. I won't have time to
investigate, generalize, or implement them before I go on vacation, so I want to
note them down in an issue for posterity.


Replacing clz(x) == 0 with a comparison:

========= ./2559773154913247904.result =========
%0:i32 = var
%1:i32 = ctlz %0
%2:i1 = eq %1, 0:i32
infer %2

; RHS inferred successfully
%3:i1 = ult 2147483647:i32, %0
result %3

Similarly, we can do this for clz(x) == 1, which wasn't harvested from the
CLIF:

%0:i32 = var
%1:i32 = ctlz %0
%2:i1 = eq %1, 1:i32
infer %2

; RHS inferred successfully
%3:i1 = slt 1073741823:i32, %0
result %3

Note that there is no generalization for clz(x) == C available here (unless
C is larger than x's bit width, in which case it is always false). It only
works for zero and one because we don't need to check for a lower bound on x.

Probably a similar rewrite we could do with clz(x) == OP_BIT_WIDTH.


"Reverse" const propagation of a shl into a select:

========= ./11901186248914954319.result =========
%1:i1 = var
%2:i32 = select %1, 9:i32, 1:i32
%3:i32 = shl -1:i32, %2
infer %3

; RHS inferred successfully
%3:i32 = select %1, 4294966784:i32, 4294967294:i32
result %3

Maybe not profitable if %2 is used multiple times, in which case %1's live
range might be extended after this optimization. But I guess this is always true
with these sorts of peepholes...


Haven't dug into this one yet:

========= ./2969648510667058023.result =========
%0:i32 = var
%1:i32 = and %0, 2147483647:i32
%2:i32 = shl %1, 1:i32
infer %2

; RHS inferred successfully
%3:i32 = add %0, %0
result %3

A bunch of masking off bits that will be shifted out anyways:

========= ./4985840733093600201.result =========
%0:i32 = var
%1:i32 = and %0, 1073741823:i32
%2:i32 = shl %1, 2:i32
infer %2

; RHS inferred successfully
%3:i32 = shl %0, 2:i32
result %3

========= ./17860953418551930245.result =========
%0:i32 = var
%1:i32 = and %0, 268435455:i32
%2:i32 = shl %1, 4:i32
infer %2

; RHS inferred successfully
%3:i32 = shl %0, 4:i32
result %3

========= ./8617343957324108668.result =========
%0:i32 = var
%1:i32 = and %0, 536870911:i32
%2:i32 = shl %1, 3:i32
infer %2

; RHS inferred successfully
%3:i32 = shl %0, 3:i32
result %3

Replacing an and and an eq with an ult. Haven't thought about these yet.

========= ./9633751011751143656.result =========
%0:i32 = var
%1:i32 = and %0, -2147483648:i32
%2:i1 = eq %1, 0:i32
infer %2

; RHS inferred successfully
%3:i1 = ult %0, 2147483648:i32
result %3

========= ./3840756061434214754.result =========
%0:i32 = var
%1:i32 = and %0, 4294967280:i32
%2:i1 = eq %1, 0:i32
infer %2

; RHS inferred successfully
%3:i1 = ult %0, 16:i32
result %3

Unnecessary ors:

========= ./14551618322220925804.result =========
%0:i32 = var
%1:i32 = or %0, 1024:i32
%2:i32 = and %1, 2048:i32
infer %2

; RHS inferred successfully
%3:i32 = and 2048:i32, %0
result %3

========= ./13190508444419006141.result =========
%0:i32 = var
%1:i32 = or %0, 33032:i32
%2:i32 = and %1, 192:i32
infer %2

; RHS inferred successfully
%3:i32 = and 192:i32, %0
result %3

More masking off bits that will be shifted away:

========= ./3703682576609255056.result =========
%0:i32 = var
%1:i32 = and %0, 255:i32
%2:i32 = shl %1, 24:i32
infer %2

; RHS inferred successfully
%3:i32 = shl %0, 24:i32
result %3

========= ./5310030443405410098.result =========
%0:i32 = var
%1:i32 = and %0, 255:i32
%2:i32 = shl %1, 25:i32
infer %2

; RHS inferred successfully
%3:i32 = shl %0, 25:i32
result %3

A ton of "reverse" const prop with comparisons found, here are a few:

========= ./13877742632332331899.result =========
%0:i32 = var
%1:i32 = add %0, 1:i32
%2:i1 = eq %1, 1:i32
infer %2

; RHS inferred successfully
%3:i1 = eq 0:i32, %0
result %3

========= ./3568272213429168939.result =========
%0:i32 = var
%1:i32 = add %0, 44:i32
%2:i1 = eq %1, 3608:i32
infer %2

; RHS inferred successfully
%3:i1 = eq 3564:i32, %0
result %3

========= ./12119931432587256453.result =========
%0:i32 = var
%1:i32 = ashr %0, 2:i32
%2:i1 = slt %1, 24:i32
infer %2

; RHS inferred successfully
%3:i1 = slt %0, 96:i32
result %3

========= ./7033427926372082757.result =========
%0:i32 = var
%1:i32 = and %0, -8:i32
%2:i1 = sle %1, 31:i32
infer %2

; RHS inferred successfully
%3:i1 = slt %0, 32:i32
result %3

And also a bunch "reverse" const prop with other operators as well (not as many
as with comparisons though). Again, here are a few:

========= ./7402500702784523674.result =========
%0:i32 = var
%1:i32 = shl %0, 1:i32
%2:i32 = shl %1, 16:i32
infer %2

; RHS inferred successfully
%3:i32 = shl %0, 17:i32
result %3

========= ./10936462652029262118.result =========
%0:i32 = var
%1:i32 = shl %0, 1:i32
%2:i32 = mul %1, 12:i32
infer %2

; RHS inferred successfully
%3:i32 = mul 24:i32, %0
result %3

========= ./13431533325972989820.result =========
%0:i32 = var
%1:i32 = shl -1:i32, %0
%2:i32 = shl %1, 2:i32
infer %2

; RHS inferred successfully
%3:i32 = shl 4294967292:i32, %0
result %3

view this post on Zulip Wasmtime GitHub notifications bot (Feb 23 2023 at 10:42):

afonso360 commented on issue #5783:

I was looking into souper the other day. I wasn't able to set it up properly, but I noticed we are missing a translation for Bswap in our harvest program. So there might be a few more optimizations possible there.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 19 2024 at 17:39):

regehr commented on issue #5783:

hello, please see this list of generalized versions of these:

https://gist.github.com/manasij7479/602e770d45169d5ffa73d8cd100d5b05

maybe read them over and if you have any questions, ask me and @manasij7479 to explain.

in Hydra's output, sext(1) is just a bitwidth-independent way to say -1

hope this is useful!

view this post on Zulip Wasmtime GitHub notifications bot (Mar 19 2024 at 17:44):

regehr commented on issue #5783:

I should add that in the general case, generalizing an optimization is not a well-posed problem. there are often multiple solutions, and it is often the case that it's not totally clear which one to prefer. so while the Hydra output is (hopefully) correct, it probably contains some examples where you might want to generalize things a different way in practice.

also, this notation:

(sext(1) << x0) << C2
  =>
(sext(1) << C2) << x0

comes from a pretty-printer and is intended to make things easy to read for humans. the machine-readable form is Hydra's internal IR which is an extended version of Souper IR. what I'm saying is that I'm not sure that writing a parser for the pretty-printed output is the right answer. the right answer is probably to write some C++ that fits into Hydra and directly emits Cranelift pattern-matching / rewriting code. if you want to do this, please talk to us and hopefully also Manasij will help out

view this post on Zulip Wasmtime GitHub notifications bot (Mar 20 2024 at 18:20):

fitzgen commented on issue #5783:

Thanks @regehr!! Will take a look at these when I have some free cycles.

Out of curiosity, can you provide some more details on what the reported "profit" scores are?

view this post on Zulip Wasmtime GitHub notifications bot (Mar 20 2024 at 18:24):

regehr commented on issue #5783:

ah, sorry, that's the difference in cost between the LHS and RHS, where cost is souper's idiosyncratic cost model where most stuff has cost 1, a few things like select have cost 3, and then some stuff like intrinsics have cost 5.

we never really arrived at a good cost model for LLVM IR (I talked to many LLVM people about this many times and never really made good progress) and I don't expect it to be a great cost model for Cranelift either. but perhaps it's close enough.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 20 2024 at 18:29):

fitzgen commented on issue #5783:

what I'm saying is that I'm not sure that writing a parser for the pretty-printed output is the right answer.

FWIW, if Hydra outputted the same text format as Souper, we would be able to reuse the existing parser I wrote: https://docs.rs/souper-ir/latest/souper_ir/

But yeah, we will definitely reach out again when we look at automating this whole process in the future

view this post on Zulip Wasmtime GitHub notifications bot (Mar 20 2024 at 19:00):

manasij7479 commented on issue #5783:

what I'm saying is that I'm not sure that writing a parser for the pretty-printed output is the right answer.

FWIW, if Hydra outputted the same text format as Souper, we would be able to reuse the existing parser I wrote: https://docs.rs/souper-ir/latest/souper_ir/

But yeah, we will definitely reach out again when we look at automating this whole process in the future

It does, and that is what our automation for generating an LLVM pass uses.

It assigns meaning to specific identifier names, and has different semantics for width constraints.
For example a %symconst prefix means it is a symbolic constant. It is pretty much Souper IR other than some details like this, so your parser could work pretty well.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 20 2024 at 21:39):

regehr commented on issue #5783:

this all sounds great.

but note that some things like bitwidth independence are handled outside of Souper IR, and require some care

if there's something we can do on our side, please let us know, but keep in mind that Manasij plans to defend later this spring, so the earlier the better!


Last updated: Oct 23 2024 at 20:03 UTC