Stream: git-wasmtime

Topic: wasmtime / PR #8565 wasmtime: Remove ALL constant phis


view this post on Zulip Wasmtime GitHub notifications bot (May 07 2024 at 03:04):

jameysharp requested cfallin for a review on PR #8565.

view this post on Zulip Wasmtime GitHub notifications bot (May 07 2024 at 03:04):

jameysharp requested wasmtime-compiler-reviewers for a review on PR #8565.

view this post on Zulip Wasmtime GitHub notifications bot (May 07 2024 at 03:04):

jameysharp opened PR #8565 from jameysharp:yes-all-constant-phis to bytecodealliance:main:

When we're trying to delete block-params that can be replaced by a single dominating value, we weren't handling some cases.

In particular, if we concluded that a block formal parameter (say, v3) had more than one value, then we believed that any uses of that parameter (say, defining another formal parameter v4) also had more than one value, and therefore could not be deleted.

However, in such cases we can try using v3 itself as the definition of v4. If there are no other definitions of v4, then it can be deleted.

With this change, if a block has only one predecessor, it is now true that this pass will delete all of its block params. We couldn't rely on that property before.

view this post on Zulip Wasmtime GitHub notifications bot (May 07 2024 at 15:30):

cfallin submitted PR review:

This looks good, thanks! Intuitively it seems the difference to me is the two-stage meet of sorts -- we get an actual Value (replacement) then test that against the abstract value and handle the case where it's actually the same as the existing One; is that where these two diverged?

I'm not sure if there's a comment of some sort that would help with this subtlety (I also didn't write the original and haven't looked at this code in a while so I may have misunderstood some of the original thinking); but it's already quite well commented, so I'll leave it up to you if you can think of any further explanation-comments to add or not. In any case, thanks!

view this post on Zulip Wasmtime GitHub notifications bot (May 07 2024 at 20:00):

cfallin commented on PR #8565:

@jameysharp would you mind also doing a Sightglass run on this? It should, from first principles, mainly result in speedups; but I'm getting some weird results when trying it on some weval'd code (-5% deltas) and I'm wondering if it has some unfortunate interactions in my case with regalloc (longer liveranges) or if it's something else. In any case having a baseline measurement on "normal" code (not my horrific spaghetti) would be useful I suspect :-)

view this post on Zulip Wasmtime GitHub notifications bot (May 07 2024 at 22:22):

jameysharp commented on PR #8565:

This looks good, thanks! Intuitively it seems the difference to me is the two-stage meet of sorts -- we get an actual Value (replacement) then test _that_ against the abstract value and handle the case where it's actually the same as the existing One; is that where these two diverged?

I'm not sure if there's a comment of some sort that would help with this subtlety (I also didn't write the original and haven't looked at this code in a while so I may have misunderstood some of the original thinking); but it's already quite well commented, so I'll leave it up to you if you can think of any further explanation-comments to add or not. In any case, thanks!

I'm not sure I understand the way you phrased this question, so I'm going to try answering a different question and see if that helps.

If we decide that a formal block parameter can be replaced by another value, then we want to forward _that_ value in the dataflow analysis if that formal parameter is used as the actual parameter for another branch. I'm thinking of it like continuing the analysis _as if_ one rewrite had already happened, to see if that makes more rewrites visible.

The existing implementation did that, except it forwarded any abstract value from the formal parameter, including the Many value indicating that a rewrite is not possible. Where this PR diverges is that once we have learned that a rewrite is not possible for some formal parameter, we should treat it like any other value that we aren't rewriting, such as entry-block formal parameters or instruction results.

As a result it turns out that the only abstract value we can get from that check is the One case: either the one value we currently believe we can rewrite to, or the non-rewritable value itself. Therefore I simplified join because we always join with the same kind of abstract value. But I don't think of it as a two-stage meet.

By the way, one thing I worried about is whether this dataflow transfer function is monotonic, and I had trouble convincing myself. Since this is a fix-point, it can optimistically assume that a rewrite will succeed until it finds evidence to the contrary, at which point that evidence propagates. But that means we can have a formal parameter which we assigned an abstract value of One to, and forwarded to other blocks as above, but then on the next iteration we learn it's actually Many so this time we forward a different value onward. Could there exist a CFG where doing so causes the formal parameter to go back to One state?

Now that I've written this out I remember the answer is trivially "no": we join the previous abstract value with the new one, and Many is "sticky". So once we've decided we can't rewrite a value, we never go back.

@jameysharp would you mind also doing a Sightglass run on this? It _should_, from first principles, mainly result in speedups; but I'm getting some weird results when trying it on some weval'd code (-5% deltas) and I'm wondering if it has some unfortunate interactions in my case with regalloc (longer liveranges) or if it's something else. In any case having a baseline measurement on "normal" code (not my horrific spaghetti) would be useful I suspect :-)

Sightglass on the default benchmark set says there is no significant difference in:

During execution, by instructions retired, this PR is very slightly worse on all three benchmarks. By CPU cycles, bz2 is faster with this PR and the others are a wash. Now that I know this, I'm not sure what to do with it.

execution :: cpu-cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 3838403.12 ± 2463605.04 (confidence = 99%)

  yes-all-constant-phis-27c1c1d74.so is 1.01x to 1.06x faster than no-all-constant-phis-81a89169f.so!

  [100771190 107900272.66 126167246] no-all-constant-phis-81a89169f.so
  [100584228 104061869.54 121997952] yes-all-constant-phis-27c1c1d74.so

execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 482232.60 ± 94.51 (confidence = 99%)

  no-all-constant-phis-81a89169f.so is 1.02x to 1.02x faster than yes-all-constant-phis-27c1c1d74.so!

  [19708191 19708335.27 19708917] no-all-constant-phis-81a89169f.so
  [20190380 20190567.87 20191097] yes-all-constant-phis-27c1c1d74.so

execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 2499390.23 ± 27298.54 (confidence = 99%)

  no-all-constant-phis-81a89169f.so is 1.00x to 1.00x faster than yes-all-constant-phis-27c1c1d74.so!

  [2658325340 2658415971.80 2658657194] no-all-constant-phis-81a89169f.so
  [2660806656 2660915362.03 2661215422] yes-all-constant-phis-27c1c1d74.so

execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  Δ = 84593.43 ± 0.36 (confidence = 99%)

  no-all-constant-phis-81a89169f.so is 1.00x to 1.00x faster than yes-all-constant-phis-27c1c1d74.so!

  [227350470 227350471.64 227350474] no-all-constant-phis-81a89169f.so
  [227435064 227435065.07 227435068] yes-all-constant-phis-27c1c1d74.so

view this post on Zulip Wasmtime GitHub notifications bot (May 08 2024 at 02:50):

cfallin commented on PR #8565:

OK, cool; that's what I had understood, I was mostly just wondering if there was some sort of doc-comment we could add to clarify. I think it's probably fine as-is.

Re: termination and monotonicity, I agree. The stickiness of values and using the meet-function of a finite-height lattice alone is enough to ensure termination, but a unique solution (i.e. ensuring that the result is not dependent on processing order or whatnot) requires monotonicity of the transfer functions too (i.e. x <= y implies f(x) <= f(y) for any f, including the abstract-value behavior of blockparams). I think that holds in this case; even the "Many laundered through a blockparam becomes a One" property is monotonic. That's because it's never the case that an input moving down the lattice (going from One to Many) can turn the output from Many to One (right?).

Re: performance: the mixed results are a little concerning combined with my observations from earlier; maybe this is something we could dig a bit further into? I agree the change seems very nice; but if in practice it does bad things to regalloc today then perhaps we can keep it around 'til those issues are resolved or we understand the conditions better...

view this post on Zulip Wasmtime GitHub notifications bot (May 08 2024 at 07:33):

jameysharp commented on PR #8565:

Re: termination and monotonicity, I agree. The stickiness of values and using the meet-function of a finite-height lattice alone is enough to ensure termination, but a unique solution (i.e. ensuring that the result is not dependent on processing order or whatnot) requires monotonicity of the transfer functions too (i.e. x <= y implies f(x) <= f(y) for any f, including the abstract-value behavior of blockparams). I think that holds in this case; even the "Many laundered through a blockparam becomes a One" property is monotonic. That's because it's never the case that an input moving down the lattice (going from One to Many) can turn the output from Many to One (right?).

Oh right, uniqueness of the solution would be nice too. Yeah, this is what I had trouble reasoning about.

Let's say we see that v5 is a formal block parameter which on some branch has received v4 as its actual parameter, and that v4 in turn is another formal block parameter which so far we've only seen v3 given as the actual parameter for. Therefore we tentatively conclude that both v4 and v5 can be rewritten to use v3 instead.

Now later let's say we discover that v4 has more actual parameters besides v3, so we bump it to Many on the lattice. With this PR, that changes the value we want to meet with v5's abstract value from One(v3) to One(v4). Since those are different values, we bump v5 to Many on the lattice as well. But perhaps we'd have gotten a stable fix-point from using One(v4) instead of Many?

I think there's something here about the fact that we do this analysis pass in reverse post-order that helps, maybe only for reducible flow graphs.

But I also wonder if this needs a different lattice, distinguishing between "the one value coming from a blockparam" versus "the one value coming from a concrete source such as an instruction result".

Re: performance: the mixed results are a little concerning combined with my observations from earlier; maybe this is something we could dig a bit further into? I agree the change seems very nice; but if in practice it does bad things to regalloc today then perhaps we can keep it around 'til those issues are resolved or we understand the conditions better...

Let me make the performance question more complicated :sweat_smile:

I discovered this missed-optimization bug because I was trying to address #7639, which was prompted by work Amanieu wanted to do in bytecodealliance/regalloc2#170. This PR establishes the guarantee that any block which has only one predecessor has no incoming block-params, and that makes it easier to avoid having any block-params on branches that have multiple outgoing destinations, since then only the critical edges have block-params and you can move them into the block that's created for splitting the critical edge.

I've done a Sightglass run comparing this PR against the local branch I have for eliminating block-params from branches with multiple destinations; that branch is based on this one. Measured by instructions retired during execution, all the usual benchmarks are a little faster on that branch. Spidermonkey is also slightly faster by CPU cycles on that branch.

execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 312520.28 ± 107.30 (confidence = 99%)

  no-multi-branch-blockparams-73d4f890e.so is 1.02x to 1.02x faster than yes-all-constant-phis-27c1c1d74.so!

  [19877813 19878054.49 19878621] no-multi-branch-blockparams-73d4f890e.so
  [20190380 20190574.77 20191114] yes-all-constant-phis-27c1c1d74.so

execution :: cpu-cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 1068752.77 ± 254583.37 (confidence = 99%)

  yes-all-constant-phis-27c1c1d74.so is 1.01x to 1.01x faster than no-multi-branch-blockparams-73d4f890e.so!

  [100489713 101440920.46 104541932] no-multi-branch-blockparams-73d4f890e.so
  [99842117 100372167.69 100906771] yes-all-constant-phis-27c1c1d74.so

execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 14461536.49 ± 29120.93 (confidence = 99%)

  no-multi-branch-blockparams-73d4f890e.so is 1.01x to 1.01x faster than yes-all-constant-phis-27c1c1d74.so!

  [2646345104 2646446038.97 2646651491] no-multi-branch-blockparams-73d4f890e.so
  [2660807100 2660907575.46 2661200553] yes-all-constant-phis-27c1c1d74.so

execution :: cpu-cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 2435891.43 ± 1730336.92 (confidence = 99%)

  no-multi-branch-blockparams-73d4f890e.so is 1.00x to 1.00x faster than yes-all-constant-phis-27c1c1d74.so!

  [953247671 960026022.33 990561731] no-multi-branch-blockparams-73d4f890e.so
  [953674753 962461913.76 989849731] yes-all-constant-phis-27c1c1d74.so

execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  Δ = 148464.74 ± 0.20 (confidence = 99%)

  no-multi-branch-blockparams-73d4f890e.so is 1.00x to 1.00x faster than yes-all-constant-phis-27c1c1d74.so!

  [227286598 227286599.79 227286602] no-multi-branch-blockparams-73d4f890e.so
  [227435063 227435064.53 227435066] yes-all-constant-phis-27c1c1d74.so

compilation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 164098.23 ± 128446.28 (confidence = 99%)

  yes-all-constant-phis-27c1c1d74.so is 1.00x to 1.00x faster than no-multi-branch-blockparams-73d4f890e.so!

  [424854236 425728699.98 426706876] no-multi-branch-blockparams-73d4f890e.so
  [424694001 425564601.75 426230047] yes-all-constant-phis-27c1c1d74.so

I'll do another Sightglass run comparing that branch to no-all-constant-phis-81a89169f.so (the parent of this PR) so we can discuss the performance of the combination.

view this post on Zulip Wasmtime GitHub notifications bot (May 08 2024 at 16:31):

jameysharp commented on PR #8565:

Okay, I've compared no-all-constant-phis-81a89169f.so (the parent of this PR) with no-multi-branch-blockparams-73d4f890e.so (the follow-on change I mentioned). (Sightglass stripped the "no-" prefix off my filenames in the below output.)

Overall effect size is small in all cases. The biggest percentage change is for pulldown-cmark, where the combination is maybe 1% slower during execution by either instructions retired or CPU cycles. No significant difference in compilation.

<details><summary>pulldown-cmark Sightglass results</summary>

execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 169617.17 ± 94.37 (confidence = 99%)

  all-constant-phis-81a89169f.so is 1.01x to 1.01x faster than multi-branch-blockparams-73d4f890e.so!

  [19708183 19708354.86 19708941] all-constant-phis-81a89169f.so
  [19877805 19877972.03 19878516] multi-branch-blockparams-73d4f890e.so

execution :: cpu-cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 36355.47 ± 19560.48 (confidence = 99%)

  all-constant-phis-81a89169f.so is 1.00x to 1.01x faster than multi-branch-blockparams-73d4f890e.so!

  [7450646 7532210.67 7757423] all-constant-phis-81a89169f.so
  [7493914 7568566.14 7749656] multi-branch-blockparams-73d4f890e.so

</details>

Spidermonkey is marginally faster during both compilation and execution, by instructions retired; no significant difference in CPU cycles.

<details><summary>spidermonkey Sightglass results</summary>

execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 11971268.75 ± 27736.55 (confidence = 99%)

  multi-branch-blockparams-73d4f890e.so is 1.00x to 1.00x faster than all-constant-phis-81a89169f.so!

  [2658323222 2658418217.79 2658675495] all-constant-phis-81a89169f.so
  [2646347136 2646446949.04 2646669800] multi-branch-blockparams-73d4f890e.so

compilation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 133883.56 ± 121584.27 (confidence = 99%)

  multi-branch-blockparams-73d4f890e.so is 1.00x to 1.00x faster than all-constant-phis-81a89169f.so!

  [425093370 425735710.00 426816063] all-constant-phis-81a89169f.so
  [424905839 425601826.44 426469074] multi-branch-blockparams-73d4f890e.so

</details>

And bz2 is slower by CPU cycles but faster by instructions retired, during execution. No significant difference during compilation.

<details><summary>bz2 Sightglass results</summary>

execution :: cpu-cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 286689.44 ± 172752.93 (confidence = 99%)

  all-constant-phis-81a89169f.so is 1.00x to 1.00x faster than multi-branch-blockparams-73d4f890e.so!

  [100185679 100876101.21 102588446] all-constant-phis-81a89169f.so
  [100330544 101162790.65 103125048] multi-branch-blockparams-73d4f890e.so

execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  Δ = 63871.05 ± 0.20 (confidence = 99%)

  multi-branch-blockparams-73d4f890e.so is 1.00x to 1.00x faster than all-constant-phis-81a89169f.so!

  [227350470 227350470.75 227350472] all-constant-phis-81a89169f.so
  [227286598 227286599.70 227286601] multi-branch-blockparams-73d4f890e.so

</details>

view this post on Zulip Wasmtime GitHub notifications bot (May 08 2024 at 16:42):

cfallin commented on PR #8565:

Interesting; given the choice between judging performance by instructions-retired and judging performance by cycles, let's take the latter since it's what the user sees; so we have slower (pulldown-cmark), no change (spidermonkey), slower (bz2). Unfortunately it seems there's more to understand here. I share your (assumed) frustration that a clear and obvious optimization doesn't result in a speedup -- in fact I also implemented redundant-phi elimination in waffle yesterday just to see if it changed with wasm-producer-side vs native-side optimizations and got similar slowdowns. I suspect it means we have work to do in regalloc :-/


Last updated: Nov 22 2024 at 16:03 UTC