Stream: git-wasmtime

Topic: wasmtime / issue #5546 cranelift: Optimize `select+icmp` ...


view this post on Zulip Wasmtime GitHub notifications bot (Jan 09 2023 at 21:48):

afonso360 commented on issue #5546:

Perhaps we should take advantage of the egraph ability to represent equivalencies with a (simplify (icmp ty cc x y) (icmp ty (intcc_reverse cc) y x) rule.

Yeah, that sounds like a good idea, I'll try that!

The same reasoning should work on vectors; I think copying each of these rules for vselect of icmp should work.

:+1:, I'll add those in this PR if that's OK with you.

Is there an equivalent floating-point rewrite? NaNs make my head hurt.

I'm not sure,

I'd love to see Sightglass benchmark figures for this PR. I think you can do /bench_x64 but I can't; it looks like my username isn't in the list in performance.yml.

I don't expect any benefits for x86 since I think right now we lower min/max as cmp+cmov* which is the same as the unoptimized form, but we can try it. Although we might see some benefits on the vselect version (maybe?).

/bench_x64

view this post on Zulip Wasmtime GitHub notifications bot (Jan 09 2023 at 21:49):

afonso360 edited a comment on issue #5546:

Perhaps we should take advantage of the egraph ability to represent equivalencies with a (simplify (icmp ty cc x y) (icmp ty (intcc_reverse cc) y x) rule.

Yeah, that sounds like a good idea, I'll try that!

The same reasoning should work on vectors; I think copying each of these rules for vselect of icmp should work.

:+1:, I'll add those in this PR if that's OK with you.

Is there an equivalent floating-point rewrite? NaNs make my head hurt.

I'm not sure, I'll try to look into it.

I'd love to see Sightglass benchmark figures for this PR. I think you can do /bench_x64 but I can't; it looks like my username isn't in the list in performance.yml.

I don't expect any benefits for x86 since I think right now we lower min/max as cmp+cmov* which is the same as the unoptimized form, but we can try it. Although we might see some benefits on the vselect version (maybe?).

/bench_x64

view this post on Zulip Wasmtime GitHub notifications bot (Jan 09 2023 at 21:56):

jlb6740 commented on issue #5546:

instantiation :: cycles :: benchmarks/bz2/benchmark.wasm

No difference in performance.

[160490 179150.32 293708] commit.so
[155924 172967.04 189250] main.so

compilation :: cycles :: benchmarks/bz2/benchmark.wasm

No difference in performance.

[202562282 217920659.92 241944548] commit.so
[199991838 212302882.08 235356036] main.so

instantiation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

No difference in performance.

[186500 203277.04 247336] commit.so
[186216 207677.60 249312] main.so

compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

No difference in performance.

[293540152 337751307.12 374158094] commit.so
[295516870 341031354.56 393946812] main.so

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

No difference in performance.

[6965675788 7114549484.48 7213034856] commit.so
[6955473648 7075012926.00 7165106004] main.so

instantiation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

No difference in performance.

[488512 504472.32 573802] commit.so
[488122 501985.20 525900] main.so

execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

No difference in performance.

[8327116 8476234.16 8683322] commit.so
[8339612 8512672.24 8750996] main.so

execution :: cycles :: benchmarks/bz2/benchmark.wasm

No difference in performance.

[111521870 115012543.92 117344984] commit.so
[111540296 114710730.08 122547780] main.so

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

No difference in performance.

[1024608234 1038058547.20 1067188634] commit.so
[1027357456 1037305556.48 1048046760] main.so

view this post on Zulip Wasmtime GitHub notifications bot (Jan 15 2023 at 17:59):

afonso360 commented on issue #5546:

I tried adding the icmp reverse rule, and while it works and is correct it seems to be doing something weird. We now get a bunch of aliases after optimizing a normal icmp eq. I've left it as a separate commit so that its easier to investigate.

I'm worried that we might be unintentionally doing a bunch of stuff and worsening compile performance.

This is what we get now:

test optimize
set use_egraphs=true
target x86_64

function %icmp_eq(i32, i32) -> i8 {
block0(v0: i32, v1: i32):
    v2 = icmp eq v0, v1
    return v2
}

; check: block0(v0: i32, v1: i32):
; check:    v8 = icmp eq v0, v1
; check:    v9 -> v8
; check:    v10 -> v9
; check:    v11 -> v10
; check:    v12 -> v11
; check:    v13 -> v12
; check:    v14 -> v13
; check:    return v8

Is there an equivalent floating-point rewrite? NaNs make my head hurt.

So, I don't think there is an exact match for the full fcmp space, but the f{min,max}_pseudo instructions are an defined in terms of fcmp {lt,gt} + select, so we can at least get those.


I ran the Sightglass SIMD benchmarks locally and they measured no difference, but I'm not too confident in my benchmarking setup since its a linux vm inside a windows host, which doesn't seem ideal...

<details>
<summary>Sightglass SIMD benchmark results</summary>

     Running `target/debug/sightglass-cli benchmark --engine ./libwasmtime_main.so --engine ./libwasmtime_select.so -- benchmarks/blake3-simd/benchmark.wasm benchmarks/intgemm-simd/benchmark.wasm
`

compilation :: cycles :: benchmarks/blake3-simd/benchmark.wasm

  No difference in performance.

  [73345320 95091825.36 287267718] main.so
  [70722628 101732805.98 299985933] select.so

instantiation :: cycles :: benchmarks/intgemm-simd/benchmark.wasm

  No difference in performance.

  [168256 216817.60 382976] main.so
  [177600 220696.00 344096] select.so

instantiation :: cycles :: benchmarks/blake3-simd/benchmark.wasm

  No difference in performance.

  [133184 167869.12 416352] main.so
  [132128 165781.12 341184] select.so

compilation :: cycles :: benchmarks/intgemm-simd/benchmark.wasm

  No difference in performance.

  [709243689 815567491.90 1073223272] main.so
  [729598384 821242801.38 1027388168] select.so

execution :: cycles :: benchmarks/intgemm-simd/benchmark.wasm

  No difference in performance.

  [3855579532 3914533632.01 4071619781] main.so
  [3855359377 3917102361.59 4042283617] select.so

execution :: cycles :: benchmarks/blake3-simd/benchmark.wasm

  No difference in performance.

  [439936 513683.52 823392] main.so
  [439488 513457.28 862848] select.so

</details>

/bench_x64

view this post on Zulip Wasmtime GitHub notifications bot (Jan 15 2023 at 17:59):

afonso360 edited a comment on issue #5546:

I tried adding the icmp reverse rule, and while it works and is correct it seems to be doing something weird. We now get a bunch of aliases after optimizing a normal icmp eq. I've left it as a separate commit so that its easier to investigate.

I'm worried that we might be unintentionally doing a bunch of stuff and worsening compile performance.

This is what we get now:

test optimize
set use_egraphs=true
target x86_64

function %icmp_eq(i32, i32) -> i8 {
block0(v0: i32, v1: i32):
    v2 = icmp eq v0, v1
    return v2
}

; check: block0(v0: i32, v1: i32):
; check:    v8 = icmp eq v0, v1
; check:    v9 -> v8
; check:    v10 -> v9
; check:    v11 -> v10
; check:    v12 -> v11
; check:    v13 -> v12
; check:    v14 -> v13
; check:    return v8

Is there an equivalent floating-point rewrite? NaNs make my head hurt.

So, I don't think there is an exact match for the full fcmp space, but the f{min,max}_pseudo instructions are defined in terms of fcmp {lt,gt} + select, so we can at least get those.


I ran the Sightglass SIMD benchmarks locally and they measured no difference, but I'm not too confident in my benchmarking setup since its a linux vm inside a windows host, which doesn't seem ideal...

<details>
<summary>Sightglass SIMD benchmark results</summary>

     Running `target/debug/sightglass-cli benchmark --engine ./libwasmtime_main.so --engine ./libwasmtime_select.so -- benchmarks/blake3-simd/benchmark.wasm benchmarks/intgemm-simd/benchmark.wasm
`

compilation :: cycles :: benchmarks/blake3-simd/benchmark.wasm

  No difference in performance.

  [73345320 95091825.36 287267718] main.so
  [70722628 101732805.98 299985933] select.so

instantiation :: cycles :: benchmarks/intgemm-simd/benchmark.wasm

  No difference in performance.

  [168256 216817.60 382976] main.so
  [177600 220696.00 344096] select.so

instantiation :: cycles :: benchmarks/blake3-simd/benchmark.wasm

  No difference in performance.

  [133184 167869.12 416352] main.so
  [132128 165781.12 341184] select.so

compilation :: cycles :: benchmarks/intgemm-simd/benchmark.wasm

  No difference in performance.

  [709243689 815567491.90 1073223272] main.so
  [729598384 821242801.38 1027388168] select.so

execution :: cycles :: benchmarks/intgemm-simd/benchmark.wasm

  No difference in performance.

  [3855579532 3914533632.01 4071619781] main.so
  [3855359377 3917102361.59 4042283617] select.so

execution :: cycles :: benchmarks/blake3-simd/benchmark.wasm

  No difference in performance.

  [439936 513683.52 823392] main.so
  [439488 513457.28 862848] select.so

</details>

/bench_x64

view this post on Zulip Wasmtime GitHub notifications bot (Jan 15 2023 at 18:01):

afonso360 edited a comment on issue #5546:

I tried adding the icmp reverse rule, and while it works and is correct it seems to be doing something weird. We now get a bunch of aliases after optimizing a normal icmp eq. I've left it as a separate commit so that its easier to investigate.

I'm worried that we might be unintentionally doing a bunch of stuff and worsening compile performance.

This is what we get now:

test optimize
set use_egraphs=true
target x86_64

function %icmp_eq(i32, i32) -> i8 {
block0(v0: i32, v1: i32):
    v2 = icmp eq v0, v1
    return v2
}

; check: block0(v0: i32, v1: i32):
; check:    v8 = icmp eq v0, v1
; check:    v9 -> v8
; check:    v10 -> v9
; check:    v11 -> v10
; check:    v12 -> v11
; check:    v13 -> v12
; check:    v14 -> v13
; check:    return v8

Is there an equivalent floating-point rewrite? NaNs make my head hurt.

So, I don't think there is an exact match for the full fcmp space, but the f{min,max}_pseudo instructions are defined in terms of fcmp {lt,gt} + select, so we can at least get those.

We have some docs for those instructions, but they are better described in the Webassembly PR that introduces them


I ran the Sightglass SIMD benchmarks locally and they measured no difference, but I'm not too confident in my benchmarking setup since its a linux vm inside a windows host, which doesn't seem ideal...

<details>
<summary>Sightglass SIMD benchmark results</summary>

     Running `target/debug/sightglass-cli benchmark --engine ./libwasmtime_main.so --engine ./libwasmtime_select.so -- benchmarks/blake3-simd/benchmark.wasm benchmarks/intgemm-simd/benchmark.wasm
`

compilation :: cycles :: benchmarks/blake3-simd/benchmark.wasm

  No difference in performance.

  [73345320 95091825.36 287267718] main.so
  [70722628 101732805.98 299985933] select.so

instantiation :: cycles :: benchmarks/intgemm-simd/benchmark.wasm

  No difference in performance.

  [168256 216817.60 382976] main.so
  [177600 220696.00 344096] select.so

instantiation :: cycles :: benchmarks/blake3-simd/benchmark.wasm

  No difference in performance.

  [133184 167869.12 416352] main.so
  [132128 165781.12 341184] select.so

compilation :: cycles :: benchmarks/intgemm-simd/benchmark.wasm

  No difference in performance.

  [709243689 815567491.90 1073223272] main.so
  [729598384 821242801.38 1027388168] select.so

execution :: cycles :: benchmarks/intgemm-simd/benchmark.wasm

  No difference in performance.

  [3855579532 3914533632.01 4071619781] main.so
  [3855359377 3917102361.59 4042283617] select.so

execution :: cycles :: benchmarks/blake3-simd/benchmark.wasm

  No difference in performance.

  [439936 513683.52 823392] main.so
  [439488 513457.28 862848] select.so

</details>

/bench_x64

view this post on Zulip Wasmtime GitHub notifications bot (Jan 15 2023 at 18:06):

jlb6740 commented on issue #5546:

instantiation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

No difference in performance.

[195254 206675.84 239848] commit.so
[185986 200724.96 227526] main.so

instantiation :: cycles :: benchmarks/bz2/benchmark.wasm

No difference in performance.

[158588 177232.96 200728] commit.so
[153838 181267.68 359084] main.so

compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

No difference in performance.

[310637436 342353446.56 390973658] commit.so
[295484178 336498505.76 366404166] main.so

compilation :: cycles :: benchmarks/bz2/benchmark.wasm

No difference in performance.

[205285250 220044999.92 262015540] commit.so
[202251150 217512094.00 237476954] main.so

execution :: cycles :: benchmarks/bz2/benchmark.wasm

No difference in performance.

[111173460 114115222.96 119495454] commit.so
[111448476 114811907.92 115933528] main.so

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

No difference in performance.

[6940653952 7096700981.20 7212066130] commit.so
[6942954936 7124108036.08 7229159218] main.so

instantiation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

No difference in performance.

[478890 506487.60 662056] commit.so
[485120 504785.28 608956] main.so

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

No difference in performance.

[1023014568 1034258955.84 1055892690] commit.so
[1019706902 1030842327.04 1044519376] main.so

execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

No difference in performance.

[8352392 8496849.04 8691182] commit.so
[8360734 8482021.60 8626984] main.so

view this post on Zulip Wasmtime GitHub notifications bot (Jan 17 2023 at 19:55):

cfallin commented on issue #5546:

It makes me quite happy to see the usage of the egraph framework to add a rewrite be so straightforward!

Regarding the benchmarking results: perhaps they are showing no effect because egraph-based optimization is still off by default? (I'm cooking up a PR to flip that, but it's not done yet.)

view this post on Zulip Wasmtime GitHub notifications bot (Jan 18 2023 at 15:43):

afonso360 commented on issue #5546:

It makes me quite happy to see the usage of the egraph framework to add a rewrite be so straightforward!

It was also really intuitive to get to this solution which is awesome!

Regarding the benchmarking results: perhaps they are showing no effect because egraph-based optimization is still off by default?

Oh! I completely forgot about that! :big_smile:

I'll try to enable those settings in sightglass in the next run I make. But I'm not expecting great things here, x86 and aarch64 both lower to the same instructions, RISC-V does benefit from it since in can do max/min as a single instruction.

The vselect ones I think might make a difference? Unless existing toolchains already perform this optimization which is somewhat likely.

view this post on Zulip Wasmtime GitHub notifications bot (Jan 18 2023 at 15:44):

afonso360 edited a comment on issue #5546:

It makes me quite happy to see the usage of the egraph framework to add a rewrite be so straightforward!

It was also really intuitive to get to this solution which is awesome!

Regarding the benchmarking results: perhaps they are showing no effect because egraph-based optimization is still off by default?

Oh! I completely forgot about that! :big_smile:

I'll try to enable those settings in sightglass in the next run I make. But I'm not expecting great things here, x86 and aarch64 both lower to the same instructions, RISC-V does benefit from it since in can do max/min as a single instruction.

The vselect ones I think might make a difference? Unless existing toolchains already perform this optimization when producing wasm which is somewhat likely.

view this post on Zulip Wasmtime GitHub notifications bot (Jan 29 2023 at 12:02):

EdorianDark commented on issue #5546:

Egraph based optimizations are now on by default, so I think an rebase should be enough to test if something changes.

view this post on Zulip Wasmtime GitHub notifications bot (Jan 29 2023 at 16:34):

bjorn3 commented on issue #5546:

Egraph optimizations are on by default if optimizations are on. Do benchmarks run with optimizations enabled?

view this post on Zulip Wasmtime GitHub notifications bot (Jan 29 2023 at 19:03):

afonso360 commented on issue #5546:

Do benchmarks run with optimizations enabled?

I think optimizations are enabled by default, at least on the wasmtime cli that is the case. Although I'm not entirely sure about sightglass.

Rebased this on main and ran sightglass on AArch64, doesen't seem to have much of an impact. I wonder if the results would be any different for something like cg_clif that doesen't run on already optimized code.
Unfortunately I can't run this on RISCV which is where I would hope to see some positive effect.

<details>
<summary>Scalar Benchmarks AArch64</summary>

ubuntu@instance-20220805-0848:~/git/sightglass$ target/debug/sightglass-cli benchmark --engine /tmp/wasmtime_main.so --engine /tmp/wasmtime_select.so

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 3282352.64 ± 1066807.74 (confidence = 99%)

  main.so is 1.02x to 1.04x faster than select.so!

  [111047763 118113547.11 123328519] main.so
  [115034752 121395899.75 128445997] select.so

execution :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 6938.23 ± 1941.66 (confidence = 99%)

  select.so is 1.01x to 1.01x faster than main.so!

  [876198 884956.45 902123] main.so
  [869354 878018.22 901740] select.so

instantiation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [2849 3260.40 12227] main.so
  [2804 3148.16 4268] select.so

compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [4381352 5597680.00 6817168] main.so
  [4465781 5715523.09 7002568] select.so

instantiation :: cycles :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [2361 2621.85 4044] main.so
  [2284 2575.55 3465] select.so

compilation :: cycles :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [2440354 2586186.45 3163455] main.so
  [2460769 2621895.24 3155162] select.so

execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [69303 75400.87 94495] main.so
  [70033 76346.64 97401] select.so

instantiation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [5533 6360.26 18487] main.so
  [5558 6418.39 19011] select.so

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [8367754 8415382.47 8499654] main.so
  [8377518 8420745.48 8547947] select.so

</details>

<details>
<summary>SIMD Benchmarks AArch64</summary>

ubuntu@instance-20220805-0848:~/git/sightglass$ target/debug/sightglass-cli benchmark --engine /tmp/wasmtime_main.so --engine /tmp/wasmtime_select.so -- benchmarks/blake3-simd/benchmark.wasm benchmarks/intgemm-simd/benchmark.wasm

execution :: cycles :: benchmarks/intgemm-simd/benchmark.wasm

  Δ = 203985.92 ± 8663.93 (confidence = 99%)

  select.so is 1.00x to 1.00x faster than main.so!

  [45561021 45611295.11 45720621] main.so
  [45363259 45407309.19 45486165] select.so

execution :: cycles :: benchmarks/blake3-simd/benchmark.wasm

  No difference in performance.

  [3472 4728.82 13547] main.so
  [3477 4261.65 15734] select.so

instantiation :: cycles :: benchmarks/blake3-simd/benchmark.wasm

  No difference in performance.

  [1749 2124.65 3341] main.so
  [1724 2101.14 3474] select.so

compilation :: cycles :: benchmarks/intgemm-simd/benchmark.wasm

  No difference in performance.

  [5448821 5745023.23 6331575] main.so
  [5487095 5779248.67 6320724] select.so

compilation :: cycles :: benchmarks/blake3-simd/benchmark.wasm

  No difference in performance.

  [526706 615112.06 1567920] main.so
  [536207 613883.57 1149844] select.so

instantiation :: cycles :: benchmarks/intgemm-simd/benchmark.wasm

  No difference in performance.

  [2246 2495.36 3522] main.so
  [2126 2497.00 4032] select.so

</details>

view this post on Zulip Wasmtime GitHub notifications bot (Feb 09 2023 at 21:16):

jameysharp commented on issue #5546:

Looks like s390x doesn't have lowerings for umin/umax on scalars, so when this PR rewrites this pattern to those instructions the s390x tests fail. cc: @uweigand

view this post on Zulip Wasmtime GitHub notifications bot (Feb 10 2023 at 12:15):

uweigand commented on issue #5546:

Looks like s390x doesn't have lowerings for umin/umax on scalars, so when this PR rewrites this pattern to those instructions the s390x tests fail. cc: @uweigand

Well, we don't have any instructions for umin/umax on scalars, so the only possible implementation would be to turn it right back into (the equivalent of) select + icmp ... Is this what we're supposed to be doing here?

view this post on Zulip Wasmtime GitHub notifications bot (Feb 10 2023 at 12:24):

afonso360 commented on issue #5546:

I think that's what x86 does, as well.

view this post on Zulip Wasmtime GitHub notifications bot (Feb 10 2023 at 15:28):

uweigand commented on issue #5546:

I've now opened https://github.com/bytecodealliance/wasmtime/pull/5762 to add support for these clif instructions on s390x.


Last updated: Jan 24 2025 at 00:11 UTC