afonso360 commented on issue #5546:
Perhaps we should take advantage of the egraph ability to represent equivalencies with a (simplify (icmp ty cc x y) (icmp ty (intcc_reverse cc) y x) rule.
Yeah, that sounds like a good idea, I'll try that!
The same reasoning should work on vectors; I think copying each of these rules for vselect of icmp should work.
:+1:, I'll add those in this PR if that's OK with you.
Is there an equivalent floating-point rewrite? NaNs make my head hurt.
I'm not sure,
I'd love to see Sightglass benchmark figures for this PR. I think you can do /bench_x64 but I can't; it looks like my username isn't in the list in performance.yml.
I don't expect any benefits for x86 since I think right now we lower
min
/max
ascmp
+cmov*
which is the same as the unoptimized form, but we can try it. Although we might see some benefits on thevselect
version (maybe?)./bench_x64
afonso360 edited a comment on issue #5546:
Perhaps we should take advantage of the egraph ability to represent equivalencies with a (simplify (icmp ty cc x y) (icmp ty (intcc_reverse cc) y x) rule.
Yeah, that sounds like a good idea, I'll try that!
The same reasoning should work on vectors; I think copying each of these rules for vselect of icmp should work.
:+1:, I'll add those in this PR if that's OK with you.
Is there an equivalent floating-point rewrite? NaNs make my head hurt.
I'm not sure, I'll try to look into it.
I'd love to see Sightglass benchmark figures for this PR. I think you can do /bench_x64 but I can't; it looks like my username isn't in the list in performance.yml.
I don't expect any benefits for x86 since I think right now we lower
min
/max
ascmp
+cmov*
which is the same as the unoptimized form, but we can try it. Although we might see some benefits on thevselect
version (maybe?)./bench_x64
jlb6740 commented on issue #5546:
instantiation :: cycles :: benchmarks/bz2/benchmark.wasm
No difference in performance.
[160490 179150.32 293708] commit.so
[155924 172967.04 189250] main.socompilation :: cycles :: benchmarks/bz2/benchmark.wasm
No difference in performance.
[202562282 217920659.92 241944548] commit.so
[199991838 212302882.08 235356036] main.soinstantiation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm
No difference in performance.
[186500 203277.04 247336] commit.so
[186216 207677.60 249312] main.socompilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm
No difference in performance.
[293540152 337751307.12 374158094] commit.so
[295516870 341031354.56 393946812] main.socompilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm
No difference in performance.
[6965675788 7114549484.48 7213034856] commit.so
[6955473648 7075012926.00 7165106004] main.soinstantiation :: cycles :: benchmarks/spidermonkey/benchmark.wasm
No difference in performance.
[488512 504472.32 573802] commit.so
[488122 501985.20 525900] main.soexecution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm
No difference in performance.
[8327116 8476234.16 8683322] commit.so
[8339612 8512672.24 8750996] main.soexecution :: cycles :: benchmarks/bz2/benchmark.wasm
No difference in performance.
[111521870 115012543.92 117344984] commit.so
[111540296 114710730.08 122547780] main.soexecution :: cycles :: benchmarks/spidermonkey/benchmark.wasm
No difference in performance.
[1024608234 1038058547.20 1067188634] commit.so
[1027357456 1037305556.48 1048046760] main.so
afonso360 commented on issue #5546:
I tried adding the icmp reverse rule, and while it works and is correct it seems to be doing something weird. We now get a bunch of aliases after optimizing a normal
icmp eq
. I've left it as a separate commit so that its easier to investigate.I'm worried that we might be unintentionally doing a bunch of stuff and worsening compile performance.
This is what we get now:
test optimize set use_egraphs=true target x86_64 function %icmp_eq(i32, i32) -> i8 { block0(v0: i32, v1: i32): v2 = icmp eq v0, v1 return v2 } ; check: block0(v0: i32, v1: i32): ; check: v8 = icmp eq v0, v1 ; check: v9 -> v8 ; check: v10 -> v9 ; check: v11 -> v10 ; check: v12 -> v11 ; check: v13 -> v12 ; check: v14 -> v13 ; check: return v8
Is there an equivalent floating-point rewrite? NaNs make my head hurt.
So, I don't think there is an exact match for the full
fcmp
space, but thef{min,max}_pseudo
instructions are an defined in terms offcmp {lt,gt} + select
, so we can at least get those.
I ran the Sightglass SIMD benchmarks locally and they measured no difference, but I'm not too confident in my benchmarking setup since its a linux vm inside a windows host, which doesn't seem ideal...
<details>
<summary>Sightglass SIMD benchmark results</summary>Running `target/debug/sightglass-cli benchmark --engine ./libwasmtime_main.so --engine ./libwasmtime_select.so -- benchmarks/blake3-simd/benchmark.wasm benchmarks/intgemm-simd/benchmark.wasm ` compilation :: cycles :: benchmarks/blake3-simd/benchmark.wasm No difference in performance. [73345320 95091825.36 287267718] main.so [70722628 101732805.98 299985933] select.so instantiation :: cycles :: benchmarks/intgemm-simd/benchmark.wasm No difference in performance. [168256 216817.60 382976] main.so [177600 220696.00 344096] select.so instantiation :: cycles :: benchmarks/blake3-simd/benchmark.wasm No difference in performance. [133184 167869.12 416352] main.so [132128 165781.12 341184] select.so compilation :: cycles :: benchmarks/intgemm-simd/benchmark.wasm No difference in performance. [709243689 815567491.90 1073223272] main.so [729598384 821242801.38 1027388168] select.so execution :: cycles :: benchmarks/intgemm-simd/benchmark.wasm No difference in performance. [3855579532 3914533632.01 4071619781] main.so [3855359377 3917102361.59 4042283617] select.so execution :: cycles :: benchmarks/blake3-simd/benchmark.wasm No difference in performance. [439936 513683.52 823392] main.so [439488 513457.28 862848] select.so
</details>
/bench_x64
afonso360 edited a comment on issue #5546:
I tried adding the icmp reverse rule, and while it works and is correct it seems to be doing something weird. We now get a bunch of aliases after optimizing a normal
icmp eq
. I've left it as a separate commit so that its easier to investigate.I'm worried that we might be unintentionally doing a bunch of stuff and worsening compile performance.
This is what we get now:
test optimize set use_egraphs=true target x86_64 function %icmp_eq(i32, i32) -> i8 { block0(v0: i32, v1: i32): v2 = icmp eq v0, v1 return v2 } ; check: block0(v0: i32, v1: i32): ; check: v8 = icmp eq v0, v1 ; check: v9 -> v8 ; check: v10 -> v9 ; check: v11 -> v10 ; check: v12 -> v11 ; check: v13 -> v12 ; check: v14 -> v13 ; check: return v8
Is there an equivalent floating-point rewrite? NaNs make my head hurt.
So, I don't think there is an exact match for the full
fcmp
space, but thef{min,max}_pseudo
instructions are defined in terms offcmp {lt,gt} + select
, so we can at least get those.
I ran the Sightglass SIMD benchmarks locally and they measured no difference, but I'm not too confident in my benchmarking setup since its a linux vm inside a windows host, which doesn't seem ideal...
<details>
<summary>Sightglass SIMD benchmark results</summary>Running `target/debug/sightglass-cli benchmark --engine ./libwasmtime_main.so --engine ./libwasmtime_select.so -- benchmarks/blake3-simd/benchmark.wasm benchmarks/intgemm-simd/benchmark.wasm ` compilation :: cycles :: benchmarks/blake3-simd/benchmark.wasm No difference in performance. [73345320 95091825.36 287267718] main.so [70722628 101732805.98 299985933] select.so instantiation :: cycles :: benchmarks/intgemm-simd/benchmark.wasm No difference in performance. [168256 216817.60 382976] main.so [177600 220696.00 344096] select.so instantiation :: cycles :: benchmarks/blake3-simd/benchmark.wasm No difference in performance. [133184 167869.12 416352] main.so [132128 165781.12 341184] select.so compilation :: cycles :: benchmarks/intgemm-simd/benchmark.wasm No difference in performance. [709243689 815567491.90 1073223272] main.so [729598384 821242801.38 1027388168] select.so execution :: cycles :: benchmarks/intgemm-simd/benchmark.wasm No difference in performance. [3855579532 3914533632.01 4071619781] main.so [3855359377 3917102361.59 4042283617] select.so execution :: cycles :: benchmarks/blake3-simd/benchmark.wasm No difference in performance. [439936 513683.52 823392] main.so [439488 513457.28 862848] select.so
</details>
/bench_x64
afonso360 edited a comment on issue #5546:
I tried adding the icmp reverse rule, and while it works and is correct it seems to be doing something weird. We now get a bunch of aliases after optimizing a normal
icmp eq
. I've left it as a separate commit so that its easier to investigate.I'm worried that we might be unintentionally doing a bunch of stuff and worsening compile performance.
This is what we get now:
test optimize set use_egraphs=true target x86_64 function %icmp_eq(i32, i32) -> i8 { block0(v0: i32, v1: i32): v2 = icmp eq v0, v1 return v2 } ; check: block0(v0: i32, v1: i32): ; check: v8 = icmp eq v0, v1 ; check: v9 -> v8 ; check: v10 -> v9 ; check: v11 -> v10 ; check: v12 -> v11 ; check: v13 -> v12 ; check: v14 -> v13 ; check: return v8
Is there an equivalent floating-point rewrite? NaNs make my head hurt.
So, I don't think there is an exact match for the full
fcmp
space, but thef{min,max}_pseudo
instructions are defined in terms offcmp {lt,gt} + select
, so we can at least get those.We have some docs for those instructions, but they are better described in the Webassembly PR that introduces them
I ran the Sightglass SIMD benchmarks locally and they measured no difference, but I'm not too confident in my benchmarking setup since its a linux vm inside a windows host, which doesn't seem ideal...
<details>
<summary>Sightglass SIMD benchmark results</summary>Running `target/debug/sightglass-cli benchmark --engine ./libwasmtime_main.so --engine ./libwasmtime_select.so -- benchmarks/blake3-simd/benchmark.wasm benchmarks/intgemm-simd/benchmark.wasm ` compilation :: cycles :: benchmarks/blake3-simd/benchmark.wasm No difference in performance. [73345320 95091825.36 287267718] main.so [70722628 101732805.98 299985933] select.so instantiation :: cycles :: benchmarks/intgemm-simd/benchmark.wasm No difference in performance. [168256 216817.60 382976] main.so [177600 220696.00 344096] select.so instantiation :: cycles :: benchmarks/blake3-simd/benchmark.wasm No difference in performance. [133184 167869.12 416352] main.so [132128 165781.12 341184] select.so compilation :: cycles :: benchmarks/intgemm-simd/benchmark.wasm No difference in performance. [709243689 815567491.90 1073223272] main.so [729598384 821242801.38 1027388168] select.so execution :: cycles :: benchmarks/intgemm-simd/benchmark.wasm No difference in performance. [3855579532 3914533632.01 4071619781] main.so [3855359377 3917102361.59 4042283617] select.so execution :: cycles :: benchmarks/blake3-simd/benchmark.wasm No difference in performance. [439936 513683.52 823392] main.so [439488 513457.28 862848] select.so
</details>
/bench_x64
jlb6740 commented on issue #5546:
instantiation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm
No difference in performance.
[195254 206675.84 239848] commit.so
[185986 200724.96 227526] main.soinstantiation :: cycles :: benchmarks/bz2/benchmark.wasm
No difference in performance.
[158588 177232.96 200728] commit.so
[153838 181267.68 359084] main.socompilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm
No difference in performance.
[310637436 342353446.56 390973658] commit.so
[295484178 336498505.76 366404166] main.socompilation :: cycles :: benchmarks/bz2/benchmark.wasm
No difference in performance.
[205285250 220044999.92 262015540] commit.so
[202251150 217512094.00 237476954] main.soexecution :: cycles :: benchmarks/bz2/benchmark.wasm
No difference in performance.
[111173460 114115222.96 119495454] commit.so
[111448476 114811907.92 115933528] main.socompilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm
No difference in performance.
[6940653952 7096700981.20 7212066130] commit.so
[6942954936 7124108036.08 7229159218] main.soinstantiation :: cycles :: benchmarks/spidermonkey/benchmark.wasm
No difference in performance.
[478890 506487.60 662056] commit.so
[485120 504785.28 608956] main.soexecution :: cycles :: benchmarks/spidermonkey/benchmark.wasm
No difference in performance.
[1023014568 1034258955.84 1055892690] commit.so
[1019706902 1030842327.04 1044519376] main.soexecution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm
No difference in performance.
[8352392 8496849.04 8691182] commit.so
[8360734 8482021.60 8626984] main.so
cfallin commented on issue #5546:
It makes me quite happy to see the usage of the egraph framework to add a rewrite be so straightforward!
Regarding the benchmarking results: perhaps they are showing no effect because egraph-based optimization is still off by default? (I'm cooking up a PR to flip that, but it's not done yet.)
afonso360 commented on issue #5546:
It makes me quite happy to see the usage of the egraph framework to add a rewrite be so straightforward!
It was also really intuitive to get to this solution which is awesome!
Regarding the benchmarking results: perhaps they are showing no effect because egraph-based optimization is still off by default?
Oh! I completely forgot about that! :big_smile:
I'll try to enable those settings in sightglass in the next run I make. But I'm not expecting great things here, x86 and aarch64 both lower to the same instructions, RISC-V does benefit from it since in can do
max
/min
as a single instruction.The
vselect
ones I think might make a difference? Unless existing toolchains already perform this optimization which is somewhat likely.
afonso360 edited a comment on issue #5546:
It makes me quite happy to see the usage of the egraph framework to add a rewrite be so straightforward!
It was also really intuitive to get to this solution which is awesome!
Regarding the benchmarking results: perhaps they are showing no effect because egraph-based optimization is still off by default?
Oh! I completely forgot about that! :big_smile:
I'll try to enable those settings in sightglass in the next run I make. But I'm not expecting great things here, x86 and aarch64 both lower to the same instructions, RISC-V does benefit from it since in can do
max
/min
as a single instruction.The
vselect
ones I think might make a difference? Unless existing toolchains already perform this optimization when producing wasm which is somewhat likely.
EdorianDark commented on issue #5546:
Egraph based optimizations are now on by default, so I think an rebase should be enough to test if something changes.
bjorn3 commented on issue #5546:
Egraph optimizations are on by default if optimizations are on. Do benchmarks run with optimizations enabled?
afonso360 commented on issue #5546:
Do benchmarks run with optimizations enabled?
I think optimizations are enabled by default, at least on the wasmtime cli that is the case. Although I'm not entirely sure about sightglass.
Rebased this on main and ran sightglass on AArch64, doesen't seem to have much of an impact. I wonder if the results would be any different for something like
cg_clif
that doesen't run on already optimized code.
Unfortunately I can't run this on RISCV which is where I would hope to see some positive effect.<details>
<summary>Scalar Benchmarks AArch64</summary>ubuntu@instance-20220805-0848:~/git/sightglass$ target/debug/sightglass-cli benchmark --engine /tmp/wasmtime_main.so --engine /tmp/wasmtime_select.so compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm Δ = 3282352.64 ± 1066807.74 (confidence = 99%) main.so is 1.02x to 1.04x faster than select.so! [111047763 118113547.11 123328519] main.so [115034752 121395899.75 128445997] select.so execution :: cycles :: benchmarks/bz2/benchmark.wasm Δ = 6938.23 ± 1941.66 (confidence = 99%) select.so is 1.01x to 1.01x faster than main.so! [876198 884956.45 902123] main.so [869354 878018.22 901740] select.so instantiation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm No difference in performance. [2849 3260.40 12227] main.so [2804 3148.16 4268] select.so compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm No difference in performance. [4381352 5597680.00 6817168] main.so [4465781 5715523.09 7002568] select.so instantiation :: cycles :: benchmarks/bz2/benchmark.wasm No difference in performance. [2361 2621.85 4044] main.so [2284 2575.55 3465] select.so compilation :: cycles :: benchmarks/bz2/benchmark.wasm No difference in performance. [2440354 2586186.45 3163455] main.so [2460769 2621895.24 3155162] select.so execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm No difference in performance. [69303 75400.87 94495] main.so [70033 76346.64 97401] select.so instantiation :: cycles :: benchmarks/spidermonkey/benchmark.wasm No difference in performance. [5533 6360.26 18487] main.so [5558 6418.39 19011] select.so execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm No difference in performance. [8367754 8415382.47 8499654] main.so [8377518 8420745.48 8547947] select.so
</details>
<details>
<summary>SIMD Benchmarks AArch64</summary>ubuntu@instance-20220805-0848:~/git/sightglass$ target/debug/sightglass-cli benchmark --engine /tmp/wasmtime_main.so --engine /tmp/wasmtime_select.so -- benchmarks/blake3-simd/benchmark.wasm benchmarks/intgemm-simd/benchmark.wasm execution :: cycles :: benchmarks/intgemm-simd/benchmark.wasm Δ = 203985.92 ± 8663.93 (confidence = 99%) select.so is 1.00x to 1.00x faster than main.so! [45561021 45611295.11 45720621] main.so [45363259 45407309.19 45486165] select.so execution :: cycles :: benchmarks/blake3-simd/benchmark.wasm No difference in performance. [3472 4728.82 13547] main.so [3477 4261.65 15734] select.so instantiation :: cycles :: benchmarks/blake3-simd/benchmark.wasm No difference in performance. [1749 2124.65 3341] main.so [1724 2101.14 3474] select.so compilation :: cycles :: benchmarks/intgemm-simd/benchmark.wasm No difference in performance. [5448821 5745023.23 6331575] main.so [5487095 5779248.67 6320724] select.so compilation :: cycles :: benchmarks/blake3-simd/benchmark.wasm No difference in performance. [526706 615112.06 1567920] main.so [536207 613883.57 1149844] select.so instantiation :: cycles :: benchmarks/intgemm-simd/benchmark.wasm No difference in performance. [2246 2495.36 3522] main.so [2126 2497.00 4032] select.so
</details>
jameysharp commented on issue #5546:
Looks like s390x doesn't have lowerings for
umin
/umax
on scalars, so when this PR rewrites this pattern to those instructions the s390x tests fail. cc: @uweigand
uweigand commented on issue #5546:
Looks like s390x doesn't have lowerings for
umin
/umax
on scalars, so when this PR rewrites this pattern to those instructions the s390x tests fail. cc: @uweigandWell, we don't have any instructions for
umin
/umax
on scalars, so the only possible implementation would be to turn it right back into (the equivalent of)select
+icmp
... Is this what we're supposed to be doing here?
afonso360 commented on issue #5546:
I think that's what x86 does, as well.
uweigand commented on issue #5546:
I've now opened https://github.com/bytecodealliance/wasmtime/pull/5762 to add support for these clif instructions on s390x.
Last updated: Dec 23 2024 at 12:05 UTC