Stream: git-wasmtime

Topic: wasmtime / PR #8785 cranelift: Delete scalar `{u,s}loadNN...


view this post on Zulip Wasmtime GitHub notifications bot (Jun 12 2024 at 18:42):

afonso360 requested elliottt for a review on PR #8785.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 12 2024 at 18:42):

afonso360 requested wasmtime-compiler-reviewers for a review on PR #8785.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 12 2024 at 18:42):

afonso360 opened PR #8785 from afonso360:delete-special-mem to bytecodealliance:main:

:wave: Hey,

This is a WIP PR to delete the scalar load+extend instructions. I'm still working on cleaning up the s390x backend, but I saw there was some discussion on today's meeting regarding this.

I haven't done any benchmarking, but if someone wants to take a stab at it, this PR should be pretty much done for all backends except s390x.

CC: https://github.com/bytecodealliance/wasmtime/issues/6056

view this post on Zulip Wasmtime GitHub notifications bot (Jun 12 2024 at 18:42):

afonso360 requested wasmtime-core-reviewers for a review on PR #8785.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 12 2024 at 18:43):

afonso360 edited PR #8785:

:wave: Hey,

This is a WIP PR to delete the scalar load+extend instructions. I'm still working on cleaning up the s390x backend, but I saw there was some discussion on today's meeting regarding this. (Unfortunately these now clash with some other stuff and I'm unable to attend)

I haven't done any benchmarking, but if someone wants to take a stab at it, this PR should be pretty much done for all backends except s390x.

CC: https://github.com/bytecodealliance/wasmtime/issues/6056

view this post on Zulip Wasmtime GitHub notifications bot (Jun 12 2024 at 18:51):

afonso360 edited PR #8785:

:wave: Hey,

This is a WIP PR to delete the scalar load+extend instructions. I'm still working on cleaning up the s390x backend, but I saw there was some discussion on today's meeting regarding this. (Unfortunately these now clash with some other stuff and I'm unable to attend)

I haven't done any benchmarking, and I just noticed there are some cases where we are not correctly fusing the load+extend. Once that is working correctly I'll try to run sightglass on it.

CC: https://github.com/bytecodealliance/wasmtime/issues/6056

view this post on Zulip Wasmtime GitHub notifications bot (Jun 12 2024 at 20:14):

alexcrichton commented on PR #8785:

For context I brought this up in the Cranelift meeting today as I was curious about the genesis of the instructions. My assumption was that these were added under the assumption that the pattern of load+extend or ireduce+store would be so common that having fused ops in Cranelift would reduce the size of the resident IR during compilation and perhaps have other various memory/compile-time benefits. The conclusion though was that while this was probably the predicted purpose of the instructions no one was aware of any benchmarking one way or another to show the impact.

I think it'd be worthwhile to put this through sightglass to ensure there's not, for example, a 10% slowdown compiling spidermonkey, but otherwise I think we're all in the abstract all for cleanups that simplify the IR.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 12 2024 at 23:15):

afonso360 commented on PR #8785:

Ran some quick benchmarks on this, doesn't seem to make too much of a difference, except for spidermonkey where there is a slight compile time regression.

I should also note that this version suffers from the issue described in https://github.com/bytecodealliance/wasmtime/issues/8787, but it looks like that doesn't affect larger programs too much? It does affect the test cases which slightly bothers me.

<details>
<summary>Sightglass results</summary>

    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.10s
     Running `target/debug/sightglass-cli benchmark --engine ./main.so --engine ./delete-special-mem.so --iterations-per-process 10 --processes 2 -- benchmarks/spidermonkey/benchmark.wasm ./benchmarks/pulldown-cmark/benchmark.wasm benchmarks/bz2/benchmark.wasm benchmarks/regex/benchmark.wasm`
.

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 250476630.15 ± 157064794.89 (confidence = 99%)

  main.so is 1.01x to 1.03x faster than delete-special-mem.so!

  [14521885186 14938969886.45 15446520641] delete-special-mem.so
  [14558278944 14688493256.30 15396563040] main.so

instantiation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [528672 687515.20 1726784] delete-special-mem.so
  [528352 761440.00 2136608] main.so

instantiation :: cycles :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [200704 226676.80 266368] delete-special-mem.so
  [196128 242470.40 326496] main.so

instantiation :: cycles :: ./benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [317088 364465.60 469152] delete-special-mem.so
  [314976 380755.20 560736] main.so

compilation :: cycles :: ./benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [590328960 645099580.55 788561440] delete-special-mem.so
  [603675135 673634577.60 887706080] main.so

execution :: cycles :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [124808576 134569985.05 199907890] delete-special-mem.so
  [122697376 130622858.45 157331714] main.so

instantiation :: cycles :: benchmarks/regex/benchmark.wasm

  No difference in performance.

  [513856 599899.20 1005408] delete-special-mem.so
  [459328 615880.00 1086016] main.so

compilation :: cycles :: benchmarks/regex/benchmark.wasm

  No difference in performance.

  [1356884384 1438567822.70 1639913438] delete-special-mem.so
  [1371897536 1467483544.05 1656487360] main.so

compilation :: cycles :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [260681040 299165913.20 481644877] delete-special-mem.so
  [261796512 294465843.00 491744160] main.so

execution :: cycles :: ./benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [9724864 10999916.80 13853504] delete-special-mem.so
  [10061056 10919562.45 12255328] main.so

execution :: cycles :: benchmarks/regex/benchmark.wasm

  No difference in performance.

  [284849472 298366723.55 319585712] delete-special-mem.so
  [281981504 296739732.85 332067262] main.so

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [1437751471 1491664753.05 1615952144] delete-special-mem.so
  [1434106066 1487088603.50 1577413794] main.so

</details>

view this post on Zulip Wasmtime GitHub notifications bot (Jun 13 2024 at 01:17):

cfallin commented on PR #8785:

Thanks for running those benchmarks! IMHO, a 1-3% compile time regression is unfortunately a bit too significant to take for a pure "cleanliness win" change (others may disagree, happy to discuss; in the CL meeting this morning I gave a clean 1% as an example of a number I'd personally be fine with, 10% as an example clearly not, 3% is somewhere in the middle).

I wonder if it gets any better if we do handle merging better, along the lines of #8787 -- the compile slowdown could arise because of larger VCode on average rather than larger CLIF, as well. Worth trying to resolve that first then coming back to benchmark this again perhaps?

view this post on Zulip Wasmtime GitHub notifications bot (Jun 13 2024 at 20:31):

afonso360 commented on PR #8785:

IMHO, a 1-3% compile time regression is unfortunately a bit too significant to take for a pure "cleanliness win" change

Yeah, I don't think it's worth it if we have this regression.

I wonder if it gets any better if we do handle merging better, along the lines of https://github.com/bytecodealliance/wasmtime/issues/8787 -- the compile slowdown could arise because of larger VCode on average rather than larger CLIF, as well. Worth trying to resolve that first then coming back to benchmark this again perhaps?

Maybe, I'm not too familiar with how we do elaboration on egraphs so it might be slightly harder for me to pick that up.

Reading https://github.com/bytecodealliance/wasmtime/issues/6154 got me interested in the jump-threading pass idea, even though I think moving the extends next to the loads would be more effective in this case. I probably won't have time to look at it now but maybe later.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 17 2024 at 13:30):

afonso360 commented on PR #8785:

By the way, I was doing some unrelated benchmarking and found out that sightglass is not consistent enough on my machine to give me confidence that the result above is real / meaningful.

I'm going to try to run the benchmarks again on a more stable setup when I get some time, although I suspect it's probably going to amount to the same.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 17 2024 at 17:54):

fitzgen commented on PR #8785:

@afonso360 you could try measuring instructions retired instead of cycles, pass something like -m insts-retired to sightglass. It should be much more consistent, and this change isn't the kind of thing where we instructions retired is a poor proxy for performance (as opposed to when measuring changes that are intended to improve cache locality, for example).

view this post on Zulip Wasmtime GitHub notifications bot (Jun 20 2024 at 13:21):

afonso360 closed without merge PR #8785.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 20 2024 at 13:21):

afonso360 commented on PR #8785:

I re-ran the benchmarks with -m insts-retired and some of the steps in cpu-isolation.md. It's still not 100% reproducible across runs, but it's a lot better.

However the results are now completely different from the results above.

<details>

<summary>Sightglass Results</summary>

afonso@fedora:~/git/sightglass$ taskset --cpu-list 3 cargo run -- benchmark --engine ./delete-special-mem-main.so --engine ./delete-special-mem.so --iterations-per-process 10 --processes 1 -m insts-retired -- benchmarks/spidermonkey/benchmark.wasm ./benchmarks/pulldown-cmark/benchmark.wasm benchmarks/bz2/benchmark.wasm benchmarks/regex/benchmark.wasm
warning: virtual workspace defaulting to `resolver = "1"` despite one or more workspace members being on edition 2021 which implies `resolver = "2"`
note: to keep the current resolver, specify `workspace.resolver = "1"` in the workspace root's manifest
note: to use the edition 2021 resolver, specify `workspace.resolver = "2"` in the workspace root's manifest
note: for more details see https://doc.rust-lang.org/cargo/reference/resolver.html#resolver-versions
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.37s
     Running `target/debug/sightglass-cli benchmark --engine ./delete-special-mem-main.so --engine ./delete-special-mem.so --iterations-per-process 10 --processes 1 -m insts-retired -- benchmarks/spidermonkey/benchmark.wasm ./benchmarks/pulldown-cmark/benchmark.wasm benchmarks/bz2/benchmark.wasm benchmarks/regex/benchmark.wasm`

execution :: instructions-retired :: benchmarks/regex/benchmark.wasm

  Δ = 53721614.20 ± 3668.44 (confidence = 99%)

  -main.so is 1.07x to 1.07x faster than .so!

  [737583027 737586621.20 737590770] -main.so
  [791305114 791308235.40 791313975] .so

execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  Δ = 15985756.30 ± 2.10 (confidence = 99%)

  -main.so is 1.07x to 1.07x faster than .so!

  [227672476 227672479.20 227672481] -main.so
  [243658234 243658235.50 243658239] .so

instantiation :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  Δ = 2971.00 ± 2002.10 (confidence = 99%)

  -main.so is 1.02x to 1.11x faster than .so!

  [43088 45173.90 49561] -main.so
  [46621 48144.90 49943] .so

execution :: instructions-retired :: ./benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 976271.20 ± 336.81 (confidence = 99%)

  -main.so is 1.05x to 1.05x faster than .so!

  [19771797 19771883.20 19772465] -main.so
  [20747941 20748154.40 20748599] .so

execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 29346100.10 ± 77173.30 (confidence = 99%)

  -main.so is 1.01x to 1.01x faster than .so!

  [2660295835 2660354378.00 2660420572] -main.so
  [2689612009 2689700478.10 2689780017] .so

compilation :: instructions-retired :: benchmarks/regex/benchmark.wasm

  Δ = 80572.40 ± 75315.03 (confidence = 99%)

  -main.so is 1.00x to 1.00x faster than .so!

  [37695341 37800338.60 37888633] -main.so
  [37747142 37880911.00 37960870] .so

instantiation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [46462 71324.50 168586] -main.so
  [46932 47315.60 47940] .so

instantiation :: instructions-retired :: ./benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [37923 60559.60 69184] -main.so
  [63367 68729.70 72940] .so

instantiation :: instructions-retired :: benchmarks/regex/benchmark.wasm

  No difference in performance.

  [35647 36528.00 37098] -main.so
  [35728 41068.30 60408] .so

compilation :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [4079658 4110764.30 4134315] -main.so
  [4079041 4097484.70 4118287] .so

compilation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [374729187 375180105.70 375606520] -main.so
  [375041594 375590062.10 376019380] .so

compilation :: instructions-retired :: ./benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [15772628 15798740.50 15873683] -main.so
  [15784192 15808475.30 15841131] .so

</details>

Weirdly there is now no longer any compilation difference, but there are execution differences.

I'm going to close this for now, however I'm planning on redoing these benchmarks when we have a jump threading pass, which I'm currently working on.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 20 2024 at 13:22):

afonso360 edited a comment on PR #8785:

I re-ran the benchmarks with -m insts-retired and some of the steps in cpu-isolation.md. It's still not 100% reproducible across runs, but it's a lot better.

However the results are now completely different from the results above.

<details>

<summary>Sightglass Results</summary>

afonso@fedora:~/git/sightglass$ taskset --cpu-list 3 cargo run -- benchmark --engine ./delete-special-mem-main.so --engine ./delete-special-mem.so --iterations-per-process 10 --processes 1 -m insts-retired -- benchmarks/spidermonkey/benchmark.wasm ./benchmarks/pulldown-cmark/benchmark.wasm benchmarks/bz2/benchmark.wasm benchmarks/regex/benchmark.wasm
warning: virtual workspace defaulting to `resolver = "1"` despite one or more workspace members being on edition 2021 which implies `resolver = "2"`
note: to keep the current resolver, specify `workspace.resolver = "1"` in the workspace root's manifest
note: to use the edition 2021 resolver, specify `workspace.resolver = "2"` in the workspace root's manifest
note: for more details see https://doc.rust-lang.org/cargo/reference/resolver.html#resolver-versions
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.37s
     Running `target/debug/sightglass-cli benchmark --engine ./delete-special-mem-main.so --engine ./delete-special-mem.so --iterations-per-process 10 --processes 1 -m insts-retired -- benchmarks/spidermonkey/benchmark.wasm ./benchmarks/pulldown-cmark/benchmark.wasm benchmarks/bz2/benchmark.wasm benchmarks/regex/benchmark.wasm`

execution :: instructions-retired :: benchmarks/regex/benchmark.wasm

  Δ = 53721614.20 ± 3668.44 (confidence = 99%)

  -main.so is 1.07x to 1.07x faster than .so!

  [737583027 737586621.20 737590770] -main.so
  [791305114 791308235.40 791313975] .so

execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  Δ = 15985756.30 ± 2.10 (confidence = 99%)

  -main.so is 1.07x to 1.07x faster than .so!

  [227672476 227672479.20 227672481] -main.so
  [243658234 243658235.50 243658239] .so

instantiation :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  Δ = 2971.00 ± 2002.10 (confidence = 99%)

  -main.so is 1.02x to 1.11x faster than .so!

  [43088 45173.90 49561] -main.so
  [46621 48144.90 49943] .so

execution :: instructions-retired :: ./benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 976271.20 ± 336.81 (confidence = 99%)

  -main.so is 1.05x to 1.05x faster than .so!

  [19771797 19771883.20 19772465] -main.so
  [20747941 20748154.40 20748599] .so

execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 29346100.10 ± 77173.30 (confidence = 99%)

  -main.so is 1.01x to 1.01x faster than .so!

  [2660295835 2660354378.00 2660420572] -main.so
  [2689612009 2689700478.10 2689780017] .so

compilation :: instructions-retired :: benchmarks/regex/benchmark.wasm

  Δ = 80572.40 ± 75315.03 (confidence = 99%)

  -main.so is 1.00x to 1.00x faster than .so!

  [37695341 37800338.60 37888633] -main.so
  [37747142 37880911.00 37960870] .so

instantiation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [46462 71324.50 168586] -main.so
  [46932 47315.60 47940] .so

instantiation :: instructions-retired :: ./benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [37923 60559.60 69184] -main.so
  [63367 68729.70 72940] .so

instantiation :: instructions-retired :: benchmarks/regex/benchmark.wasm

  No difference in performance.

  [35647 36528.00 37098] -main.so
  [35728 41068.30 60408] .so

compilation :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [4079658 4110764.30 4134315] -main.so
  [4079041 4097484.70 4118287] .so

compilation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [374729187 375180105.70 375606520] -main.so
  [375041594 375590062.10 376019380] .so

compilation :: instructions-retired :: ./benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [15772628 15798740.50 15873683] -main.so
  [15784192 15808475.30 15841131] .so

</details>

Weirdly there is now no longer any compilation difference, but there are execution differences.

I'm going to close this for now, however I'm planning on redoing these benchmarks when we have a jump threading pass, which I'm currently working on (slowly).


Last updated: Jan 24 2025 at 00:11 UTC