Stream: git-wasmtime

Topic: wasmtime / PR #2051 Aarch64 codegen quality: handle add-n...


view this post on Zulip Wasmtime GitHub notifications bot (Jul 20 2020 at 20:40):

cfallin opened PR #2051 from aarch64-add-negative-imm to main:

We often see patterns like:

    mov w2, #0xffff_ffff   // uses ORR with logical immediate form
    add w0, w1, w2
```

which is just `w0 := w1 - 1`. It would be much better to recognize when
the inverse of an immediate will fit in a 12-bit immediate field if the
immediate itself does not, and flip add to subtract (and vice versa), so
we can instead generate:

```
    sub w0, w1, #-1
```

We see this pattern in e.g. `bz2`, where this commit makes the following
difference (counting instructions with `perf stat`, filling in the
wasmtime cache first then running again to get just runtime):

pre:

```
        992.762250      task-clock (msec)         #    0.998 CPUs utilized
               109      context-switches          #    0.110 K/sec
                 0      cpu-migrations            #    0.000 K/sec
             5,035      page-faults               #    0.005 M/sec
     3,224,119,134      cycles                    #    3.248 GHz
     4,000,521,171      instructions              #    1.24  insn per cycle
   <not supported>      branches
        27,573,755      branch-misses

       0.995072322 seconds time elapsed
```

post:

```
        993.853850      task-clock (msec)         #    0.998 CPUs utilized
               123      context-switches          #    0.124 K/sec
                 1      cpu-migrations            #    0.001 K/sec
             5,072      page-faults               #    0.005 M/sec
     3,201,278,337      cycles                    #    3.221 GHz
     3,917,061,340      instructions              #    1.22  insn per cycle
   <not supported>      branches
        28,410,633      branch-misses

       0.996008047 seconds time elapsed
```

In other words, a 2.1% redunction in instruction count on `bz2`.

<!--

Please ensure that the following steps are all taken care of before submitting
the PR.

- [ ] This has been discussed in issue #..., or if not, please tell us why
  here.
- [ ] A short description of what this does, why it is needed; if the
  description becomes long, the matter should probably be discussed in an issue
  first.
- [ ] This PR contains test cases, if meaningful.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
  If you don't know who could review this, please indicate so. The list of
  suggested reviewers on the right can help you.

Please ensure all communication adheres to the [code of
conduct](https://github.com/bytecodealliance/wasmtime/blob/master/CODE_OF_CONDUCT.md).
-->

~~~

view this post on Zulip Wasmtime GitHub notifications bot (Jul 20 2020 at 20:40):

cfallin requested julian-seward1 for a review on PR #2051.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 20 2020 at 20:40):

cfallin updated PR #2051 from aarch64-add-negative-imm to main:

We often see patterns like:

    mov w2, #0xffff_ffff   // uses ORR with logical immediate form
    add w0, w1, w2
```

which is just `w0 := w1 - 1`. It would be much better to recognize when
the inverse of an immediate will fit in a 12-bit immediate field if the
immediate itself does not, and flip add to subtract (and vice versa), so
we can instead generate:

```
    sub w0, w1, #-1
```

We see this pattern in e.g. `bz2`, where this commit makes the following
difference (counting instructions with `perf stat`, filling in the
wasmtime cache first then running again to get just runtime):

pre:

```
        992.762250      task-clock (msec)         #    0.998 CPUs utilized
               109      context-switches          #    0.110 K/sec
                 0      cpu-migrations            #    0.000 K/sec
             5,035      page-faults               #    0.005 M/sec
     3,224,119,134      cycles                    #    3.248 GHz
     4,000,521,171      instructions              #    1.24  insn per cycle
   <not supported>      branches
        27,573,755      branch-misses

       0.995072322 seconds time elapsed
```

post:

```
        993.853850      task-clock (msec)         #    0.998 CPUs utilized
               123      context-switches          #    0.124 K/sec
                 1      cpu-migrations            #    0.001 K/sec
             5,072      page-faults               #    0.005 M/sec
     3,201,278,337      cycles                    #    3.221 GHz
     3,917,061,340      instructions              #    1.22  insn per cycle
   <not supported>      branches
        28,410,633      branch-misses

       0.996008047 seconds time elapsed
```

In other words, a 2.1% redunction in instruction count on `bz2`.

<!--

Please ensure that the following steps are all taken care of before submitting
the PR.

- [ ] This has been discussed in issue #..., or if not, please tell us why
  here.
- [ ] A short description of what this does, why it is needed; if the
  description becomes long, the matter should probably be discussed in an issue
  first.
- [ ] This PR contains test cases, if meaningful.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
  If you don't know who could review this, please indicate so. The list of
  suggested reviewers on the right can help you.

Please ensure all communication adheres to the [code of
conduct](https://github.com/bytecodealliance/wasmtime/blob/master/CODE_OF_CONDUCT.md).
-->

~~~

view this post on Zulip Wasmtime GitHub notifications bot (Jul 20 2020 at 20:40):

cfallin edited PR #2051 from aarch64-add-negative-imm to main:

We often see patterns like:

    mov w2, #0xffff_ffff   // uses ORR with logical immediate form
    add w0, w1, w2

which is just w0 := w1 - 1. It would be much better to recognize when
the inverse of an immediate will fit in a 12-bit immediate field if the
immediate itself does not, and flip add to subtract (and vice versa), so
we can instead generate:

    sub w0, w1, #-1

We see this pattern in e.g. bz2, where this commit makes the following
difference (counting instructions with perf stat, filling in the
wasmtime cache first then running again to get just runtime):

pre:

        992.762250      task-clock (msec)         #    0.998 CPUs utilized
               109      context-switches          #    0.110 K/sec
                 0      cpu-migrations            #    0.000 K/sec
             5,035      page-faults               #    0.005 M/sec
     3,224,119,134      cycles                    #    3.248 GHz
     4,000,521,171      instructions              #    1.24  insn per cycle
   <not supported>      branches
        27,573,755      branch-misses

       0.995072322 seconds time elapsed

post:

        993.853850      task-clock (msec)         #    0.998 CPUs utilized
               123      context-switches          #    0.124 K/sec
                 1      cpu-migrations            #    0.001 K/sec
             5,072      page-faults               #    0.005 M/sec
     3,201,278,337      cycles                    #    3.221 GHz
     3,917,061,340      instructions              #    1.22  insn per cycle
   <not supported>      branches
        28,410,633      branch-misses

       0.996008047 seconds time elapsed

In other words, a 2.1% redunction in instruction count on bz2.

<!--

Please ensure that the following steps are all taken care of before submitting
the PR.

Please ensure all communication adheres to the code of conduct.
-->

view this post on Zulip Wasmtime GitHub notifications bot (Jul 20 2020 at 20:41):

cfallin edited PR #2051 from aarch64-add-negative-imm to main:

We often see patterns like:

    mov w2, #0xffff_ffff   // uses ORR with logical immediate form
    add w0, w1, w2

which is just w0 := w1 - 1. It would be much better to recognize when
the inverse of an immediate will fit in a 12-bit immediate field if the
immediate itself does not, and flip add to subtract (and vice versa), so
we can instead generate:

    sub w0, w1, #1

We see this pattern in e.g. bz2, where this commit makes the following
difference (counting instructions with perf stat, filling in the
wasmtime cache first then running again to get just runtime):

pre:

        992.762250      task-clock (msec)         #    0.998 CPUs utilized
               109      context-switches          #    0.110 K/sec
                 0      cpu-migrations            #    0.000 K/sec
             5,035      page-faults               #    0.005 M/sec
     3,224,119,134      cycles                    #    3.248 GHz
     4,000,521,171      instructions              #    1.24  insn per cycle
   <not supported>      branches
        27,573,755      branch-misses

       0.995072322 seconds time elapsed

post:

        993.853850      task-clock (msec)         #    0.998 CPUs utilized
               123      context-switches          #    0.124 K/sec
                 1      cpu-migrations            #    0.001 K/sec
             5,072      page-faults               #    0.005 M/sec
     3,201,278,337      cycles                    #    3.221 GHz
     3,917,061,340      instructions              #    1.22  insn per cycle
   <not supported>      branches
        28,410,633      branch-misses

       0.996008047 seconds time elapsed

In other words, a 2.1% redunction in instruction count on bz2.

<!--

Please ensure that the following steps are all taken care of before submitting
the PR.

Please ensure all communication adheres to the code of conduct.
-->

view this post on Zulip Wasmtime GitHub notifications bot (Jul 20 2020 at 20:41):

cfallin updated PR #2051 from aarch64-add-negative-imm to main:

We often see patterns like:

    mov w2, #0xffff_ffff   // uses ORR with logical immediate form
    add w0, w1, w2

which is just w0 := w1 - 1. It would be much better to recognize when
the inverse of an immediate will fit in a 12-bit immediate field if the
immediate itself does not, and flip add to subtract (and vice versa), so
we can instead generate:

    sub w0, w1, #1

We see this pattern in e.g. bz2, where this commit makes the following
difference (counting instructions with perf stat, filling in the
wasmtime cache first then running again to get just runtime):

pre:

        992.762250      task-clock (msec)         #    0.998 CPUs utilized
               109      context-switches          #    0.110 K/sec
                 0      cpu-migrations            #    0.000 K/sec
             5,035      page-faults               #    0.005 M/sec
     3,224,119,134      cycles                    #    3.248 GHz
     4,000,521,171      instructions              #    1.24  insn per cycle
   <not supported>      branches
        27,573,755      branch-misses

       0.995072322 seconds time elapsed

post:

        993.853850      task-clock (msec)         #    0.998 CPUs utilized
               123      context-switches          #    0.124 K/sec
                 1      cpu-migrations            #    0.001 K/sec
             5,072      page-faults               #    0.005 M/sec
     3,201,278,337      cycles                    #    3.221 GHz
     3,917,061,340      instructions              #    1.22  insn per cycle
   <not supported>      branches
        28,410,633      branch-misses

       0.996008047 seconds time elapsed

In other words, a 2.1% redunction in instruction count on bz2.

<!--

Please ensure that the following steps are all taken care of before submitting
the PR.

Please ensure all communication adheres to the code of conduct.
-->

view this post on Zulip Wasmtime GitHub notifications bot (Jul 20 2020 at 20:42):

cfallin updated PR #2051 from aarch64-add-negative-imm to main:

We often see patterns like:

    mov w2, #0xffff_ffff   // uses ORR with logical immediate form
    add w0, w1, w2

which is just w0 := w1 - 1. It would be much better to recognize when
the inverse of an immediate will fit in a 12-bit immediate field if the
immediate itself does not, and flip add to subtract (and vice versa), so
we can instead generate:

    sub w0, w1, #1

We see this pattern in e.g. bz2, where this commit makes the following
difference (counting instructions with perf stat, filling in the
wasmtime cache first then running again to get just runtime):

pre:

        992.762250      task-clock (msec)         #    0.998 CPUs utilized
               109      context-switches          #    0.110 K/sec
                 0      cpu-migrations            #    0.000 K/sec
             5,035      page-faults               #    0.005 M/sec
     3,224,119,134      cycles                    #    3.248 GHz
     4,000,521,171      instructions              #    1.24  insn per cycle
   <not supported>      branches
        27,573,755      branch-misses

       0.995072322 seconds time elapsed

post:

        993.853850      task-clock (msec)         #    0.998 CPUs utilized
               123      context-switches          #    0.124 K/sec
                 1      cpu-migrations            #    0.001 K/sec
             5,072      page-faults               #    0.005 M/sec
     3,201,278,337      cycles                    #    3.221 GHz
     3,917,061,340      instructions              #    1.22  insn per cycle
   <not supported>      branches
        28,410,633      branch-misses

       0.996008047 seconds time elapsed

In other words, a 2.1% redunction in instruction count on bz2.

<!--

Please ensure that the following steps are all taken care of before submitting
the PR.

Please ensure all communication adheres to the code of conduct.
-->

view this post on Zulip Wasmtime GitHub notifications bot (Jul 20 2020 at 20:42):

cfallin edited PR #2051 from aarch64-add-negative-imm to main:

We often see patterns like:

    mov w2, #0xffff_ffff   // uses ORR with logical immediate form
    add w0, w1, w2

which is just w0 := w1 - 1. It would be much better to recognize when
the inverse of an immediate will fit in a 12-bit immediate field if the
immediate itself does not, and flip add to subtract (and vice versa), so
we can instead generate:

    sub w0, w1, #1

We see this pattern in e.g. bz2, where this commit makes the following
difference (counting instructions with perf stat, filling in the
wasmtime cache first then running again to get just runtime):

pre:

        992.762250      task-clock (msec)         #    0.998 CPUs utilized
               109      context-switches          #    0.110 K/sec
                 0      cpu-migrations            #    0.000 K/sec
             5,035      page-faults               #    0.005 M/sec
     3,224,119,134      cycles                    #    3.248 GHz
     4,000,521,171      instructions              #    1.24  insn per cycle
   <not supported>      branches
        27,573,755      branch-misses

       0.995072322 seconds time elapsed

post:

        993.853850      task-clock (msec)         #    0.998 CPUs utilized
               123      context-switches          #    0.124 K/sec
                 1      cpu-migrations            #    0.001 K/sec
             5,072      page-faults               #    0.005 M/sec
     3,201,278,337      cycles                    #    3.221 GHz
     3,917,061,340      instructions              #    1.22  insn per cycle
   <not supported>      branches
        28,410,633      branch-misses

       0.996008047 seconds time elapsed

In other words, a 2.1% reduction in instruction count on bz2.

<!--

Please ensure that the following steps are all taken care of before submitting
the PR.

Please ensure all communication adheres to the code of conduct.
-->

view this post on Zulip Wasmtime GitHub notifications bot (Jul 21 2020 at 19:11):

cfallin updated PR #2051 from aarch64-add-negative-imm to main:

We often see patterns like:

    mov w2, #0xffff_ffff   // uses ORR with logical immediate form
    add w0, w1, w2

which is just w0 := w1 - 1. It would be much better to recognize when
the inverse of an immediate will fit in a 12-bit immediate field if the
immediate itself does not, and flip add to subtract (and vice versa), so
we can instead generate:

    sub w0, w1, #1

We see this pattern in e.g. bz2, where this commit makes the following
difference (counting instructions with perf stat, filling in the
wasmtime cache first then running again to get just runtime):

pre:

        992.762250      task-clock (msec)         #    0.998 CPUs utilized
               109      context-switches          #    0.110 K/sec
                 0      cpu-migrations            #    0.000 K/sec
             5,035      page-faults               #    0.005 M/sec
     3,224,119,134      cycles                    #    3.248 GHz
     4,000,521,171      instructions              #    1.24  insn per cycle
   <not supported>      branches
        27,573,755      branch-misses

       0.995072322 seconds time elapsed

post:

        993.853850      task-clock (msec)         #    0.998 CPUs utilized
               123      context-switches          #    0.124 K/sec
                 1      cpu-migrations            #    0.001 K/sec
             5,072      page-faults               #    0.005 M/sec
     3,201,278,337      cycles                    #    3.221 GHz
     3,917,061,340      instructions              #    1.22  insn per cycle
   <not supported>      branches
        28,410,633      branch-misses

       0.996008047 seconds time elapsed

In other words, a 2.1% reduction in instruction count on bz2.

<!--

Please ensure that the following steps are all taken care of before submitting
the PR.

Please ensure all communication adheres to the code of conduct.
-->

view this post on Zulip Wasmtime GitHub notifications bot (Jul 23 2020 at 17:12):

cfallin updated PR #2051 from aarch64-add-negative-imm to main:

We often see patterns like:

    mov w2, #0xffff_ffff   // uses ORR with logical immediate form
    add w0, w1, w2

which is just w0 := w1 - 1. It would be much better to recognize when
the inverse of an immediate will fit in a 12-bit immediate field if the
immediate itself does not, and flip add to subtract (and vice versa), so
we can instead generate:

    sub w0, w1, #1

We see this pattern in e.g. bz2, where this commit makes the following
difference (counting instructions with perf stat, filling in the
wasmtime cache first then running again to get just runtime):

pre:

        992.762250      task-clock (msec)         #    0.998 CPUs utilized
               109      context-switches          #    0.110 K/sec
                 0      cpu-migrations            #    0.000 K/sec
             5,035      page-faults               #    0.005 M/sec
     3,224,119,134      cycles                    #    3.248 GHz
     4,000,521,171      instructions              #    1.24  insn per cycle
   <not supported>      branches
        27,573,755      branch-misses

       0.995072322 seconds time elapsed

post:

        993.853850      task-clock (msec)         #    0.998 CPUs utilized
               123      context-switches          #    0.124 K/sec
                 1      cpu-migrations            #    0.001 K/sec
             5,072      page-faults               #    0.005 M/sec
     3,201,278,337      cycles                    #    3.221 GHz
     3,917,061,340      instructions              #    1.22  insn per cycle
   <not supported>      branches
        28,410,633      branch-misses

       0.996008047 seconds time elapsed

In other words, a 2.1% reduction in instruction count on bz2.

<!--

Please ensure that the following steps are all taken care of before submitting
the PR.

Please ensure all communication adheres to the code of conduct.
-->

view this post on Zulip Wasmtime GitHub notifications bot (Jul 24 2020 at 04:14):

julian-seward1 submitted PR Review.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 24 2020 at 18:26):

cfallin updated PR #2051 from aarch64-add-negative-imm to main:

We often see patterns like:

    mov w2, #0xffff_ffff   // uses ORR with logical immediate form
    add w0, w1, w2

which is just w0 := w1 - 1. It would be much better to recognize when
the inverse of an immediate will fit in a 12-bit immediate field if the
immediate itself does not, and flip add to subtract (and vice versa), so
we can instead generate:

    sub w0, w1, #1

We see this pattern in e.g. bz2, where this commit makes the following
difference (counting instructions with perf stat, filling in the
wasmtime cache first then running again to get just runtime):

pre:

        992.762250      task-clock (msec)         #    0.998 CPUs utilized
               109      context-switches          #    0.110 K/sec
                 0      cpu-migrations            #    0.000 K/sec
             5,035      page-faults               #    0.005 M/sec
     3,224,119,134      cycles                    #    3.248 GHz
     4,000,521,171      instructions              #    1.24  insn per cycle
   <not supported>      branches
        27,573,755      branch-misses

       0.995072322 seconds time elapsed

post:

        993.853850      task-clock (msec)         #    0.998 CPUs utilized
               123      context-switches          #    0.124 K/sec
                 1      cpu-migrations            #    0.001 K/sec
             5,072      page-faults               #    0.005 M/sec
     3,201,278,337      cycles                    #    3.221 GHz
     3,917,061,340      instructions              #    1.22  insn per cycle
   <not supported>      branches
        28,410,633      branch-misses

       0.996008047 seconds time elapsed

In other words, a 2.1% reduction in instruction count on bz2.

<!--

Please ensure that the following steps are all taken care of before submitting
the PR.

Please ensure all communication adheres to the code of conduct.
-->

view this post on Zulip Wasmtime GitHub notifications bot (Jul 24 2020 at 18:33):

cfallin updated PR #2051 from aarch64-add-negative-imm to main:

We often see patterns like:

    mov w2, #0xffff_ffff   // uses ORR with logical immediate form
    add w0, w1, w2

which is just w0 := w1 - 1. It would be much better to recognize when
the inverse of an immediate will fit in a 12-bit immediate field if the
immediate itself does not, and flip add to subtract (and vice versa), so
we can instead generate:

    sub w0, w1, #1

We see this pattern in e.g. bz2, where this commit makes the following
difference (counting instructions with perf stat, filling in the
wasmtime cache first then running again to get just runtime):

pre:

        992.762250      task-clock (msec)         #    0.998 CPUs utilized
               109      context-switches          #    0.110 K/sec
                 0      cpu-migrations            #    0.000 K/sec
             5,035      page-faults               #    0.005 M/sec
     3,224,119,134      cycles                    #    3.248 GHz
     4,000,521,171      instructions              #    1.24  insn per cycle
   <not supported>      branches
        27,573,755      branch-misses

       0.995072322 seconds time elapsed

post:

        993.853850      task-clock (msec)         #    0.998 CPUs utilized
               123      context-switches          #    0.124 K/sec
                 1      cpu-migrations            #    0.001 K/sec
             5,072      page-faults               #    0.005 M/sec
     3,201,278,337      cycles                    #    3.221 GHz
     3,917,061,340      instructions              #    1.22  insn per cycle
   <not supported>      branches
        28,410,633      branch-misses

       0.996008047 seconds time elapsed

In other words, a 2.1% reduction in instruction count on bz2.

<!--

Please ensure that the following steps are all taken care of before submitting
the PR.

Please ensure all communication adheres to the code of conduct.
-->

view this post on Zulip Wasmtime GitHub notifications bot (Jul 24 2020 at 18:41):

cfallin updated PR #2051 from aarch64-add-negative-imm to main:

We often see patterns like:

    mov w2, #0xffff_ffff   // uses ORR with logical immediate form
    add w0, w1, w2

which is just w0 := w1 - 1. It would be much better to recognize when
the inverse of an immediate will fit in a 12-bit immediate field if the
immediate itself does not, and flip add to subtract (and vice versa), so
we can instead generate:

    sub w0, w1, #1

We see this pattern in e.g. bz2, where this commit makes the following
difference (counting instructions with perf stat, filling in the
wasmtime cache first then running again to get just runtime):

pre:

        992.762250      task-clock (msec)         #    0.998 CPUs utilized
               109      context-switches          #    0.110 K/sec
                 0      cpu-migrations            #    0.000 K/sec
             5,035      page-faults               #    0.005 M/sec
     3,224,119,134      cycles                    #    3.248 GHz
     4,000,521,171      instructions              #    1.24  insn per cycle
   <not supported>      branches
        27,573,755      branch-misses

       0.995072322 seconds time elapsed

post:

        993.853850      task-clock (msec)         #    0.998 CPUs utilized
               123      context-switches          #    0.124 K/sec
                 1      cpu-migrations            #    0.001 K/sec
             5,072      page-faults               #    0.005 M/sec
     3,201,278,337      cycles                    #    3.221 GHz
     3,917,061,340      instructions              #    1.22  insn per cycle
   <not supported>      branches
        28,410,633      branch-misses

       0.996008047 seconds time elapsed

In other words, a 2.1% reduction in instruction count on bz2.

<!--

Please ensure that the following steps are all taken care of before submitting
the PR.

Please ensure all communication adheres to the code of conduct.
-->

view this post on Zulip Wasmtime GitHub notifications bot (Jul 24 2020 at 19:26):

cfallin merged PR #2051.


Last updated: Jan 24 2025 at 00:11 UTC