Stream: git-wasmtime

Topic: wasmtime / PR #2061 Aarch64 codegen quality: support more...


view this post on Zulip Wasmtime GitHub notifications bot (Jul 22 2020 at 05:54):

cfallin opened PR #2061 from aarch64-amode to main:

Previously, our pattern-matching for generating load/store addresses was
somewhat limited. For example, it could not use a register-extend
address mode to handle the following CLIF:

   v2760 = uextend.i64 v985
   v2761 = load.i64 notrap aligned readonly v1
   v1018 = iadd v2761, v2760
   store v1017, v1018

This PR adds more general support for address expressions made up of
additions and extensions. In particular, it pattern-matches a tree of
64-bit iadds, optionally with uextend/sextend from 32-bit values
at the leaves, to collect the list of all addends that form the address.
It also collects all offsets at leaves, combining them.
It applies a series of heuristics to make the best use of the
available addressing modes, filling the load/store itself with as many
64-bit registers, zero/sign-extended 32-bit registers, and/or an offset,
then computing the rest with add instructions as necessary. It attempts
to make use of immediate forms (add-immediate or subtract-immediate)
whenever possible, and also uses the built-in extend operators on add
instructions when possible. There are certainly cases where this is not
optimal (i.e., does not generate the strictly shortest sequence of
instructions), but it should be good enough for most code.

Using perf stat to measure instruction count (runtime only, on
wasmtime, after populating the cache to avoid measuring compilation),
this impacts bz2 as follows:

pre:

       1006.410425      task-clock (msec)         #    1.000 CPUs utilized
               113      context-switches          #    0.112 K/sec
                 1      cpu-migrations            #    0.001 K/sec
             5,036      page-faults               #    0.005 M/sec
     3,221,547,476      cycles                    #    3.201 GHz
     4,000,670,104      instructions              #    1.24  insn per cycle
   <not supported>      branches
        27,958,613      branch-misses

       1.006071348 seconds time elapsed

post:

        963.499525      task-clock (msec)         #    0.997 CPUs utilized
               117      context-switches          #    0.121 K/sec
                 0      cpu-migrations            #    0.000 K/sec
             5,081      page-faults               #    0.005 M/sec
     3,039,687,673      cycles                    #    3.155 GHz
     3,837,761,690      instructions              #    1.26  insn per cycle
   <not supported>      branches
        28,254,585      branch-misses

       0.966072682 seconds time elapsed

In other words, this reduces instruction count by 4.1% on bz2.

<!--

Please ensure that the following steps are all taken care of before submitting
the PR.

Please ensure all communication adheres to the code of conduct.
-->

view this post on Zulip Wasmtime GitHub notifications bot (Jul 22 2020 at 05:54):

cfallin requested bnjbvr and julian-seward1 for a review on PR #2061.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 22 2020 at 05:54):

cfallin requested bnjbvr and julian-seward1 for a review on PR #2061.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 22 2020 at 05:54):

cfallin updated PR #2061 from aarch64-amode to main:

Previously, our pattern-matching for generating load/store addresses was
somewhat limited. For example, it could not use a register-extend
address mode to handle the following CLIF:

   v2760 = uextend.i64 v985
   v2761 = load.i64 notrap aligned readonly v1
   v1018 = iadd v2761, v2760
   store v1017, v1018

This PR adds more general support for address expressions made up of
additions and extensions. In particular, it pattern-matches a tree of
64-bit iadds, optionally with uextend/sextend from 32-bit values
at the leaves, to collect the list of all addends that form the address.
It also collects all offsets at leaves, combining them.
It applies a series of heuristics to make the best use of the
available addressing modes, filling the load/store itself with as many
64-bit registers, zero/sign-extended 32-bit registers, and/or an offset,
then computing the rest with add instructions as necessary. It attempts
to make use of immediate forms (add-immediate or subtract-immediate)
whenever possible, and also uses the built-in extend operators on add
instructions when possible. There are certainly cases where this is not
optimal (i.e., does not generate the strictly shortest sequence of
instructions), but it should be good enough for most code.

Using perf stat to measure instruction count (runtime only, on
wasmtime, after populating the cache to avoid measuring compilation),
this impacts bz2 as follows:

pre:

       1006.410425      task-clock (msec)         #    1.000 CPUs utilized
               113      context-switches          #    0.112 K/sec
                 1      cpu-migrations            #    0.001 K/sec
             5,036      page-faults               #    0.005 M/sec
     3,221,547,476      cycles                    #    3.201 GHz
     4,000,670,104      instructions              #    1.24  insn per cycle
   <not supported>      branches
        27,958,613      branch-misses

       1.006071348 seconds time elapsed

post:

        963.499525      task-clock (msec)         #    0.997 CPUs utilized
               117      context-switches          #    0.121 K/sec
                 0      cpu-migrations            #    0.000 K/sec
             5,081      page-faults               #    0.005 M/sec
     3,039,687,673      cycles                    #    3.155 GHz
     3,837,761,690      instructions              #    1.26  insn per cycle
   <not supported>      branches
        28,254,585      branch-misses

       0.966072682 seconds time elapsed

In other words, this reduces instruction count by 4.1% on bz2.

<!--

Please ensure that the following steps are all taken care of before submitting
the PR.

Please ensure all communication adheres to the code of conduct.
-->

view this post on Zulip Wasmtime GitHub notifications bot (Jul 22 2020 at 05:55):

cfallin edited PR #2061 from aarch64-amode to main:

Previously, our pattern-matching for generating load/store addresses was
somewhat limited. For example, it could not use a register-extend
address mode to handle the following CLIF:

   v2760 = uextend.i64 v985
   v2761 = load.i64 notrap aligned readonly v1
   v1018 = iadd v2761, v2760
   store v1017, v1018

This PR adds more general support for address expressions made up of
additions and extensions. In particular, it pattern-matches a tree of
64-bit iadds, optionally with uextend/sextend from 32-bit values
at the leaves, to collect the list of all addends that form the address.
It also collects all offsets at leaves, combining them.
It applies a series of heuristics to make the best use of the
available addressing modes, filling the load/store itself with as many
64-bit registers, zero/sign-extended 32-bit registers, and/or an offset,
then computing the rest with add instructions as necessary. It attempts
to make use of immediate forms (add-immediate or subtract-immediate)
whenever possible, and also uses the built-in extend operators on add
instructions when possible. There are certainly cases where this is not
optimal (i.e., does not generate the strictly shortest sequence of
instructions), but it should be good enough for most code.

Using perf stat to measure instruction count (runtime only, on
wasmtime, after populating the cache to avoid measuring compilation),
this impacts bz2 as follows:

pre:

       1006.410425      task-clock (msec)         #    1.000 CPUs utilized
               113      context-switches          #    0.112 K/sec
                 1      cpu-migrations            #    0.001 K/sec
             5,036      page-faults               #    0.005 M/sec
     3,221,547,476      cycles                    #    3.201 GHz
     4,000,670,104      instructions              #    1.24  insn per cycle
   <not supported>      branches
        27,958,613      branch-misses

       1.006071348 seconds time elapsed

post:

        963.499525      task-clock (msec)         #    0.997 CPUs utilized
               117      context-switches          #    0.121 K/sec
                 0      cpu-migrations            #    0.000 K/sec
             5,081      page-faults               #    0.005 M/sec
     3,039,687,673      cycles                    #    3.155 GHz
     3,837,761,690      instructions              #    1.26  insn per cycle
   <not supported>      branches
        28,254,585      branch-misses

       0.966072682 seconds time elapsed

In other words, this reduces instruction count by 4.1% on bz2.

<!--

Please ensure that the following steps are all taken care of before submitting
the PR.

Please ensure all communication adheres to the code of conduct.
-->

view this post on Zulip Wasmtime GitHub notifications bot (Jul 24 2020 at 14:36):

julian-seward1 submitted PR Review.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 24 2020 at 14:36):

julian-seward1 created PR Review Comment:

Yes. That sounds reasonable.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 24 2020 at 14:41):

julian-seward1 submitted PR Review.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 24 2020 at 14:41):

julian-seward1 created PR Review Comment:

Can this case, addends64.len() == 0 && <whatever> actually happen? Wouldn't it imply that whatever created this code knew that it would run in an environment where the max virtual address is 2^32 or some small multiple thereof ?
I guess that's implausible for the general 64-bit target case, but it's not per se invalid.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 24 2020 at 14:42):

julian-seward1 submitted PR Review.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 24 2020 at 14:42):

julian-seward1 created PR Review Comment:

And again here -- even more so. The VA is formed only from an immediate ?
Again, I guess that's OK. But -- at least -- do we have tests for this?

view this post on Zulip Wasmtime GitHub notifications bot (Jul 24 2020 at 14:43):

julian-seward1 submitted PR Review.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 24 2020 at 14:43):

julian-seward1 created PR Review Comment:

wrapping neg needed here?

view this post on Zulip Wasmtime GitHub notifications bot (Jul 24 2020 at 18:18):

julian-seward1 submitted PR Review.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 24 2020 at 18:55):

cfallin updated PR #2061 from aarch64-amode to main:

Previously, our pattern-matching for generating load/store addresses was
somewhat limited. For example, it could not use a register-extend
address mode to handle the following CLIF:

   v2760 = uextend.i64 v985
   v2761 = load.i64 notrap aligned readonly v1
   v1018 = iadd v2761, v2760
   store v1017, v1018

This PR adds more general support for address expressions made up of
additions and extensions. In particular, it pattern-matches a tree of
64-bit iadds, optionally with uextend/sextend from 32-bit values
at the leaves, to collect the list of all addends that form the address.
It also collects all offsets at leaves, combining them.
It applies a series of heuristics to make the best use of the
available addressing modes, filling the load/store itself with as many
64-bit registers, zero/sign-extended 32-bit registers, and/or an offset,
then computing the rest with add instructions as necessary. It attempts
to make use of immediate forms (add-immediate or subtract-immediate)
whenever possible, and also uses the built-in extend operators on add
instructions when possible. There are certainly cases where this is not
optimal (i.e., does not generate the strictly shortest sequence of
instructions), but it should be good enough for most code.

Using perf stat to measure instruction count (runtime only, on
wasmtime, after populating the cache to avoid measuring compilation),
this impacts bz2 as follows:

pre:

       1006.410425      task-clock (msec)         #    1.000 CPUs utilized
               113      context-switches          #    0.112 K/sec
                 1      cpu-migrations            #    0.001 K/sec
             5,036      page-faults               #    0.005 M/sec
     3,221,547,476      cycles                    #    3.201 GHz
     4,000,670,104      instructions              #    1.24  insn per cycle
   <not supported>      branches
        27,958,613      branch-misses

       1.006071348 seconds time elapsed

post:

        963.499525      task-clock (msec)         #    0.997 CPUs utilized
               117      context-switches          #    0.121 K/sec
                 0      cpu-migrations            #    0.000 K/sec
             5,081      page-faults               #    0.005 M/sec
     3,039,687,673      cycles                    #    3.155 GHz
     3,837,761,690      instructions              #    1.26  insn per cycle
   <not supported>      branches
        28,254,585      branch-misses

       0.966072682 seconds time elapsed

In other words, this reduces instruction count by 4.1% on bz2.

<!--

Please ensure that the following steps are all taken care of before submitting
the PR.

Please ensure all communication adheres to the code of conduct.
-->

view this post on Zulip Wasmtime GitHub notifications bot (Jul 24 2020 at 18:56):

cfallin submitted PR Review.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 24 2020 at 18:56):

cfallin created PR Review Comment:

Done!

view this post on Zulip Wasmtime GitHub notifications bot (Jul 24 2020 at 19:04):

cfallin submitted PR Review.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 24 2020 at 19:04):

cfallin created PR Review Comment:

Yup, there's a test (f10 in amodes.clif). It's conceivable that at least in a JIT scenario, one might want to bake an address into the compiled code (e.g., access a global flag somewhere). In any case, it's valid CLIF, so we should support it, I think.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 24 2020 at 19:06):

cfallin submitted PR Review.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 24 2020 at 19:06):

cfallin created PR Review Comment:

Indeed, it's valid CLIF, so we need to either handle it here, or else define it as invalid and punt this problem to the CLIF verifier. It's conceivable that (as you say) the implementer might know better and use 32-bit pointers, so I don't think we can rule it out.

(I actually asserted that this case did not happen at first, but caught myself with a test-case I had previously written that used only a zero-extended 32-bit value as an address... so if nothing else, at least mischevous unit-test authors are known to do this.)

view this post on Zulip Wasmtime GitHub notifications bot (Jul 25 2020 at 07:49):

julian-seward1 submitted PR Review.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 27 2020 at 13:32):

bnjbvr submitted PR Review.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 27 2020 at 13:32):

bnjbvr submitted PR Review.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 27 2020 at 13:32):

bnjbvr created PR Review Comment:

nit: while let Some(input) = workqueue.pop()

view this post on Zulip Wasmtime GitHub notifications bot (Jul 27 2020 at 13:32):

bnjbvr created PR Review Comment:

nit: i thought this was a bug at first, because only the 4 opcodes mentioned here should be caught, and then saw the extra condition on Uextend/Sextend. Instead, would it make sense to split this match into two:

view this post on Zulip Wasmtime GitHub notifications bot (Jul 27 2020 at 13:32):

bnjbvr created PR Review Comment:

Fyi, the comment line size is 99 chars, so the previous wrapping seemed somewhat correct.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 27 2020 at 13:32):

bnjbvr created PR Review Comment:

maybe trace?

view this post on Zulip Wasmtime GitHub notifications bot (Jul 27 2020 at 13:32):

bnjbvr created PR Review Comment:

From just skimming this code, it's hard to believe that this last branch is hit when addends64.len() > 0. Could the whole series be reorganized as if addends64.len() > 0 { /* subchecks, etc. */ } else /*other subchecks*/{}?

view this post on Zulip Wasmtime GitHub notifications bot (Jul 27 2020 at 20:10):

cfallin updated PR #2061 from aarch64-amode to main:

Previously, our pattern-matching for generating load/store addresses was
somewhat limited. For example, it could not use a register-extend
address mode to handle the following CLIF:

   v2760 = uextend.i64 v985
   v2761 = load.i64 notrap aligned readonly v1
   v1018 = iadd v2761, v2760
   store v1017, v1018

This PR adds more general support for address expressions made up of
additions and extensions. In particular, it pattern-matches a tree of
64-bit iadds, optionally with uextend/sextend from 32-bit values
at the leaves, to collect the list of all addends that form the address.
It also collects all offsets at leaves, combining them.
It applies a series of heuristics to make the best use of the
available addressing modes, filling the load/store itself with as many
64-bit registers, zero/sign-extended 32-bit registers, and/or an offset,
then computing the rest with add instructions as necessary. It attempts
to make use of immediate forms (add-immediate or subtract-immediate)
whenever possible, and also uses the built-in extend operators on add
instructions when possible. There are certainly cases where this is not
optimal (i.e., does not generate the strictly shortest sequence of
instructions), but it should be good enough for most code.

Using perf stat to measure instruction count (runtime only, on
wasmtime, after populating the cache to avoid measuring compilation),
this impacts bz2 as follows:

pre:

       1006.410425      task-clock (msec)         #    1.000 CPUs utilized
               113      context-switches          #    0.112 K/sec
                 1      cpu-migrations            #    0.001 K/sec
             5,036      page-faults               #    0.005 M/sec
     3,221,547,476      cycles                    #    3.201 GHz
     4,000,670,104      instructions              #    1.24  insn per cycle
   <not supported>      branches
        27,958,613      branch-misses

       1.006071348 seconds time elapsed

post:

        963.499525      task-clock (msec)         #    0.997 CPUs utilized
               117      context-switches          #    0.121 K/sec
                 0      cpu-migrations            #    0.000 K/sec
             5,081      page-faults               #    0.005 M/sec
     3,039,687,673      cycles                    #    3.155 GHz
     3,837,761,690      instructions              #    1.26  insn per cycle
   <not supported>      branches
        28,254,585      branch-misses

       0.966072682 seconds time elapsed

In other words, this reduces instruction count by 4.1% on bz2.

<!--

Please ensure that the following steps are all taken care of before submitting
the PR.

Please ensure all communication adheres to the code of conduct.
-->

view this post on Zulip Wasmtime GitHub notifications bot (Jul 27 2020 at 20:11):

cfallin submitted PR Review.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 27 2020 at 20:11):

cfallin created PR Review Comment:

Ah, I had my editor set to 80 characters and probably just rewrapped this while nearby. Undid change, thanks.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 27 2020 at 20:11):

cfallin submitted PR Review.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 27 2020 at 20:11):

cfallin created PR Review Comment:

Good idea, thanks!

view this post on Zulip Wasmtime GitHub notifications bot (Jul 27 2020 at 20:11):

cfallin submitted PR Review.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 27 2020 at 20:11):

cfallin created PR Review Comment:

Done.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 27 2020 at 20:12):

cfallin submitted PR Review.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 27 2020 at 20:12):

cfallin created PR Review Comment:

Ah yes, it's much cleaner with that split -- thanks!

view this post on Zulip Wasmtime GitHub notifications bot (Jul 27 2020 at 20:12):

cfallin submitted PR Review.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 27 2020 at 20:12):

cfallin created PR Review Comment:

Done.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 27 2020 at 20:48):

cfallin merged PR #2061.


Last updated: Jan 24 2025 at 00:11 UTC