cfallin opened PR #2061 from aarch64-amode
to main
:
Previously, our pattern-matching for generating load/store addresses was
somewhat limited. For example, it could not use a register-extend
address mode to handle the following CLIF:v2760 = uextend.i64 v985 v2761 = load.i64 notrap aligned readonly v1 v1018 = iadd v2761, v2760 store v1017, v1018
This PR adds more general support for address expressions made up of
additions and extensions. In particular, it pattern-matches a tree of
64-bitiadd
s, optionally withuextend
/sextend
from 32-bit values
at the leaves, to collect the list of all addends that form the address.
It also collects all offsets at leaves, combining them.
It applies a series of heuristics to make the best use of the
available addressing modes, filling the load/store itself with as many
64-bit registers, zero/sign-extended 32-bit registers, and/or an offset,
then computing the rest with add instructions as necessary. It attempts
to make use of immediate forms (add-immediate or subtract-immediate)
whenever possible, and also uses the built-in extend operators on add
instructions when possible. There are certainly cases where this is not
optimal (i.e., does not generate the strictly shortest sequence of
instructions), but it should be good enough for most code.Using
perf stat
to measure instruction count (runtime only, on
wasmtime, after populating the cache to avoid measuring compilation),
this impactsbz2
as follows:pre: 1006.410425 task-clock (msec) # 1.000 CPUs utilized 113 context-switches # 0.112 K/sec 1 cpu-migrations # 0.001 K/sec 5,036 page-faults # 0.005 M/sec 3,221,547,476 cycles # 3.201 GHz 4,000,670,104 instructions # 1.24 insn per cycle <not supported> branches 27,958,613 branch-misses 1.006071348 seconds time elapsed post: 963.499525 task-clock (msec) # 0.997 CPUs utilized 117 context-switches # 0.121 K/sec 0 cpu-migrations # 0.000 K/sec 5,081 page-faults # 0.005 M/sec 3,039,687,673 cycles # 3.155 GHz 3,837,761,690 instructions # 1.26 insn per cycle <not supported> branches 28,254,585 branch-misses 0.966072682 seconds time elapsed
In other words, this reduces instruction count by 4.1% on
bz2
.<!--
Please ensure that the following steps are all taken care of before submitting
the PR.
[ ] This has been discussed in issue #..., or if not, please tell us why
here.[ ] A short description of what this does, why it is needed; if the
description becomes long, the matter should probably be discussed in an issue
first.[ ] This PR contains test cases, if meaningful.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so. The list of
suggested reviewers on the right can help you.Please ensure all communication adheres to the code of conduct.
-->
cfallin requested bnjbvr and julian-seward1 for a review on PR #2061.
cfallin requested bnjbvr and julian-seward1 for a review on PR #2061.
cfallin updated PR #2061 from aarch64-amode
to main
:
Previously, our pattern-matching for generating load/store addresses was
somewhat limited. For example, it could not use a register-extend
address mode to handle the following CLIF:v2760 = uextend.i64 v985 v2761 = load.i64 notrap aligned readonly v1 v1018 = iadd v2761, v2760 store v1017, v1018
This PR adds more general support for address expressions made up of
additions and extensions. In particular, it pattern-matches a tree of
64-bitiadd
s, optionally withuextend
/sextend
from 32-bit values
at the leaves, to collect the list of all addends that form the address.
It also collects all offsets at leaves, combining them.
It applies a series of heuristics to make the best use of the
available addressing modes, filling the load/store itself with as many
64-bit registers, zero/sign-extended 32-bit registers, and/or an offset,
then computing the rest with add instructions as necessary. It attempts
to make use of immediate forms (add-immediate or subtract-immediate)
whenever possible, and also uses the built-in extend operators on add
instructions when possible. There are certainly cases where this is not
optimal (i.e., does not generate the strictly shortest sequence of
instructions), but it should be good enough for most code.Using
perf stat
to measure instruction count (runtime only, on
wasmtime, after populating the cache to avoid measuring compilation),
this impactsbz2
as follows:pre: 1006.410425 task-clock (msec) # 1.000 CPUs utilized 113 context-switches # 0.112 K/sec 1 cpu-migrations # 0.001 K/sec 5,036 page-faults # 0.005 M/sec 3,221,547,476 cycles # 3.201 GHz 4,000,670,104 instructions # 1.24 insn per cycle <not supported> branches 27,958,613 branch-misses 1.006071348 seconds time elapsed post: 963.499525 task-clock (msec) # 0.997 CPUs utilized 117 context-switches # 0.121 K/sec 0 cpu-migrations # 0.000 K/sec 5,081 page-faults # 0.005 M/sec 3,039,687,673 cycles # 3.155 GHz 3,837,761,690 instructions # 1.26 insn per cycle <not supported> branches 28,254,585 branch-misses 0.966072682 seconds time elapsed
In other words, this reduces instruction count by 4.1% on
bz2
.<!--
Please ensure that the following steps are all taken care of before submitting
the PR.
[ ] This has been discussed in issue #..., or if not, please tell us why
here.[ ] A short description of what this does, why it is needed; if the
description becomes long, the matter should probably be discussed in an issue
first.[ ] This PR contains test cases, if meaningful.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so. The list of
suggested reviewers on the right can help you.Please ensure all communication adheres to the code of conduct.
-->
cfallin edited PR #2061 from aarch64-amode
to main
:
Previously, our pattern-matching for generating load/store addresses was
somewhat limited. For example, it could not use a register-extend
address mode to handle the following CLIF:v2760 = uextend.i64 v985 v2761 = load.i64 notrap aligned readonly v1 v1018 = iadd v2761, v2760 store v1017, v1018
This PR adds more general support for address expressions made up of
additions and extensions. In particular, it pattern-matches a tree of
64-bitiadd
s, optionally withuextend
/sextend
from 32-bit values
at the leaves, to collect the list of all addends that form the address.
It also collects all offsets at leaves, combining them.
It applies a series of heuristics to make the best use of the
available addressing modes, filling the load/store itself with as many
64-bit registers, zero/sign-extended 32-bit registers, and/or an offset,
then computing the rest with add instructions as necessary. It attempts
to make use of immediate forms (add-immediate or subtract-immediate)
whenever possible, and also uses the built-in extend operators on add
instructions when possible. There are certainly cases where this is not
optimal (i.e., does not generate the strictly shortest sequence of
instructions), but it should be good enough for most code.Using
perf stat
to measure instruction count (runtime only, on
wasmtime, after populating the cache to avoid measuring compilation),
this impactsbz2
as follows:pre: 1006.410425 task-clock (msec) # 1.000 CPUs utilized 113 context-switches # 0.112 K/sec 1 cpu-migrations # 0.001 K/sec 5,036 page-faults # 0.005 M/sec 3,221,547,476 cycles # 3.201 GHz 4,000,670,104 instructions # 1.24 insn per cycle <not supported> branches 27,958,613 branch-misses 1.006071348 seconds time elapsed post: 963.499525 task-clock (msec) # 0.997 CPUs utilized 117 context-switches # 0.121 K/sec 0 cpu-migrations # 0.000 K/sec 5,081 page-faults # 0.005 M/sec 3,039,687,673 cycles # 3.155 GHz 3,837,761,690 instructions # 1.26 insn per cycle <not supported> branches 28,254,585 branch-misses 0.966072682 seconds time elapsed
In other words, this reduces instruction count by 4.1% on
bz2
.<!--
Please ensure that the following steps are all taken care of before submitting
the PR.
[ ] This has been discussed in issue #..., or if not, please tell us why
here.[ ] A short description of what this does, why it is needed; if the
description becomes long, the matter should probably be discussed in an issue
first.[ ] This PR contains test cases, if meaningful.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so. The list of
suggested reviewers on the right can help you.Please ensure all communication adheres to the code of conduct.
-->
julian-seward1 submitted PR Review.
julian-seward1 created PR Review Comment:
Yes. That sounds reasonable.
julian-seward1 submitted PR Review.
julian-seward1 created PR Review Comment:
Can this case,
addends64.len() == 0 && <whatever>
actually happen? Wouldn't it imply that whatever created this code knew that it would run in an environment where the max virtual address is 2^32 or some small multiple thereof ?
I guess that's implausible for the general 64-bit target case, but it's not per se invalid.
julian-seward1 submitted PR Review.
julian-seward1 created PR Review Comment:
And again here -- even more so. The VA is formed only from an immediate ?
Again, I guess that's OK. But -- at least -- do we have tests for this?
julian-seward1 submitted PR Review.
julian-seward1 created PR Review Comment:
wrapping neg needed here?
julian-seward1 submitted PR Review.
cfallin updated PR #2061 from aarch64-amode
to main
:
Previously, our pattern-matching for generating load/store addresses was
somewhat limited. For example, it could not use a register-extend
address mode to handle the following CLIF:v2760 = uextend.i64 v985 v2761 = load.i64 notrap aligned readonly v1 v1018 = iadd v2761, v2760 store v1017, v1018
This PR adds more general support for address expressions made up of
additions and extensions. In particular, it pattern-matches a tree of
64-bitiadd
s, optionally withuextend
/sextend
from 32-bit values
at the leaves, to collect the list of all addends that form the address.
It also collects all offsets at leaves, combining them.
It applies a series of heuristics to make the best use of the
available addressing modes, filling the load/store itself with as many
64-bit registers, zero/sign-extended 32-bit registers, and/or an offset,
then computing the rest with add instructions as necessary. It attempts
to make use of immediate forms (add-immediate or subtract-immediate)
whenever possible, and also uses the built-in extend operators on add
instructions when possible. There are certainly cases where this is not
optimal (i.e., does not generate the strictly shortest sequence of
instructions), but it should be good enough for most code.Using
perf stat
to measure instruction count (runtime only, on
wasmtime, after populating the cache to avoid measuring compilation),
this impactsbz2
as follows:pre: 1006.410425 task-clock (msec) # 1.000 CPUs utilized 113 context-switches # 0.112 K/sec 1 cpu-migrations # 0.001 K/sec 5,036 page-faults # 0.005 M/sec 3,221,547,476 cycles # 3.201 GHz 4,000,670,104 instructions # 1.24 insn per cycle <not supported> branches 27,958,613 branch-misses 1.006071348 seconds time elapsed post: 963.499525 task-clock (msec) # 0.997 CPUs utilized 117 context-switches # 0.121 K/sec 0 cpu-migrations # 0.000 K/sec 5,081 page-faults # 0.005 M/sec 3,039,687,673 cycles # 3.155 GHz 3,837,761,690 instructions # 1.26 insn per cycle <not supported> branches 28,254,585 branch-misses 0.966072682 seconds time elapsed
In other words, this reduces instruction count by 4.1% on
bz2
.<!--
Please ensure that the following steps are all taken care of before submitting
the PR.
[ ] This has been discussed in issue #..., or if not, please tell us why
here.[ ] A short description of what this does, why it is needed; if the
description becomes long, the matter should probably be discussed in an issue
first.[ ] This PR contains test cases, if meaningful.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so. The list of
suggested reviewers on the right can help you.Please ensure all communication adheres to the code of conduct.
-->
cfallin submitted PR Review.
cfallin created PR Review Comment:
Done!
cfallin submitted PR Review.
cfallin created PR Review Comment:
Yup, there's a test (
f10
inamodes.clif
). It's conceivable that at least in a JIT scenario, one might want to bake an address into the compiled code (e.g., access a global flag somewhere). In any case, it's valid CLIF, so we should support it, I think.
cfallin submitted PR Review.
cfallin created PR Review Comment:
Indeed, it's valid CLIF, so we need to either handle it here, or else define it as invalid and punt this problem to the CLIF verifier. It's conceivable that (as you say) the implementer might know better and use 32-bit pointers, so I don't think we can rule it out.
(I actually asserted that this case did not happen at first, but caught myself with a test-case I had previously written that used only a zero-extended 32-bit value as an address... so if nothing else, at least mischevous unit-test authors are known to do this.)
julian-seward1 submitted PR Review.
bnjbvr submitted PR Review.
bnjbvr submitted PR Review.
bnjbvr created PR Review Comment:
nit:
while let Some(input) = workqueue.pop()
bnjbvr created PR Review Comment:
nit: i thought this was a bug at first, because only the 4 opcodes mentioned here should be caught, and then saw the extra condition on Uextend/Sextend. Instead, would it make sense to split this match into two:
- one for Uextend and Sextend that don't match the condition, so just
Opcode::Uextend | Opcode::Sextend
after the first arm- one
_ => panic!("unexpected opcode")
bnjbvr created PR Review Comment:
Fyi, the comment line size is 99 chars, so the previous wrapping seemed somewhat correct.
bnjbvr created PR Review Comment:
maybe trace?
bnjbvr created PR Review Comment:
From just skimming this code, it's hard to believe that this last branch is hit when
addends64.len() > 0
. Could the whole series be reorganized asif addends64.len() > 0 { /* subchecks, etc. */ } else /*other subchecks*/{}
?
cfallin updated PR #2061 from aarch64-amode
to main
:
Previously, our pattern-matching for generating load/store addresses was
somewhat limited. For example, it could not use a register-extend
address mode to handle the following CLIF:v2760 = uextend.i64 v985 v2761 = load.i64 notrap aligned readonly v1 v1018 = iadd v2761, v2760 store v1017, v1018
This PR adds more general support for address expressions made up of
additions and extensions. In particular, it pattern-matches a tree of
64-bitiadd
s, optionally withuextend
/sextend
from 32-bit values
at the leaves, to collect the list of all addends that form the address.
It also collects all offsets at leaves, combining them.
It applies a series of heuristics to make the best use of the
available addressing modes, filling the load/store itself with as many
64-bit registers, zero/sign-extended 32-bit registers, and/or an offset,
then computing the rest with add instructions as necessary. It attempts
to make use of immediate forms (add-immediate or subtract-immediate)
whenever possible, and also uses the built-in extend operators on add
instructions when possible. There are certainly cases where this is not
optimal (i.e., does not generate the strictly shortest sequence of
instructions), but it should be good enough for most code.Using
perf stat
to measure instruction count (runtime only, on
wasmtime, after populating the cache to avoid measuring compilation),
this impactsbz2
as follows:pre: 1006.410425 task-clock (msec) # 1.000 CPUs utilized 113 context-switches # 0.112 K/sec 1 cpu-migrations # 0.001 K/sec 5,036 page-faults # 0.005 M/sec 3,221,547,476 cycles # 3.201 GHz 4,000,670,104 instructions # 1.24 insn per cycle <not supported> branches 27,958,613 branch-misses 1.006071348 seconds time elapsed post: 963.499525 task-clock (msec) # 0.997 CPUs utilized 117 context-switches # 0.121 K/sec 0 cpu-migrations # 0.000 K/sec 5,081 page-faults # 0.005 M/sec 3,039,687,673 cycles # 3.155 GHz 3,837,761,690 instructions # 1.26 insn per cycle <not supported> branches 28,254,585 branch-misses 0.966072682 seconds time elapsed
In other words, this reduces instruction count by 4.1% on
bz2
.<!--
Please ensure that the following steps are all taken care of before submitting
the PR.
[ ] This has been discussed in issue #..., or if not, please tell us why
here.[ ] A short description of what this does, why it is needed; if the
description becomes long, the matter should probably be discussed in an issue
first.[ ] This PR contains test cases, if meaningful.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so. The list of
suggested reviewers on the right can help you.Please ensure all communication adheres to the code of conduct.
-->
cfallin submitted PR Review.
cfallin created PR Review Comment:
Ah, I had my editor set to 80 characters and probably just rewrapped this while nearby. Undid change, thanks.
cfallin submitted PR Review.
cfallin created PR Review Comment:
Good idea, thanks!
cfallin submitted PR Review.
cfallin created PR Review Comment:
Done.
cfallin submitted PR Review.
cfallin created PR Review Comment:
Ah yes, it's much cleaner with that split -- thanks!
cfallin submitted PR Review.
cfallin created PR Review Comment:
Done.
cfallin merged PR #2061.
Last updated: Dec 23 2024 at 12:05 UTC