cfallin opened PR #1718 from machinst-codebuffer
to master
:
This patch includes:
A complete rework of the way that CLIF blocks and edge blocks are
lowered into VCode blocks. The new mechanism inBlockLoweringOrder
computes RPO over the CFG, but with a twist: it merges edge blocks intto
heads or tails of original CLIF blocks wherever possible, and it does
this without ever actually materializing the full nodes-plus-edges
graph first. The backend driver lowers blocks in final order so
there's no need to reshuffle later.A new
MachBuffer
that replaces theMachSection
. This is a special
version of a code-sink that is far more than a humbleVec<u8>
. In
particular, it keeps a record of label definitions and label uses,
with a machine-pluggableLabelUse
trait that defines various types
of fixups (basically internal relocations).Importantly, it implements some simple peephole-style branch rewrites
inline in the emission pass, without any separate traversals over
the code to use fallthroughs, swap taken/not-taken arms, etc. It
tracks branches at the tail of the buffer and can (i) remove blocks
that are just unconditional branches (by redirecting the label), (ii)
understand a conditional/unconditional pair and swap the conditional
polarity when it's helpful; and (iii) remove branches that branch to
the fallthrough PC.The
MachBuffer
also implements branch-island support. On
architectures like AArch64, this is needed to allow conditional
branches within plausibly-attainable ranges (+/- 1MB on AArch64
specifically). It also does this inline while streaming through the
emission, without any sort of fixpoint algorithm or later moving of
code, by simply tracking outstanding references and "deadlines" and
emitting an island just-in-time when we're in danger of going out of
range.A rework of the instruction selector driver. This is largely following
the same algorithm as before, but is cleaned up significantly, in
particular in the API: the machine backend can ask for an input arg
and get any of three forms (constant, register, producing
instruction), indicating it needs the register or can merge the
constant or producing instruction as appropriate. This new driver
takes special care to emit constants right at use-sites (and at phi
inputs), minimizing their live-ranges, and also special-cases the
"pinned register" to avoid superfluous moves.Overall, on
bz2.wasm
, the results are:wasmtime full run (compile + runtime) of bz2: baseline: 9774M insns, 9742M cycles, 3.918s w/ changes: 7012M insns, 6888M cycles, 2.958s (24.5% faster, 28.3% fewer insns) clif-util wasm compile bz2: baseline: 2633M insns, 3278M cycles, 1.034s w/ changes: 2366M insns, 2920M cycles, 0.923s (10.7% faster, 10.1% fewer insns) All numbers are averages of two runs on an Ampere eMAG.
Creating this PR now to start the review, but I still need to do the following before merging:
- Rebase to latest
master
- Update vcode filetests (since isel changed)
- Write tests for the island-emission cases of
MachBuffer
- Update the x64 backend to the slightly-changed isel API (it's disabled at the moment to allow this to compile)
<!--
Please ensure that the following steps are all taken care of before submitting
the PR.
[ ] This has been discussed in issue #..., or if not, please tell us why
here.[ ] A short description of what this does, why it is needed; if the
description becomes long, the matter should probably be discussed in an issue
first.[ ] This PR contains test cases, if meaningful.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so. The list of
suggested reviewers on the right can help you.Please ensure all communication adheres to the code of conduct.
-->
cfallin requested bnjbvr and julian-seward1 for a review on PR #1718.
cfallin requested bnjbvr and julian-seward1 for a review on PR #1718.
cfallin edited PR #1718 from machinst-codebuffer
to master
:
tl;dr: new new-isel; better block-ordering, handling branches in one pass. 24% faster compile+run on bz2 (28% fewer instructions); 10% faster compile (10% fewer instructions).
This patch includes:
A complete rework of the way that CLIF blocks and edge blocks are
lowered into VCode blocks. The new mechanism inBlockLoweringOrder
computes RPO over the CFG, but with a twist: it merges edge blocks intto
heads or tails of original CLIF blocks wherever possible, and it does
this without ever actually materializing the full nodes-plus-edges
graph first. The backend driver lowers blocks in final order so
there's no need to reshuffle later.A new
MachBuffer
that replaces theMachSection
. This is a special
version of a code-sink that is far more than a humbleVec<u8>
. In
particular, it keeps a record of label definitions and label uses,
with a machine-pluggableLabelUse
trait that defines various types
of fixups (basically internal relocations).Importantly, it implements some simple peephole-style branch rewrites
inline in the emission pass, without any separate traversals over
the code to use fallthroughs, swap taken/not-taken arms, etc. It
tracks branches at the tail of the buffer and can (i) remove blocks
that are just unconditional branches (by redirecting the label), (ii)
understand a conditional/unconditional pair and swap the conditional
polarity when it's helpful; and (iii) remove branches that branch to
the fallthrough PC.The
MachBuffer
also implements branch-island support. On
architectures like AArch64, this is needed to allow conditional
branches within plausibly-attainable ranges (+/- 1MB on AArch64
specifically). It also does this inline while streaming through the
emission, without any sort of fixpoint algorithm or later moving of
code, by simply tracking outstanding references and "deadlines" and
emitting an island just-in-time when we're in danger of going out of
range.A rework of the instruction selector driver. This is largely following
the same algorithm as before, but is cleaned up significantly, in
particular in the API: the machine backend can ask for an input arg
and get any of three forms (constant, register, producing
instruction), indicating it needs the register or can merge the
constant or producing instruction as appropriate. This new driver
takes special care to emit constants right at use-sites (and at phi
inputs), minimizing their live-ranges, and also special-cases the
"pinned register" to avoid superfluous moves.Overall, on
bz2.wasm
, the results are:wasmtime full run (compile + runtime) of bz2: baseline: 9774M insns, 9742M cycles, 3.918s w/ changes: 7012M insns, 6888M cycles, 2.958s (24.5% faster, 28.3% fewer insns) clif-util wasm compile bz2: baseline: 2633M insns, 3278M cycles, 1.034s w/ changes: 2366M insns, 2920M cycles, 0.923s (10.7% faster, 10.1% fewer insns) All numbers are averages of two runs on an Ampere eMAG.
Creating this PR now to start the review, but I still need to do the following before merging:
- Rebase to latest
master
- Update vcode filetests (since isel changed)
- Write tests for the island-emission cases of
MachBuffer
- Update the x64 backend to the slightly-changed isel API (it's disabled at the moment to allow this to compile)
<!--
Please ensure that the following steps are all taken care of before submitting
the PR.
[ ] This has been discussed in issue #..., or if not, please tell us why
here.[ ] A short description of what this does, why it is needed; if the
description becomes long, the matter should probably be discussed in an issue
first.[ ] This PR contains test cases, if meaningful.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so. The list of
suggested reviewers on the right can help you.Please ensure all communication adheres to the code of conduct.
-->
cfallin updated PR #1718 from machinst-codebuffer
to master
:
tl;dr: new new-isel; better block-ordering, handling branches in one pass. 24% faster compile+run on bz2 (28% fewer instructions); 10% faster compile (10% fewer instructions).
This patch includes:
A complete rework of the way that CLIF blocks and edge blocks are
lowered into VCode blocks. The new mechanism inBlockLoweringOrder
computes RPO over the CFG, but with a twist: it merges edge blocks intto
heads or tails of original CLIF blocks wherever possible, and it does
this without ever actually materializing the full nodes-plus-edges
graph first. The backend driver lowers blocks in final order so
there's no need to reshuffle later.A new
MachBuffer
that replaces theMachSection
. This is a special
version of a code-sink that is far more than a humbleVec<u8>
. In
particular, it keeps a record of label definitions and label uses,
with a machine-pluggableLabelUse
trait that defines various types
of fixups (basically internal relocations).Importantly, it implements some simple peephole-style branch rewrites
inline in the emission pass, without any separate traversals over
the code to use fallthroughs, swap taken/not-taken arms, etc. It
tracks branches at the tail of the buffer and can (i) remove blocks
that are just unconditional branches (by redirecting the label), (ii)
understand a conditional/unconditional pair and swap the conditional
polarity when it's helpful; and (iii) remove branches that branch to
the fallthrough PC.The
MachBuffer
also implements branch-island support. On
architectures like AArch64, this is needed to allow conditional
branches within plausibly-attainable ranges (+/- 1MB on AArch64
specifically). It also does this inline while streaming through the
emission, without any sort of fixpoint algorithm or later moving of
code, by simply tracking outstanding references and "deadlines" and
emitting an island just-in-time when we're in danger of going out of
range.A rework of the instruction selector driver. This is largely following
the same algorithm as before, but is cleaned up significantly, in
particular in the API: the machine backend can ask for an input arg
and get any of three forms (constant, register, producing
instruction), indicating it needs the register or can merge the
constant or producing instruction as appropriate. This new driver
takes special care to emit constants right at use-sites (and at phi
inputs), minimizing their live-ranges, and also special-cases the
"pinned register" to avoid superfluous moves.Overall, on
bz2.wasm
, the results are:wasmtime full run (compile + runtime) of bz2: baseline: 9774M insns, 9742M cycles, 3.918s w/ changes: 7012M insns, 6888M cycles, 2.958s (24.5% faster, 28.3% fewer insns) clif-util wasm compile bz2: baseline: 2633M insns, 3278M cycles, 1.034s w/ changes: 2366M insns, 2920M cycles, 0.923s (10.7% faster, 10.1% fewer insns) All numbers are averages of two runs on an Ampere eMAG.
Creating this PR now to start the review, but I still need to do the following before merging:
- Rebase to latest
master
- Update vcode filetests (since isel changed)
- Write tests for the island-emission cases of
MachBuffer
- Update the x64 backend to the slightly-changed isel API (it's disabled at the moment to allow this to compile)
<!--
Please ensure that the following steps are all taken care of before submitting
the PR.
[ ] This has been discussed in issue #..., or if not, please tell us why
here.[ ] A short description of what this does, why it is needed; if the
description becomes long, the matter should probably be discussed in an issue
first.[ ] This PR contains test cases, if meaningful.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so. The list of
suggested reviewers on the right can help you.Please ensure all communication adheres to the code of conduct.
-->
cfallin updated PR #1718 from machinst-codebuffer
to master
:
tl;dr: new new-isel; better block-ordering, handling branches in one pass. 24% faster compile+run on bz2 (28% fewer instructions); 10% faster compile (10% fewer instructions).
This patch includes:
A complete rework of the way that CLIF blocks and edge blocks are
lowered into VCode blocks. The new mechanism inBlockLoweringOrder
computes RPO over the CFG, but with a twist: it merges edge blocks intto
heads or tails of original CLIF blocks wherever possible, and it does
this without ever actually materializing the full nodes-plus-edges
graph first. The backend driver lowers blocks in final order so
there's no need to reshuffle later.A new
MachBuffer
that replaces theMachSection
. This is a special
version of a code-sink that is far more than a humbleVec<u8>
. In
particular, it keeps a record of label definitions and label uses,
with a machine-pluggableLabelUse
trait that defines various types
of fixups (basically internal relocations).Importantly, it implements some simple peephole-style branch rewrites
inline in the emission pass, without any separate traversals over
the code to use fallthroughs, swap taken/not-taken arms, etc. It
tracks branches at the tail of the buffer and can (i) remove blocks
that are just unconditional branches (by redirecting the label), (ii)
understand a conditional/unconditional pair and swap the conditional
polarity when it's helpful; and (iii) remove branches that branch to
the fallthrough PC.The
MachBuffer
also implements branch-island support. On
architectures like AArch64, this is needed to allow conditional
branches within plausibly-attainable ranges (+/- 1MB on AArch64
specifically). It also does this inline while streaming through the
emission, without any sort of fixpoint algorithm or later moving of
code, by simply tracking outstanding references and "deadlines" and
emitting an island just-in-time when we're in danger of going out of
range.A rework of the instruction selector driver. This is largely following
the same algorithm as before, but is cleaned up significantly, in
particular in the API: the machine backend can ask for an input arg
and get any of three forms (constant, register, producing
instruction), indicating it needs the register or can merge the
constant or producing instruction as appropriate. This new driver
takes special care to emit constants right at use-sites (and at phi
inputs), minimizing their live-ranges, and also special-cases the
"pinned register" to avoid superfluous moves.Overall, on
bz2.wasm
, the results are:wasmtime full run (compile + runtime) of bz2: baseline: 9774M insns, 9742M cycles, 3.918s w/ changes: 7012M insns, 6888M cycles, 2.958s (24.5% faster, 28.3% fewer insns) clif-util wasm compile bz2: baseline: 2633M insns, 3278M cycles, 1.034s w/ changes: 2366M insns, 2920M cycles, 0.923s (10.7% faster, 10.1% fewer insns) All numbers are averages of two runs on an Ampere eMAG.
Creating this PR now to start the review, but I still need to do the following before merging:
- Rebase to latest
master
- Update vcode filetests (since isel changed)
- Write tests for the island-emission cases of
MachBuffer
- Update the x64 backend to the slightly-changed isel API (it's disabled at the moment to allow this to compile)
<!--
Please ensure that the following steps are all taken care of before submitting
the PR.
[ ] This has been discussed in issue #..., or if not, please tell us why
here.[ ] A short description of what this does, why it is needed; if the
description becomes long, the matter should probably be discussed in an issue
first.[ ] This PR contains test cases, if meaningful.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so. The list of
suggested reviewers on the right can help you.Please ensure all communication adheres to the code of conduct.
-->
cfallin updated PR #1718 from machinst-codebuffer
to master
:
tl;dr: new new-isel; better block-ordering, handling branches in one pass. 24% faster compile+run on bz2 (28% fewer instructions); 10% faster compile (10% fewer instructions).
This patch includes:
A complete rework of the way that CLIF blocks and edge blocks are
lowered into VCode blocks. The new mechanism inBlockLoweringOrder
computes RPO over the CFG, but with a twist: it merges edge blocks intto
heads or tails of original CLIF blocks wherever possible, and it does
this without ever actually materializing the full nodes-plus-edges
graph first. The backend driver lowers blocks in final order so
there's no need to reshuffle later.A new
MachBuffer
that replaces theMachSection
. This is a special
version of a code-sink that is far more than a humbleVec<u8>
. In
particular, it keeps a record of label definitions and label uses,
with a machine-pluggableLabelUse
trait that defines various types
of fixups (basically internal relocations).Importantly, it implements some simple peephole-style branch rewrites
inline in the emission pass, without any separate traversals over
the code to use fallthroughs, swap taken/not-taken arms, etc. It
tracks branches at the tail of the buffer and can (i) remove blocks
that are just unconditional branches (by redirecting the label), (ii)
understand a conditional/unconditional pair and swap the conditional
polarity when it's helpful; and (iii) remove branches that branch to
the fallthrough PC.The
MachBuffer
also implements branch-island support. On
architectures like AArch64, this is needed to allow conditional
branches within plausibly-attainable ranges (+/- 1MB on AArch64
specifically). It also does this inline while streaming through the
emission, without any sort of fixpoint algorithm or later moving of
code, by simply tracking outstanding references and "deadlines" and
emitting an island just-in-time when we're in danger of going out of
range.A rework of the instruction selector driver. This is largely following
the same algorithm as before, but is cleaned up significantly, in
particular in the API: the machine backend can ask for an input arg
and get any of three forms (constant, register, producing
instruction), indicating it needs the register or can merge the
constant or producing instruction as appropriate. This new driver
takes special care to emit constants right at use-sites (and at phi
inputs), minimizing their live-ranges, and also special-cases the
"pinned register" to avoid superfluous moves.Overall, on
bz2.wasm
, the results are:wasmtime full run (compile + runtime) of bz2: baseline: 9774M insns, 9742M cycles, 3.918s w/ changes: 7012M insns, 6888M cycles, 2.958s (24.5% faster, 28.3% fewer insns) clif-util wasm compile bz2: baseline: 2633M insns, 3278M cycles, 1.034s w/ changes: 2366M insns, 2920M cycles, 0.923s (10.7% faster, 10.1% fewer insns) All numbers are averages of two runs on an Ampere eMAG.
Creating this PR now to start the review, but I still need to do the following before merging:
- Rebase to latest
master
- Update vcode filetests (since isel changed)
- Write tests for the island-emission cases of
MachBuffer
- Update the x64 backend to the slightly-changed isel API (it's disabled at the moment to allow this to compile)
<!--
Please ensure that the following steps are all taken care of before submitting
the PR.
[ ] This has been discussed in issue #..., or if not, please tell us why
here.[ ] A short description of what this does, why it is needed; if the
description becomes long, the matter should probably be discussed in an issue
first.[ ] This PR contains test cases, if meaningful.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so. The list of
suggested reviewers on the right can help you.Please ensure all communication adheres to the code of conduct.
-->
cfallin updated PR #1718 from machinst-codebuffer
to master
:
tl;dr: new new-isel; better block-ordering, handling branches in one pass. 24% faster compile+run on bz2 (28% fewer instructions); 10% faster compile (10% fewer instructions).
This patch includes:
A complete rework of the way that CLIF blocks and edge blocks are
lowered into VCode blocks. The new mechanism inBlockLoweringOrder
computes RPO over the CFG, but with a twist: it merges edge blocks intto
heads or tails of original CLIF blocks wherever possible, and it does
this without ever actually materializing the full nodes-plus-edges
graph first. The backend driver lowers blocks in final order so
there's no need to reshuffle later.A new
MachBuffer
that replaces theMachSection
. This is a special
version of a code-sink that is far more than a humbleVec<u8>
. In
particular, it keeps a record of label definitions and label uses,
with a machine-pluggableLabelUse
trait that defines various types
of fixups (basically internal relocations).Importantly, it implements some simple peephole-style branch rewrites
inline in the emission pass, without any separate traversals over
the code to use fallthroughs, swap taken/not-taken arms, etc. It
tracks branches at the tail of the buffer and can (i) remove blocks
that are just unconditional branches (by redirecting the label), (ii)
understand a conditional/unconditional pair and swap the conditional
polarity when it's helpful; and (iii) remove branches that branch to
the fallthrough PC.The
MachBuffer
also implements branch-island support. On
architectures like AArch64, this is needed to allow conditional
branches within plausibly-attainable ranges (+/- 1MB on AArch64
specifically). It also does this inline while streaming through the
emission, without any sort of fixpoint algorithm or later moving of
code, by simply tracking outstanding references and "deadlines" and
emitting an island just-in-time when we're in danger of going out of
range.A rework of the instruction selector driver. This is largely following
the same algorithm as before, but is cleaned up significantly, in
particular in the API: the machine backend can ask for an input arg
and get any of three forms (constant, register, producing
instruction), indicating it needs the register or can merge the
constant or producing instruction as appropriate. This new driver
takes special care to emit constants right at use-sites (and at phi
inputs), minimizing their live-ranges, and also special-cases the
"pinned register" to avoid superfluous moves.Overall, on
bz2.wasm
, the results are:wasmtime full run (compile + runtime) of bz2: baseline: 9774M insns, 9742M cycles, 3.918s w/ changes: 7012M insns, 6888M cycles, 2.958s (24.5% faster, 28.3% fewer insns) clif-util wasm compile bz2: baseline: 2633M insns, 3278M cycles, 1.034s w/ changes: 2366M insns, 2920M cycles, 0.923s (10.7% faster, 10.1% fewer insns) All numbers are averages of two runs on an Ampere eMAG.
Creating this PR now to start the review, but I still need to do the following before merging:
- Rebase to latest
master
- Update vcode filetests (since isel changed)
- Write tests for the island-emission cases of
MachBuffer
- Update the x64 backend to the slightly-changed isel API (it's disabled at the moment to allow this to compile)
<!--
Please ensure that the following steps are all taken care of before submitting
the PR.
[ ] This has been discussed in issue #..., or if not, please tell us why
here.[ ] A short description of what this does, why it is needed; if the
description becomes long, the matter should probably be discussed in an issue
first.[ ] This PR contains test cases, if meaningful.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so. The list of
suggested reviewers on the right can help you.Please ensure all communication adheres to the code of conduct.
-->
cfallin updated PR #1718 from machinst-codebuffer
to master
:
tl;dr: new new-isel; better block-ordering, handling branches in one pass. 24% faster compile+run on bz2 (28% fewer instructions); 10% faster compile (10% fewer instructions).
This patch includes:
A complete rework of the way that CLIF blocks and edge blocks are
lowered into VCode blocks. The new mechanism inBlockLoweringOrder
computes RPO over the CFG, but with a twist: it merges edge blocks intto
heads or tails of original CLIF blocks wherever possible, and it does
this without ever actually materializing the full nodes-plus-edges
graph first. The backend driver lowers blocks in final order so
there's no need to reshuffle later.A new
MachBuffer
that replaces theMachSection
. This is a special
version of a code-sink that is far more than a humbleVec<u8>
. In
particular, it keeps a record of label definitions and label uses,
with a machine-pluggableLabelUse
trait that defines various types
of fixups (basically internal relocations).Importantly, it implements some simple peephole-style branch rewrites
inline in the emission pass, without any separate traversals over
the code to use fallthroughs, swap taken/not-taken arms, etc. It
tracks branches at the tail of the buffer and can (i) remove blocks
that are just unconditional branches (by redirecting the label), (ii)
understand a conditional/unconditional pair and swap the conditional
polarity when it's helpful; and (iii) remove branches that branch to
the fallthrough PC.The
MachBuffer
also implements branch-island support. On
architectures like AArch64, this is needed to allow conditional
branches within plausibly-attainable ranges (+/- 1MB on AArch64
specifically). It also does this inline while streaming through the
emission, without any sort of fixpoint algorithm or later moving of
code, by simply tracking outstanding references and "deadlines" and
emitting an island just-in-time when we're in danger of going out of
range.A rework of the instruction selector driver. This is largely following
the same algorithm as before, but is cleaned up significantly, in
particular in the API: the machine backend can ask for an input arg
and get any of three forms (constant, register, producing
instruction), indicating it needs the register or can merge the
constant or producing instruction as appropriate. This new driver
takes special care to emit constants right at use-sites (and at phi
inputs), minimizing their live-ranges, and also special-cases the
"pinned register" to avoid superfluous moves.Overall, on
bz2.wasm
, the results are:wasmtime full run (compile + runtime) of bz2: baseline: 9774M insns, 9742M cycles, 3.918s w/ changes: 7012M insns, 6888M cycles, 2.958s (24.5% faster, 28.3% fewer insns) clif-util wasm compile bz2: baseline: 2633M insns, 3278M cycles, 1.034s w/ changes: 2366M insns, 2920M cycles, 0.923s (10.7% faster, 10.1% fewer insns) All numbers are averages of two runs on an Ampere eMAG.
Creating this PR now to start the review, but I still need to do the following before merging:
- Rebase to latest
master
- Update vcode filetests (since isel changed)
- Write tests for the island-emission cases of
MachBuffer
- Update the x64 backend to the slightly-changed isel API (it's disabled at the moment to allow this to compile)
<!--
Please ensure that the following steps are all taken care of before submitting
the PR.
[ ] This has been discussed in issue #..., or if not, please tell us why
here.[ ] A short description of what this does, why it is needed; if the
description becomes long, the matter should probably be discussed in an issue
first.[ ] This PR contains test cases, if meaningful.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so. The list of
suggested reviewers on the right can help you.Please ensure all communication adheres to the code of conduct.
-->
cfallin updated PR #1718 from machinst-codebuffer
to master
:
tl;dr: new new-isel; better block-ordering, handling branches in one pass. 24% faster compile+run on bz2 (28% fewer instructions); 10% faster compile (10% fewer instructions).
This patch includes:
A complete rework of the way that CLIF blocks and edge blocks are
lowered into VCode blocks. The new mechanism inBlockLoweringOrder
computes RPO over the CFG, but with a twist: it merges edge blocks intto
heads or tails of original CLIF blocks wherever possible, and it does
this without ever actually materializing the full nodes-plus-edges
graph first. The backend driver lowers blocks in final order so
there's no need to reshuffle later.A new
MachBuffer
that replaces theMachSection
. This is a special
version of a code-sink that is far more than a humbleVec<u8>
. In
particular, it keeps a record of label definitions and label uses,
with a machine-pluggableLabelUse
trait that defines various types
of fixups (basically internal relocations).Importantly, it implements some simple peephole-style branch rewrites
inline in the emission pass, without any separate traversals over
the code to use fallthroughs, swap taken/not-taken arms, etc. It
tracks branches at the tail of the buffer and can (i) remove blocks
that are just unconditional branches (by redirecting the label), (ii)
understand a conditional/unconditional pair and swap the conditional
polarity when it's helpful; and (iii) remove branches that branch to
the fallthrough PC.The
MachBuffer
also implements branch-island support. On
architectures like AArch64, this is needed to allow conditional
branches within plausibly-attainable ranges (+/- 1MB on AArch64
specifically). It also does this inline while streaming through the
emission, without any sort of fixpoint algorithm or later moving of
code, by simply tracking outstanding references and "deadlines" and
emitting an island just-in-time when we're in danger of going out of
range.A rework of the instruction selector driver. This is largely following
the same algorithm as before, but is cleaned up significantly, in
particular in the API: the machine backend can ask for an input arg
and get any of three forms (constant, register, producing
instruction), indicating it needs the register or can merge the
constant or producing instruction as appropriate. This new driver
takes special care to emit constants right at use-sites (and at phi
inputs), minimizing their live-ranges, and also special-cases the
"pinned register" to avoid superfluous moves.Overall, on
bz2.wasm
, the results are:wasmtime full run (compile + runtime) of bz2: baseline: 9774M insns, 9742M cycles, 3.918s w/ changes: 7012M insns, 6888M cycles, 2.958s (24.5% faster, 28.3% fewer insns) clif-util wasm compile bz2: baseline: 2633M insns, 3278M cycles, 1.034s w/ changes: 2366M insns, 2920M cycles, 0.923s (10.7% faster, 10.1% fewer insns) All numbers are averages of two runs on an Ampere eMAG.
Creating this PR now to start the review, but I still need to do the following before merging:
- Rebase to latest
master
- Update vcode filetests (since isel changed)
- Write tests for the island-emission cases of
MachBuffer
- Update the x64 backend to the slightly-changed isel API (it's disabled at the moment to allow this to compile)
<!--
Please ensure that the following steps are all taken care of before submitting
the PR.
[ ] This has been discussed in issue #..., or if not, please tell us why
here.[ ] A short description of what this does, why it is needed; if the
description becomes long, the matter should probably be discussed in an issue
first.[ ] This PR contains test cases, if meaningful.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so. The list of
suggested reviewers on the right can help you.Please ensure all communication adheres to the code of conduct.
-->
cfallin edited PR #1718 from machinst-codebuffer
to master
:
tl;dr: new new-isel; better block-ordering, handling branches in one pass. 24% faster compile+run on bz2 (28% fewer instructions); 10% faster compile (10% fewer instructions).
This patch includes:
A complete rework of the way that CLIF blocks and edge blocks are
lowered into VCode blocks. The new mechanism inBlockLoweringOrder
computes RPO over the CFG, but with a twist: it merges edge blocks intto
heads or tails of original CLIF blocks wherever possible, and it does
this without ever actually materializing the full nodes-plus-edges
graph first. The backend driver lowers blocks in final order so
there's no need to reshuffle later.A new
MachBuffer
that replaces theMachSection
. This is a special
version of a code-sink that is far more than a humbleVec<u8>
. In
particular, it keeps a record of label definitions and label uses,
with a machine-pluggableLabelUse
trait that defines various types
of fixups (basically internal relocations).Importantly, it implements some simple peephole-style branch rewrites
inline in the emission pass, without any separate traversals over
the code to use fallthroughs, swap taken/not-taken arms, etc. It
tracks branches at the tail of the buffer and can (i) remove blocks
that are just unconditional branches (by redirecting the label), (ii)
understand a conditional/unconditional pair and swap the conditional
polarity when it's helpful; and (iii) remove branches that branch to
the fallthrough PC.The
MachBuffer
also implements branch-island support. On
architectures like AArch64, this is needed to allow conditional
branches within plausibly-attainable ranges (+/- 1MB on AArch64
specifically). It also does this inline while streaming through the
emission, without any sort of fixpoint algorithm or later moving of
code, by simply tracking outstanding references and "deadlines" and
emitting an island just-in-time when we're in danger of going out of
range.A rework of the instruction selector driver. This is largely following
the same algorithm as before, but is cleaned up significantly, in
particular in the API: the machine backend can ask for an input arg
and get any of three forms (constant, register, producing
instruction), indicating it needs the register or can merge the
constant or producing instruction as appropriate. This new driver
takes special care to emit constants right at use-sites (and at phi
inputs), minimizing their live-ranges, and also special-cases the
"pinned register" to avoid superfluous moves.Overall, on
bz2.wasm
, the results are:wasmtime full run (compile + runtime) of bz2: baseline: 9774M insns, 9742M cycles, 3.918s w/ changes: 7012M insns, 6888M cycles, 2.958s (24.5% faster, 28.3% fewer insns) clif-util wasm compile bz2: baseline: 2633M insns, 3278M cycles, 1.034s w/ changes: 2366M insns, 2920M cycles, 0.923s (10.7% faster, 10.1% fewer insns) All numbers are averages of two runs on an Ampere eMAG.
<!--
Please ensure that the following steps are all taken care of before submitting
the PR.
[ ] This has been discussed in issue #..., or if not, please tell us why
here.[ ] A short description of what this does, why it is needed; if the
description becomes long, the matter should probably be discussed in an issue
first.[ ] This PR contains test cases, if meaningful.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so. The list of
suggested reviewers on the right can help you.Please ensure all communication adheres to the code of conduct.
-->
julian-seward1 submitted PR Review.
julian-seward1 submitted PR Review.
julian-seward1 created PR Review Comment:
What does this mean? Can you clarify the semantics? When is it used?
julian-seward1 created PR Review Comment:
nit: it would be better not to use the word
Load
here since it's not a load. MaybeCompute
?
julian-seward1 created PR Review Comment:
Could you add some details to say what the island consists of? More generally, is there a top level description of the islands-and-deadlines algorithm somewhere?
julian-seward1 created PR Review Comment:
What if
ty
is a vector type? ThenInst::load_constant
doesn't sound right to me. Is there some guarantee that this won't get called with such a type? If not, can you assert/panic it out?
julian-seward1 created PR Review Comment:
Is there any way that this can be automatically cross-checked with reality? This sounds to me like something that could be violated somewhere down the line, but that would not break anything except in some extremely rare huge-function input, which will make it hard to track down. So some (any?) kind of cross-check scheme would be a Good Thing.
If not possible, at least add a load comment at the top of the insn emitter to the effect that it must comply with what is claimed here.
julian-seward1 created PR Review Comment:
Interpreted signed or unsigned?
julian-seward1 created PR Review Comment:
That doesn't read quite right; is it correct?
julian-seward1 created PR Review Comment:
This is a bit unclear; could you make it more precise? Is there a 1:1 mapping from CLIR blocks to VCode blocks? The use of "subgraphs" implies there isn't, but there's no clarification of the meaning of "subgraphs" here.
julian-seward1 created PR Review Comment:
Does
(emit island with guard jump if needed)
refer to the 6 insns that follow (I think so), or does it denote further insns that need to be emitted (I think not) ? It would be good to make this clearer in the comment.
julian-seward1 created PR Review Comment:
"freely permuted" .. surely they'd have to maintain the same data dependency relationships?
julian-seward1 created PR Review Comment:
Since forgetting to do this might be a common mistake, can you say here how the system will fail should one forget to do that?
julian-seward1 created PR Review Comment:
For the sake of clarity, could you add "CLIR" before "instructions" ?
julian-seward1 created PR Review Comment:
Can you add a 1 liner comment saying what this does?
julian-seward1 created PR Review Comment:
tmp
doesn't give a big enough hint what this does. A better name would benew_vreg
.
julian-seward1 edited PR Review Comment.
julian-seward1 submitted PR Review.
julian-seward1 submitted PR Review.
julian-seward1 created PR Review Comment:
Is this definition of
is64
right? That seems like it's an unsigned criterion, but the general rule on Intel for 32-bit immediate fields is that they are sign extended to 64 bits as appropriate. If this logic simply moved from elsewhere in this patch, then leave it as is; but otherwise maybe change to the signed variant, usinglow32willSXto64
?
julian-seward1 created PR Review Comment:
This change concerns me somewhat. What guarantees that this assertion can't fail now?
julian-seward1 created PR Review Comment:
I feel like it's a shame to lose this, because it means losing the ability to easily differentiate legitimate failures due to non-implementation of a target-independent CLIR insn, vs bugs resulting in machine-specific CLIRs being handed to us.
cfallin submitted PR Review.
cfallin created PR Review Comment:
Oh, these
X86*
opcodes went away in the latestmaster
, so this is just a rebase-related change. I think the rule is still (as has always been) never have a fallthrough in the big-opcode-match, and handle additions or deletions as they come, indicated by compile errors -- so if any machine-specific ops are added back in the future, we'll figure out what to do then.
cfallin submitted PR Review.
cfallin created PR Review Comment:
Nothing guarantees it, but this form (
ResolvedOffset
) is used only when the lowering explicitly selects it, now; ordinary branches don't go through this code (theLabelUse
handles them instead, and we can implement 64-bit long-form veneers there if we think we need >2GB code-size).
cfallin submitted PR Review.
cfallin created PR Review Comment:
I think so, or at least, this was existing logic in the
Iconst
lowering prior to this patch:
cfallin updated PR #1718 from machinst-codebuffer
to master
:
tl;dr: new new-isel; better block-ordering, handling branches in one pass. 24% faster compile+run on bz2 (28% fewer instructions); 10% faster compile (10% fewer instructions).
This patch includes:
A complete rework of the way that CLIF blocks and edge blocks are
lowered into VCode blocks. The new mechanism inBlockLoweringOrder
computes RPO over the CFG, but with a twist: it merges edge blocks intto
heads or tails of original CLIF blocks wherever possible, and it does
this without ever actually materializing the full nodes-plus-edges
graph first. The backend driver lowers blocks in final order so
there's no need to reshuffle later.A new
MachBuffer
that replaces theMachSection
. This is a special
version of a code-sink that is far more than a humbleVec<u8>
. In
particular, it keeps a record of label definitions and label uses,
with a machine-pluggableLabelUse
trait that defines various types
of fixups (basically internal relocations).Importantly, it implements some simple peephole-style branch rewrites
inline in the emission pass, without any separate traversals over
the code to use fallthroughs, swap taken/not-taken arms, etc. It
tracks branches at the tail of the buffer and can (i) remove blocks
that are just unconditional branches (by redirecting the label), (ii)
understand a conditional/unconditional pair and swap the conditional
polarity when it's helpful; and (iii) remove branches that branch to
the fallthrough PC.The
MachBuffer
also implements branch-island support. On
architectures like AArch64, this is needed to allow conditional
branches within plausibly-attainable ranges (+/- 1MB on AArch64
specifically). It also does this inline while streaming through the
emission, without any sort of fixpoint algorithm or later moving of
code, by simply tracking outstanding references and "deadlines" and
emitting an island just-in-time when we're in danger of going out of
range.A rework of the instruction selector driver. This is largely following
the same algorithm as before, but is cleaned up significantly, in
particular in the API: the machine backend can ask for an input arg
and get any of three forms (constant, register, producing
instruction), indicating it needs the register or can merge the
constant or producing instruction as appropriate. This new driver
takes special care to emit constants right at use-sites (and at phi
inputs), minimizing their live-ranges, and also special-cases the
"pinned register" to avoid superfluous moves.Overall, on
bz2.wasm
, the results are:wasmtime full run (compile + runtime) of bz2: baseline: 9774M insns, 9742M cycles, 3.918s w/ changes: 7012M insns, 6888M cycles, 2.958s (24.5% faster, 28.3% fewer insns) clif-util wasm compile bz2: baseline: 2633M insns, 3278M cycles, 1.034s w/ changes: 2366M insns, 2920M cycles, 0.923s (10.7% faster, 10.1% fewer insns) All numbers are averages of two runs on an Ampere eMAG.
<!--
Please ensure that the following steps are all taken care of before submitting
the PR.
[ ] This has been discussed in issue #..., or if not, please tell us why
here.[ ] A short description of what this does, why it is needed; if the
description becomes long, the matter should probably be discussed in an issue
first.[ ] This PR contains test cases, if meaningful.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so. The list of
suggested reviewers on the right can help you.Please ensure all communication adheres to the code of conduct.
-->
cfallin submitted PR Review.
cfallin created PR Review Comment:
Was part of the old API, but no reason not to rename here :-) Now it's
alloc_tmp()
.
cfallin submitted PR Review.
cfallin created PR Review Comment:
Done!
cfallin submitted PR Review.
cfallin created PR Review Comment:
Done.
cfallin submitted PR Review.
cfallin created PR Review Comment:
Yep, modulo true deps; clarified.
cfallin submitted PR Review.
cfallin created PR Review Comment:
Was correct but probably too terse; fixed.
cfallin submitted PR Review.
cfallin created PR Review Comment:
Hopefully the ASCII art and additional explanation help! This is pretty subtle so I'm happy to clarify further if needed.
cfallin submitted PR Review.
cfallin created PR Review Comment:
Added some more docs on
EmitIsland
to clarify.
cfallin submitted PR Review.
cfallin created PR Review Comment:
Signed (clarified).
cfallin submitted PR Review.
cfallin created PR Review Comment:
Good idea! I added a debug assert to
Inst::emit()
that verifies that no more thanworst_case_size()
bytes were emitted.
cfallin submitted PR Review.
cfallin created PR Review Comment:
Added assert.
cfallin submitted PR Review.
cfallin created PR Review Comment:
Added to the top of
machinst/buffer.rs
.
cfallin updated PR #1718 from machinst-codebuffer
to master
:
tl;dr: new new-isel; better block-ordering, handling branches in one pass. 24% faster compile+run on bz2 (28% fewer instructions); 10% faster compile (10% fewer instructions).
This patch includes:
A complete rework of the way that CLIF blocks and edge blocks are
lowered into VCode blocks. The new mechanism inBlockLoweringOrder
computes RPO over the CFG, but with a twist: it merges edge blocks intto
heads or tails of original CLIF blocks wherever possible, and it does
this without ever actually materializing the full nodes-plus-edges
graph first. The backend driver lowers blocks in final order so
there's no need to reshuffle later.A new
MachBuffer
that replaces theMachSection
. This is a special
version of a code-sink that is far more than a humbleVec<u8>
. In
particular, it keeps a record of label definitions and label uses,
with a machine-pluggableLabelUse
trait that defines various types
of fixups (basically internal relocations).Importantly, it implements some simple peephole-style branch rewrites
inline in the emission pass, without any separate traversals over
the code to use fallthroughs, swap taken/not-taken arms, etc. It
tracks branches at the tail of the buffer and can (i) remove blocks
that are just unconditional branches (by redirecting the label), (ii)
understand a conditional/unconditional pair and swap the conditional
polarity when it's helpful; and (iii) remove branches that branch to
the fallthrough PC.The
MachBuffer
also implements branch-island support. On
architectures like AArch64, this is needed to allow conditional
branches within plausibly-attainable ranges (+/- 1MB on AArch64
specifically). It also does this inline while streaming through the
emission, without any sort of fixpoint algorithm or later moving of
code, by simply tracking outstanding references and "deadlines" and
emitting an island just-in-time when we're in danger of going out of
range.A rework of the instruction selector driver. This is largely following
the same algorithm as before, but is cleaned up significantly, in
particular in the API: the machine backend can ask for an input arg
and get any of three forms (constant, register, producing
instruction), indicating it needs the register or can merge the
constant or producing instruction as appropriate. This new driver
takes special care to emit constants right at use-sites (and at phi
inputs), minimizing their live-ranges, and also special-cases the
"pinned register" to avoid superfluous moves.Overall, on
bz2.wasm
, the results are:wasmtime full run (compile + runtime) of bz2: baseline: 9774M insns, 9742M cycles, 3.918s w/ changes: 7012M insns, 6888M cycles, 2.958s (24.5% faster, 28.3% fewer insns) clif-util wasm compile bz2: baseline: 2633M insns, 3278M cycles, 1.034s w/ changes: 2366M insns, 2920M cycles, 0.923s (10.7% faster, 10.1% fewer insns) All numbers are averages of two runs on an Ampere eMAG.
<!--
Please ensure that the following steps are all taken care of before submitting
the PR.
[ ] This has been discussed in issue #..., or if not, please tell us why
here.[ ] A short description of what this does, why it is needed; if the
description becomes long, the matter should probably be discussed in an issue
first.[ ] This PR contains test cases, if meaningful.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so. The list of
suggested reviewers on the right can help you.Please ensure all communication adheres to the code of conduct.
-->
cfallin submitted PR Review.
cfallin created PR Review Comment:
Done.
cfallin submitted PR Review.
cfallin created PR Review Comment:
Clarified; this is the same as the original "lowered" form, but we just call out the actual semantics and purpose a bit more explicitly now rather than conflating it with the branch-lowering process.
cfallin submitted PR Review.
cfallin created PR Review Comment:
Done.
cfallin updated PR #1718 from machinst-codebuffer
to master
:
tl;dr: new new-isel; better block-ordering, handling branches in one pass. 24% faster compile+run on bz2 (28% fewer instructions); 10% faster compile (10% fewer instructions).
This patch includes:
A complete rework of the way that CLIF blocks and edge blocks are
lowered into VCode blocks. The new mechanism inBlockLoweringOrder
computes RPO over the CFG, but with a twist: it merges edge blocks intto
heads or tails of original CLIF blocks wherever possible, and it does
this without ever actually materializing the full nodes-plus-edges
graph first. The backend driver lowers blocks in final order so
there's no need to reshuffle later.A new
MachBuffer
that replaces theMachSection
. This is a special
version of a code-sink that is far more than a humbleVec<u8>
. In
particular, it keeps a record of label definitions and label uses,
with a machine-pluggableLabelUse
trait that defines various types
of fixups (basically internal relocations).Importantly, it implements some simple peephole-style branch rewrites
inline in the emission pass, without any separate traversals over
the code to use fallthroughs, swap taken/not-taken arms, etc. It
tracks branches at the tail of the buffer and can (i) remove blocks
that are just unconditional branches (by redirecting the label), (ii)
understand a conditional/unconditional pair and swap the conditional
polarity when it's helpful; and (iii) remove branches that branch to
the fallthrough PC.The
MachBuffer
also implements branch-island support. On
architectures like AArch64, this is needed to allow conditional
branches within plausibly-attainable ranges (+/- 1MB on AArch64
specifically). It also does this inline while streaming through the
emission, without any sort of fixpoint algorithm or later moving of
code, by simply tracking outstanding references and "deadlines" and
emitting an island just-in-time when we're in danger of going out of
range.A rework of the instruction selector driver. This is largely following
the same algorithm as before, but is cleaned up significantly, in
particular in the API: the machine backend can ask for an input arg
and get any of three forms (constant, register, producing
instruction), indicating it needs the register or can merge the
constant or producing instruction as appropriate. This new driver
takes special care to emit constants right at use-sites (and at phi
inputs), minimizing their live-ranges, and also special-cases the
"pinned register" to avoid superfluous moves.Overall, on
bz2.wasm
, the results are:wasmtime full run (compile + runtime) of bz2: baseline: 9774M insns, 9742M cycles, 3.918s w/ changes: 7012M insns, 6888M cycles, 2.958s (24.5% faster, 28.3% fewer insns) clif-util wasm compile bz2: baseline: 2633M insns, 3278M cycles, 1.034s w/ changes: 2366M insns, 2920M cycles, 0.923s (10.7% faster, 10.1% fewer insns) All numbers are averages of two runs on an Ampere eMAG.
<!--
Please ensure that the following steps are all taken care of before submitting
the PR.
[ ] This has been discussed in issue #..., or if not, please tell us why
here.[ ] A short description of what this does, why it is needed; if the
description becomes long, the matter should probably be discussed in an issue
first.[ ] This PR contains test cases, if meaningful.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so. The list of
suggested reviewers on the right can help you.Please ensure all communication adheres to the code of conduct.
-->
cfallin updated PR #1718 from machinst-codebuffer
to master
:
tl;dr: new new-isel; better block-ordering, handling branches in one pass. 24% faster compile+run on bz2 (28% fewer instructions); 10% faster compile (10% fewer instructions).
This patch includes:
A complete rework of the way that CLIF blocks and edge blocks are
lowered into VCode blocks. The new mechanism inBlockLoweringOrder
computes RPO over the CFG, but with a twist: it merges edge blocks intto
heads or tails of original CLIF blocks wherever possible, and it does
this without ever actually materializing the full nodes-plus-edges
graph first. The backend driver lowers blocks in final order so
there's no need to reshuffle later.A new
MachBuffer
that replaces theMachSection
. This is a special
version of a code-sink that is far more than a humbleVec<u8>
. In
particular, it keeps a record of label definitions and label uses,
with a machine-pluggableLabelUse
trait that defines various types
of fixups (basically internal relocations).Importantly, it implements some simple peephole-style branch rewrites
inline in the emission pass, without any separate traversals over
the code to use fallthroughs, swap taken/not-taken arms, etc. It
tracks branches at the tail of the buffer and can (i) remove blocks
that are just unconditional branches (by redirecting the label), (ii)
understand a conditional/unconditional pair and swap the conditional
polarity when it's helpful; and (iii) remove branches that branch to
the fallthrough PC.The
MachBuffer
also implements branch-island support. On
architectures like AArch64, this is needed to allow conditional
branches within plausibly-attainable ranges (+/- 1MB on AArch64
specifically). It also does this inline while streaming through the
emission, without any sort of fixpoint algorithm or later moving of
code, by simply tracking outstanding references and "deadlines" and
emitting an island just-in-time when we're in danger of going out of
range.A rework of the instruction selector driver. This is largely following
the same algorithm as before, but is cleaned up significantly, in
particular in the API: the machine backend can ask for an input arg
and get any of three forms (constant, register, producing
instruction), indicating it needs the register or can merge the
constant or producing instruction as appropriate. This new driver
takes special care to emit constants right at use-sites (and at phi
inputs), minimizing their live-ranges, and also special-cases the
"pinned register" to avoid superfluous moves.Overall, on
bz2.wasm
, the results are:wasmtime full run (compile + runtime) of bz2: baseline: 9774M insns, 9742M cycles, 3.918s w/ changes: 7012M insns, 6888M cycles, 2.958s (24.5% faster, 28.3% fewer insns) clif-util wasm compile bz2: baseline: 2633M insns, 3278M cycles, 1.034s w/ changes: 2366M insns, 2920M cycles, 0.923s (10.7% faster, 10.1% fewer insns) All numbers are averages of two runs on an Ampere eMAG.
<!--
Please ensure that the following steps are all taken care of before submitting
the PR.
[ ] This has been discussed in issue #..., or if not, please tell us why
here.[ ] A short description of what this does, why it is needed; if the
description becomes long, the matter should probably be discussed in an issue
first.[ ] This PR contains test cases, if meaningful.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so. The list of
suggested reviewers on the right can help you.Please ensure all communication adheres to the code of conduct.
-->
cfallin updated PR #1718 from machinst-codebuffer
to master
:
tl;dr: new new-isel; better block-ordering, handling branches in one pass. 24% faster compile+run on bz2 (28% fewer instructions); 10% faster compile (10% fewer instructions).
This patch includes:
A complete rework of the way that CLIF blocks and edge blocks are
lowered into VCode blocks. The new mechanism inBlockLoweringOrder
computes RPO over the CFG, but with a twist: it merges edge blocks intto
heads or tails of original CLIF blocks wherever possible, and it does
this without ever actually materializing the full nodes-plus-edges
graph first. The backend driver lowers blocks in final order so
there's no need to reshuffle later.A new
MachBuffer
that replaces theMachSection
. This is a special
version of a code-sink that is far more than a humbleVec<u8>
. In
particular, it keeps a record of label definitions and label uses,
with a machine-pluggableLabelUse
trait that defines various types
of fixups (basically internal relocations).Importantly, it implements some simple peephole-style branch rewrites
inline in the emission pass, without any separate traversals over
the code to use fallthroughs, swap taken/not-taken arms, etc. It
tracks branches at the tail of the buffer and can (i) remove blocks
that are just unconditional branches (by redirecting the label), (ii)
understand a conditional/unconditional pair and swap the conditional
polarity when it's helpful; and (iii) remove branches that branch to
the fallthrough PC.The
MachBuffer
also implements branch-island support. On
architectures like AArch64, this is needed to allow conditional
branches within plausibly-attainable ranges (+/- 1MB on AArch64
specifically). It also does this inline while streaming through the
emission, without any sort of fixpoint algorithm or later moving of
code, by simply tracking outstanding references and "deadlines" and
emitting an island just-in-time when we're in danger of going out of
range.A rework of the instruction selector driver. This is largely following
the same algorithm as before, but is cleaned up significantly, in
particular in the API: the machine backend can ask for an input arg
and get any of three forms (constant, register, producing
instruction), indicating it needs the register or can merge the
constant or producing instruction as appropriate. This new driver
takes special care to emit constants right at use-sites (and at phi
inputs), minimizing their live-ranges, and also special-cases the
"pinned register" to avoid superfluous moves.Overall, on
bz2.wasm
, the results are:wasmtime full run (compile + runtime) of bz2: baseline: 9774M insns, 9742M cycles, 3.918s w/ changes: 7012M insns, 6888M cycles, 2.958s (24.5% faster, 28.3% fewer insns) clif-util wasm compile bz2: baseline: 2633M insns, 3278M cycles, 1.034s w/ changes: 2366M insns, 2920M cycles, 0.923s (10.7% faster, 10.1% fewer insns) All numbers are averages of two runs on an Ampere eMAG.
<!--
Please ensure that the following steps are all taken care of before submitting
the PR.
[ ] This has been discussed in issue #..., or if not, please tell us why
here.[ ] A short description of what this does, why it is needed; if the
description becomes long, the matter should probably be discussed in an issue
first.[ ] This PR contains test cases, if meaningful.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so. The list of
suggested reviewers on the right can help you.Please ensure all communication adheres to the code of conduct.
-->
julian-seward1 submitted PR Review.
cfallin merged PR #1718.
Last updated: Jan 24 2025 at 00:11 UTC