wasmtime / PR #1718 Rework of MachInst isel, branch fixup... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / PR #1718 Rework of MachInst isel, branch fixup...

Wasmtime GitHub notifications bot (May 16 2020 at 02:08):

cfallin opened PR #1718 from machinst-codebuffer to master:

This patch includes:

A complete rework of the way that CLIF blocks and edge blocks are
lowered into VCode blocks. The new mechanism in BlockLoweringOrder
computes RPO over the CFG, but with a twist: it merges edge blocks intto
heads or tails of original CLIF blocks wherever possible, and it does
this without ever actually materializing the full nodes-plus-edges
graph first. The backend driver lowers blocks in final order so
there's no need to reshuffle later.

A new MachBuffer that replaces the MachSection. This is a special
version of a code-sink that is far more than a humble Vec<u8>. In
particular, it keeps a record of label definitions and label uses,
with a machine-pluggable LabelUse trait that defines various types
of fixups (basically internal relocations).

Importantly, it implements some simple peephole-style branch rewrites
inline in the emission pass, without any separate traversals over
the code to use fallthroughs, swap taken/not-taken arms, etc. It
tracks branches at the tail of the buffer and can (i) remove blocks
that are just unconditional branches (by redirecting the label), (ii)
understand a conditional/unconditional pair and swap the conditional
polarity when it's helpful; and (iii) remove branches that branch to
the fallthrough PC.

The MachBuffer also implements branch-island support. On
architectures like AArch64, this is needed to allow conditional
branches within plausibly-attainable ranges (+/- 1MB on AArch64
specifically). It also does this inline while streaming through the
emission, without any sort of fixpoint algorithm or later moving of
code, by simply tracking outstanding references and "deadlines" and
emitting an island just-in-time when we're in danger of going out of
range.

A rework of the instruction selector driver. This is largely following
the same algorithm as before, but is cleaned up significantly, in
particular in the API: the machine backend can ask for an input arg
and get any of three forms (constant, register, producing
instruction), indicating it needs the register or can merge the
constant or producing instruction as appropriate. This new driver
takes special care to emit constants right at use-sites (and at phi
inputs), minimizing their live-ranges, and also special-cases the
"pinned register" to avoid superfluous moves.

Overall, on bz2.wasm, the results are:
    wasmtime full run (compile + runtime) of bz2:

    baseline:   9774M insns, 9742M cycles, 3.918s
    w/ changes: 7012M insns, 6888M cycles, 2.958s  (24.5% faster, 28.3% fewer insns)

    clif-util wasm compile bz2:

    baseline:   2633M insns, 3278M cycles, 1.034s
    w/ changes: 2366M insns, 2920M cycles, 0.923s  (10.7% faster, 10.1% fewer insns)

    All numbers are averages of two runs on an Ampere eMAG.
Creating this PR now to start the review, but I still need to do the following before merging:

Rebase to latest master

Update vcode filetests (since isel changed)

Write tests for the island-emission cases of MachBuffer

Update the x64 backend to the slightly-changed isel API (it's disabled at the moment to allow this to compile)

Wasmtime GitHub notifications bot (May 16 2020 at 02:08):

cfallin requested bnjbvr and julian-seward1 for a review on PR #1718.

Wasmtime GitHub notifications bot (May 16 2020 at 02:08):

cfallin requested bnjbvr and julian-seward1 for a review on PR #1718.

Wasmtime GitHub notifications bot (May 16 2020 at 02:09):

cfallin edited PR #1718 from machinst-codebuffer to master:

tl;dr: new new-isel; better block-ordering, handling branches in one pass. 24% faster compile+run on bz2 (28% fewer instructions); 10% faster compile (10% fewer instructions).

This patch includes:

A complete rework of the way that CLIF blocks and edge blocks are
lowered into VCode blocks. The new mechanism in BlockLoweringOrder
computes RPO over the CFG, but with a twist: it merges edge blocks intto
heads or tails of original CLIF blocks wherever possible, and it does
this without ever actually materializing the full nodes-plus-edges
graph first. The backend driver lowers blocks in final order so
there's no need to reshuffle later.

A new MachBuffer that replaces the MachSection. This is a special
version of a code-sink that is far more than a humble Vec<u8>. In
particular, it keeps a record of label definitions and label uses,
with a machine-pluggable LabelUse trait that defines various types
of fixups (basically internal relocations).

Importantly, it implements some simple peephole-style branch rewrites
inline in the emission pass, without any separate traversals over
the code to use fallthroughs, swap taken/not-taken arms, etc. It
tracks branches at the tail of the buffer and can (i) remove blocks
that are just unconditional branches (by redirecting the label), (ii)
understand a conditional/unconditional pair and swap the conditional
polarity when it's helpful; and (iii) remove branches that branch to
the fallthrough PC.

The MachBuffer also implements branch-island support. On
architectures like AArch64, this is needed to allow conditional
branches within plausibly-attainable ranges (+/- 1MB on AArch64
specifically). It also does this inline while streaming through the
emission, without any sort of fixpoint algorithm or later moving of
code, by simply tracking outstanding references and "deadlines" and
emitting an island just-in-time when we're in danger of going out of
range.

A rework of the instruction selector driver. This is largely following
the same algorithm as before, but is cleaned up significantly, in
particular in the API: the machine backend can ask for an input arg
and get any of three forms (constant, register, producing
instruction), indicating it needs the register or can merge the
constant or producing instruction as appropriate. This new driver
takes special care to emit constants right at use-sites (and at phi
inputs), minimizing their live-ranges, and also special-cases the
"pinned register" to avoid superfluous moves.

Overall, on bz2.wasm, the results are:
    wasmtime full run (compile + runtime) of bz2:

    baseline:   9774M insns, 9742M cycles, 3.918s
    w/ changes: 7012M insns, 6888M cycles, 2.958s  (24.5% faster, 28.3% fewer insns)

    clif-util wasm compile bz2:

    baseline:   2633M insns, 3278M cycles, 1.034s
    w/ changes: 2366M insns, 2920M cycles, 0.923s  (10.7% faster, 10.1% fewer insns)

    All numbers are averages of two runs on an Ampere eMAG.
Creating this PR now to start the review, but I still need to do the following before merging:

Rebase to latest master

Update vcode filetests (since isel changed)

Write tests for the island-emission cases of MachBuffer

Update the x64 backend to the slightly-changed isel API (it's disabled at the moment to allow this to compile)

Wasmtime GitHub notifications bot (May 16 2020 at 06:22):

cfallin updated PR #1718 from machinst-codebuffer to master:

tl;dr: new new-isel; better block-ordering, handling branches in one pass. 24% faster compile+run on bz2 (28% fewer instructions); 10% faster compile (10% fewer instructions).

This patch includes:

A complete rework of the way that CLIF blocks and edge blocks are
lowered into VCode blocks. The new mechanism in BlockLoweringOrder
computes RPO over the CFG, but with a twist: it merges edge blocks intto
heads or tails of original CLIF blocks wherever possible, and it does
this without ever actually materializing the full nodes-plus-edges
graph first. The backend driver lowers blocks in final order so
there's no need to reshuffle later.

A new MachBuffer that replaces the MachSection. This is a special
version of a code-sink that is far more than a humble Vec<u8>. In
particular, it keeps a record of label definitions and label uses,
with a machine-pluggable LabelUse trait that defines various types
of fixups (basically internal relocations).

Importantly, it implements some simple peephole-style branch rewrites
inline in the emission pass, without any separate traversals over
the code to use fallthroughs, swap taken/not-taken arms, etc. It
tracks branches at the tail of the buffer and can (i) remove blocks
that are just unconditional branches (by redirecting the label), (ii)
understand a conditional/unconditional pair and swap the conditional
polarity when it's helpful; and (iii) remove branches that branch to
the fallthrough PC.

The MachBuffer also implements branch-island support. On
architectures like AArch64, this is needed to allow conditional
branches within plausibly-attainable ranges (+/- 1MB on AArch64
specifically). It also does this inline while streaming through the
emission, without any sort of fixpoint algorithm or later moving of
code, by simply tracking outstanding references and "deadlines" and
emitting an island just-in-time when we're in danger of going out of
range.

A rework of the instruction selector driver. This is largely following
the same algorithm as before, but is cleaned up significantly, in
particular in the API: the machine backend can ask for an input arg
and get any of three forms (constant, register, producing
instruction), indicating it needs the register or can merge the
constant or producing instruction as appropriate. This new driver
takes special care to emit constants right at use-sites (and at phi
inputs), minimizing their live-ranges, and also special-cases the
"pinned register" to avoid superfluous moves.

Overall, on bz2.wasm, the results are:
    wasmtime full run (compile + runtime) of bz2:

    baseline:   9774M insns, 9742M cycles, 3.918s
    w/ changes: 7012M insns, 6888M cycles, 2.958s  (24.5% faster, 28.3% fewer insns)

    clif-util wasm compile bz2:

    baseline:   2633M insns, 3278M cycles, 1.034s
    w/ changes: 2366M insns, 2920M cycles, 0.923s  (10.7% faster, 10.1% fewer insns)

    All numbers are averages of two runs on an Ampere eMAG.
Creating this PR now to start the review, but I still need to do the following before merging:

Rebase to latest master

Update vcode filetests (since isel changed)

Write tests for the island-emission cases of MachBuffer

Update the x64 backend to the slightly-changed isel API (it's disabled at the moment to allow this to compile)

Wasmtime GitHub notifications bot (May 16 2020 at 08:02):

cfallin updated PR #1718 from machinst-codebuffer to master:

tl;dr: new new-isel; better block-ordering, handling branches in one pass. 24% faster compile+run on bz2 (28% fewer instructions); 10% faster compile (10% fewer instructions).

This patch includes:

A complete rework of the way that CLIF blocks and edge blocks are
lowered into VCode blocks. The new mechanism in BlockLoweringOrder
computes RPO over the CFG, but with a twist: it merges edge blocks intto
heads or tails of original CLIF blocks wherever possible, and it does
this without ever actually materializing the full nodes-plus-edges
graph first. The backend driver lowers blocks in final order so
there's no need to reshuffle later.

A new MachBuffer that replaces the MachSection. This is a special
version of a code-sink that is far more than a humble Vec<u8>. In
particular, it keeps a record of label definitions and label uses,
with a machine-pluggable LabelUse trait that defines various types
of fixups (basically internal relocations).

Importantly, it implements some simple peephole-style branch rewrites
inline in the emission pass, without any separate traversals over
the code to use fallthroughs, swap taken/not-taken arms, etc. It
tracks branches at the tail of the buffer and can (i) remove blocks
that are just unconditional branches (by redirecting the label), (ii)
understand a conditional/unconditional pair and swap the conditional
polarity when it's helpful; and (iii) remove branches that branch to
the fallthrough PC.

The MachBuffer also implements branch-island support. On
architectures like AArch64, this is needed to allow conditional
branches within plausibly-attainable ranges (+/- 1MB on AArch64
specifically). It also does this inline while streaming through the
emission, without any sort of fixpoint algorithm or later moving of
code, by simply tracking outstanding references and "deadlines" and
emitting an island just-in-time when we're in danger of going out of
range.

A rework of the instruction selector driver. This is largely following
the same algorithm as before, but is cleaned up significantly, in
particular in the API: the machine backend can ask for an input arg
and get any of three forms (constant, register, producing
instruction), indicating it needs the register or can merge the
constant or producing instruction as appropriate. This new driver
takes special care to emit constants right at use-sites (and at phi
inputs), minimizing their live-ranges, and also special-cases the
"pinned register" to avoid superfluous moves.

Overall, on bz2.wasm, the results are:
    wasmtime full run (compile + runtime) of bz2:

    baseline:   9774M insns, 9742M cycles, 3.918s
    w/ changes: 7012M insns, 6888M cycles, 2.958s  (24.5% faster, 28.3% fewer insns)

    clif-util wasm compile bz2:

    baseline:   2633M insns, 3278M cycles, 1.034s
    w/ changes: 2366M insns, 2920M cycles, 0.923s  (10.7% faster, 10.1% fewer insns)

    All numbers are averages of two runs on an Ampere eMAG.
Creating this PR now to start the review, but I still need to do the following before merging:

Rebase to latest master

Update vcode filetests (since isel changed)

Write tests for the island-emission cases of MachBuffer

Update the x64 backend to the slightly-changed isel API (it's disabled at the moment to allow this to compile)

Wasmtime GitHub notifications bot (May 17 2020 at 05:03):

cfallin updated PR #1718 from machinst-codebuffer to master:

tl;dr: new new-isel; better block-ordering, handling branches in one pass. 24% faster compile+run on bz2 (28% fewer instructions); 10% faster compile (10% fewer instructions).

This patch includes:

A complete rework of the way that CLIF blocks and edge blocks are
lowered into VCode blocks. The new mechanism in BlockLoweringOrder
computes RPO over the CFG, but with a twist: it merges edge blocks intto
heads or tails of original CLIF blocks wherever possible, and it does
this without ever actually materializing the full nodes-plus-edges
graph first. The backend driver lowers blocks in final order so
there's no need to reshuffle later.

A new MachBuffer that replaces the MachSection. This is a special
version of a code-sink that is far more than a humble Vec<u8>. In
particular, it keeps a record of label definitions and label uses,
with a machine-pluggable LabelUse trait that defines various types
of fixups (basically internal relocations).

Importantly, it implements some simple peephole-style branch rewrites
inline in the emission pass, without any separate traversals over
the code to use fallthroughs, swap taken/not-taken arms, etc. It
tracks branches at the tail of the buffer and can (i) remove blocks
that are just unconditional branches (by redirecting the label), (ii)
understand a conditional/unconditional pair and swap the conditional
polarity when it's helpful; and (iii) remove branches that branch to
the fallthrough PC.

The MachBuffer also implements branch-island support. On
architectures like AArch64, this is needed to allow conditional
branches within plausibly-attainable ranges (+/- 1MB on AArch64
specifically). It also does this inline while streaming through the
emission, without any sort of fixpoint algorithm or later moving of
code, by simply tracking outstanding references and "deadlines" and
emitting an island just-in-time when we're in danger of going out of
range.

A rework of the instruction selector driver. This is largely following
the same algorithm as before, but is cleaned up significantly, in
particular in the API: the machine backend can ask for an input arg
and get any of three forms (constant, register, producing
instruction), indicating it needs the register or can merge the
constant or producing instruction as appropriate. This new driver
takes special care to emit constants right at use-sites (and at phi
inputs), minimizing their live-ranges, and also special-cases the
"pinned register" to avoid superfluous moves.

Overall, on bz2.wasm, the results are:
    wasmtime full run (compile + runtime) of bz2:

    baseline:   9774M insns, 9742M cycles, 3.918s
    w/ changes: 7012M insns, 6888M cycles, 2.958s  (24.5% faster, 28.3% fewer insns)

    clif-util wasm compile bz2:

    baseline:   2633M insns, 3278M cycles, 1.034s
    w/ changes: 2366M insns, 2920M cycles, 0.923s  (10.7% faster, 10.1% fewer insns)

    All numbers are averages of two runs on an Ampere eMAG.
Creating this PR now to start the review, but I still need to do the following before merging:

Rebase to latest master

Update vcode filetests (since isel changed)

Write tests for the island-emission cases of MachBuffer

Update the x64 backend to the slightly-changed isel API (it's disabled at the moment to allow this to compile)

Wasmtime GitHub notifications bot (May 17 2020 at 06:11):

cfallin updated PR #1718 from machinst-codebuffer to master:

tl;dr: new new-isel; better block-ordering, handling branches in one pass. 24% faster compile+run on bz2 (28% fewer instructions); 10% faster compile (10% fewer instructions).

This patch includes:

A complete rework of the way that CLIF blocks and edge blocks are
lowered into VCode blocks. The new mechanism in BlockLoweringOrder
computes RPO over the CFG, but with a twist: it merges edge blocks intto
heads or tails of original CLIF blocks wherever possible, and it does
this without ever actually materializing the full nodes-plus-edges
graph first. The backend driver lowers blocks in final order so
there's no need to reshuffle later.

A new MachBuffer that replaces the MachSection. This is a special
version of a code-sink that is far more than a humble Vec<u8>. In
particular, it keeps a record of label definitions and label uses,
with a machine-pluggable LabelUse trait that defines various types
of fixups (basically internal relocations).

Importantly, it implements some simple peephole-style branch rewrites
inline in the emission pass, without any separate traversals over
the code to use fallthroughs, swap taken/not-taken arms, etc. It
tracks branches at the tail of the buffer and can (i) remove blocks
that are just unconditional branches (by redirecting the label), (ii)
understand a conditional/unconditional pair and swap the conditional
polarity when it's helpful; and (iii) remove branches that branch to
the fallthrough PC.

The MachBuffer also implements branch-island support. On
architectures like AArch64, this is needed to allow conditional
branches within plausibly-attainable ranges (+/- 1MB on AArch64
specifically). It also does this inline while streaming through the
emission, without any sort of fixpoint algorithm or later moving of
code, by simply tracking outstanding references and "deadlines" and
emitting an island just-in-time when we're in danger of going out of
range.

A rework of the instruction selector driver. This is largely following
the same algorithm as before, but is cleaned up significantly, in
particular in the API: the machine backend can ask for an input arg
and get any of three forms (constant, register, producing
instruction), indicating it needs the register or can merge the
constant or producing instruction as appropriate. This new driver
takes special care to emit constants right at use-sites (and at phi
inputs), minimizing their live-ranges, and also special-cases the
"pinned register" to avoid superfluous moves.

Overall, on bz2.wasm, the results are:
    wasmtime full run (compile + runtime) of bz2:

    baseline:   9774M insns, 9742M cycles, 3.918s
    w/ changes: 7012M insns, 6888M cycles, 2.958s  (24.5% faster, 28.3% fewer insns)

    clif-util wasm compile bz2:

    baseline:   2633M insns, 3278M cycles, 1.034s
    w/ changes: 2366M insns, 2920M cycles, 0.923s  (10.7% faster, 10.1% fewer insns)

    All numbers are averages of two runs on an Ampere eMAG.
Creating this PR now to start the review, but I still need to do the following before merging:

Rebase to latest master

Update vcode filetests (since isel changed)

Write tests for the island-emission cases of MachBuffer

Update the x64 backend to the slightly-changed isel API (it's disabled at the moment to allow this to compile)

Wasmtime GitHub notifications bot (May 18 2020 at 01:17):

cfallin updated PR #1718 from machinst-codebuffer to master:

tl;dr: new new-isel; better block-ordering, handling branches in one pass. 24% faster compile+run on bz2 (28% fewer instructions); 10% faster compile (10% fewer instructions).

This patch includes:

A complete rework of the way that CLIF blocks and edge blocks are
lowered into VCode blocks. The new mechanism in BlockLoweringOrder
computes RPO over the CFG, but with a twist: it merges edge blocks intto
heads or tails of original CLIF blocks wherever possible, and it does
this without ever actually materializing the full nodes-plus-edges
graph first. The backend driver lowers blocks in final order so
there's no need to reshuffle later.

A new MachBuffer that replaces the MachSection. This is a special
version of a code-sink that is far more than a humble Vec<u8>. In
particular, it keeps a record of label definitions and label uses,
with a machine-pluggable LabelUse trait that defines various types
of fixups (basically internal relocations).

Importantly, it implements some simple peephole-style branch rewrites
inline in the emission pass, without any separate traversals over
the code to use fallthroughs, swap taken/not-taken arms, etc. It
tracks branches at the tail of the buffer and can (i) remove blocks
that are just unconditional branches (by redirecting the label), (ii)
understand a conditional/unconditional pair and swap the conditional
polarity when it's helpful; and (iii) remove branches that branch to
the fallthrough PC.

The MachBuffer also implements branch-island support. On
architectures like AArch64, this is needed to allow conditional
branches within plausibly-attainable ranges (+/- 1MB on AArch64
specifically). It also does this inline while streaming through the
emission, without any sort of fixpoint algorithm or later moving of
code, by simply tracking outstanding references and "deadlines" and
emitting an island just-in-time when we're in danger of going out of
range.

A rework of the instruction selector driver. This is largely following
the same algorithm as before, but is cleaned up significantly, in
particular in the API: the machine backend can ask for an input arg
and get any of three forms (constant, register, producing
instruction), indicating it needs the register or can merge the
constant or producing instruction as appropriate. This new driver
takes special care to emit constants right at use-sites (and at phi
inputs), minimizing their live-ranges, and also special-cases the
"pinned register" to avoid superfluous moves.

Overall, on bz2.wasm, the results are:
    wasmtime full run (compile + runtime) of bz2:

    baseline:   9774M insns, 9742M cycles, 3.918s
    w/ changes: 7012M insns, 6888M cycles, 2.958s  (24.5% faster, 28.3% fewer insns)

    clif-util wasm compile bz2:

    baseline:   2633M insns, 3278M cycles, 1.034s
    w/ changes: 2366M insns, 2920M cycles, 0.923s  (10.7% faster, 10.1% fewer insns)

    All numbers are averages of two runs on an Ampere eMAG.
Creating this PR now to start the review, but I still need to do the following before merging:

Rebase to latest master

Update vcode filetests (since isel changed)

Write tests for the island-emission cases of MachBuffer

Update the x64 backend to the slightly-changed isel API (it's disabled at the moment to allow this to compile)

Wasmtime GitHub notifications bot (May 18 2020 at 01:18):

cfallin updated PR #1718 from machinst-codebuffer to master:

tl;dr: new new-isel; better block-ordering, handling branches in one pass. 24% faster compile+run on bz2 (28% fewer instructions); 10% faster compile (10% fewer instructions).

This patch includes:

A complete rework of the way that CLIF blocks and edge blocks are
lowered into VCode blocks. The new mechanism in BlockLoweringOrder
computes RPO over the CFG, but with a twist: it merges edge blocks intto
heads or tails of original CLIF blocks wherever possible, and it does
this without ever actually materializing the full nodes-plus-edges
graph first. The backend driver lowers blocks in final order so
there's no need to reshuffle later.

A new MachBuffer that replaces the MachSection. This is a special
version of a code-sink that is far more than a humble Vec<u8>. In
particular, it keeps a record of label definitions and label uses,
with a machine-pluggable LabelUse trait that defines various types
of fixups (basically internal relocations).

Importantly, it implements some simple peephole-style branch rewrites
inline in the emission pass, without any separate traversals over
the code to use fallthroughs, swap taken/not-taken arms, etc. It
tracks branches at the tail of the buffer and can (i) remove blocks
that are just unconditional branches (by redirecting the label), (ii)
understand a conditional/unconditional pair and swap the conditional
polarity when it's helpful; and (iii) remove branches that branch to
the fallthrough PC.

The MachBuffer also implements branch-island support. On
architectures like AArch64, this is needed to allow conditional
branches within plausibly-attainable ranges (+/- 1MB on AArch64
specifically). It also does this inline while streaming through the
emission, without any sort of fixpoint algorithm or later moving of
code, by simply tracking outstanding references and "deadlines" and
emitting an island just-in-time when we're in danger of going out of
range.

A rework of the instruction selector driver. This is largely following
the same algorithm as before, but is cleaned up significantly, in
particular in the API: the machine backend can ask for an input arg
and get any of three forms (constant, register, producing
instruction), indicating it needs the register or can merge the
constant or producing instruction as appropriate. This new driver
takes special care to emit constants right at use-sites (and at phi
inputs), minimizing their live-ranges, and also special-cases the
"pinned register" to avoid superfluous moves.

Overall, on bz2.wasm, the results are:
    wasmtime full run (compile + runtime) of bz2:

    baseline:   9774M insns, 9742M cycles, 3.918s
    w/ changes: 7012M insns, 6888M cycles, 2.958s  (24.5% faster, 28.3% fewer insns)

    clif-util wasm compile bz2:

    baseline:   2633M insns, 3278M cycles, 1.034s
    w/ changes: 2366M insns, 2920M cycles, 0.923s  (10.7% faster, 10.1% fewer insns)

    All numbers are averages of two runs on an Ampere eMAG.
Creating this PR now to start the review, but I still need to do the following before merging:

Rebase to latest master

Update vcode filetests (since isel changed)

Write tests for the island-emission cases of MachBuffer

Update the x64 backend to the slightly-changed isel API (it's disabled at the moment to allow this to compile)

Wasmtime GitHub notifications bot (May 18 2020 at 01:20):

cfallin edited PR #1718 from machinst-codebuffer to master:

tl;dr: new new-isel; better block-ordering, handling branches in one pass. 24% faster compile+run on bz2 (28% fewer instructions); 10% faster compile (10% fewer instructions).

This patch includes:

A complete rework of the way that CLIF blocks and edge blocks are
lowered into VCode blocks. The new mechanism in BlockLoweringOrder
computes RPO over the CFG, but with a twist: it merges edge blocks intto
heads or tails of original CLIF blocks wherever possible, and it does
this without ever actually materializing the full nodes-plus-edges
graph first. The backend driver lowers blocks in final order so
there's no need to reshuffle later.

A new MachBuffer that replaces the MachSection. This is a special
version of a code-sink that is far more than a humble Vec<u8>. In
particular, it keeps a record of label definitions and label uses,
with a machine-pluggable LabelUse trait that defines various types
of fixups (basically internal relocations).

Importantly, it implements some simple peephole-style branch rewrites
inline in the emission pass, without any separate traversals over
the code to use fallthroughs, swap taken/not-taken arms, etc. It
tracks branches at the tail of the buffer and can (i) remove blocks
that are just unconditional branches (by redirecting the label), (ii)
understand a conditional/unconditional pair and swap the conditional
polarity when it's helpful; and (iii) remove branches that branch to
the fallthrough PC.

The MachBuffer also implements branch-island support. On
architectures like AArch64, this is needed to allow conditional
branches within plausibly-attainable ranges (+/- 1MB on AArch64
specifically). It also does this inline while streaming through the
emission, without any sort of fixpoint algorithm or later moving of
code, by simply tracking outstanding references and "deadlines" and
emitting an island just-in-time when we're in danger of going out of
range.

A rework of the instruction selector driver. This is largely following
the same algorithm as before, but is cleaned up significantly, in
particular in the API: the machine backend can ask for an input arg
and get any of three forms (constant, register, producing
instruction), indicating it needs the register or can merge the
constant or producing instruction as appropriate. This new driver
takes special care to emit constants right at use-sites (and at phi
inputs), minimizing their live-ranges, and also special-cases the
"pinned register" to avoid superfluous moves.

Overall, on bz2.wasm, the results are:
    wasmtime full run (compile + runtime) of bz2:

    baseline:   9774M insns, 9742M cycles, 3.918s
    w/ changes: 7012M insns, 6888M cycles, 2.958s  (24.5% faster, 28.3% fewer insns)

    clif-util wasm compile bz2:

    baseline:   2633M insns, 3278M cycles, 1.034s
    w/ changes: 2366M insns, 2920M cycles, 0.923s  (10.7% faster, 10.1% fewer insns)

    All numbers are averages of two runs on an Ampere eMAG.

Wasmtime GitHub notifications bot (May 18 2020 at 11:45):

julian-seward1 submitted PR Review.

Wasmtime GitHub notifications bot (May 18 2020 at 11:45):

julian-seward1 submitted PR Review.

Wasmtime GitHub notifications bot (May 18 2020 at 11:45):

julian-seward1 created PR Review Comment:

What does this mean? Can you clarify the semantics? When is it used?

Wasmtime GitHub notifications bot (May 18 2020 at 11:45):

julian-seward1 created PR Review Comment:

nit: it would be better not to use the word Load here since it's not a load. Maybe Compute ?

Wasmtime GitHub notifications bot (May 18 2020 at 11:45):

julian-seward1 created PR Review Comment:

Could you add some details to say what the island consists of? More generally, is there a top level description of the islands-and-deadlines algorithm somewhere?

Wasmtime GitHub notifications bot (May 18 2020 at 11:45):

julian-seward1 created PR Review Comment:

What if ty is a vector type? Then Inst::load_constant doesn't sound right to me. Is there some guarantee that this won't get called with such a type? If not, can you assert/panic it out?

Wasmtime GitHub notifications bot (May 18 2020 at 11:45):

julian-seward1 created PR Review Comment:

Is there any way that this can be automatically cross-checked with reality? This sounds to me like something that could be violated somewhere down the line, but that would not break anything except in some extremely rare huge-function input, which will make it hard to track down. So some (any?) kind of cross-check scheme would be a Good Thing.

If not possible, at least add a load comment at the top of the insn emitter to the effect that it must comply with what is claimed here.

Wasmtime GitHub notifications bot (May 18 2020 at 11:45):

julian-seward1 created PR Review Comment:

Interpreted signed or unsigned?

Wasmtime GitHub notifications bot (May 18 2020 at 11:45):

julian-seward1 created PR Review Comment:

That doesn't read quite right; is it correct?

Wasmtime GitHub notifications bot (May 18 2020 at 11:45):

julian-seward1 created PR Review Comment:

This is a bit unclear; could you make it more precise? Is there a 1:1 mapping from CLIR blocks to VCode blocks? The use of "subgraphs" implies there isn't, but there's no clarification of the meaning of "subgraphs" here.

Wasmtime GitHub notifications bot (May 18 2020 at 11:45):

julian-seward1 created PR Review Comment:

Does (emit island with guard jump if needed) refer to the 6 insns that follow (I think so), or does it denote further insns that need to be emitted (I think not) ? It would be good to make this clearer in the comment.

Wasmtime GitHub notifications bot (May 18 2020 at 11:45):

julian-seward1 created PR Review Comment:

"freely permuted" .. surely they'd have to maintain the same data dependency relationships?

Wasmtime GitHub notifications bot (May 18 2020 at 11:45):

julian-seward1 created PR Review Comment:

Since forgetting to do this might be a common mistake, can you say here how the system will fail should one forget to do that?

Wasmtime GitHub notifications bot (May 18 2020 at 11:45):

julian-seward1 created PR Review Comment:

For the sake of clarity, could you add "CLIR" before "instructions" ?

Wasmtime GitHub notifications bot (May 18 2020 at 11:45):

julian-seward1 created PR Review Comment:

Can you add a 1 liner comment saying what this does?

Wasmtime GitHub notifications bot (May 18 2020 at 11:45):

julian-seward1 created PR Review Comment:

tmp doesn't give a big enough hint what this does. A better name would be new_vreg.

Wasmtime GitHub notifications bot (May 18 2020 at 12:17):

julian-seward1 edited PR Review Comment.

Wasmtime GitHub notifications bot (May 18 2020 at 12:31):

julian-seward1 submitted PR Review.

Wasmtime GitHub notifications bot (May 18 2020 at 12:31):

julian-seward1 submitted PR Review.

Wasmtime GitHub notifications bot (May 18 2020 at 12:31):

julian-seward1 created PR Review Comment:

Is this definition of is64 right? That seems like it's an unsigned criterion, but the general rule on Intel for 32-bit immediate fields is that they are sign extended to 64 bits as appropriate. If this logic simply moved from elsewhere in this patch, then leave it as is; but otherwise maybe change to the signed variant, using low32willSXto64 ?

Wasmtime GitHub notifications bot (May 18 2020 at 12:31):

julian-seward1 created PR Review Comment:

This change concerns me somewhat. What guarantees that this assertion can't fail now?

Wasmtime GitHub notifications bot (May 18 2020 at 12:31):

julian-seward1 created PR Review Comment:

I feel like it's a shame to lose this, because it means losing the ability to easily differentiate legitimate failures due to non-implementation of a target-independent CLIR insn, vs bugs resulting in machine-specific CLIRs being handed to us.

Wasmtime GitHub notifications bot (May 18 2020 at 15:52):

cfallin submitted PR Review.

Wasmtime GitHub notifications bot (May 18 2020 at 15:52):

cfallin created PR Review Comment:

Oh, these X86* opcodes went away in the latest master, so this is just a rebase-related change. I think the rule is still (as has always been) never have a fallthrough in the big-opcode-match, and handle additions or deletions as they come, indicated by compile errors -- so if any machine-specific ops are added back in the future, we'll figure out what to do then.

Wasmtime GitHub notifications bot (May 18 2020 at 22:38):

cfallin submitted PR Review.

Wasmtime GitHub notifications bot (May 18 2020 at 22:38):

cfallin created PR Review Comment:

Nothing guarantees it, but this form (ResolvedOffset) is used only when the lowering explicitly selects it, now; ordinary branches don't go through this code (the LabelUse handles them instead, and we can implement 64-bit long-form veneers there if we think we need >2GB code-size).

Wasmtime GitHub notifications bot (May 18 2020 at 22:39):

cfallin submitted PR Review.

Wasmtime GitHub notifications bot (May 18 2020 at 22:39):

cfallin created PR Review Comment:

I think so, or at least, this was existing logic in the Iconst lowering prior to this patch:

https://github.com/bytecodealliance/wasmtime/blob/a75377565f830052094aa8aa72c5e7a6b787fa18/cranelift/codegen/src/isa/x64/lower.rs#L116

Wasmtime GitHub notifications bot (May 18 2020 at 22:40):

cfallin updated PR #1718 from machinst-codebuffer to master:

tl;dr: new new-isel; better block-ordering, handling branches in one pass. 24% faster compile+run on bz2 (28% fewer instructions); 10% faster compile (10% fewer instructions).

This patch includes:

A complete rework of the way that CLIF blocks and edge blocks are
lowered into VCode blocks. The new mechanism in BlockLoweringOrder
computes RPO over the CFG, but with a twist: it merges edge blocks intto
heads or tails of original CLIF blocks wherever possible, and it does
this without ever actually materializing the full nodes-plus-edges
graph first. The backend driver lowers blocks in final order so
there's no need to reshuffle later.

A new MachBuffer that replaces the MachSection. This is a special
version of a code-sink that is far more than a humble Vec<u8>. In
particular, it keeps a record of label definitions and label uses,
with a machine-pluggable LabelUse trait that defines various types
of fixups (basically internal relocations).

Importantly, it implements some simple peephole-style branch rewrites
inline in the emission pass, without any separate traversals over
the code to use fallthroughs, swap taken/not-taken arms, etc. It
tracks branches at the tail of the buffer and can (i) remove blocks
that are just unconditional branches (by redirecting the label), (ii)
understand a conditional/unconditional pair and swap the conditional
polarity when it's helpful; and (iii) remove branches that branch to
the fallthrough PC.

The MachBuffer also implements branch-island support. On
architectures like AArch64, this is needed to allow conditional
branches within plausibly-attainable ranges (+/- 1MB on AArch64
specifically). It also does this inline while streaming through the
emission, without any sort of fixpoint algorithm or later moving of
code, by simply tracking outstanding references and "deadlines" and
emitting an island just-in-time when we're in danger of going out of
range.

A rework of the instruction selector driver. This is largely following
the same algorithm as before, but is cleaned up significantly, in
particular in the API: the machine backend can ask for an input arg
and get any of three forms (constant, register, producing
instruction), indicating it needs the register or can merge the
constant or producing instruction as appropriate. This new driver
takes special care to emit constants right at use-sites (and at phi
inputs), minimizing their live-ranges, and also special-cases the
"pinned register" to avoid superfluous moves.

Overall, on bz2.wasm, the results are:
    wasmtime full run (compile + runtime) of bz2:

    baseline:   9774M insns, 9742M cycles, 3.918s
    w/ changes: 7012M insns, 6888M cycles, 2.958s  (24.5% faster, 28.3% fewer insns)

    clif-util wasm compile bz2:

    baseline:   2633M insns, 3278M cycles, 1.034s
    w/ changes: 2366M insns, 2920M cycles, 0.923s  (10.7% faster, 10.1% fewer insns)

    All numbers are averages of two runs on an Ampere eMAG.

Wasmtime GitHub notifications bot (May 18 2020 at 22:40):

cfallin submitted PR Review.

Wasmtime GitHub notifications bot (May 18 2020 at 22:40):

cfallin created PR Review Comment:

Was part of the old API, but no reason not to rename here :-) Now it's alloc_tmp().

Wasmtime GitHub notifications bot (May 18 2020 at 22:40):

cfallin submitted PR Review.

Wasmtime GitHub notifications bot (May 18 2020 at 22:40):

cfallin created PR Review Comment:

Done!

Wasmtime GitHub notifications bot (May 18 2020 at 22:41):

cfallin submitted PR Review.

Wasmtime GitHub notifications bot (May 18 2020 at 22:41):

cfallin created PR Review Comment:

Done.

Wasmtime GitHub notifications bot (May 18 2020 at 22:41):

cfallin submitted PR Review.

Wasmtime GitHub notifications bot (May 18 2020 at 22:41):

cfallin created PR Review Comment:

Yep, modulo true deps; clarified.

Wasmtime GitHub notifications bot (May 18 2020 at 22:41):

cfallin submitted PR Review.

Wasmtime GitHub notifications bot (May 18 2020 at 22:41):

cfallin created PR Review Comment:

Was correct but probably too terse; fixed.

Wasmtime GitHub notifications bot (May 18 2020 at 22:42):

cfallin submitted PR Review.

Wasmtime GitHub notifications bot (May 18 2020 at 22:42):

cfallin created PR Review Comment:

Hopefully the ASCII art and additional explanation help! This is pretty subtle so I'm happy to clarify further if needed.

Wasmtime GitHub notifications bot (May 18 2020 at 22:42):

cfallin submitted PR Review.

Wasmtime GitHub notifications bot (May 18 2020 at 22:42):

cfallin created PR Review Comment:

Added some more docs on EmitIsland to clarify.

Wasmtime GitHub notifications bot (May 18 2020 at 22:42):

cfallin submitted PR Review.

Wasmtime GitHub notifications bot (May 18 2020 at 22:42):

cfallin created PR Review Comment:

Signed (clarified).

Wasmtime GitHub notifications bot (May 18 2020 at 22:43):

cfallin submitted PR Review.

Wasmtime GitHub notifications bot (May 18 2020 at 22:43):

cfallin created PR Review Comment:

Good idea! I added a debug assert to Inst::emit() that verifies that no more than worst_case_size() bytes were emitted.

Wasmtime GitHub notifications bot (May 18 2020 at 22:43):

cfallin submitted PR Review.

Wasmtime GitHub notifications bot (May 18 2020 at 22:43):

cfallin created PR Review Comment:

Added assert.

Wasmtime GitHub notifications bot (May 18 2020 at 23:04):

cfallin submitted PR Review.

Wasmtime GitHub notifications bot (May 18 2020 at 23:04):

cfallin created PR Review Comment:

Added to the top of machinst/buffer.rs.

Wasmtime GitHub notifications bot (May 18 2020 at 23:04):

cfallin updated PR #1718 from machinst-codebuffer to master:

tl;dr: new new-isel; better block-ordering, handling branches in one pass. 24% faster compile+run on bz2 (28% fewer instructions); 10% faster compile (10% fewer instructions).

This patch includes:

A complete rework of the way that CLIF blocks and edge blocks are
lowered into VCode blocks. The new mechanism in BlockLoweringOrder
computes RPO over the CFG, but with a twist: it merges edge blocks intto
heads or tails of original CLIF blocks wherever possible, and it does
this without ever actually materializing the full nodes-plus-edges
graph first. The backend driver lowers blocks in final order so
there's no need to reshuffle later.

A new MachBuffer that replaces the MachSection. This is a special
version of a code-sink that is far more than a humble Vec<u8>. In
particular, it keeps a record of label definitions and label uses,
with a machine-pluggable LabelUse trait that defines various types
of fixups (basically internal relocations).

Importantly, it implements some simple peephole-style branch rewrites
inline in the emission pass, without any separate traversals over
the code to use fallthroughs, swap taken/not-taken arms, etc. It
tracks branches at the tail of the buffer and can (i) remove blocks
that are just unconditional branches (by redirecting the label), (ii)
understand a conditional/unconditional pair and swap the conditional
polarity when it's helpful; and (iii) remove branches that branch to
the fallthrough PC.

The MachBuffer also implements branch-island support. On
architectures like AArch64, this is needed to allow conditional
branches within plausibly-attainable ranges (+/- 1MB on AArch64
specifically). It also does this inline while streaming through the
emission, without any sort of fixpoint algorithm or later moving of
code, by simply tracking outstanding references and "deadlines" and
emitting an island just-in-time when we're in danger of going out of
range.

A rework of the instruction selector driver. This is largely following
the same algorithm as before, but is cleaned up significantly, in
particular in the API: the machine backend can ask for an input arg
and get any of three forms (constant, register, producing
instruction), indicating it needs the register or can merge the
constant or producing instruction as appropriate. This new driver
takes special care to emit constants right at use-sites (and at phi
inputs), minimizing their live-ranges, and also special-cases the
"pinned register" to avoid superfluous moves.

Overall, on bz2.wasm, the results are:
    wasmtime full run (compile + runtime) of bz2:

    baseline:   9774M insns, 9742M cycles, 3.918s
    w/ changes: 7012M insns, 6888M cycles, 2.958s  (24.5% faster, 28.3% fewer insns)

    clif-util wasm compile bz2:

    baseline:   2633M insns, 3278M cycles, 1.034s
    w/ changes: 2366M insns, 2920M cycles, 0.923s  (10.7% faster, 10.1% fewer insns)

    All numbers are averages of two runs on an Ampere eMAG.

Wasmtime GitHub notifications bot (May 18 2020 at 23:04):

cfallin submitted PR Review.

Wasmtime GitHub notifications bot (May 18 2020 at 23:04):

cfallin created PR Review Comment:

Done.

Wasmtime GitHub notifications bot (May 18 2020 at 23:05):

cfallin submitted PR Review.

Wasmtime GitHub notifications bot (May 18 2020 at 23:05):

cfallin created PR Review Comment:

Clarified; this is the same as the original "lowered" form, but we just call out the actual semantics and purpose a bit more explicitly now rather than conflating it with the branch-lowering process.

Wasmtime GitHub notifications bot (May 18 2020 at 23:05):

cfallin submitted PR Review.

Wasmtime GitHub notifications bot (May 18 2020 at 23:05):

cfallin created PR Review Comment:

Done.

Wasmtime GitHub notifications bot (May 18 2020 at 23:06):

cfallin updated PR #1718 from machinst-codebuffer to master:

tl;dr: new new-isel; better block-ordering, handling branches in one pass. 24% faster compile+run on bz2 (28% fewer instructions); 10% faster compile (10% fewer instructions).

This patch includes:

A complete rework of the way that CLIF blocks and edge blocks are
lowered into VCode blocks. The new mechanism in BlockLoweringOrder
computes RPO over the CFG, but with a twist: it merges edge blocks intto
heads or tails of original CLIF blocks wherever possible, and it does
this without ever actually materializing the full nodes-plus-edges
graph first. The backend driver lowers blocks in final order so
there's no need to reshuffle later.

A new MachBuffer that replaces the MachSection. This is a special
version of a code-sink that is far more than a humble Vec<u8>. In
particular, it keeps a record of label definitions and label uses,
with a machine-pluggable LabelUse trait that defines various types
of fixups (basically internal relocations).

Importantly, it implements some simple peephole-style branch rewrites
inline in the emission pass, without any separate traversals over
the code to use fallthroughs, swap taken/not-taken arms, etc. It
tracks branches at the tail of the buffer and can (i) remove blocks
that are just unconditional branches (by redirecting the label), (ii)
understand a conditional/unconditional pair and swap the conditional
polarity when it's helpful; and (iii) remove branches that branch to
the fallthrough PC.

The MachBuffer also implements branch-island support. On
architectures like AArch64, this is needed to allow conditional
branches within plausibly-attainable ranges (+/- 1MB on AArch64
specifically). It also does this inline while streaming through the
emission, without any sort of fixpoint algorithm or later moving of
code, by simply tracking outstanding references and "deadlines" and
emitting an island just-in-time when we're in danger of going out of
range.

A rework of the instruction selector driver. This is largely following
the same algorithm as before, but is cleaned up significantly, in
particular in the API: the machine backend can ask for an input arg
and get any of three forms (constant, register, producing
instruction), indicating it needs the register or can merge the
constant or producing instruction as appropriate. This new driver
takes special care to emit constants right at use-sites (and at phi
inputs), minimizing their live-ranges, and also special-cases the
"pinned register" to avoid superfluous moves.

Overall, on bz2.wasm, the results are:
    wasmtime full run (compile + runtime) of bz2:

    baseline:   9774M insns, 9742M cycles, 3.918s
    w/ changes: 7012M insns, 6888M cycles, 2.958s  (24.5% faster, 28.3% fewer insns)

    clif-util wasm compile bz2:

    baseline:   2633M insns, 3278M cycles, 1.034s
    w/ changes: 2366M insns, 2920M cycles, 0.923s  (10.7% faster, 10.1% fewer insns)

    All numbers are averages of two runs on an Ampere eMAG.

Wasmtime GitHub notifications bot (May 18 2020 at 23:07):

cfallin updated PR #1718 from machinst-codebuffer to master:

tl;dr: new new-isel; better block-ordering, handling branches in one pass. 24% faster compile+run on bz2 (28% fewer instructions); 10% faster compile (10% fewer instructions).

This patch includes:

A complete rework of the way that CLIF blocks and edge blocks are
lowered into VCode blocks. The new mechanism in BlockLoweringOrder
computes RPO over the CFG, but with a twist: it merges edge blocks intto
heads or tails of original CLIF blocks wherever possible, and it does
this without ever actually materializing the full nodes-plus-edges
graph first. The backend driver lowers blocks in final order so
there's no need to reshuffle later.

A new MachBuffer that replaces the MachSection. This is a special
version of a code-sink that is far more than a humble Vec<u8>. In
particular, it keeps a record of label definitions and label uses,
with a machine-pluggable LabelUse trait that defines various types
of fixups (basically internal relocations).

Importantly, it implements some simple peephole-style branch rewrites
inline in the emission pass, without any separate traversals over
the code to use fallthroughs, swap taken/not-taken arms, etc. It
tracks branches at the tail of the buffer and can (i) remove blocks
that are just unconditional branches (by redirecting the label), (ii)
understand a conditional/unconditional pair and swap the conditional
polarity when it's helpful; and (iii) remove branches that branch to
the fallthrough PC.

The MachBuffer also implements branch-island support. On
architectures like AArch64, this is needed to allow conditional
branches within plausibly-attainable ranges (+/- 1MB on AArch64
specifically). It also does this inline while streaming through the
emission, without any sort of fixpoint algorithm or later moving of
code, by simply tracking outstanding references and "deadlines" and
emitting an island just-in-time when we're in danger of going out of
range.

A rework of the instruction selector driver. This is largely following
the same algorithm as before, but is cleaned up significantly, in
particular in the API: the machine backend can ask for an input arg
and get any of three forms (constant, register, producing
instruction), indicating it needs the register or can merge the
constant or producing instruction as appropriate. This new driver
takes special care to emit constants right at use-sites (and at phi
inputs), minimizing their live-ranges, and also special-cases the
"pinned register" to avoid superfluous moves.

Overall, on bz2.wasm, the results are:
    wasmtime full run (compile + runtime) of bz2:

    baseline:   9774M insns, 9742M cycles, 3.918s
    w/ changes: 7012M insns, 6888M cycles, 2.958s  (24.5% faster, 28.3% fewer insns)

    clif-util wasm compile bz2:

    baseline:   2633M insns, 3278M cycles, 1.034s
    w/ changes: 2366M insns, 2920M cycles, 0.923s  (10.7% faster, 10.1% fewer insns)

    All numbers are averages of two runs on an Ampere eMAG.

Wasmtime GitHub notifications bot (May 18 2020 at 23:25):

cfallin updated PR #1718 from machinst-codebuffer to master:

tl;dr: new new-isel; better block-ordering, handling branches in one pass. 24% faster compile+run on bz2 (28% fewer instructions); 10% faster compile (10% fewer instructions).

This patch includes:

A complete rework of the way that CLIF blocks and edge blocks are
lowered into VCode blocks. The new mechanism in BlockLoweringOrder
computes RPO over the CFG, but with a twist: it merges edge blocks intto
heads or tails of original CLIF blocks wherever possible, and it does
this without ever actually materializing the full nodes-plus-edges
graph first. The backend driver lowers blocks in final order so
there's no need to reshuffle later.

A new MachBuffer that replaces the MachSection. This is a special
version of a code-sink that is far more than a humble Vec<u8>. In
particular, it keeps a record of label definitions and label uses,
with a machine-pluggable LabelUse trait that defines various types
of fixups (basically internal relocations).

Importantly, it implements some simple peephole-style branch rewrites
inline in the emission pass, without any separate traversals over
the code to use fallthroughs, swap taken/not-taken arms, etc. It
tracks branches at the tail of the buffer and can (i) remove blocks
that are just unconditional branches (by redirecting the label), (ii)
understand a conditional/unconditional pair and swap the conditional
polarity when it's helpful; and (iii) remove branches that branch to
the fallthrough PC.

The MachBuffer also implements branch-island support. On
architectures like AArch64, this is needed to allow conditional
branches within plausibly-attainable ranges (+/- 1MB on AArch64
specifically). It also does this inline while streaming through the
emission, without any sort of fixpoint algorithm or later moving of
code, by simply tracking outstanding references and "deadlines" and
emitting an island just-in-time when we're in danger of going out of
range.

A rework of the instruction selector driver. This is largely following
the same algorithm as before, but is cleaned up significantly, in
particular in the API: the machine backend can ask for an input arg
and get any of three forms (constant, register, producing
instruction), indicating it needs the register or can merge the
constant or producing instruction as appropriate. This new driver
takes special care to emit constants right at use-sites (and at phi
inputs), minimizing their live-ranges, and also special-cases the
"pinned register" to avoid superfluous moves.

Overall, on bz2.wasm, the results are:
    wasmtime full run (compile + runtime) of bz2:

    baseline:   9774M insns, 9742M cycles, 3.918s
    w/ changes: 7012M insns, 6888M cycles, 2.958s  (24.5% faster, 28.3% fewer insns)

    clif-util wasm compile bz2:

    baseline:   2633M insns, 3278M cycles, 1.034s
    w/ changes: 2366M insns, 2920M cycles, 0.923s  (10.7% faster, 10.1% fewer insns)

    All numbers are averages of two runs on an Ampere eMAG.

Wasmtime GitHub notifications bot (May 19 2020 at 04:36):

julian-seward1 submitted PR Review.

Wasmtime GitHub notifications bot (May 19 2020 at 14:17):

cfallin merged PR #1718.

Last updated: Apr 17 2025 at 21:03 UTC