cfallin requested fitzgen for a review on PR #10086.
cfallin requested wasmtime-compiler-reviewers for a review on PR #10086.
cfallin opened PR #10086 from cfallin:lets-keep-our-basic-blocks-basic
to bytecodealliance:main
:
In #9980, we saw that code copmiled with the single-pass register allocator has incorrect behavior. We eventually narrowed this down to the fact that the single-pass allocator is inserting code meant to be at the end of a block, just before its terminator, between two branches that form the terminator sequence. The allocator is correct; the bug is with Cranelift's x64 backend.
When we produce instructions into a VCode container, we maintain basic blocks, and we have the invariant (usual for basic block-based IR) that only the last -- terminator -- instruction is a branch that can leave the block. Even the conditional branches maintain this invariant: though VCode is meant to be "almost machine code", we emit two-target conditionals that are semantically like "jcond; jmp". We then are able to optimize this inline during binary emission in the
MachBuffer
: the buffer knows about unconditional and conditional branches and will "chomp" branches off the tail of the buffer whenever they target the fallthrough block. (We designed the system this way because it is simpler to think about BBs that are order-invariant, i.e., not bake the "fallthrough" concept into the IR.) Thus we have a simpler abstraction but produce optimal terminator sequences.Unfortunately, when adding a branch-on-floating-point-compare lowering, we had the need to branch to a target if either of two conditions were true, and rather than add a new kind of terminator instruction, we added a "one-armed branch": conditionally branch to label or fall through. We emitted this in sequence right before the actual terminator, so semantically it was almost equivalent.
I write "almost" because the register allocator is allowed to insert spills/reloads/moves between any two instructions. Here the distinct pieces of the terminator sequence matter: the allocator might insert something just before the last instruction, assuming the basic-block "single in, single out" invariant means this will always run with the block. With one-armed branches this is no longer true.
The backtracking allocator (our original RA2 algorithm, and still the default today) will never insert code at the end of a block when it has multiple terminators, because it associates such block-start/end insertions with edges; so in such conditions it inserts instructions into the tops of successor blocks instead. But the single-pass allocator needs to perform work at the end of every block, so it will trigger this bug.
This PR removes
JmpIf
and converts the br-of-fcmp lowering to useJmpCondOr
instead, which is a pseudoinstruction that doesjcc1; jcc2; jmp
. This maintains the BB invariant and fixes the bug.Note that Winch still uses
JmpIf
, so we cannot remove it entirely: this PR renames it toWinchJmpIf
instead, and adds a mechanism to assert failure if it is ever added toVCode
(rather than emitted directly, as Winch's macro-assembler does). We could instead write Winch'sjmp_if
assembler function in terms ofJmpCond
with a fallthrough label that is immediately bound, and let the MachBuffer always chomp the jmp; I opted not to regress Winch compiler performance by doing this. If one day we abstract out the assembler further, we can removeWinchJmpIf
.This is one of two instances of a "one-armed branch"; the other is s390x's
OneWayCondBr
, used inbr_table
lowerings, which we will address separately. Once we do, that will address #9980 entirely.<!--
Please make sure you include the following information:
If this work has been discussed elsewhere, please include a link to that
conversation. If it was discussed in an issue, just mention "issue #...".Explain why this change is needed. If the details are in an issue already,
this can be brief.Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.htmlPlease ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->
cfallin updated PR #10086.
cfallin commented on PR #10086:
(Thanks to @alexcrichton and others for bottoming this out into the root cause; discussed earlier today in the Cranelift meeting.)
cfallin edited PR #10086:
In #9980, we saw that code compiled with the single-pass register allocator has incorrect behavior. We eventually narrowed this down to the fact that the single-pass allocator is inserting code meant to be at the end of a block, just before its terminator, between two branches that form the terminator sequence. The allocator is correct; the bug is with Cranelift's x64 backend.
When we produce instructions into a VCode container, we maintain basic blocks, and we have the invariant (usual for basic block-based IR) that only the last -- terminator -- instruction is a branch that can leave the block. Even the conditional branches maintain this invariant: though VCode is meant to be "almost machine code", we emit two-target conditionals that are semantically like "jcond; jmp". We then are able to optimize this inline during binary emission in the
MachBuffer
: the buffer knows about unconditional and conditional branches and will "chomp" branches off the tail of the buffer whenever they target the fallthrough block. (We designed the system this way because it is simpler to think about BBs that are order-invariant, i.e., not bake the "fallthrough" concept into the IR.) Thus we have a simpler abstraction but produce optimal terminator sequences.Unfortunately, when adding a branch-on-floating-point-compare lowering, we had the need to branch to a target if either of two conditions were true, and rather than add a new kind of terminator instruction, we added a "one-armed branch": conditionally branch to label or fall through. We emitted this in sequence right before the actual terminator, so semantically it was almost equivalent.
I write "almost" because the register allocator is allowed to insert spills/reloads/moves between any two instructions. Here the distinct pieces of the terminator sequence matter: the allocator might insert something just before the last instruction, assuming the basic-block "single in, single out" invariant means this will always run with the block. With one-armed branches this is no longer true.
The backtracking allocator (our original RA2 algorithm, and still the default today) will never insert code at the end of a block when it has multiple terminators, because it associates such block-start/end insertions with edges; so in such conditions it inserts instructions into the tops of successor blocks instead. But the single-pass allocator needs to perform work at the end of every block, so it will trigger this bug.
This PR removes
JmpIf
and converts the br-of-fcmp lowering to useJmpCondOr
instead, which is a pseudoinstruction that doesjcc1; jcc2; jmp
. This maintains the BB invariant and fixes the bug.Note that Winch still uses
JmpIf
, so we cannot remove it entirely: this PR renames it toWinchJmpIf
instead, and adds a mechanism to assert failure if it is ever added toVCode
(rather than emitted directly, as Winch's macro-assembler does). We could instead write Winch'sjmp_if
assembler function in terms ofJmpCond
with a fallthrough label that is immediately bound, and let the MachBuffer always chomp the jmp; I opted not to regress Winch compiler performance by doing this. If one day we abstract out the assembler further, we can removeWinchJmpIf
.This is one of two instances of a "one-armed branch"; the other is s390x's
OneWayCondBr
, used inbr_table
lowerings, which we will address separately. Once we do, that will address #9980 entirely.<!--
Please make sure you include the following information:
If this work has been discussed elsewhere, please include a link to that
conversation. If it was discussed in an issue, just mention "issue #...".Explain why this change is needed. If the details are in an issue already,
this can be brief.Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.htmlPlease ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->
fitzgen submitted PR review:
Very nice! Thanks for trying to be defensive with the low-level branching instruction bits as well. Kind of annoying that we are dealing with that. Hopefully a separate assembler library will help here...
fitzgen created PR review comment:
Mind writing a test for this case where all three branches can get chomped and replaced with a fallthrough?
github-actions[bot] commented on PR #10086:
Subscribe to Label Action
cc @saulecabrera
<details>
This issue or pull request has been labeled: "cranelift", "cranelift:area:machinst", "cranelift:area:x64", "winch"Thus the following users have been cc'd because of the following labels:
- saulecabrera: winch
To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.
Learn more.
</details>
cfallin updated PR #10086.
cfallin updated PR #10086.
cfallin submitted PR review.
cfallin created PR review comment:
Done! It ends up chomping five branches in a row due to critical-edge splitting actually (two edges from one block to another; the edge blocks end up empty) which is neat to see.
cfallin updated PR #10086.
cfallin has enabled auto merge for PR #10086.
cfallin merged PR #10086.
Last updated: Jan 24 2025 at 00:11 UTC