jameysharp labeled issue #5022:
.clif
Test CaseThe simplest case I've seen came from cranelift-fuzzgen in #4875, where the relevant CLIF produced by cranelift-frontend included these two lines:
v257 = rotl v255, v256 ; v256 = 0 v255 -> v257
Steps to Reproduce
I believe the simplest way to trigger this bug is:
- Use cranelift-frontend to construct SSA from non-SSA
- Create a non-entry block and do the remaining steps in that block
- Declare a new variable
- Add an instruction which uses the value returned by
use_var
on that variable- Define the variable using the result of the instruction
- End the block with a branch to itself
- Seal the block
It also happens if there's a cycle of multiple blocks that each have a single predecessor.
I think it also can happen if there are multiple predecessors for some blocks in the cycle, as long as the only definition which reaches the instruction comes from the same instruction.
We've also seen it happen if there is a dependency cycle across multiple instructions, where each instruction defines a variable used by the next instruction in the cycle. It's not limited to the case where an instruction depends directly on its own results.
Expected Results
CLIF in SSA form.
Actual Results
SSABuilder::finish_predecessors_lookup
sees that the variable does have a definition in the cycle, so it doesn't insert a constant zero. It also sees that the variable doesn't have any other definitions reaching the use, so it deletes the phi node and changes the originaluse_var
result to an alias for the value given todef_var
. That leads to a cycle through value aliases, and the resulting CLIF is not in SSA form.I believe the Cranelift validator has run on the cases we've seen, without detecting this issue.
In #5020 this cycle was detected by the
simple_preopt
pass. There was a series of shifts and rotates where the shift amount was 0 in each case, so the pass tried to replace each result with an alias to the instruction's first argument. However once it had replaced all of them, the alias pointed to itself.The other way we've detected this bug is when we print the generated CLIF to a file, and trying to parse that file again fails. If no instruction in the cycle constrains the variable's type because they're all polymorphic (as e.g.
rotr
andishl
are), the parser fails with a message like "type variable required for polymorphic opcode, e.g. 'rotl.i32'; can't infer from v255 which is not yet defined".Versions and Environment
Cranelift version or commit: on
main
for at least the last month, as observed in these issues:Operating system: observed on Linux (and Windows I think?)
Architecture: observed on aarch64 and x64
Extra Info
I'm not sure how to fix this yet, but I want to thank @afonso360 for reminding me to dig into why it's happening. @cfallin, any thoughts?
jameysharp labeled issue #5022:
.clif
Test CaseThe simplest case I've seen came from cranelift-fuzzgen in #4875, where the relevant CLIF produced by cranelift-frontend included these two lines:
v257 = rotl v255, v256 ; v256 = 0 v255 -> v257
Steps to Reproduce
I believe the simplest way to trigger this bug is:
- Use cranelift-frontend to construct SSA from non-SSA
- Create a non-entry block and do the remaining steps in that block
- Declare a new variable
- Add an instruction which uses the value returned by
use_var
on that variable- Define the variable using the result of the instruction
- End the block with a branch to itself
- Seal the block
It also happens if there's a cycle of multiple blocks that each have a single predecessor.
I think it also can happen if there are multiple predecessors for some blocks in the cycle, as long as the only definition which reaches the instruction comes from the same instruction.
We've also seen it happen if there is a dependency cycle across multiple instructions, where each instruction defines a variable used by the next instruction in the cycle. It's not limited to the case where an instruction depends directly on its own results.
Expected Results
CLIF in SSA form.
Actual Results
SSABuilder::finish_predecessors_lookup
sees that the variable does have a definition in the cycle, so it doesn't insert a constant zero. It also sees that the variable doesn't have any other definitions reaching the use, so it deletes the phi node and changes the originaluse_var
result to an alias for the value given todef_var
. That leads to a cycle through value aliases, and the resulting CLIF is not in SSA form.I believe the Cranelift validator has run on the cases we've seen, without detecting this issue.
In #5020 this cycle was detected by the
simple_preopt
pass. There was a series of shifts and rotates where the shift amount was 0 in each case, so the pass tried to replace each result with an alias to the instruction's first argument. However once it had replaced all of them, the alias pointed to itself.The other way we've detected this bug is when we print the generated CLIF to a file, and trying to parse that file again fails. If no instruction in the cycle constrains the variable's type because they're all polymorphic (as e.g.
rotr
andishl
are), the parser fails with a message like "type variable required for polymorphic opcode, e.g. 'rotl.i32'; can't infer from v255 which is not yet defined".Versions and Environment
Cranelift version or commit: on
main
for at least the last month, as observed in these issues:Operating system: observed on Linux (and Windows I think?)
Architecture: observed on aarch64 and x64
Extra Info
I'm not sure how to fix this yet, but I want to thank @afonso360 for reminding me to dig into why it's happening. @cfallin, any thoughts?
jameysharp opened issue #5022:
.clif
Test CaseThe simplest case I've seen came from cranelift-fuzzgen in #4875, where the relevant CLIF produced by cranelift-frontend included these two lines:
v257 = rotl v255, v256 ; v256 = 0 v255 -> v257
Steps to Reproduce
I believe the simplest way to trigger this bug is:
- Use cranelift-frontend to construct SSA from non-SSA
- Create a non-entry block and do the remaining steps in that block
- Declare a new variable
- Add an instruction which uses the value returned by
use_var
on that variable- Define the variable using the result of the instruction
- End the block with a branch to itself
- Seal the block
It also happens if there's a cycle of multiple blocks that each have a single predecessor.
I think it also can happen if there are multiple predecessors for some blocks in the cycle, as long as the only definition which reaches the instruction comes from the same instruction.
We've also seen it happen if there is a dependency cycle across multiple instructions, where each instruction defines a variable used by the next instruction in the cycle. It's not limited to the case where an instruction depends directly on its own results.
Expected Results
CLIF in SSA form.
Actual Results
SSABuilder::finish_predecessors_lookup
sees that the variable does have a definition in the cycle, so it doesn't insert a constant zero. It also sees that the variable doesn't have any other definitions reaching the use, so it deletes the phi node and changes the originaluse_var
result to an alias for the value given todef_var
. That leads to a cycle through value aliases, and the resulting CLIF is not in SSA form.I believe the Cranelift validator has run on the cases we've seen, without detecting this issue.
In #5020 this cycle was detected by the
simple_preopt
pass. There was a series of shifts and rotates where the shift amount was 0 in each case, so the pass tried to replace each result with an alias to the instruction's first argument. However once it had replaced all of them, the alias pointed to itself.The other way we've detected this bug is when we print the generated CLIF to a file, and trying to parse that file again fails. If no instruction in the cycle constrains the variable's type because they're all polymorphic (as e.g.
rotr
andishl
are), the parser fails with a message like "type variable required for polymorphic opcode, e.g. 'rotl.i32'; can't infer from v255 which is not yet defined".Versions and Environment
Cranelift version or commit: on
main
for at least the last month, as observed in these issues:Operating system: observed on Linux (and Windows I think?)
Architecture: observed on aarch64 and x64
Extra Info
I'm not sure how to fix this yet, but I want to thank @afonso360 for reminding me to dig into why it's happening. @cfallin, any thoughts?
jameysharp labeled issue #5022:
.clif
Test CaseThe simplest case I've seen came from cranelift-fuzzgen in #4875, where the relevant CLIF produced by cranelift-frontend included these two lines:
v257 = rotl v255, v256 ; v256 = 0 v255 -> v257
Steps to Reproduce
I believe the simplest way to trigger this bug is:
- Use cranelift-frontend to construct SSA from non-SSA
- Create a non-entry block and do the remaining steps in that block
- Declare a new variable
- Add an instruction which uses the value returned by
use_var
on that variable- Define the variable using the result of the instruction
- End the block with a branch to itself
- Seal the block
It also happens if there's a cycle of multiple blocks that each have a single predecessor.
I think it also can happen if there are multiple predecessors for some blocks in the cycle, as long as the only definition which reaches the instruction comes from the same instruction.
We've also seen it happen if there is a dependency cycle across multiple instructions, where each instruction defines a variable used by the next instruction in the cycle. It's not limited to the case where an instruction depends directly on its own results.
Expected Results
CLIF in SSA form.
Actual Results
SSABuilder::finish_predecessors_lookup
sees that the variable does have a definition in the cycle, so it doesn't insert a constant zero. It also sees that the variable doesn't have any other definitions reaching the use, so it deletes the phi node and changes the originaluse_var
result to an alias for the value given todef_var
. That leads to a cycle through value aliases, and the resulting CLIF is not in SSA form.I believe the Cranelift validator has run on the cases we've seen, without detecting this issue.
In #5020 this cycle was detected by the
simple_preopt
pass. There was a series of shifts and rotates where the shift amount was 0 in each case, so the pass tried to replace each result with an alias to the instruction's first argument. However once it had replaced all of them, the alias pointed to itself.The other way we've detected this bug is when we print the generated CLIF to a file, and trying to parse that file again fails. If no instruction in the cycle constrains the variable's type because they're all polymorphic (as e.g.
rotr
andishl
are), the parser fails with a message like "type variable required for polymorphic opcode, e.g. 'rotl.i32'; can't infer from v255 which is not yet defined".Versions and Environment
Cranelift version or commit: on
main
for at least the last month, as observed in these issues:Operating system: observed on Linux (and Windows I think?)
Architecture: observed on aarch64 and x64
Extra Info
I'm not sure how to fix this yet, but I want to thank @afonso360 for reminding me to dig into why it's happening. @cfallin, any thoughts?
cfallin commented on issue #5022:
@jameysharp and I talked offline a bit about this issue yesterday. In the end I think we started to converge on a viewpoint that accepting unreachable code in general creates edge-cases such as this whose resulting complexity may not be worth handling.
From the IR-consumer perspective, handling only inputs with reachable blocks is unambiguously easier: it avoids cases like the one in this issue, and has other nice properties too (one knows for sure that a traversal from the entry will reach all code, etc).
From the IR-producer perspective, it can be somewhat annoying to face this requirement: it seems at odds with a compiler that otherwise accepts arbitrary unoptimized code and promises to optimize it. (Dead code is a kind of unoptimized code!) But it seems a little bit different to me than other sorts of unoptimized code (in a way that's hard to describe exactly, but I'll note that e.g.
cranelift-wasm
goes to some lengths to handle unreachable program points specially).In my experience with regalloc2, as well, accepting unreachable code created a ton of issues and in the end I imposed the requirement that input blocks are reachable. Cranelift already provided this property in its lowered code (even with opts turned off) but this created issues in the fuzz testcase generator. IIUC, this and related issues are being discovered by cranelift-fuzzgen.
My solution in regalloc2's case was to design the fuzz testcase generator in a way that produces reachable code by construction: we construct a "spine" and then create arbitrary branches (forward or back) along that spine. This can, I think, create any CFG shape that we want to test, and also can be configured to produce either reducible control flow only, or allow irreducible backedges. I might suggest taking an approach like this in cranelift-fuzzgen too. An alternative would be to simply reject inputs with unreachable blocks (i.e. silently accept / bail out early) but that probably would have significant impact on fuzzing efficiency.
afonso360 commented on issue #5022:
we construct a "spine" and then create arbitrary branches (forward or back) along that spine.
Yeah, this sounds like its fairly easy to do in fuzzgen.
From the IR-producer perspective
I'm worried if this is going to impact the ease of use of cranelift for smaller users. It sounds like it shouldn't happen if someone is just translating AST -> Cranelift blocks right? I'd like for the use case of someone just building a random language and translating it to clif to be as simple as possible.
If that's not a concern then yeah, I'm all for it. This is the third (I think) issue that fuzzgen finds with unreachable blocks and it sounds like It would simplify the frontend.
afonso360 edited a comment on issue #5022:
we construct a "spine" and then create arbitrary branches (forward or back) along that spine.
Yeah, this sounds like its fairly easy to do in fuzzgen.
From the IR-producer perspective
I'm worried if this is going to impact the ease of use of cranelift for smaller users. It sounds like it shouldn't happen if someone is just translating AST -> Cranelift blocks right? I'd like for the use case of someone just building a random language and translating it to CLIF to be as simple as possible.
If that's not a concern then yeah, I'm all for it. This is the third (I think) issue that fuzzgen finds with unreachable blocks and it sounds like It would simplify the frontend.
bjorn3 commented on issue #5022:
One option would be to let cranelift-frontend remove all unreachable blocks when finalizing.
cfallin commented on issue #5022:
Re: impact on simple AST-to-CLIF translation: I think that this should never produce unreachable code, by construction, for a structured-control-flow AST (if-else nodes, switch nodes, loop nodes, etc).
Translation from an arbitrary CFG into CLIF could potentially run into issues but the producer could do a DFS to set "reachable" bits beforehand; this is a slight step up from "as simple as possible" but IMHO not terrible either: if one already has an arbitrary CFG data structure, one has likely built some utilities up already, and a 20-line reachability filter is not so bad.
bjorn3 commented on issue #5022:
For the record cg_clif currently unconditionallt runs eliminate_unreachable_code: https://github.com/bjorn3/rustc_codegen_cranelift/blob/dae6a30d0b888a02e8b3a450510b8683f920cb16/src/base.rs#L158 This is because the iconst.i128 0 that can be inserted by cranelift-frontend in unreachable blocks weren't handled by Cranelift in the past.
afonso360 commented on issue #5022:
Re: impact on simple AST-to-CLIF translation: I think that this should never produce unreachable code, by construction, for a structured-control-flow AST (if-else nodes, switch nodes, loop nodes, etc).
Translation from an arbitrary CFG into CLIF could potentially run into issues but the producer could do a DFS to set "reachable" bits beforehand; this is a slight step up from "as simple as possible" but IMHO not terrible either: if one already has an arbitrary CFG data structure, one has likely built some utilities up already, and a 20-line reachability filter is not so bad.
Great! That sounds pretty reasonable and solves my concerns. I'm going to update fuzzgen to do the "spine" block construction.
jameysharp commented on issue #5022:
I've been trying to figure out if it's possible to trigger this bug entirely in reachable code:
- Declare a variable, but don't define it in the entry block.
- Jump from the entry block to block1.
- Use the variable in an instruction in block1.
- Define the variable to the result of the instruction.
- Branch back to the beginning of block1.
I guess that case actually works today because when it tries to remove the phi node from block1, it recursively does a
use_var
in block0, doesn't find a definition there, and introduces a 0. Then there are two definitions reaching the phi node, so it can't be removed.Also: One of the approaches that Chris and I had discussed was for cranelift-frontend to raise an error if
use_var
gets called in a block that isn't _currently_ reachable, even if predecessors could be added later that make it reachable. That general idea may turn out to be too restrictive even for Wasmtime. I noticed this week that cranelift-wasm constructs a single exit block for the function before translating the body of the function, and I think it does the same with blocks following conditionals and loops.I'm not sure exactly what invariant we need in SSA construction, so it's hard to say whether we can check that invariant incrementally without breaking reasonable frontends. But we can at least get fuzzgen to not trigger this bug, so that's still worth doing.
Last updated: Jan 24 2025 at 00:11 UTC