Stream: git-wasmtime

Topic: wasmtime / issue #5022 cranelift-frontend: unreachable cy...


view this post on Zulip Wasmtime GitHub notifications bot (Oct 05 2022 at 18:47):

jameysharp labeled issue #5022:

.clif Test Case

The simplest case I've seen came from cranelift-fuzzgen in #4875, where the relevant CLIF produced by cranelift-frontend included these two lines:

    v257 = rotl v255, v256  ; v256 = 0
    v255 -> v257

Steps to Reproduce

I believe the simplest way to trigger this bug is:

It also happens if there's a cycle of multiple blocks that each have a single predecessor.

I think it also can happen if there are multiple predecessors for some blocks in the cycle, as long as the only definition which reaches the instruction comes from the same instruction.

We've also seen it happen if there is a dependency cycle across multiple instructions, where each instruction defines a variable used by the next instruction in the cycle. It's not limited to the case where an instruction depends directly on its own results.

Expected Results

CLIF in SSA form.

Actual Results

SSABuilder::finish_predecessors_lookup sees that the variable does have a definition in the cycle, so it doesn't insert a constant zero. It also sees that the variable doesn't have any other definitions reaching the use, so it deletes the phi node and changes the original use_var result to an alias for the value given to def_var. That leads to a cycle through value aliases, and the resulting CLIF is not in SSA form.

I believe the Cranelift validator has run on the cases we've seen, without detecting this issue.

In #5020 this cycle was detected by the simple_preopt pass. There was a series of shifts and rotates where the shift amount was 0 in each case, so the pass tried to replace each result with an alias to the instruction's first argument. However once it had replaced all of them, the alias pointed to itself.

The other way we've detected this bug is when we print the generated CLIF to a file, and trying to parse that file again fails. If no instruction in the cycle constrains the variable's type because they're all polymorphic (as e.g. rotr and ishl are), the parser fails with a message like "type variable required for polymorphic opcode, e.g. 'rotl.i32'; can't infer from v255 which is not yet defined".

Versions and Environment

Cranelift version or commit: on main for at least the last month, as observed in these issues:

Operating system: observed on Linux (and Windows I think?)

Architecture: observed on aarch64 and x64

Extra Info

I'm not sure how to fix this yet, but I want to thank @afonso360 for reminding me to dig into why it's happening. @cfallin, any thoughts?

view this post on Zulip Wasmtime GitHub notifications bot (Oct 05 2022 at 18:47):

jameysharp labeled issue #5022:

.clif Test Case

The simplest case I've seen came from cranelift-fuzzgen in #4875, where the relevant CLIF produced by cranelift-frontend included these two lines:

    v257 = rotl v255, v256  ; v256 = 0
    v255 -> v257

Steps to Reproduce

I believe the simplest way to trigger this bug is:

It also happens if there's a cycle of multiple blocks that each have a single predecessor.

I think it also can happen if there are multiple predecessors for some blocks in the cycle, as long as the only definition which reaches the instruction comes from the same instruction.

We've also seen it happen if there is a dependency cycle across multiple instructions, where each instruction defines a variable used by the next instruction in the cycle. It's not limited to the case where an instruction depends directly on its own results.

Expected Results

CLIF in SSA form.

Actual Results

SSABuilder::finish_predecessors_lookup sees that the variable does have a definition in the cycle, so it doesn't insert a constant zero. It also sees that the variable doesn't have any other definitions reaching the use, so it deletes the phi node and changes the original use_var result to an alias for the value given to def_var. That leads to a cycle through value aliases, and the resulting CLIF is not in SSA form.

I believe the Cranelift validator has run on the cases we've seen, without detecting this issue.

In #5020 this cycle was detected by the simple_preopt pass. There was a series of shifts and rotates where the shift amount was 0 in each case, so the pass tried to replace each result with an alias to the instruction's first argument. However once it had replaced all of them, the alias pointed to itself.

The other way we've detected this bug is when we print the generated CLIF to a file, and trying to parse that file again fails. If no instruction in the cycle constrains the variable's type because they're all polymorphic (as e.g. rotr and ishl are), the parser fails with a message like "type variable required for polymorphic opcode, e.g. 'rotl.i32'; can't infer from v255 which is not yet defined".

Versions and Environment

Cranelift version or commit: on main for at least the last month, as observed in these issues:

Operating system: observed on Linux (and Windows I think?)

Architecture: observed on aarch64 and x64

Extra Info

I'm not sure how to fix this yet, but I want to thank @afonso360 for reminding me to dig into why it's happening. @cfallin, any thoughts?

view this post on Zulip Wasmtime GitHub notifications bot (Oct 05 2022 at 18:47):

jameysharp opened issue #5022:

.clif Test Case

The simplest case I've seen came from cranelift-fuzzgen in #4875, where the relevant CLIF produced by cranelift-frontend included these two lines:

    v257 = rotl v255, v256  ; v256 = 0
    v255 -> v257

Steps to Reproduce

I believe the simplest way to trigger this bug is:

It also happens if there's a cycle of multiple blocks that each have a single predecessor.

I think it also can happen if there are multiple predecessors for some blocks in the cycle, as long as the only definition which reaches the instruction comes from the same instruction.

We've also seen it happen if there is a dependency cycle across multiple instructions, where each instruction defines a variable used by the next instruction in the cycle. It's not limited to the case where an instruction depends directly on its own results.

Expected Results

CLIF in SSA form.

Actual Results

SSABuilder::finish_predecessors_lookup sees that the variable does have a definition in the cycle, so it doesn't insert a constant zero. It also sees that the variable doesn't have any other definitions reaching the use, so it deletes the phi node and changes the original use_var result to an alias for the value given to def_var. That leads to a cycle through value aliases, and the resulting CLIF is not in SSA form.

I believe the Cranelift validator has run on the cases we've seen, without detecting this issue.

In #5020 this cycle was detected by the simple_preopt pass. There was a series of shifts and rotates where the shift amount was 0 in each case, so the pass tried to replace each result with an alias to the instruction's first argument. However once it had replaced all of them, the alias pointed to itself.

The other way we've detected this bug is when we print the generated CLIF to a file, and trying to parse that file again fails. If no instruction in the cycle constrains the variable's type because they're all polymorphic (as e.g. rotr and ishl are), the parser fails with a message like "type variable required for polymorphic opcode, e.g. 'rotl.i32'; can't infer from v255 which is not yet defined".

Versions and Environment

Cranelift version or commit: on main for at least the last month, as observed in these issues:

Operating system: observed on Linux (and Windows I think?)

Architecture: observed on aarch64 and x64

Extra Info

I'm not sure how to fix this yet, but I want to thank @afonso360 for reminding me to dig into why it's happening. @cfallin, any thoughts?

view this post on Zulip Wasmtime GitHub notifications bot (Oct 05 2022 at 18:47):

jameysharp labeled issue #5022:

.clif Test Case

The simplest case I've seen came from cranelift-fuzzgen in #4875, where the relevant CLIF produced by cranelift-frontend included these two lines:

    v257 = rotl v255, v256  ; v256 = 0
    v255 -> v257

Steps to Reproduce

I believe the simplest way to trigger this bug is:

It also happens if there's a cycle of multiple blocks that each have a single predecessor.

I think it also can happen if there are multiple predecessors for some blocks in the cycle, as long as the only definition which reaches the instruction comes from the same instruction.

We've also seen it happen if there is a dependency cycle across multiple instructions, where each instruction defines a variable used by the next instruction in the cycle. It's not limited to the case where an instruction depends directly on its own results.

Expected Results

CLIF in SSA form.

Actual Results

SSABuilder::finish_predecessors_lookup sees that the variable does have a definition in the cycle, so it doesn't insert a constant zero. It also sees that the variable doesn't have any other definitions reaching the use, so it deletes the phi node and changes the original use_var result to an alias for the value given to def_var. That leads to a cycle through value aliases, and the resulting CLIF is not in SSA form.

I believe the Cranelift validator has run on the cases we've seen, without detecting this issue.

In #5020 this cycle was detected by the simple_preopt pass. There was a series of shifts and rotates where the shift amount was 0 in each case, so the pass tried to replace each result with an alias to the instruction's first argument. However once it had replaced all of them, the alias pointed to itself.

The other way we've detected this bug is when we print the generated CLIF to a file, and trying to parse that file again fails. If no instruction in the cycle constrains the variable's type because they're all polymorphic (as e.g. rotr and ishl are), the parser fails with a message like "type variable required for polymorphic opcode, e.g. 'rotl.i32'; can't infer from v255 which is not yet defined".

Versions and Environment

Cranelift version or commit: on main for at least the last month, as observed in these issues:

Operating system: observed on Linux (and Windows I think?)

Architecture: observed on aarch64 and x64

Extra Info

I'm not sure how to fix this yet, but I want to thank @afonso360 for reminding me to dig into why it's happening. @cfallin, any thoughts?

view this post on Zulip Wasmtime GitHub notifications bot (Oct 06 2022 at 18:48):

cfallin commented on issue #5022:

@jameysharp and I talked offline a bit about this issue yesterday. In the end I think we started to converge on a viewpoint that accepting unreachable code in general creates edge-cases such as this whose resulting complexity may not be worth handling.

From the IR-consumer perspective, handling only inputs with reachable blocks is unambiguously easier: it avoids cases like the one in this issue, and has other nice properties too (one knows for sure that a traversal from the entry will reach all code, etc).

From the IR-producer perspective, it can be somewhat annoying to face this requirement: it seems at odds with a compiler that otherwise accepts arbitrary unoptimized code and promises to optimize it. (Dead code is a kind of unoptimized code!) But it seems a little bit different to me than other sorts of unoptimized code (in a way that's hard to describe exactly, but I'll note that e.g. cranelift-wasm goes to some lengths to handle unreachable program points specially).

In my experience with regalloc2, as well, accepting unreachable code created a ton of issues and in the end I imposed the requirement that input blocks are reachable. Cranelift already provided this property in its lowered code (even with opts turned off) but this created issues in the fuzz testcase generator. IIUC, this and related issues are being discovered by cranelift-fuzzgen.

My solution in regalloc2's case was to design the fuzz testcase generator in a way that produces reachable code by construction: we construct a "spine" and then create arbitrary branches (forward or back) along that spine. This can, I think, create any CFG shape that we want to test, and also can be configured to produce either reducible control flow only, or allow irreducible backedges. I might suggest taking an approach like this in cranelift-fuzzgen too. An alternative would be to simply reject inputs with unreachable blocks (i.e. silently accept / bail out early) but that probably would have significant impact on fuzzing efficiency.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 07 2022 at 08:45):

afonso360 commented on issue #5022:

we construct a "spine" and then create arbitrary branches (forward or back) along that spine.

Yeah, this sounds like its fairly easy to do in fuzzgen.

From the IR-producer perspective

I'm worried if this is going to impact the ease of use of cranelift for smaller users. It sounds like it shouldn't happen if someone is just translating AST -> Cranelift blocks right? I'd like for the use case of someone just building a random language and translating it to clif to be as simple as possible.

If that's not a concern then yeah, I'm all for it. This is the third (I think) issue that fuzzgen finds with unreachable blocks and it sounds like It would simplify the frontend.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 07 2022 at 08:45):

afonso360 edited a comment on issue #5022:

we construct a "spine" and then create arbitrary branches (forward or back) along that spine.

Yeah, this sounds like its fairly easy to do in fuzzgen.

From the IR-producer perspective

I'm worried if this is going to impact the ease of use of cranelift for smaller users. It sounds like it shouldn't happen if someone is just translating AST -> Cranelift blocks right? I'd like for the use case of someone just building a random language and translating it to CLIF to be as simple as possible.

If that's not a concern then yeah, I'm all for it. This is the third (I think) issue that fuzzgen finds with unreachable blocks and it sounds like It would simplify the frontend.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 07 2022 at 10:00):

bjorn3 commented on issue #5022:

One option would be to let cranelift-frontend remove all unreachable blocks when finalizing.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 07 2022 at 20:41):

cfallin commented on issue #5022:

Re: impact on simple AST-to-CLIF translation: I think that this should never produce unreachable code, by construction, for a structured-control-flow AST (if-else nodes, switch nodes, loop nodes, etc).

Translation from an arbitrary CFG into CLIF could potentially run into issues but the producer could do a DFS to set "reachable" bits beforehand; this is a slight step up from "as simple as possible" but IMHO not terrible either: if one already has an arbitrary CFG data structure, one has likely built some utilities up already, and a 20-line reachability filter is not so bad.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 07 2022 at 21:26):

bjorn3 commented on issue #5022:

For the record cg_clif currently unconditionallt runs eliminate_unreachable_code: https://github.com/bjorn3/rustc_codegen_cranelift/blob/dae6a30d0b888a02e8b3a450510b8683f920cb16/src/base.rs#L158 This is because the iconst.i128 0 that can be inserted by cranelift-frontend in unreachable blocks weren't handled by Cranelift in the past.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 07 2022 at 22:28):

afonso360 commented on issue #5022:

Re: impact on simple AST-to-CLIF translation: I think that this should never produce unreachable code, by construction, for a structured-control-flow AST (if-else nodes, switch nodes, loop nodes, etc).

Translation from an arbitrary CFG into CLIF could potentially run into issues but the producer could do a DFS to set "reachable" bits beforehand; this is a slight step up from "as simple as possible" but IMHO not terrible either: if one already has an arbitrary CFG data structure, one has likely built some utilities up already, and a 20-line reachability filter is not so bad.

Great! That sounds pretty reasonable and solves my concerns. I'm going to update fuzzgen to do the "spine" block construction.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 08 2022 at 16:24):

jameysharp commented on issue #5022:

I've been trying to figure out if it's possible to trigger this bug entirely in reachable code:

I guess that case actually works today because when it tries to remove the phi node from block1, it recursively does a use_var in block0, doesn't find a definition there, and introduces a 0. Then there are two definitions reaching the phi node, so it can't be removed.

Also: One of the approaches that Chris and I had discussed was for cranelift-frontend to raise an error if use_var gets called in a block that isn't _currently_ reachable, even if predecessors could be added later that make it reachable. That general idea may turn out to be too restrictive even for Wasmtime. I noticed this week that cranelift-wasm constructs a single exit block for the function before translating the body of the function, and I think it does the same with blocks following conditionals and loops.

I'm not sure exactly what invariant we need in SSA construction, so it's hard to say whether we can check that invariant incrementally without breaking reasonable frontends. But we can at least get fuzzgen to not trigger this bug, so that's still worth doing.


Last updated: Oct 23 2024 at 20:03 UTC