wasmtime / issue #7857 Block order and value number affec... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / issue #7857 Block order and value number affec...

Wasmtime GitHub notifications bot (Feb 01 2024 at 17:57):

alexcrichton opened issue #7857:

Last night we got a fuzz bug. Everything below is relative to Wasmtime at 0d662c9d3dd1db8490e30124c9126bfa00a9a5c6.

This input:

(module
  (func
    (local f32)
    f32.const 100
    f32.sqrt
    i32.const 0
    if
      f32.const 100
      f32.sqrt
      block ;; label = @5
        i32.const 1
        br_if 0 (;@6;)
        f32.const 0
        local.set 0
      end
      local.get 0
      i32.const 1
      select
      i32.reinterpret_f32
      global.set 0
    end
    i32.reinterpret_f32
    global.set 0
  )
  (global (;0;) (mut i32) i32.const 0)
)

will panic in regalloc

$ cargo run compile -C cache=n -W nan-canonicalization ./foo.wat
...
thread '<unnamed>' panicked at cranelift/codegen/src/machinst/compile.rs:76:14:
register allocation: SSA(VReg(vreg = 198, class = Float), Inst(33))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

This surfaces a validation error in CLIF earlier with validation enabled

$ cargo run compile -C cache=n -C cranelift-debug-verifier -W nan-canonicalization ./foo.wat
...
                                    v19 = f32const +NaN
                                    v4 = select v18, v19, v17  ; v19 = +NaN
@0043                               v12 = select v8, v4, v10  ; v8 = 1
;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
; error: inst12 (v12 = select.f32 v8, v4, v10  ; v8 = 1): uses value arg from non-dominating block4

@0044                               v13 = bitcast.i32 v12
@0049                               store notrap aligned table v13, v0+80
@004b                               jump block1

                                block1:
@004b                               return
}

; 1 verifier error detected (see above). Compilation aborted.

This has been further reduced to this CLIF test case:

test optimize
set enable_verifier=true
set opt_level=speed
target x86_64

function %foo(i64) {
block0(v0: i64):
    v3 = f32const 0x1.900000p6
    v17 = sqrt v3
    v18 = fcmp ne v17, v17
    v19 = f32const +NaN
    v4 = select v18, v19, v17
    v5 = iconst.i32 0
    brif v5, block2, block3

block2:
    v6 = f32const 0x1.900000p6
    v20 = sqrt v6
    v21 = fcmp ne v20, v20
    v22 = f32const +NaN
    v7 = select v21, v22, v20
    v8 = iconst.i32 1
    v2 = f32const 0.0
    brif v8, block4(v2), block5

block5:
    v9 = f32const 0.0
    jump block4(v9)

block4(v10: f32):
    v11 = iconst.i32 1
    v12 = select.f32 v11, v7, v10
    v13 = bitcast.i32 v12
    store notrap aligned table v13, v0+80
    return

block3:
    v15 = bitcast.i32 v4
    store notrap aligned table v15, v0+80
    return
}

which can be reproduced with:

$ cd cranelift && cargo run test ./foo.clif

I've been investigating this with @elliottt and @fitzgen in person for a bit now. So far we have concluded a few "fixes" can be applied:

One fix is to renumber the original v12 input to v1 in CLIF.
Another fix is to move block3 to be beneath block0.
The final fix is to change this line to (subsume x)

Naturally none of these are actual fixes but are symptoms of the "real" issue. We're still figuring things out at this time but I wanted to open this up.

Trevor and Nick are telling me as well that this is possibly related to https://github.com/bytecodealliance/wasmtime/issues/6126.

Wasmtime GitHub notifications bot (Feb 01 2024 at 17:57):

alexcrichton added the bug label to Issue #7857.

Wasmtime GitHub notifications bot (Feb 01 2024 at 17:57):

alexcrichton added the fuzz-bug label to Issue #7857.

Wasmtime GitHub notifications bot (Feb 01 2024 at 17:57):

alexcrichton added the cranelift label to Issue #7857.

Wasmtime GitHub notifications bot (Feb 01 2024 at 17:57):

alexcrichton edited issue #7857:

Last night we got a fuzz bug. Everything below is relative to Wasmtime at 0d662c9d3dd1db8490e30124c9126bfa00a9a5c6.

This input:

(module
  (func
    (local f32)
    f32.const 100
    f32.sqrt
    i32.const 0
    if
      f32.const 100
      f32.sqrt
      block
        i32.const 1
        br_if 0
        f32.const 0
        local.set 0
      end
      local.get 0
      i32.const 1
      select
      i32.reinterpret_f32
      global.set 0
    end
    i32.reinterpret_f32
    global.set 0
  )
  (global (;0;) (mut i32) i32.const 0)
)

will panic in regalloc

$ cargo run compile -C cache=n -W nan-canonicalization ./foo.wat
...
thread '<unnamed>' panicked at cranelift/codegen/src/machinst/compile.rs:76:14:
register allocation: SSA(VReg(vreg = 198, class = Float), Inst(33))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

This surfaces a validation error in CLIF earlier with validation enabled

$ cargo run compile -C cache=n -C cranelift-debug-verifier -W nan-canonicalization ./foo.wat
...
                                    v19 = f32const +NaN
                                    v4 = select v18, v19, v17  ; v19 = +NaN
@0043                               v12 = select v8, v4, v10  ; v8 = 1
;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
; error: inst12 (v12 = select.f32 v8, v4, v10  ; v8 = 1): uses value arg from non-dominating block4

@0044                               v13 = bitcast.i32 v12
@0049                               store notrap aligned table v13, v0+80
@004b                               jump block1

                                block1:
@004b                               return
}

; 1 verifier error detected (see above). Compilation aborted.

This has been further reduced to this CLIF test case:

test optimize
set enable_verifier=true
set opt_level=speed
target x86_64

function %foo(i64) {
block0(v0: i64):
    v3 = f32const 0x1.900000p6
    v17 = sqrt v3
    v18 = fcmp ne v17, v17
    v19 = f32const +NaN
    v4 = select v18, v19, v17
    v5 = iconst.i32 0
    brif v5, block2, block3

block2:
    v6 = f32const 0x1.900000p6
    v20 = sqrt v6
    v21 = fcmp ne v20, v20
    v22 = f32const +NaN
    v7 = select v21, v22, v20
    v8 = iconst.i32 1
    v2 = f32const 0.0
    brif v8, block4(v2), block5

block5:
    v9 = f32const 0.0
    jump block4(v9)

block4(v10: f32):
    v11 = iconst.i32 1
    v12 = select.f32 v11, v7, v10
    v13 = bitcast.i32 v12
    store notrap aligned table v13, v0+80
    return

block3:
    v15 = bitcast.i32 v4
    store notrap aligned table v15, v0+80
    return
}

which can be reproduced with:

$ cd cranelift && cargo run test ./foo.clif

I've been investigating this with @elliottt and @fitzgen in person for a bit now. So far we have concluded a few "fixes" can be applied:

One fix is to renumber the original v12 input to v1 in CLIF.
Another fix is to move block3 to be beneath block0.
The final fix is to change this line to (subsume x)

Naturally none of these are actual fixes but are symptoms of the "real" issue. We're still figuring things out at this time but I wanted to open this up.

Trevor and Nick are telling me as well that this is possibly related to https://github.com/bytecodealliance/wasmtime/issues/6126.

Wasmtime GitHub notifications bot (Feb 01 2024 at 19:24):

bjorn3 commented on issue #7857:

With a small patch to get bugpoint to actually work on this test case and a fair amount of manual work I got it reduced to:

test optimize
set enable_verifier=true
set opt_level=speed
target x86_64

function %foo() fast {
block0:
    v0 = iconst.i64 0
    v2 = f32const 0.0
    v9 = f32const 0.0
    v20 = fneg v2  ; v2 = 0.0
    v18 = fcmp eq v20, v20
    v4 = select v18, v2, v20  ; v2 = 0.0
    v8 = iconst.i32 0
    v11 = iconst.i32 1
    brif v0, block2, block3  ; v0 = 0

block2:
    brif.i32 v8, block4(v2), block4(v9)  ; v8 = 0, v2 = 0.0, v9 = 0.0

block4(v10: f32):
    v12 = select.f32 v11, v4, v10  ; v11 = 1
    v13 = bitcast.i32 v12
    store v13, v0  ; v0 = 0
    trap user0

block3:
    v15 = bitcast.i32 v4
    store v15, v0  ; v0 = 0
    return
}

which gives the following verifier error:

Error: function %foo() fast {
block0:
    v0 = iconst.i64 0
    brif v0, block2, block3  ; v0 = 0

block2:
    v8 = iconst.i32 0
    v23 = f32const 0.0
    brif v8, block4(v23), block4(v23)  ; v8 = 0, v23 = 0.0, v23 = 0.0

block4(v10: f32):
    v24 = iconst.i32 1
    v25 = fneg.f32 v23  ; v23 = 0.0
    v26 = fcmp eq v25, v25
    v27 = select.f32 v26, v23, v25  ; v23 = 0.0
    v28 = select v24, v27, v10  ; v24 = 1
    v29 = bitcast.i32 v28
    v30 = iconst.i64 0
    store v29, v30  ; v30 = 0
    trap user0

block3:
    v11 = iconst.i32 1
    v2 = f32const 0.0
    v20 = fneg v2  ; v2 = 0.0
    v18 = fcmp eq v20, v20
    v4 = select v18, v2, v20  ; v2 = 0.0
    v12 = select v11, v4, v10  ; v11 = 1
;   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
; error: inst10 (v12 = select.f32 v11, v4, v10  ; v11 = 1): uses value arg from non-dominating block4

    v13 = bitcast.i32 v12
    v22 = iconst.i64 0
    store v13, v22  ; v22 = 0
    return
}

; 1 verifier error detected (see above). Compilation aborted.

Wasmtime GitHub notifications bot (Feb 01 2024 at 20:02):

cfallin commented on issue #7857:

I took a quick look at this (I'm still on leave, only have a brief window here, but figured my input could be useful after seeing this go by).

It seems to me at least that there's an issue with the cost function: the select is union'ing with one of its inputs (because it is being constant-folded), but then another user of that input elsewhere is picking the select instead. (Given x and select true, x, bla one should always pick x.) This is compounded by the fact that the bla is dependent on a blockparam that's not available at the other use's site.

subsume is a bandaid over that but really we should figure out why select true, x, bla doesn't have higher cost than x. Both v12 and v4 have saturated their cost; I didn't get a chance to look into why.

Alternately, in a universe where we have use-site-specific costs, we could push the cost of blockparams that are not available (not in a dominating block) to infinity. But we've avoided use-site-specific costs so far because that removes the ability to have the nice linear-time dynamic programming pass to compute them...

(Why hasn't this come up before? The x -> func(x, other_stuff) rewrite is a violation of the correctness condition for rewrites that we have in opts/README.md -- "using the same inputs as the original, or fewer" -- not in the sense that the original rule is written that way, but in the sense that equivalences are bidirectional and so reading the rule backwards produces that outcome. The cost function should ordinarily prevent the rewrite from being used "backwards" and I suspect our "infinity" encoding/scheme has some issues...)

Wasmtime GitHub notifications bot (Feb 01 2024 at 22:51):

alexcrichton commented on issue #7857:

@elliottt, @fitzgen, and I have been debugging this quite a bit today. Sorry haven't fully caught up on the latest comments. Our findings are:

It's possible to leverage this bug to elaborate effectful instructions, such as function calls, where they shouldn't be. This is a sign that we should assert that during elaboration no effectful instructions should be added, only pure instructions.

One issue here is that compute_best_values only runs one iteration going through the values. The input program here is constructed such that this one pass does not calculate an accurate cost of nodes such as v4 and v7. This explains why v12 is chosen to elaborate because all items in the eclass have near-infinite cost and this heuristic prefers larger nodes.

The "root" issue here feels that there's an eclass which contains v4 and v12. The v12 expression relies on a value not in scope with v4, meaning that it seems like there needs to be some sort of concept of "scope" with eclasses. We had no idea how this would be implemented and it sounds scary.

@fitzgen is going to leave a further comment about the possibility of running compute_best_values to a fixpoint to get a more accurate cost.

Answering some questions reading over your comment now @cfallin

we should figure out why select true, x, bla doesn't have higher cost than x

We've concluded this is due to the numbers of values here. Due to the single pass in compute_best_values everything ends up with near-infinite cost. The values are out-of-order here due to the nan canonicalization pass running after CLIF construction in Wasmtime.

Why hasn't this come up before?

My naive understanding at this point is that if all values are defined with values defined before it then we can correctly calculate cost and will deterministically pick x over f(x, other). This is not a great explanation though and there may be something deeper.

Wasmtime GitHub notifications bot (Feb 01 2024 at 23:03):

fitzgen commented on issue #7857:

One issue here is that compute_best_values only runs one iteration going through the values.

I think we should do a fixed point here, and I know that is a scary thing to just drop in chat, but I think it is actually fine. The maximum number of iterations the fixed-point loop would take is equal to the longest chain of vNN-to-vMM edges in the the dataflow graph where NN > MM. So in the worst case this is O(n^2), yes. But the Wasm frontend doesn't introduce any of these edges. Cranelift itself can introduce them during NaN canonicalization, but this will be chains that are only one or two links long. Therefore, with Wasmtime, Cranelift would _never_ do more than a handful of iterations. Other frontends would however have the responsibility of avoiding the worst case value numbering themselves.

Additionally, this would allow us to remove the funky infinity-minus-one cost clamping as well, which would be a very nice simplification.

I think this might also fix some of the issues discussed in https://github.com/bytecodealliance/wasmtime/issues/6126.

Wasmtime GitHub notifications bot (Feb 02 2024 at 00:59):

cfallin commented on issue #7857:

The "root" issue here feels that there's an eclass which contains v4 and v12. The v12 expression relies on a value not in scope with v4, meaning that it seems like there needs to be some sort of concept of "scope" with eclasses. We had no idea how this would be implemented and it sounds scary.

This (two values in the eclass) is "right and good and normal" I think -- if v12 derives from v4 and we realize that the def of v12 simplifies down to v12 := v4, then there is no other reasonable way to represent this but to put them in the same eclass. Doing otherwise (e.g. having a "copy" operator or something of the sort) then makes each instance unique (and e.g. we couldn't collapse v14 and v15 in v12 := v4; v13 := v4; v14 = f(v12); v15 = f(v13)). Basically, if something is truly equal to a blockparam then we need to represent it as such, or else all of the simplification stops prematurely.

The key I think is the directionality, and for that we need to stick to the "correctness condition" we've stated for rewrite rules -- removing but not adding inputs -- in the extraction (picking nodes for each class) too. If one enode in an eclass depends on v1, and another depends on v1, v2, we need to pick the former. This is because while we may have written the rule to "shrink" the input set, equivalence is bidirectional, so a broken extraction could actually use the bigger input set where we originally had the smaller.

So all this leads me to come again to the cost function, and I agree @fitzgen's approach feels like the right one. Basically we need to (i) define costs correctly, so a rewrite rule that "shrinks" the input set also results in a node with lower cost -- this should already be the case, and is a separate bug if not -- and (ii) compute costs correctly according to their definition, which the fixpoint approach should do.

Wasmtime GitHub notifications bot (Feb 02 2024 at 02:21):

fitzgen edited a comment on issue #7857:

One issue here is that compute_best_values only runs one iteration going through the values.

I think we should do a fixed point here, and I know that is a scary thing to just drop in chat, but I think it is actually fine. The maximum number of iterations the fixed-point loop would take is equal to the longest chain of vNN-to-vMM edges in the the dataflow graph where NN < MM. So in the worst case this is O(n^2), yes. But the Wasm frontend doesn't introduce any of these edges. Cranelift itself can introduce them during NaN canonicalization, but this will be chains that are only one or two links long. Therefore, with Wasmtime, Cranelift would _never_ do more than a handful of iterations. Other frontends would however have the responsibility of avoiding the worst case value numbering themselves.

Additionally, this would allow us to remove the funky infinity-minus-one cost clamping as well, which would be a very nice simplification.

I think this might also fix some of the issues discussed in https://github.com/bytecodealliance/wasmtime/issues/6126.

Wasmtime GitHub notifications bot (Feb 02 2024 at 03:40):

fitzgen commented on issue #7857:

I have a fix in https://github.com/bytecodealliance/wasmtime/pull/7859

Wasmtime GitHub notifications bot (Feb 05 2024 at 15:04):

alexcrichton commented on issue #7857:

(please feel free to respond to this after you're back @cfallin, no urgency on this of course)

If one enode in an eclass depends on v1, and another depends on v1, v2, we need to pick the former.

In the past I've been under the impression that one of the properties of eclasses/elaboration that we rely on is that correctness is guaranteed because no matter what we choose in an eclass it's guaranteed to have the same program. Put another way I was under the impression that various bits and pieces of code relied on the fact that we can choose anything in an eclass to elaborate. What you're saying I agree with, however, and if the above program is correctly putting everything in the same eclass then the I also agree with the consequence, the cost function is quite important.

Is there perhaps other bits and pieces of elaboration that need to be updated/rethought? Or is this perhaps a mistaken impression I've gotten from egraphs/etc? (I couldn't actually find docs in Cranelift itself saying anything like this searching for a moment).

Wasmtime GitHub notifications bot (Feb 05 2024 at 23:22):

fitzgen closed issue #7857:

Last night we got a fuzz bug. Everything below is relative to Wasmtime at 0d662c9d3dd1db8490e30124c9126bfa00a9a5c6.

This input:

(module
  (func
    (local f32)
    f32.const 100
    f32.sqrt
    i32.const 0
    if
      f32.const 100
      f32.sqrt
      block
        i32.const 1
        br_if 0
        f32.const 0
        local.set 0
      end
      local.get 0
      i32.const 1
      select
      i32.reinterpret_f32
      global.set 0
    end
    i32.reinterpret_f32
    global.set 0
  )
  (global (;0;) (mut i32) i32.const 0)
)

will panic in regalloc

$ cargo run compile -C cache=n -W nan-canonicalization ./foo.wat
...
thread '<unnamed>' panicked at cranelift/codegen/src/machinst/compile.rs:76:14:
register allocation: SSA(VReg(vreg = 198, class = Float), Inst(33))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

This surfaces a validation error in CLIF earlier with validation enabled

$ cargo run compile -C cache=n -C cranelift-debug-verifier -W nan-canonicalization ./foo.wat
...
                                    v19 = f32const +NaN
                                    v4 = select v18, v19, v17  ; v19 = +NaN
@0043                               v12 = select v8, v4, v10  ; v8 = 1
;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
; error: inst12 (v12 = select.f32 v8, v4, v10  ; v8 = 1): uses value arg from non-dominating block4

@0044                               v13 = bitcast.i32 v12
@0049                               store notrap aligned table v13, v0+80
@004b                               jump block1

                                block1:
@004b                               return
}

; 1 verifier error detected (see above). Compilation aborted.

This has been further reduced to this CLIF test case:

test optimize
set enable_verifier=true
set opt_level=speed
target x86_64

function %foo(i64) {
block0(v0: i64):
    v3 = f32const 0x1.900000p6
    v17 = sqrt v3
    v18 = fcmp ne v17, v17
    v19 = f32const +NaN
    v4 = select v18, v19, v17
    v5 = iconst.i32 0
    brif v5, block2, block3

block2:
    v6 = f32const 0x1.900000p6
    v20 = sqrt v6
    v21 = fcmp ne v20, v20
    v22 = f32const +NaN
    v7 = select v21, v22, v20
    v8 = iconst.i32 1
    v2 = f32const 0.0
    brif v8, block4(v2), block5

block5:
    v9 = f32const 0.0
    jump block4(v9)

block4(v10: f32):
    v11 = iconst.i32 1
    v12 = select.f32 v11, v7, v10
    v13 = bitcast.i32 v12
    store notrap aligned table v13, v0+80
    return

block3:
    v15 = bitcast.i32 v4
    store notrap aligned table v15, v0+80
    return
}

which can be reproduced with:

$ cd cranelift && cargo run test ./foo.clif

I've been investigating this with @elliottt and @fitzgen in person for a bit now. So far we have concluded a few "fixes" can be applied:

One fix is to renumber the original v12 input to v1 in CLIF.
Another fix is to move block3 to be beneath block0.
The final fix is to change this line to (subsume x)

Naturally none of these are actual fixes but are symptoms of the "real" issue. We're still figuring things out at this time but I wanted to open this up.

Trevor and Nick are telling me as well that this is possibly related to https://github.com/bytecodealliance/wasmtime/issues/6126.

Last updated: Apr 18 2025 at 08:04 UTC