alexcrichton opened issue #7857:
Last night we got a fuzz bug. Everything below is relative to Wasmtime at 0d662c9d3dd1db8490e30124c9126bfa00a9a5c6.
This input:
(module (func (local f32) f32.const 100 f32.sqrt i32.const 0 if f32.const 100 f32.sqrt block ;; label = @5 i32.const 1 br_if 0 (;@6;) f32.const 0 local.set 0 end local.get 0 i32.const 1 select i32.reinterpret_f32 global.set 0 end i32.reinterpret_f32 global.set 0 ) (global (;0;) (mut i32) i32.const 0) )
will panic in regalloc
$ cargo run compile -C cache=n -W nan-canonicalization ./foo.wat ... thread '<unnamed>' panicked at cranelift/codegen/src/machinst/compile.rs:76:14: register allocation: SSA(VReg(vreg = 198, class = Float), Inst(33)) note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
This surfaces a validation error in CLIF earlier with validation enabled
$ cargo run compile -C cache=n -C cranelift-debug-verifier -W nan-canonicalization ./foo.wat ... v19 = f32const +NaN v4 = select v18, v19, v17 ; v19 = +NaN @0043 v12 = select v8, v4, v10 ; v8 = 1 ;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ; error: inst12 (v12 = select.f32 v8, v4, v10 ; v8 = 1): uses value arg from non-dominating block4 @0044 v13 = bitcast.i32 v12 @0049 store notrap aligned table v13, v0+80 @004b jump block1 block1: @004b return } ; 1 verifier error detected (see above). Compilation aborted.
This has been further reduced to this CLIF test case:
test optimize set enable_verifier=true set opt_level=speed target x86_64 function %foo(i64) { block0(v0: i64): v3 = f32const 0x1.900000p6 v17 = sqrt v3 v18 = fcmp ne v17, v17 v19 = f32const +NaN v4 = select v18, v19, v17 v5 = iconst.i32 0 brif v5, block2, block3 block2: v6 = f32const 0x1.900000p6 v20 = sqrt v6 v21 = fcmp ne v20, v20 v22 = f32const +NaN v7 = select v21, v22, v20 v8 = iconst.i32 1 v2 = f32const 0.0 brif v8, block4(v2), block5 block5: v9 = f32const 0.0 jump block4(v9) block4(v10: f32): v11 = iconst.i32 1 v12 = select.f32 v11, v7, v10 v13 = bitcast.i32 v12 store notrap aligned table v13, v0+80 return block3: v15 = bitcast.i32 v4 store notrap aligned table v15, v0+80 return }
which can be reproduced with:
$ cd cranelift && cargo run test ./foo.clif
I've been investigating this with @elliottt and @fitzgen in person for a bit now. So far we have concluded a few "fixes" can be applied:
- One fix is to renumber the original
v12
input tov1
in CLIF.- Another fix is to move
block3
to be beneathblock0
.- The final fix is to change this line to
(subsume x)
Naturally none of these are actual fixes but are symptoms of the "real" issue. We're still figuring things out at this time but I wanted to open this up.
Trevor and Nick are telling me as well that this is possibly related to https://github.com/bytecodealliance/wasmtime/issues/6126.
alexcrichton added the bug label to Issue #7857.
alexcrichton added the fuzz-bug label to Issue #7857.
alexcrichton added the cranelift label to Issue #7857.
alexcrichton edited issue #7857:
Last night we got a fuzz bug. Everything below is relative to Wasmtime at 0d662c9d3dd1db8490e30124c9126bfa00a9a5c6.
This input:
(module (func (local f32) f32.const 100 f32.sqrt i32.const 0 if f32.const 100 f32.sqrt block i32.const 1 br_if 0 f32.const 0 local.set 0 end local.get 0 i32.const 1 select i32.reinterpret_f32 global.set 0 end i32.reinterpret_f32 global.set 0 ) (global (;0;) (mut i32) i32.const 0) )
will panic in regalloc
$ cargo run compile -C cache=n -W nan-canonicalization ./foo.wat ... thread '<unnamed>' panicked at cranelift/codegen/src/machinst/compile.rs:76:14: register allocation: SSA(VReg(vreg = 198, class = Float), Inst(33)) note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
This surfaces a validation error in CLIF earlier with validation enabled
$ cargo run compile -C cache=n -C cranelift-debug-verifier -W nan-canonicalization ./foo.wat ... v19 = f32const +NaN v4 = select v18, v19, v17 ; v19 = +NaN @0043 v12 = select v8, v4, v10 ; v8 = 1 ;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ; error: inst12 (v12 = select.f32 v8, v4, v10 ; v8 = 1): uses value arg from non-dominating block4 @0044 v13 = bitcast.i32 v12 @0049 store notrap aligned table v13, v0+80 @004b jump block1 block1: @004b return } ; 1 verifier error detected (see above). Compilation aborted.
This has been further reduced to this CLIF test case:
test optimize set enable_verifier=true set opt_level=speed target x86_64 function %foo(i64) { block0(v0: i64): v3 = f32const 0x1.900000p6 v17 = sqrt v3 v18 = fcmp ne v17, v17 v19 = f32const +NaN v4 = select v18, v19, v17 v5 = iconst.i32 0 brif v5, block2, block3 block2: v6 = f32const 0x1.900000p6 v20 = sqrt v6 v21 = fcmp ne v20, v20 v22 = f32const +NaN v7 = select v21, v22, v20 v8 = iconst.i32 1 v2 = f32const 0.0 brif v8, block4(v2), block5 block5: v9 = f32const 0.0 jump block4(v9) block4(v10: f32): v11 = iconst.i32 1 v12 = select.f32 v11, v7, v10 v13 = bitcast.i32 v12 store notrap aligned table v13, v0+80 return block3: v15 = bitcast.i32 v4 store notrap aligned table v15, v0+80 return }
which can be reproduced with:
$ cd cranelift && cargo run test ./foo.clif
I've been investigating this with @elliottt and @fitzgen in person for a bit now. So far we have concluded a few "fixes" can be applied:
- One fix is to renumber the original
v12
input tov1
in CLIF.- Another fix is to move
block3
to be beneathblock0
.- The final fix is to change this line to
(subsume x)
Naturally none of these are actual fixes but are symptoms of the "real" issue. We're still figuring things out at this time but I wanted to open this up.
Trevor and Nick are telling me as well that this is possibly related to https://github.com/bytecodealliance/wasmtime/issues/6126.
bjorn3 commented on issue #7857:
With a small patch to get bugpoint to actually work on this test case and a fair amount of manual work I got it reduced to:
test optimize set enable_verifier=true set opt_level=speed target x86_64 function %foo() fast { block0: v0 = iconst.i64 0 v2 = f32const 0.0 v9 = f32const 0.0 v20 = fneg v2 ; v2 = 0.0 v18 = fcmp eq v20, v20 v4 = select v18, v2, v20 ; v2 = 0.0 v8 = iconst.i32 0 v11 = iconst.i32 1 brif v0, block2, block3 ; v0 = 0 block2: brif.i32 v8, block4(v2), block4(v9) ; v8 = 0, v2 = 0.0, v9 = 0.0 block4(v10: f32): v12 = select.f32 v11, v4, v10 ; v11 = 1 v13 = bitcast.i32 v12 store v13, v0 ; v0 = 0 trap user0 block3: v15 = bitcast.i32 v4 store v15, v0 ; v0 = 0 return }
which gives the following verifier error:
Error: function %foo() fast { block0: v0 = iconst.i64 0 brif v0, block2, block3 ; v0 = 0 block2: v8 = iconst.i32 0 v23 = f32const 0.0 brif v8, block4(v23), block4(v23) ; v8 = 0, v23 = 0.0, v23 = 0.0 block4(v10: f32): v24 = iconst.i32 1 v25 = fneg.f32 v23 ; v23 = 0.0 v26 = fcmp eq v25, v25 v27 = select.f32 v26, v23, v25 ; v23 = 0.0 v28 = select v24, v27, v10 ; v24 = 1 v29 = bitcast.i32 v28 v30 = iconst.i64 0 store v29, v30 ; v30 = 0 trap user0 block3: v11 = iconst.i32 1 v2 = f32const 0.0 v20 = fneg v2 ; v2 = 0.0 v18 = fcmp eq v20, v20 v4 = select v18, v2, v20 ; v2 = 0.0 v12 = select v11, v4, v10 ; v11 = 1 ; ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ; error: inst10 (v12 = select.f32 v11, v4, v10 ; v11 = 1): uses value arg from non-dominating block4 v13 = bitcast.i32 v12 v22 = iconst.i64 0 store v13, v22 ; v22 = 0 return } ; 1 verifier error detected (see above). Compilation aborted.
cfallin commented on issue #7857:
I took a quick look at this (I'm still on leave, only have a brief window here, but figured my input could be useful after seeing this go by).
It seems to me at least that there's an issue with the cost function: the select is union'ing with one of its inputs (because it is being constant-folded), but then another user of that input elsewhere is picking the select instead. (Given
x
andselect true, x, bla
one should always pickx
.) This is compounded by the fact that thebla
is dependent on a blockparam that's not available at the other use's site.
subsume
is a bandaid over that but really we should figure out whyselect true, x, bla
doesn't have higher cost thanx
. Bothv12
andv4
have saturated their cost; I didn't get a chance to look into why.Alternately, in a universe where we have use-site-specific costs, we could push the cost of blockparams that are not available (not in a dominating block) to infinity. But we've avoided use-site-specific costs so far because that removes the ability to have the nice linear-time dynamic programming pass to compute them...
(Why hasn't this come up before? The
x
->func(x, other_stuff)
rewrite is a violation of the correctness condition for rewrites that we have inopts/README.md
-- "using the same inputs as the original, or fewer" -- not in the sense that the original rule is written that way, but in the sense that equivalences are bidirectional and so reading the rule backwards produces that outcome. The cost function should ordinarily prevent the rewrite from being used "backwards" and I suspect our "infinity" encoding/scheme has some issues...)
alexcrichton commented on issue #7857:
@elliottt, @fitzgen, and I have been debugging this quite a bit today. Sorry haven't fully caught up on the latest comments. Our findings are:
- It's possible to leverage this bug to elaborate effectful instructions, such as function calls, where they shouldn't be. This is a sign that we should assert that during elaboration no effectful instructions should be added, only pure instructions.
- One issue here is that
compute_best_values
only runs one iteration going through the values. The input program here is constructed such that this one pass does not calculate an accurate cost of nodes such asv4
andv7
. This explains whyv12
is chosen to elaborate because all items in the eclass have near-infinite cost and this heuristic prefers larger nodes.- The "root" issue here feels that there's an eclass which contains
v4
andv12
. Thev12
expression relies on a value not in scope withv4
, meaning that it seems like there needs to be some sort of concept of "scope" with eclasses. We had no idea how this would be implemented and it sounds scary.@fitzgen is going to leave a further comment about the possibility of running
compute_best_values
to a fixpoint to get a more accurate cost.
Answering some questions reading over your comment now @cfallin
we should figure out why select true, x, bla doesn't have higher cost than x
We've concluded this is due to the numbers of values here. Due to the single pass in
compute_best_values
everything ends up with near-infinite cost. The values are out-of-order here due to the nan canonicalization pass running after CLIF construction in Wasmtime.Why hasn't this come up before?
My naive understanding at this point is that if all values are defined with values defined before it then we can correctly calculate cost and will deterministically pick
x
overf(x, other)
. This is not a great explanation though and there may be something deeper.
fitzgen commented on issue #7857:
One issue here is that
compute_best_values
only runs one iteration going through the values.I think we should do a fixed point here, and I know that is a scary thing to just drop in chat, but I think it is actually fine. The maximum number of iterations the fixed-point loop would take is equal to the longest chain of
vNN
-to-vMM
edges in the the dataflow graph whereNN > MM
. So in the worst case this isO(n^2)
, yes. But the Wasm frontend doesn't introduce any of these edges. Cranelift itself can introduce them during NaN canonicalization, but this will be chains that are only one or two links long. Therefore, with Wasmtime, Cranelift would _never_ do more than a handful of iterations. Other frontends would however have the responsibility of avoiding the worst case value numbering themselves.Additionally, this would allow us to remove the funky infinity-minus-one cost clamping as well, which would be a very nice simplification.
I think this might also fix some of the issues discussed in https://github.com/bytecodealliance/wasmtime/issues/6126.
cfallin commented on issue #7857:
The "root" issue here feels that there's an eclass which contains v4 and v12. The v12 expression relies on a value not in scope with v4, meaning that it seems like there needs to be some sort of concept of "scope" with eclasses. We had no idea how this would be implemented and it sounds scary.
This (two values in the eclass) is "right and good and normal" I think -- if
v12
derives fromv4
and we realize that the def ofv12
simplifies down tov12 := v4
, then there is no other reasonable way to represent this but to put them in the same eclass. Doing otherwise (e.g. having a "copy" operator or something of the sort) then makes each instance unique (and e.g. we couldn't collapsev14
andv15
inv12 := v4; v13 := v4; v14 = f(v12); v15 = f(v13)
). Basically, if something is truly equal to a blockparam then we need to represent it as such, or else all of the simplification stops prematurely.The key I think is the directionality, and for that we need to stick to the "correctness condition" we've stated for rewrite rules -- removing but not adding inputs -- in the extraction (picking nodes for each class) too. If one enode in an eclass depends on
v1
, and another depends onv1, v2
, we need to pick the former. This is because while we may have written the rule to "shrink" the input set, equivalence is bidirectional, so a broken extraction could actually use the bigger input set where we originally had the smaller.So all this leads me to come again to the cost function, and I agree @fitzgen's approach feels like the right one. Basically we need to (i) define costs correctly, so a rewrite rule that "shrinks" the input set also results in a node with lower cost -- this should already be the case, and is a separate bug if not -- and (ii) compute costs correctly according to their definition, which the fixpoint approach should do.
fitzgen edited a comment on issue #7857:
One issue here is that
compute_best_values
only runs one iteration going through the values.I think we should do a fixed point here, and I know that is a scary thing to just drop in chat, but I think it is actually fine. The maximum number of iterations the fixed-point loop would take is equal to the longest chain of
vNN
-to-vMM
edges in the the dataflow graph whereNN < MM
. So in the worst case this isO(n^2)
, yes. But the Wasm frontend doesn't introduce any of these edges. Cranelift itself can introduce them during NaN canonicalization, but this will be chains that are only one or two links long. Therefore, with Wasmtime, Cranelift would _never_ do more than a handful of iterations. Other frontends would however have the responsibility of avoiding the worst case value numbering themselves.Additionally, this would allow us to remove the funky infinity-minus-one cost clamping as well, which would be a very nice simplification.
I think this might also fix some of the issues discussed in https://github.com/bytecodealliance/wasmtime/issues/6126.
fitzgen commented on issue #7857:
I have a fix in https://github.com/bytecodealliance/wasmtime/pull/7859
alexcrichton commented on issue #7857:
(please feel free to respond to this after you're back @cfallin, no urgency on this of course)
If one enode in an eclass depends on v1, and another depends on v1, v2, we need to pick the former.
In the past I've been under the impression that one of the properties of eclasses/elaboration that we rely on is that correctness is guaranteed because no matter what we choose in an eclass it's guaranteed to have the same program. Put another way I was under the impression that various bits and pieces of code relied on the fact that we can choose anything in an eclass to elaborate. What you're saying I agree with, however, and if the above program is correctly putting everything in the same eclass then the I also agree with the consequence, the cost function is quite important.
Is there perhaps other bits and pieces of elaboration that need to be updated/rethought? Or is this perhaps a mistaken impression I've gotten from egraphs/etc? (I couldn't actually find docs in Cranelift itself saying anything like this searching for a moment).
fitzgen closed issue #7857:
Last night we got a fuzz bug. Everything below is relative to Wasmtime at 0d662c9d3dd1db8490e30124c9126bfa00a9a5c6.
This input:
(module (func (local f32) f32.const 100 f32.sqrt i32.const 0 if f32.const 100 f32.sqrt block i32.const 1 br_if 0 f32.const 0 local.set 0 end local.get 0 i32.const 1 select i32.reinterpret_f32 global.set 0 end i32.reinterpret_f32 global.set 0 ) (global (;0;) (mut i32) i32.const 0) )
will panic in regalloc
$ cargo run compile -C cache=n -W nan-canonicalization ./foo.wat ... thread '<unnamed>' panicked at cranelift/codegen/src/machinst/compile.rs:76:14: register allocation: SSA(VReg(vreg = 198, class = Float), Inst(33)) note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
This surfaces a validation error in CLIF earlier with validation enabled
$ cargo run compile -C cache=n -C cranelift-debug-verifier -W nan-canonicalization ./foo.wat ... v19 = f32const +NaN v4 = select v18, v19, v17 ; v19 = +NaN @0043 v12 = select v8, v4, v10 ; v8 = 1 ;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ; error: inst12 (v12 = select.f32 v8, v4, v10 ; v8 = 1): uses value arg from non-dominating block4 @0044 v13 = bitcast.i32 v12 @0049 store notrap aligned table v13, v0+80 @004b jump block1 block1: @004b return } ; 1 verifier error detected (see above). Compilation aborted.
This has been further reduced to this CLIF test case:
test optimize set enable_verifier=true set opt_level=speed target x86_64 function %foo(i64) { block0(v0: i64): v3 = f32const 0x1.900000p6 v17 = sqrt v3 v18 = fcmp ne v17, v17 v19 = f32const +NaN v4 = select v18, v19, v17 v5 = iconst.i32 0 brif v5, block2, block3 block2: v6 = f32const 0x1.900000p6 v20 = sqrt v6 v21 = fcmp ne v20, v20 v22 = f32const +NaN v7 = select v21, v22, v20 v8 = iconst.i32 1 v2 = f32const 0.0 brif v8, block4(v2), block5 block5: v9 = f32const 0.0 jump block4(v9) block4(v10: f32): v11 = iconst.i32 1 v12 = select.f32 v11, v7, v10 v13 = bitcast.i32 v12 store notrap aligned table v13, v0+80 return block3: v15 = bitcast.i32 v4 store notrap aligned table v15, v0+80 return }
which can be reproduced with:
$ cd cranelift && cargo run test ./foo.clif
I've been investigating this with @elliottt and @fitzgen in person for a bit now. So far we have concluded a few "fixes" can be applied:
- One fix is to renumber the original
v12
input tov1
in CLIF.- Another fix is to move
block3
to be beneathblock0
.- The final fix is to change this line to
(subsume x)
Naturally none of these are actual fixes but are symptoms of the "real" issue. We're still figuring things out at this time but I wanted to open this up.
Trevor and Nick are telling me as well that this is possibly related to https://github.com/bytecodealliance/wasmtime/issues/6126.
Last updated: Nov 22 2024 at 16:03 UTC