alexcrichton opened issue #7283:
Given this input CLIF:
function %foo(i64, i64) { block0(v0: i64, v1: i64): ;; Create a loop-invariant value `v10` which is some operation which ;; includes a constant somewhere. v8 = load.f64 v0+100 v9 = f64const 0x1.0000000000000p1 v10 = fdiv v8, v9 ;; jump to the loop v3 = iconst.i64 0 jump block2(v3) ; v3 = 0 block2(v11: i64): ;; store the loop-invariant `v10` to memory "somewhere" v15 = iadd v0, v11 store.f64 v10, v15 ;; loop breakout condition v17 = iadd_imm v11, 1 v19 = icmp_imm ne v17, 100 brif v19, block2(v17), block1 block1: return }
this will currently optimize as:
$ cargo run compile -p ./foo.clif --target aarch64 --set opt_level=speed function %foo(i64, i64) fast { block0(v0: i64, v1: i64): v8 = load.f64 v0+100 v3 = iconst.i64 0 jump block2(v3) ; v3 = 0 block2(v11: i64): v9 = f64const 0x1.0000000000000p1 v10 = fdiv.f64 v8, v9 ; v9 = 0x1.0000000000000p1 v15 = iadd.i64 v0, v11 store v10, v15 v20 = iconst.i64 1 v17 = iadd v11, v20 ; v20 = 1 v21 = iconst.i64 100 v19 = icmp ne v17, v21 ; v21 = 100 brif v19, block2(v17), block1 block1: return }
This notably sinks the
fdiv
operation, which is loop invariant and originally outside of the loop, into the loop.After debugging with @elliottt the reason this seems to be happening is that during elaboration to elaborate the
fdiv
instruction it first elaborates the inputs to thefdiv
instruction. When doing so one of the inputs is anf64const
which is rematerializable, meaning that the constant value is materialized inside of the loop. Next when elaboratingfdiv
it sees that one of its inputs is inside of the loop, so it concludes that thefdiv
must also itself be inside of the loop. Put another way the rematerialization of the constant argument additionally forces thefdiv
to go inside of the loop. This hypothesis was tested by commenting out these lines which avoided putting thefdiv
into the loop. This isn't a complete fix, however, because rematerialized integer constants would still cause integer operations to be sunk into loops erroneously.I remember first noticing this behavior in an older issue about false dependencies where a division operation was sunk into a loop, although the performance regression in that issue was due to something else. @elliottt's and my investigation of https://github.com/bytecodealliance/wasmtime/issues/7246 however turned up that this sinking is the cause of the slowdown there. The
vdivsd
instruction is present in the loop when in the original wasm thef64.div
was not present in the loop. The above commenting ofremat.isle
rules improves the runtime performance of that test case to be as expected.cc @elliottt and @cfallin as y'all are likely interested in this.
cfallin commented on issue #7283:
This is fascinating -- thank you both for debugging this!
I agree that this "reverse LICM" should be, well, reversed. (Though I'm tempted to argue that "loop invariant code motion" doesn't specify which way the code should move :sweat_smile: ) Basically the remat should not affect code placement decisions at all.
It seems to me that there should be a relatively small change possible to the remat logic. The part that perplexes me a little bit right now, without going through a tracelog, is why we're remat'ing into
block2
at all, if there are no other uses ofv9
. In any case I should be able to look at this more later today and hopefully fix it -- thank you for the very helpful minimal example!
cfallin closed issue #7283:
Given this input CLIF:
function %foo(i64, i64) { block0(v0: i64, v1: i64): ;; Create a loop-invariant value `v10` which is some operation which ;; includes a constant somewhere. v8 = load.f64 v0+100 v9 = f64const 0x1.0000000000000p1 v10 = fdiv v8, v9 ;; jump to the loop v3 = iconst.i64 0 jump block2(v3) ; v3 = 0 block2(v11: i64): ;; store the loop-invariant `v10` to memory "somewhere" v15 = iadd v0, v11 store.f64 v10, v15 ;; loop breakout condition v17 = iadd_imm v11, 1 v19 = icmp_imm ne v17, 100 brif v19, block2(v17), block1 block1: return }
this will currently optimize as:
$ cargo run compile -p ./foo.clif --target aarch64 --set opt_level=speed function %foo(i64, i64) fast { block0(v0: i64, v1: i64): v8 = load.f64 v0+100 v3 = iconst.i64 0 jump block2(v3) ; v3 = 0 block2(v11: i64): v9 = f64const 0x1.0000000000000p1 v10 = fdiv.f64 v8, v9 ; v9 = 0x1.0000000000000p1 v15 = iadd.i64 v0, v11 store v10, v15 v20 = iconst.i64 1 v17 = iadd v11, v20 ; v20 = 1 v21 = iconst.i64 100 v19 = icmp ne v17, v21 ; v21 = 100 brif v19, block2(v17), block1 block1: return }
This notably sinks the
fdiv
operation, which is loop invariant and originally outside of the loop, into the loop.After debugging with @elliottt the reason this seems to be happening is that during elaboration to elaborate the
fdiv
instruction it first elaborates the inputs to thefdiv
instruction. When doing so one of the inputs is anf64const
which is rematerializable, meaning that the constant value is materialized inside of the loop. Next when elaboratingfdiv
it sees that one of its inputs is inside of the loop, so it concludes that thefdiv
must also itself be inside of the loop. Put another way the rematerialization of the constant argument additionally forces thefdiv
to go inside of the loop. This hypothesis was tested by commenting out these lines which avoided putting thefdiv
into the loop. This isn't a complete fix, however, because rematerialized integer constants would still cause integer operations to be sunk into loops erroneously.I remember first noticing this behavior in an older issue about false dependencies where a division operation was sunk into a loop, although the performance regression in that issue was due to something else. @elliottt's and my investigation of https://github.com/bytecodealliance/wasmtime/issues/7246 however turned up that this sinking is the cause of the slowdown there. The
vdivsd
instruction is present in the loop when in the original wasm thef64.div
was not present in the loop. The above commenting ofremat.isle
rules improves the runtime performance of that test case to be as expected.cc @elliottt and @cfallin as y'all are likely interested in this.
Last updated: Jan 24 2025 at 00:11 UTC