hungryzzz opened issue #9590:
Test Cases
Steps to Reproduce
Hi, I run the attached two cases(
good.wasm
&bad.wasm
) inWasmtime
andWasmEdge
(AOT), and collect their execution time respectively (measured bytime
tool).# command to collect execution time of wasmtime wasmtime compile bad.wasm -o bad.cwasm time wasmtime run --allow-precompiled bad.cwasm # command to collect execution time of wasmedge wasmedgec bad.wasm bad-wasmedge-aot.wasm time wasmedge bad-wasmedge-aot.wasm
Expected Results & Actual Results
For
good.wasm
, the execution time in different runtimes are as follows:
Wasmtime
: 0.99sWasmEdge
: 1.06sFor
bad.wasm
, the execution time in different runtimes are as follows:
Wasmtime
: 6.57sWasmEdge
: 1.05sThe difference between the attached two cases is as follows: changing the stored value in line 8 from
1
to0
, which decreases Wasmtime performance by 5.5s but has no negative effect on WasmEdge.➜ diff diff good.wat bad.wat 8c8 < f64.const 0x1p+0 (;=1;) --- > f64.const 0x0p+0 (;=0;)
More observations & questions:
- The store operation is outside the loop, and the following instructions do not contain any load operations.
Wasmtime
compiles the loop conditions inbad.wasm
andgood.wasm
in different ways, but I don't understand why the changes of an instruction that seems unrelated to the loop can affect the compilation strategies to the loop.# part of machine code of bad.wasm generated by wasmtime ... jne 5c <wasm[0]::function[1]+0x5c> movdqu (%rsp),%xmm1 jmp 60 <wasm[0]::function[1]+0x60> ... je 7f <wasm[0]::function[1]+0x7f> movdqu %xmm1,(%rsp) jmp 46 <wasm[0]::function[1]+0x46> ... je a2 <wasm[0]::function[1]+0xa2> movdqu %xmm1,(%rsp) jmp 46 <wasm[0]::function[1]+0x46> ... # part of machine code of good.wasm generated by wasmtime ... je 5f <wasm[0]::function[1]+0x5f> ... jne 4f <wasm[0]::function[1]+0x4f> ... jne 4f <wasm[0]::function[1]+0x4f> ...
Versions and Environment
- Wasmtime version or commit: c8b136965
- Operating system: Linux ringzzz-OptiPlex-Micro-Plus-7010 6.5.0-18-generic
- Architecture: Intel(R) Core(TM) i5-13500
primoly commented on issue #9590:
Strangely, replacing that constant zero with an equivalent calculation fixes the issue. Cranelift should constant fold these instructions, so why do they affect machine code generation? :thinking:
Replace line 8 in
bad.wasm
with one of the following for equal performance withgood.wasm
:f64.const 0 f64.abs
f64.const 0 f64.const 0 f64.add
f64.const 42 f64.const 42 f64.sub
f64.const 777 f64.const 0 f64.mul
alexcrichton commented on issue #9590:
This is happening because a register is being spilled, but why it's being spilled I'm not sure. Changing from 1 to 0 changes codegen which changes register allocation which causes this. In that sense I think it's worth digging in to why the spill happened here in the 0 case because I'm not sure what's going on.
I've got this CLIF:
function u0:1(i64, i64) { sig0 = (i64, i64) fn0 = colocated u0:2 sig0 block0(v0: i64, v1: i64): v2 = f64const 0.0 v7 = load.i64 notrap aligned readonly checked v0+120 v10 = iadd_imm v7, 16 store little heap v2, v10 call fn0(v0, v0) v3 = iconst.i32 0 v17 = iconst.i32 0xffff v15 = iconst.i32 1 jump block4(v3, v3, v2) block4(v14: i32, v24: i32, v34: f64): v37 = iconst.i32 0 brif v37, block6, block7(v34) block6: v38 = f64const 0.0 jump block7(v38) block7(v33: f64): v40 = band_imm.i32 v14, 0xffff v42 = icmp_imm ne v40, 1 v43 = iadd_imm.i32 v14, 1 brif v42, block4(v43, v24, v33), block5 block5: v46 = icmp_imm ne v24, 0xffff v48 = iadd_imm.i32 v24, 1 brif v46, block4(v43, v48, v33), block9 block9: trap user11 }
compiled with
cargo run compile out.clif --target x86_64 -D --set opt_level=speed
to generate:Disassembly of 128 bytes <u0:1>: 0: 55 pushq %rbp 1: 48 89 e5 movq %rsp, %rbp 4: 48 83 ec 10 subq $0x10, %rsp 8: 66 0f 57 c9 xorpd %xmm1, %xmm1 c: 48 8b 57 78 movq 0x78(%rdi), %rdx 10: f2 0f 11 4a 10 movsd %xmm1, 0x10(%rdx) 15: 48 89 fe movq %rdi, %rsi 18: f3 0f 7f 0c 24 movdqu %xmm1, (%rsp) ;; SPILL 1d: e8 00 00 00 00 callq 0x22 22: 45 31 d2 xorl %r10d, %r10d 25: 4c 89 d6 movq %r10, %rsi 28: 45 31 db xorl %r11d, %r11d 2b: 45 85 db testl %r11d, %r11d 2e: 0f 85 0a 00 00 00 jne 0x3e 34: f3 0f 6f 0c 24 movdqu (%rsp), %xmm1 ;; RELOAD 39: e9 04 00 00 00 jmp 0x42 3e: 66 0f 57 c9 xorpd %xmm1, %xmm1 42: 4c 89 d7 movq %r10, %rdi 45: 81 e7 ff ff 00 00 andl $0xffff, %edi 4b: 41 83 c2 01 addl $1, %r10d 4f: 83 ff 01 cmpl $1, %edi 52: 0f 84 0a 00 00 00 je 0x62 58: f3 0f 7f 0c 24 movdqu %xmm1, (%rsp) ;; SPILL 5d: e9 c6 ff ff ff jmp 0x28 62: 8d 7e 01 leal 1(%rsi), %edi 65: 81 fe ff ff 00 00 cmpl $0xffff, %esi 6b: 0f 84 0d 00 00 00 je 0x7e 71: 48 89 fe movq %rdi, %rsi 74: f3 0f 7f 0c 24 movdqu %xmm1, (%rsp) ;; SPILL 79: e9 aa ff ff ff jmp 0x28 7e: 0f 0b ud2
where I've annotated the spill/reload slots.
@cfallin would you know how to perhaps debug this further to see why the spill is being inserted? It seems to require some of the bits at the beginning of the function (e.g. calling some other function) so it may be related to caller/callee saves or something like that. I'm not sure if this is related to ABI details though as opposed to register allocator behavior. If it's regalloc-related this might end up leading to some nice wins on other benchmarks if it's not just float-related perhaps?
cfallin commented on issue #9590:
@cfallin would you know how to perhaps debug this further to see why the spill is being inserted? It seems to require some of the bits at the beginning of the function (e.g. calling some other function) so it may be related to caller/callee saves or something like that. I'm not sure if this is related to ABI details though as opposed to register allocator behavior. If it's regalloc-related this might end up leading to some nice wins on other benchmarks if it's not just float-related perhaps?
Maybe -- FWIW I have zero cycles to dedicate to this right now, so you'll need to dig into RA2 on this, unfortunately. Maybe at some later time I'll have the ability to make another performance push. Sorry!
hungryzzz commented on issue #9590:
compiled with
cargo run compile out.clif --target x86_64 -D --set opt_level=speed
to generate:
Disassembly of 128 bytes <u0:1>: 0: 55 pushq %rbp 1: 48 89 e5 movq %rsp, %rbp 4: 48 83 ec 10 subq $0x10, %rsp 8: 66 0f 57 c9 xorpd %xmm1, %xmm1 c: 48 8b 57 78 movq 0x78(%rdi), %rdx 10: f2 0f 11 4a 10 movsd %xmm1, 0x10(%rdx) 15: 48 89 fe movq %rdi, %rsi 18: f3 0f 7f 0c 24 movdqu %xmm1, (%rsp) ;; SPILL 1d: e8 00 00 00 00 callq 0x22 22: 45 31 d2 xorl %r10d, %r10d 25: 4c 89 d6 movq %r10, %rsi 28: 45 31 db xorl %r11d, %r11d 2b: 45 85 db testl %r11d, %r11d 2e: 0f 85 0a 00 00 00 jne 0x3e 34: f3 0f 6f 0c 24 movdqu (%rsp), %xmm1 ;; RELOAD 39: e9 04 00 00 00 jmp 0x42 3e: 66 0f 57 c9 xorpd %xmm1, %xmm1 42: 4c 89 d7 movq %r10, %rdi 45: 81 e7 ff ff 00 00 andl $0xffff, %edi 4b: 41 83 c2 01 addl $1, %r10d 4f: 83 ff 01 cmpl $1, %edi 52: 0f 84 0a 00 00 00 je 0x62 58: f3 0f 7f 0c 24 movdqu %xmm1, (%rsp) ;; SPILL 5d: e9 c6 ff ff ff jmp 0x28 62: 8d 7e 01 leal 1(%rsi), %edi 65: 81 fe ff ff 00 00 cmpl $0xffff, %esi 6b: 0f 84 0d 00 00 00 je 0x7e 71: 48 89 fe movq %rdi, %rsi 74: f3 0f 7f 0c 24 movdqu %xmm1, (%rsp) ;; SPILL 79: e9 aa ff ff ff jmp 0x28 7e: 0f 0b ud2
Hi @alexcrichton, I went through again the codes and the following is my heuristic thoughts about this bug:
Is it possible that the SPILL in line 18 is caused by the calling instruction since the
xmm1
is one of the caller-saved registers? And the following RELOAD (line34) and SPILL (line58&74) are the chain reaction due to the first SPILL(line18).However, in the
good.wasm
, it usesxmm0
to store the value, butxmm0
is not spilled before calling instruction even thoughxmm0
is also one of the caller-saved registers. Therefore, is it possible thatcranelift
uses different strategies forxmm0
andxmm1
before calling instruction?
hungryzzz edited a comment on issue #9590:
compiled with
cargo run compile out.clif --target x86_64 -D --set opt_level=speed
to generate:
Disassembly of 128 bytes <u0:1>: 0: 55 pushq %rbp 1: 48 89 e5 movq %rsp, %rbp 4: 48 83 ec 10 subq $0x10, %rsp 8: 66 0f 57 c9 xorpd %xmm1, %xmm1 c: 48 8b 57 78 movq 0x78(%rdi), %rdx 10: f2 0f 11 4a 10 movsd %xmm1, 0x10(%rdx) 15: 48 89 fe movq %rdi, %rsi 18: f3 0f 7f 0c 24 movdqu %xmm1, (%rsp) ;; SPILL 1d: e8 00 00 00 00 callq 0x22 22: 45 31 d2 xorl %r10d, %r10d 25: 4c 89 d6 movq %r10, %rsi 28: 45 31 db xorl %r11d, %r11d 2b: 45 85 db testl %r11d, %r11d 2e: 0f 85 0a 00 00 00 jne 0x3e 34: f3 0f 6f 0c 24 movdqu (%rsp), %xmm1 ;; RELOAD 39: e9 04 00 00 00 jmp 0x42 3e: 66 0f 57 c9 xorpd %xmm1, %xmm1 42: 4c 89 d7 movq %r10, %rdi 45: 81 e7 ff ff 00 00 andl $0xffff, %edi 4b: 41 83 c2 01 addl $1, %r10d 4f: 83 ff 01 cmpl $1, %edi 52: 0f 84 0a 00 00 00 je 0x62 58: f3 0f 7f 0c 24 movdqu %xmm1, (%rsp) ;; SPILL 5d: e9 c6 ff ff ff jmp 0x28 62: 8d 7e 01 leal 1(%rsi), %edi 65: 81 fe ff ff 00 00 cmpl $0xffff, %esi 6b: 0f 84 0d 00 00 00 je 0x7e 71: 48 89 fe movq %rdi, %rsi 74: f3 0f 7f 0c 24 movdqu %xmm1, (%rsp) ;; SPILL 79: e9 aa ff ff ff jmp 0x28 7e: 0f 0b ud2
Hi @alexcrichton, I went through again the codes and the following is my heuristic thoughts about this bug:
Is it possible that the SPILL in line 18 is caused by the calling instruction since the
xmm1
is one of the caller-saved registers? And the following RELOAD (line34) and SPILL (line58&74) are the chain reaction due to the first SPILL(line18)?However, in the
good.wasm
, it usesxmm0
to store the value, butxmm0
is not spilled before calling instruction even thoughxmm0
is also one of the caller-saved registers. Therefore, is it possible thatcranelift
uses different strategies forxmm0
andxmm1
before calling instruction?
hungryzzz edited a comment on issue #9590:
compiled with
cargo run compile out.clif --target x86_64 -D --set opt_level=speed
to generate:
Disassembly of 128 bytes <u0:1>: 0: 55 pushq %rbp 1: 48 89 e5 movq %rsp, %rbp 4: 48 83 ec 10 subq $0x10, %rsp 8: 66 0f 57 c9 xorpd %xmm1, %xmm1 c: 48 8b 57 78 movq 0x78(%rdi), %rdx 10: f2 0f 11 4a 10 movsd %xmm1, 0x10(%rdx) 15: 48 89 fe movq %rdi, %rsi 18: f3 0f 7f 0c 24 movdqu %xmm1, (%rsp) ;; SPILL 1d: e8 00 00 00 00 callq 0x22 22: 45 31 d2 xorl %r10d, %r10d 25: 4c 89 d6 movq %r10, %rsi 28: 45 31 db xorl %r11d, %r11d 2b: 45 85 db testl %r11d, %r11d 2e: 0f 85 0a 00 00 00 jne 0x3e 34: f3 0f 6f 0c 24 movdqu (%rsp), %xmm1 ;; RELOAD 39: e9 04 00 00 00 jmp 0x42 3e: 66 0f 57 c9 xorpd %xmm1, %xmm1 42: 4c 89 d7 movq %r10, %rdi 45: 81 e7 ff ff 00 00 andl $0xffff, %edi 4b: 41 83 c2 01 addl $1, %r10d 4f: 83 ff 01 cmpl $1, %edi 52: 0f 84 0a 00 00 00 je 0x62 58: f3 0f 7f 0c 24 movdqu %xmm1, (%rsp) ;; SPILL 5d: e9 c6 ff ff ff jmp 0x28 62: 8d 7e 01 leal 1(%rsi), %edi 65: 81 fe ff ff 00 00 cmpl $0xffff, %esi 6b: 0f 84 0d 00 00 00 je 0x7e 71: 48 89 fe movq %rdi, %rsi 74: f3 0f 7f 0c 24 movdqu %xmm1, (%rsp) ;; SPILL 79: e9 aa ff ff ff jmp 0x28 7e: 0f 0b ud2
Hi @alexcrichton, I went through again the codes and the following are my heuristic thoughts about this bug:
Is it possible that the SPILL in line 18 is caused by the calling instruction since the
xmm1
is one of the caller-saved registers? And the following RELOAD (line34) and SPILL (line58&74) are the chain reaction due to the first SPILL(line18)?However, in the
good.wasm
, it usesxmm0
to store the value, butxmm0
is not spilled before calling instruction even thoughxmm0
is also one of the caller-saved registers. Therefore, is it possible thatcranelift
uses different strategies forxmm0
andxmm1
before calling instruction?
primoly commented on issue #9590:
In the spilling case Cranelift uses the same value for multiple zero constants. In the non-spilling case it uses two different ones. In the original
bad.wasm
case it had to use different ones for 0 and 1, but you can get cranelift to use two different ones even when they are both zero, with for example the(f64.abs (f64.const 0))
hack i’ve shown above.Generated CLIF after optimisations of @alexcrichton’s example (spills):
(Note how it uses only
v2
for the zero constant)function u0:1(i64, i64) fast { sig0 = (i64, i64) fast fn0 = colocated u0:2 sig0 block0(v0: i64, v1: i64): v2 = f64const 0.0 v7 = load.i64 notrap aligned readonly checked v0+120 v49 = iconst.i64 16 v10 = iadd v7, v49 ; v49 = 16 store little heap v2, v10 ; v2 = 0.0 call fn0(v0, v0) v3 = iconst.i32 0 v17 = iconst.i32 0xffff v15 = iconst.i32 1 jump block4(v3, v3, v2) ; v3 = 0, v3 = 0, v2 = 0.0 block4(v14: i32, v24: i32, v34: f64): v55 = iconst.i32 0 brif v55, block6, block7(v34) ; v55 = 0 block6: v56 = f64const 0.0 jump block7(v56) ; v56 = 0.0 block7(v33: f64): v57 = iconst.i32 0xffff v58 = band.i32 v14, v57 ; v57 = 0xffff v59 = iconst.i32 1 v60 = icmp ne v58, v59 ; v59 = 1 v61 = iadd.i32 v14, v59 ; v59 = 1 brif v60, block4(v61, v24, v33), block5 block5: v62 = iconst.i32 0xffff v63 = icmp.i32 ne v24, v62 ; v62 = 0xffff v64 = iconst.i32 1 v65 = iadd.i32 v24, v64 ; v64 = 1 brif v63, block4(v61, v65, v33), block9 block9: trap user11 }
Example that does not spill:
function u0:1(i64, i64) { sig0 = (i64, i64) fn0 = colocated u0:2 sig0 block0(v0: i64, v1: i64): v2 = f64const 0.0 v6 = fabs.f64 v2 v7 = load.i64 notrap aligned readonly checked v0+120 v10 = iadd_imm v7, 16 store little heap v6, v10 call fn0(v0, v0) v3 = iconst.i32 0 v17 = iconst.i32 0xffff v15 = iconst.i32 1 jump block4(v3, v3, v2) block4(v14: i32, v24: i32, v34: f64): v37 = iconst.i32 0 brif v37, block6, block7(v34) block6: v38 = f64const 0.0 jump block7(v38) block7(v33: f64): v40 = band_imm.i32 v14, 0xffff v42 = icmp_imm ne v40, 1 v43 = iadd_imm.i32 v14, 1 brif v42, block4(v43, v24, v33), block5 block5: v46 = icmp_imm ne v24, 0xffff v48 = iadd_imm.i32 v24, 1 brif v46, block4(v43, v48, v33), block9 block9: trap user11 }
</details>
Generated CLIF after optimisations of the above example:
(Note how it uses
v2
andv55
for the zero constant)function u0:1(i64, i64) fast { sig0 = (i64, i64) fast fn0 = colocated u0:2 sig0 block0(v0: i64, v1: i64): v55 = f64const 0.0 v7 = load.i64 notrap aligned readonly checked v0+120 v49 = iconst.i64 16 v10 = iadd v7, v49 ; v49 = 16 store little heap v55, v10 ; v55 = 0.0 call fn0(v0, v0) v3 = iconst.i32 0 v2 = f64const 0.0 v17 = iconst.i32 0xffff v15 = iconst.i32 1 jump block4(v3, v3, v2) ; v3 = 0, v3 = 0, v2 = 0.0 block4(v14: i32, v24: i32, v34: f64): v56 = iconst.i32 0 brif v56, block6, block7(v34) ; v56 = 0 block6: v57 = f64const 0.0 jump block7(v57) ; v57 = 0.0 block7(v33: f64): v58 = iconst.i32 0xffff v59 = band.i32 v14, v58 ; v58 = 0xffff v60 = iconst.i32 1 v61 = icmp ne v59, v60 ; v60 = 1 v62 = iadd.i32 v14, v60 ; v60 = 1 brif v61, block4(v62, v24, v33), block5 block5: v63 = iconst.i32 0xffff v64 = icmp.i32 ne v24, v63 ; v63 = 0xffff v65 = iconst.i32 1 v66 = iadd.i32 v24, v65 ; v65 = 1 brif v64, block4(v62, v66, v33), block9 block9: trap user11 }
primoly edited a comment on issue #9590:
In the spilling case Cranelift uses the same value for multiple zero constants. In the non-spilling case it uses two different ones. In the original
good.wasm
case it had to use different ones for 0 and 1, but you can get cranelift to use two different ones even when they are both zero, with for example the(f64.abs (f64.const 0))
hack i’ve shown above.Generated CLIF after optimisations of @alexcrichton’s example (spills):
(Note how it uses only
v2
for the zero constant)function u0:1(i64, i64) fast { sig0 = (i64, i64) fast fn0 = colocated u0:2 sig0 block0(v0: i64, v1: i64): v2 = f64const 0.0 v7 = load.i64 notrap aligned readonly checked v0+120 v49 = iconst.i64 16 v10 = iadd v7, v49 ; v49 = 16 store little heap v2, v10 ; v2 = 0.0 call fn0(v0, v0) v3 = iconst.i32 0 v17 = iconst.i32 0xffff v15 = iconst.i32 1 jump block4(v3, v3, v2) ; v3 = 0, v3 = 0, v2 = 0.0 block4(v14: i32, v24: i32, v34: f64): v55 = iconst.i32 0 brif v55, block6, block7(v34) ; v55 = 0 block6: v56 = f64const 0.0 jump block7(v56) ; v56 = 0.0 block7(v33: f64): v57 = iconst.i32 0xffff v58 = band.i32 v14, v57 ; v57 = 0xffff v59 = iconst.i32 1 v60 = icmp ne v58, v59 ; v59 = 1 v61 = iadd.i32 v14, v59 ; v59 = 1 brif v60, block4(v61, v24, v33), block5 block5: v62 = iconst.i32 0xffff v63 = icmp.i32 ne v24, v62 ; v62 = 0xffff v64 = iconst.i32 1 v65 = iadd.i32 v24, v64 ; v64 = 1 brif v63, block4(v61, v65, v33), block9 block9: trap user11 }
Example that does not spill:
function u0:1(i64, i64) { sig0 = (i64, i64) fn0 = colocated u0:2 sig0 block0(v0: i64, v1: i64): v2 = f64const 0.0 v6 = fabs.f64 v2 v7 = load.i64 notrap aligned readonly checked v0+120 v10 = iadd_imm v7, 16 store little heap v6, v10 call fn0(v0, v0) v3 = iconst.i32 0 v17 = iconst.i32 0xffff v15 = iconst.i32 1 jump block4(v3, v3, v2) block4(v14: i32, v24: i32, v34: f64): v37 = iconst.i32 0 brif v37, block6, block7(v34) block6: v38 = f64const 0.0 jump block7(v38) block7(v33: f64): v40 = band_imm.i32 v14, 0xffff v42 = icmp_imm ne v40, 1 v43 = iadd_imm.i32 v14, 1 brif v42, block4(v43, v24, v33), block5 block5: v46 = icmp_imm ne v24, 0xffff v48 = iadd_imm.i32 v24, 1 brif v46, block4(v43, v48, v33), block9 block9: trap user11 }
</details>
Generated CLIF after optimisations of the above example:
(Note how it uses
v2
andv55
for the zero constant)function u0:1(i64, i64) fast { sig0 = (i64, i64) fast fn0 = colocated u0:2 sig0 block0(v0: i64, v1: i64): v55 = f64const 0.0 v7 = load.i64 notrap aligned readonly checked v0+120 v49 = iconst.i64 16 v10 = iadd v7, v49 ; v49 = 16 store little heap v55, v10 ; v55 = 0.0 call fn0(v0, v0) v3 = iconst.i32 0 v2 = f64const 0.0 v17 = iconst.i32 0xffff v15 = iconst.i32 1 jump block4(v3, v3, v2) ; v3 = 0, v3 = 0, v2 = 0.0 block4(v14: i32, v24: i32, v34: f64): v56 = iconst.i32 0 brif v56, block6, block7(v34) ; v56 = 0 block6: v57 = f64const 0.0 jump block7(v57) ; v57 = 0.0 block7(v33: f64): v58 = iconst.i32 0xffff v59 = band.i32 v14, v58 ; v58 = 0xffff v60 = iconst.i32 1 v61 = icmp ne v59, v60 ; v60 = 1 v62 = iadd.i32 v14, v60 ; v60 = 1 brif v61, block4(v62, v24, v33), block5 block5: v63 = iconst.i32 0xffff v64 = icmp.i32 ne v24, v63 ; v63 = 0xffff v65 = iconst.i32 1 v66 = iadd.i32 v24, v65 ; v65 = 1 brif v64, block4(v62, v66, v33), block9 block9: trap user11 }
primoly edited a comment on issue #9590:
In the spilling case Cranelift uses the same value for multiple zero constants. In the non-spilling case it uses two different ones. In the original
good.wasm
case it had to use different values for0.0
and1.0
, but you can get Cranelift to use two separate ones even when they are both0.0
, with for example the(f64.abs (f64.const 0))
hack i’ve shown above.Generated CLIF after optimisations of @alexcrichton’s example (spills)
Note how it uses only
v2
for the0.0
constant inblock0
. Thisv2
is then used twice, in thestore
as well as thejump
.function u0:1(i64, i64) fast { sig0 = (i64, i64) fast fn0 = colocated u0:2 sig0 block0(v0: i64, v1: i64): v2 = f64const 0.0 v7 = load.i64 notrap aligned readonly checked v0+120 v49 = iconst.i64 16 v10 = iadd v7, v49 ; v49 = 16 store little heap v2, v10 ; v2 = 0.0 call fn0(v0, v0) v3 = iconst.i32 0 v17 = iconst.i32 0xffff v15 = iconst.i32 1 jump block4(v3, v3, v2) ; v3 = 0, v3 = 0, v2 = 0.0 block4(v14: i32, v24: i32, v34: f64): v55 = iconst.i32 0 brif v55, block6, block7(v34) ; v55 = 0 block6: v56 = f64const 0.0 jump block7(v56) ; v56 = 0.0 block7(v33: f64): v57 = iconst.i32 0xffff v58 = band.i32 v14, v57 ; v57 = 0xffff v59 = iconst.i32 1 v60 = icmp ne v58, v59 ; v59 = 1 v61 = iadd.i32 v14, v59 ; v59 = 1 brif v60, block4(v61, v24, v33), block5 block5: v62 = iconst.i32 0xffff v63 = icmp.i32 ne v24, v62 ; v62 = 0xffff v64 = iconst.i32 1 v65 = iadd.i32 v24, v64 ; v64 = 1 brif v63, block4(v61, v65, v33), block9 block9: trap user11 }
Example that does not spill
Note the additional
v6
that will get constant folded.function u0:1(i64, i64) { sig0 = (i64, i64) fn0 = colocated u0:2 sig0 block0(v0: i64, v1: i64): v2 = f64const 0.0 v6 = fabs.f64 v2 v7 = load.i64 notrap aligned readonly checked v0+120 v10 = iadd_imm v7, 16 store little heap v6, v10 call fn0(v0, v0) v3 = iconst.i32 0 v17 = iconst.i32 0xffff v15 = iconst.i32 1 jump block4(v3, v3, v2) block4(v14: i32, v24: i32, v34: f64): v37 = iconst.i32 0 brif v37, block6, block7(v34) block6: v38 = f64const 0.0 jump block7(v38) block7(v33: f64): v40 = band_imm.i32 v14, 0xffff v42 = icmp_imm ne v40, 1 v43 = iadd_imm.i32 v14, 1 brif v42, block4(v43, v24, v33), block5 block5: v46 = icmp_imm ne v24, 0xffff v48 = iadd_imm.i32 v24, 1 brif v46, block4(v43, v48, v33), block9 block9: trap user11 }
</details>
Generated CLIF after optimisations of the non-spilling example above
Note how it uses
v2
andv55
for the0.0
constant inblock0
. Thestore
usesv55
while thejump
usesv2
.function u0:1(i64, i64) fast { sig0 = (i64, i64) fast fn0 = colocated u0:2 sig0 block0(v0: i64, v1: i64): v55 = f64const 0.0 v7 = load.i64 notrap aligned readonly checked v0+120 v49 = iconst.i64 16 v10 = iadd v7, v49 ; v49 = 16 store little heap v55, v10 ; v55 = 0.0 call fn0(v0, v0) v3 = iconst.i32 0 v2 = f64const 0.0 v17 = iconst.i32 0xffff v15 = iconst.i32 1 jump block4(v3, v3, v2) ; v3 = 0, v3 = 0, v2 = 0.0 block4(v14: i32, v24: i32, v34: f64): v56 = iconst.i32 0 brif v56, block6, block7(v34) ; v56 = 0 block6: v57 = f64const 0.0 jump block7(v57) ; v57 = 0.0 block7(v33: f64): v58 = iconst.i32 0xffff v59 = band.i32 v14, v58 ; v58 = 0xffff v60 = iconst.i32 1 v61 = icmp ne v59, v60 ; v60 = 1 v62 = iadd.i32 v14, v60 ; v60 = 1 brif v61, block4(v62, v24, v33), block5 block5: v63 = iconst.i32 0xffff v64 = icmp.i32 ne v24, v63 ; v63 = 0xffff v65 = iconst.i32 1 v66 = iadd.i32 v24, v65 ; v65 = 1 brif v64, block4(v62, v66, v33), block9 block9: trap user11 }
hungryzzz commented on issue #9590:
Hi @alexcrichton @primoly, I also find that if I change the
local.get 0
(line 42) in thebad.wasm
tof64.const 0x0p+0
, the performance will also be back to normal.In the loop, the local variable 0 will be set in a branch (line 16), and obtained outside this loop (line 42). is it possible this dependency affects some analysis?
Here are the differences between their machine codes:
0000000000000000 <wasm[0]::function[1]>: 0: 55 push %rbp 1: 48 89 e5 mov %rsp,%rbp 4: 4c 8b 57 08 mov 0x8(%rdi),%r10 8: 4d 8b 12 mov (%r10),%r10 b: 49 83 c2 30 add $0x30,%r10 f: 49 39 e2 cmp %rsp,%r10 12: 0f 87 9b 00 00 00 ja b3 <wasm[0]::function[1]+0xb3> 18: 48 83 ec 20 sub $0x20,%rsp 1c: 48 89 5c 24 10 mov %rbx,0x10(%rsp) 21: c5 d9 57 cc vxorpd %xmm4,%xmm4,%xmm1 25: 48 8b 47 78 mov 0x78(%rdi),%rax 29: 48 89 fe mov %rdi,%rsi 2c: c5 fb 11 48 10 vmovsd %xmm1,0x10(%rax) ;; diff 31: 48 89 f3 mov %rsi,%rbx 34: f3 0f 7f 0c 24 movdqu %xmm1,(%rsp) 39: 48 89 df mov %rbx,%rdi 3c: e8 7f 00 00 00 call c0 <wasm[0]::function[2]> 41: 31 ff xor %edi,%edi 43: 49 89 fa mov %rdi,%r10 46: 45 31 c9 xor %r9d,%r9d 49: 45 85 c9 test %r9d,%r9d 4c: 0f 85 0a 00 00 00 jne 5c <wasm[0]::function[1]+0x5c> ;; diff 52: f3 0f 6f 0c 24 movdqu (%rsp),%xmm1 ;; diff 57: e9 04 00 00 00 jmp 60 <wasm[0]::function[1]+0x60> ;; diff 5c: c5 e9 57 ca vxorpd %xmm2,%xmm2,%xmm1 ;; diff 60: 48 89 fe mov %rdi,%rsi 63: 81 e6 ff ff 00 00 and $0xffff,%esi 69: 83 c7 01 add $0x1,%edi 6c: 83 fe 01 cmp $0x1,%esi 6f: 0f 84 0a 00 00 00 je 7f <wasm[0]::function[1]+0x7f> ;; diff 75: f3 0f 7f 0c 24 movdqu %xmm1,(%rsp) ;; diff 7a: e9 c7 ff ff ff jmp 46 <wasm[0]::function[1]+0x46> ;; diff 7f: 4c 89 d0 mov %r10,%rax 82: 81 e0 ff ff 00 00 and $0xffff,%eax 88: 41 83 c2 01 add $0x1,%r10d 8c: 81 f8 ff ff 00 00 cmp $0xffff,%eax 92: 0f 84 0a 00 00 00 je a2 <wasm[0]::function[1]+0xa2> ;; diff 98: f3 0f 7f 0c 24 movdqu %xmm1,(%rsp) ;; diff 9d: e9 a4 ff ff ff jmp 46 <wasm[0]::function[1]+0x46> ;; diff a2: 48 8b 43 58 mov 0x58(%rbx),%rax a6: 48 8b 7b 68 mov 0x68(%rbx),%rdi aa: 31 d2 xor %edx,%edx ac: 48 89 de mov %rbx,%rsi af: ff d0 call *%rax
hungryzzz edited a comment on issue #9590:
Hi @alexcrichton @primoly, I also find that if I change the
local.get 0
(line 42) in thebad.wasm
tof64.const 0x0p+0
, the performance will also be back to normal. It seems that I can understand something about the register spills and reload.Observations about the register spills and reload
- The
local.set 0
instruction is in a branch within a double loop. (4c
-5f
in machine code) Besides, there is alocal.get 0
outside the loop (even though it will be dropped right away.)- I guess that Wasmtime will store the value of local[0] in address
%rsp
, and usexmm1
register to store the new value of local[0].- When the branch is true,
xmm1
is set to a new value (5c
); while when the branch is false,xmm1
is set to the current value of local[0]. (address52
) This explains the register reload instruction in address52
.- In every iteration, the new value of
xmm1
will be stored back to the address%rsp
(address75
and98
), this explains the register spilling instructions in address75
and98
.- Even though I understand why the register spills and reloads happen in
bad.wasm
, I still do not know why changing thef64.const 0x0p+0
in line 8 fromf64.const 0x1p+0
prevents this situation from happening.Here are the differences between their machine codes:
0000000000000000 <wasm[0]::function[1]>: 0: 55 push %rbp 1: 48 89 e5 mov %rsp,%rbp 4: 4c 8b 57 08 mov 0x8(%rdi),%r10 8: 4d 8b 12 mov (%r10),%r10 b: 49 83 c2 30 add $0x30,%r10 f: 49 39 e2 cmp %rsp,%r10 12: 0f 87 9b 00 00 00 ja b3 <wasm[0]::function[1]+0xb3> 18: 48 83 ec 20 sub $0x20,%rsp 1c: 48 89 5c 24 10 mov %rbx,0x10(%rsp) 21: c5 d9 57 cc vxorpd %xmm4,%xmm4,%xmm1 25: 48 8b 47 78 mov 0x78(%rdi),%rax 29: 48 89 fe mov %rdi,%rsi 2c: c5 fb 11 48 10 vmovsd %xmm1,0x10(%rax) ;; diff 31: 48 89 f3 mov %rsi,%rbx 34: f3 0f 7f 0c 24 movdqu %xmm1,(%rsp) 39: 48 89 df mov %rbx,%rdi 3c: e8 7f 00 00 00 call c0 <wasm[0]::function[2]> 41: 31 ff xor %edi,%edi 43: 49 89 fa mov %rdi,%r10 46: 45 31 c9 xor %r9d,%r9d 49: 45 85 c9 test %r9d,%r9d 4c: 0f 85 0a 00 00 00 jne 5c <wasm[0]::function[1]+0x5c> ;; diff 52: f3 0f 6f 0c 24 movdqu (%rsp),%xmm1 ;; diff 57: e9 04 00 00 00 jmp 60 <wasm[0]::function[1]+0x60> ;; diff 5c: c5 e9 57 ca vxorpd %xmm2,%xmm2,%xmm1 ;; diff 60: 48 89 fe mov %rdi,%rsi 63: 81 e6 ff ff 00 00 and $0xffff,%esi 69: 83 c7 01 add $0x1,%edi 6c: 83 fe 01 cmp $0x1,%esi 6f: 0f 84 0a 00 00 00 je 7f <wasm[0]::function[1]+0x7f> ;; diff 75: f3 0f 7f 0c 24 movdqu %xmm1,(%rsp) ;; diff 7a: e9 c7 ff ff ff jmp 46 <wasm[0]::function[1]+0x46> ;; diff 7f: 4c 89 d0 mov %r10,%rax 82: 81 e0 ff ff 00 00 and $0xffff,%eax 88: 41 83 c2 01 add $0x1,%r10d 8c: 81 f8 ff ff 00 00 cmp $0xffff,%eax 92: 0f 84 0a 00 00 00 je a2 <wasm[0]::function[1]+0xa2> ;; diff 98: f3 0f 7f 0c 24 movdqu %xmm1,(%rsp) ;; diff 9d: e9 a4 ff ff ff jmp 46 <wasm[0]::function[1]+0x46> ;; diff a2: 48 8b 43 58 mov 0x58(%rbx),%rax a6: 48 8b 7b 68 mov 0x68(%rbx),%rdi aa: 31 d2 xor %edx,%edx ac: 48 89 de mov %rbx,%rsi af: ff d0 call *%rax
alexcrichton commented on issue #9590:
@hungryzzz you might be right yeah! I would have naively expected live range splitting in regalloc2 though to kick in and make it such that the spill, if necessary, only happened around the call rather than through the loop. As for small differences, when it comes to regalloc heuristics AFAIK it's expected that small changes to the input can have subtle changes to the output. Especially with specific decisions like spilling here I'm not entirely surprised. That being said though I don't fully understand everything happening here so I can't really answer with certainty. (e.g. I don't know for certain why @primoly's changes resulted in different register allocation decisions)
primoly commented on issue #9590:
The reason for the spill is the call to another function which leads to Cranelift saving the value of the
xmm
register on the stack so it can retrieve it afterwards (caller saves). But when Cranelift treats the value of two zeros before the call and after as different values, it just writes a fresh0.0
into anotherxmm
register. Notice how the non-spilling cases always use two differentxmm
registers while the spilling ones only use one, matching the non-spilling CLIF usingv2
andv55
while the spilling CLIF uses onlyv2
.
primoly edited a comment on issue #9590:
The reason for the spill is the call to another function which leads to Cranelift saving the value of the
xmm
register on the stack so it can retrieve it afterwards (caller saves). But when Cranelift treats the two zeros before the call and after as different values, it just writes a fresh0.0
into anotherxmm
register. Notice how the non-spilling cases always use two differentxmm
registers while the spilling ones only use one, matching the non-spilling CLIF usingv55
before thecall fn0(v0, v0)
andv2
after while the spilling CLIF usesv2
before and after.
Last updated: Nov 22 2024 at 16:03 UTC