hungryzzz opened issue #9590:
Test Cases
Steps to Reproduce
Hi, I run the attached two cases(
good.wasm&bad.wasm) inWasmtimeandWasmEdge(AOT), and collect their execution time respectively (measured bytimetool).# command to collect execution time of wasmtime wasmtime compile bad.wasm -o bad.cwasm time wasmtime run --allow-precompiled bad.cwasm # command to collect execution time of wasmedge wasmedgec bad.wasm bad-wasmedge-aot.wasm time wasmedge bad-wasmedge-aot.wasmExpected Results & Actual Results
For
good.wasm, the execution time in different runtimes are as follows:
Wasmtime: 0.99sWasmEdge: 1.06sFor
bad.wasm, the execution time in different runtimes are as follows:
Wasmtime: 6.57sWasmEdge: 1.05sThe difference between the attached two cases is as follows: changing the stored value in line 8 from
1to0, which decreases Wasmtime performance by 5.5s but has no negative effect on WasmEdge.➜ diff diff good.wat bad.wat 8c8 < f64.const 0x1p+0 (;=1;) --- > f64.const 0x0p+0 (;=0;)More observations & questions:
- The store operation is outside the loop, and the following instructions do not contain any load operations.
Wasmtimecompiles the loop conditions inbad.wasmandgood.wasmin different ways, but I don't understand why the changes of an instruction that seems unrelated to the loop can affect the compilation strategies to the loop.# part of machine code of bad.wasm generated by wasmtime ... jne 5c <wasm[0]::function[1]+0x5c> movdqu (%rsp),%xmm1 jmp 60 <wasm[0]::function[1]+0x60> ... je 7f <wasm[0]::function[1]+0x7f> movdqu %xmm1,(%rsp) jmp 46 <wasm[0]::function[1]+0x46> ... je a2 <wasm[0]::function[1]+0xa2> movdqu %xmm1,(%rsp) jmp 46 <wasm[0]::function[1]+0x46> ... # part of machine code of good.wasm generated by wasmtime ... je 5f <wasm[0]::function[1]+0x5f> ... jne 4f <wasm[0]::function[1]+0x4f> ... jne 4f <wasm[0]::function[1]+0x4f> ...Versions and Environment
- Wasmtime version or commit: c8b136965
- Operating system: Linux ringzzz-OptiPlex-Micro-Plus-7010 6.5.0-18-generic
- Architecture: Intel(R) Core(TM) i5-13500
primoly commented on issue #9590:
Strangely, replacing that constant zero with an equivalent calculation fixes the issue. Cranelift should constant fold these instructions, so why do they affect machine code generation? :thinking:
Replace line 8 in
bad.wasmwith one of the following for equal performance withgood.wasm:f64.const 0 f64.absf64.const 0 f64.const 0 f64.addf64.const 42 f64.const 42 f64.subf64.const 777 f64.const 0 f64.mul
alexcrichton commented on issue #9590:
This is happening because a register is being spilled, but why it's being spilled I'm not sure. Changing from 1 to 0 changes codegen which changes register allocation which causes this. In that sense I think it's worth digging in to why the spill happened here in the 0 case because I'm not sure what's going on.
I've got this CLIF:
function u0:1(i64, i64) { sig0 = (i64, i64) fn0 = colocated u0:2 sig0 block0(v0: i64, v1: i64): v2 = f64const 0.0 v7 = load.i64 notrap aligned readonly checked v0+120 v10 = iadd_imm v7, 16 store little heap v2, v10 call fn0(v0, v0) v3 = iconst.i32 0 v17 = iconst.i32 0xffff v15 = iconst.i32 1 jump block4(v3, v3, v2) block4(v14: i32, v24: i32, v34: f64): v37 = iconst.i32 0 brif v37, block6, block7(v34) block6: v38 = f64const 0.0 jump block7(v38) block7(v33: f64): v40 = band_imm.i32 v14, 0xffff v42 = icmp_imm ne v40, 1 v43 = iadd_imm.i32 v14, 1 brif v42, block4(v43, v24, v33), block5 block5: v46 = icmp_imm ne v24, 0xffff v48 = iadd_imm.i32 v24, 1 brif v46, block4(v43, v48, v33), block9 block9: trap user11 }compiled with
cargo run compile out.clif --target x86_64 -D --set opt_level=speedto generate:Disassembly of 128 bytes <u0:1>: 0: 55 pushq %rbp 1: 48 89 e5 movq %rsp, %rbp 4: 48 83 ec 10 subq $0x10, %rsp 8: 66 0f 57 c9 xorpd %xmm1, %xmm1 c: 48 8b 57 78 movq 0x78(%rdi), %rdx 10: f2 0f 11 4a 10 movsd %xmm1, 0x10(%rdx) 15: 48 89 fe movq %rdi, %rsi 18: f3 0f 7f 0c 24 movdqu %xmm1, (%rsp) ;; SPILL 1d: e8 00 00 00 00 callq 0x22 22: 45 31 d2 xorl %r10d, %r10d 25: 4c 89 d6 movq %r10, %rsi 28: 45 31 db xorl %r11d, %r11d 2b: 45 85 db testl %r11d, %r11d 2e: 0f 85 0a 00 00 00 jne 0x3e 34: f3 0f 6f 0c 24 movdqu (%rsp), %xmm1 ;; RELOAD 39: e9 04 00 00 00 jmp 0x42 3e: 66 0f 57 c9 xorpd %xmm1, %xmm1 42: 4c 89 d7 movq %r10, %rdi 45: 81 e7 ff ff 00 00 andl $0xffff, %edi 4b: 41 83 c2 01 addl $1, %r10d 4f: 83 ff 01 cmpl $1, %edi 52: 0f 84 0a 00 00 00 je 0x62 58: f3 0f 7f 0c 24 movdqu %xmm1, (%rsp) ;; SPILL 5d: e9 c6 ff ff ff jmp 0x28 62: 8d 7e 01 leal 1(%rsi), %edi 65: 81 fe ff ff 00 00 cmpl $0xffff, %esi 6b: 0f 84 0d 00 00 00 je 0x7e 71: 48 89 fe movq %rdi, %rsi 74: f3 0f 7f 0c 24 movdqu %xmm1, (%rsp) ;; SPILL 79: e9 aa ff ff ff jmp 0x28 7e: 0f 0b ud2where I've annotated the spill/reload slots.
@cfallin would you know how to perhaps debug this further to see why the spill is being inserted? It seems to require some of the bits at the beginning of the function (e.g. calling some other function) so it may be related to caller/callee saves or something like that. I'm not sure if this is related to ABI details though as opposed to register allocator behavior. If it's regalloc-related this might end up leading to some nice wins on other benchmarks if it's not just float-related perhaps?
cfallin commented on issue #9590:
@cfallin would you know how to perhaps debug this further to see why the spill is being inserted? It seems to require some of the bits at the beginning of the function (e.g. calling some other function) so it may be related to caller/callee saves or something like that. I'm not sure if this is related to ABI details though as opposed to register allocator behavior. If it's regalloc-related this might end up leading to some nice wins on other benchmarks if it's not just float-related perhaps?
Maybe -- FWIW I have zero cycles to dedicate to this right now, so you'll need to dig into RA2 on this, unfortunately. Maybe at some later time I'll have the ability to make another performance push. Sorry!
hungryzzz commented on issue #9590:
compiled with
cargo run compile out.clif --target x86_64 -D --set opt_level=speedto generate:
Disassembly of 128 bytes <u0:1>: 0: 55 pushq %rbp 1: 48 89 e5 movq %rsp, %rbp 4: 48 83 ec 10 subq $0x10, %rsp 8: 66 0f 57 c9 xorpd %xmm1, %xmm1 c: 48 8b 57 78 movq 0x78(%rdi), %rdx 10: f2 0f 11 4a 10 movsd %xmm1, 0x10(%rdx) 15: 48 89 fe movq %rdi, %rsi 18: f3 0f 7f 0c 24 movdqu %xmm1, (%rsp) ;; SPILL 1d: e8 00 00 00 00 callq 0x22 22: 45 31 d2 xorl %r10d, %r10d 25: 4c 89 d6 movq %r10, %rsi 28: 45 31 db xorl %r11d, %r11d 2b: 45 85 db testl %r11d, %r11d 2e: 0f 85 0a 00 00 00 jne 0x3e 34: f3 0f 6f 0c 24 movdqu (%rsp), %xmm1 ;; RELOAD 39: e9 04 00 00 00 jmp 0x42 3e: 66 0f 57 c9 xorpd %xmm1, %xmm1 42: 4c 89 d7 movq %r10, %rdi 45: 81 e7 ff ff 00 00 andl $0xffff, %edi 4b: 41 83 c2 01 addl $1, %r10d 4f: 83 ff 01 cmpl $1, %edi 52: 0f 84 0a 00 00 00 je 0x62 58: f3 0f 7f 0c 24 movdqu %xmm1, (%rsp) ;; SPILL 5d: e9 c6 ff ff ff jmp 0x28 62: 8d 7e 01 leal 1(%rsi), %edi 65: 81 fe ff ff 00 00 cmpl $0xffff, %esi 6b: 0f 84 0d 00 00 00 je 0x7e 71: 48 89 fe movq %rdi, %rsi 74: f3 0f 7f 0c 24 movdqu %xmm1, (%rsp) ;; SPILL 79: e9 aa ff ff ff jmp 0x28 7e: 0f 0b ud2Hi @alexcrichton, I went through again the codes and the following is my heuristic thoughts about this bug:
Is it possible that the SPILL in line 18 is caused by the calling instruction since the
xmm1is one of the caller-saved registers? And the following RELOAD (line34) and SPILL (line58&74) are the chain reaction due to the first SPILL(line18).However, in the
good.wasm, it usesxmm0to store the value, butxmm0is not spilled before calling instruction even thoughxmm0is also one of the caller-saved registers. Therefore, is it possible thatcraneliftuses different strategies forxmm0andxmm1before calling instruction?
hungryzzz edited a comment on issue #9590:
compiled with
cargo run compile out.clif --target x86_64 -D --set opt_level=speedto generate:
Disassembly of 128 bytes <u0:1>: 0: 55 pushq %rbp 1: 48 89 e5 movq %rsp, %rbp 4: 48 83 ec 10 subq $0x10, %rsp 8: 66 0f 57 c9 xorpd %xmm1, %xmm1 c: 48 8b 57 78 movq 0x78(%rdi), %rdx 10: f2 0f 11 4a 10 movsd %xmm1, 0x10(%rdx) 15: 48 89 fe movq %rdi, %rsi 18: f3 0f 7f 0c 24 movdqu %xmm1, (%rsp) ;; SPILL 1d: e8 00 00 00 00 callq 0x22 22: 45 31 d2 xorl %r10d, %r10d 25: 4c 89 d6 movq %r10, %rsi 28: 45 31 db xorl %r11d, %r11d 2b: 45 85 db testl %r11d, %r11d 2e: 0f 85 0a 00 00 00 jne 0x3e 34: f3 0f 6f 0c 24 movdqu (%rsp), %xmm1 ;; RELOAD 39: e9 04 00 00 00 jmp 0x42 3e: 66 0f 57 c9 xorpd %xmm1, %xmm1 42: 4c 89 d7 movq %r10, %rdi 45: 81 e7 ff ff 00 00 andl $0xffff, %edi 4b: 41 83 c2 01 addl $1, %r10d 4f: 83 ff 01 cmpl $1, %edi 52: 0f 84 0a 00 00 00 je 0x62 58: f3 0f 7f 0c 24 movdqu %xmm1, (%rsp) ;; SPILL 5d: e9 c6 ff ff ff jmp 0x28 62: 8d 7e 01 leal 1(%rsi), %edi 65: 81 fe ff ff 00 00 cmpl $0xffff, %esi 6b: 0f 84 0d 00 00 00 je 0x7e 71: 48 89 fe movq %rdi, %rsi 74: f3 0f 7f 0c 24 movdqu %xmm1, (%rsp) ;; SPILL 79: e9 aa ff ff ff jmp 0x28 7e: 0f 0b ud2Hi @alexcrichton, I went through again the codes and the following is my heuristic thoughts about this bug:
Is it possible that the SPILL in line 18 is caused by the calling instruction since the
xmm1is one of the caller-saved registers? And the following RELOAD (line34) and SPILL (line58&74) are the chain reaction due to the first SPILL(line18)?However, in the
good.wasm, it usesxmm0to store the value, butxmm0is not spilled before calling instruction even thoughxmm0is also one of the caller-saved registers. Therefore, is it possible thatcraneliftuses different strategies forxmm0andxmm1before calling instruction?
hungryzzz edited a comment on issue #9590:
compiled with
cargo run compile out.clif --target x86_64 -D --set opt_level=speedto generate:
Disassembly of 128 bytes <u0:1>: 0: 55 pushq %rbp 1: 48 89 e5 movq %rsp, %rbp 4: 48 83 ec 10 subq $0x10, %rsp 8: 66 0f 57 c9 xorpd %xmm1, %xmm1 c: 48 8b 57 78 movq 0x78(%rdi), %rdx 10: f2 0f 11 4a 10 movsd %xmm1, 0x10(%rdx) 15: 48 89 fe movq %rdi, %rsi 18: f3 0f 7f 0c 24 movdqu %xmm1, (%rsp) ;; SPILL 1d: e8 00 00 00 00 callq 0x22 22: 45 31 d2 xorl %r10d, %r10d 25: 4c 89 d6 movq %r10, %rsi 28: 45 31 db xorl %r11d, %r11d 2b: 45 85 db testl %r11d, %r11d 2e: 0f 85 0a 00 00 00 jne 0x3e 34: f3 0f 6f 0c 24 movdqu (%rsp), %xmm1 ;; RELOAD 39: e9 04 00 00 00 jmp 0x42 3e: 66 0f 57 c9 xorpd %xmm1, %xmm1 42: 4c 89 d7 movq %r10, %rdi 45: 81 e7 ff ff 00 00 andl $0xffff, %edi 4b: 41 83 c2 01 addl $1, %r10d 4f: 83 ff 01 cmpl $1, %edi 52: 0f 84 0a 00 00 00 je 0x62 58: f3 0f 7f 0c 24 movdqu %xmm1, (%rsp) ;; SPILL 5d: e9 c6 ff ff ff jmp 0x28 62: 8d 7e 01 leal 1(%rsi), %edi 65: 81 fe ff ff 00 00 cmpl $0xffff, %esi 6b: 0f 84 0d 00 00 00 je 0x7e 71: 48 89 fe movq %rdi, %rsi 74: f3 0f 7f 0c 24 movdqu %xmm1, (%rsp) ;; SPILL 79: e9 aa ff ff ff jmp 0x28 7e: 0f 0b ud2Hi @alexcrichton, I went through again the codes and the following are my heuristic thoughts about this bug:
Is it possible that the SPILL in line 18 is caused by the calling instruction since the
xmm1is one of the caller-saved registers? And the following RELOAD (line34) and SPILL (line58&74) are the chain reaction due to the first SPILL(line18)?However, in the
good.wasm, it usesxmm0to store the value, butxmm0is not spilled before calling instruction even thoughxmm0is also one of the caller-saved registers. Therefore, is it possible thatcraneliftuses different strategies forxmm0andxmm1before calling instruction?
primoly commented on issue #9590:
In the spilling case Cranelift uses the same value for multiple zero constants. In the non-spilling case it uses two different ones. In the original
bad.wasmcase it had to use different ones for 0 and 1, but you can get cranelift to use two different ones even when they are both zero, with for example the(f64.abs (f64.const 0))hack i’ve shown above.Generated CLIF after optimisations of @alexcrichton’s example (spills):
(Note how it uses only
v2for the zero constant)function u0:1(i64, i64) fast { sig0 = (i64, i64) fast fn0 = colocated u0:2 sig0 block0(v0: i64, v1: i64): v2 = f64const 0.0 v7 = load.i64 notrap aligned readonly checked v0+120 v49 = iconst.i64 16 v10 = iadd v7, v49 ; v49 = 16 store little heap v2, v10 ; v2 = 0.0 call fn0(v0, v0) v3 = iconst.i32 0 v17 = iconst.i32 0xffff v15 = iconst.i32 1 jump block4(v3, v3, v2) ; v3 = 0, v3 = 0, v2 = 0.0 block4(v14: i32, v24: i32, v34: f64): v55 = iconst.i32 0 brif v55, block6, block7(v34) ; v55 = 0 block6: v56 = f64const 0.0 jump block7(v56) ; v56 = 0.0 block7(v33: f64): v57 = iconst.i32 0xffff v58 = band.i32 v14, v57 ; v57 = 0xffff v59 = iconst.i32 1 v60 = icmp ne v58, v59 ; v59 = 1 v61 = iadd.i32 v14, v59 ; v59 = 1 brif v60, block4(v61, v24, v33), block5 block5: v62 = iconst.i32 0xffff v63 = icmp.i32 ne v24, v62 ; v62 = 0xffff v64 = iconst.i32 1 v65 = iadd.i32 v24, v64 ; v64 = 1 brif v63, block4(v61, v65, v33), block9 block9: trap user11 }Example that does not spill:
function u0:1(i64, i64) { sig0 = (i64, i64) fn0 = colocated u0:2 sig0 block0(v0: i64, v1: i64): v2 = f64const 0.0 v6 = fabs.f64 v2 v7 = load.i64 notrap aligned readonly checked v0+120 v10 = iadd_imm v7, 16 store little heap v6, v10 call fn0(v0, v0) v3 = iconst.i32 0 v17 = iconst.i32 0xffff v15 = iconst.i32 1 jump block4(v3, v3, v2) block4(v14: i32, v24: i32, v34: f64): v37 = iconst.i32 0 brif v37, block6, block7(v34) block6: v38 = f64const 0.0 jump block7(v38) block7(v33: f64): v40 = band_imm.i32 v14, 0xffff v42 = icmp_imm ne v40, 1 v43 = iadd_imm.i32 v14, 1 brif v42, block4(v43, v24, v33), block5 block5: v46 = icmp_imm ne v24, 0xffff v48 = iadd_imm.i32 v24, 1 brif v46, block4(v43, v48, v33), block9 block9: trap user11 }</details>
Generated CLIF after optimisations of the above example:
(Note how it uses
v2andv55for the zero constant)function u0:1(i64, i64) fast { sig0 = (i64, i64) fast fn0 = colocated u0:2 sig0 block0(v0: i64, v1: i64): v55 = f64const 0.0 v7 = load.i64 notrap aligned readonly checked v0+120 v49 = iconst.i64 16 v10 = iadd v7, v49 ; v49 = 16 store little heap v55, v10 ; v55 = 0.0 call fn0(v0, v0) v3 = iconst.i32 0 v2 = f64const 0.0 v17 = iconst.i32 0xffff v15 = iconst.i32 1 jump block4(v3, v3, v2) ; v3 = 0, v3 = 0, v2 = 0.0 block4(v14: i32, v24: i32, v34: f64): v56 = iconst.i32 0 brif v56, block6, block7(v34) ; v56 = 0 block6: v57 = f64const 0.0 jump block7(v57) ; v57 = 0.0 block7(v33: f64): v58 = iconst.i32 0xffff v59 = band.i32 v14, v58 ; v58 = 0xffff v60 = iconst.i32 1 v61 = icmp ne v59, v60 ; v60 = 1 v62 = iadd.i32 v14, v60 ; v60 = 1 brif v61, block4(v62, v24, v33), block5 block5: v63 = iconst.i32 0xffff v64 = icmp.i32 ne v24, v63 ; v63 = 0xffff v65 = iconst.i32 1 v66 = iadd.i32 v24, v65 ; v65 = 1 brif v64, block4(v62, v66, v33), block9 block9: trap user11 }
primoly edited a comment on issue #9590:
In the spilling case Cranelift uses the same value for multiple zero constants. In the non-spilling case it uses two different ones. In the original
good.wasmcase it had to use different ones for 0 and 1, but you can get cranelift to use two different ones even when they are both zero, with for example the(f64.abs (f64.const 0))hack i’ve shown above.Generated CLIF after optimisations of @alexcrichton’s example (spills):
(Note how it uses only
v2for the zero constant)function u0:1(i64, i64) fast { sig0 = (i64, i64) fast fn0 = colocated u0:2 sig0 block0(v0: i64, v1: i64): v2 = f64const 0.0 v7 = load.i64 notrap aligned readonly checked v0+120 v49 = iconst.i64 16 v10 = iadd v7, v49 ; v49 = 16 store little heap v2, v10 ; v2 = 0.0 call fn0(v0, v0) v3 = iconst.i32 0 v17 = iconst.i32 0xffff v15 = iconst.i32 1 jump block4(v3, v3, v2) ; v3 = 0, v3 = 0, v2 = 0.0 block4(v14: i32, v24: i32, v34: f64): v55 = iconst.i32 0 brif v55, block6, block7(v34) ; v55 = 0 block6: v56 = f64const 0.0 jump block7(v56) ; v56 = 0.0 block7(v33: f64): v57 = iconst.i32 0xffff v58 = band.i32 v14, v57 ; v57 = 0xffff v59 = iconst.i32 1 v60 = icmp ne v58, v59 ; v59 = 1 v61 = iadd.i32 v14, v59 ; v59 = 1 brif v60, block4(v61, v24, v33), block5 block5: v62 = iconst.i32 0xffff v63 = icmp.i32 ne v24, v62 ; v62 = 0xffff v64 = iconst.i32 1 v65 = iadd.i32 v24, v64 ; v64 = 1 brif v63, block4(v61, v65, v33), block9 block9: trap user11 }Example that does not spill:
function u0:1(i64, i64) { sig0 = (i64, i64) fn0 = colocated u0:2 sig0 block0(v0: i64, v1: i64): v2 = f64const 0.0 v6 = fabs.f64 v2 v7 = load.i64 notrap aligned readonly checked v0+120 v10 = iadd_imm v7, 16 store little heap v6, v10 call fn0(v0, v0) v3 = iconst.i32 0 v17 = iconst.i32 0xffff v15 = iconst.i32 1 jump block4(v3, v3, v2) block4(v14: i32, v24: i32, v34: f64): v37 = iconst.i32 0 brif v37, block6, block7(v34) block6: v38 = f64const 0.0 jump block7(v38) block7(v33: f64): v40 = band_imm.i32 v14, 0xffff v42 = icmp_imm ne v40, 1 v43 = iadd_imm.i32 v14, 1 brif v42, block4(v43, v24, v33), block5 block5: v46 = icmp_imm ne v24, 0xffff v48 = iadd_imm.i32 v24, 1 brif v46, block4(v43, v48, v33), block9 block9: trap user11 }</details>
Generated CLIF after optimisations of the above example:
(Note how it uses
v2andv55for the zero constant)function u0:1(i64, i64) fast { sig0 = (i64, i64) fast fn0 = colocated u0:2 sig0 block0(v0: i64, v1: i64): v55 = f64const 0.0 v7 = load.i64 notrap aligned readonly checked v0+120 v49 = iconst.i64 16 v10 = iadd v7, v49 ; v49 = 16 store little heap v55, v10 ; v55 = 0.0 call fn0(v0, v0) v3 = iconst.i32 0 v2 = f64const 0.0 v17 = iconst.i32 0xffff v15 = iconst.i32 1 jump block4(v3, v3, v2) ; v3 = 0, v3 = 0, v2 = 0.0 block4(v14: i32, v24: i32, v34: f64): v56 = iconst.i32 0 brif v56, block6, block7(v34) ; v56 = 0 block6: v57 = f64const 0.0 jump block7(v57) ; v57 = 0.0 block7(v33: f64): v58 = iconst.i32 0xffff v59 = band.i32 v14, v58 ; v58 = 0xffff v60 = iconst.i32 1 v61 = icmp ne v59, v60 ; v60 = 1 v62 = iadd.i32 v14, v60 ; v60 = 1 brif v61, block4(v62, v24, v33), block5 block5: v63 = iconst.i32 0xffff v64 = icmp.i32 ne v24, v63 ; v63 = 0xffff v65 = iconst.i32 1 v66 = iadd.i32 v24, v65 ; v65 = 1 brif v64, block4(v62, v66, v33), block9 block9: trap user11 }
primoly edited a comment on issue #9590:
In the spilling case Cranelift uses the same value for multiple zero constants. In the non-spilling case it uses two different ones. In the original
good.wasmcase it had to use different values for0.0and1.0, but you can get Cranelift to use two separate ones even when they are both0.0, with for example the(f64.abs (f64.const 0))hack i’ve shown above.Generated CLIF after optimisations of @alexcrichton’s example (spills)
Note how it uses only
v2for the0.0constant inblock0. Thisv2is then used twice, in thestoreas well as thejump.function u0:1(i64, i64) fast { sig0 = (i64, i64) fast fn0 = colocated u0:2 sig0 block0(v0: i64, v1: i64): v2 = f64const 0.0 v7 = load.i64 notrap aligned readonly checked v0+120 v49 = iconst.i64 16 v10 = iadd v7, v49 ; v49 = 16 store little heap v2, v10 ; v2 = 0.0 call fn0(v0, v0) v3 = iconst.i32 0 v17 = iconst.i32 0xffff v15 = iconst.i32 1 jump block4(v3, v3, v2) ; v3 = 0, v3 = 0, v2 = 0.0 block4(v14: i32, v24: i32, v34: f64): v55 = iconst.i32 0 brif v55, block6, block7(v34) ; v55 = 0 block6: v56 = f64const 0.0 jump block7(v56) ; v56 = 0.0 block7(v33: f64): v57 = iconst.i32 0xffff v58 = band.i32 v14, v57 ; v57 = 0xffff v59 = iconst.i32 1 v60 = icmp ne v58, v59 ; v59 = 1 v61 = iadd.i32 v14, v59 ; v59 = 1 brif v60, block4(v61, v24, v33), block5 block5: v62 = iconst.i32 0xffff v63 = icmp.i32 ne v24, v62 ; v62 = 0xffff v64 = iconst.i32 1 v65 = iadd.i32 v24, v64 ; v64 = 1 brif v63, block4(v61, v65, v33), block9 block9: trap user11 }Example that does not spill
Note the additional
v6that will get constant folded.function u0:1(i64, i64) { sig0 = (i64, i64) fn0 = colocated u0:2 sig0 block0(v0: i64, v1: i64): v2 = f64const 0.0 v6 = fabs.f64 v2 v7 = load.i64 notrap aligned readonly checked v0+120 v10 = iadd_imm v7, 16 store little heap v6, v10 call fn0(v0, v0) v3 = iconst.i32 0 v17 = iconst.i32 0xffff v15 = iconst.i32 1 jump block4(v3, v3, v2) block4(v14: i32, v24: i32, v34: f64): v37 = iconst.i32 0 brif v37, block6, block7(v34) block6: v38 = f64const 0.0 jump block7(v38) block7(v33: f64): v40 = band_imm.i32 v14, 0xffff v42 = icmp_imm ne v40, 1 v43 = iadd_imm.i32 v14, 1 brif v42, block4(v43, v24, v33), block5 block5: v46 = icmp_imm ne v24, 0xffff v48 = iadd_imm.i32 v24, 1 brif v46, block4(v43, v48, v33), block9 block9: trap user11 }</details>
Generated CLIF after optimisations of the non-spilling example above
Note how it uses
v2andv55for the0.0constant inblock0. Thestoreusesv55while thejumpusesv2.function u0:1(i64, i64) fast { sig0 = (i64, i64) fast fn0 = colocated u0:2 sig0 block0(v0: i64, v1: i64): v55 = f64const 0.0 v7 = load.i64 notrap aligned readonly checked v0+120 v49 = iconst.i64 16 v10 = iadd v7, v49 ; v49 = 16 store little heap v55, v10 ; v55 = 0.0 call fn0(v0, v0) v3 = iconst.i32 0 v2 = f64const 0.0 v17 = iconst.i32 0xffff v15 = iconst.i32 1 jump block4(v3, v3, v2) ; v3 = 0, v3 = 0, v2 = 0.0 block4(v14: i32, v24: i32, v34: f64): v56 = iconst.i32 0 brif v56, block6, block7(v34) ; v56 = 0 block6: v57 = f64const 0.0 jump block7(v57) ; v57 = 0.0 block7(v33: f64): v58 = iconst.i32 0xffff v59 = band.i32 v14, v58 ; v58 = 0xffff v60 = iconst.i32 1 v61 = icmp ne v59, v60 ; v60 = 1 v62 = iadd.i32 v14, v60 ; v60 = 1 brif v61, block4(v62, v24, v33), block5 block5: v63 = iconst.i32 0xffff v64 = icmp.i32 ne v24, v63 ; v63 = 0xffff v65 = iconst.i32 1 v66 = iadd.i32 v24, v65 ; v65 = 1 brif v64, block4(v62, v66, v33), block9 block9: trap user11 }
hungryzzz commented on issue #9590:
Hi @alexcrichton @primoly, I also find that if I change the
local.get 0(line 42) in thebad.wasmtof64.const 0x0p+0, the performance will also be back to normal.In the loop, the local variable 0 will be set in a branch (line 16), and obtained outside this loop (line 42). is it possible this dependency affects some analysis?
Here are the differences between their machine codes:
0000000000000000 <wasm[0]::function[1]>: 0: 55 push %rbp 1: 48 89 e5 mov %rsp,%rbp 4: 4c 8b 57 08 mov 0x8(%rdi),%r10 8: 4d 8b 12 mov (%r10),%r10 b: 49 83 c2 30 add $0x30,%r10 f: 49 39 e2 cmp %rsp,%r10 12: 0f 87 9b 00 00 00 ja b3 <wasm[0]::function[1]+0xb3> 18: 48 83 ec 20 sub $0x20,%rsp 1c: 48 89 5c 24 10 mov %rbx,0x10(%rsp) 21: c5 d9 57 cc vxorpd %xmm4,%xmm4,%xmm1 25: 48 8b 47 78 mov 0x78(%rdi),%rax 29: 48 89 fe mov %rdi,%rsi 2c: c5 fb 11 48 10 vmovsd %xmm1,0x10(%rax) ;; diff 31: 48 89 f3 mov %rsi,%rbx 34: f3 0f 7f 0c 24 movdqu %xmm1,(%rsp) 39: 48 89 df mov %rbx,%rdi 3c: e8 7f 00 00 00 call c0 <wasm[0]::function[2]> 41: 31 ff xor %edi,%edi 43: 49 89 fa mov %rdi,%r10 46: 45 31 c9 xor %r9d,%r9d 49: 45 85 c9 test %r9d,%r9d 4c: 0f 85 0a 00 00 00 jne 5c <wasm[0]::function[1]+0x5c> ;; diff 52: f3 0f 6f 0c 24 movdqu (%rsp),%xmm1 ;; diff 57: e9 04 00 00 00 jmp 60 <wasm[0]::function[1]+0x60> ;; diff 5c: c5 e9 57 ca vxorpd %xmm2,%xmm2,%xmm1 ;; diff 60: 48 89 fe mov %rdi,%rsi 63: 81 e6 ff ff 00 00 and $0xffff,%esi 69: 83 c7 01 add $0x1,%edi 6c: 83 fe 01 cmp $0x1,%esi 6f: 0f 84 0a 00 00 00 je 7f <wasm[0]::function[1]+0x7f> ;; diff 75: f3 0f 7f 0c 24 movdqu %xmm1,(%rsp) ;; diff 7a: e9 c7 ff ff ff jmp 46 <wasm[0]::function[1]+0x46> ;; diff 7f: 4c 89 d0 mov %r10,%rax 82: 81 e0 ff ff 00 00 and $0xffff,%eax 88: 41 83 c2 01 add $0x1,%r10d 8c: 81 f8 ff ff 00 00 cmp $0xffff,%eax 92: 0f 84 0a 00 00 00 je a2 <wasm[0]::function[1]+0xa2> ;; diff 98: f3 0f 7f 0c 24 movdqu %xmm1,(%rsp) ;; diff 9d: e9 a4 ff ff ff jmp 46 <wasm[0]::function[1]+0x46> ;; diff a2: 48 8b 43 58 mov 0x58(%rbx),%rax a6: 48 8b 7b 68 mov 0x68(%rbx),%rdi aa: 31 d2 xor %edx,%edx ac: 48 89 de mov %rbx,%rsi af: ff d0 call *%rax
hungryzzz edited a comment on issue #9590:
Hi @alexcrichton @primoly, I also find that if I change the
local.get 0(line 42) in thebad.wasmtof64.const 0x0p+0, the performance will also be back to normal. It seems that I can understand something about the register spills and reload.Observations about the register spills and reload
- The
local.set 0instruction is in a branch within a double loop. (4c-5fin machine code) Besides, there is alocal.get 0outside the loop (even though it will be dropped right away.)- I guess that Wasmtime will store the value of local[0] in address
%rsp, and usexmm1register to store the new value of local[0].- When the branch is true,
xmm1is set to a new value (5c); while when the branch is false,xmm1is set to the current value of local[0]. (address52) This explains the register reload instruction in address52.- In every iteration, the new value of
xmm1will be stored back to the address%rsp(address75and98), this explains the register spilling instructions in address75and98.- Even though I understand why the register spills and reloads happen in
bad.wasm, I still do not know why changing thef64.const 0x0p+0in line 8 fromf64.const 0x1p+0prevents this situation from happening.Here are the differences between their machine codes:
0000000000000000 <wasm[0]::function[1]>: 0: 55 push %rbp 1: 48 89 e5 mov %rsp,%rbp 4: 4c 8b 57 08 mov 0x8(%rdi),%r10 8: 4d 8b 12 mov (%r10),%r10 b: 49 83 c2 30 add $0x30,%r10 f: 49 39 e2 cmp %rsp,%r10 12: 0f 87 9b 00 00 00 ja b3 <wasm[0]::function[1]+0xb3> 18: 48 83 ec 20 sub $0x20,%rsp 1c: 48 89 5c 24 10 mov %rbx,0x10(%rsp) 21: c5 d9 57 cc vxorpd %xmm4,%xmm4,%xmm1 25: 48 8b 47 78 mov 0x78(%rdi),%rax 29: 48 89 fe mov %rdi,%rsi 2c: c5 fb 11 48 10 vmovsd %xmm1,0x10(%rax) ;; diff 31: 48 89 f3 mov %rsi,%rbx 34: f3 0f 7f 0c 24 movdqu %xmm1,(%rsp) 39: 48 89 df mov %rbx,%rdi 3c: e8 7f 00 00 00 call c0 <wasm[0]::function[2]> 41: 31 ff xor %edi,%edi 43: 49 89 fa mov %rdi,%r10 46: 45 31 c9 xor %r9d,%r9d 49: 45 85 c9 test %r9d,%r9d 4c: 0f 85 0a 00 00 00 jne 5c <wasm[0]::function[1]+0x5c> ;; diff 52: f3 0f 6f 0c 24 movdqu (%rsp),%xmm1 ;; diff 57: e9 04 00 00 00 jmp 60 <wasm[0]::function[1]+0x60> ;; diff 5c: c5 e9 57 ca vxorpd %xmm2,%xmm2,%xmm1 ;; diff 60: 48 89 fe mov %rdi,%rsi 63: 81 e6 ff ff 00 00 and $0xffff,%esi 69: 83 c7 01 add $0x1,%edi 6c: 83 fe 01 cmp $0x1,%esi 6f: 0f 84 0a 00 00 00 je 7f <wasm[0]::function[1]+0x7f> ;; diff 75: f3 0f 7f 0c 24 movdqu %xmm1,(%rsp) ;; diff 7a: e9 c7 ff ff ff jmp 46 <wasm[0]::function[1]+0x46> ;; diff 7f: 4c 89 d0 mov %r10,%rax 82: 81 e0 ff ff 00 00 and $0xffff,%eax 88: 41 83 c2 01 add $0x1,%r10d 8c: 81 f8 ff ff 00 00 cmp $0xffff,%eax 92: 0f 84 0a 00 00 00 je a2 <wasm[0]::function[1]+0xa2> ;; diff 98: f3 0f 7f 0c 24 movdqu %xmm1,(%rsp) ;; diff 9d: e9 a4 ff ff ff jmp 46 <wasm[0]::function[1]+0x46> ;; diff a2: 48 8b 43 58 mov 0x58(%rbx),%rax a6: 48 8b 7b 68 mov 0x68(%rbx),%rdi aa: 31 d2 xor %edx,%edx ac: 48 89 de mov %rbx,%rsi af: ff d0 call *%rax
alexcrichton commented on issue #9590:
@hungryzzz you might be right yeah! I would have naively expected live range splitting in regalloc2 though to kick in and make it such that the spill, if necessary, only happened around the call rather than through the loop. As for small differences, when it comes to regalloc heuristics AFAIK it's expected that small changes to the input can have subtle changes to the output. Especially with specific decisions like spilling here I'm not entirely surprised. That being said though I don't fully understand everything happening here so I can't really answer with certainty. (e.g. I don't know for certain why @primoly's changes resulted in different register allocation decisions)
primoly commented on issue #9590:
The reason for the spill is the call to another function which leads to Cranelift saving the value of the
xmmregister on the stack so it can retrieve it afterwards (caller saves). But when Cranelift treats the value of two zeros before the call and after as different values, it just writes a fresh0.0into anotherxmmregister. Notice how the non-spilling cases always use two differentxmmregisters while the spilling ones only use one, matching the non-spilling CLIF usingv2andv55while the spilling CLIF uses onlyv2.
primoly edited a comment on issue #9590:
The reason for the spill is the call to another function which leads to Cranelift saving the value of the
xmmregister on the stack so it can retrieve it afterwards (caller saves). But when Cranelift treats the two zeros before the call and after as different values, it just writes a fresh0.0into anotherxmmregister. Notice how the non-spilling cases always use two differentxmmregisters while the spilling ones only use one, matching the non-spilling CLIF usingv55before thecall fn0(v0, v0)andv2after while the spilling CLIF usesv2before and after.
Last updated: Dec 13 2025 at 21:03 UTC