wasmtime / issue #9590 Changing a stored value from 1 to ... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / issue #9590 Changing a stored value from 1 to ...

Wasmtime GitHub notifications bot (Nov 10 2024 at 07:35):

hungryzzz opened issue #9590:

Test Cases

case.zip

Steps to Reproduce

Hi, I run the attached two cases(good.wasm & bad.wasm) in Wasmtime and WasmEdge(AOT), and collect their execution time respectively (measured by time tool).
# command to collect execution time of wasmtime
wasmtime compile bad.wasm -o bad.cwasm
time wasmtime run --allow-precompiled bad.cwasm

# command to collect execution time of wasmedge
wasmedgec bad.wasm bad-wasmedge-aot.wasm
time wasmedge bad-wasmedge-aot.wasm
Expected Results & Actual Results

For good.wasm, the execution time in different runtimes are as follows:

Wasmtime: 0.99s

WasmEdge: 1.06s

For bad.wasm, the execution time in different runtimes are as follows:

Wasmtime: 6.57s

WasmEdge: 1.05s

The difference between the attached two cases is as follows: changing the stored value in line 8 from 1 to 0, which decreases Wasmtime performance by 5.5s but has no negative effect on WasmEdge.
➜  diff diff good.wat bad.wat
8c8
<     f64.const 0x1p+0 (;=1;)
---
>     f64.const 0x0p+0 (;=0;)
More observations & questions:

The store operation is outside the loop, and the following instructions do not contain any load operations.

Wasmtime compiles the loop conditions in bad.wasm and good.wasm in different ways, but I don't understand why the changes of an instruction that seems unrelated to the loop can affect the compilation strategies to the loop.
# part of machine code of bad.wasm generated by wasmtime
...
jne 5c <wasm[0]::function[1]+0x5c>
movdqu (%rsp),%xmm1
jmp 60 <wasm[0]::function[1]+0x60>
...
je 7f <wasm[0]::function[1]+0x7f>
movdqu %xmm1,(%rsp)
jmp 46 <wasm[0]::function[1]+0x46>
...
je a2 <wasm[0]::function[1]+0xa2>
movdqu %xmm1,(%rsp)
jmp 46 <wasm[0]::function[1]+0x46>
...


# part of machine code of good.wasm generated by wasmtime
...
je 5f <wasm[0]::function[1]+0x5f>
...
jne 4f <wasm[0]::function[1]+0x4f>
...
jne 4f <wasm[0]::function[1]+0x4f>
...
Versions and Environment

Wasmtime version or commit: c8b136965

Operating system: Linux ringzzz-OptiPlex-Micro-Plus-7010 6.5.0-18-generic

Architecture: Intel(R) Core(TM) i5-13500

Wasmtime GitHub notifications bot (Nov 10 2024 at 12:15):

primoly commented on issue #9590:

Strangely, replacing that constant zero with an equivalent calculation fixes the issue. Cranelift should constant fold these instructions, so why do they affect machine code generation? :thinking:

Replace line 8 in bad.wasm with one of the following for equal performance with good.wasm:
f64.const 0
f64.abs
f64.const 0
f64.const 0
f64.add
f64.const 42
f64.const 42
f64.sub
f64.const 777
f64.const 0
f64.mul

Wasmtime GitHub notifications bot (Nov 11 2024 at 04:11):

alexcrichton commented on issue #9590:

This is happening because a register is being spilled, but why it's being spilled I'm not sure. Changing from 1 to 0 changes codegen which changes register allocation which causes this. In that sense I think it's worth digging in to why the spill happened here in the 0 case because I'm not sure what's going on.

I've got this CLIF:

function u0:1(i64, i64) {
    sig0 = (i64, i64)
    fn0 = colocated u0:2 sig0

block0(v0: i64, v1: i64):
    v2 = f64const 0.0
    v7 = load.i64 notrap aligned readonly checked v0+120
    v10 = iadd_imm v7, 16
    store little heap v2, v10
    call fn0(v0, v0)
    v3 = iconst.i32 0
    v17 = iconst.i32 0xffff
    v15 = iconst.i32 1
    jump block4(v3, v3, v2)

block4(v14: i32, v24: i32, v34: f64):
    v37 = iconst.i32 0
    brif v37, block6, block7(v34)

block6:
    v38 = f64const 0.0
    jump block7(v38)

block7(v33: f64):
    v40 = band_imm.i32 v14, 0xffff
    v42 = icmp_imm ne v40, 1
    v43 = iadd_imm.i32 v14, 1
    brif v42, block4(v43, v24, v33), block5

block5:
    v46 = icmp_imm ne v24, 0xffff
    v48 = iadd_imm.i32 v24, 1
    brif v46, block4(v43, v48, v33), block9

block9:
    trap user11
}

compiled with cargo run compile out.clif --target x86_64 -D --set opt_level=speed to generate:

Disassembly of 128 bytes <u0:1>:
   0:   55                      pushq   %rbp
   1:   48 89 e5                movq    %rsp, %rbp
   4:   48 83 ec 10             subq    $0x10, %rsp
   8:   66 0f 57 c9             xorpd   %xmm1, %xmm1
   c:   48 8b 57 78             movq    0x78(%rdi), %rdx
  10:   f2 0f 11 4a 10          movsd   %xmm1, 0x10(%rdx)
  15:   48 89 fe                movq    %rdi, %rsi
  18:   f3 0f 7f 0c 24          movdqu  %xmm1, (%rsp)     ;; SPILL
  1d:   e8 00 00 00 00          callq   0x22
  22:   45 31 d2                xorl    %r10d, %r10d
  25:   4c 89 d6                movq    %r10, %rsi
  28:   45 31 db                xorl    %r11d, %r11d
  2b:   45 85 db                testl   %r11d, %r11d
  2e:   0f 85 0a 00 00 00       jne     0x3e
  34:   f3 0f 6f 0c 24          movdqu  (%rsp), %xmm1     ;; RELOAD
  39:   e9 04 00 00 00          jmp     0x42
  3e:   66 0f 57 c9             xorpd   %xmm1, %xmm1
  42:   4c 89 d7                movq    %r10, %rdi
  45:   81 e7 ff ff 00 00       andl    $0xffff, %edi
  4b:   41 83 c2 01             addl    $1, %r10d
  4f:   83 ff 01                cmpl    $1, %edi
  52:   0f 84 0a 00 00 00       je      0x62
  58:   f3 0f 7f 0c 24          movdqu  %xmm1, (%rsp)     ;; SPILL
  5d:   e9 c6 ff ff ff          jmp     0x28
  62:   8d 7e 01                leal    1(%rsi), %edi
  65:   81 fe ff ff 00 00       cmpl    $0xffff, %esi
  6b:   0f 84 0d 00 00 00       je      0x7e
  71:   48 89 fe                movq    %rdi, %rsi
  74:   f3 0f 7f 0c 24          movdqu  %xmm1, (%rsp)     ;; SPILL
  79:   e9 aa ff ff ff          jmp     0x28
  7e:   0f 0b                   ud2

where I've annotated the spill/reload slots.

@cfallin would you know how to perhaps debug this further to see why the spill is being inserted? It seems to require some of the bits at the beginning of the function (e.g. calling some other function) so it may be related to caller/callee saves or something like that. I'm not sure if this is related to ABI details though as opposed to register allocator behavior. If it's regalloc-related this might end up leading to some nice wins on other benchmarks if it's not just float-related perhaps?

Wasmtime GitHub notifications bot (Nov 11 2024 at 07:05):

cfallin commented on issue #9590:

@cfallin would you know how to perhaps debug this further to see why the spill is being inserted? It seems to require some of the bits at the beginning of the function (e.g. calling some other function) so it may be related to caller/callee saves or something like that. I'm not sure if this is related to ABI details though as opposed to register allocator behavior. If it's regalloc-related this might end up leading to some nice wins on other benchmarks if it's not just float-related perhaps?

Maybe -- FWIW I have zero cycles to dedicate to this right now, so you'll need to dig into RA2 on this, unfortunately. Maybe at some later time I'll have the ability to make another performance push. Sorry!

Wasmtime GitHub notifications bot (Nov 13 2024 at 09:27):

hungryzzz commented on issue #9590:

compiled with cargo run compile out.clif --target x86_64 -D --set opt_level=speed to generate:

Disassembly of 128 bytes <u0:1>: 0: 55 pushq %rbp 1: 48 89 e5 movq %rsp, %rbp 4: 48 83 ec 10 subq $0x10, %rsp 8: 66 0f 57 c9 xorpd %xmm1, %xmm1 c: 48 8b 57 78 movq 0x78(%rdi), %rdx 10: f2 0f 11 4a 10 movsd %xmm1, 0x10(%rdx) 15: 48 89 fe movq %rdi, %rsi 18: f3 0f 7f 0c 24 movdqu %xmm1, (%rsp) ;; SPILL 1d: e8 00 00 00 00 callq 0x22 22: 45 31 d2 xorl %r10d, %r10d 25: 4c 89 d6 movq %r10, %rsi 28: 45 31 db xorl %r11d, %r11d 2b: 45 85 db testl %r11d, %r11d 2e: 0f 85 0a 00 00 00 jne 0x3e 34: f3 0f 6f 0c 24 movdqu (%rsp), %xmm1 ;; RELOAD 39: e9 04 00 00 00 jmp 0x42 3e: 66 0f 57 c9 xorpd %xmm1, %xmm1 42: 4c 89 d7 movq %r10, %rdi 45: 81 e7 ff ff 00 00 andl $0xffff, %edi 4b: 41 83 c2 01 addl $1, %r10d 4f: 83 ff 01 cmpl $1, %edi 52: 0f 84 0a 00 00 00 je 0x62 58: f3 0f 7f 0c 24 movdqu %xmm1, (%rsp) ;; SPILL 5d: e9 c6 ff ff ff jmp 0x28 62: 8d 7e 01 leal 1(%rsi), %edi 65: 81 fe ff ff 00 00 cmpl $0xffff, %esi 6b: 0f 84 0d 00 00 00 je 0x7e 71: 48 89 fe movq %rdi, %rsi 74: f3 0f 7f 0c 24 movdqu %xmm1, (%rsp) ;; SPILL 79: e9 aa ff ff ff jmp 0x28 7e: 0f 0b ud2

Hi @alexcrichton, I went through again the codes and the following is my heuristic thoughts about this bug:

Is it possible that the SPILL in line 18 is caused by the calling instruction since the xmm1 is one of the caller-saved registers? And the following RELOAD (line34) and SPILL (line58&74) are the chain reaction due to the first SPILL(line18).

However, in the good.wasm, it uses xmm0 to store the value, but xmm0 is not spilled before calling instruction even though xmm0 is also one of the caller-saved registers. Therefore, is it possible that cranelift uses different strategies for xmm0 and xmm1 before calling instruction?

Wasmtime GitHub notifications bot (Nov 13 2024 at 09:28):

hungryzzz edited a comment on issue #9590:

compiled with cargo run compile out.clif --target x86_64 -D --set opt_level=speed to generate:

Disassembly of 128 bytes <u0:1>: 0: 55 pushq %rbp 1: 48 89 e5 movq %rsp, %rbp 4: 48 83 ec 10 subq $0x10, %rsp 8: 66 0f 57 c9 xorpd %xmm1, %xmm1 c: 48 8b 57 78 movq 0x78(%rdi), %rdx 10: f2 0f 11 4a 10 movsd %xmm1, 0x10(%rdx) 15: 48 89 fe movq %rdi, %rsi 18: f3 0f 7f 0c 24 movdqu %xmm1, (%rsp) ;; SPILL 1d: e8 00 00 00 00 callq 0x22 22: 45 31 d2 xorl %r10d, %r10d 25: 4c 89 d6 movq %r10, %rsi 28: 45 31 db xorl %r11d, %r11d 2b: 45 85 db testl %r11d, %r11d 2e: 0f 85 0a 00 00 00 jne 0x3e 34: f3 0f 6f 0c 24 movdqu (%rsp), %xmm1 ;; RELOAD 39: e9 04 00 00 00 jmp 0x42 3e: 66 0f 57 c9 xorpd %xmm1, %xmm1 42: 4c 89 d7 movq %r10, %rdi 45: 81 e7 ff ff 00 00 andl $0xffff, %edi 4b: 41 83 c2 01 addl $1, %r10d 4f: 83 ff 01 cmpl $1, %edi 52: 0f 84 0a 00 00 00 je 0x62 58: f3 0f 7f 0c 24 movdqu %xmm1, (%rsp) ;; SPILL 5d: e9 c6 ff ff ff jmp 0x28 62: 8d 7e 01 leal 1(%rsi), %edi 65: 81 fe ff ff 00 00 cmpl $0xffff, %esi 6b: 0f 84 0d 00 00 00 je 0x7e 71: 48 89 fe movq %rdi, %rsi 74: f3 0f 7f 0c 24 movdqu %xmm1, (%rsp) ;; SPILL 79: e9 aa ff ff ff jmp 0x28 7e: 0f 0b ud2

Hi @alexcrichton, I went through again the codes and the following is my heuristic thoughts about this bug:

Is it possible that the SPILL in line 18 is caused by the calling instruction since the xmm1 is one of the caller-saved registers? And the following RELOAD (line34) and SPILL (line58&74) are the chain reaction due to the first SPILL(line18)?

However, in the good.wasm, it uses xmm0 to store the value, but xmm0 is not spilled before calling instruction even though xmm0 is also one of the caller-saved registers. Therefore, is it possible that cranelift uses different strategies for xmm0 and xmm1 before calling instruction?

Wasmtime GitHub notifications bot (Nov 13 2024 at 09:28):

hungryzzz edited a comment on issue #9590:

compiled with cargo run compile out.clif --target x86_64 -D --set opt_level=speed to generate:

Disassembly of 128 bytes <u0:1>: 0: 55 pushq %rbp 1: 48 89 e5 movq %rsp, %rbp 4: 48 83 ec 10 subq $0x10, %rsp 8: 66 0f 57 c9 xorpd %xmm1, %xmm1 c: 48 8b 57 78 movq 0x78(%rdi), %rdx 10: f2 0f 11 4a 10 movsd %xmm1, 0x10(%rdx) 15: 48 89 fe movq %rdi, %rsi 18: f3 0f 7f 0c 24 movdqu %xmm1, (%rsp) ;; SPILL 1d: e8 00 00 00 00 callq 0x22 22: 45 31 d2 xorl %r10d, %r10d 25: 4c 89 d6 movq %r10, %rsi 28: 45 31 db xorl %r11d, %r11d 2b: 45 85 db testl %r11d, %r11d 2e: 0f 85 0a 00 00 00 jne 0x3e 34: f3 0f 6f 0c 24 movdqu (%rsp), %xmm1 ;; RELOAD 39: e9 04 00 00 00 jmp 0x42 3e: 66 0f 57 c9 xorpd %xmm1, %xmm1 42: 4c 89 d7 movq %r10, %rdi 45: 81 e7 ff ff 00 00 andl $0xffff, %edi 4b: 41 83 c2 01 addl $1, %r10d 4f: 83 ff 01 cmpl $1, %edi 52: 0f 84 0a 00 00 00 je 0x62 58: f3 0f 7f 0c 24 movdqu %xmm1, (%rsp) ;; SPILL 5d: e9 c6 ff ff ff jmp 0x28 62: 8d 7e 01 leal 1(%rsi), %edi 65: 81 fe ff ff 00 00 cmpl $0xffff, %esi 6b: 0f 84 0d 00 00 00 je 0x7e 71: 48 89 fe movq %rdi, %rsi 74: f3 0f 7f 0c 24 movdqu %xmm1, (%rsp) ;; SPILL 79: e9 aa ff ff ff jmp 0x28 7e: 0f 0b ud2

Hi @alexcrichton, I went through again the codes and the following are my heuristic thoughts about this bug:

Is it possible that the SPILL in line 18 is caused by the calling instruction since the xmm1 is one of the caller-saved registers? And the following RELOAD (line34) and SPILL (line58&74) are the chain reaction due to the first SPILL(line18)?

However, in the good.wasm, it uses xmm0 to store the value, but xmm0 is not spilled before calling instruction even though xmm0 is also one of the caller-saved registers. Therefore, is it possible that cranelift uses different strategies for xmm0 and xmm1 before calling instruction?

Wasmtime GitHub notifications bot (Nov 13 2024 at 10:22):

primoly commented on issue #9590:

In the spilling case Cranelift uses the same value for multiple zero constants. In the non-spilling case it uses two different ones. In the original bad.wasm case it had to use different ones for 0 and 1, but you can get cranelift to use two different ones even when they are both zero, with for example the (f64.abs (f64.const 0)) hack i’ve shown above.

Generated CLIF after optimisations of @alexcrichton’s example (spills):

(Note how it uses only v2 for the zero constant)

function u0:1(i64, i64) fast {
    sig0 = (i64, i64) fast
    fn0 = colocated u0:2 sig0

block0(v0: i64, v1: i64):
    v2 = f64const 0.0
    v7 = load.i64 notrap aligned readonly checked v0+120
    v49 = iconst.i64 16
    v10 = iadd v7, v49  ; v49 = 16
    store little heap v2, v10  ; v2 = 0.0
    call fn0(v0, v0)
    v3 = iconst.i32 0
    v17 = iconst.i32 0xffff
    v15 = iconst.i32 1
    jump block4(v3, v3, v2)  ; v3 = 0, v3 = 0, v2 = 0.0

block4(v14: i32, v24: i32, v34: f64):
    v55 = iconst.i32 0
    brif v55, block6, block7(v34)  ; v55 = 0

block6:
    v56 = f64const 0.0
    jump block7(v56)  ; v56 = 0.0

block7(v33: f64):
    v57 = iconst.i32 0xffff
    v58 = band.i32 v14, v57  ; v57 = 0xffff
    v59 = iconst.i32 1
    v60 = icmp ne v58, v59  ; v59 = 1
    v61 = iadd.i32 v14, v59  ; v59 = 1
    brif v60, block4(v61, v24, v33), block5

block5:
    v62 = iconst.i32 0xffff
    v63 = icmp.i32 ne v24, v62  ; v62 = 0xffff
    v64 = iconst.i32 1
    v65 = iadd.i32 v24, v64  ; v64 = 1
    brif v63, block4(v61, v65, v33), block9

block9:
    trap user11
}

Example that does not spill:

function u0:1(i64, i64) {
    sig0 = (i64, i64)
    fn0 = colocated u0:2 sig0

block0(v0: i64, v1: i64):
    v2 = f64const 0.0
    v6 = fabs.f64 v2
    v7 = load.i64 notrap aligned readonly checked v0+120
    v10 = iadd_imm v7, 16
    store little heap v6, v10
    call fn0(v0, v0)
    v3 = iconst.i32 0
    v17 = iconst.i32 0xffff
    v15 = iconst.i32 1
    jump block4(v3, v3, v2)

block4(v14: i32, v24: i32, v34: f64):
    v37 = iconst.i32 0
    brif v37, block6, block7(v34)

block6:
    v38 = f64const 0.0
    jump block7(v38)

block7(v33: f64):
    v40 = band_imm.i32 v14, 0xffff
    v42 = icmp_imm ne v40, 1
    v43 = iadd_imm.i32 v14, 1
    brif v42, block4(v43, v24, v33), block5

block5:
    v46 = icmp_imm ne v24, 0xffff
    v48 = iadd_imm.i32 v24, 1
    brif v46, block4(v43, v48, v33), block9

block9:
    trap user11
}

</details>

Generated CLIF after optimisations of the above example:

(Note how it uses v2 and v55 for the zero constant)

function u0:1(i64, i64) fast {
    sig0 = (i64, i64) fast
    fn0 = colocated u0:2 sig0

block0(v0: i64, v1: i64):
    v55 = f64const 0.0
    v7 = load.i64 notrap aligned readonly checked v0+120
    v49 = iconst.i64 16
    v10 = iadd v7, v49  ; v49 = 16
    store little heap v55, v10  ; v55 = 0.0
    call fn0(v0, v0)
    v3 = iconst.i32 0
    v2 = f64const 0.0
    v17 = iconst.i32 0xffff
    v15 = iconst.i32 1
    jump block4(v3, v3, v2)  ; v3 = 0, v3 = 0, v2 = 0.0

block4(v14: i32, v24: i32, v34: f64):
    v56 = iconst.i32 0
    brif v56, block6, block7(v34)  ; v56 = 0

block6:
    v57 = f64const 0.0
    jump block7(v57)  ; v57 = 0.0

block7(v33: f64):
    v58 = iconst.i32 0xffff
    v59 = band.i32 v14, v58  ; v58 = 0xffff
    v60 = iconst.i32 1
    v61 = icmp ne v59, v60  ; v60 = 1
    v62 = iadd.i32 v14, v60  ; v60 = 1
    brif v61, block4(v62, v24, v33), block5

block5:
    v63 = iconst.i32 0xffff
    v64 = icmp.i32 ne v24, v63  ; v63 = 0xffff
    v65 = iconst.i32 1
    v66 = iadd.i32 v24, v65  ; v65 = 1
    brif v64, block4(v62, v66, v33), block9

block9:
    trap user11
}

Wasmtime GitHub notifications bot (Nov 13 2024 at 10:23):

primoly edited a comment on issue #9590:

In the spilling case Cranelift uses the same value for multiple zero constants. In the non-spilling case it uses two different ones. In the original good.wasm case it had to use different ones for 0 and 1, but you can get cranelift to use two different ones even when they are both zero, with for example the (f64.abs (f64.const 0)) hack i’ve shown above.

Generated CLIF after optimisations of @alexcrichton’s example (spills):

(Note how it uses only v2 for the zero constant)

function u0:1(i64, i64) fast {
    sig0 = (i64, i64) fast
    fn0 = colocated u0:2 sig0

block0(v0: i64, v1: i64):
    v2 = f64const 0.0
    v7 = load.i64 notrap aligned readonly checked v0+120
    v49 = iconst.i64 16
    v10 = iadd v7, v49  ; v49 = 16
    store little heap v2, v10  ; v2 = 0.0
    call fn0(v0, v0)
    v3 = iconst.i32 0
    v17 = iconst.i32 0xffff
    v15 = iconst.i32 1
    jump block4(v3, v3, v2)  ; v3 = 0, v3 = 0, v2 = 0.0

block4(v14: i32, v24: i32, v34: f64):
    v55 = iconst.i32 0
    brif v55, block6, block7(v34)  ; v55 = 0

block6:
    v56 = f64const 0.0
    jump block7(v56)  ; v56 = 0.0

block7(v33: f64):
    v57 = iconst.i32 0xffff
    v58 = band.i32 v14, v57  ; v57 = 0xffff
    v59 = iconst.i32 1
    v60 = icmp ne v58, v59  ; v59 = 1
    v61 = iadd.i32 v14, v59  ; v59 = 1
    brif v60, block4(v61, v24, v33), block5

block5:
    v62 = iconst.i32 0xffff
    v63 = icmp.i32 ne v24, v62  ; v62 = 0xffff
    v64 = iconst.i32 1
    v65 = iadd.i32 v24, v64  ; v64 = 1
    brif v63, block4(v61, v65, v33), block9

block9:
    trap user11
}

Example that does not spill:

function u0:1(i64, i64) {
    sig0 = (i64, i64)
    fn0 = colocated u0:2 sig0

block0(v0: i64, v1: i64):
    v2 = f64const 0.0
    v6 = fabs.f64 v2
    v7 = load.i64 notrap aligned readonly checked v0+120
    v10 = iadd_imm v7, 16
    store little heap v6, v10
    call fn0(v0, v0)
    v3 = iconst.i32 0
    v17 = iconst.i32 0xffff
    v15 = iconst.i32 1
    jump block4(v3, v3, v2)

block4(v14: i32, v24: i32, v34: f64):
    v37 = iconst.i32 0
    brif v37, block6, block7(v34)

block6:
    v38 = f64const 0.0
    jump block7(v38)

block7(v33: f64):
    v40 = band_imm.i32 v14, 0xffff
    v42 = icmp_imm ne v40, 1
    v43 = iadd_imm.i32 v14, 1
    brif v42, block4(v43, v24, v33), block5

block5:
    v46 = icmp_imm ne v24, 0xffff
    v48 = iadd_imm.i32 v24, 1
    brif v46, block4(v43, v48, v33), block9

block9:
    trap user11
}

</details>

Generated CLIF after optimisations of the above example:

(Note how it uses v2 and v55 for the zero constant)

function u0:1(i64, i64) fast {
    sig0 = (i64, i64) fast
    fn0 = colocated u0:2 sig0

block0(v0: i64, v1: i64):
    v55 = f64const 0.0
    v7 = load.i64 notrap aligned readonly checked v0+120
    v49 = iconst.i64 16
    v10 = iadd v7, v49  ; v49 = 16
    store little heap v55, v10  ; v55 = 0.0
    call fn0(v0, v0)
    v3 = iconst.i32 0
    v2 = f64const 0.0
    v17 = iconst.i32 0xffff
    v15 = iconst.i32 1
    jump block4(v3, v3, v2)  ; v3 = 0, v3 = 0, v2 = 0.0

block4(v14: i32, v24: i32, v34: f64):
    v56 = iconst.i32 0
    brif v56, block6, block7(v34)  ; v56 = 0

block6:
    v57 = f64const 0.0
    jump block7(v57)  ; v57 = 0.0

block7(v33: f64):
    v58 = iconst.i32 0xffff
    v59 = band.i32 v14, v58  ; v58 = 0xffff
    v60 = iconst.i32 1
    v61 = icmp ne v59, v60  ; v60 = 1
    v62 = iadd.i32 v14, v60  ; v60 = 1
    brif v61, block4(v62, v24, v33), block5

block5:
    v63 = iconst.i32 0xffff
    v64 = icmp.i32 ne v24, v63  ; v63 = 0xffff
    v65 = iconst.i32 1
    v66 = iadd.i32 v24, v65  ; v65 = 1
    brif v64, block4(v62, v66, v33), block9

block9:
    trap user11
}

Wasmtime GitHub notifications bot (Nov 13 2024 at 17:50):

primoly edited a comment on issue #9590:

In the spilling case Cranelift uses the same value for multiple zero constants. In the non-spilling case it uses two different ones. In the original good.wasm case it had to use different values for 0.0 and 1.0, but you can get Cranelift to use two separate ones even when they are both 0.0, with for example the (f64.abs (f64.const 0)) hack i’ve shown above.

Generated CLIF after optimisations of @alexcrichton’s example (spills)

Note how it uses only v2 for the 0.0 constant in block0. This v2 is then used twice, in the store as well as the jump.

function u0:1(i64, i64) fast {
    sig0 = (i64, i64) fast
    fn0 = colocated u0:2 sig0

block0(v0: i64, v1: i64):
    v2 = f64const 0.0
    v7 = load.i64 notrap aligned readonly checked v0+120
    v49 = iconst.i64 16
    v10 = iadd v7, v49  ; v49 = 16
    store little heap v2, v10  ; v2 = 0.0
    call fn0(v0, v0)
    v3 = iconst.i32 0
    v17 = iconst.i32 0xffff
    v15 = iconst.i32 1
    jump block4(v3, v3, v2)  ; v3 = 0, v3 = 0, v2 = 0.0

block4(v14: i32, v24: i32, v34: f64):
    v55 = iconst.i32 0
    brif v55, block6, block7(v34)  ; v55 = 0

block6:
    v56 = f64const 0.0
    jump block7(v56)  ; v56 = 0.0

block7(v33: f64):
    v57 = iconst.i32 0xffff
    v58 = band.i32 v14, v57  ; v57 = 0xffff
    v59 = iconst.i32 1
    v60 = icmp ne v58, v59  ; v59 = 1
    v61 = iadd.i32 v14, v59  ; v59 = 1
    brif v60, block4(v61, v24, v33), block5

block5:
    v62 = iconst.i32 0xffff
    v63 = icmp.i32 ne v24, v62  ; v62 = 0xffff
    v64 = iconst.i32 1
    v65 = iadd.i32 v24, v64  ; v64 = 1
    brif v63, block4(v61, v65, v33), block9

block9:
    trap user11
}

Example that does not spill

Note the additional v6 that will get constant folded.

function u0:1(i64, i64) {
    sig0 = (i64, i64)
    fn0 = colocated u0:2 sig0

block0(v0: i64, v1: i64):
    v2 = f64const 0.0
    v6 = fabs.f64 v2
    v7 = load.i64 notrap aligned readonly checked v0+120
    v10 = iadd_imm v7, 16
    store little heap v6, v10
    call fn0(v0, v0)
    v3 = iconst.i32 0
    v17 = iconst.i32 0xffff
    v15 = iconst.i32 1
    jump block4(v3, v3, v2)

block4(v14: i32, v24: i32, v34: f64):
    v37 = iconst.i32 0
    brif v37, block6, block7(v34)

block6:
    v38 = f64const 0.0
    jump block7(v38)

block7(v33: f64):
    v40 = band_imm.i32 v14, 0xffff
    v42 = icmp_imm ne v40, 1
    v43 = iadd_imm.i32 v14, 1
    brif v42, block4(v43, v24, v33), block5

block5:
    v46 = icmp_imm ne v24, 0xffff
    v48 = iadd_imm.i32 v24, 1
    brif v46, block4(v43, v48, v33), block9

block9:
    trap user11
}

</details>

Generated CLIF after optimisations of the non-spilling example above

Note how it uses v2 and v55 for the 0.0 constant in block0. The store uses v55 while the jump uses v2.

function u0:1(i64, i64) fast {
    sig0 = (i64, i64) fast
    fn0 = colocated u0:2 sig0

block0(v0: i64, v1: i64):
    v55 = f64const 0.0
    v7 = load.i64 notrap aligned readonly checked v0+120
    v49 = iconst.i64 16
    v10 = iadd v7, v49  ; v49 = 16
    store little heap v55, v10  ; v55 = 0.0
    call fn0(v0, v0)
    v3 = iconst.i32 0
    v2 = f64const 0.0
    v17 = iconst.i32 0xffff
    v15 = iconst.i32 1
    jump block4(v3, v3, v2)  ; v3 = 0, v3 = 0, v2 = 0.0

block4(v14: i32, v24: i32, v34: f64):
    v56 = iconst.i32 0
    brif v56, block6, block7(v34)  ; v56 = 0

block6:
    v57 = f64const 0.0
    jump block7(v57)  ; v57 = 0.0

block7(v33: f64):
    v58 = iconst.i32 0xffff
    v59 = band.i32 v14, v58  ; v58 = 0xffff
    v60 = iconst.i32 1
    v61 = icmp ne v59, v60  ; v60 = 1
    v62 = iadd.i32 v14, v60  ; v60 = 1
    brif v61, block4(v62, v24, v33), block5

block5:
    v63 = iconst.i32 0xffff
    v64 = icmp.i32 ne v24, v63  ; v63 = 0xffff
    v65 = iconst.i32 1
    v66 = iadd.i32 v24, v65  ; v65 = 1
    brif v64, block4(v62, v66, v33), block9

block9:
    trap user11
}

Wasmtime GitHub notifications bot (Nov 15 2024 at 09:40):

hungryzzz commented on issue #9590:

Hi @alexcrichton @primoly, I also find that if I change the local.get 0 (line 42) in the bad.wasm to f64.const 0x0p+0, the performance will also be back to normal.

In the loop, the local variable 0 will be set in a branch (line 16), and obtained outside this loop (line 42). is it possible this dependency affects some analysis?

Here are the differences between their machine codes:

0000000000000000 <wasm[0]::function[1]>:
       0:   55                      push   %rbp
       1:   48 89 e5                mov    %rsp,%rbp
       4:   4c 8b 57 08             mov    0x8(%rdi),%r10
       8:   4d 8b 12                mov    (%r10),%r10
       b:   49 83 c2 30             add    $0x30,%r10
       f:   49 39 e2                cmp    %rsp,%r10
      12:   0f 87 9b 00 00 00       ja     b3 <wasm[0]::function[1]+0xb3>
      18:   48 83 ec 20             sub    $0x20,%rsp
      1c:   48 89 5c 24 10          mov    %rbx,0x10(%rsp)
      21:   c5 d9 57 cc             vxorpd %xmm4,%xmm4,%xmm1
      25:   48 8b 47 78             mov    0x78(%rdi),%rax
      29:   48 89 fe                mov    %rdi,%rsi
      2c:   c5 fb 11 48 10          vmovsd %xmm1,0x10(%rax)     ;; diff
      31:   48 89 f3                mov    %rsi,%rbx
      34:   f3 0f 7f 0c 24          movdqu %xmm1,(%rsp)
      39:   48 89 df                mov    %rbx,%rdi
      3c:   e8 7f 00 00 00          call   c0 <wasm[0]::function[2]>
      41:   31 ff                   xor    %edi,%edi
      43:   49 89 fa                mov    %rdi,%r10
      46:   45 31 c9                xor    %r9d,%r9d
      49:   45 85 c9                test   %r9d,%r9d
      4c:   0f 85 0a 00 00 00       jne    5c <wasm[0]::function[1]+0x5c>     ;; diff
      52:   f3 0f 6f 0c 24          movdqu (%rsp),%xmm1                                      ;; diff
      57:   e9 04 00 00 00          jmp    60 <wasm[0]::function[1]+0x60>     ;; diff
      5c:   c5 e9 57 ca             vxorpd %xmm2,%xmm2,%xmm1     ;; diff
      60:   48 89 fe                mov    %rdi,%rsi
      63:   81 e6 ff ff 00 00       and    $0xffff,%esi
      69:   83 c7 01                add    $0x1,%edi
      6c:   83 fe 01                cmp    $0x1,%esi
      6f:   0f 84 0a 00 00 00       je     7f <wasm[0]::function[1]+0x7f>     ;; diff
      75:   f3 0f 7f 0c 24          movdqu %xmm1,(%rsp)                                 ;; diff
      7a:   e9 c7 ff ff ff          jmp    46 <wasm[0]::function[1]+0x46>     ;; diff
      7f:   4c 89 d0                mov    %r10,%rax
      82:   81 e0 ff ff 00 00       and    $0xffff,%eax
      88:   41 83 c2 01             add    $0x1,%r10d
      8c:   81 f8 ff ff 00 00       cmp    $0xffff,%eax
      92:   0f 84 0a 00 00 00       je     a2 <wasm[0]::function[1]+0xa2>     ;; diff
      98:   f3 0f 7f 0c 24          movdqu %xmm1,(%rsp)                                   ;; diff
      9d:   e9 a4 ff ff ff          jmp    46 <wasm[0]::function[1]+0x46>     ;; diff
      a2:   48 8b 43 58             mov    0x58(%rbx),%rax
      a6:   48 8b 7b 68             mov    0x68(%rbx),%rdi
      aa:   31 d2                   xor    %edx,%edx
      ac:   48 89 de                mov    %rbx,%rsi
      af:   ff d0                   call   *%rax

Wasmtime GitHub notifications bot (Nov 15 2024 at 16:36):

hungryzzz edited a comment on issue #9590:

Hi @alexcrichton @primoly, I also find that if I change the local.get 0 (line 42) in the bad.wasm to f64.const 0x0p+0, the performance will also be back to normal. It seems that I can understand something about the register spills and reload.

Observations about the register spills and reload

The local.set 0 instruction is in a branch within a double loop. (4c - 5f in machine code) Besides, there is a local.get 0 outside the loop (even though it will be dropped right away.)
I guess that Wasmtime will store the value of local[0] in address %rsp, and use xmm1 register to store the new value of local[0].
When the branch is true, xmm1 is set to a new value (5c); while when the branch is false, xmm1 is set to the current value of local[0]. (address 52) This explains the register reload instruction in address52.
In every iteration, the new value of xmm1 will be stored back to the address %rsp (address 75 and 98), this explains the register spilling instructions in address 75 and 98.
Even though I understand why the register spills and reloads happen in bad.wasm, I still do not know why changing the f64.const 0x0p+0 in line 8 from f64.const 0x1p+0 prevents this situation from happening.

Here are the differences between their machine codes:

0000000000000000 <wasm[0]::function[1]>:
       0:   55                      push   %rbp
       1:   48 89 e5                mov    %rsp,%rbp
       4:   4c 8b 57 08             mov    0x8(%rdi),%r10
       8:   4d 8b 12                mov    (%r10),%r10
       b:   49 83 c2 30             add    $0x30,%r10
       f:   49 39 e2                cmp    %rsp,%r10
      12:   0f 87 9b 00 00 00       ja     b3 <wasm[0]::function[1]+0xb3>
      18:   48 83 ec 20             sub    $0x20,%rsp
      1c:   48 89 5c 24 10          mov    %rbx,0x10(%rsp)
      21:   c5 d9 57 cc             vxorpd %xmm4,%xmm4,%xmm1
      25:   48 8b 47 78             mov    0x78(%rdi),%rax
      29:   48 89 fe                mov    %rdi,%rsi
      2c:   c5 fb 11 48 10          vmovsd %xmm1,0x10(%rax)     ;; diff
      31:   48 89 f3                mov    %rsi,%rbx
      34:   f3 0f 7f 0c 24          movdqu %xmm1,(%rsp)
      39:   48 89 df                mov    %rbx,%rdi
      3c:   e8 7f 00 00 00          call   c0 <wasm[0]::function[2]>
      41:   31 ff                   xor    %edi,%edi
      43:   49 89 fa                mov    %rdi,%r10
      46:   45 31 c9                xor    %r9d,%r9d
      49:   45 85 c9                test   %r9d,%r9d
      4c:   0f 85 0a 00 00 00       jne    5c <wasm[0]::function[1]+0x5c>     ;; diff
      52:   f3 0f 6f 0c 24          movdqu (%rsp),%xmm1                                ;; diff
      57:   e9 04 00 00 00          jmp    60 <wasm[0]::function[1]+0x60>     ;; diff
      5c:   c5 e9 57 ca             vxorpd %xmm2,%xmm2,%xmm1     ;; diff
      60:   48 89 fe                mov    %rdi,%rsi
      63:   81 e6 ff ff 00 00       and    $0xffff,%esi
      69:   83 c7 01                add    $0x1,%edi
      6c:   83 fe 01                cmp    $0x1,%esi
      6f:   0f 84 0a 00 00 00       je     7f <wasm[0]::function[1]+0x7f>     ;; diff
      75:   f3 0f 7f 0c 24          movdqu %xmm1,(%rsp)                             ;; diff
      7a:   e9 c7 ff ff ff          jmp    46 <wasm[0]::function[1]+0x46>     ;; diff
      7f:   4c 89 d0                mov    %r10,%rax
      82:   81 e0 ff ff 00 00       and    $0xffff,%eax
      88:   41 83 c2 01             add    $0x1,%r10d
      8c:   81 f8 ff ff 00 00       cmp    $0xffff,%eax
      92:   0f 84 0a 00 00 00       je     a2 <wasm[0]::function[1]+0xa2>     ;; diff
      98:   f3 0f 7f 0c 24          movdqu %xmm1,(%rsp)                           ;; diff
      9d:   e9 a4 ff ff ff          jmp    46 <wasm[0]::function[1]+0x46>     ;; diff
      a2:   48 8b 43 58             mov    0x58(%rbx),%rax
      a6:   48 8b 7b 68             mov    0x68(%rbx),%rdi
      aa:   31 d2                   xor    %edx,%edx
      ac:   48 89 de                mov    %rbx,%rsi
      af:   ff d0                   call   *%rax

Wasmtime GitHub notifications bot (Nov 16 2024 at 14:14):

alexcrichton commented on issue #9590:

@hungryzzz you might be right yeah! I would have naively expected live range splitting in regalloc2 though to kick in and make it such that the spill, if necessary, only happened around the call rather than through the loop. As for small differences, when it comes to regalloc heuristics AFAIK it's expected that small changes to the input can have subtle changes to the output. Especially with specific decisions like spilling here I'm not entirely surprised. That being said though I don't fully understand everything happening here so I can't really answer with certainty. (e.g. I don't know for certain why @primoly's changes resulted in different register allocation decisions)

Wasmtime GitHub notifications bot (Nov 16 2024 at 14:28):

primoly commented on issue #9590:

The reason for the spill is the call to another function which leads to Cranelift saving the value of thexmm register on the stack so it can retrieve it afterwards (caller saves). But when Cranelift treats the value of two zeros before the call and after as different values, it just writes a fresh 0.0 into another xmm register. Notice how the non-spilling cases always use two different xmm registers while the spilling ones only use one, matching the non-spilling CLIF using v2 and v55 while the spilling CLIF uses only v2.

Wasmtime GitHub notifications bot (Nov 16 2024 at 15:26):

primoly edited a comment on issue #9590:

The reason for the spill is the call to another function which leads to Cranelift saving the value of thexmm register on the stack so it can retrieve it afterwards (caller saves). But when Cranelift treats the two zeros before the call and after as different values, it just writes a fresh 0.0 into another xmm register. Notice how the non-spilling cases always use two different xmm registers while the spilling ones only use one, matching the non-spilling CLIF using v55 before the call fn0(v0, v0) and v2 after while the spilling CLIF uses v2 before and after.

Last updated: Apr 18 2025 at 12:05 UTC