wasmtime / issue #7495 Bad performance in Wasmtime when e... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / issue #7495 Bad performance in Wasmtime when e...

Wasmtime GitHub notifications bot (Nov 07 2023 at 18:02):

hungryzzz opened issue #7495:

Hi, I run the following case in Wasmtime and WasmEdge respectively and I find that the execution time of Wasmtime is almost 6x slower than which of WasmEdge(measured by time tool).

Wasmtime: 4.08s
WasmEdge: 0.66s

(module
  (type (;0;) (func (param i32)))
  (type (;1;) (func))
  (import "wasi_snapshot_preview1" "proc_exit" (func (;0;) (type 0)))
  (func (;1;) (type 1)
    call 2
    global.get 0
    i32.const 31
    i32.and
    call 0
    unreachable)
  (func (;2;) (type 1)
    (local i32 i32 i32 i32 i32 i32)
    loop  ;; label = @6
      local.get 1
      local.get 2
      local.get 2
      i32.const 1
      i32.add
      local.set 2
      i32.const 0x3FFFFFFF
      i32.and
      i32.const 0x3FFFFFFF
      i32.ne
      i32.and
      br_if 0 (;@6;)
    end
    loop  ;; label = @6
      local.get 2
      i32.const 255
      i32.and
      global.get 1
      i32.store
      local.get 2
      local.get 0
      i32.and
      local.tee 5
      i32.const 255
      i32.and
      i32.load
      local.set 4
      local.get 3
      local.get 3
      i32.const 1
      i32.add
      local.set 3
      i32.const 0x3FFFFFFF
      i32.and
      i32.const 0x3FFFFFFF
      i32.ne
      br_if 0 (;@6;)
    end
  )
  (table (;0;) 6 6 funcref)
  (memory (;0;) 8192 8192)
  (global (;0;) (mut i32) (i32.const 5246560))
  (global (;1;) (mut i32) (i32.const 615098409))
  (export "memory" (memory 0))
  (export "_start" (func 1))
  (export "__indirect_function_table" (table 0)))

Try to debug

I dump the corresponding machine code generated by Wasmtime and WasmEdge, I find that, the first loop will be optimized to empty in WasmEdge but Wasmtime cannot do this.

# Machine code generated by Wasmtime

64: 31 c9                   xor    %ecx,%ecx
66: 45 31 db                xor    %r11d,%r11d
69: 83 c1 01                add    $0x1,%ecx
6c: 45 85 db                test   %r11d,%r11d
6f: 0f 85 f1 ff ff ff       jne    66 <wasm[0]::function[2]+0x16>

75: 31 d2                   xor    %edx,%edx
77: 48 8b 87 80 00 00 00    mov    0x80(%rdi),%rax
7e: 81 e1 ff 00 00 00       and    $0xff,%ecx
84: 44 8b 87 a0 00 00 00    mov    0xa0(%rdi),%r8d
8b: 44 89 04 08             mov    %r8d,(%rax,%rcx,1)
8f: 44 8b 00                mov    (%rax),%r8d
92: 49 89 d0                mov    %rdx,%r8
95: 41 81 e0 ff ff ff 3f    and    $0x3fffffff,%r8d
9c: 83 c2 01                add    $0x1,%edx
9f: 41 81 f8 ff ff ff 3f    cmp    $0x3fffffff,%r8d
a6: 0f 85 d8 ff ff ff       jne    84 <wasm[0]::function[2]+0x34>


# Machine code generated by WasmEdge
60: 48 8b 07                mov    (%rdi),%rax
63: 48 8b 4f 08             mov    0x8(%rdi),%rcx
67: 31 d2                   xor    %edx,%edx
69: 0f 1f 80 00 00 00 00    nopl   0x0(%rax)
70: 48 8b 71 08             mov    0x8(%rcx),%rsi
74: 8b 36                   mov    (%rsi),%esi
76: 48 8b 38                mov    (%rax),%rdi
79: 89 77 01                mov    %esi,0x1(%rdi)
7c: 8d 72 01                lea    0x1(%rdx),%esi
7f: 81 e2 ff ff ff 3f       and    $0x3fffffff,%edx
85: 81 fa ff ff ff 3f       cmp    $0x3fffffff,%edx
8b: 89 f2                   mov    %esi,%edx
8d: 75 e1                   jne    70 <f2+0x10>

And after removing the first loop in the above .wat, the execution time of Wasmtime will back to normal. So I am wandering that whether there is an related optimization in Cranelift can optimize the first loop like LLVM.

Versions and Environment

Wasmtime: https://github.com/bytecodealliance/wasmtime/commit/e0bfa7336de20f76048edbdc0157ee637a2c5fea (build release version)
WasmEdge: 7d4213b3abd5360c88e9120c3bc68ffd9b8a6870
kernel version: Linux 5.15.0-83-generic

Wasmtime GitHub notifications bot (Nov 08 2023 at 12:16):

hungryzzz edited issue #7495:

Hi, I run the following case in Wasmtime and WasmEdge respectively and I find that the execution time of Wasmtime is almost 6x slower than which of WasmEdge(measured by time tool).

Wasmtime: 4.08s
WasmEdge: 0.66s

(module
  (type (;0;) (func (param i32)))
  (type (;1;) (func))
  (import "wasi_snapshot_preview1" "proc_exit" (func (;0;) (type 0)))
  (func (;1;) (type 1)
    call 2
    global.get 0
    i32.const 31
    i32.and
    call 0
    unreachable)
  (func (;2;) (type 1)
    (local i32 i32 i32 i32 i32 i32)
    ;; First Loop
    loop  ;; label = @6
      local.get 1
      local.get 2
      local.get 2
      i32.const 1
      i32.add
      local.set 2
      i32.const 0x3FFFFFFF
      i32.and
      i32.const 0x3FFFFFFF
      i32.ne
      i32.and
      br_if 0 (;@6;)
    end

    ;; Second Loop
    loop  ;; label = @6
      local.get 2
      i32.const 255
      i32.and
      global.get 1
      i32.store
      local.get 2
      local.get 0
      i32.and
      local.tee 5
      i32.const 255
      i32.and
      i32.load
      local.set 4
      local.get 3
      local.get 3
      i32.const 1
      i32.add
      local.set 3
      i32.const 0x3FFFFFFF
      i32.and
      i32.const 0x3FFFFFFF
      i32.ne
      br_if 0 (;@6;)
    end
  )
  (table (;0;) 6 6 funcref)
  (memory (;0;) 8192 8192)
  (global (;0;) (mut i32) (i32.const 5246560))
  (global (;1;) (mut i32) (i32.const 615098409))
  (export "memory" (memory 0))
  (export "_start" (func 1))
  (export "__indirect_function_table" (table 0)))

Try to debug

I dump the corresponding machine code generated by Wasmtime and WasmEdge, I find that there is something strange with the machine code generated by Wasmtime,
1. the first loop in the above wat will be compiled to 64-6f(I guess that), but actually the first time the jne instruction is executed, the loop will be jumped out of, but %ecx will be added 1.
2. In the second loop, mov %r8d,(%rax,%rcx,1) is also strange. Because the variables used in these two loops don't have data dependence.

# Machine code generated by Wasmtime

64: 31 c9                   xor    %ecx,%ecx
66: 45 31 db                xor    %r11d,%r11d
69: 83 c1 01                add    $0x1,%ecx
6c: 45 85 db                test   %r11d,%r11d
6f: 0f 85 f1 ff ff ff       jne    66 <wasm[0]::function[2]+0x16>

75: 31 d2                   xor    %edx,%edx
77: 48 8b 87 80 00 00 00    mov    0x80(%rdi),%rax
7e: 81 e1 ff 00 00 00       and    $0xff,%ecx
84: 44 8b 87 a0 00 00 00    mov    0xa0(%rdi),%r8d
8b: 44 89 04 08             mov    %r8d,(%rax,%rcx,1)
8f: 44 8b 00                mov    (%rax),%r8d
92: 49 89 d0                mov    %rdx,%r8
95: 41 81 e0 ff ff ff 3f    and    $0x3fffffff,%r8d
9c: 83 c2 01                add    $0x1,%edx
9f: 41 81 f8 ff ff ff 3f    cmp    $0x3fffffff,%r8d
a6: 0f 85 d8 ff ff ff       jne    84 <wasm[0]::function[2]+0x34>


# Machine code generated by WasmEdge
60: 48 8b 07                mov    (%rdi),%rax
63: 48 8b 4f 08             mov    0x8(%rdi),%rcx
67: 31 d2                   xor    %edx,%edx
69: 0f 1f 80 00 00 00 00    nopl   0x0(%rax)
70: 48 8b 71 08             mov    0x8(%rcx),%rsi
74: 8b 36                   mov    (%rsi),%esi
76: 48 8b 38                mov    (%rax),%rdi
79: 89 77 01                mov    %esi,0x1(%rdi)
7c: 8d 72 01                lea    0x1(%rdx),%esi
7f: 81 e2 ff ff ff 3f       and    $0x3fffffff,%edx
85: 81 fa ff ff ff 3f       cmp    $0x3fffffff,%edx
8b: 89 f2                   mov    %esi,%edx
8d: 75 e1                   jne    70 <f2+0x10>

I try to remove the first loop in the above wat and the genrated codes are a little different,
1. mov %r8d,(%rax,%rcx,1) in the above version becomes mov %eax,(%rsi)
2. mov (%rax),%r8d in the above version disappeared.

So I think Cranelift generates wrong machine code which will result in wrong behaviors and further causes poor performance.

  64:   31 c9                   xor    %ecx,%ecx
  66:   48 8b b7 80 00 00 00    mov    0x80(%rdi),%rsi

  6d:   8b 87 a0 00 00 00       mov    0xa0(%rdi),%eax
  73:   89 06                   mov    %eax,(%rsi)
  75:   48 89 c8                mov    %rcx,%rax
  78:   81 e0 ff ff ff 3f       and    $0x3fffffff,%eax
  7e:   83 c1 01                add    $0x1,%ecx
  81:   81 f8 ff ff ff 3f       cmp    $0x3fffffff,%eax
  87:   0f 85 e0 ff ff ff       jne    6d <wasm[0]::function[2]+0x1d>

Versions and Environment

Wasmtime: https://github.com/bytecodealliance/wasmtime/commit/e0bfa7336de20f76048edbdc0157ee637a2c5fea (build release version)
WasmEdge: 7d4213b3abd5360c88e9120c3bc68ffd9b8a6870
kernel version: Linux 5.15.0-83-generic

Wasmtime GitHub notifications bot (Nov 08 2023 at 14:31):

hungryzzz edited issue #7495:

Hi, I run the following case in Wasmtime and WasmEdge respectively and I find that the execution time of Wasmtime is almost 6x slower than which of WasmEdge(measured by time tool).

Wasmtime: 4.08s
WasmEdge: 0.66s

(module
  (type (;0;) (func (param i32)))
  (type (;1;) (func))
  (import "wasi_snapshot_preview1" "proc_exit" (func (;0;) (type 0)))
  (func (;1;) (type 1)
    call 2
    global.get 0
    i32.const 31
    i32.and
    call 0
    unreachable)
  (func (;2;) (type 1)
    (local i32 i32 i32 i32 i32 i32)
    ;; First Loop
    loop  ;; label = @6
      local.get 1
      local.get 2
      local.get 2
      i32.const 1
      i32.add
      local.set 2
      i32.const 0x3FFFFFFF
      i32.and
      i32.const 0x3FFFFFFF
      i32.ne
      i32.and
      br_if 0 (;@6;)
    end

    ;; Second Loop
    loop  ;; label = @6
      local.get 2
      i32.const 255
      i32.and
      global.get 1
      i32.store
      local.get 2
      local.get 0
      i32.and
      local.tee 5
      i32.const 255
      i32.and
      i32.load
      local.set 4
      local.get 3
      local.get 3
      i32.const 1
      i32.add
      local.set 3
      i32.const 0x3FFFFFFF
      i32.and
      i32.const 0x3FFFFFFF
      i32.ne
      br_if 0 (;@6;)
    end
  )
  (table (;0;) 6 6 funcref)
  (memory (;0;) 8192 8192)
  (global (;0;) (mut i32) (i32.const 5246560))
  (global (;1;) (mut i32) (i32.const 615098409))
  (export "memory" (memory 0))
  (export "_start" (func 1))
  (export "__indirect_function_table" (table 0)))

Try to debug

# Machine code generated by Wasmtime

64: 31 c9                   xor    %ecx,%ecx
66: 45 31 db                xor    %r11d,%r11d
69: 83 c1 01                add    $0x1,%ecx
6c: 45 85 db                test   %r11d,%r11d
6f: 0f 85 f1 ff ff ff       jne    66 <wasm[0]::function[2]+0x16>

75: 31 d2                   xor    %edx,%edx
77: 48 8b 87 80 00 00 00    mov    0x80(%rdi),%rax
7e: 81 e1 ff 00 00 00       and    $0xff,%ecx
84: 44 8b 87 a0 00 00 00    mov    0xa0(%rdi),%r8d
8b: 44 89 04 08             mov    %r8d,(%rax,%rcx,1)
8f: 44 8b 00                mov    (%rax),%r8d
92: 49 89 d0                mov    %rdx,%r8
95: 41 81 e0 ff ff ff 3f    and    $0x3fffffff,%r8d
9c: 83 c2 01                add    $0x1,%edx
9f: 41 81 f8 ff ff ff 3f    cmp    $0x3fffffff,%r8d
a6: 0f 85 d8 ff ff ff       jne    84 <wasm[0]::function[2]+0x34>


# Machine code generated by WasmEdge
60: 48 8b 07                mov    (%rdi),%rax
63: 48 8b 4f 08             mov    0x8(%rdi),%rcx
67: 31 d2                   xor    %edx,%edx
69: 0f 1f 80 00 00 00 00    nopl   0x0(%rax)
70: 48 8b 71 08             mov    0x8(%rcx),%rsi
74: 8b 36                   mov    (%rsi),%esi
76: 48 8b 38                mov    (%rax),%rdi
79: 89 77 01                mov    %esi,0x1(%rdi)
7c: 8d 72 01                lea    0x1(%rdx),%esi
7f: 81 e2 ff ff ff 3f       and    $0x3fffffff,%edx
85: 81 fa ff ff ff 3f       cmp    $0x3fffffff,%edx
8b: 89 f2                   mov    %esi,%edx
8d: 75 e1                   jne    70 <f2+0x10>

So I think Cranelift generates wrong machine code which will result in wrong behaviors and further causes poor performance.

  64:   31 c9                   xor    %ecx,%ecx
  66:   48 8b b7 80 00 00 00    mov    0x80(%rdi),%rsi

  6d:   8b 87 a0 00 00 00       mov    0xa0(%rdi),%eax
  73:   89 06                   mov    %eax,(%rsi)
  75:   48 89 c8                mov    %rcx,%rax
  78:   81 e0 ff ff ff 3f       and    $0x3fffffff,%eax
  7e:   83 c1 01                add    $0x1,%ecx
  81:   81 f8 ff ff ff 3f       cmp    $0x3fffffff,%eax
  87:   0f 85 e0 ff ff ff       jne    6d <wasm[0]::function[2]+0x1d>

Versions and Environment

Wasmtime: https://github.com/bytecodealliance/wasmtime/commit/e0bfa7336de20f76048edbdc0157ee637a2c5fea (build release version)
WasmEdge: 7d4213b3abd5360c88e9120c3bc68ffd9b8a6870
kernel version: Linux 5.15.0-83-generic

case.zip

Wasmtime GitHub notifications bot (Nov 08 2023 at 21:06):

alexcrichton commented on issue #7495:

This is the output of VTune for me:

<img width="1193" alt="Screenshot 2023-11-08 at 3 04 06 PM" src="https://github.com/bytecodealliance/wasmtime/assets/64996/c9e0f911-5752-4334-b47b-755e4234537d">

which seems to point the smoking gun at "Loads Blocked by Store Forwarding"

For instructions I got:

<img width="1273" alt="Screenshot 2023-11-08 at 3 04 54 PM" src="https://github.com/bytecodealliance/wasmtime/assets/64996/c60a113e-38c3-423a-925d-82d9e003d7a5">

Cranelift/Wasmtime emit a load-store-load whereas LLVM/WasmEdge seem to emit a load-load-store which probably doesn't break load-store forwarding would be my guess.

I have not dug further yet at this time. I don't think Cranelift is generating wrong code here, and I have not dug into why removing the loop affects things.

Last updated: Apr 18 2025 at 13:08 UTC