Stream: git-wasmtime

Topic: wasmtime / issue #7495 Bad performance in Wasmtime when e...


view this post on Zulip Wasmtime GitHub notifications bot (Nov 07 2023 at 18:02):

hungryzzz opened issue #7495:

Hi, I run the following case in Wasmtime and WasmEdge respectively and I find that the execution time of Wasmtime is almost 6x slower than which of WasmEdge(measured by time tool).

(module
  (type (;0;) (func (param i32)))
  (type (;1;) (func))
  (import "wasi_snapshot_preview1" "proc_exit" (func (;0;) (type 0)))
  (func (;1;) (type 1)
    call 2
    global.get 0
    i32.const 31
    i32.and
    call 0
    unreachable)
  (func (;2;) (type 1)
    (local i32 i32 i32 i32 i32 i32)
    loop  ;; label = @6
      local.get 1
      local.get 2
      local.get 2
      i32.const 1
      i32.add
      local.set 2
      i32.const 0x3FFFFFFF
      i32.and
      i32.const 0x3FFFFFFF
      i32.ne
      i32.and
      br_if 0 (;@6;)
    end
    loop  ;; label = @6
      local.get 2
      i32.const 255
      i32.and
      global.get 1
      i32.store
      local.get 2
      local.get 0
      i32.and
      local.tee 5
      i32.const 255
      i32.and
      i32.load
      local.set 4
      local.get 3
      local.get 3
      i32.const 1
      i32.add
      local.set 3
      i32.const 0x3FFFFFFF
      i32.and
      i32.const 0x3FFFFFFF
      i32.ne
      br_if 0 (;@6;)
    end
  )
  (table (;0;) 6 6 funcref)
  (memory (;0;) 8192 8192)
  (global (;0;) (mut i32) (i32.const 5246560))
  (global (;1;) (mut i32) (i32.const 615098409))
  (export "memory" (memory 0))
  (export "_start" (func 1))
  (export "__indirect_function_table" (table 0)))

Try to debug

I dump the corresponding machine code generated by Wasmtime and WasmEdge, I find that, the first loop will be optimized to empty in WasmEdge but Wasmtime cannot do this.

# Machine code generated by Wasmtime

64: 31 c9                   xor    %ecx,%ecx
66: 45 31 db                xor    %r11d,%r11d
69: 83 c1 01                add    $0x1,%ecx
6c: 45 85 db                test   %r11d,%r11d
6f: 0f 85 f1 ff ff ff       jne    66 <wasm[0]::function[2]+0x16>

75: 31 d2                   xor    %edx,%edx
77: 48 8b 87 80 00 00 00    mov    0x80(%rdi),%rax
7e: 81 e1 ff 00 00 00       and    $0xff,%ecx
84: 44 8b 87 a0 00 00 00    mov    0xa0(%rdi),%r8d
8b: 44 89 04 08             mov    %r8d,(%rax,%rcx,1)
8f: 44 8b 00                mov    (%rax),%r8d
92: 49 89 d0                mov    %rdx,%r8
95: 41 81 e0 ff ff ff 3f    and    $0x3fffffff,%r8d
9c: 83 c2 01                add    $0x1,%edx
9f: 41 81 f8 ff ff ff 3f    cmp    $0x3fffffff,%r8d
a6: 0f 85 d8 ff ff ff       jne    84 <wasm[0]::function[2]+0x34>


# Machine code generated by WasmEdge
60: 48 8b 07                mov    (%rdi),%rax
63: 48 8b 4f 08             mov    0x8(%rdi),%rcx
67: 31 d2                   xor    %edx,%edx
69: 0f 1f 80 00 00 00 00    nopl   0x0(%rax)
70: 48 8b 71 08             mov    0x8(%rcx),%rsi
74: 8b 36                   mov    (%rsi),%esi
76: 48 8b 38                mov    (%rax),%rdi
79: 89 77 01                mov    %esi,0x1(%rdi)
7c: 8d 72 01                lea    0x1(%rdx),%esi
7f: 81 e2 ff ff ff 3f       and    $0x3fffffff,%edx
85: 81 fa ff ff ff 3f       cmp    $0x3fffffff,%edx
8b: 89 f2                   mov    %esi,%edx
8d: 75 e1                   jne    70 <f2+0x10>

And after removing the first loop in the above .wat, the execution time of Wasmtime will back to normal. So I am wandering that whether there is an related optimization in Cranelift can optimize the first loop like LLVM.

Versions and Environment

view this post on Zulip Wasmtime GitHub notifications bot (Nov 08 2023 at 12:16):

hungryzzz edited issue #7495:

Hi, I run the following case in Wasmtime and WasmEdge respectively and I find that the execution time of Wasmtime is almost 6x slower than which of WasmEdge(measured by time tool).

(module
  (type (;0;) (func (param i32)))
  (type (;1;) (func))
  (import "wasi_snapshot_preview1" "proc_exit" (func (;0;) (type 0)))
  (func (;1;) (type 1)
    call 2
    global.get 0
    i32.const 31
    i32.and
    call 0
    unreachable)
  (func (;2;) (type 1)
    (local i32 i32 i32 i32 i32 i32)
    ;; First Loop
    loop  ;; label = @6
      local.get 1
      local.get 2
      local.get 2
      i32.const 1
      i32.add
      local.set 2
      i32.const 0x3FFFFFFF
      i32.and
      i32.const 0x3FFFFFFF
      i32.ne
      i32.and
      br_if 0 (;@6;)
    end

    ;; Second Loop
    loop  ;; label = @6
      local.get 2
      i32.const 255
      i32.and
      global.get 1
      i32.store
      local.get 2
      local.get 0
      i32.and
      local.tee 5
      i32.const 255
      i32.and
      i32.load
      local.set 4
      local.get 3
      local.get 3
      i32.const 1
      i32.add
      local.set 3
      i32.const 0x3FFFFFFF
      i32.and
      i32.const 0x3FFFFFFF
      i32.ne
      br_if 0 (;@6;)
    end
  )
  (table (;0;) 6 6 funcref)
  (memory (;0;) 8192 8192)
  (global (;0;) (mut i32) (i32.const 5246560))
  (global (;1;) (mut i32) (i32.const 615098409))
  (export "memory" (memory 0))
  (export "_start" (func 1))
  (export "__indirect_function_table" (table 0)))

Try to debug

I dump the corresponding machine code generated by Wasmtime and WasmEdge, I find that there is something strange with the machine code generated by Wasmtime,
1. the first loop in the above wat will be compiled to 64-6f(I guess that), but actually the first time the jne instruction is executed, the loop will be jumped out of, but %ecx will be added 1.
2. In the second loop, mov %r8d,(%rax,%rcx,1) is also strange. Because the variables used in these two loops don't have data dependence.

# Machine code generated by Wasmtime

64: 31 c9                   xor    %ecx,%ecx
66: 45 31 db                xor    %r11d,%r11d
69: 83 c1 01                add    $0x1,%ecx
6c: 45 85 db                test   %r11d,%r11d
6f: 0f 85 f1 ff ff ff       jne    66 <wasm[0]::function[2]+0x16>

75: 31 d2                   xor    %edx,%edx
77: 48 8b 87 80 00 00 00    mov    0x80(%rdi),%rax
7e: 81 e1 ff 00 00 00       and    $0xff,%ecx
84: 44 8b 87 a0 00 00 00    mov    0xa0(%rdi),%r8d
8b: 44 89 04 08             mov    %r8d,(%rax,%rcx,1)
8f: 44 8b 00                mov    (%rax),%r8d
92: 49 89 d0                mov    %rdx,%r8
95: 41 81 e0 ff ff ff 3f    and    $0x3fffffff,%r8d
9c: 83 c2 01                add    $0x1,%edx
9f: 41 81 f8 ff ff ff 3f    cmp    $0x3fffffff,%r8d
a6: 0f 85 d8 ff ff ff       jne    84 <wasm[0]::function[2]+0x34>


# Machine code generated by WasmEdge
60: 48 8b 07                mov    (%rdi),%rax
63: 48 8b 4f 08             mov    0x8(%rdi),%rcx
67: 31 d2                   xor    %edx,%edx
69: 0f 1f 80 00 00 00 00    nopl   0x0(%rax)
70: 48 8b 71 08             mov    0x8(%rcx),%rsi
74: 8b 36                   mov    (%rsi),%esi
76: 48 8b 38                mov    (%rax),%rdi
79: 89 77 01                mov    %esi,0x1(%rdi)
7c: 8d 72 01                lea    0x1(%rdx),%esi
7f: 81 e2 ff ff ff 3f       and    $0x3fffffff,%edx
85: 81 fa ff ff ff 3f       cmp    $0x3fffffff,%edx
8b: 89 f2                   mov    %esi,%edx
8d: 75 e1                   jne    70 <f2+0x10>

I try to remove the first loop in the above wat and the genrated codes are a little different,
1. mov %r8d,(%rax,%rcx,1) in the above version becomes mov %eax,(%rsi)
2. mov (%rax),%r8d in the above version disappeared.

So I think Cranelift generates wrong machine code which will result in wrong behaviors and further causes poor performance.

  64:   31 c9                   xor    %ecx,%ecx
  66:   48 8b b7 80 00 00 00    mov    0x80(%rdi),%rsi

  6d:   8b 87 a0 00 00 00       mov    0xa0(%rdi),%eax
  73:   89 06                   mov    %eax,(%rsi)
  75:   48 89 c8                mov    %rcx,%rax
  78:   81 e0 ff ff ff 3f       and    $0x3fffffff,%eax
  7e:   83 c1 01                add    $0x1,%ecx
  81:   81 f8 ff ff ff 3f       cmp    $0x3fffffff,%eax
  87:   0f 85 e0 ff ff ff       jne    6d <wasm[0]::function[2]+0x1d>

Versions and Environment

view this post on Zulip Wasmtime GitHub notifications bot (Nov 08 2023 at 14:31):

hungryzzz edited issue #7495:

Hi, I run the following case in Wasmtime and WasmEdge respectively and I find that the execution time of Wasmtime is almost 6x slower than which of WasmEdge(measured by time tool).

(module
  (type (;0;) (func (param i32)))
  (type (;1;) (func))
  (import "wasi_snapshot_preview1" "proc_exit" (func (;0;) (type 0)))
  (func (;1;) (type 1)
    call 2
    global.get 0
    i32.const 31
    i32.and
    call 0
    unreachable)
  (func (;2;) (type 1)
    (local i32 i32 i32 i32 i32 i32)
    ;; First Loop
    loop  ;; label = @6
      local.get 1
      local.get 2
      local.get 2
      i32.const 1
      i32.add
      local.set 2
      i32.const 0x3FFFFFFF
      i32.and
      i32.const 0x3FFFFFFF
      i32.ne
      i32.and
      br_if 0 (;@6;)
    end

    ;; Second Loop
    loop  ;; label = @6
      local.get 2
      i32.const 255
      i32.and
      global.get 1
      i32.store
      local.get 2
      local.get 0
      i32.and
      local.tee 5
      i32.const 255
      i32.and
      i32.load
      local.set 4
      local.get 3
      local.get 3
      i32.const 1
      i32.add
      local.set 3
      i32.const 0x3FFFFFFF
      i32.and
      i32.const 0x3FFFFFFF
      i32.ne
      br_if 0 (;@6;)
    end
  )
  (table (;0;) 6 6 funcref)
  (memory (;0;) 8192 8192)
  (global (;0;) (mut i32) (i32.const 5246560))
  (global (;1;) (mut i32) (i32.const 615098409))
  (export "memory" (memory 0))
  (export "_start" (func 1))
  (export "__indirect_function_table" (table 0)))

Try to debug

I dump the corresponding machine code generated by Wasmtime and WasmEdge, I find that there is something strange with the machine code generated by Wasmtime,
1. the first loop in the above wat will be compiled to 64-6f(I guess that), but actually the first time the jne instruction is executed, the loop will be jumped out of, but %ecx will be added 1.
2. In the second loop, mov %r8d,(%rax,%rcx,1) is also strange. Because the variables used in these two loops don't have data dependence.

# Machine code generated by Wasmtime

64: 31 c9                   xor    %ecx,%ecx
66: 45 31 db                xor    %r11d,%r11d
69: 83 c1 01                add    $0x1,%ecx
6c: 45 85 db                test   %r11d,%r11d
6f: 0f 85 f1 ff ff ff       jne    66 <wasm[0]::function[2]+0x16>

75: 31 d2                   xor    %edx,%edx
77: 48 8b 87 80 00 00 00    mov    0x80(%rdi),%rax
7e: 81 e1 ff 00 00 00       and    $0xff,%ecx
84: 44 8b 87 a0 00 00 00    mov    0xa0(%rdi),%r8d
8b: 44 89 04 08             mov    %r8d,(%rax,%rcx,1)
8f: 44 8b 00                mov    (%rax),%r8d
92: 49 89 d0                mov    %rdx,%r8
95: 41 81 e0 ff ff ff 3f    and    $0x3fffffff,%r8d
9c: 83 c2 01                add    $0x1,%edx
9f: 41 81 f8 ff ff ff 3f    cmp    $0x3fffffff,%r8d
a6: 0f 85 d8 ff ff ff       jne    84 <wasm[0]::function[2]+0x34>


# Machine code generated by WasmEdge
60: 48 8b 07                mov    (%rdi),%rax
63: 48 8b 4f 08             mov    0x8(%rdi),%rcx
67: 31 d2                   xor    %edx,%edx
69: 0f 1f 80 00 00 00 00    nopl   0x0(%rax)
70: 48 8b 71 08             mov    0x8(%rcx),%rsi
74: 8b 36                   mov    (%rsi),%esi
76: 48 8b 38                mov    (%rax),%rdi
79: 89 77 01                mov    %esi,0x1(%rdi)
7c: 8d 72 01                lea    0x1(%rdx),%esi
7f: 81 e2 ff ff ff 3f       and    $0x3fffffff,%edx
85: 81 fa ff ff ff 3f       cmp    $0x3fffffff,%edx
8b: 89 f2                   mov    %esi,%edx
8d: 75 e1                   jne    70 <f2+0x10>

I try to remove the first loop in the above wat and the genrated codes are a little different,
1. mov %r8d,(%rax,%rcx,1) in the above version becomes mov %eax,(%rsi)
2. mov (%rax),%r8d in the above version disappeared.

So I think Cranelift generates wrong machine code which will result in wrong behaviors and further causes poor performance.

  64:   31 c9                   xor    %ecx,%ecx
  66:   48 8b b7 80 00 00 00    mov    0x80(%rdi),%rsi

  6d:   8b 87 a0 00 00 00       mov    0xa0(%rdi),%eax
  73:   89 06                   mov    %eax,(%rsi)
  75:   48 89 c8                mov    %rcx,%rax
  78:   81 e0 ff ff ff 3f       and    $0x3fffffff,%eax
  7e:   83 c1 01                add    $0x1,%ecx
  81:   81 f8 ff ff ff 3f       cmp    $0x3fffffff,%eax
  87:   0f 85 e0 ff ff ff       jne    6d <wasm[0]::function[2]+0x1d>

Versions and Environment

case.zip

view this post on Zulip Wasmtime GitHub notifications bot (Nov 08 2023 at 21:06):

alexcrichton commented on issue #7495:

This is the output of VTune for me:

<img width="1193" alt="Screenshot 2023-11-08 at 3 04 06 PM" src="https://github.com/bytecodealliance/wasmtime/assets/64996/c9e0f911-5752-4334-b47b-755e4234537d">

which seems to point the smoking gun at "Loads Blocked by Store Forwarding"

For instructions I got:

<img width="1273" alt="Screenshot 2023-11-08 at 3 04 54 PM" src="https://github.com/bytecodealliance/wasmtime/assets/64996/c60a113e-38c3-423a-925d-82d9e003d7a5">

Cranelift/Wasmtime emit a load-store-load whereas LLVM/WasmEdge seem to emit a load-load-store which probably doesn't break load-store forwarding would be my guess.

I have not dug further yet at this time. I don't think Cranelift is generating wrong code here, and I have not dug into why removing the loop affects things.


Last updated: Jan 24 2025 at 00:11 UTC