Stream: git-wasmtime

Topic: wasmtime / issue #8624 Small difference makes suspicion p...


view this post on Zulip Wasmtime GitHub notifications bot (May 15 2024 at 14:03):

hungryzzz opened issue #8624:

Test Cases

cases.zip

Steps to Reproduce

Hi, I run the attached two cases(good.wasm&bad.wasm) in Wasmtime and WasmEdge(AOT), and collect their execution time respectively (measured by time tool).

# command to collect execution time of wasmtime
wasmtime compile bad.wasm -o bad.cwasm
time wasmtime run --allow-precompiled bad.cwasm

# command to collect execution time of wasmedge
wasmedgec bad.wasm bad-wasmedge-aot.wasm
time wasmedge bad-wasmedge-aot.wasm

Expected Results & Actual Results

For good.wasm, the execution time in different runtimes are as follows:

For bad.wasm, the execution time in different runtimes are as follows:

The difference between the attached two cases is as follow, i.e., changing the operand of i32.add from i32.const 0 to local.get 1, which can bring 5x performance decreasing on Wasmtime but has no effect on WasmEdge.

  cases diff good.wat bad.wat
62c62
<                       i32.const 0
---
>                       local.get 1
# good.wat
i32.const 0
i32.const 1
i32.add

# bad.wat
local.get 1
i32.const 1
i32.add

At first I thought the performance decreasing was caused by the difference, because the good one uses a constant while the bad one uses a local variable which may need to fetch from memory. So I do a small experiment: repeatly calculate (2000000000 times) an addition operation whose operand are a constant or a local variable and measure the execution time respectively. And I find that they are almost the same, 1.09s vs. 1.1s. Therefore, I think the above performance decreasing is caused by other reasons.

# experiment case
(module
  (type (;0;) (func))
  (type (;1;) (func (param i32)))
  (import "wasi_snapshot_preview1" "proc_exit" (func (;0;) (type 1)))
  (func (;1;) (type 0)
    (local i32 i32)
    loop
      ;; i32.const 0
      local.get 1
      i32.const 1
      i32.add
      local.get 0
      i32.add
      local.tee 0
      i32.const 2000000000
      i32.ne
      br_if 0
    end
    i32.const 0
    call 0
    unreachable)
  (memory (;0;) 8192 8192)
  (export "_start" (func 1))
  (export "memory" (memory 0)))

Profiling Information

I use Perf tool to profile the execution time and find that the hotspot is in the loop where the small difference happens, so I think the difference change some compilation strategy which may cause the performance decreasing.

Samples: 23K of event 'cycles', Event count (approx.): 21853956752
Overhead  Command          Shared Object      Symbol
  99.87%  wasmtime         jitted-93855-1.so  [.] wasm[0]::function[2]                                                                                              0.02%  tokio-runtime-w  [kernel.kallsyms]  [k] __mod_memcg_lruvec_state                                                                                          0.02%  wasmtime         ld-2.31.so         [.] _dl_relocate_object                                                                                               0.02%  wasmtime         [kernel.kallsyms]  [k] __do_fault                                                                                                        0.01%  wasmtime         [kernel.kallsyms]  [k] pmd_page_vaddr                                                                                                 
# the commands to profile
perf record -k mono ~/wasmtime/target/release/wasmtime run --profile jitdump --allow-precompiled bad.cwasm
perf inject --jit --input perf.data --output perf.jit.data
perf report --input perf.jit.data --no-children

Versions and Environment

view this post on Zulip Wasmtime GitHub notifications bot (May 15 2024 at 14:38):

hungryzzz edited issue #8624:

Test Cases

cases.zip

Steps to Reproduce

Hi, I run the attached two cases(good.wasm&bad.wasm) in Wasmtime and WasmEdge(AOT), and collect their execution time respectively (measured by time tool).

# command to collect execution time of wasmtime
wasmtime compile bad.wasm -o bad.cwasm
time wasmtime run --allow-precompiled bad.cwasm

# command to collect execution time of wasmedge
wasmedgec bad.wasm bad-wasmedge-aot.wasm
time wasmedge bad-wasmedge-aot.wasm

Expected Results & Actual Results

For good.wasm, the execution time in different runtimes are as follows:

For bad.wasm, the execution time in different runtimes are as follows:

The difference between the attached two cases is as follow, i.e., changing the operand of i32.add from i32.const 0 to local.get 1, which can bring 5x performance decreasing on Wasmtime but has no effect on WasmEdge.

  cases diff good.wat bad.wat
62c62
<                       i32.const 0
---
>                       local.get 1
# good.wat
i32.const 0
i32.const 1
i32.add

# bad.wat
local.get 1
i32.const 1
i32.add

At first I thought the performance decreasing was caused by the difference, because the good one uses a constant while the bad one uses a local variable which may need to fetch from memory. So I do a small experiment: repeatly calculate (2000000000 times) an addition operation whose operand are a constant or a local variable and measure the execution time respectively. And I find that they are almost the same, 1.09s vs. 1.1s. Therefore, I think the above performance decreasing is caused by other reasons.

# experiment case
(module
  (type (;0;) (func))
  (type (;1;) (func (param i32)))
  (import "wasi_snapshot_preview1" "proc_exit" (func (;0;) (type 1)))
  (func (;1;) (type 0)
    (local i32 i32)
    loop
      ;; i32.const 0
      local.get 1
      i32.const 1
      i32.add
      local.get 0
      i32.add
      local.tee 0
      i32.const 2000000000
      i32.ne
      br_if 0
    end
    i32.const 0
    call 0
    unreachable)
  (memory (;0;) 8192 8192)
  (export "_start" (func 1))
  (export "memory" (memory 0)))

Profiling Information

I use Perf tool to profile the execution time and find that the hotspot is in the loop where the small difference happens, so I think the difference change some compilation strategy which may cause the performance decreasing.

Samples: 23K of event 'cycles', Event count (approx.): 21853956752
Overhead  Command          Shared Object      Symbol
  99.87%  wasmtime         jitted-93855-1.so  [.] wasm[0]::function[2]                                                                                              0.02%  tokio-runtime-w  [kernel.kallsyms]  [k] __mod_memcg_lruvec_state                                                                                          0.02%  wasmtime         ld-2.31.so         [.] _dl_relocate_object                                                                                               0.02%  wasmtime         [kernel.kallsyms]  [k] __do_fault                                                                                                        0.01%  wasmtime         [kernel.kallsyms]  [k] pmd_page_vaddr                                                                                                 
# the commands to profile
perf record -k mono ~/wasmtime/target/release/wasmtime run --profile jitdump --allow-precompiled bad.cwasm
perf inject --jit --input perf.data --output perf.jit.data
perf report --input perf.jit.data --no-children

Versions and Environment

view this post on Zulip Wasmtime GitHub notifications bot (May 15 2024 at 15:08):

hungryzzz edited issue #8624:

Test Cases

cases.zip

Steps to Reproduce

Hi, I run the attached two cases(good.wasm&bad.wasm) in Wasmtime and WasmEdge(AOT), and collect their execution time respectively (measured by time tool).

# command to collect execution time of wasmtime
wasmtime compile bad.wasm -o bad.cwasm
time wasmtime run --allow-precompiled bad.cwasm

# command to collect execution time of wasmedge
wasmedgec bad.wasm bad-wasmedge-aot.wasm
time wasmedge bad-wasmedge-aot.wasm

Expected Results & Actual Results

For good.wasm, the execution time in different runtimes are as follows:

For bad.wasm, the execution time in different runtimes are as follows:

The difference between the attached two cases is as follow, i.e., changing the operand of i32.add from i32.const 0 to local.get 1, which can bring 5x performance decreasing on Wasmtime but has no effect on WasmEdge.

  cases diff good.wat bad.wat
62c62
<                       i32.const 0
---
>                       local.get 1
;; good.wat
i32.const 0
i32.const 1
i32.add

;; bad.wat
local.get 1
i32.const 1
i32.add

At first I thought the performance decreasing was caused by the difference, because the good one uses a constant while the bad one uses a local variable which may need to fetch from memory. So I do a small experiment: repeatly calculate (2000000000 times) an addition operation whose operand are a constant or a local variable and measure the execution time respectively. And I find that they are almost the same, 1.09s vs. 1.1s. Therefore, I think the above performance decreasing is caused by other reasons.

;; experiment case
(module
  (type (;0;) (func))
  (type (;1;) (func (param i32)))
  (import "wasi_snapshot_preview1" "proc_exit" (func (;0;) (type 1)))
  (func (;1;) (type 0)
    (local i32 i32)
    loop
      ;; i32.const 0
      local.get 1
      i32.const 1
      i32.add
      local.get 0
      i32.add
      local.tee 0
      i32.const 2000000000
      i32.ne
      br_if 0
    end
    i32.const 0
    call 0
    unreachable)
  (memory (;0;) 8192 8192)
  (export "_start" (func 1))
  (export "memory" (memory 0)))

Profiling Information

I use Perf tool to profile the execution time and find that the hotspot is in the loop where the small difference happens, so I think the difference change some compilation strategy which may cause the performance decreasing.

Samples: 23K of event 'cycles', Event count (approx.): 21853956752
Overhead  Command          Shared Object      Symbol
  99.87%  wasmtime         jitted-93855-1.so  [.] wasm[0]::function[2]                                                                                              0.02%  tokio-runtime-w  [kernel.kallsyms]  [k] __mod_memcg_lruvec_state                                                                                          0.02%  wasmtime         ld-2.31.so         [.] _dl_relocate_object                                                                                               0.02%  wasmtime         [kernel.kallsyms]  [k] __do_fault                                                                                                        0.01%  wasmtime         [kernel.kallsyms]  [k] pmd_page_vaddr                                                                                                 
# the commands to profile
perf record -k mono ~/wasmtime/target/release/wasmtime run --profile jitdump --allow-precompiled bad.cwasm
perf inject --jit --input perf.data --output perf.jit.data
perf report --input perf.jit.data --no-children

Versions and Environment

view this post on Zulip Wasmtime GitHub notifications bot (May 15 2024 at 15:52):

alexcrichton commented on issue #8624:

Locally on Intel(R) Core(TM) i9-14900K I get 0.9s on bad.wasm and 0.74s on good.wasm

Profiling good.wasm the hot loop is, according to perf:

        d5:┌─→mov    (%rbx),%r10d                                                                                                                                                                                                                                                                                             
             add    $0x1,%r13d                                                                                                                                                                                                                                                                                               
           ├──test   %r13d,%r13d                                                                                                                                                                                                                                                                                              
100.00     └──jne    d5                                                                                                                                                                                                                                                                                                       

For bad.wasm it's

        c5:┌─→mov    (%rbx),%r10d                                                                                                                                                                                                                                                                                             
             mov    (%rsp),%rcx                                                                                                                                                                                                                                                                                              
  0.09       lea    0x1(%rcx),%r10d                                                                                                                                                                                                                                                                                          
 55.32       test   %r10d,%r10d                                                                                                                                                                                                                                                                                              
           │↓ je     e5                                                                                                                                                                                                                                                                                                       
  0.06       mov    %r10,%rcx                                                                                                                                                                                                                                                                                                
  0.03       mov    %rcx,(%rsp)                                                                                                                                                                                                                                                                                              
 44.51     └──jmp    c5                                                                                                                                                                                                                                                                                                       
        e5:   xor    %edi,%edi                                                                                                                                                                                                                                                                                                

I believe WasmEdge is an LLVM-based backend so it probably unrolls this loop or something similar.

view this post on Zulip Wasmtime GitHub notifications bot (May 16 2024 at 14:56):

hungryzzz commented on issue #8624:

I find that there is not difference between the binary generated by WasmEdge before and after changes, so I think maybe the block which contains the small changes is optimized (eliminated) during code generation in LLVM.

view this post on Zulip Wasmtime GitHub notifications bot (May 16 2024 at 15:00):

hungryzzz edited a comment on issue #8624:

I find that there is not difference between the binary generated by WasmEdge before and after changes, so I think maybe the block which contains the small changes is optimized (eliminated) during code generation in LLVM.

So maybe the performance decreasing in Wasmtime is caused by the newly generated instructions in loop, but I am still surprised that only the small changes in machine code can make more than 5 times performance impact...

view this post on Zulip Wasmtime GitHub notifications bot (May 16 2024 at 17:26):

alexcrichton commented on issue #8624:

My guess is you're hitting micro-architectural limits or things like that. Basically various cliffs in the CPU in terms of performance where once you fall off the happy path it's both difficult to explain why and difficult to understand the effects. That's just my best guess though.


Last updated: Oct 23 2024 at 20:03 UTC