hungryzzz opened issue #8624:
Test Cases
Steps to Reproduce
Hi, I run the attached two cases(
good.wasm
&bad.wasm
) inWasmtime
and WasmEdge(AOT), and collect their execution time respectively (measured by time tool).# command to collect execution time of wasmtime wasmtime compile bad.wasm -o bad.cwasm time wasmtime run --allow-precompiled bad.cwasm # command to collect execution time of wasmedge wasmedgec bad.wasm bad-wasmedge-aot.wasm time wasmedge bad-wasmedge-aot.wasm
Expected Results & Actual Results
For
good.wasm
, the execution time in different runtimes are as follows:
- Wasmtime: 1.17s
- WasmEdge: 0.59s
For
bad.wasm
, the execution time in different runtimes are as follows:
- Wasmtime: 6.01s
- WasmEdge: 0.59s
The difference between the attached two cases is as follow, i.e., changing the operand of
i32.add
fromi32.const 0
tolocal.get 1
, which can bring 5x performance decreasing onWasmtime
but has no effect onWasmEdge
.➜ cases diff good.wat bad.wat 62c62 < i32.const 0 --- > local.get 1
# good.wat i32.const 0 i32.const 1 i32.add # bad.wat local.get 1 i32.const 1 i32.add
At first I thought the performance decreasing was caused by the difference, because the
good
one uses a constant while thebad
one uses a local variable which may need to fetch from memory. So I do a small experiment: repeatly calculate (2000000000
times) an addition operation whose operand are a constant or a local variable and measure the execution time respectively. And I find that they are almost the same,1.09s
vs.1.1s
. Therefore, I think the above performance decreasing is caused by other reasons.# experiment case (module (type (;0;) (func)) (type (;1;) (func (param i32))) (import "wasi_snapshot_preview1" "proc_exit" (func (;0;) (type 1))) (func (;1;) (type 0) (local i32 i32) loop ;; i32.const 0 local.get 1 i32.const 1 i32.add local.get 0 i32.add local.tee 0 i32.const 2000000000 i32.ne br_if 0 end i32.const 0 call 0 unreachable) (memory (;0;) 8192 8192) (export "_start" (func 1)) (export "memory" (memory 0)))
Profiling Information
I use Perf tool to profile the execution time and find that the hotspot is in the loop where the small difference happens, so I think the difference change some compilation strategy which may cause the performance decreasing.
Samples: 23K of event 'cycles', Event count (approx.): 21853956752 Overhead Command Shared Object Symbol 99.87% wasmtime jitted-93855-1.so [.] wasm[0]::function[2] ▒ 0.02% tokio-runtime-w [kernel.kallsyms] [k] __mod_memcg_lruvec_state ▒ 0.02% wasmtime ld-2.31.so [.] _dl_relocate_object ▒ 0.02% wasmtime [kernel.kallsyms] [k] __do_fault ▒ 0.01% wasmtime [kernel.kallsyms] [k] pmd_page_vaddr ▒
# the commands to profile perf record -k mono ~/wasmtime/target/release/wasmtime run --profile jitdump --allow-precompiled bad.cwasm perf inject --jit --input perf.data --output perf.jit.data perf report --input perf.jit.data --no-children
Versions and Environment
- Wasmtime version or commit: d0cf46a098d97bab9
- Operating system: Linux ringzzz-OptiPlex-7070 5.15.0-97-generic
- Architecture: Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz
hungryzzz edited issue #8624:
Test Cases
Steps to Reproduce
Hi, I run the attached two cases(
good.wasm
&bad.wasm
) inWasmtime
andWasmEdge
(AOT), and collect their execution time respectively (measured by time tool).# command to collect execution time of wasmtime wasmtime compile bad.wasm -o bad.cwasm time wasmtime run --allow-precompiled bad.cwasm # command to collect execution time of wasmedge wasmedgec bad.wasm bad-wasmedge-aot.wasm time wasmedge bad-wasmedge-aot.wasm
Expected Results & Actual Results
For
good.wasm
, the execution time in different runtimes are as follows:
- Wasmtime: 1.17s
- WasmEdge: 0.59s
For
bad.wasm
, the execution time in different runtimes are as follows:
- Wasmtime: 6.01s
- WasmEdge: 0.59s
The difference between the attached two cases is as follow, i.e., changing the operand of
i32.add
fromi32.const 0
tolocal.get 1
, which can bring 5x performance decreasing onWasmtime
but has no effect onWasmEdge
.➜ cases diff good.wat bad.wat 62c62 < i32.const 0 --- > local.get 1
# good.wat i32.const 0 i32.const 1 i32.add # bad.wat local.get 1 i32.const 1 i32.add
At first I thought the performance decreasing was caused by the difference, because the
good
one uses a constant while thebad
one uses a local variable which may need to fetch from memory. So I do a small experiment: repeatly calculate (2000000000
times) an addition operation whose operand are a constant or a local variable and measure the execution time respectively. And I find that they are almost the same,1.09s
vs.1.1s
. Therefore, I think the above performance decreasing is caused by other reasons.# experiment case (module (type (;0;) (func)) (type (;1;) (func (param i32))) (import "wasi_snapshot_preview1" "proc_exit" (func (;0;) (type 1))) (func (;1;) (type 0) (local i32 i32) loop ;; i32.const 0 local.get 1 i32.const 1 i32.add local.get 0 i32.add local.tee 0 i32.const 2000000000 i32.ne br_if 0 end i32.const 0 call 0 unreachable) (memory (;0;) 8192 8192) (export "_start" (func 1)) (export "memory" (memory 0)))
Profiling Information
I use Perf tool to profile the execution time and find that the hotspot is in the loop where the small difference happens, so I think the difference change some compilation strategy which may cause the performance decreasing.
Samples: 23K of event 'cycles', Event count (approx.): 21853956752 Overhead Command Shared Object Symbol 99.87% wasmtime jitted-93855-1.so [.] wasm[0]::function[2] ▒ 0.02% tokio-runtime-w [kernel.kallsyms] [k] __mod_memcg_lruvec_state ▒ 0.02% wasmtime ld-2.31.so [.] _dl_relocate_object ▒ 0.02% wasmtime [kernel.kallsyms] [k] __do_fault ▒ 0.01% wasmtime [kernel.kallsyms] [k] pmd_page_vaddr ▒
# the commands to profile perf record -k mono ~/wasmtime/target/release/wasmtime run --profile jitdump --allow-precompiled bad.cwasm perf inject --jit --input perf.data --output perf.jit.data perf report --input perf.jit.data --no-children
Versions and Environment
- Wasmtime version or commit: d0cf46a098d97bab9
- Operating system: Linux ringzzz-OptiPlex-7070 5.15.0-97-generic
- Architecture: Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz
hungryzzz edited issue #8624:
Test Cases
Steps to Reproduce
Hi, I run the attached two cases(
good.wasm
&bad.wasm
) inWasmtime
andWasmEdge
(AOT), and collect their execution time respectively (measured by time tool).# command to collect execution time of wasmtime wasmtime compile bad.wasm -o bad.cwasm time wasmtime run --allow-precompiled bad.cwasm # command to collect execution time of wasmedge wasmedgec bad.wasm bad-wasmedge-aot.wasm time wasmedge bad-wasmedge-aot.wasm
Expected Results & Actual Results
For
good.wasm
, the execution time in different runtimes are as follows:
- Wasmtime: 1.17s
- WasmEdge: 0.59s
For
bad.wasm
, the execution time in different runtimes are as follows:
- Wasmtime: 6.01s
- WasmEdge: 0.59s
The difference between the attached two cases is as follow, i.e., changing the operand of
i32.add
fromi32.const 0
tolocal.get 1
, which can bring 5x performance decreasing onWasmtime
but has no effect onWasmEdge
.➜ cases diff good.wat bad.wat 62c62 < i32.const 0 --- > local.get 1
;; good.wat i32.const 0 i32.const 1 i32.add ;; bad.wat local.get 1 i32.const 1 i32.add
At first I thought the performance decreasing was caused by the difference, because the
good
one uses a constant while thebad
one uses a local variable which may need to fetch from memory. So I do a small experiment: repeatly calculate (2000000000
times) an addition operation whose operand are a constant or a local variable and measure the execution time respectively. And I find that they are almost the same,1.09s
vs.1.1s
. Therefore, I think the above performance decreasing is caused by other reasons.;; experiment case (module (type (;0;) (func)) (type (;1;) (func (param i32))) (import "wasi_snapshot_preview1" "proc_exit" (func (;0;) (type 1))) (func (;1;) (type 0) (local i32 i32) loop ;; i32.const 0 local.get 1 i32.const 1 i32.add local.get 0 i32.add local.tee 0 i32.const 2000000000 i32.ne br_if 0 end i32.const 0 call 0 unreachable) (memory (;0;) 8192 8192) (export "_start" (func 1)) (export "memory" (memory 0)))
Profiling Information
I use Perf tool to profile the execution time and find that the hotspot is in the loop where the small difference happens, so I think the difference change some compilation strategy which may cause the performance decreasing.
Samples: 23K of event 'cycles', Event count (approx.): 21853956752 Overhead Command Shared Object Symbol 99.87% wasmtime jitted-93855-1.so [.] wasm[0]::function[2] ▒ 0.02% tokio-runtime-w [kernel.kallsyms] [k] __mod_memcg_lruvec_state ▒ 0.02% wasmtime ld-2.31.so [.] _dl_relocate_object ▒ 0.02% wasmtime [kernel.kallsyms] [k] __do_fault ▒ 0.01% wasmtime [kernel.kallsyms] [k] pmd_page_vaddr ▒
# the commands to profile perf record -k mono ~/wasmtime/target/release/wasmtime run --profile jitdump --allow-precompiled bad.cwasm perf inject --jit --input perf.data --output perf.jit.data perf report --input perf.jit.data --no-children
Versions and Environment
- Wasmtime version or commit: d0cf46a098d97bab9
- Operating system: Linux ringzzz-OptiPlex-7070 5.15.0-97-generic
- Architecture: Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz
alexcrichton commented on issue #8624:
Locally on Intel(R) Core(TM) i9-14900K I get 0.9s on bad.wasm and 0.74s on good.wasm
Profiling good.wasm the hot loop is, according to
perf
:│ d5:┌─→mov (%rbx),%r10d ▒ │ │ add $0x1,%r13d ▒ │ ├──test %r13d,%r13d ▒ 100.00 │ └──jne d5 ▒
For
bad.wasm
it's│ c5:┌─→mov (%rbx),%r10d ▒ │ │ mov (%rsp),%rcx ▒ 0.09 │ │ lea 0x1(%rcx),%r10d ▒ 55.32 │ │ test %r10d,%r10d ▒ │ │↓ je e5 ▒ 0.06 │ │ mov %r10,%rcx ▒ 0.03 │ │ mov %rcx,(%rsp) ▒ 44.51 │ └──jmp c5 ▒ │ e5: xor %edi,%edi ▒
I believe WasmEdge is an LLVM-based backend so it probably unrolls this loop or something similar.
hungryzzz commented on issue #8624:
I find that there is not difference between the binary generated by
WasmEdge
before and after changes, so I think maybe the block which contains the small changes is optimized (eliminated) during code generation in LLVM.
hungryzzz edited a comment on issue #8624:
I find that there is not difference between the binary generated by
WasmEdge
before and after changes, so I think maybe the block which contains the small changes is optimized (eliminated) during code generation in LLVM.So maybe the performance decreasing in
Wasmtime
is caused by the newly generated instructions in loop, but I am still surprised that only the small changes in machine code can make more than 5 times performance impact...
alexcrichton commented on issue #8624:
My guess is you're hitting micro-architectural limits or things like that. Basically various cliffs in the CPU in terms of performance where once you fall off the happy path it's both difficult to explain why and difficult to understand the effects. That's just my best guess though.
Last updated: Dec 23 2024 at 12:05 UTC