hungryzzz opened issue #8648:
Test Cases
Steps to Reproduce
Hi, I run the attached two cases(
good.wasm
&bad.wasm
) inWasmtime
andWasmEdge
(AOT), and collect their execution time respectively (measured bytime
tool).# command to collect execution time of wasmtime wasmtime compile bad.wasm -o bad.cwasm time wasmtime run --allow-precompiled bad.cwasm # command to collect execution time of wasmedge wasmedgec bad.wasm bad-wasmedge-aot.wasm time wasmedge bad-wasmedge-aot.wasm
Expected Results & Actual Results
For
good.wasm
, the execution time in different runtimes are as follows:
- Wasmtime: 3.07s
- WasmEdge: 1.18s
For
bad.wasm
, the execution time in different runtimes are as follows:
- Wasmtime: 4.68s
- WasmEdge: 1.18s
The difference between the attached two cases is as follow, i.e., changing the value which will be store to address 0 from
1
to0
outsize the loop, which can bring 1.6s performance decreasing on Wasmtime but has no effect on WasmEdge. Moreover, the address 0 will be then stored the same value 0, so the first storage will not affect the loop in theory.➜ cases diff good.wat bad.wat 23c23 < i32.const 1 --- > i32.const 0
;; part of `good.wat` i32.const 0 i32.const 1 ;; here i32.store i32.const 0 i32.const 0 i32.store ;; part of `bad.wat` i32.const 0 i32.const 0 ;; here i32.store i32.const 0 i32.const 0 i32.store
And I check the machine code generated by
Wasmtime
andWasmEdge
, and find the instruction sequence ofWasmtime
is different before and after change but theWasmEdge
one only changes an immediate of an instruction from1
to0
.I have two confusion of the machine code generatd by
Wasmtime
:
1. Why the instruction sequence are different between two cases? In my view, maybe only changing the immediate of an instruction just asWasmEdge
is enough.
2. Why the difference of the two instruction sequences which are both out of the loop, can cause 1.6s performance difference?➜ binary-diff diff good.wasmedge.asm bad.wasmedge.asm 45c45 < 82: c7 02 01 00 00 00 movl $0x1,(%rdx) --- > 82: c7 02 00 00 00 00 movl $0x0,(%rdx) # difference of machine code generated by wasmtime # good.wasm 29: 44 88 11 mov %r10b,(%rcx) 2c: c7 01 00 00 00 00 movl $0x0,(%rcx) 32: c7 01 00 00 00 00 movl $0x0,(%rcx) 38: c7 01 00 00 00 00 movl $0x0,(%rcx) 3e: 41 bb 01 00 00 00 mov $0x1,%r11d 44: c7 01 01 00 00 00 movl $0x1,(%rcx) 4a: c7 01 00 00 00 00 movl $0x0,(%rcx) 50: c7 01 00 00 00 00 movl $0x0,(%rcx) 56: 44 88 99 4c 01 00 00 mov %r11b,0x14c(%rcx) 5d: 44 88 91 bc 01 00 00 mov %r10b,0x1bc(%rcx) # bad.wasm 29: 44 88 11 mov %r10b,(%rcx) 2c: c7 01 00 00 00 00 movl $0x0,(%rcx) 32: c7 01 00 00 00 00 movl $0x0,(%rcx) 38: c7 01 00 00 00 00 movl $0x0,(%rcx) 3e: c7 01 00 00 00 00 movl $0x0,(%rcx) 44: c7 01 00 00 00 00 movl $0x0,(%rcx) 4a: c7 01 00 00 00 00 movl $0x0,(%rcx) 50: b8 01 00 00 00 mov $0x1,%eax 55: 88 81 4c 01 00 00 mov %al,0x14c(%rcx) 5b: 44 88 91 bc 01 00 00 mov %r10b,0x1bc(%rcx)
Profiling Information
I use Perf tool to profile the hotspot and the results are as follows:
# good.wasm │ xor %r10d,%r10d 0.01 │ mov %r10b,(%rcx) 0.01 │ movl $0x0,(%rcx) 20.82 │ movl $0x0,(%rcx) 0.05 │ movl $0x0,(%rcx) 18.60 │ mov $0x1,%r11d 0.03 │ movl $0x1,(%rcx) 20.42 │ movl $0x0,(%rcx) 22.37 │ movl $0x0,(%rcx) 17.68 │ mov %r11b,0x14c(%rcx) 0.01 │ mov %r10b,0x1bc(%rcx) # bad.wasm │ xor %r10d,%r10d │ mov %r10b,(%rcx) 24.94 │ movl $0x0,(%rcx) │ movl $0x0,(%rcx) │ movl $0x0,(%rcx) │ movl $0x0,(%rcx) │ movl $0x0,(%rcx) 75.04 │ movl $0x0,(%rcx) 0.01 │ mov $0x1,%eax │ mov %al,0x14c(%rcx) 0.01 │ mov %r10b,0x1bc(%rcx)
# the commands to profile perf record -k mono ~/wasmtime/target/release/wasmtime run --profile jitdump --allow-precompiled bad.cwasm perf inject --jit --input perf.data --output perf.jit.data perf report --input perf.jit.data --no-children
Versions and Environment
- Wasmtime version or commit: 91ec9a589cc6c7f031ef4c
- Operating system: Linux ringzzz-OptiPlex-7070 5.15.0-97-generic
- Architecture: Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz
hungryzzz edited issue #8648:
Test Cases
Steps to Reproduce
Hi, I run the attached two cases(
good.wasm
&bad.wasm
) inWasmtime
andWasmEdge
(AOT), and collect their execution time respectively (measured bytime
tool).# command to collect execution time of wasmtime wasmtime compile bad.wasm -o bad.cwasm time wasmtime run --allow-precompiled bad.cwasm # command to collect execution time of wasmedge wasmedgec bad.wasm bad-wasmedge-aot.wasm time wasmedge bad-wasmedge-aot.wasm
Expected Results & Actual Results
For
good.wasm
, the execution time in different runtimes are as follows:
- Wasmtime: 3.07s
- WasmEdge: 1.18s
For
bad.wasm
, the execution time in different runtimes are as follows:
- Wasmtime: 4.68s
- WasmEdge: 1.18s
The difference between the attached two cases is as follow, i.e., changing the value which will be store to address 0 from
1
to0
outsize the loop, which can bring 1.6s performance decreasing on Wasmtime but has no effect on WasmEdge. Moreover, the address 0 will be then stored the same value 0, so the first storage will not affect the loop in theory.➜ cases diff good.wat bad.wat 23c23 < i32.const 1 --- > i32.const 0
;; part of `good.wat` i32.const 0 i32.const 1 ;; here i32.store i32.const 0 i32.const 0 i32.store ;; part of `bad.wat` i32.const 0 i32.const 0 ;; here i32.store i32.const 0 i32.const 0 i32.store
And I check the machine code generated by
Wasmtime
andWasmEdge
, and find the instruction sequence ofWasmtime
is different before and after change but theWasmEdge
one only changes an immediate of an instruction from1
to0
.I have two confusion of the machine code generatd by
Wasmtime
:
1. Why the instruction sequence are different between two cases? In my view, maybe only changing the immediate of an instruction just asWasmEdge
is enough.
2. Why the difference of the two instruction sequences which are both out of the loop, can cause 1.6s performance difference?➜ binary-diff diff good.wasmedge.asm bad.wasmedge.asm 45c45 < 82: c7 02 01 00 00 00 movl $0x1,(%rdx) --- > 82: c7 02 00 00 00 00 movl $0x0,(%rdx) # difference of machine code generated by wasmtime # good.wasm 29: 44 88 11 mov %r10b,(%rcx) 2c: c7 01 00 00 00 00 movl $0x0,(%rcx) 32: c7 01 00 00 00 00 movl $0x0,(%rcx) 38: c7 01 00 00 00 00 movl $0x0,(%rcx) 3e: 41 bb 01 00 00 00 mov $0x1,%r11d 44: c7 01 01 00 00 00 movl $0x1,(%rcx) 4a: c7 01 00 00 00 00 movl $0x0,(%rcx) 50: c7 01 00 00 00 00 movl $0x0,(%rcx) 56: 44 88 99 4c 01 00 00 mov %r11b,0x14c(%rcx) 5d: 44 88 91 bc 01 00 00 mov %r10b,0x1bc(%rcx) # bad.wasm 29: 44 88 11 mov %r10b,(%rcx) 2c: c7 01 00 00 00 00 movl $0x0,(%rcx) 32: c7 01 00 00 00 00 movl $0x0,(%rcx) 38: c7 01 00 00 00 00 movl $0x0,(%rcx) 3e: c7 01 00 00 00 00 movl $0x0,(%rcx) 44: c7 01 00 00 00 00 movl $0x0,(%rcx) 4a: c7 01 00 00 00 00 movl $0x0,(%rcx) 50: b8 01 00 00 00 mov $0x1,%eax 55: 88 81 4c 01 00 00 mov %al,0x14c(%rcx) 5b: 44 88 91 bc 01 00 00 mov %r10b,0x1bc(%rcx)
Profiling Information
I use Perf tool to profile the hotspot and the results are as follows:
# good.wasm │ xor %r10d,%r10d 0.01 │ mov %r10b,(%rcx) 0.01 │ movl $0x0,(%rcx) 20.82 │ movl $0x0,(%rcx) 0.05 │ movl $0x0,(%rcx) 18.60 │ mov $0x1,%r11d 0.03 │ movl $0x1,(%rcx) 20.42 │ movl $0x0,(%rcx) 22.37 │ movl $0x0,(%rcx) 17.68 │ mov %r11b,0x14c(%rcx) 0.01 │ mov %r10b,0x1bc(%rcx) # bad.wasm │ xor %r10d,%r10d │ mov %r10b,(%rcx) 24.94 │ movl $0x0,(%rcx) │ movl $0x0,(%rcx) │ movl $0x0,(%rcx) │ movl $0x0,(%rcx) │ movl $0x0,(%rcx) 75.04 │ movl $0x0,(%rcx) 0.01 │ mov $0x1,%eax │ mov %al,0x14c(%rcx) 0.01 │ mov %r10b,0x1bc(%rcx)
# the commands to profile perf record -k mono ~/wasmtime/target/release/wasmtime run --profile jitdump --allow-precompiled bad.cwasm perf inject --jit --input perf.data --output perf.jit.data perf report --input perf.jit.data --no-children
Versions and Environment
- Wasmtime version or commit: 91ec9a589cc6c7f031ef4c
- Operating system: Linux ringzzz-OptiPlex-7070 5.15.0-97-generic
- Architecture: Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz
hungryzzz edited issue #8648:
Test Cases
Steps to Reproduce
Hi, I run the attached two cases(
good.wasm
&bad.wasm
) inWasmtime
andWasmEdge
(AOT), and collect their execution time respectively (measured bytime
tool).# command to collect execution time of wasmtime wasmtime compile bad.wasm -o bad.cwasm time wasmtime run --allow-precompiled bad.cwasm # command to collect execution time of wasmedge wasmedgec bad.wasm bad-wasmedge-aot.wasm time wasmedge bad-wasmedge-aot.wasm
Expected Results & Actual Results
For
good.wasm
, the execution time in different runtimes are as follows:
- Wasmtime: 3.07s
- WasmEdge: 1.18s
For
bad.wasm
, the execution time in different runtimes are as follows:
- Wasmtime: 4.68s
- WasmEdge: 1.18s
The difference between the attached two cases is as follow, i.e., changing the value which will be store to address 0 from
1
to0
outsize the loop, which can bring 1.6s performance decreasing on Wasmtime but has no effect on WasmEdge. Moreover, the address 0 will be then stored the same value 0, so the first storage will not affect the loop in theory.➜ cases diff good.wat bad.wat 23c23 < i32.const 1 --- > i32.const 0
;; part of `good.wat` i32.const 0 i32.const 1 ;; here i32.store i32.const 0 i32.const 0 i32.store ;; part of `bad.wat` i32.const 0 i32.const 0 ;; here i32.store i32.const 0 i32.const 0 i32.store
And I check the machine code generated by
Wasmtime
andWasmEdge
, and find the instruction sequence ofWasmtime
is different before and after change but theWasmEdge
one only changes an immediate of an instruction from1
to0
.I have two confusion of the machine code generatd by
Wasmtime
:
1. Why the instruction sequence are different between two cases? In my view, maybe only changing the immediate of an instruction just asWasmEdge
is enough.
2. Why the difference of the two instruction sequences which are both out of the loop, can cause 1.6s performance difference?➜ binary-diff diff good.wasmedge.asm bad.wasmedge.asm 45c45 < 82: c7 02 01 00 00 00 movl $0x1,(%rdx) --- > 82: c7 02 00 00 00 00 movl $0x0,(%rdx) # difference of machine code generated by wasmtime # good.wasm 29: 44 88 11 mov %r10b,(%rcx) 2c: c7 01 00 00 00 00 movl $0x0,(%rcx) 32: c7 01 00 00 00 00 movl $0x0,(%rcx) 38: c7 01 00 00 00 00 movl $0x0,(%rcx) 3e: 41 bb 01 00 00 00 mov $0x1,%r11d 44: c7 01 01 00 00 00 movl $0x1,(%rcx) 4a: c7 01 00 00 00 00 movl $0x0,(%rcx) 50: c7 01 00 00 00 00 movl $0x0,(%rcx) 56: 44 88 99 4c 01 00 00 mov %r11b,0x14c(%rcx) 5d: 44 88 91 bc 01 00 00 mov %r10b,0x1bc(%rcx) # bad.wasm 29: 44 88 11 mov %r10b,(%rcx) 2c: c7 01 00 00 00 00 movl $0x0,(%rcx) 32: c7 01 00 00 00 00 movl $0x0,(%rcx) 38: c7 01 00 00 00 00 movl $0x0,(%rcx) 3e: c7 01 00 00 00 00 movl $0x0,(%rcx) 44: c7 01 00 00 00 00 movl $0x0,(%rcx) 4a: c7 01 00 00 00 00 movl $0x0,(%rcx) 50: b8 01 00 00 00 mov $0x1,%eax 55: 88 81 4c 01 00 00 mov %al,0x14c(%rcx) 5b: 44 88 91 bc 01 00 00 mov %r10b,0x1bc(%rcx)
Profiling Information
I use Perf tool to profile the hotspot and the results are as follows:
# good.wasm │ xor %r10d,%r10d 0.01 │ mov %r10b,(%rcx) 0.01 │ movl $0x0,(%rcx) 20.82 │ movl $0x0,(%rcx) 0.05 │ movl $0x0,(%rcx) 18.60 │ mov $0x1,%r11d 0.03 │ movl $0x1,(%rcx) 20.42 │ movl $0x0,(%rcx) 22.37 │ movl $0x0,(%rcx) 17.68 │ mov %r11b,0x14c(%rcx) 0.01 │ mov %r10b,0x1bc(%rcx) # bad.wasm │ xor %r10d,%r10d │ mov %r10b,(%rcx) 24.94 │ movl $0x0,(%rcx) │ movl $0x0,(%rcx) │ movl $0x0,(%rcx) │ movl $0x0,(%rcx) │ movl $0x0,(%rcx) 75.04 │ movl $0x0,(%rcx) 0.01 │ mov $0x1,%eax │ mov %al,0x14c(%rcx) 0.01 │ mov %r10b,0x1bc(%rcx)
# the commands to profile perf record -k mono ~/wasmtime/target/release/wasmtime run --profile jitdump --allow-precompiled bad.cwasm perf inject --jit --input perf.data --output perf.jit.data perf report --input perf.jit.data --no-children
Versions and Environment
- Wasmtime version or commit: 91ec9a589cc6c7f031ef4c
- Operating system: Linux ringzzz-OptiPlex-7070 5.15.0-97-generic
- Architecture: Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz
hungryzzz edited issue #8648:
Test Cases
Steps to Reproduce
Hi, I run the attached two cases(
good.wasm
&bad.wasm
) inWasmtime
andWasmEdge
(AOT), and collect their execution time respectively (measured bytime
tool).# command to collect execution time of wasmtime wasmtime compile bad.wasm -o bad.cwasm time wasmtime run --allow-precompiled bad.cwasm # command to collect execution time of wasmedge wasmedgec bad.wasm bad-wasmedge-aot.wasm time wasmedge bad-wasmedge-aot.wasm
Expected Results & Actual Results
For
good.wasm
, the execution time in different runtimes are as follows:
- Wasmtime: 3.07s
- WasmEdge: 1.18s
For
bad.wasm
, the execution time in different runtimes are as follows:
- Wasmtime: 4.68s
- WasmEdge: 1.18s
The difference between the attached two cases is as follow, i.e., changing the value which will be store to address 0 from
1
to0
outsize the loop, which can bring 1.6s performance decreasing on Wasmtime but has no effect on WasmEdge. Moreover, the address 0 will be then stored the same value 0, so the first storage will not affect the loop in theory.➜ cases diff good.wat bad.wat 23c23 < i32.const 1 --- > i32.const 0
;; part of `good.wat` i32.const 0 i32.const 1 ;; here i32.store i32.const 0 i32.const 0 i32.store ;; part of `bad.wat` i32.const 0 i32.const 0 ;; here i32.store i32.const 0 i32.const 0 i32.store
And I check the machine code generated by
Wasmtime
andWasmEdge
, and find the instruction sequence ofWasmtime
is different before and after change but theWasmEdge
one only changes an immediate of an instruction from1
to0
.I have two confusions to the machine code generatd by
Wasmtime
:
1. Why the instruction sequence are different between two cases? In my view, maybe only changing the immediate of an instruction just asWasmEdge
is enough.
2. Why the difference of the two instruction sequences which are both out of the loop, can cause 1.6s performance difference?➜ binary-diff diff good.wasmedge.asm bad.wasmedge.asm 45c45 < 82: c7 02 01 00 00 00 movl $0x1,(%rdx) --- > 82: c7 02 00 00 00 00 movl $0x0,(%rdx) # difference of machine code generated by wasmtime # good.wasm 29: 44 88 11 mov %r10b,(%rcx) 2c: c7 01 00 00 00 00 movl $0x0,(%rcx) 32: c7 01 00 00 00 00 movl $0x0,(%rcx) 38: c7 01 00 00 00 00 movl $0x0,(%rcx) 3e: 41 bb 01 00 00 00 mov $0x1,%r11d 44: c7 01 01 00 00 00 movl $0x1,(%rcx) 4a: c7 01 00 00 00 00 movl $0x0,(%rcx) 50: c7 01 00 00 00 00 movl $0x0,(%rcx) 56: 44 88 99 4c 01 00 00 mov %r11b,0x14c(%rcx) 5d: 44 88 91 bc 01 00 00 mov %r10b,0x1bc(%rcx) # bad.wasm 29: 44 88 11 mov %r10b,(%rcx) 2c: c7 01 00 00 00 00 movl $0x0,(%rcx) 32: c7 01 00 00 00 00 movl $0x0,(%rcx) 38: c7 01 00 00 00 00 movl $0x0,(%rcx) 3e: c7 01 00 00 00 00 movl $0x0,(%rcx) 44: c7 01 00 00 00 00 movl $0x0,(%rcx) 4a: c7 01 00 00 00 00 movl $0x0,(%rcx) 50: b8 01 00 00 00 mov $0x1,%eax 55: 88 81 4c 01 00 00 mov %al,0x14c(%rcx) 5b: 44 88 91 bc 01 00 00 mov %r10b,0x1bc(%rcx)
Profiling Information
I use Perf tool to profile the hotspot and the results are as follows:
# good.wasm │ xor %r10d,%r10d 0.01 │ mov %r10b,(%rcx) 0.01 │ movl $0x0,(%rcx) 20.82 │ movl $0x0,(%rcx) 0.05 │ movl $0x0,(%rcx) 18.60 │ mov $0x1,%r11d 0.03 │ movl $0x1,(%rcx) 20.42 │ movl $0x0,(%rcx) 22.37 │ movl $0x0,(%rcx) 17.68 │ mov %r11b,0x14c(%rcx) 0.01 │ mov %r10b,0x1bc(%rcx) # bad.wasm │ xor %r10d,%r10d │ mov %r10b,(%rcx) 24.94 │ movl $0x0,(%rcx) │ movl $0x0,(%rcx) │ movl $0x0,(%rcx) │ movl $0x0,(%rcx) │ movl $0x0,(%rcx) 75.04 │ movl $0x0,(%rcx) 0.01 │ mov $0x1,%eax │ mov %al,0x14c(%rcx) 0.01 │ mov %r10b,0x1bc(%rcx)
# the commands to profile perf record -k mono ~/wasmtime/target/release/wasmtime run --profile jitdump --allow-precompiled bad.cwasm perf inject --jit --input perf.data --output perf.jit.data perf report --input perf.jit.data --no-children
Versions and Environment
- Wasmtime version or commit: 91ec9a589cc6c7f031ef4c
- Operating system: Linux ringzzz-OptiPlex-7070 5.15.0-97-generic
- Architecture: Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz
cfallin commented on issue #8648:
# good.wasm 29: 44 88 11 mov %r10b,(%rcx) 2c: c7 01 00 00 00 00 movl $0x0,(%rcx) 32: c7 01 00 00 00 00 movl $0x0,(%rcx) 38: c7 01 00 00 00 00 movl $0x0,(%rcx) 3e: 41 bb 01 00 00 00 mov $0x1,%r11d 44: c7 01 01 00 00 00 movl $0x1,(%rcx) 4a: c7 01 00 00 00 00 movl $0x0,(%rcx) 50: c7 01 00 00 00 00 movl $0x0,(%rcx) 56: 44 88 99 4c 01 00 00 mov %r11b,0x14c(%rcx) 5d: 44 88 91 bc 01 00 00 mov %r10b,0x1bc(%rcx) # bad.wasm 29: 44 88 11 mov %r10b,(%rcx) 2c: c7 01 00 00 00 00 movl $0x0,(%rcx) 32: c7 01 00 00 00 00 movl $0x0,(%rcx) 38: c7 01 00 00 00 00 movl $0x0,(%rcx) 3e: c7 01 00 00 00 00 movl $0x0,(%rcx) 44: c7 01 00 00 00 00 movl $0x0,(%rcx) 4a: c7 01 00 00 00 00 movl $0x0,(%rcx) 50: b8 01 00 00 00 mov $0x1,%eax 55: 88 81 4c 01 00 00 mov %al,0x14c(%rcx) 5b: 44 88 91 bc 01 00 00 mov %r10b,0x1bc(%rcx)
@hungryzzz, would you mind isolating this into a microbenchmark written in assembly and measuring it by itself? (With appropriate values in
rcx
to make the stores valid and a loop to run a suitable number of iterations, etc...) It would be good to understand if this is really the difference that matters.I am fascinated by this case because the assembly differs in only two ways: the
mov $1, reg
is shifted downward by three instructions, and the register allocator choserax
rather thanr11
. The latter shouldn't matter at all; and the former shouldn't matter on any out-of-order CPU. The alignment of both sequences is the same, too.(As to why the instruction moves: in the
good.wasm
case the immediate is shared by two of the instructions, so we only load it once, before its first use. Separately we could probably have another lowering rule formov [byte immediate], (mem)
but that would change both cases here.)
cfallin edited a comment on issue #8648:
# good.wasm 29: 44 88 11 mov %r10b,(%rcx) 2c: c7 01 00 00 00 00 movl $0x0,(%rcx) 32: c7 01 00 00 00 00 movl $0x0,(%rcx) 38: c7 01 00 00 00 00 movl $0x0,(%rcx) 3e: 41 bb 01 00 00 00 mov $0x1,%r11d 44: c7 01 01 00 00 00 movl $0x1,(%rcx) 4a: c7 01 00 00 00 00 movl $0x0,(%rcx) 50: c7 01 00 00 00 00 movl $0x0,(%rcx) 56: 44 88 99 4c 01 00 00 mov %r11b,0x14c(%rcx) 5d: 44 88 91 bc 01 00 00 mov %r10b,0x1bc(%rcx) # bad.wasm 29: 44 88 11 mov %r10b,(%rcx) 2c: c7 01 00 00 00 00 movl $0x0,(%rcx) 32: c7 01 00 00 00 00 movl $0x0,(%rcx) 38: c7 01 00 00 00 00 movl $0x0,(%rcx) 3e: c7 01 00 00 00 00 movl $0x0,(%rcx) 44: c7 01 00 00 00 00 movl $0x0,(%rcx) 4a: c7 01 00 00 00 00 movl $0x0,(%rcx) 50: b8 01 00 00 00 mov $0x1,%eax 55: 88 81 4c 01 00 00 mov %al,0x14c(%rcx) 5b: 44 88 91 bc 01 00 00 mov %r10b,0x1bc(%rcx)
@hungryzzz, would you mind isolating this into a microbenchmark written in assembly and measuring it by itself? (With appropriate values in
rcx
to make the stores valid and a loop to run a suitable number of iterations, etc...) It would be good to understand if this is really the difference that matters.I am fascinated by this case because the assembly differs in only two ways: the
mov $1, reg
is shifted downward by three instructions, and the register allocator choserax
rather thanr11
. The latter shouldn't matter at all; and the former shouldn't matter on any out-of-order CPU. The alignment of both sequences is the same, too.(As to why the instruction moves: in the
good.wasm
case the immediate is shared by two of the instructions [EDIT: or at least would have been if the 32-bit store of constant1
didn't fold the immediate in], so we only load it once, before its first use. Separately we could probably have another lowering rule formov [byte immediate], (mem)
but that would change both cases here.)
hungryzzz closed issue #8648:
Test Cases
Steps to Reproduce
Hi, I run the attached two cases(
good.wasm
&bad.wasm
) inWasmtime
andWasmEdge
(AOT), and collect their execution time respectively (measured bytime
tool).# command to collect execution time of wasmtime wasmtime compile bad.wasm -o bad.cwasm time wasmtime run --allow-precompiled bad.cwasm # command to collect execution time of wasmedge wasmedgec bad.wasm bad-wasmedge-aot.wasm time wasmedge bad-wasmedge-aot.wasm
Expected Results & Actual Results
For
good.wasm
, the execution time in different runtimes are as follows:
- Wasmtime: 3.07s
- WasmEdge: 1.18s
For
bad.wasm
, the execution time in different runtimes are as follows:
- Wasmtime: 4.68s
- WasmEdge: 1.18s
The difference between the attached two cases is as follow, i.e., changing the value which will be store to address 0 from
1
to0
outsize the loop, which can bring 1.6s performance decreasing on Wasmtime but has no effect on WasmEdge. Moreover, the address 0 will be then stored the same value 0, so the first storage will not affect the loop in theory.➜ cases diff good.wat bad.wat 23c23 < i32.const 1 --- > i32.const 0
;; part of `good.wat` i32.const 0 i32.const 1 ;; here i32.store i32.const 0 i32.const 0 i32.store ;; part of `bad.wat` i32.const 0 i32.const 0 ;; here i32.store i32.const 0 i32.const 0 i32.store
And I check the machine code generated by
Wasmtime
andWasmEdge
, and find the instruction sequence ofWasmtime
is different before and after change but theWasmEdge
one only changes an immediate of an instruction from1
to0
.I have two confusions to the machine code generatd by
Wasmtime
:
1. Why the instruction sequence are different between two cases? In my view, maybe only changing the immediate of an instruction just asWasmEdge
is enough.
2. Why the difference of the two instruction sequences which are both out of the loop, can cause 1.6s performance difference?➜ binary-diff diff good.wasmedge.asm bad.wasmedge.asm 45c45 < 82: c7 02 01 00 00 00 movl $0x1,(%rdx) --- > 82: c7 02 00 00 00 00 movl $0x0,(%rdx) # difference of machine code generated by wasmtime # good.wasm 29: 44 88 11 mov %r10b,(%rcx) 2c: c7 01 00 00 00 00 movl $0x0,(%rcx) 32: c7 01 00 00 00 00 movl $0x0,(%rcx) 38: c7 01 00 00 00 00 movl $0x0,(%rcx) 3e: 41 bb 01 00 00 00 mov $0x1,%r11d 44: c7 01 01 00 00 00 movl $0x1,(%rcx) 4a: c7 01 00 00 00 00 movl $0x0,(%rcx) 50: c7 01 00 00 00 00 movl $0x0,(%rcx) 56: 44 88 99 4c 01 00 00 mov %r11b,0x14c(%rcx) 5d: 44 88 91 bc 01 00 00 mov %r10b,0x1bc(%rcx) # bad.wasm 29: 44 88 11 mov %r10b,(%rcx) 2c: c7 01 00 00 00 00 movl $0x0,(%rcx) 32: c7 01 00 00 00 00 movl $0x0,(%rcx) 38: c7 01 00 00 00 00 movl $0x0,(%rcx) 3e: c7 01 00 00 00 00 movl $0x0,(%rcx) 44: c7 01 00 00 00 00 movl $0x0,(%rcx) 4a: c7 01 00 00 00 00 movl $0x0,(%rcx) 50: b8 01 00 00 00 mov $0x1,%eax 55: 88 81 4c 01 00 00 mov %al,0x14c(%rcx) 5b: 44 88 91 bc 01 00 00 mov %r10b,0x1bc(%rcx)
Profiling Information
I use Perf tool to profile the hotspot and the results are as follows:
# good.wasm │ xor %r10d,%r10d 0.01 │ mov %r10b,(%rcx) 0.01 │ movl $0x0,(%rcx) 20.82 │ movl $0x0,(%rcx) 0.05 │ movl $0x0,(%rcx) 18.60 │ mov $0x1,%r11d 0.03 │ movl $0x1,(%rcx) 20.42 │ movl $0x0,(%rcx) 22.37 │ movl $0x0,(%rcx) 17.68 │ mov %r11b,0x14c(%rcx) 0.01 │ mov %r10b,0x1bc(%rcx) # bad.wasm │ xor %r10d,%r10d │ mov %r10b,(%rcx) 24.94 │ movl $0x0,(%rcx) │ movl $0x0,(%rcx) │ movl $0x0,(%rcx) │ movl $0x0,(%rcx) │ movl $0x0,(%rcx) 75.04 │ movl $0x0,(%rcx) 0.01 │ mov $0x1,%eax │ mov %al,0x14c(%rcx) 0.01 │ mov %r10b,0x1bc(%rcx)
# the commands to profile perf record -k mono ~/wasmtime/target/release/wasmtime run --profile jitdump --allow-precompiled bad.cwasm perf inject --jit --input perf.data --output perf.jit.data perf report --input perf.jit.data --no-children
Versions and Environment
- Wasmtime version or commit: 91ec9a589cc6c7f031ef4c
- Operating system: Linux ringzzz-OptiPlex-7070 5.15.0-97-generic
- Architecture: Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz
hungryzzz commented on issue #8648:
@cfallin Really sorry for my late reply. I found that the above performance decreasing cannot replay in another machine so I thought maybe my CPU is too old to run the case well. Therefore I apply a new generation machine to run experiments. I would close this issue and submit a new one for this case, thank you very much!
hungryzzz edited a comment on issue #8648:
@cfallin Really sorry for my late reply. I found that the above performance decreasing cannot be replayed in another machine so I thought maybe my CPU is too old to run the case well. Therefore I apply a new generation machine to run experiments. I would close this issue and submit a new one for this case, thank you very much!
hungryzzz edited a comment on issue #8648:
@cfallin Really sorry for my late reply. I found that the above performance decreasing cannot be replayed in another machine so I thought maybe my CPU is too old to run the case well. Therefore I apply a newer generation machine to run experiments. I would close this issue and submit a new one for this case, thank you very much!
Last updated: Jan 24 2025 at 00:11 UTC