Stream: git-wasmtime

Topic: wasmtime / issue #8648 Changing an immediate operand from...


view this post on Zulip Wasmtime GitHub notifications bot (May 17 2024 at 16:26):

hungryzzz opened issue #8648:

Test Cases

cases.zip

Steps to Reproduce

Hi, I run the attached two cases(good.wasm&bad.wasm) in Wasmtime and WasmEdge(AOT), and collect their execution time respectively (measured by time tool).

# command to collect execution time of wasmtime
wasmtime compile bad.wasm -o bad.cwasm
time wasmtime run --allow-precompiled bad.cwasm

# command to collect execution time of wasmedge
wasmedgec bad.wasm bad-wasmedge-aot.wasm
time wasmedge bad-wasmedge-aot.wasm

Expected Results & Actual Results

For good.wasm, the execution time in different runtimes are as follows:

For bad.wasm, the execution time in different runtimes are as follows:

The difference between the attached two cases is as follow, i.e., changing the value which will be store to address 0 from 1 to 0 outsize the loop, which can bring 1.6s performance decreasing on Wasmtime but has no effect on WasmEdge. Moreover, the address 0 will be then stored the same value 0, so the first storage will not affect the loop in theory.

  cases diff good.wat bad.wat
23c23
<     i32.const 1
---
>     i32.const 0
;; part of `good.wat`
i32.const 0
i32.const 1  ;; here
i32.store
i32.const 0
i32.const 0
i32.store

;; part of `bad.wat`
i32.const 0
i32.const 0 ;; here
i32.store
i32.const 0
i32.const 0
i32.store

And I check the machine code generated by Wasmtime and WasmEdge, and find the instruction sequence of Wasmtime is different before and after change but the WasmEdge one only changes an immediate of an instruction from 1 to 0.

I have two confusion of the machine code generatd by Wasmtime:
1. Why the instruction sequence are different between two cases? In my view, maybe only changing the immediate of an instruction just as WasmEdge is enough.
2. Why the difference of the two instruction sequences which are both out of the loop, can cause 1.6s performance difference?

  binary-diff diff good.wasmedge.asm bad.wasmedge.asm
45c45
<   82: c7 02 01 00 00 00       movl   $0x1,(%rdx)
---
>   82: c7 02 00 00 00 00       movl   $0x0,(%rdx)


# difference of machine code generated by wasmtime
# good.wasm
29: 44 88 11                        mov    %r10b,(%rcx)
2c: c7 01 00 00 00 00       movl   $0x0,(%rcx)
32: c7 01 00 00 00 00       movl   $0x0,(%rcx)
38: c7 01 00 00 00 00       movl   $0x0,(%rcx)
3e: 41 bb 01 00 00 00       mov    $0x1,%r11d
44: c7 01 01 00 00 00       movl   $0x1,(%rcx)
4a: c7 01 00 00 00 00       movl   $0x0,(%rcx)
50: c7 01 00 00 00 00       movl   $0x0,(%rcx)
56: 44 88 99 4c 01 00 00    mov    %r11b,0x14c(%rcx)
5d: 44 88 91 bc 01 00 00    mov    %r10b,0x1bc(%rcx)

# bad.wasm
29: 44 88 11                        mov    %r10b,(%rcx)
2c: c7 01 00 00 00 00       movl   $0x0,(%rcx)
32: c7 01 00 00 00 00       movl   $0x0,(%rcx)
38: c7 01 00 00 00 00       movl   $0x0,(%rcx)
3e: c7 01 00 00 00 00       movl   $0x0,(%rcx)
44: c7 01 00 00 00 00       movl   $0x0,(%rcx)
4a: c7 01 00 00 00 00       movl   $0x0,(%rcx)
50: b8 01 00 00 00          mov    $0x1,%eax
55: 88 81 4c 01 00 00       mov    %al,0x14c(%rcx)
5b: 44 88 91 bc 01 00 00    mov    %r10b,0x1bc(%rcx)

Profiling Information

I use Perf tool to profile the hotspot and the results are as follows:

# good.wasm
             xor    %r10d,%r10d
  0.01       mov    %r10b,(%rcx)
  0.01       movl   $0x0,(%rcx)
 20.82       movl   $0x0,(%rcx)
  0.05       movl   $0x0,(%rcx)
 18.60       mov    $0x1,%r11d
  0.03       movl   $0x1,(%rcx)
 20.42       movl   $0x0,(%rcx)
 22.37       movl   $0x0,(%rcx)
 17.68       mov    %r11b,0x14c(%rcx)
  0.01       mov    %r10b,0x1bc(%rcx)

# bad.wasm
             xor    %r10d,%r10d
             mov    %r10b,(%rcx)
 24.94       movl   $0x0,(%rcx)
             movl   $0x0,(%rcx)
             movl   $0x0,(%rcx)
             movl   $0x0,(%rcx)
             movl   $0x0,(%rcx)
 75.04       movl   $0x0,(%rcx)
  0.01       mov    $0x1,%eax
             mov    %al,0x14c(%rcx)
  0.01       mov    %r10b,0x1bc(%rcx)
# the commands to profile
perf record -k mono ~/wasmtime/target/release/wasmtime run --profile jitdump --allow-precompiled bad.cwasm
perf inject --jit --input perf.data --output perf.jit.data
perf report --input perf.jit.data --no-children

Versions and Environment

view this post on Zulip Wasmtime GitHub notifications bot (May 17 2024 at 16:27):

hungryzzz edited issue #8648:

Test Cases

cases.zip

Steps to Reproduce

Hi, I run the attached two cases(good.wasm&bad.wasm) in Wasmtime and WasmEdge(AOT), and collect their execution time respectively (measured by time tool).

# command to collect execution time of wasmtime
wasmtime compile bad.wasm -o bad.cwasm
time wasmtime run --allow-precompiled bad.cwasm

# command to collect execution time of wasmedge
wasmedgec bad.wasm bad-wasmedge-aot.wasm
time wasmedge bad-wasmedge-aot.wasm

Expected Results & Actual Results

For good.wasm, the execution time in different runtimes are as follows:

For bad.wasm, the execution time in different runtimes are as follows:

The difference between the attached two cases is as follow, i.e., changing the value which will be store to address 0 from 1 to 0 outsize the loop, which can bring 1.6s performance decreasing on Wasmtime but has no effect on WasmEdge. Moreover, the address 0 will be then stored the same value 0, so the first storage will not affect the loop in theory.

  cases diff good.wat bad.wat
23c23
<     i32.const 1
---
>     i32.const 0
;; part of `good.wat`
i32.const 0
i32.const 1  ;; here
i32.store
i32.const 0
i32.const 0
i32.store

;; part of `bad.wat`
i32.const 0
i32.const 0 ;; here
i32.store
i32.const 0
i32.const 0
i32.store

And I check the machine code generated by Wasmtime and WasmEdge, and find the instruction sequence of Wasmtime is different before and after change but the WasmEdge one only changes an immediate of an instruction from 1 to 0.

I have two confusion of the machine code generatd by Wasmtime:
1. Why the instruction sequence are different between two cases? In my view, maybe only changing the immediate of an instruction just as WasmEdge is enough.
2. Why the difference of the two instruction sequences which are both out of the loop, can cause 1.6s performance difference?

  binary-diff diff good.wasmedge.asm bad.wasmedge.asm
45c45
<   82: c7 02 01 00 00 00       movl   $0x1,(%rdx)
---
>   82: c7 02 00 00 00 00       movl   $0x0,(%rdx)


# difference of machine code generated by wasmtime
# good.wasm
29: 44 88 11             mov    %r10b,(%rcx)
2c: c7 01 00 00 00 00       movl   $0x0,(%rcx)
32: c7 01 00 00 00 00       movl   $0x0,(%rcx)
38: c7 01 00 00 00 00       movl   $0x0,(%rcx)
3e: 41 bb 01 00 00 00       mov    $0x1,%r11d
44: c7 01 01 00 00 00       movl   $0x1,(%rcx)
4a: c7 01 00 00 00 00       movl   $0x0,(%rcx)
50: c7 01 00 00 00 00       movl   $0x0,(%rcx)
56: 44 88 99 4c 01 00 00    mov    %r11b,0x14c(%rcx)
5d: 44 88 91 bc 01 00 00    mov    %r10b,0x1bc(%rcx)

# bad.wasm
29: 44 88 11              mov    %r10b,(%rcx)
2c: c7 01 00 00 00 00       movl   $0x0,(%rcx)
32: c7 01 00 00 00 00       movl   $0x0,(%rcx)
38: c7 01 00 00 00 00       movl   $0x0,(%rcx)
3e: c7 01 00 00 00 00       movl   $0x0,(%rcx)
44: c7 01 00 00 00 00       movl   $0x0,(%rcx)
4a: c7 01 00 00 00 00       movl   $0x0,(%rcx)
50: b8 01 00 00 00          mov    $0x1,%eax
55: 88 81 4c 01 00 00       mov    %al,0x14c(%rcx)
5b: 44 88 91 bc 01 00 00    mov    %r10b,0x1bc(%rcx)

Profiling Information

I use Perf tool to profile the hotspot and the results are as follows:

# good.wasm
             xor    %r10d,%r10d
  0.01       mov    %r10b,(%rcx)
  0.01       movl   $0x0,(%rcx)
 20.82       movl   $0x0,(%rcx)
  0.05       movl   $0x0,(%rcx)
 18.60       mov    $0x1,%r11d
  0.03       movl   $0x1,(%rcx)
 20.42       movl   $0x0,(%rcx)
 22.37       movl   $0x0,(%rcx)
 17.68       mov    %r11b,0x14c(%rcx)
  0.01       mov    %r10b,0x1bc(%rcx)

# bad.wasm
             xor    %r10d,%r10d
             mov    %r10b,(%rcx)
 24.94       movl   $0x0,(%rcx)
             movl   $0x0,(%rcx)
             movl   $0x0,(%rcx)
             movl   $0x0,(%rcx)
             movl   $0x0,(%rcx)
 75.04       movl   $0x0,(%rcx)
  0.01       mov    $0x1,%eax
             mov    %al,0x14c(%rcx)
  0.01       mov    %r10b,0x1bc(%rcx)
# the commands to profile
perf record -k mono ~/wasmtime/target/release/wasmtime run --profile jitdump --allow-precompiled bad.cwasm
perf inject --jit --input perf.data --output perf.jit.data
perf report --input perf.jit.data --no-children

Versions and Environment

view this post on Zulip Wasmtime GitHub notifications bot (May 17 2024 at 16:27):

hungryzzz edited issue #8648:

Test Cases

cases.zip

Steps to Reproduce

Hi, I run the attached two cases(good.wasm&bad.wasm) in Wasmtime and WasmEdge(AOT), and collect their execution time respectively (measured by time tool).

# command to collect execution time of wasmtime
wasmtime compile bad.wasm -o bad.cwasm
time wasmtime run --allow-precompiled bad.cwasm

# command to collect execution time of wasmedge
wasmedgec bad.wasm bad-wasmedge-aot.wasm
time wasmedge bad-wasmedge-aot.wasm

Expected Results & Actual Results

For good.wasm, the execution time in different runtimes are as follows:

For bad.wasm, the execution time in different runtimes are as follows:

The difference between the attached two cases is as follow, i.e., changing the value which will be store to address 0 from 1 to 0 outsize the loop, which can bring 1.6s performance decreasing on Wasmtime but has no effect on WasmEdge. Moreover, the address 0 will be then stored the same value 0, so the first storage will not affect the loop in theory.

  cases diff good.wat bad.wat
23c23
<     i32.const 1
---
>     i32.const 0
;; part of `good.wat`
i32.const 0
i32.const 1  ;; here
i32.store
i32.const 0
i32.const 0
i32.store

;; part of `bad.wat`
i32.const 0
i32.const 0 ;; here
i32.store
i32.const 0
i32.const 0
i32.store

And I check the machine code generated by Wasmtime and WasmEdge, and find the instruction sequence of Wasmtime is different before and after change but the WasmEdge one only changes an immediate of an instruction from 1 to 0.

I have two confusion of the machine code generatd by Wasmtime:
1. Why the instruction sequence are different between two cases? In my view, maybe only changing the immediate of an instruction just as WasmEdge is enough.
2. Why the difference of the two instruction sequences which are both out of the loop, can cause 1.6s performance difference?

  binary-diff diff good.wasmedge.asm bad.wasmedge.asm
45c45
<   82: c7 02 01 00 00 00       movl   $0x1,(%rdx)
---
>   82: c7 02 00 00 00 00       movl   $0x0,(%rdx)


# difference of machine code generated by wasmtime
# good.wasm
29: 44 88 11                mov    %r10b,(%rcx)
2c: c7 01 00 00 00 00       movl   $0x0,(%rcx)
32: c7 01 00 00 00 00       movl   $0x0,(%rcx)
38: c7 01 00 00 00 00       movl   $0x0,(%rcx)
3e: 41 bb 01 00 00 00       mov    $0x1,%r11d
44: c7 01 01 00 00 00       movl   $0x1,(%rcx)
4a: c7 01 00 00 00 00       movl   $0x0,(%rcx)
50: c7 01 00 00 00 00       movl   $0x0,(%rcx)
56: 44 88 99 4c 01 00 00    mov    %r11b,0x14c(%rcx)
5d: 44 88 91 bc 01 00 00    mov    %r10b,0x1bc(%rcx)

# bad.wasm
29: 44 88 11                mov    %r10b,(%rcx)
2c: c7 01 00 00 00 00       movl   $0x0,(%rcx)
32: c7 01 00 00 00 00       movl   $0x0,(%rcx)
38: c7 01 00 00 00 00       movl   $0x0,(%rcx)
3e: c7 01 00 00 00 00       movl   $0x0,(%rcx)
44: c7 01 00 00 00 00       movl   $0x0,(%rcx)
4a: c7 01 00 00 00 00       movl   $0x0,(%rcx)
50: b8 01 00 00 00          mov    $0x1,%eax
55: 88 81 4c 01 00 00       mov    %al,0x14c(%rcx)
5b: 44 88 91 bc 01 00 00    mov    %r10b,0x1bc(%rcx)

Profiling Information

I use Perf tool to profile the hotspot and the results are as follows:

# good.wasm
             xor    %r10d,%r10d
  0.01       mov    %r10b,(%rcx)
  0.01       movl   $0x0,(%rcx)
 20.82       movl   $0x0,(%rcx)
  0.05       movl   $0x0,(%rcx)
 18.60       mov    $0x1,%r11d
  0.03       movl   $0x1,(%rcx)
 20.42       movl   $0x0,(%rcx)
 22.37       movl   $0x0,(%rcx)
 17.68       mov    %r11b,0x14c(%rcx)
  0.01       mov    %r10b,0x1bc(%rcx)

# bad.wasm
             xor    %r10d,%r10d
             mov    %r10b,(%rcx)
 24.94       movl   $0x0,(%rcx)
             movl   $0x0,(%rcx)
             movl   $0x0,(%rcx)
             movl   $0x0,(%rcx)
             movl   $0x0,(%rcx)
 75.04       movl   $0x0,(%rcx)
  0.01       mov    $0x1,%eax
             mov    %al,0x14c(%rcx)
  0.01       mov    %r10b,0x1bc(%rcx)
# the commands to profile
perf record -k mono ~/wasmtime/target/release/wasmtime run --profile jitdump --allow-precompiled bad.cwasm
perf inject --jit --input perf.data --output perf.jit.data
perf report --input perf.jit.data --no-children

Versions and Environment

view this post on Zulip Wasmtime GitHub notifications bot (May 17 2024 at 16:30):

hungryzzz edited issue #8648:

Test Cases

cases.zip

Steps to Reproduce

Hi, I run the attached two cases(good.wasm&bad.wasm) in Wasmtime and WasmEdge(AOT), and collect their execution time respectively (measured by time tool).

# command to collect execution time of wasmtime
wasmtime compile bad.wasm -o bad.cwasm
time wasmtime run --allow-precompiled bad.cwasm

# command to collect execution time of wasmedge
wasmedgec bad.wasm bad-wasmedge-aot.wasm
time wasmedge bad-wasmedge-aot.wasm

Expected Results & Actual Results

For good.wasm, the execution time in different runtimes are as follows:

For bad.wasm, the execution time in different runtimes are as follows:

The difference between the attached two cases is as follow, i.e., changing the value which will be store to address 0 from 1 to 0 outsize the loop, which can bring 1.6s performance decreasing on Wasmtime but has no effect on WasmEdge. Moreover, the address 0 will be then stored the same value 0, so the first storage will not affect the loop in theory.

  cases diff good.wat bad.wat
23c23
<     i32.const 1
---
>     i32.const 0
;; part of `good.wat`
i32.const 0
i32.const 1  ;; here
i32.store
i32.const 0
i32.const 0
i32.store

;; part of `bad.wat`
i32.const 0
i32.const 0 ;; here
i32.store
i32.const 0
i32.const 0
i32.store

And I check the machine code generated by Wasmtime and WasmEdge, and find the instruction sequence of Wasmtime is different before and after change but the WasmEdge one only changes an immediate of an instruction from 1 to 0.

I have two confusions to the machine code generatd by Wasmtime:
1. Why the instruction sequence are different between two cases? In my view, maybe only changing the immediate of an instruction just as WasmEdge is enough.
2. Why the difference of the two instruction sequences which are both out of the loop, can cause 1.6s performance difference?

  binary-diff diff good.wasmedge.asm bad.wasmedge.asm
45c45
<   82: c7 02 01 00 00 00       movl   $0x1,(%rdx)
---
>   82: c7 02 00 00 00 00       movl   $0x0,(%rdx)


# difference of machine code generated by wasmtime
# good.wasm
29: 44 88 11                mov    %r10b,(%rcx)
2c: c7 01 00 00 00 00       movl   $0x0,(%rcx)
32: c7 01 00 00 00 00       movl   $0x0,(%rcx)
38: c7 01 00 00 00 00       movl   $0x0,(%rcx)
3e: 41 bb 01 00 00 00       mov    $0x1,%r11d
44: c7 01 01 00 00 00       movl   $0x1,(%rcx)
4a: c7 01 00 00 00 00       movl   $0x0,(%rcx)
50: c7 01 00 00 00 00       movl   $0x0,(%rcx)
56: 44 88 99 4c 01 00 00    mov    %r11b,0x14c(%rcx)
5d: 44 88 91 bc 01 00 00    mov    %r10b,0x1bc(%rcx)

# bad.wasm
29: 44 88 11                mov    %r10b,(%rcx)
2c: c7 01 00 00 00 00       movl   $0x0,(%rcx)
32: c7 01 00 00 00 00       movl   $0x0,(%rcx)
38: c7 01 00 00 00 00       movl   $0x0,(%rcx)
3e: c7 01 00 00 00 00       movl   $0x0,(%rcx)
44: c7 01 00 00 00 00       movl   $0x0,(%rcx)
4a: c7 01 00 00 00 00       movl   $0x0,(%rcx)
50: b8 01 00 00 00          mov    $0x1,%eax
55: 88 81 4c 01 00 00       mov    %al,0x14c(%rcx)
5b: 44 88 91 bc 01 00 00    mov    %r10b,0x1bc(%rcx)

Profiling Information

I use Perf tool to profile the hotspot and the results are as follows:

# good.wasm
             xor    %r10d,%r10d
  0.01       mov    %r10b,(%rcx)
  0.01       movl   $0x0,(%rcx)
 20.82       movl   $0x0,(%rcx)
  0.05       movl   $0x0,(%rcx)
 18.60       mov    $0x1,%r11d
  0.03       movl   $0x1,(%rcx)
 20.42       movl   $0x0,(%rcx)
 22.37       movl   $0x0,(%rcx)
 17.68       mov    %r11b,0x14c(%rcx)
  0.01       mov    %r10b,0x1bc(%rcx)

# bad.wasm
             xor    %r10d,%r10d
             mov    %r10b,(%rcx)
 24.94       movl   $0x0,(%rcx)
             movl   $0x0,(%rcx)
             movl   $0x0,(%rcx)
             movl   $0x0,(%rcx)
             movl   $0x0,(%rcx)
 75.04       movl   $0x0,(%rcx)
  0.01       mov    $0x1,%eax
             mov    %al,0x14c(%rcx)
  0.01       mov    %r10b,0x1bc(%rcx)
# the commands to profile
perf record -k mono ~/wasmtime/target/release/wasmtime run --profile jitdump --allow-precompiled bad.cwasm
perf inject --jit --input perf.data --output perf.jit.data
perf report --input perf.jit.data --no-children

Versions and Environment

view this post on Zulip Wasmtime GitHub notifications bot (May 17 2024 at 18:00):

cfallin commented on issue #8648:

# good.wasm
29: 44 88 11                mov    %r10b,(%rcx)
2c: c7 01 00 00 00 00       movl   $0x0,(%rcx)
32: c7 01 00 00 00 00       movl   $0x0,(%rcx)
38: c7 01 00 00 00 00       movl   $0x0,(%rcx)
3e: 41 bb 01 00 00 00       mov    $0x1,%r11d
44: c7 01 01 00 00 00       movl   $0x1,(%rcx)
4a: c7 01 00 00 00 00       movl   $0x0,(%rcx)
50: c7 01 00 00 00 00       movl   $0x0,(%rcx)
56: 44 88 99 4c 01 00 00    mov    %r11b,0x14c(%rcx)
5d: 44 88 91 bc 01 00 00    mov    %r10b,0x1bc(%rcx)

# bad.wasm
29: 44 88 11                mov    %r10b,(%rcx)
2c: c7 01 00 00 00 00       movl   $0x0,(%rcx)
32: c7 01 00 00 00 00       movl   $0x0,(%rcx)
38: c7 01 00 00 00 00       movl   $0x0,(%rcx)
3e: c7 01 00 00 00 00       movl   $0x0,(%rcx)
44: c7 01 00 00 00 00       movl   $0x0,(%rcx)
4a: c7 01 00 00 00 00       movl   $0x0,(%rcx)
50: b8 01 00 00 00          mov    $0x1,%eax
55: 88 81 4c 01 00 00       mov    %al,0x14c(%rcx)
5b: 44 88 91 bc 01 00 00    mov    %r10b,0x1bc(%rcx)

@hungryzzz, would you mind isolating this into a microbenchmark written in assembly and measuring it by itself? (With appropriate values in rcx to make the stores valid and a loop to run a suitable number of iterations, etc...) It would be good to understand if this is really the difference that matters.

I am fascinated by this case because the assembly differs in only two ways: the mov $1, reg is shifted downward by three instructions, and the register allocator chose rax rather than r11. The latter shouldn't matter at all; and the former shouldn't matter on any out-of-order CPU. The alignment of both sequences is the same, too.

(As to why the instruction moves: in the good.wasm case the immediate is shared by two of the instructions, so we only load it once, before its first use. Separately we could probably have another lowering rule for mov [byte immediate], (mem) but that would change both cases here.)

view this post on Zulip Wasmtime GitHub notifications bot (May 17 2024 at 18:02):

cfallin edited a comment on issue #8648:

# good.wasm
29: 44 88 11                mov    %r10b,(%rcx)
2c: c7 01 00 00 00 00       movl   $0x0,(%rcx)
32: c7 01 00 00 00 00       movl   $0x0,(%rcx)
38: c7 01 00 00 00 00       movl   $0x0,(%rcx)
3e: 41 bb 01 00 00 00       mov    $0x1,%r11d
44: c7 01 01 00 00 00       movl   $0x1,(%rcx)
4a: c7 01 00 00 00 00       movl   $0x0,(%rcx)
50: c7 01 00 00 00 00       movl   $0x0,(%rcx)
56: 44 88 99 4c 01 00 00    mov    %r11b,0x14c(%rcx)
5d: 44 88 91 bc 01 00 00    mov    %r10b,0x1bc(%rcx)

# bad.wasm
29: 44 88 11                mov    %r10b,(%rcx)
2c: c7 01 00 00 00 00       movl   $0x0,(%rcx)
32: c7 01 00 00 00 00       movl   $0x0,(%rcx)
38: c7 01 00 00 00 00       movl   $0x0,(%rcx)
3e: c7 01 00 00 00 00       movl   $0x0,(%rcx)
44: c7 01 00 00 00 00       movl   $0x0,(%rcx)
4a: c7 01 00 00 00 00       movl   $0x0,(%rcx)
50: b8 01 00 00 00          mov    $0x1,%eax
55: 88 81 4c 01 00 00       mov    %al,0x14c(%rcx)
5b: 44 88 91 bc 01 00 00    mov    %r10b,0x1bc(%rcx)

@hungryzzz, would you mind isolating this into a microbenchmark written in assembly and measuring it by itself? (With appropriate values in rcx to make the stores valid and a loop to run a suitable number of iterations, etc...) It would be good to understand if this is really the difference that matters.

I am fascinated by this case because the assembly differs in only two ways: the mov $1, reg is shifted downward by three instructions, and the register allocator chose rax rather than r11. The latter shouldn't matter at all; and the former shouldn't matter on any out-of-order CPU. The alignment of both sequences is the same, too.

(As to why the instruction moves: in the good.wasm case the immediate is shared by two of the instructions [EDIT: or at least would have been if the 32-bit store of constant 1 didn't fold the immediate in], so we only load it once, before its first use. Separately we could probably have another lowering rule for mov [byte immediate], (mem) but that would change both cases here.)

view this post on Zulip Wasmtime GitHub notifications bot (May 29 2024 at 13:06):

hungryzzz closed issue #8648:

Test Cases

cases.zip

Steps to Reproduce

Hi, I run the attached two cases(good.wasm&bad.wasm) in Wasmtime and WasmEdge(AOT), and collect their execution time respectively (measured by time tool).

# command to collect execution time of wasmtime
wasmtime compile bad.wasm -o bad.cwasm
time wasmtime run --allow-precompiled bad.cwasm

# command to collect execution time of wasmedge
wasmedgec bad.wasm bad-wasmedge-aot.wasm
time wasmedge bad-wasmedge-aot.wasm

Expected Results & Actual Results

For good.wasm, the execution time in different runtimes are as follows:

For bad.wasm, the execution time in different runtimes are as follows:

The difference between the attached two cases is as follow, i.e., changing the value which will be store to address 0 from 1 to 0 outsize the loop, which can bring 1.6s performance decreasing on Wasmtime but has no effect on WasmEdge. Moreover, the address 0 will be then stored the same value 0, so the first storage will not affect the loop in theory.

  cases diff good.wat bad.wat
23c23
<     i32.const 1
---
>     i32.const 0
;; part of `good.wat`
i32.const 0
i32.const 1  ;; here
i32.store
i32.const 0
i32.const 0
i32.store

;; part of `bad.wat`
i32.const 0
i32.const 0 ;; here
i32.store
i32.const 0
i32.const 0
i32.store

And I check the machine code generated by Wasmtime and WasmEdge, and find the instruction sequence of Wasmtime is different before and after change but the WasmEdge one only changes an immediate of an instruction from 1 to 0.

I have two confusions to the machine code generatd by Wasmtime:
1. Why the instruction sequence are different between two cases? In my view, maybe only changing the immediate of an instruction just as WasmEdge is enough.
2. Why the difference of the two instruction sequences which are both out of the loop, can cause 1.6s performance difference?

  binary-diff diff good.wasmedge.asm bad.wasmedge.asm
45c45
<   82: c7 02 01 00 00 00       movl   $0x1,(%rdx)
---
>   82: c7 02 00 00 00 00       movl   $0x0,(%rdx)


# difference of machine code generated by wasmtime
# good.wasm
29: 44 88 11                mov    %r10b,(%rcx)
2c: c7 01 00 00 00 00       movl   $0x0,(%rcx)
32: c7 01 00 00 00 00       movl   $0x0,(%rcx)
38: c7 01 00 00 00 00       movl   $0x0,(%rcx)
3e: 41 bb 01 00 00 00       mov    $0x1,%r11d
44: c7 01 01 00 00 00       movl   $0x1,(%rcx)
4a: c7 01 00 00 00 00       movl   $0x0,(%rcx)
50: c7 01 00 00 00 00       movl   $0x0,(%rcx)
56: 44 88 99 4c 01 00 00    mov    %r11b,0x14c(%rcx)
5d: 44 88 91 bc 01 00 00    mov    %r10b,0x1bc(%rcx)

# bad.wasm
29: 44 88 11                mov    %r10b,(%rcx)
2c: c7 01 00 00 00 00       movl   $0x0,(%rcx)
32: c7 01 00 00 00 00       movl   $0x0,(%rcx)
38: c7 01 00 00 00 00       movl   $0x0,(%rcx)
3e: c7 01 00 00 00 00       movl   $0x0,(%rcx)
44: c7 01 00 00 00 00       movl   $0x0,(%rcx)
4a: c7 01 00 00 00 00       movl   $0x0,(%rcx)
50: b8 01 00 00 00          mov    $0x1,%eax
55: 88 81 4c 01 00 00       mov    %al,0x14c(%rcx)
5b: 44 88 91 bc 01 00 00    mov    %r10b,0x1bc(%rcx)

Profiling Information

I use Perf tool to profile the hotspot and the results are as follows:

# good.wasm
             xor    %r10d,%r10d
  0.01       mov    %r10b,(%rcx)
  0.01       movl   $0x0,(%rcx)
 20.82       movl   $0x0,(%rcx)
  0.05       movl   $0x0,(%rcx)
 18.60       mov    $0x1,%r11d
  0.03       movl   $0x1,(%rcx)
 20.42       movl   $0x0,(%rcx)
 22.37       movl   $0x0,(%rcx)
 17.68       mov    %r11b,0x14c(%rcx)
  0.01       mov    %r10b,0x1bc(%rcx)

# bad.wasm
             xor    %r10d,%r10d
             mov    %r10b,(%rcx)
 24.94       movl   $0x0,(%rcx)
             movl   $0x0,(%rcx)
             movl   $0x0,(%rcx)
             movl   $0x0,(%rcx)
             movl   $0x0,(%rcx)
 75.04       movl   $0x0,(%rcx)
  0.01       mov    $0x1,%eax
             mov    %al,0x14c(%rcx)
  0.01       mov    %r10b,0x1bc(%rcx)
# the commands to profile
perf record -k mono ~/wasmtime/target/release/wasmtime run --profile jitdump --allow-precompiled bad.cwasm
perf inject --jit --input perf.data --output perf.jit.data
perf report --input perf.jit.data --no-children

Versions and Environment

view this post on Zulip Wasmtime GitHub notifications bot (May 29 2024 at 13:06):

hungryzzz commented on issue #8648:

@cfallin Really sorry for my late reply. I found that the above performance decreasing cannot replay in another machine so I thought maybe my CPU is too old to run the case well. Therefore I apply a new generation machine to run experiments. I would close this issue and submit a new one for this case, thank you very much!

view this post on Zulip Wasmtime GitHub notifications bot (May 29 2024 at 13:07):

hungryzzz edited a comment on issue #8648:

@cfallin Really sorry for my late reply. I found that the above performance decreasing cannot be replayed in another machine so I thought maybe my CPU is too old to run the case well. Therefore I apply a new generation machine to run experiments. I would close this issue and submit a new one for this case, thank you very much!

view this post on Zulip Wasmtime GitHub notifications bot (May 29 2024 at 13:08):

hungryzzz edited a comment on issue #8648:

@cfallin Really sorry for my late reply. I found that the above performance decreasing cannot be replayed in another machine so I thought maybe my CPU is too old to run the case well. Therefore I apply a newer generation machine to run experiments. I would close this issue and submit a new one for this case, thank you very much!


Last updated: Dec 23 2024 at 13:07 UTC