hungryzzz opened issue #7495:
Hi, I run the following case in
Wasmtime
andWasmEdge
respectively and I find that the execution time of Wasmtime is almost 6x slower than which of WasmEdge(measured by time tool).
Wasmtime
: 4.08sWasmEdge
: 0.66s(module (type (;0;) (func (param i32))) (type (;1;) (func)) (import "wasi_snapshot_preview1" "proc_exit" (func (;0;) (type 0))) (func (;1;) (type 1) call 2 global.get 0 i32.const 31 i32.and call 0 unreachable) (func (;2;) (type 1) (local i32 i32 i32 i32 i32 i32) loop ;; label = @6 local.get 1 local.get 2 local.get 2 i32.const 1 i32.add local.set 2 i32.const 0x3FFFFFFF i32.and i32.const 0x3FFFFFFF i32.ne i32.and br_if 0 (;@6;) end loop ;; label = @6 local.get 2 i32.const 255 i32.and global.get 1 i32.store local.get 2 local.get 0 i32.and local.tee 5 i32.const 255 i32.and i32.load local.set 4 local.get 3 local.get 3 i32.const 1 i32.add local.set 3 i32.const 0x3FFFFFFF i32.and i32.const 0x3FFFFFFF i32.ne br_if 0 (;@6;) end ) (table (;0;) 6 6 funcref) (memory (;0;) 8192 8192) (global (;0;) (mut i32) (i32.const 5246560)) (global (;1;) (mut i32) (i32.const 615098409)) (export "memory" (memory 0)) (export "_start" (func 1)) (export "__indirect_function_table" (table 0)))
Try to debug
I dump the corresponding machine code generated by
Wasmtime
andWasmEdge
, I find that, the first loop will be optimized to empty inWasmEdge
butWasmtime
cannot do this.# Machine code generated by Wasmtime 64: 31 c9 xor %ecx,%ecx 66: 45 31 db xor %r11d,%r11d 69: 83 c1 01 add $0x1,%ecx 6c: 45 85 db test %r11d,%r11d 6f: 0f 85 f1 ff ff ff jne 66 <wasm[0]::function[2]+0x16> 75: 31 d2 xor %edx,%edx 77: 48 8b 87 80 00 00 00 mov 0x80(%rdi),%rax 7e: 81 e1 ff 00 00 00 and $0xff,%ecx 84: 44 8b 87 a0 00 00 00 mov 0xa0(%rdi),%r8d 8b: 44 89 04 08 mov %r8d,(%rax,%rcx,1) 8f: 44 8b 00 mov (%rax),%r8d 92: 49 89 d0 mov %rdx,%r8 95: 41 81 e0 ff ff ff 3f and $0x3fffffff,%r8d 9c: 83 c2 01 add $0x1,%edx 9f: 41 81 f8 ff ff ff 3f cmp $0x3fffffff,%r8d a6: 0f 85 d8 ff ff ff jne 84 <wasm[0]::function[2]+0x34> # Machine code generated by WasmEdge 60: 48 8b 07 mov (%rdi),%rax 63: 48 8b 4f 08 mov 0x8(%rdi),%rcx 67: 31 d2 xor %edx,%edx 69: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) 70: 48 8b 71 08 mov 0x8(%rcx),%rsi 74: 8b 36 mov (%rsi),%esi 76: 48 8b 38 mov (%rax),%rdi 79: 89 77 01 mov %esi,0x1(%rdi) 7c: 8d 72 01 lea 0x1(%rdx),%esi 7f: 81 e2 ff ff ff 3f and $0x3fffffff,%edx 85: 81 fa ff ff ff 3f cmp $0x3fffffff,%edx 8b: 89 f2 mov %esi,%edx 8d: 75 e1 jne 70 <f2+0x10>
And after removing the first loop in the above
.wat
, the execution time ofWasmtime
will back to normal. So I am wandering that whether there is an related optimization inCranelift
can optimize the first loop likeLLVM
.Versions and Environment
Wasmtime
: https://github.com/bytecodealliance/wasmtime/commit/e0bfa7336de20f76048edbdc0157ee637a2c5fea (build release version)WasmEdge
: 7d4213b3abd5360c88e9120c3bc68ffd9b8a6870- kernel version: Linux 5.15.0-83-generic
hungryzzz edited issue #7495:
Hi, I run the following case in
Wasmtime
andWasmEdge
respectively and I find that the execution time of Wasmtime is almost 6x slower than which of WasmEdge(measured by time tool).
Wasmtime
: 4.08sWasmEdge
: 0.66s(module (type (;0;) (func (param i32))) (type (;1;) (func)) (import "wasi_snapshot_preview1" "proc_exit" (func (;0;) (type 0))) (func (;1;) (type 1) call 2 global.get 0 i32.const 31 i32.and call 0 unreachable) (func (;2;) (type 1) (local i32 i32 i32 i32 i32 i32) ;; First Loop loop ;; label = @6 local.get 1 local.get 2 local.get 2 i32.const 1 i32.add local.set 2 i32.const 0x3FFFFFFF i32.and i32.const 0x3FFFFFFF i32.ne i32.and br_if 0 (;@6;) end ;; Second Loop loop ;; label = @6 local.get 2 i32.const 255 i32.and global.get 1 i32.store local.get 2 local.get 0 i32.and local.tee 5 i32.const 255 i32.and i32.load local.set 4 local.get 3 local.get 3 i32.const 1 i32.add local.set 3 i32.const 0x3FFFFFFF i32.and i32.const 0x3FFFFFFF i32.ne br_if 0 (;@6;) end ) (table (;0;) 6 6 funcref) (memory (;0;) 8192 8192) (global (;0;) (mut i32) (i32.const 5246560)) (global (;1;) (mut i32) (i32.const 615098409)) (export "memory" (memory 0)) (export "_start" (func 1)) (export "__indirect_function_table" (table 0)))
Try to debug
I dump the corresponding machine code generated by
Wasmtime
andWasmEdge
, I find that there is something strange with the machine code generated byWasmtime
,
1. the first loop in the abovewat
will be compiled to64-6f
(I guess that), but actually the first time thejne
instruction is executed, the loop will be jumped out of, but%ecx
will be added 1.
2. In the second loop,mov %r8d,(%rax,%rcx,1)
is also strange. Because the variables used in these two loops don't have data dependence.# Machine code generated by Wasmtime 64: 31 c9 xor %ecx,%ecx 66: 45 31 db xor %r11d,%r11d 69: 83 c1 01 add $0x1,%ecx 6c: 45 85 db test %r11d,%r11d 6f: 0f 85 f1 ff ff ff jne 66 <wasm[0]::function[2]+0x16> 75: 31 d2 xor %edx,%edx 77: 48 8b 87 80 00 00 00 mov 0x80(%rdi),%rax 7e: 81 e1 ff 00 00 00 and $0xff,%ecx 84: 44 8b 87 a0 00 00 00 mov 0xa0(%rdi),%r8d 8b: 44 89 04 08 mov %r8d,(%rax,%rcx,1) 8f: 44 8b 00 mov (%rax),%r8d 92: 49 89 d0 mov %rdx,%r8 95: 41 81 e0 ff ff ff 3f and $0x3fffffff,%r8d 9c: 83 c2 01 add $0x1,%edx 9f: 41 81 f8 ff ff ff 3f cmp $0x3fffffff,%r8d a6: 0f 85 d8 ff ff ff jne 84 <wasm[0]::function[2]+0x34> # Machine code generated by WasmEdge 60: 48 8b 07 mov (%rdi),%rax 63: 48 8b 4f 08 mov 0x8(%rdi),%rcx 67: 31 d2 xor %edx,%edx 69: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) 70: 48 8b 71 08 mov 0x8(%rcx),%rsi 74: 8b 36 mov (%rsi),%esi 76: 48 8b 38 mov (%rax),%rdi 79: 89 77 01 mov %esi,0x1(%rdi) 7c: 8d 72 01 lea 0x1(%rdx),%esi 7f: 81 e2 ff ff ff 3f and $0x3fffffff,%edx 85: 81 fa ff ff ff 3f cmp $0x3fffffff,%edx 8b: 89 f2 mov %esi,%edx 8d: 75 e1 jne 70 <f2+0x10>
I try to remove the first loop in the above
wat
and the genrated codes are a little different,
1.mov %r8d,(%rax,%rcx,1)
in the above version becomesmov %eax,(%rsi)
2.mov (%rax),%r8d
in the above version disappeared.So I think Cranelift generates wrong machine code which will result in wrong behaviors and further causes poor performance.
64: 31 c9 xor %ecx,%ecx 66: 48 8b b7 80 00 00 00 mov 0x80(%rdi),%rsi 6d: 8b 87 a0 00 00 00 mov 0xa0(%rdi),%eax 73: 89 06 mov %eax,(%rsi) 75: 48 89 c8 mov %rcx,%rax 78: 81 e0 ff ff ff 3f and $0x3fffffff,%eax 7e: 83 c1 01 add $0x1,%ecx 81: 81 f8 ff ff ff 3f cmp $0x3fffffff,%eax 87: 0f 85 e0 ff ff ff jne 6d <wasm[0]::function[2]+0x1d>
Versions and Environment
Wasmtime
: https://github.com/bytecodealliance/wasmtime/commit/e0bfa7336de20f76048edbdc0157ee637a2c5fea (build release version)WasmEdge
: 7d4213b3abd5360c88e9120c3bc68ffd9b8a6870- kernel version: Linux 5.15.0-83-generic
hungryzzz edited issue #7495:
Hi, I run the following case in
Wasmtime
andWasmEdge
respectively and I find that the execution time of Wasmtime is almost 6x slower than which of WasmEdge(measured by time tool).
Wasmtime
: 4.08sWasmEdge
: 0.66s(module (type (;0;) (func (param i32))) (type (;1;) (func)) (import "wasi_snapshot_preview1" "proc_exit" (func (;0;) (type 0))) (func (;1;) (type 1) call 2 global.get 0 i32.const 31 i32.and call 0 unreachable) (func (;2;) (type 1) (local i32 i32 i32 i32 i32 i32) ;; First Loop loop ;; label = @6 local.get 1 local.get 2 local.get 2 i32.const 1 i32.add local.set 2 i32.const 0x3FFFFFFF i32.and i32.const 0x3FFFFFFF i32.ne i32.and br_if 0 (;@6;) end ;; Second Loop loop ;; label = @6 local.get 2 i32.const 255 i32.and global.get 1 i32.store local.get 2 local.get 0 i32.and local.tee 5 i32.const 255 i32.and i32.load local.set 4 local.get 3 local.get 3 i32.const 1 i32.add local.set 3 i32.const 0x3FFFFFFF i32.and i32.const 0x3FFFFFFF i32.ne br_if 0 (;@6;) end ) (table (;0;) 6 6 funcref) (memory (;0;) 8192 8192) (global (;0;) (mut i32) (i32.const 5246560)) (global (;1;) (mut i32) (i32.const 615098409)) (export "memory" (memory 0)) (export "_start" (func 1)) (export "__indirect_function_table" (table 0)))
Try to debug
I dump the corresponding machine code generated by
Wasmtime
andWasmEdge
, I find that there is something strange with the machine code generated byWasmtime
,
1. the first loop in the abovewat
will be compiled to64-6f
(I guess that), but actually the first time thejne
instruction is executed, the loop will be jumped out of, but%ecx
will be added 1.
2. In the second loop,mov %r8d,(%rax,%rcx,1)
is also strange. Because the variables used in these two loops don't have data dependence.# Machine code generated by Wasmtime 64: 31 c9 xor %ecx,%ecx 66: 45 31 db xor %r11d,%r11d 69: 83 c1 01 add $0x1,%ecx 6c: 45 85 db test %r11d,%r11d 6f: 0f 85 f1 ff ff ff jne 66 <wasm[0]::function[2]+0x16> 75: 31 d2 xor %edx,%edx 77: 48 8b 87 80 00 00 00 mov 0x80(%rdi),%rax 7e: 81 e1 ff 00 00 00 and $0xff,%ecx 84: 44 8b 87 a0 00 00 00 mov 0xa0(%rdi),%r8d 8b: 44 89 04 08 mov %r8d,(%rax,%rcx,1) 8f: 44 8b 00 mov (%rax),%r8d 92: 49 89 d0 mov %rdx,%r8 95: 41 81 e0 ff ff ff 3f and $0x3fffffff,%r8d 9c: 83 c2 01 add $0x1,%edx 9f: 41 81 f8 ff ff ff 3f cmp $0x3fffffff,%r8d a6: 0f 85 d8 ff ff ff jne 84 <wasm[0]::function[2]+0x34> # Machine code generated by WasmEdge 60: 48 8b 07 mov (%rdi),%rax 63: 48 8b 4f 08 mov 0x8(%rdi),%rcx 67: 31 d2 xor %edx,%edx 69: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) 70: 48 8b 71 08 mov 0x8(%rcx),%rsi 74: 8b 36 mov (%rsi),%esi 76: 48 8b 38 mov (%rax),%rdi 79: 89 77 01 mov %esi,0x1(%rdi) 7c: 8d 72 01 lea 0x1(%rdx),%esi 7f: 81 e2 ff ff ff 3f and $0x3fffffff,%edx 85: 81 fa ff ff ff 3f cmp $0x3fffffff,%edx 8b: 89 f2 mov %esi,%edx 8d: 75 e1 jne 70 <f2+0x10>
I try to remove the first loop in the above
wat
and the genrated codes are a little different,
1.mov %r8d,(%rax,%rcx,1)
in the above version becomesmov %eax,(%rsi)
2.mov (%rax),%r8d
in the above version disappeared.So I think Cranelift generates wrong machine code which will result in wrong behaviors and further causes poor performance.
64: 31 c9 xor %ecx,%ecx 66: 48 8b b7 80 00 00 00 mov 0x80(%rdi),%rsi 6d: 8b 87 a0 00 00 00 mov 0xa0(%rdi),%eax 73: 89 06 mov %eax,(%rsi) 75: 48 89 c8 mov %rcx,%rax 78: 81 e0 ff ff ff 3f and $0x3fffffff,%eax 7e: 83 c1 01 add $0x1,%ecx 81: 81 f8 ff ff ff 3f cmp $0x3fffffff,%eax 87: 0f 85 e0 ff ff ff jne 6d <wasm[0]::function[2]+0x1d>
Versions and Environment
Wasmtime
: https://github.com/bytecodealliance/wasmtime/commit/e0bfa7336de20f76048edbdc0157ee637a2c5fea (build release version)WasmEdge
: 7d4213b3abd5360c88e9120c3bc68ffd9b8a6870- kernel version: Linux 5.15.0-83-generic
alexcrichton commented on issue #7495:
This is the output of VTune for me:
<img width="1193" alt="Screenshot 2023-11-08 at 3 04 06 PM" src="https://github.com/bytecodealliance/wasmtime/assets/64996/c9e0f911-5752-4334-b47b-755e4234537d">
which seems to point the smoking gun at "Loads Blocked by Store Forwarding"
For instructions I got:
<img width="1273" alt="Screenshot 2023-11-08 at 3 04 54 PM" src="https://github.com/bytecodealliance/wasmtime/assets/64996/c60a113e-38c3-423a-925d-82d9e003d7a5">
Cranelift/Wasmtime emit a load-store-load whereas LLVM/WasmEdge seem to emit a load-load-store which probably doesn't break load-store forwarding would be my guess.
I have not dug further yet at this time. I don't think Cranelift is generating wrong code here, and I have not dug into why removing the loop affects things.
Last updated: Nov 22 2024 at 16:03 UTC