hungryzzz opened issue #7492:
Hi, I run the attached case in
Wasmtime
andWasmEdge
respectively and I find that the execution time ofWasmtime
is 4x slower than which ofWasmEdge
(measured bytime
tool).
Wasmtime
: 4.20sWasmEdge
: 1.05sI try to use the
perf
tool to locate the time consuming part of code, the report is as followed.99.66% wasmtime jitted-2354298-2.so [.] wasm[0]::function[3] ▒ 0.03% wasmtime jitted-2354298-1.so [.] wasm[0]::function[2] ▒ 0.03% wasmtime [kernel.kallsyms] [k] update_min_vruntime ▒ 0.02% wasmtime [kernel.kallsyms] [k] zap_pte_range.isra.0 ▒ 0.02% wasmtime libc-2.31.so [.] __memset_avx2_erms ▒ 0.02% wasmtime [kernel.kallsyms] [k] __vma_adjust ◆ 0.02% wasmtime [kernel.kallsyms] [k] error_entry ▒ 0.02% wasmtime [kernel.kallsyms] [k] filemap_map_pages ▒ 0.01% wasmtime [kernel.kallsyms] [k] page_add_file_rmap ▒ 0.01% wasmtime wasmtime [.] wasmtime_cranelift_shared::compiled_function::collect_address_maps::cvt ▒ 0.01% wasmtime [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe ▒ 0.01% wasmtime [kernel.kallsyms] [k] mem_cgroup_from_task ▒ 0.01% wasmtime [kernel.kallsyms] [k] charge_memcg ▒ 0.01% wasmtime [kernel.kallsyms] [k] kmem_cache_alloc ▒ 0.01% wasmtime [kernel.kallsyms] [k] ___slab_alloc ▒ 0.01% wasmtime [kernel.kallsyms] [k] update_cfs_rq_h_load ▒ 0.01% wasmtime [kernel.kallsyms] [k] kmem_cache_free ▒ 0.01% wasmtime libc-2.31.so [.] realloc
<img width="1233" alt="截屏2023-11-07 20 26 59" src="https://github.com/bytecodealliance/wasmtime/assets/32137313/27299d97-fe40-48a0-91c0-fcbefbac746a">
Also, I try to dump the machine code generated by
WasmEdge
(LLVM
) and I get a shorter machine code thanWasmtime
.# Machine code generated by LLVM, only 36 instructions compared with 127 instructions generated by Wasmtime. 0000000000000140 <f3>: 140: 40 f6 c6 20 test $0x20,%sil 144: 75 2b jne 171 <f3+0x31> 146: 4c 8b 07 mov (%rdi),%r8 149: 40 0f b6 c6 movzbl %sil,%eax 14d: 4d 8b 00 mov (%r8),%r8 150: 42 8b 74 00 10 mov 0x10(%rax,%r8,1),%esi 155: 42 2b 74 00 14 sub 0x14(%rax,%r8,1),%esi 15a: 39 ce cmp %ecx,%esi 15c: 73 05 jae 163 <f3+0x23> 15e: e9 00 00 00 00 jmpq 163 <f3+0x23> 163: 66 83 fa ff cmp $0xffff,%dx 167: 75 09 jne 172 <f3+0x32> 169: 42 83 7c 00 50 00 cmpl $0x0,0x50(%rax,%r8,1) 16f: 74 01 je 172 <f3+0x32> 171: c3 retq 172: 41 89 d1 mov %edx,%r9d 175: 89 c8 mov %ecx,%eax 177: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1) 17e: 00 00 180: 85 c0 test %eax,%eax 182: 75 10 jne 194 <f3+0x54> 184: 89 ce mov %ecx,%esi 186: 81 e6 ff 7f 00 00 and $0x7fff,%esi 18c: 81 fe ff 7f 00 00 cmp $0x7fff,%esi 192: 74 dd je 171 <f3+0x31> 194: 66 41 83 f9 ff cmp $0xffff,%r9w 199: 74 15 je 1b0 <f3+0x70> 19b: ff c1 inc %ecx 19d: 8d 34 02 lea (%rdx,%rax,1),%esi 1a0: ff c8 dec %eax 1a2: 40 0f b6 f6 movzbl %sil,%esi 1a6: 41 ff c1 inc %r9d 1a9: 41 80 3c 30 0a cmpb $0xa,(%r8,%rsi,1) 1ae: 75 d0 jne 180 <f3+0x40> 1b0: ff c2 inc %edx 1b2: e9 00 00 00 00 jmpq 1b7 <f3+0x77>
I think maybe there are some missing optimizations in
Cranelift
soWasmtime
generates a suboptimal machine code and executes longer thanWasmEdge
.Versions and Environment
Wasmtime
: e0bfa7336de20f76048edbdc0157ee637a2c5fea (build release version)WasmEdge
: 7d4213b3abd5360c88e9120c3bc68ffd9b8a6870- kernel version: Linux 5.15.0-83-generic
hungryzzz edited issue #7492:
Hi, I run the attached case in
Wasmtime
andWasmEdge
respectively and I find that the execution time ofWasmtime
is 4x slower than which ofWasmEdge
(measured bytime
tool).
Wasmtime
: 4.20sWasmEdge
: 1.05sI try to use the
perf
tool to locate the time consuming part of code, the report is as followed.99.66% wasmtime jitted-2354298-2.so [.] wasm[0]::function[3] ▒ 0.03% wasmtime jitted-2354298-1.so [.] wasm[0]::function[2] ▒ 0.03% wasmtime [kernel.kallsyms] [k] update_min_vruntime ▒ 0.02% wasmtime [kernel.kallsyms] [k] zap_pte_range.isra.0 ▒ 0.02% wasmtime libc-2.31.so [.] __memset_avx2_erms ▒ 0.02% wasmtime [kernel.kallsyms] [k] __vma_adjust ◆ 0.02% wasmtime [kernel.kallsyms] [k] error_entry ▒ 0.02% wasmtime [kernel.kallsyms] [k] filemap_map_pages ▒ 0.01% wasmtime [kernel.kallsyms] [k] page_add_file_rmap ▒ 0.01% wasmtime wasmtime [.] wasmtime_cranelift_shared::compiled_function::collect_address_maps::cvt ▒ 0.01% wasmtime [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe ▒ 0.01% wasmtime [kernel.kallsyms] [k] mem_cgroup_from_task ▒ 0.01% wasmtime [kernel.kallsyms] [k] charge_memcg ▒ 0.01% wasmtime [kernel.kallsyms] [k] kmem_cache_alloc ▒ 0.01% wasmtime [kernel.kallsyms] [k] ___slab_alloc ▒ 0.01% wasmtime [kernel.kallsyms] [k] update_cfs_rq_h_load ▒ 0.01% wasmtime [kernel.kallsyms] [k] kmem_cache_free ▒ 0.01% wasmtime libc-2.31.so [.] realloc
<img width="1233" alt="截屏2023-11-07 20 26 59" src="https://github.com/bytecodealliance/wasmtime/assets/32137313/27299d97-fe40-48a0-91c0-fcbefbac746a">
Also, I try to dump the machine code generated by
WasmEdge
(LLVM
) and I get a shorter machine code thanWasmtime
.# Machine code generated by LLVM, only 36 instructions compared with 127 instructions generated by Wasmtime. 0000000000000140 <f3>: 140: 40 f6 c6 20 test $0x20,%sil 144: 75 2b jne 171 <f3+0x31> 146: 4c 8b 07 mov (%rdi),%r8 149: 40 0f b6 c6 movzbl %sil,%eax 14d: 4d 8b 00 mov (%r8),%r8 150: 42 8b 74 00 10 mov 0x10(%rax,%r8,1),%esi 155: 42 2b 74 00 14 sub 0x14(%rax,%r8,1),%esi 15a: 39 ce cmp %ecx,%esi 15c: 73 05 jae 163 <f3+0x23> 15e: e9 00 00 00 00 jmpq 163 <f3+0x23> 163: 66 83 fa ff cmp $0xffff,%dx 167: 75 09 jne 172 <f3+0x32> 169: 42 83 7c 00 50 00 cmpl $0x0,0x50(%rax,%r8,1) 16f: 74 01 je 172 <f3+0x32> 171: c3 retq 172: 41 89 d1 mov %edx,%r9d 175: 89 c8 mov %ecx,%eax 177: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1) 17e: 00 00 180: 85 c0 test %eax,%eax 182: 75 10 jne 194 <f3+0x54> 184: 89 ce mov %ecx,%esi 186: 81 e6 ff 7f 00 00 and $0x7fff,%esi 18c: 81 fe ff 7f 00 00 cmp $0x7fff,%esi 192: 74 dd je 171 <f3+0x31> 194: 66 41 83 f9 ff cmp $0xffff,%r9w 199: 74 15 je 1b0 <f3+0x70> 19b: ff c1 inc %ecx 19d: 8d 34 02 lea (%rdx,%rax,1),%esi 1a0: ff c8 dec %eax 1a2: 40 0f b6 f6 movzbl %sil,%esi 1a6: 41 ff c1 inc %r9d 1a9: 41 80 3c 30 0a cmpb $0xa,(%r8,%rsi,1) 1ae: 75 d0 jne 180 <f3+0x40> 1b0: ff c2 inc %edx 1b2: e9 00 00 00 00 jmpq 1b7 <f3+0x77>
I think maybe there are some missing optimizations in
Cranelift
soWasmtime
generates a suboptimal machine code and gets longer execution time thanWasmEdge
.Versions and Environment
Wasmtime
: e0bfa7336de20f76048edbdc0157ee637a2c5fea (build release version)WasmEdge
: 7d4213b3abd5360c88e9120c3bc68ffd9b8a6870- kernel version: Linux 5.15.0-83-generic
alexcrichton commented on issue #7492:
Thanks for this and https://github.com/bytecodealliance/wasmtime/issues/7495! Would you be able to provide the original source code as well? That'd likely help in investigating this.
hungryzzz commented on issue #7492:
@alexcrichton Hi, does the original source code refer to
C
code? If yes, I don't have the original C code file either unfortunately.
alexcrichton commented on issue #7492:
Ah I was curious about whatever source it came from, not necessarily C, but if you don't have the source then no worries.
hungryzzz commented on issue #7492:
The earilest original source is from C. After compiling it to
wasm
, we mutate it and get the above case.
hungryzzz edited a comment on issue #7492:
The earilest original source is from C. After compiling it to
wasm
, we mutate it and get the above case, so it will be quite different with the original source code. Actually, I have reduced the originalwasm
file, so the size of the buggy case I submmit is quite small(around 120 wasm instructions, only 3 functions) compared to the original one. I hope this will help your further investigating.
fitzgen commented on issue #7492:
Echoing Alex: thanks very much for reporting these perf bugs!
@hungryzzz, out of curiosity, are you fuzzing performance across wasm runtimes? Do you have any additional details you can share about what you're doing and what your goals are?
hungryzzz commented on issue #7492:
@fitzgen Yes, I am fuzzing performance using differential testing across wasm runtime. The current exection mode of the most wasm runtimes is AOT(compile the whole module to machine code and execute it), so I think the execution time of them should be close, if not, there maybe something wrong(misoptimization etc.) with the code generation of runtime.
Our ideal pipeline of the fuzzing is, 1) generate lots of wasm test case; 2) differential testing across wasm runtimes and find the suspected performance bugs; 3) reduce buggy cases to a small size one to help further debuging(because the wasm bytecode generated from
emcc
is usually very very long).
hungryzzz deleted a comment on issue #7492:
@fitzgen Yes, I am fuzzing performance using differential testing across wasm runtime. The current exection mode of the most wasm runtimes is AOT(compile the whole module to machine code and execute it), so I think the execution time of them should be close, if not, there maybe something wrong(misoptimization etc.) with the code generation of runtime.
Our ideal pipeline of the fuzzing is, 1) generate lots of wasm test case; 2) differential testing across wasm runtimes and find the suspected performance bugs; 3) reduce buggy cases to a small size one to help further debuging(because the wasm bytecode generated from
emcc
is usually very very long).
hungryzzz commented on issue #7492:
@fitzgen Yes, I am fuzzing performance using differential testing across wasm runtime. The current exection mode of the most wasm runtimes is AOT(compile the whole module to machine code and execute it), so I think the execution time of them should be close, if not, there maybe something wrong(misoptimization etc.) with the code generation of runtime.
alexcrichton commented on issue #7492:
VTune outputs this:
<img width="1111" alt="Screenshot 2023-11-08 at 3 06 41 PM" src="https://github.com/bytecodealliance/wasmtime/assets/64996/88e31d0b-9c53-462f-abe1-b94bd5030388">
which seems to point a smoking-ish gun at DSB Coverage/Misses. I just learned about this today and still know very little about it. For individual instructions this is what I'm seeing:
<img width="1856" alt="Screenshot 2023-11-08 at 3 07 27 PM" src="https://github.com/bytecodealliance/wasmtime/assets/64996/a97b0607-512a-4540-8d63-e6002c496b1e">
I have not yet dug further at this time.
alexcrichton commented on issue #7492:
Sort of where I'm leaning for this and https://github.com/bytecodealliance/wasmtime/issues/7495 is that this may not be too too actionable without the original source.
In general Cranelift is not expected to be able to beat LLVM, which I believe WasmEdge is using. Cranelift is intended to be competitive with "peer JITs" such as v8 and SpiderMonkey but it won't ever be able to reach parity with LLVM on all inputs. So in that sense it's not always a "fair" comparison to compare LLVM and Cranelift.
Now that being said Wasmtime and Cranelift as a whole still want to be fast. Much of the speed of Wasmtime/Cranelift comes from pre-optimized WebAssembly modules. For example Rust/C programs originally go through LLVM's optimizer which takes care of advanced optimizations that Cranelift may not do. Afterwards Cranelift is intended to generate high-quality machine code with a high-quality instruction selector as well as optimizations around Wasmtime's injection of runtime management code (e.g. egraphs plus table bounds checks). So despite LLVM-vs-Cranelift not being "fair" it's something we're still interested in (hence thanks again for these issues!).
It sounds, though, like the pipeline to produce these wasm files was to take an origin C program and mutate it. This starts moving into the realm of the original program no longer being optimized by LLVM, meaning that the onus is on Cranelift to perform more advanced optimizations that LLVM implements. This is unlikely to happen, however, given Cranelift's use case.
So given all this that's what leads me to the conclusion that this may not be actionable. The bugs, from what I can tell, here and on https://github.com/bytecodealliance/wasmtime/issues/7495 rely on optimizations that LLVM does but Cranelift doesn't do. I am guessing, however, that if the original source were C or Rust that when optimized and then run through both wasm runtimes the performance would be similar because LLVM's optimizations would kick in when producing the wasm. So overall my current conclusion is that Cranelift is missing optimizations, but it's intentional.
Now a lot of what I'm saying above is merely conjecture. I haven't bottomed out the exact performance differences and what's going on where. It may very well be a small tweak to Cranelift fixes performance of these cases. That being said I'm personally having a difficult time analyzing this without a comparison of the original source code. So I'd ask again, I know that this is reduced and mutated from an original source program, but is it possible to get the source program and the list of mutations?
hungryzzz commented on issue #7492:
@alexcrichton Hi, the codes are generated by randomizing wasm bytecodes. Thank you for your efforts. I will get back to you later once I have new clues.
alexcrichton commented on issue #7492:
Ok if the original source + mutations aren't available then no worries.
I'll again restate though that what I'm saying above is currently a hypothesis and isn't something I've been able to definitively prove. In the abstract though LLVM is expected to beat Cranelift when fed arbitrary code and there's only so much that we can do about that. Cranelift should be more competitive when fed LLVM-optimized wasm code, however.
Last updated: Nov 22 2024 at 17:03 UTC