Stream: git-wasmtime

Topic: wasmtime / issue #7492 Bad performance in Wasmtime


view this post on Zulip Wasmtime GitHub notifications bot (Nov 07 2023 at 12:45):

hungryzzz opened issue #7492:

Hi, I run the attached case in Wasmtime and WasmEdge respectively and I find that the execution time of Wasmtime is 4x slower than which of WasmEdge(measured by time tool).

I try to use the perf tool to locate the time consuming part of code, the report is as followed.

99.66%  wasmtime  jitted-2354298-2.so  [.] wasm[0]::function[3]                                                                                                
0.03%  wasmtime  jitted-2354298-1.so  [.] wasm[0]::function[2]                                                                                                
0.03%  wasmtime  [kernel.kallsyms]    [k] update_min_vruntime                                                                                                 
0.02%  wasmtime  [kernel.kallsyms]    [k] zap_pte_range.isra.0                                                                                                
0.02%  wasmtime  libc-2.31.so         [.] __memset_avx2_erms                                                                                                  
0.02%  wasmtime  [kernel.kallsyms]    [k] __vma_adjust                                                                                                        
0.02%  wasmtime  [kernel.kallsyms]    [k] error_entry                                                                                                         
0.02%  wasmtime  [kernel.kallsyms]    [k] filemap_map_pages                                                                                                   
0.01%  wasmtime  [kernel.kallsyms]    [k] page_add_file_rmap                                                                                                  
0.01%  wasmtime  wasmtime             [.] wasmtime_cranelift_shared::compiled_function::collect_address_maps::cvt                                             
0.01%  wasmtime  [kernel.kallsyms]    [k] entry_SYSCALL_64_after_hwframe                                                                                      
0.01%  wasmtime  [kernel.kallsyms]    [k] mem_cgroup_from_task                                                                                                
0.01%  wasmtime  [kernel.kallsyms]    [k] charge_memcg                                                                                                        
0.01%  wasmtime  [kernel.kallsyms]    [k] kmem_cache_alloc                                                                                                    
0.01%  wasmtime  [kernel.kallsyms]    [k] ___slab_alloc                                                                                                       
0.01%  wasmtime  [kernel.kallsyms]    [k] update_cfs_rq_h_load                                                                                                
0.01%  wasmtime  [kernel.kallsyms]    [k] kmem_cache_free                                                                                                     
0.01%  wasmtime  libc-2.31.so         [.] realloc

<img width="1233" alt="截屏2023-11-07 20 26 59" src="https://github.com/bytecodealliance/wasmtime/assets/32137313/27299d97-fe40-48a0-91c0-fcbefbac746a">

Also, I try to dump the machine code generated by WasmEdge(LLVM) and I get a shorter machine code than Wasmtime.

# Machine code generated by LLVM, only  36 instructions compared with 127 instructions generated by Wasmtime.
0000000000000140 <f3>:
 140:   40 f6 c6 20             test   $0x20,%sil
 144:   75 2b                   jne    171 <f3+0x31>
 146:   4c 8b 07                mov    (%rdi),%r8
 149:   40 0f b6 c6             movzbl %sil,%eax
 14d:   4d 8b 00                mov    (%r8),%r8
 150:   42 8b 74 00 10          mov    0x10(%rax,%r8,1),%esi
 155:   42 2b 74 00 14          sub    0x14(%rax,%r8,1),%esi
 15a:   39 ce                   cmp    %ecx,%esi
 15c:   73 05                   jae    163 <f3+0x23>
 15e:   e9 00 00 00 00          jmpq   163 <f3+0x23>
 163:   66 83 fa ff             cmp    $0xffff,%dx
 167:   75 09                   jne    172 <f3+0x32>
 169:   42 83 7c 00 50 00       cmpl   $0x0,0x50(%rax,%r8,1)
 16f:   74 01                   je     172 <f3+0x32>
 171:   c3                      retq
 172:   41 89 d1                mov    %edx,%r9d
 175:   89 c8                   mov    %ecx,%eax
 177:   66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
 17e:   00 00
 180:   85 c0                   test   %eax,%eax
 182:   75 10                   jne    194 <f3+0x54>
 184:   89 ce                   mov    %ecx,%esi
 186:   81 e6 ff 7f 00 00       and    $0x7fff,%esi
 18c:   81 fe ff 7f 00 00       cmp    $0x7fff,%esi
 192:   74 dd                   je     171 <f3+0x31>
 194:   66 41 83 f9 ff          cmp    $0xffff,%r9w
 199:   74 15                   je     1b0 <f3+0x70>
 19b:   ff c1                   inc    %ecx
 19d:   8d 34 02                lea    (%rdx,%rax,1),%esi
 1a0:   ff c8                   dec    %eax
 1a2:   40 0f b6 f6             movzbl %sil,%esi
 1a6:   41 ff c1                inc    %r9d
 1a9:   41 80 3c 30 0a          cmpb   $0xa,(%r8,%rsi,1)
 1ae:   75 d0                   jne    180 <f3+0x40>
 1b0:   ff c2                   inc    %edx
 1b2:   e9 00 00 00 00          jmpq   1b7 <f3+0x77>

I think maybe there are some missing optimizations in Cranelift so Wasmtime generates a suboptimal machine code and executes longer than WasmEdge.

Versions and Environment

case.zip

view this post on Zulip Wasmtime GitHub notifications bot (Nov 07 2023 at 13:22):

hungryzzz edited issue #7492:

Hi, I run the attached case in Wasmtime and WasmEdge respectively and I find that the execution time of Wasmtime is 4x slower than which of WasmEdge(measured by time tool).

I try to use the perf tool to locate the time consuming part of code, the report is as followed.

99.66%  wasmtime  jitted-2354298-2.so  [.] wasm[0]::function[3]                                                                                                
0.03%  wasmtime  jitted-2354298-1.so  [.] wasm[0]::function[2]                                                                                                
0.03%  wasmtime  [kernel.kallsyms]    [k] update_min_vruntime                                                                                                 
0.02%  wasmtime  [kernel.kallsyms]    [k] zap_pte_range.isra.0                                                                                                
0.02%  wasmtime  libc-2.31.so         [.] __memset_avx2_erms                                                                                                  
0.02%  wasmtime  [kernel.kallsyms]    [k] __vma_adjust                                                                                                        
0.02%  wasmtime  [kernel.kallsyms]    [k] error_entry                                                                                                         
0.02%  wasmtime  [kernel.kallsyms]    [k] filemap_map_pages                                                                                                   
0.01%  wasmtime  [kernel.kallsyms]    [k] page_add_file_rmap                                                                                                  
0.01%  wasmtime  wasmtime             [.] wasmtime_cranelift_shared::compiled_function::collect_address_maps::cvt                                             
0.01%  wasmtime  [kernel.kallsyms]    [k] entry_SYSCALL_64_after_hwframe                                                                                      
0.01%  wasmtime  [kernel.kallsyms]    [k] mem_cgroup_from_task                                                                                                
0.01%  wasmtime  [kernel.kallsyms]    [k] charge_memcg                                                                                                        
0.01%  wasmtime  [kernel.kallsyms]    [k] kmem_cache_alloc                                                                                                    
0.01%  wasmtime  [kernel.kallsyms]    [k] ___slab_alloc                                                                                                       
0.01%  wasmtime  [kernel.kallsyms]    [k] update_cfs_rq_h_load                                                                                                
0.01%  wasmtime  [kernel.kallsyms]    [k] kmem_cache_free                                                                                                     
0.01%  wasmtime  libc-2.31.so         [.] realloc

<img width="1233" alt="截屏2023-11-07 20 26 59" src="https://github.com/bytecodealliance/wasmtime/assets/32137313/27299d97-fe40-48a0-91c0-fcbefbac746a">

Also, I try to dump the machine code generated by WasmEdge(LLVM) and I get a shorter machine code than Wasmtime.

# Machine code generated by LLVM, only  36 instructions compared with 127 instructions generated by Wasmtime.
0000000000000140 <f3>:
 140:   40 f6 c6 20             test   $0x20,%sil
 144:   75 2b                   jne    171 <f3+0x31>
 146:   4c 8b 07                mov    (%rdi),%r8
 149:   40 0f b6 c6             movzbl %sil,%eax
 14d:   4d 8b 00                mov    (%r8),%r8
 150:   42 8b 74 00 10          mov    0x10(%rax,%r8,1),%esi
 155:   42 2b 74 00 14          sub    0x14(%rax,%r8,1),%esi
 15a:   39 ce                   cmp    %ecx,%esi
 15c:   73 05                   jae    163 <f3+0x23>
 15e:   e9 00 00 00 00          jmpq   163 <f3+0x23>
 163:   66 83 fa ff             cmp    $0xffff,%dx
 167:   75 09                   jne    172 <f3+0x32>
 169:   42 83 7c 00 50 00       cmpl   $0x0,0x50(%rax,%r8,1)
 16f:   74 01                   je     172 <f3+0x32>
 171:   c3                      retq
 172:   41 89 d1                mov    %edx,%r9d
 175:   89 c8                   mov    %ecx,%eax
 177:   66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
 17e:   00 00
 180:   85 c0                   test   %eax,%eax
 182:   75 10                   jne    194 <f3+0x54>
 184:   89 ce                   mov    %ecx,%esi
 186:   81 e6 ff 7f 00 00       and    $0x7fff,%esi
 18c:   81 fe ff 7f 00 00       cmp    $0x7fff,%esi
 192:   74 dd                   je     171 <f3+0x31>
 194:   66 41 83 f9 ff          cmp    $0xffff,%r9w
 199:   74 15                   je     1b0 <f3+0x70>
 19b:   ff c1                   inc    %ecx
 19d:   8d 34 02                lea    (%rdx,%rax,1),%esi
 1a0:   ff c8                   dec    %eax
 1a2:   40 0f b6 f6             movzbl %sil,%esi
 1a6:   41 ff c1                inc    %r9d
 1a9:   41 80 3c 30 0a          cmpb   $0xa,(%r8,%rsi,1)
 1ae:   75 d0                   jne    180 <f3+0x40>
 1b0:   ff c2                   inc    %edx
 1b2:   e9 00 00 00 00          jmpq   1b7 <f3+0x77>

I think maybe there are some missing optimizations in Cranelift so Wasmtime generates a suboptimal machine code and gets longer execution time than WasmEdge.

Versions and Environment

case.zip

view this post on Zulip Wasmtime GitHub notifications bot (Nov 08 2023 at 15:09):

alexcrichton commented on issue #7492:

Thanks for this and https://github.com/bytecodealliance/wasmtime/issues/7495! Would you be able to provide the original source code as well? That'd likely help in investigating this.

view this post on Zulip Wasmtime GitHub notifications bot (Nov 08 2023 at 15:27):

hungryzzz commented on issue #7492:

@alexcrichton Hi, does the original source code refer to C code? If yes, I don't have the original C code file either unfortunately.

view this post on Zulip Wasmtime GitHub notifications bot (Nov 08 2023 at 15:28):

alexcrichton commented on issue #7492:

Ah I was curious about whatever source it came from, not necessarily C, but if you don't have the source then no worries.

view this post on Zulip Wasmtime GitHub notifications bot (Nov 08 2023 at 15:31):

hungryzzz commented on issue #7492:

The earilest original source is from C. After compiling it to wasm, we mutate it and get the above case.

view this post on Zulip Wasmtime GitHub notifications bot (Nov 08 2023 at 15:37):

hungryzzz edited a comment on issue #7492:

The earilest original source is from C. After compiling it to wasm, we mutate it and get the above case, so it will be quite different with the original source code. Actually, I have reduced the original wasm file, so the size of the buggy case I submmit is quite small(around 120 wasm instructions, only 3 functions) compared to the original one. I hope this will help your further investigating.

view this post on Zulip Wasmtime GitHub notifications bot (Nov 08 2023 at 15:56):

fitzgen commented on issue #7492:

Echoing Alex: thanks very much for reporting these perf bugs!

@hungryzzz, out of curiosity, are you fuzzing performance across wasm runtimes? Do you have any additional details you can share about what you're doing and what your goals are?

view this post on Zulip Wasmtime GitHub notifications bot (Nov 08 2023 at 16:21):

hungryzzz commented on issue #7492:

@fitzgen Yes, I am fuzzing performance using differential testing across wasm runtime. The current exection mode of the most wasm runtimes is AOT(compile the whole module to machine code and execute it), so I think the execution time of them should be close, if not, there maybe something wrong(misoptimization etc.) with the code generation of runtime.

Our ideal pipeline of the fuzzing is, 1) generate lots of wasm test case; 2) differential testing across wasm runtimes and find the suspected performance bugs; 3) reduce buggy cases to a small size one to help further debuging(because the wasm bytecode generated from emcc is usually very very long).

view this post on Zulip Wasmtime GitHub notifications bot (Nov 08 2023 at 16:26):

hungryzzz deleted a comment on issue #7492:

@fitzgen Yes, I am fuzzing performance using differential testing across wasm runtime. The current exection mode of the most wasm runtimes is AOT(compile the whole module to machine code and execute it), so I think the execution time of them should be close, if not, there maybe something wrong(misoptimization etc.) with the code generation of runtime.

Our ideal pipeline of the fuzzing is, 1) generate lots of wasm test case; 2) differential testing across wasm runtimes and find the suspected performance bugs; 3) reduce buggy cases to a small size one to help further debuging(because the wasm bytecode generated from emcc is usually very very long).

view this post on Zulip Wasmtime GitHub notifications bot (Nov 08 2023 at 16:27):

hungryzzz commented on issue #7492:

@fitzgen Yes, I am fuzzing performance using differential testing across wasm runtime. The current exection mode of the most wasm runtimes is AOT(compile the whole module to machine code and execute it), so I think the execution time of them should be close, if not, there maybe something wrong(misoptimization etc.) with the code generation of runtime.

view this post on Zulip Wasmtime GitHub notifications bot (Nov 08 2023 at 21:07):

alexcrichton commented on issue #7492:

VTune outputs this:

<img width="1111" alt="Screenshot 2023-11-08 at 3 06 41 PM" src="https://github.com/bytecodealliance/wasmtime/assets/64996/88e31d0b-9c53-462f-abe1-b94bd5030388">

which seems to point a smoking-ish gun at DSB Coverage/Misses. I just learned about this today and still know very little about it. For individual instructions this is what I'm seeing:

<img width="1856" alt="Screenshot 2023-11-08 at 3 07 27 PM" src="https://github.com/bytecodealliance/wasmtime/assets/64996/a97b0607-512a-4540-8d63-e6002c496b1e">

I have not yet dug further at this time.

view this post on Zulip Wasmtime GitHub notifications bot (Nov 09 2023 at 19:24):

alexcrichton commented on issue #7492:

Sort of where I'm leaning for this and https://github.com/bytecodealliance/wasmtime/issues/7495 is that this may not be too too actionable without the original source.

In general Cranelift is not expected to be able to beat LLVM, which I believe WasmEdge is using. Cranelift is intended to be competitive with "peer JITs" such as v8 and SpiderMonkey but it won't ever be able to reach parity with LLVM on all inputs. So in that sense it's not always a "fair" comparison to compare LLVM and Cranelift.

Now that being said Wasmtime and Cranelift as a whole still want to be fast. Much of the speed of Wasmtime/Cranelift comes from pre-optimized WebAssembly modules. For example Rust/C programs originally go through LLVM's optimizer which takes care of advanced optimizations that Cranelift may not do. Afterwards Cranelift is intended to generate high-quality machine code with a high-quality instruction selector as well as optimizations around Wasmtime's injection of runtime management code (e.g. egraphs plus table bounds checks). So despite LLVM-vs-Cranelift not being "fair" it's something we're still interested in (hence thanks again for these issues!).

It sounds, though, like the pipeline to produce these wasm files was to take an origin C program and mutate it. This starts moving into the realm of the original program no longer being optimized by LLVM, meaning that the onus is on Cranelift to perform more advanced optimizations that LLVM implements. This is unlikely to happen, however, given Cranelift's use case.

So given all this that's what leads me to the conclusion that this may not be actionable. The bugs, from what I can tell, here and on https://github.com/bytecodealliance/wasmtime/issues/7495 rely on optimizations that LLVM does but Cranelift doesn't do. I am guessing, however, that if the original source were C or Rust that when optimized and then run through both wasm runtimes the performance would be similar because LLVM's optimizations would kick in when producing the wasm. So overall my current conclusion is that Cranelift is missing optimizations, but it's intentional.

Now a lot of what I'm saying above is merely conjecture. I haven't bottomed out the exact performance differences and what's going on where. It may very well be a small tweak to Cranelift fixes performance of these cases. That being said I'm personally having a difficult time analyzing this without a comparison of the original source code. So I'd ask again, I know that this is reduced and mutated from an original source program, but is it possible to get the source program and the list of mutations?

view this post on Zulip Wasmtime GitHub notifications bot (Nov 10 2023 at 05:23):

hungryzzz commented on issue #7492:

@alexcrichton Hi, the codes are generated by randomizing wasm bytecodes. Thank you for your efforts. I will get back to you later once I have new clues.

view this post on Zulip Wasmtime GitHub notifications bot (Nov 13 2023 at 16:54):

alexcrichton commented on issue #7492:

Ok if the original source + mutations aren't available then no worries.

I'll again restate though that what I'm saying above is currently a hypothesis and isn't something I've been able to definitively prove. In the abstract though LLVM is expected to beat Cranelift when fed arbitrary code and there's only so much that we can do about that. Cranelift should be more competitive when fed LLVM-optimized wasm code, however.


Last updated: Nov 22 2024 at 17:03 UTC