Stream: git-wasmtime

Topic: wasmtime / issue #8573 Small difference makes suspicion p...


view this post on Zulip Wasmtime GitHub notifications bot (May 07 2024 at 16:00):

hungryzzz opened issue #8573:

Test Cases

cases.zip

Steps to Reproduce

Hi, I run the attached two cases(good.wasm&bad.wasm) in Wasmtime, WasmEdge(AOT) and Wamr(AOT), and collect their execution time respectively (measured by time tool).

Expected Results & Actual Results

For good.wasm, the execution time in different runtimes are as follows:

For bad.wasm, the execution time in different runtimes are as follows:

The difference between the attached two cases is as follow, i.e., changing the address of i32.load from local.get 0 to i32.const 0, the case with i32.const 0 has the bad performance in Wasmtime, whose execution time is 2x slower than the former one. I think the decreasing is confused and suspicious, why is an immediate operation slower than a local variable operation? Moreover, the i32.load is out of the loop, I cannot understand why the operation could affect the overall performance so vastly.

![截屏2024-05-07 23 19 36](https://github.com/bytecodealliance/wasmtime/assets/32137313/be606e8b-7da9-4f99-9697-e23b3b53e376)

Versions and Environment

Wasmtime version or commit: d0cf46a098d97bab9c4cf54a2f3aeaddea7d3831

Operating system: Linux ringzzz-OptiPlex-7070 5.15.0-97-generic

Architecture: Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz

view this post on Zulip Wasmtime GitHub notifications bot (May 07 2024 at 16:01):

hungryzzz edited issue #8573:

Test Cases

cases.zip

Steps to Reproduce

Hi, I run the attached two cases(good.wasm&bad.wasm) in Wasmtime, WasmEdge(AOT) and Wamr(AOT), and collect their execution time respectively (measured by time tool).

Expected Results & Actual Results

For good.wasm, the execution time in different runtimes are as follows:

For bad.wasm, the execution time in different runtimes are as follows:

The difference between the attached two cases is as follow, i.e., changing the address of i32.load from local.get 0 to i32.const 0, the case with i32.const 0 has the bad performance in Wasmtime, whose execution time is 2x slower than the former one. I think the decreasing is confused and suspicious, why is an immediate operation slower than a local variable operation? Moreover, the i32.load is out of the loop, I cannot understand why the operation could affect the overall performance so vastly.

![截屏2024-05-07 23 19 36](https://github.com/bytecodealliance/wasmtime/assets/32137313/be606e8b-7da9-4f99-9697-e23b3b53e376)

Versions and Environment

Wasmtime version or commit: d0cf46a098d97bab9c4cf54a2f3aeaddea7d3831

Operating system: Linux ringzzz-OptiPlex-7070 5.15.0-97-generic

Architecture: Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz

view this post on Zulip Wasmtime GitHub notifications bot (May 07 2024 at 16:10):

cfallin commented on issue #8573:

cc @fitzgen , perhaps this is something to do with our bounds-checking cases?

view this post on Zulip Wasmtime GitHub notifications bot (May 07 2024 at 16:15):

fitzgen commented on issue #8573:

@hungryzzz can you share how you are configuring and executing the Wasm in Wasmtime? I.e. the exact flags you are passing to the CLI or the exact Rust code that you are using when embedding Wasmtime if you aren't using the CLI.

cc @fitzgen , perhaps this is something to do with our bounds-checking cases?

With the default configuration of Wasmtime we should be eliding basically all bounds checks, I'd be surprised if this was bounds-check related unless guard pages are being disabled or something.

view this post on Zulip Wasmtime GitHub notifications bot (May 07 2024 at 16:22):

hungryzzz commented on issue #8573:

@fitzgen I use the CLI to run the case, the command I use is wasmtime good.wasm without extra flags.

view this post on Zulip Wasmtime GitHub notifications bot (May 07 2024 at 16:30):

fitzgen commented on issue #8573:

@hungryzzz ah okay, that means that you are measuring compilation time as well, not just execution time.

If you want to measure execution performance specifically, you should instead do something like

$ wasmtime compile test.wasm -o test.cwasm
$ time wasmtime run --allow-precompiled test.cwasm

Wasmtime is technically a JIT in the sense that it can compile code on demand and map it executable into the same process that did the compiling right before executing the code. But it is not a JIT in the same sense that engines like V8 and SpiderMonkey are, where there are speculative optimizations, lazy tiering up based on which functions are hot, dynamic deoptimization when speculative assumptions fail, etc... Given all that, I'd highly recommend the two-phase process I sketched above, where the Wasm is first compiled and then second the compiled code is executed (and this second part is what you are timing).

view this post on Zulip Wasmtime GitHub notifications bot (May 08 2024 at 05:42):

hungryzzz commented on issue #8573:

@fitzgen Thank you for your advice, I recollect the execution time with the above commands and the results are as follows.
<img width="1385" alt="截屏2024-05-08 13 40 58" src="https://github.com/bytecodealliance/wasmtime/assets/32137313/88a176a6-ea67-475d-8474-90d61f0fbed6">

view this post on Zulip Wasmtime GitHub notifications bot (May 08 2024 at 06:32):

cfallin commented on issue #8573:

@hungryzzz a request for your posts: would you mind posting your results as plain text (or formatted as appropriate), rather than PNG screenshots of a terminal?

There are several reasons for this:

It may take ever so slightly longer (copy & paste, use markdown triple-backtick quotes) but it's a great kindness to readers here for all the reasons above, and we want to set a standard here for maximally accessible and usable discussion content. Thanks!

view this post on Zulip Wasmtime GitHub notifications bot (May 08 2024 at 06:33):

cfallin edited a comment on issue #8573:

@hungryzzz a request for your posts: would you mind posting your results as plain text (or formatted as appropriate), rather than PNG screenshots of a terminal?

There are several reasons for this:

It may take ever so slightly longer (copy & paste, use markdown triple-backtick quotes) but it's a great kindness to readers here for all the reasons above, and we want to set a standard here for maximally accessible and usable discussion content. Thanks!

view this post on Zulip Wasmtime GitHub notifications bot (May 08 2024 at 06:34):

hungryzzz edited a comment on issue #8573:

@fitzgen Thank you for your advice, I recollect the execution time with the above commands and the results are as follows. I guess the JITed code would be cached after the first compilation, so the execution results are nearly the same as the above one.

  case1 git:(wasm-mutate)  wasmtime compile good.wasm -o good.cwasm
➜  case1 git:(wasm-mutate)  wasmtime compile bad.wasm -o bad.cwasm
➜  case1 git:(wasm-mutate)  time wasmtime run --allow-precompiled good.cwasm
/home/ringzzz/wasm_runtime/wasmtime/target/release/wasmtime run  good.cwasm  4.67s user 0.00s system 100% cpu 4.674 total
➜  case1 git:(wasm-mutate)  time wasmtime run --allow-precompiled bad.cwasm
/home/ringzzz/wasm_runtime/wasmtime/target/release/wasmtime run  bad.cwasm  9.36s user 0.01s system 100% cpu 9.365 total

view this post on Zulip Wasmtime GitHub notifications bot (May 08 2024 at 06:35):

cfallin edited a comment on issue #8573:

@hungryzzz a request for your posts: would you mind posting your results as plain text (or formatted as appropriate), rather than PNG screenshots of a terminal?

There are several reasons for this:

It may take ever so slightly longer (copy & paste, use markdown triple-backtick quotes) but it's a great kindness to readers here for all the reasons above, and we want to set a standard here for maximally accessible and usable discussion content. Thanks!

(EDIT: thanks for your edit! I'll leave this here for posterity)

view this post on Zulip Wasmtime GitHub notifications bot (May 08 2024 at 06:36):

hungryzzz edited a comment on issue #8573:

@fitzgen Thank you for your advice, I recollect the execution time with the above commands and the results are as follows. I guess the JITed code would be cached after the first compilation, so the execution results are nearly the same as the above one.

  case  wasmtime compile good.wasm -o good.cwasm
➜  case  wasmtime compile bad.wasm -o bad.cwasm
➜  case  time wasmtime run --allow-precompiled good.cwasm
~/wasmtime/target/release/wasmtime run  good.cwasm  4.67s user 0.00s system 100% cpu 4.674 total
➜  case  time wasmtime run --allow-precompiled bad.cwasm
~/wasmtime/target/release/wasmtime run  bad.cwasm  9.36s user 0.01s system 100% cpu 9.365 total

view this post on Zulip Wasmtime GitHub notifications bot (May 09 2024 at 00:15):

primoly commented on issue #8573:

Probably irrelevant, but could code alignment play a role? I noticed that both modules compile to basically the same instructions, just assigned different registers. Yet for func1, good gets more operations aligned to four bytes than bad.

bad.cwasm func1 x86-64: 60 instructions, 17 start at addresses divisible by four (28%)
good.cwasm func1 x86-64: 62 instructions, 25 start at addresses divisible by four (40%)

Unsurprisingly no difference for func2 since it generates identical instructions for better or worse... I mean good and bad.

view this post on Zulip Wasmtime GitHub notifications bot (May 14 2024 at 16:21):

hungryzzz commented on issue #8573:

I use Perf tool to profile the execution time and surprisingly find that the hotspot is not in func1 where the small difference happens, but in the func2. I continue to check the binary compiled by Cranelift and find that the binary of func2 is the same before and after the small difference. It is really confused to me how the difference in func1 could affect the execution of func2.

Samples: 38K of event 'cycles', Event count (approx.): 34572923013
Overhead  Command          Shared Object        Symbol
  74.91%  wasmtime         jitted-4044282-1.so  [.] 000000000000004f                                                                                               24.63%  wasmtime         jitted-4044282-1.so  [.] 0000000000000071                                                                                                0.32%  wasmtime         jitted-4044282-1.so  [.] 0x0000000000000067                                                                                              0.01%  wasmtime         [kernel.kallsyms]    [k] __handle_mm_fault                                                                                               0.01%  wasmtime         [kernel.kallsyms]    [k] sync_regs                                                                                                       0.01%  wasmtime         [kernel.kallsyms]    [k] native_irq_return_iret                                                                                          0.01%  wasmtime         [kernel.kallsyms]    [k] vm_unmapped_area                                                                                             
Samples: 38K of event 'cycles', 4000 Hz, Event count (approx.): 34572923013
000000000000004f  /home/ringzzz/performance_debug/case1-wasmtime/wasmtime-perf/jitted-4044282-1.so [Percent: local period]
Percent│
                          Disassembly of section .text:
                   0000000000000040 <wasm[0]::function[2]>:
            wasm[0]::function[2]():
              push %rbp
              mov  %rsp,%rbp
              xor  %eax,%eax
              mov  %rax,%r8
        3a:   cmp  $0xffff,%eax
             jne  58
              mov  0x80(%rdi),%rdi
              movl $0x1,(%rdi)
              mov  %rbp,%rsp
              pop  %rbp
             retq
        58:   lea  0x1(%r8),%edx
              mov  %r8,%rax
              mov  %rdx,%r8
             jmpq 3a

# the commands to profile
perf record -k mono ~/wasmtime/target/release/wasmtime run --profile jitdump --allow-precompiled bad.cwasm
perf inject --jit --input perf.data --output perf.jit.data
perf report --input perf.jit.data --no-children

view this post on Zulip Wasmtime GitHub notifications bot (May 14 2024 at 16:38):

fitzgen commented on issue #8573:

It is really confused to me how the difference in func1 could affect the execution of func2.

Is it changing the alignment of func2? We have both preferred and required alignments inside Cranelift, and it is possible that something is getting bumped from its preferred down to just its required alignment.

view this post on Zulip Wasmtime GitHub notifications bot (May 14 2024 at 16:43):

alexcrichton commented on issue #8573:

Testing locally I've got the "good" function 2 starting at 0x140 and "bad" starts at 0x130. That makes sense to me because we align functions to 16-bytes and the "bad" version has more constant propagation so function is a bit smaller in the "bad" version, hence starting a bit earlier.

This program appears to be effectively a benchmark of the second function, which you've pasted the machine code above. In perf locally the jmpq and jne instructions are the hottest instructions. My suspicion is that this is CPU-specific for you (I can't reproduce the slowdown myself) where it's falling over some branch predictor heuristics.

That being said in general the "function 2" here is a pretty bad layout choice for cranelift. The loop is spread across the epilogue instead of tightly put together which might omit a jump. Cranelift can probably do better here in terms of block layout on the second function and that might help branch predictor heuristics on your CPU as well

view this post on Zulip Wasmtime GitHub notifications bot (May 14 2024 at 17:03):

cfallin commented on issue #8573:

FWIW, the usual "fetch bandwidth sin" to avoid is forcing partial fetch lines -- the CPU frontend grabs an aligned 32B or 64B chunk at a time so if a function starts halfway through a fetch line, we're wasting frontend bandwidth. I suspect a 32B function alignment would be a pretty reasonable default in general -- is there a reason we only require 16B now? Did we find it mattered a lot for object file bloat or somesuch?

view this post on Zulip Wasmtime GitHub notifications bot (May 14 2024 at 17:51):

alexcrichton commented on issue #8573:

Looks like the 16-byte alignment was added in #4826 as the minimum require for functions at the time to properly align constants in the constant pool. I don't think 32-bytes was specifically rejected in favor of a smaller size. Looks. like aarch64 is aligned to 32-bytes.

view this post on Zulip Wasmtime GitHub notifications bot (May 14 2024 at 18:04):

cfallin commented on issue #8573:

Cool -- @hungryzzz, I think we'd be happy to take a PR to change x86-64's function alignment to 32 bytes as well, if you'd like to create one!

view this post on Zulip Wasmtime GitHub notifications bot (May 15 2024 at 08:26):

hungryzzz commented on issue #8573:

Is it changing the alignment of func2?

Yes, the machine codes of func2 before and after changes are the same excpet the start address of the func2.

# machine code of `func2` in `bad.wasm` generated by Cranelift
0000000000000130 <wasm[0]::function[2]>:
     130:   55                      push   %rbp
     131:   48 89 e5                mov    %rsp,%rbp
     134:   31 c0                   xor    %eax,%eax
     136:   49 89 c0                mov    %rax,%r8
     139:   81 f8 ff ff 00 00       cmp    $0xffff,%eax
     13f:   0f 85 12 00 00 00       jne    157 <wasm[0]::function[2]+0x27>
     145:   48 8b bf 80 00 00 00    mov    0x80(%rdi),%rdi
     14c:   c7 07 01 00 00 00       movl   $0x1,(%rdi)
     152:   48 89 ec                mov    %rbp,%rsp
     155:   5d                      pop    %rbp
     156:   c3                      retq
     157:   41 8d 50 01             lea    0x1(%r8),%edx
     15b:   4c 89 c0                mov    %r8,%rax
     15e:   49 89 d0                mov    %rdx,%r8
     161:   e9 d3 ff ff ff          jmpq   139 <wasm[0]::function[2]+0x9>

# machine code of `func2` in `good.wasm` generated by Cranelift
0000000000000120 <wasm[0]::function[2]>:
     120:   55                      push   %rbp
     121:   48 89 e5                mov    %rsp,%rbp
     124:   31 c0                   xor    %eax,%eax
     126:   49 89 c0                mov    %rax,%r8
     129:   81 f8 ff ff 00 00       cmp    $0xffff,%eax
     12f:   0f 85 12 00 00 00       jne    147 <wasm[0]::function[2]+0x27>
     135:   48 8b bf 80 00 00 00    mov    0x80(%rdi),%rdi
     13c:   c7 07 01 00 00 00       movl   $0x1,(%rdi)
     142:   48 89 ec                mov    %rbp,%rsp
     145:   5d                      pop    %rbp
     146:   c3                      retq
     147:   41 8d 50 01             lea    0x1(%r8),%edx
     14b:   4c 89 c0                mov    %r8,%rax
     14e:   49 89 d0                mov    %rdx,%r8
     151:   e9 d3 ff ff ff          jmpq   129 <wasm[0]::function[2]+0x9>

In perf locally the jmpq and jne instructions are the hottest instructions. My suspicion is that this is CPU-specific for you (I can't reproduce the slowdown myself) where it's falling over some branch predictor heuristics.

I am confused that how the function alignment could affect the CPU branch prediction.

view this post on Zulip Wasmtime GitHub notifications bot (May 15 2024 at 08:31):

hungryzzz edited a comment on issue #8573:

Is it changing the alignment of func2?

Yes, the machine codes of func2 before and after changes are the same excpet the start address of the func2.

# machine code of `func2` in `bad.wasm` generated by Cranelift
0000000000000130 <wasm[0]::function[2]>:
     130:   55                      push   %rbp
     131:   48 89 e5                mov    %rsp,%rbp
     134:   31 c0                   xor    %eax,%eax
     136:   49 89 c0                mov    %rax,%r8
     139:   81 f8 ff ff 00 00       cmp    $0xffff,%eax
     13f:   0f 85 12 00 00 00       jne    157 <wasm[0]::function[2]+0x27>
     145:   48 8b bf 80 00 00 00    mov    0x80(%rdi),%rdi
     14c:   c7 07 01 00 00 00       movl   $0x1,(%rdi)
     152:   48 89 ec                mov    %rbp,%rsp
     155:   5d                      pop    %rbp
     156:   c3                      retq
     157:   41 8d 50 01             lea    0x1(%r8),%edx
     15b:   4c 89 c0                mov    %r8,%rax
     15e:   49 89 d0                mov    %rdx,%r8
     161:   e9 d3 ff ff ff          jmpq   139 <wasm[0]::function[2]+0x9>

# machine code of `func2` in `good.wasm` generated by Cranelift
0000000000000120 <wasm[0]::function[2]>:
     120:   55                      push   %rbp
     121:   48 89 e5                mov    %rsp,%rbp
     124:   31 c0                   xor    %eax,%eax
     126:   49 89 c0                mov    %rax,%r8
     129:   81 f8 ff ff 00 00       cmp    $0xffff,%eax
     12f:   0f 85 12 00 00 00       jne    147 <wasm[0]::function[2]+0x27>
     135:   48 8b bf 80 00 00 00    mov    0x80(%rdi),%rdi
     13c:   c7 07 01 00 00 00       movl   $0x1,(%rdi)
     142:   48 89 ec                mov    %rbp,%rsp
     145:   5d                      pop    %rbp
     146:   c3                      retq
     147:   41 8d 50 01             lea    0x1(%r8),%edx
     14b:   4c 89 c0                mov    %r8,%rax
     14e:   49 89 d0                mov    %rdx,%r8
     151:   e9 d3 ff ff ff          jmpq   129 <wasm[0]::function[2]+0x9>

In perf locally the jmpq and jne instructions are the hottest instructions. My suspicion is that this is CPU-specific for you (I can't reproduce the slowdown myself) where it's falling over some branch predictor heuristics.

Hi, could I ask why the function alignment could affect the CPU branch prediction? Thanks!

view this post on Zulip Wasmtime GitHub notifications bot (May 15 2024 at 14:46):

cfallin commented on issue #8573:

Hi, could I ask why the function alignment could affect the CPU branch prediction? Thanks!

I addressed this a few comments up:

FWIW, the usual "fetch bandwidth sin" to avoid is forcing partial fetch lines -- the CPU frontend grabs an aligned 32B or 64B chunk at a time so if a function starts halfway through a fetch line, we're wasting frontend bandwidth. I suspect a 32B function alignment would be a pretty reasonable default in general -- is there a reason we only require 16B now? Did we find it mattered a lot for object file bloat or somesuch?

The problem isn't actually with branch prediction, but with the way the instruction fetch works on most modern superscalar machines. (In more detail, fetches are big aligned chunks because it's fewer wires and transistors to grab only an aligned chunk from an icache line rather than any arbitrary alignment with arbitrary shifting.)

Would you be willing to send a PR to update the alignment on x86-64? Someone else can do it if not, but hopefully it's an easy-ish change!

view this post on Zulip Wasmtime GitHub notifications bot (May 15 2024 at 15:00):

hungryzzz commented on issue #8573:

Would you be willing to send a PR to update the alignment on x86-64? Someone else can do it if not, but hopefully it's an easy-ish change!

Hi, I am really willing to do it, but I am not familiar with Rust and code generation implementation, could you please give me some references or advices to help me start it?

view this post on Zulip Wasmtime GitHub notifications bot (May 15 2024 at 15:03):

cfallin commented on issue #8573:

I think it's just this constant that needs to be changed; probably with a comment update noting that we're aligning for better formance and a link to this issue?

view this post on Zulip Wasmtime GitHub notifications bot (May 15 2024 at 15:06):

hungryzzz commented on issue #8573:

The problem isn't actually with branch prediction, but with the way the instruction fetch works on most modern superscalar machines.

The func2 would be called many times, in my opinion, once the instructions of func2 have been fetched into the icache, wouldn't it be necessary to fetch them again when calling func2 again?

view this post on Zulip Wasmtime GitHub notifications bot (May 15 2024 at 15:07):

hungryzzz commented on issue #8573:

I think it's just this constant that needs to be changed; probably with a comment update noting that we're aligning for better formance and a link to this issue?

Thank you very much! I'll try it!

view this post on Zulip Wasmtime GitHub notifications bot (May 15 2024 at 15:14):

cfallin commented on issue #8573:

The problem isn't actually with branch prediction, but with the way the instruction fetch works on most modern superscalar machines.

The func2 would be called many times, in my opinion, once the instructions of func2 have been fetched into the icache, wouldn't it be necessary to fetch them again when calling func2 again?

Fetch is part of the CPU pipeline, and happens every time an instruction executes: the instructions are fetched from the icache into the pipeline (and then decoded, renamed, scheduled, executed, ...). So fetch bottlenecks affect even inner-loop performance. The step that puts instructions into the icache on a miss is usually called "cache fill" and happens in a separate pipeline.

(This is slightly less true on machines with uop caches and/or loop stream buffers, but those have their own limitations as well)

view this post on Zulip Wasmtime GitHub notifications bot (Jul 24 2024 at 08:30):

hungryzzz closed issue #8573:

Test Cases

cases.zip

Steps to Reproduce

Hi, I run the attached two cases(good.wasm&bad.wasm) in Wasmtime, WasmEdge(AOT) and Wamr(AOT), and collect their execution time respectively (measured by time tool).

Expected Results & Actual Results

For good.wasm, the execution time in different runtimes are as follows:

For bad.wasm, the execution time in different runtimes are as follows:

The difference between the attached two cases is as follow, i.e., changing the address of i32.load from local.get 0 to i32.const 0, the case with i32.const 0 has the bad performance in Wasmtime, whose execution time is 2x slower than the former one. I think the decreasing is confused and suspicious, why is an immediate operation slower than a local variable operation? Moreover, the i32.load is out of the loop, I cannot understand why the operation could affect the overall performance so vastly.

![截屏2024-05-07 23 19 36](https://github.com/bytecodealliance/wasmtime/assets/32137313/be606e8b-7da9-4f99-9697-e23b3b53e376)

Versions and Environment

Wasmtime version or commit: d0cf46a098d97bab9c4cf54a2f3aeaddea7d3831

Operating system: Linux ringzzz-OptiPlex-7070 5.15.0-97-generic

Architecture: Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz


Last updated: Dec 23 2024 at 12:05 UTC