Stream: git-wasmtime

Topic: wasmtime / issue #7085 Cranelift: Big performance differe...


view this post on Zulip Wasmtime GitHub notifications bot (Sep 25 2023 at 18:31):

hungryzzz opened issue #7085:

Hi, I use the Emscripten to compile the attached test case(flops.fast.c) and run it on Wasmtime. And then I do a very small change(flops.slow.c, only move the location of two printf statements) in this case which don't affect the program semantics. However, the execution time of flops.slow.c is almost 4 slower than that of flops.fast.c (measured by time tool).

![image](https://github.com/bytecodealliance/wasmtime/assets/32137313/b6ba42cc-d2e7-4bda-83a9-245a4c20c1b8)

And then I run these two cases in WasmEdge and Wamr with AOT mode, and there is almost not difference between the execution time, so I think this bug is not caused by Emscripten.

WasmEdge

Wamr

Try to Debug

I try to increase the number of iterations (the case only contains one loop) from 156250000 to 1562500000, and I find that the execution time will also increase by the same multiple(10), so I guess such small changes will affect the execution time of the loop.

Since Wasmtime will compile the wasm code to binary using Cranelift before execution, I dump the binary using wasmtime compile flops.fast.wasm -o flops.fast.o and check the difference of the loop between such two binary code but I don't find any semantics difference in my eyes.

# flops.fast.c
2f2:    8d 71 01                lea    esi,[rcx+0x1]
2f5:    c5 4b 59 25 73 04 00    vmulsd xmm12,xmm6,QWORD PTR [rip+0x473]        # 770 <wasm[0]::function[3]+0x730>
2fc:    00
2fd:    c4 41 53 5e fc          vdivsd xmm15,xmm5,xmm12
302:    c5 1b 2a e1             vcvtsi2sd xmm12,xmm12,ecx
306:    c4 41 03 59 e4          vmulsd xmm12,xmm15,xmm12
30b:    c4 c1 1b 59 fc          vmulsd xmm7,xmm12,xmm12
310:    c5 63 59 e7             vmulsd xmm12,xmm3,xmm7
314:    c5 1b 58 e2             vaddsd xmm12,xmm12,xmm2
318:    c4 41 43 59 e4          vmulsd xmm12,xmm7,xmm12
31d:    c5 1b 58 e0             vaddsd xmm12,xmm12,xmm0
321:    c4 41 43 59 e4          vmulsd xmm12,xmm7,xmm12
326:    c4 41 1b 58 e6          vaddsd xmm12,xmm12,xmm14
32b:    c4 41 43 59 e4          vmulsd xmm12,xmm7,xmm12
330:    c4 41 1b 58 e5          vaddsd xmm12,xmm12,xmm13
335:    c4 41 43 59 e4          vmulsd xmm12,xmm7,xmm12
33a:    f3 44 0f 6f 44 24 10    movdqu xmm8,XMMWORD PTR [rsp+0x10]
341:    c4 41 1b 58 e0          vaddsd xmm12,xmm12,xmm8
346:    c4 41 43 59 e4          vmulsd xmm12,xmm7,xmm12
34b:    c4 41 1b 58 e1          vaddsd xmm12,xmm12,xmm9
350:    c4 41 23 58 cc          vaddsd xmm9,xmm11,xmm12
355:    81 fe 90 2f 50 09       cmp    esi,0x9502f90
35b:    0f 84 08 00 00 00       je     369 <wasm[0]::function[3]+0x329>
361:    48 89 f1                mov    rcx,rsi
364:    e9 89 ff ff ff          jmp    2f2 <wasm[0]::function[3]+0x2b2>
# flops.slow.c
2c5:    8d 7e 01                lea    edi,[rsi+0x1]
2c8:    c5 1b 2a e6             vcvtsi2sd xmm12,xmm12,esi
2cc:    c5 1b 59 25 8c 04 00    vmulsd xmm12,xmm12,QWORD PTR [rip+0x48c]        # 760 <wasm[0]::function[3]+0x720>
2d3:    00
2d4:    c4 41 1b 59 e4          vmulsd xmm12,xmm12,xmm12
2d9:    c4 41 33 59 ec          vmulsd xmm13,xmm9,xmm12
2de:    c4 41 13 58 e8          vaddsd xmm13,xmm13,xmm8
2e3:    c4 41 1b 59 ed          vmulsd xmm13,xmm12,xmm13
2e8:    c5 13 58 ee             vaddsd xmm13,xmm13,xmm6
2ec:    c4 41 1b 59 ed          vmulsd xmm13,xmm12,xmm13
2f1:    c5 13 58 ed             vaddsd xmm13,xmm13,xmm5
2f5:    c4 41 1b 59 ed          vmulsd xmm13,xmm12,xmm13
2fa:    c5 13 58 ec             vaddsd xmm13,xmm13,xmm4
2fe:    c4 41 1b 59 ed          vmulsd xmm13,xmm12,xmm13
303:    c5 13 58 eb             vaddsd xmm13,xmm13,xmm3
307:    c4 41 1b 59 e5          vmulsd xmm12,xmm12,xmm13
30c:    c4 41 1b 58 d2          vaddsd xmm10,xmm12,xmm10
311:    c5 2b 58 15 57 04 00    vaddsd xmm10,xmm10,QWORD PTR [rip+0x457]        # 770 <wasm[0]::function[3]+0x730>
318:    00
319:    81 ff 90 2f 50 09       cmp    edi,0x9502f90
31f:    0f 84 08 00 00 00       je     32d <wasm[0]::function[3]+0x2ed>
325:    48 89 fe                mov    rsi,rdi
328:    e9 98 ff ff ff          jmp    2c5 <wasm[0]::function[3]+0x285>

Steps to Reproduce

  1. Compile the two C cases into Wasm:
    emcc -O2 -s WASM=1 -s TOTAL_MEMORY=512MB flops.fast.c -o flops.fast.wasm emcc -O2 -s WASM=1 -s TOTAL_MEMORY=512MB flops.slow.c -o flops.slow.wasm

  2. Run wasm in Wasmtime and collect execution time using time
    time wasmtime flops.fast.wasm time wasmtime flops.slow.wasm

Versions and Environment

code.zip

view this post on Zulip Wasmtime GitHub notifications bot (Sep 25 2023 at 18:31):

hungryzzz added the bug label to Issue #7085.

view this post on Zulip Wasmtime GitHub notifications bot (Sep 25 2023 at 18:31):

hungryzzz added the cranelift label to Issue #7085.

view this post on Zulip Wasmtime GitHub notifications bot (Sep 25 2023 at 19:16):

cfallin commented on issue #7085:

I notice that the first loop has a divide instruction (vdivsd) in it, while the second doesn't. That could explain a large difference in performance. Surprisingly, the first loop is actually supposedly the fast case.

A few questions (I don't have Emscripten installed and don't have the bandwidth to try to reproduce locally, but want to guide exploration here):

view this post on Zulip Wasmtime GitHub notifications bot (Sep 26 2023 at 07:36):

hungryzzz commented on issue #7085:

@cfallin Thank you very much for your quick reply, I also attach the corresponding Wasm files so you can run it in Wasmtime without Emscripten if you like.

Question 1

I've checked the corresponding wat file and I found that the movement of printf statements will actually affect the Emscripten to produce the different Wasm bytecode except that the printfs are moved. Specifically, the Wasm bytecode of the loop followed by the printf statements are slightly different.
The difference is related to the variable x in the source code. In the fast version, variable x will be calculated in every beginning of the loop even though x is a constant. While in the slow version, variable x is regarded as a constant.(f64.const 0x1.cc8ff6689d97dp-28) So it is very surprising that the number of the instructions in fast version is more than that in slow version but the execution time is the opposite.
![截屏2023-09-26 15 19 54](https://github.com/bytecodealliance/wasmtime/assets/32137313/1d75c95d-cc1f-49ad-9e38-7192308d6309)

![截屏2023-09-26 15 17 21](https://github.com/bytecodealliance/wasmtime/assets/32137313/2465bc24-2d83-48ab-b2bf-a34daed2266f)

Question 2

Based on the answer of question 1, the following instructions(vmulsd vdivsd) are to calculate the variable x. (x = piref / ( three * (double)m );)

# flops.fast.c
2f5:    c5 4b 59 25 73 04 00    vmulsd xmm12,xmm6,QWORD PTR [rip+0x473]        # 770 <wasm[0]::function[3]+0x730>
2fc:    00
2fd:    c4 41 53 5e fc          vdivsd xmm15,xmm5,xmm12

As for the question 3, I am still doing it.

view this post on Zulip Wasmtime GitHub notifications bot (Sep 26 2023 at 07:45):

hungryzzz edited a comment on issue #7085:

@cfallin Thank you very much for your quick reply, I also attach the corresponding Wasm files so you can run it in Wasmtime without Emscripten if you like.

Question 1

I've checked the corresponding wat file and I found that the movement of printf statements will actually affect the Emscripten to produce the different Wasm bytecode except that the printfs are moved. Specifically, the Wasm bytecode of the loop followed by the printf statements are slightly different.
The difference is related to the variable x in the source code. In the fast version, variable x will be calculated in every beginning of the loop even though x is a constant. While in the slow version, variable x is regarded as a constant.(f64.const 0x1.cc8ff6689d97dp-28) So it is very surprising that the number of the instructions in fast version is more than that in slow version but the execution time is the opposite.
However, even though the Wasm files are slightly different, their execution times in WasmEdge and Wamr are the same. So I think it is the fault of Wasmtime.
![截屏2023-09-26 15 19 54](https://github.com/bytecodealliance/wasmtime/assets/32137313/1d75c95d-cc1f-49ad-9e38-7192308d6309)

![截屏2023-09-26 15 17 21](https://github.com/bytecodealliance/wasmtime/assets/32137313/2465bc24-2d83-48ab-b2bf-a34daed2266f)

Question 2

Based on the answer of question 1, the following instructions(vmulsd vdivsd) are to calculate the variable x. (x = piref / ( three * (double)m );)

# flops.fast.c
2f5:    c5 4b 59 25 73 04 00    vmulsd xmm12,xmm6,QWORD PTR [rip+0x473]        # 770 <wasm[0]::function[3]+0x730>
2fc:    00
2fd:    c4 41 53 5e fc          vdivsd xmm15,xmm5,xmm12

As for the question 3, I am still doing it.

view this post on Zulip Wasmtime GitHub notifications bot (Sep 26 2023 at 07:47):

hungryzzz edited a comment on issue #7085:

@cfallin Thank you very much for your quick reply, I also attach the corresponding Wasm files so you can run it in Wasmtime without Emscripten if you like.

Question 1

I've checked the corresponding wat file and I found that the movement of printf statements will actually affect the Emscripten to produce the different Wasm bytecode except that the printfs are moved. Specifically, the Wasm bytecode of the loop followed by the printf statements are slightly different.
The difference is related to the variable x in the source code. In the fast version, variable x will be calculated in every beginning of the loop even though x is a constant. While in the slow version, variable x is regarded as a constant.(f64.const 0x1.cc8ff6689d97dp-28) So it is very surprising that the number of the instructions in fast version is more than that in slow version but the execution time is the opposite.
However, even though the Wasm files are slightly different, their execution times in WasmEdge and Wamr are the same. So I think it is the fault of Wasmtime.
![截屏2023-09-26 15 19 54](https://github.com/bytecodealliance/wasmtime/assets/32137313/1d75c95d-cc1f-49ad-9e38-7192308d6309)

![截屏2023-09-26 15 17 21](https://github.com/bytecodealliance/wasmtime/assets/32137313/2465bc24-2d83-48ab-b2bf-a34daed2266f)

Question 2

Based on the answer of question 1, the following instructions(vmulsd vdivsd) are to calculate the variable x. (x = piref / ( three * (double)m );)

# flops.fast.c
2f5:    c5 4b 59 25 73 04 00    vmulsd xmm12,xmm6,QWORD PTR [rip+0x473]        # 770 <wasm[0]::function[3]+0x730>
2fc:    00
2fd:    c4 41 53 5e fc          vdivsd xmm15,xmm5,xmm12

As for the question 3, I am still doing it.

flops.fast.wat.txt
flops.slow.wat.txt

view this post on Zulip Wasmtime GitHub notifications bot (Sep 26 2023 at 08:49):

hungryzzz edited a comment on issue #7085:

@cfallin Thank you very much for your quick reply, I also attach the corresponding Wasm files so you can run it in Wasmtime without Emscripten if you like.

Question 1

I've checked the corresponding wat file and I found that the movement of printf statements will actually affect the Emscripten to produce the different Wasm bytecode except that the printfs are moved. Specifically, the Wasm bytecode of the loop followed by the printf statements are slightly different.
The difference is related to the variable x in the source code. In the fast version, variable x will be calculated in every beginning of the loop even though x is a constant. While in the slow version, variable x is regarded as a constant.(f64.const 0x1.cc8ff6689d97dp-28) So it is very surprising that the number of the instructions in fast version is more than that in slow version but the execution time is the opposite.
However, even though the Wasm files are slightly different, their execution times in WasmEdge and Wamr are the same. So I think it is the fault of Wasmtime.
![截屏2023-09-26 15 19 54](https://github.com/bytecodealliance/wasmtime/assets/32137313/1d75c95d-cc1f-49ad-9e38-7192308d6309)

![截屏2023-09-26 15 17 21](https://github.com/bytecodealliance/wasmtime/assets/32137313/2465bc24-2d83-48ab-b2bf-a34daed2266f)

Question 2

Based on the answer of question 1, the following instructions(vmulsd vdivsd) are to calculate the variable x. (x = piref / ( three * (double)m );)

# flops.fast.c
2f5:    c5 4b 59 25 73 04 00    vmulsd xmm12,xmm6,QWORD PTR [rip+0x473]        # 770 <wasm[0]::function[3]+0x730>
2fc:    00
2fd:    c4 41 53 5e fc          vdivsd xmm15,xmm5,xmm12

Question 3

I use perf to profile and collect the performance counters and the results are as followed.

> sudo perf stat -d wasmtime flops.fast.wasm

   FLOPS C Program (Double Precision), V2.0 18 Dec 1992

   Module     Error        RunTime      MFLOPS
                            (usec)
     4          0.0000      0.0000      0.0000

 Performance counter stats for 'wasmtime flops.fast.wasm':

            624.93 msec task-clock                #    1.000 CPUs utilized
                12      context-switches          #   19.202 /sec
                 0      cpu-migrations            #    0.000 /sec
               851      page-faults               #    1.362 K/sec
     2,286,312,939      cycles                    #    3.659 GHz
     3,766,795,833      instructions              #    1.65  insn per cycle
       315,471,390      branches                  #  504.811 M/sec
            85,240      branch-misses             #    0.03% of all branches
       316,824,991      L1-dcache-loads           #  506.977 M/sec
           294,058      L1-dcache-load-misses     #    0.09% of all L1-dcache accesses
            69,425      LLC-loads                 #  111.093 K/sec
            23,580      LLC-load-misses           #   33.96% of all LL-cache accesses

       0.625042358 seconds time elapsed

       0.625360000 seconds user
       0.000000000 seconds sys
> sudo perf stat -d wasmtime flops.slow.wasm

   FLOPS C Program (Double Precision), V2.0 18 Dec 1992

   Module     Error        RunTime      MFLOPS
                            (usec)
     4          0.0000      0.0000      0.0000

 Performance counter stats for 'wasmtime flops.slow.wasm':

          2,390.93 msec task-clock                #    1.000 CPUs utilized
                19      context-switches          #    7.947 /sec
                 1      cpu-migrations            #    0.418 /sec
               852      page-faults               #  356.347 /sec
     8,797,759,453      cycles                    #    3.680 GHz
     3,299,916,619      instructions              #    0.38  insn per cycle
       315,808,198      branches                  #  132.086 M/sec
            94,853      branch-misses             #    0.03% of all branches
       317,397,865      L1-dcache-loads           #  132.751 M/sec
           295,131      L1-dcache-load-misses     #    0.09% of all L1-dcache accesses
            67,453      LLC-loads                 #   28.212 K/sec
            22,840      LLC-load-misses           #   33.86% of all LL-cache accesses

       2.391182093 seconds time elapsed

       2.391560000 seconds user
       0.000000000 seconds sys

Actually I try to only profile the JITed code execution process instead of the whole Wasmtime process but I don't know how to do that. But I find that the execution time of the JITed binary accounted for 98% of the total execution time under the instruction of Profiling with perfmap. So I think maybe the above results are almost accurate to represent the JITed code execution process.

flops.fast.wat.txt
flops.slow.wat.txt

view this post on Zulip Wasmtime GitHub notifications bot (Sep 26 2023 at 14:11):

alexcrichton commented on issue #7085:

I dug into this with @fitzgen yesterday and we were extremely surprised by everything going on here. Ultimately our investigation bore no fruit but I wanted to summarize and write down what we tried out anyway.

I was able to confirm everything that @hungryzzz is seeing (thanks for the detailed bug and putting in the effort here!). This alone was surprising:

The two wasms are indeed different, so there's not like a correctness bug in Cranelift, but there is still seemingly performance left on the table because the two source programs are semantically the same yet perform so different. As for the source-level differences causing differences in wasm, my assumption is that LLVM on the wasm target assumes that printf may modify all local variables. This means that in the "fast" program the prints are below some constants but aren't considered constant in the loop. In the "slow" program the values are all constant for the loop and are hoisted out. (this isn't a mistake in my writing I understand how weird this sounds).

Nick and I experimented with a few changes to Cranelift to try to see what was going on:

Don't load constants from memory

In the "slow" program constants in floating-point operations are sunk into the constant pool and are memory operands in instructions, for example:

2cc:    c5 1b 59 25 8c 04 00    vmulsd xmm12,xmm12,QWORD PTR [rip+0x48c]        # 760 <wasm[0]::function[3]+0x720>
...
311:    c5 2b 58 15 57 04 00    vaddsd xmm10,xmm10,QWORD PTR [rip+0x457]        # 770 <wasm[0]::function[3]+0x730>
318:    00

We tried not doing this and instead forcing materialization of the constant. This was a movq of a 64-bit value into a general-purpose register followed by vmovd of that register into an xmm register, followed by use of the xmm register.

This improved performance of the "slow" program to being faster than the "fast" program, as one would expect with a loop not having fdiv being faster than the other loop. Additionally this approached the performance of the native program compiled with -Os (to thwart vectorization)

Don't rematerialize float constants

Next we thought to improve the "fast" loop to hoist out its recalculation of the x constant each time the loop turns. We removed the explicit rematerializatoin of floating-point-constants. This, surprisingly, slowed down both programs. Both programs were then just as slow as the original "slow" program. Both source changes resulted in the same loop:

  1.63 2ad:┌─→lea       0x1(%rdi),%esi                                                                                                                                                                                                                                                                                        
             vcvtsi2sd %edi,%xmm2,%xmm2                                                                                                                                                                                                                                                                                      
  6.14       vmulsd    %xmm5,%xmm2,%xmm2                                                                                                                                                                                                                                                                                     
  6.71       vmulsd    %xmm2,%xmm2,%xmm13                                                                                                                                                                                                                                                                                    
  6.32       vmulsd    %xmm13,%xmm11,%xmm2                                                                                                                                                                                                                                                                                   
  6.53       vaddsd    %xmm1,%xmm2,%xmm2                                                                                                                                                                                                                                                                                     
  7.39       vmulsd    %xmm2,%xmm13,%xmm2                                                                                                                                                                                                                                                                                    
  6.92       vaddsd    %xmm7,%xmm2,%xmm2                                                                                                                                                                                                                                                                                     
  6.68       vmulsd    %xmm2,%xmm13,%xmm2                                                                                                                                                                                                                                                                                    
  6.79       vaddsd    %xmm6,%xmm2,%xmm2                                                                                                                                                                                                                                                                                     
  6.05       vmulsd    %xmm2,%xmm13,%xmm2                                                                                                                                                                                                                                                                                    
  6.54       vaddsd    %xmm4,%xmm2,%xmm2                                                                                                                                                                                                                                                                                     
  6.00       vmulsd    %xmm2,%xmm13,%xmm2                                                                                                                                                                                                                                                                                    
  6.90       vaddsd    %xmm3,%xmm2,%xmm2                                                                                                                                                                                                                                                                                     
  6.23       vmulsd    %xmm2,%xmm13,%xmm2                                                                                                                                                                                                                                                                                    
  6.88       vaddsd    %xmm14,%xmm2,%xmm2                                                                                                                                                                                                                                                                                    
  6.29       vaddsd    %xmm0,%xmm2,%xmm14                                                                                                                                                                                                                                                                                    
             cmp       $0x9502f90,%esi                                                                                                                                                                                                                                                                                       
           │↓ je        306                                                                                                                                                                                                                                                                                                   
             mov       %rsi,%rdi                                                                                                                                                                                                                                                                                             
           └──jmpq      2ad                                                                                                                                                                                                                                                                                                   
       306:   mov       0x20(%rsp),%rcx                                                                                                                                                                                                                                                                                       

What was extremely surprisin
[message truncated]

view this post on Zulip Wasmtime GitHub notifications bot (Sep 26 2023 at 14:13):

alexcrichton commented on issue #7085:

Oh I should also say we never bottomed out why Cranelift would sink the calculation of x into the loop. In the source program this isn't in the loop, so somehow Cranelift ended up putting the calculation into the loop, and we never fully bottomed that out. The reason we didn't try to take a look was that the program which sunk the calculation into the loop was faster (counter-intuitively). This is probably an independent bug, however.

view this post on Zulip Wasmtime GitHub notifications bot (Sep 26 2023 at 18:42):

hungryzzz commented on issue #7085:

@alexcrichton @cfallin Hi, I extract the loop code snippet from the whole binary and wrap them using inline assembly. I find
there exists execution time differences between the two binary snippets. I think it will more convenient to do the further experiment on this one so I paste it.

#include <stdint.h>

double three = 3.0;
double one = 1.0;
double B1 = -0.4999999999982;
double B2 = 0.4166666664651E-1;
double B3 = -0.1388888805755E-2;
double B4 = 0.24801428034E-4;
double B5 = -0.2754213324E-6;
double B6 = 0.20189405E-8;
double piref = 3.14159265358979324;
double m = 156250000;
double x = 6.70206E-9;
double s = 1000.0;

uint64_t test_quick(){
  __asm__ __volatile__(
"vmovsd %0,%%xmm5\n\t" // piref
"vmovsd %1,%%xmm6\n\t" // m
"vmovsd %2,%%xmm13\n\t"  // B2
"vmovsd %3,%%xmm14\n\t"  // B3
"vmovsd %4,%%xmm0\n\t"   // B4
"vmovsd %5,%%xmm2\n\t"   // B5
"vmovsd %6,%%xmm3\n\t"   // B6
"vmovsd %7,%%xmm9\n\t"   // s
"mov $1,%%rcx\n\t"
"START:\n\t"
"lea    0x1(%%rcx),%%esi\n\t"
"vmulsd %8,%%xmm6,%%xmm12\n\t"  // three
"vdivsd %%xmm12,%%xmm5,%%xmm15\n\t"
"vcvtsi2sd %%ecx,%%xmm12,%%xmm12\n\t"
"vmulsd %%xmm12,%%xmm15,%%xmm12\n\t"
"vmulsd %%xmm12,%%xmm12,%%xmm7\n\t"
"vmulsd %%xmm7,%%xmm3,%%xmm12\n\t"
"vaddsd %%xmm2,%%xmm12,%%xmm12\n\t"
"vmulsd %%xmm12,%%xmm7,%%xmm12\n\t"
"vaddsd %%xmm0,%%xmm12,%%xmm12\n\t"
"vmulsd %%xmm12,%%xmm7,%%xmm12\n\t"
"vaddsd %%xmm14,%%xmm12,%%xmm12\n\t"
"vmulsd %%xmm12,%%xmm7,%%xmm12\n\t"
"vaddsd %%xmm13,%%xmm12,%%xmm12\n\t"
"vmulsd %%xmm12,%%xmm7,%%xmm12\n\t"
"movdqu %9,%%xmm8\n\t"   // B1
"vaddsd %%xmm8,%%xmm12,%%xmm12\n\t"
"vmulsd %%xmm12,%%xmm7,%%xmm12\n\t"
"vaddsd %%xmm9,%%xmm12,%%xmm12\n\t"
"vaddsd %%xmm12,%%xmm11,%%xmm9\n\t"
"cmp    $0x9502f90,%%esi\n\t"
"je     EXIT\n\t"
"mov    %%rsi,%%rcx\n\t"
"jmp   START\n\t"
"EXIT:\n\t":"=m"(piref),"=m"(m),"=m"(B2),"=m"(B3),"=m"(B4),"=m"(B5),"=m"(B6),"=m"(s),"=m"(three),"=m"(B1):
            :"rcx","rsi");
    return 0;
}

uint64_t test_quick_mod(){
  __asm__ __volatile__(
"vmovsd %0,%%xmm5\n\t" // piref
"vmovsd %1,%%xmm6\n\t" // m
"vmovsd %2,%%xmm13\n\t"  // B2
"vmovsd %3,%%xmm14\n\t"  // B3
"vmovsd %4,%%xmm0\n\t"   // B4
"vmovsd %5,%%xmm2\n\t"   // B5
"vmovsd %6,%%xmm3\n\t"   // B6
"vmovsd %7,%%xmm9\n\t"   // s
"mov $1,%%rcx\n\t"
"START1:\n\t"
"lea    0x1(%%rcx),%%esi\n\t"
"vmulsd %8,%%xmm6,%%xmm12\n\t"  // three
"vdivsd %%xmm12,%%xmm5,%%xmm15\n\t"
"vcvtsi2sd %%ecx,%%xmm12,%%xmm12\n\t"
"vmulsd %%xmm12,%%xmm15,%%xmm12\n\t"
"vmulsd %%xmm12,%%xmm12,%%xmm7\n\t"
"vmulsd %%xmm7,%%xmm3,%%xmm12\n\t"
"vaddsd %%xmm2,%%xmm12,%%xmm12\n\t"
"vmulsd %%xmm12,%%xmm7,%%xmm12\n\t"
"vaddsd %%xmm0,%%xmm12,%%xmm12\n\t"
"vmulsd %%xmm12,%%xmm7,%%xmm12\n\t"
"vaddsd %%xmm14,%%xmm12,%%xmm12\n\t"
"vmulsd %%xmm12,%%xmm7,%%xmm12\n\t"
"vaddsd %%xmm13,%%xmm12,%%xmm12\n\t"
"vmulsd %%xmm12,%%xmm7,%%xmm12\n\t"
"movdqu %9,%%xmm8\n\t"   // B1
"vaddsd %%xmm8,%%xmm12,%%xmm12\n\t"
"vmulsd %%xmm12,%%xmm7,%%xmm12\n\t"
"vaddsd %%xmm9,%%xmm12,%%xmm12\n\t"
"vaddsd %%xmm12,%%xmm11,%%xmm9\n\t"
"cmp    $0x9502f90,%%esi\n\t"
"je     EXIT1\n\t"
"mov    %%rsi,%%rcx\n\t"
"jmp   START1\n\t"
"EXIT1:\n\t":"=m"(piref),"=m"(m),"=m"(B2),"=m"(B3),"=m"(B4),"=m"(B5),"=m"(B6),"=m"(s),"=m"(three),"=m"(B1):
            :"rcx","rsi");
    return 0;
}

uint64_t test_slow(){
    __asm__ __volatile__(
"vmovsd %0,%%xmm3\n\t"  // B1
"vmovsd %1,%%xmm4\n\t"  // B2
"vmovsd %2,%%xmm5\n\t"  // B3
"vmovsd %3,%%xmm6\n\t"  // B4
"vmovsd %4,%%xmm8\n\t"  // B5
"vmovsd %5,%%xmm9\n\t"  // B6
"vmovsd %6,%%xmm10\n\t"
"mov $1,%%rsi\n\t"
"START2:\n\t"
"lea    0x1(%%rsi),%%edi\n\t"
"vcvtsi2sd %%esi,%%xmm12,%%xmm12\n\t"
"vmulsd %7,%%xmm12,%%xmm12\n\t"     // x
"vmulsd %%xmm12,%%xmm12,%%xmm12\n\t"
"vmulsd %%xmm12,%%xmm9,%%xmm13\n\t"
"vaddsd %%xmm8,%%xmm13,%%xmm13\n\t"
"vmulsd %%xmm13,%%xmm12,%%xmm13\n\t"
"vaddsd %%xmm6,%%xmm13,%%xmm13\n\t"
"vmulsd %%xmm13,%%xmm12,%%xmm13\n\t"
"vaddsd %%xmm5,%%xmm13,%%xmm13\n\t"
"vmulsd %%xmm13,%%xmm12,%%xmm13\n\t"
"vaddsd %%xmm4,%%xmm13,%%xmm13\n\t"
"vmulsd %%xmm13,%%xmm12,%%xmm13\n\t"
"vaddsd %%xmm3,%%xmm13,%%xmm13\n\t"
"vmulsd %%xmm13,%%xmm12,%%xmm12\n\t"
"vaddsd %%xmm10,%%xmm12,%%xmm10\n\t"
"vaddsd %8,%%xmm10,%%xmm10\n\t"
"cmp    $0x9502f90,%%edi\n\t"
"je     EXIT2\n\t"
"mov    %%rdi,%%rsi\n\t"
"jmp    START2\n\t"
"EXIT2:\n\t":"=m"(B1),"=m"(B2),"=m"(B3),"=m"(B4),"=m"(B5),"=m"(B6),"=m"(s),"=m"(x),"=m"(one):
            :"rsi","rdi");
    return 0;
}

int main() {

  // test_quick();
  test_slow();

  return 0;
}

view this post on Zulip Wasmtime GitHub notifications bot (Sep 26 2023 at 18:43):

hungryzzz edited a comment on issue #7085:

@alexcrichton @cfallin Hi, I extract the loop code snippet from the whole binary and wrap them using inline assembly. I find
there exists execution time differences between the two binary snippets. I think it will more convenient to do the further experiment on this one so I paste it.

#include <stdint.h>

double three = 3.0;
double one = 1.0;
double B1 = -0.4999999999982;
double B2 = 0.4166666664651E-1;
double B3 = -0.1388888805755E-2;
double B4 = 0.24801428034E-4;
double B5 = -0.2754213324E-6;
double B6 = 0.20189405E-8;
double piref = 3.14159265358979324;
double m = 156250000;
double x = 6.70206E-9;
double s = 1000.0;

uint64_t test_quick(){
  __asm__ __volatile__(
"vmovsd %0,%%xmm5\n\t" // piref
"vmovsd %1,%%xmm6\n\t" // m
"vmovsd %2,%%xmm13\n\t"  // B2
"vmovsd %3,%%xmm14\n\t"  // B3
"vmovsd %4,%%xmm0\n\t"   // B4
"vmovsd %5,%%xmm2\n\t"   // B5
"vmovsd %6,%%xmm3\n\t"   // B6
"vmovsd %7,%%xmm9\n\t"   // s
"mov $1,%%rcx\n\t"
"START:\n\t"
"lea    0x1(%%rcx),%%esi\n\t"
"vmulsd %8,%%xmm6,%%xmm12\n\t"  // three
"vdivsd %%xmm12,%%xmm5,%%xmm15\n\t"
"vcvtsi2sd %%ecx,%%xmm12,%%xmm12\n\t"
"vmulsd %%xmm12,%%xmm15,%%xmm12\n\t"
"vmulsd %%xmm12,%%xmm12,%%xmm7\n\t"
"vmulsd %%xmm7,%%xmm3,%%xmm12\n\t"
"vaddsd %%xmm2,%%xmm12,%%xmm12\n\t"
"vmulsd %%xmm12,%%xmm7,%%xmm12\n\t"
"vaddsd %%xmm0,%%xmm12,%%xmm12\n\t"
"vmulsd %%xmm12,%%xmm7,%%xmm12\n\t"
"vaddsd %%xmm14,%%xmm12,%%xmm12\n\t"
"vmulsd %%xmm12,%%xmm7,%%xmm12\n\t"
"vaddsd %%xmm13,%%xmm12,%%xmm12\n\t"
"vmulsd %%xmm12,%%xmm7,%%xmm12\n\t"
"movdqu %9,%%xmm8\n\t"   // B1
"vaddsd %%xmm8,%%xmm12,%%xmm12\n\t"
"vmulsd %%xmm12,%%xmm7,%%xmm12\n\t"
"vaddsd %%xmm9,%%xmm12,%%xmm12\n\t"
"vaddsd %%xmm12,%%xmm11,%%xmm9\n\t"
"cmp    $0x9502f90,%%esi\n\t"
"je     EXIT\n\t"
"mov    %%rsi,%%rcx\n\t"
"jmp   START\n\t"
"EXIT:\n\t":"=m"(piref),"=m"(m),"=m"(B2),"=m"(B3),"=m"(B4),"=m"(B5),"=m"(B6),"=m"(s),"=m"(three),"=m"(B1):
            :"rcx","rsi");
    return 0;
}

uint64_t test_slow(){
    __asm__ __volatile__(
"vmovsd %0,%%xmm3\n\t"  // B1
"vmovsd %1,%%xmm4\n\t"  // B2
"vmovsd %2,%%xmm5\n\t"  // B3
"vmovsd %3,%%xmm6\n\t"  // B4
"vmovsd %4,%%xmm8\n\t"  // B5
"vmovsd %5,%%xmm9\n\t"  // B6
"vmovsd %6,%%xmm10\n\t"
"mov $1,%%rsi\n\t"
"START2:\n\t"
"lea    0x1(%%rsi),%%edi\n\t"
"vcvtsi2sd %%esi,%%xmm12,%%xmm12\n\t"
"vmulsd %7,%%xmm12,%%xmm12\n\t"     // x
"vmulsd %%xmm12,%%xmm12,%%xmm12\n\t"
"vmulsd %%xmm12,%%xmm9,%%xmm13\n\t"
"vaddsd %%xmm8,%%xmm13,%%xmm13\n\t"
"vmulsd %%xmm13,%%xmm12,%%xmm13\n\t"
"vaddsd %%xmm6,%%xmm13,%%xmm13\n\t"
"vmulsd %%xmm13,%%xmm12,%%xmm13\n\t"
"vaddsd %%xmm5,%%xmm13,%%xmm13\n\t"
"vmulsd %%xmm13,%%xmm12,%%xmm13\n\t"
"vaddsd %%xmm4,%%xmm13,%%xmm13\n\t"
"vmulsd %%xmm13,%%xmm12,%%xmm13\n\t"
"vaddsd %%xmm3,%%xmm13,%%xmm13\n\t"
"vmulsd %%xmm13,%%xmm12,%%xmm12\n\t"
"vaddsd %%xmm10,%%xmm12,%%xmm10\n\t"
"vaddsd %8,%%xmm10,%%xmm10\n\t"
"cmp    $0x9502f90,%%edi\n\t"
"je     EXIT2\n\t"
"mov    %%rdi,%%rsi\n\t"
"jmp    START2\n\t"
"EXIT2:\n\t":"=m"(B1),"=m"(B2),"=m"(B3),"=m"(B4),"=m"(B5),"=m"(B6),"=m"(s),"=m"(x),"=m"(one):
            :"rsi","rdi");
    return 0;
}

int main() {

  // test_quick();
  test_slow();

  return 0;
}

view this post on Zulip Wasmtime GitHub notifications bot (Sep 26 2023 at 19:48):

alexcrichton commented on issue #7085:

Oh wow that's super helpful, thank you so much for that!

view this post on Zulip Wasmtime GitHub notifications bot (Sep 26 2023 at 22:36):

alexcrichton commented on issue #7085:

Aha we have a fix! Turns out this is a false dependency issue where vcvtsi2sd is preserving the upper bits of one of its registers when we don't actually want it to. That created a dependency when it shouldn't have to the previous iteration of the loop which slowed down everything. Inserting a vxorps in front of vcvtsi2sd fixes the issue.

With all the above learnings I'll look to send more PRs in the near future!

view this post on Zulip Wasmtime GitHub notifications bot (Sep 26 2023 at 22:43):

alexcrichton commented on issue #7085:

AKA we are missing this (similar code from spidermonkey)

view this post on Zulip Wasmtime GitHub notifications bot (Sep 26 2023 at 22:54):

cfallin commented on issue #7085:

Wow, that's devious -- thank you @alexcrichton for working that out (and @fitzgen too)!

view this post on Zulip Wasmtime GitHub notifications bot (Sep 27 2023 at 07:30):

hungryzzz commented on issue #7085:

@alexcrichton Thank you very much for your quick reply! I have a question about the above solution. I also dump the loop binary generated by WasmEdge and Wamr(they are the same, all depend on LLVM). I find vcvtsi2sd is used alone, but the execution time of them are quick. So actually, I am still confused about when should we insert a vxorps in front of vcvtsi2sd? Or when will the slowing down be triggered. Thanks again!

# The loop binary generated by Wamr. Loop expansion optimization is performed.
410:    c5 9b 59 c4             vmulsd %xmm4,%xmm12,%xmm0
414:    c5 fb 59 c0             vmulsd %xmm0,%xmm0,%xmm0
418:    c5 cb 59 c8             vmulsd %xmm0,%xmm6,%xmm1
41c:    c5 f3 58 cd             vaddsd %xmm5,%xmm1,%xmm1
420:    c5 fb 59 c9             vmulsd %xmm1,%xmm0,%xmm1
424:    c5 f3 58 cb             vaddsd %xmm3,%xmm1,%xmm1
428:    c5 fb 59 c9             vmulsd %xmm1,%xmm0,%xmm1
42c:    c5 f3 58 ca             vaddsd %xmm2,%xmm1,%xmm1
430:    c5 fb 59 c9             vmulsd %xmm1,%xmm0,%xmm1
434:    c5 8b 58 c9             vaddsd %xmm1,%xmm14,%xmm1
438:    c5 fb 59 c9             vmulsd %xmm1,%xmm0,%xmm1
43c:    c5 93 58 c9             vaddsd %xmm1,%xmm13,%xmm1
440:    c5 fb 59 c1             vmulsd %xmm1,%xmm0,%xmm0
444:    c5 fb 58 c7             vaddsd %xmm7,%xmm0,%xmm0
448:    c5 a3 58 f8             vaddsd %xmm0,%xmm11,%xmm7
44c:    3d 90 2f 50 09          cmp    $0x9502f90,%eax
451:    74 4c                   je     49f <aot_func#1+0x47f>
453:    c5 83 2a c0             vcvtsi2sd %eax,%xmm15,%xmm0
457:    c5 9b 59 c0             vmulsd %xmm0,%xmm12,%xmm0
45b:    c5 fb 59 c0             vmulsd %xmm0,%xmm0,%xmm0
45f:    c5 cb 59 c8             vmulsd %xmm0,%xmm6,%xmm1
463:    c5 f3 58 cd             vaddsd %xmm5,%xmm1,%xmm1
467:    c5 fb 59 c9             vmulsd %xmm1,%xmm0,%xmm1
46b:    c5 f3 58 cb             vaddsd %xmm3,%xmm1,%xmm1
46f:    c5 fb 59 c9             vmulsd %xmm1,%xmm0,%xmm1
473:    c5 f3 58 ca             vaddsd %xmm2,%xmm1,%xmm1
477:    c5 fb 59 c9             vmulsd %xmm1,%xmm0,%xmm1
47b:    c5 8b 58 c9             vaddsd %xmm1,%xmm14,%xmm1
47f:    c5 fb 59 c9             vmulsd %xmm1,%xmm0,%xmm1
483:    c5 93 58 c9             vaddsd %xmm1,%xmm13,%xmm1
487:    c5 fb 59 c1             vmulsd %xmm1,%xmm0,%xmm0
48b:    c5 fb 58 c7             vaddsd %xmm7,%xmm0,%xmm0
48f:    c5 a3 58 f8             vaddsd %xmm0,%xmm11,%xmm7
493:    c5 ab 58 e4             vaddsd %xmm4,%xmm10,%xmm4
497:    83 c0 02                add    $0x2,%eax
49a:    e9 71 ff ff ff          jmpq   410 <aot_func#1+0x3f0>

view this post on Zulip Wasmtime GitHub notifications bot (Sep 27 2023 at 07:30):

hungryzzz edited a comment on issue #7085:

@alexcrichton Thank you very much for your quick reply! I have a question about the above solution. I also dump the loop binary generated by WasmEdge and Wamr(results are the same, all depend on LLVM). I find vcvtsi2sd is used alone, but the execution time of them are quick. So actually, I am still confused about when should we insert a vxorps in front of vcvtsi2sd? Or when will the slowing down be triggered. Thanks again!

# The loop binary generated by Wamr. Loop expansion optimization is performed.
410:    c5 9b 59 c4             vmulsd %xmm4,%xmm12,%xmm0
414:    c5 fb 59 c0             vmulsd %xmm0,%xmm0,%xmm0
418:    c5 cb 59 c8             vmulsd %xmm0,%xmm6,%xmm1
41c:    c5 f3 58 cd             vaddsd %xmm5,%xmm1,%xmm1
420:    c5 fb 59 c9             vmulsd %xmm1,%xmm0,%xmm1
424:    c5 f3 58 cb             vaddsd %xmm3,%xmm1,%xmm1
428:    c5 fb 59 c9             vmulsd %xmm1,%xmm0,%xmm1
42c:    c5 f3 58 ca             vaddsd %xmm2,%xmm1,%xmm1
430:    c5 fb 59 c9             vmulsd %xmm1,%xmm0,%xmm1
434:    c5 8b 58 c9             vaddsd %xmm1,%xmm14,%xmm1
438:    c5 fb 59 c9             vmulsd %xmm1,%xmm0,%xmm1
43c:    c5 93 58 c9             vaddsd %xmm1,%xmm13,%xmm1
440:    c5 fb 59 c1             vmulsd %xmm1,%xmm0,%xmm0
444:    c5 fb 58 c7             vaddsd %xmm7,%xmm0,%xmm0
448:    c5 a3 58 f8             vaddsd %xmm0,%xmm11,%xmm7
44c:    3d 90 2f 50 09          cmp    $0x9502f90,%eax
451:    74 4c                   je     49f <aot_func#1+0x47f>
453:    c5 83 2a c0             vcvtsi2sd %eax,%xmm15,%xmm0
457:    c5 9b 59 c0             vmulsd %xmm0,%xmm12,%xmm0
45b:    c5 fb 59 c0             vmulsd %xmm0,%xmm0,%xmm0
45f:    c5 cb 59 c8             vmulsd %xmm0,%xmm6,%xmm1
463:    c5 f3 58 cd             vaddsd %xmm5,%xmm1,%xmm1
467:    c5 fb 59 c9             vmulsd %xmm1,%xmm0,%xmm1
46b:    c5 f3 58 cb             vaddsd %xmm3,%xmm1,%xmm1
46f:    c5 fb 59 c9             vmulsd %xmm1,%xmm0,%xmm1
473:    c5 f3 58 ca             vaddsd %xmm2,%xmm1,%xmm1
477:    c5 fb 59 c9             vmulsd %xmm1,%xmm0,%xmm1
47b:    c5 8b 58 c9             vaddsd %xmm1,%xmm14,%xmm1
47f:    c5 fb 59 c9             vmulsd %xmm1,%xmm0,%xmm1
483:    c5 93 58 c9             vaddsd %xmm1,%xmm13,%xmm1
487:    c5 fb 59 c1             vmulsd %xmm1,%xmm0,%xmm0
48b:    c5 fb 58 c7             vaddsd %xmm7,%xmm0,%xmm0
48f:    c5 a3 58 f8             vaddsd %xmm0,%xmm11,%xmm7
493:    c5 ab 58 e4             vaddsd %xmm4,%xmm10,%xmm4
497:    83 c0 02                add    $0x2,%eax
49a:    e9 71 ff ff ff          jmpq   410 <aot_func#1+0x3f0>

view this post on Zulip Wasmtime GitHub notifications bot (Sep 27 2023 at 07:31):

hungryzzz edited a comment on issue #7085:

@alexcrichton Thank you very much for your quick reply! I have a question about the above solution. I also dump the loop binary generated by WasmEdge and Wamr(results are the same, all depend on LLVM). I find vcvtsi2sd is used alone, but the execution time of them are quick. So actually, I am still confused about when should we insert a vxorps in front of vcvtsi2sd? Or when will the slowing down be triggered. Thanks again!

# The loop binary generated by Wamr. Loop expansion optimization is performed.
410:    c5 9b 59 c4             vmulsd %xmm4,%xmm12,%xmm0
414:    c5 fb 59 c0             vmulsd %xmm0,%xmm0,%xmm0
418:    c5 cb 59 c8             vmulsd %xmm0,%xmm6,%xmm1
41c:    c5 f3 58 cd             vaddsd %xmm5,%xmm1,%xmm1
420:    c5 fb 59 c9             vmulsd %xmm1,%xmm0,%xmm1
424:    c5 f3 58 cb             vaddsd %xmm3,%xmm1,%xmm1
428:    c5 fb 59 c9             vmulsd %xmm1,%xmm0,%xmm1
42c:    c5 f3 58 ca             vaddsd %xmm2,%xmm1,%xmm1
430:    c5 fb 59 c9             vmulsd %xmm1,%xmm0,%xmm1
434:    c5 8b 58 c9             vaddsd %xmm1,%xmm14,%xmm1
438:    c5 fb 59 c9             vmulsd %xmm1,%xmm0,%xmm1
43c:    c5 93 58 c9             vaddsd %xmm1,%xmm13,%xmm1
440:    c5 fb 59 c1             vmulsd %xmm1,%xmm0,%xmm0
444:    c5 fb 58 c7             vaddsd %xmm7,%xmm0,%xmm0
448:    c5 a3 58 f8             vaddsd %xmm0,%xmm11,%xmm7
44c:    3d 90 2f 50 09          cmp    $0x9502f90,%eax
451:    74 4c                   je     49f <aot_func#1+0x47f>
453:    c5 83 2a c0             vcvtsi2sd %eax,%xmm15,%xmm0
457:    c5 9b 59 c0             vmulsd %xmm0,%xmm12,%xmm0
45b:    c5 fb 59 c0             vmulsd %xmm0,%xmm0,%xmm0
45f:    c5 cb 59 c8             vmulsd %xmm0,%xmm6,%xmm1
463:    c5 f3 58 cd             vaddsd %xmm5,%xmm1,%xmm1
467:    c5 fb 59 c9             vmulsd %xmm1,%xmm0,%xmm1
46b:    c5 f3 58 cb             vaddsd %xmm3,%xmm1,%xmm1
46f:    c5 fb 59 c9             vmulsd %xmm1,%xmm0,%xmm1
473:    c5 f3 58 ca             vaddsd %xmm2,%xmm1,%xmm1
477:    c5 fb 59 c9             vmulsd %xmm1,%xmm0,%xmm1
47b:    c5 8b 58 c9             vaddsd %xmm1,%xmm14,%xmm1
47f:    c5 fb 59 c9             vmulsd %xmm1,%xmm0,%xmm1
483:    c5 93 58 c9             vaddsd %xmm1,%xmm13,%xmm1
487:    c5 fb 59 c1             vmulsd %xmm1,%xmm0,%xmm0
48b:    c5 fb 58 c7             vaddsd %xmm7,%xmm0,%xmm0
48f:    c5 a3 58 f8             vaddsd %xmm0,%xmm11,%xmm7
493:    c5 ab 58 e4             vaddsd %xmm4,%xmm10,%xmm4
497:    83 c0 02                add    $0x2,%eax
49a:    e9 71 ff ff ff          jmpq   410 <aot_func#1+0x3f0>

view this post on Zulip Wasmtime GitHub notifications bot (Sep 27 2023 at 14:35):

alexcrichton commented on issue #7085:

Good question! It turns out that LLVM is quite clever about this. The semantics of vcvtsi2sd %eax,%xmm15,%xmm0 is that it converts the 32-bit value in %eax into a 64-bit double and places it in the lower bits of %xmm0. The upper 64-bits of the %xmm15 register are additionally copied to the upper bits of %xmm0. This means that vcvtsi2sd depends on two registers, where the int-to-float operation is typically instead considered a unary operation as only depending on %eax. That's the bug in Cranelift, we don't consider the %xmm15 dependency and filled in %xmm0 there, meaning the instruction depends on something we didn't intend.

LLVM, however, already knows this, and outside the loop you'll find a vxorps (or something similar) to zero out the %xmm15 register. Because %xmm15 isn't written by the loop that means it's zero for the entire loop and the zeroing operation was hoisted out of the loop. Effectively LLVM set aside a dependency chain for the entire loop which is loop invariant for vcvtsi2sd to use.

In Cranelift we're not really positioned to be quite so clever, so our solution is going to be to use vxorps before all int-to-float instructions.

view this post on Zulip Wasmtime GitHub notifications bot (Sep 27 2023 at 15:11):

hungryzzz commented on issue #7085:

Good question! It turns out that LLVM is quite clever about this. The semantics of vcvtsi2sd %eax,%xmm15,%xmm0 is that it converts the 32-bit value in %eax into a 64-bit double and places it in the lower bits of %xmm0. The upper 64-bits of the %xmm15 register are additionally copied to the upper bits of %xmm0. This means that vcvtsi2sd depends on two registers, where the int-to-float operation is typically instead considered a unary operation as only depending on %eax. That's the bug in Cranelift, we don't consider the %xmm15 dependency and filled in %xmm0 there, meaning the instruction depends on something we didn't intend.

LLVM, however, already knows this, and outside the loop you'll find a vxorps (or something similar) to zero out the %xmm15 register. Because %xmm15 isn't written by the loop that means it's zero for the entire loop and the zeroing operation was hoisted out of the loop. Effectively LLVM set aside a dependency chain for the entire loop which is loop invariant for vcvtsi2sd to use.

In Cranelift we're not really positioned to be quite so clever, so our solution is going to be to use vxorps before all int-to-float instructions.

Hi, I try to find the instruction out of the loop to zero out the %xmm15 register but I find nothing. So I guess maybe the tricks of using vcvtsi2sd is related to the register allocation algorithm in LLVM?

![截屏2023-09-27 23 11 12](https://github.com/bytecodealliance/wasmtime/assets/32137313/cbf54a77-22cc-46e7-a037-bbb416622f23)

view this post on Zulip Wasmtime GitHub notifications bot (Sep 27 2023 at 15:19):

alexcrichton commented on issue #7085:

Ah your above snippet didn't have the code before the loop and I don't see the loop in your screenshot there, so I can't say for sure. In the native LLVM output, which I suspect is similar to what you're seeing, the vxorps of xmm15 ahead of the loop is what I was seeing. Regardless though xmm15 isn't used during the loop, so there's no dependency chain that the cpu needs to worry about anyway. Technically vxorps for xmm15 isn't necessary given the duration of this loop anyway.

view this post on Zulip Wasmtime GitHub notifications bot (Sep 27 2023 at 15:40):

hungryzzz commented on issue #7085:

Hi, the whole binary file generated by Wamr is attached, I find that there is not instruction out of the loop to zero out the %xmm15 register. As you can see I can find only one xmm15(vcvtsi2sd %eax,%xmm15,%xmm0) in this file. So I just consider in which stage does LLVM handle such special instruction?
In my understanding, the solution in SpiderMonkey(insert vxorps in front of every vcvtsi2sd) occurs in code emit(maybe?). I just wonder whether this register dependency problem could be solved during register allocation?
Thanks again! (I really hope that my questions would not take up too much of your time...)

![截屏2023-09-27 23 25 49](https://github.com/bytecodealliance/wasmtime/assets/32137313/9971a87b-bb9c-43e0-a1e5-2ce771c7c5c1)
wamr.slow.binary

view this post on Zulip Wasmtime GitHub notifications bot (Sep 27 2023 at 15:46):

alexcrichton commented on issue #7085:

No worries! Indeed you're right in that the entire file doesn't touch xmm15 except for that instruction. From a register-dependency perspective that's ok. The SpiderMonkey solution is a simple "I have no context but want to break dependencies" solution where LLVM is taking a more complicated "I have total context and want to break the dependency" perspective. This can theoretically be solved with register allocation but our register allocator isn't built for that sort of constraint.

Basically there's really low-level details in play for why this is slow in Cranelift and fast in LLVM. I can only guess what heuristics and optimizations are in LLVM, I'm not familiar with LLVM's source code myself. In Cranelift we're likely to take a simple approach of feeding a xorps into cvtsi2s{s,d} to break dependency chains. This is suboptimal since, as LLVM shows, it's not strictly necessary in 100% of cases. That being said it does additionally break 100% of dependency chains, so this issue won't resurface. The follow-on problem perhaps is that icache will be hit slightly but that's hopefully not a 4x difference like you're seeing here.

view this post on Zulip Wasmtime GitHub notifications bot (Sep 27 2023 at 15:52):

hungryzzz commented on issue #7085:

Got it! Really really thank you @alexcrichton very much!

view this post on Zulip Wasmtime GitHub notifications bot (Sep 27 2023 at 18:21):

alexcrichton commented on issue #7085:

I've confirmed that locally at least https://github.com/bytecodealliance/wasmtime/pull/7098 fixes the performance here.

Thanks again for reporting this issue @hungryzzz!

view this post on Zulip Wasmtime GitHub notifications bot (Sep 28 2023 at 15:45):

alexcrichton closed issue #7085:

Hi, I use the Emscripten to compile the attached test case(flops.fast.c) and run it on Wasmtime. And then I do a very small change(flops.slow.c, only move the location of two printf statements) in this case which don't affect the program semantics. However, the execution time of flops.slow.c is almost 4 slower than that of flops.fast.c (measured by time tool).

![image](https://github.com/bytecodealliance/wasmtime/assets/32137313/b6ba42cc-d2e7-4bda-83a9-245a4c20c1b8)

And then I run these two cases in WasmEdge and Wamr with AOT mode, and there is almost not difference between the execution time, so I think this bug is not caused by Emscripten.

WasmEdge

Wamr

Try to Debug

I try to increase the number of iterations (the case only contains one loop) from 156250000 to 1562500000, and I find that the execution time will also increase by the same multiple(10), so I guess such small changes will affect the execution time of the loop.

Since Wasmtime will compile the wasm code to binary using Cranelift before execution, I dump the binary using wasmtime compile flops.fast.wasm -o flops.fast.o and check the difference of the loop between such two binary code but I don't find any semantics difference in my eyes.

# flops.fast.c
2f2:    8d 71 01                lea    esi,[rcx+0x1]
2f5:    c5 4b 59 25 73 04 00    vmulsd xmm12,xmm6,QWORD PTR [rip+0x473]        # 770 <wasm[0]::function[3]+0x730>
2fc:    00
2fd:    c4 41 53 5e fc          vdivsd xmm15,xmm5,xmm12
302:    c5 1b 2a e1             vcvtsi2sd xmm12,xmm12,ecx
306:    c4 41 03 59 e4          vmulsd xmm12,xmm15,xmm12
30b:    c4 c1 1b 59 fc          vmulsd xmm7,xmm12,xmm12
310:    c5 63 59 e7             vmulsd xmm12,xmm3,xmm7
314:    c5 1b 58 e2             vaddsd xmm12,xmm12,xmm2
318:    c4 41 43 59 e4          vmulsd xmm12,xmm7,xmm12
31d:    c5 1b 58 e0             vaddsd xmm12,xmm12,xmm0
321:    c4 41 43 59 e4          vmulsd xmm12,xmm7,xmm12
326:    c4 41 1b 58 e6          vaddsd xmm12,xmm12,xmm14
32b:    c4 41 43 59 e4          vmulsd xmm12,xmm7,xmm12
330:    c4 41 1b 58 e5          vaddsd xmm12,xmm12,xmm13
335:    c4 41 43 59 e4          vmulsd xmm12,xmm7,xmm12
33a:    f3 44 0f 6f 44 24 10    movdqu xmm8,XMMWORD PTR [rsp+0x10]
341:    c4 41 1b 58 e0          vaddsd xmm12,xmm12,xmm8
346:    c4 41 43 59 e4          vmulsd xmm12,xmm7,xmm12
34b:    c4 41 1b 58 e1          vaddsd xmm12,xmm12,xmm9
350:    c4 41 23 58 cc          vaddsd xmm9,xmm11,xmm12
355:    81 fe 90 2f 50 09       cmp    esi,0x9502f90
35b:    0f 84 08 00 00 00       je     369 <wasm[0]::function[3]+0x329>
361:    48 89 f1                mov    rcx,rsi
364:    e9 89 ff ff ff          jmp    2f2 <wasm[0]::function[3]+0x2b2>
# flops.slow.c
2c5:    8d 7e 01                lea    edi,[rsi+0x1]
2c8:    c5 1b 2a e6             vcvtsi2sd xmm12,xmm12,esi
2cc:    c5 1b 59 25 8c 04 00    vmulsd xmm12,xmm12,QWORD PTR [rip+0x48c]        # 760 <wasm[0]::function[3]+0x720>
2d3:    00
2d4:    c4 41 1b 59 e4          vmulsd xmm12,xmm12,xmm12
2d9:    c4 41 33 59 ec          vmulsd xmm13,xmm9,xmm12
2de:    c4 41 13 58 e8          vaddsd xmm13,xmm13,xmm8
2e3:    c4 41 1b 59 ed          vmulsd xmm13,xmm12,xmm13
2e8:    c5 13 58 ee             vaddsd xmm13,xmm13,xmm6
2ec:    c4 41 1b 59 ed          vmulsd xmm13,xmm12,xmm13
2f1:    c5 13 58 ed             vaddsd xmm13,xmm13,xmm5
2f5:    c4 41 1b 59 ed          vmulsd xmm13,xmm12,xmm13
2fa:    c5 13 58 ec             vaddsd xmm13,xmm13,xmm4
2fe:    c4 41 1b 59 ed          vmulsd xmm13,xmm12,xmm13
303:    c5 13 58 eb             vaddsd xmm13,xmm13,xmm3
307:    c4 41 1b 59 e5          vmulsd xmm12,xmm12,xmm13
30c:    c4 41 1b 58 d2          vaddsd xmm10,xmm12,xmm10
311:    c5 2b 58 15 57 04 00    vaddsd xmm10,xmm10,QWORD PTR [rip+0x457]        # 770 <wasm[0]::function[3]+0x730>
318:    00
319:    81 ff 90 2f 50 09       cmp    edi,0x9502f90
31f:    0f 84 08 00 00 00       je     32d <wasm[0]::function[3]+0x2ed>
325:    48 89 fe                mov    rsi,rdi
328:    e9 98 ff ff ff          jmp    2c5 <wasm[0]::function[3]+0x285>

Steps to Reproduce

  1. Compile the two C cases into Wasm:
    emcc -O2 -s WASM=1 -s TOTAL_MEMORY=512MB flops.fast.c -o flops.fast.wasm emcc -O2 -s WASM=1 -s TOTAL_MEMORY=512MB flops.slow.c -o flops.slow.wasm

  2. Run wasm in Wasmtime and collect execution time using time
    time wasmtime flops.fast.wasm time wasmtime flops.slow.wasm

Versions and Environment

code.zip


Last updated: Oct 23 2024 at 20:03 UTC