hungryzzz opened issue #7085:
Hi, I use the
Emscripten
to compile the attached test case(flops.fast.c
) and run it on Wasmtime. And then I do a very small change(flops.slow.c
, only move the location of twoprintf
statements) in this case which don't affect the program semantics. However, the execution time offlops.slow.c
is almost 4 slower than that offlops.fast.c
(measured bytime
tool).
flops.fast.c
: 0.615sflops.slow.c
: 2.395s![image](https://github.com/bytecodealliance/wasmtime/assets/32137313/b6ba42cc-d2e7-4bda-83a9-245a4c20c1b8)
And then I run these two cases in
WasmEdge
andWamr
with AOT mode, and there is almost not difference between the execution time, so I think this bug is not caused byEmscripten
.WasmEdge
flops.fast.c
: 0.43sflops.slow.c
: 0.43sWamr
flops.fast.c
: 0.42sflops.slow.c
: 0.42sTry to Debug
I try to increase the number of iterations (the case only contains one loop) from
156250000
to1562500000
, and I find that the execution time will also increase by the same multiple(10), so I guess such small changes will affect the execution time of the loop.Since
Wasmtime
will compile the wasm code to binary usingCranelift
before execution, I dump the binary usingwasmtime compile flops.fast.wasm -o flops.fast.o
and check the difference of the loop between such two binary code but I don't find any semantics difference in my eyes.# flops.fast.c 2f2: 8d 71 01 lea esi,[rcx+0x1] 2f5: c5 4b 59 25 73 04 00 vmulsd xmm12,xmm6,QWORD PTR [rip+0x473] # 770 <wasm[0]::function[3]+0x730> 2fc: 00 2fd: c4 41 53 5e fc vdivsd xmm15,xmm5,xmm12 302: c5 1b 2a e1 vcvtsi2sd xmm12,xmm12,ecx 306: c4 41 03 59 e4 vmulsd xmm12,xmm15,xmm12 30b: c4 c1 1b 59 fc vmulsd xmm7,xmm12,xmm12 310: c5 63 59 e7 vmulsd xmm12,xmm3,xmm7 314: c5 1b 58 e2 vaddsd xmm12,xmm12,xmm2 318: c4 41 43 59 e4 vmulsd xmm12,xmm7,xmm12 31d: c5 1b 58 e0 vaddsd xmm12,xmm12,xmm0 321: c4 41 43 59 e4 vmulsd xmm12,xmm7,xmm12 326: c4 41 1b 58 e6 vaddsd xmm12,xmm12,xmm14 32b: c4 41 43 59 e4 vmulsd xmm12,xmm7,xmm12 330: c4 41 1b 58 e5 vaddsd xmm12,xmm12,xmm13 335: c4 41 43 59 e4 vmulsd xmm12,xmm7,xmm12 33a: f3 44 0f 6f 44 24 10 movdqu xmm8,XMMWORD PTR [rsp+0x10] 341: c4 41 1b 58 e0 vaddsd xmm12,xmm12,xmm8 346: c4 41 43 59 e4 vmulsd xmm12,xmm7,xmm12 34b: c4 41 1b 58 e1 vaddsd xmm12,xmm12,xmm9 350: c4 41 23 58 cc vaddsd xmm9,xmm11,xmm12 355: 81 fe 90 2f 50 09 cmp esi,0x9502f90 35b: 0f 84 08 00 00 00 je 369 <wasm[0]::function[3]+0x329> 361: 48 89 f1 mov rcx,rsi 364: e9 89 ff ff ff jmp 2f2 <wasm[0]::function[3]+0x2b2>
# flops.slow.c 2c5: 8d 7e 01 lea edi,[rsi+0x1] 2c8: c5 1b 2a e6 vcvtsi2sd xmm12,xmm12,esi 2cc: c5 1b 59 25 8c 04 00 vmulsd xmm12,xmm12,QWORD PTR [rip+0x48c] # 760 <wasm[0]::function[3]+0x720> 2d3: 00 2d4: c4 41 1b 59 e4 vmulsd xmm12,xmm12,xmm12 2d9: c4 41 33 59 ec vmulsd xmm13,xmm9,xmm12 2de: c4 41 13 58 e8 vaddsd xmm13,xmm13,xmm8 2e3: c4 41 1b 59 ed vmulsd xmm13,xmm12,xmm13 2e8: c5 13 58 ee vaddsd xmm13,xmm13,xmm6 2ec: c4 41 1b 59 ed vmulsd xmm13,xmm12,xmm13 2f1: c5 13 58 ed vaddsd xmm13,xmm13,xmm5 2f5: c4 41 1b 59 ed vmulsd xmm13,xmm12,xmm13 2fa: c5 13 58 ec vaddsd xmm13,xmm13,xmm4 2fe: c4 41 1b 59 ed vmulsd xmm13,xmm12,xmm13 303: c5 13 58 eb vaddsd xmm13,xmm13,xmm3 307: c4 41 1b 59 e5 vmulsd xmm12,xmm12,xmm13 30c: c4 41 1b 58 d2 vaddsd xmm10,xmm12,xmm10 311: c5 2b 58 15 57 04 00 vaddsd xmm10,xmm10,QWORD PTR [rip+0x457] # 770 <wasm[0]::function[3]+0x730> 318: 00 319: 81 ff 90 2f 50 09 cmp edi,0x9502f90 31f: 0f 84 08 00 00 00 je 32d <wasm[0]::function[3]+0x2ed> 325: 48 89 fe mov rsi,rdi 328: e9 98 ff ff ff jmp 2c5 <wasm[0]::function[3]+0x285>
Steps to Reproduce
Compile the two C cases into Wasm:
emcc -O2 -s WASM=1 -s TOTAL_MEMORY=512MB flops.fast.c -o flops.fast.wasm emcc -O2 -s WASM=1 -s TOTAL_MEMORY=512MB flops.slow.c -o flops.slow.wasm
Run wasm in
Wasmtime
and collect execution time usingtime
time wasmtime flops.fast.wasm time wasmtime flops.slow.wasm
Versions and Environment
Wasmtime
: f10d665bb9cacceee6051989e99c0934c542ad84WasmEdge
: 777d8b4ba50567b3b39aa8c9d81a6dfade355af8Wamr
: 1ff41ebdc257b45c9f574d8d1f7ee6df4be79645
Emscripten
:
emcc (Emscripten gcc/clang-like replacement + linker emulating GNU ld) 3.1.45 (ef3e4e3b044de98e1811546e0bc605c65d3412f4)
clang version 18.0.0 (https://github.com/llvm/llvm-project d1e685df45dc5944b43d2547d0138cd4a3ee4efe)
Target: wasm32-unknown-emscripten
Thread model: posix
InstalledDir: /home/ringzzz/emsdk/upstream/binkernel version: Linux 5.15.0-83-generic
hungryzzz added the bug label to Issue #7085.
hungryzzz added the cranelift label to Issue #7085.
cfallin commented on issue #7085:
I notice that the first loop has a divide instruction (
vdivsd
) in it, while the second doesn't. That could explain a large difference in performance. Surprisingly, the first loop is actually supposedly the fast case.A few questions (I don't have Emscripten installed and don't have the bandwidth to try to reproduce locally, but want to guide exploration here):
- Can you check that the loop in the Wasm bytecode is identical, except that the printfs are moved?
- Can you try to deduce why the divide instruction is appearing in only one loop and not the other?
- Can you obtain some performance-counter stats from the runs? Number of retired instructions, number of cache misses, number of branch mispredicts, that sort of thing. This could uncover something else going on (e.g. if instruction-retired count is much higher than we expect, there may be a difference elsewhere).
hungryzzz commented on issue #7085:
@cfallin Thank you very much for your quick reply, I also attach the corresponding Wasm files so you can run it in
Wasmtime
withoutEmscripten
if you like.Question 1
I've checked the corresponding
wat
file and I found that the movement ofprintf
statements will actually affect theEmscripten
to produce the different Wasm bytecode except that the printfs are moved. Specifically, the Wasm bytecode of the loop followed by theprintf
statements are slightly different.
The difference is related to the variablex
in the source code. In the fast version, variablex
will be calculated in every beginning of the loop even thoughx
is a constant. While in the slow version, variablex
is regarded as a constant.(f64.const 0x1.cc8ff6689d97dp-28
) So it is very surprising that the number of the instructions in fast version is more than that in slow version but the execution time is the opposite.
![截屏2023-09-26 15 19 54](https://github.com/bytecodealliance/wasmtime/assets/32137313/1d75c95d-cc1f-49ad-9e38-7192308d6309)![截屏2023-09-26 15 17 21](https://github.com/bytecodealliance/wasmtime/assets/32137313/2465bc24-2d83-48ab-b2bf-a34daed2266f)
Question 2
Based on the answer of question 1, the following instructions(
vmulsd
vdivsd
) are to calculate the variablex
. (x = piref / ( three * (double)m );
)# flops.fast.c 2f5: c5 4b 59 25 73 04 00 vmulsd xmm12,xmm6,QWORD PTR [rip+0x473] # 770 <wasm[0]::function[3]+0x730> 2fc: 00 2fd: c4 41 53 5e fc vdivsd xmm15,xmm5,xmm12
As for the question 3, I am still doing it.
hungryzzz edited a comment on issue #7085:
@cfallin Thank you very much for your quick reply, I also attach the corresponding Wasm files so you can run it in
Wasmtime
withoutEmscripten
if you like.Question 1
I've checked the corresponding
wat
file and I found that the movement ofprintf
statements will actually affect theEmscripten
to produce the different Wasm bytecode except that the printfs are moved. Specifically, the Wasm bytecode of the loop followed by theprintf
statements are slightly different.
The difference is related to the variablex
in the source code. In the fast version, variablex
will be calculated in every beginning of the loop even thoughx
is a constant. While in the slow version, variablex
is regarded as a constant.(f64.const 0x1.cc8ff6689d97dp-28
) So it is very surprising that the number of the instructions in fast version is more than that in slow version but the execution time is the opposite.
However, even though the Wasm files are slightly different, their execution times inWasmEdge
andWamr
are the same. So I think it is the fault ofWasmtime
.
![截屏2023-09-26 15 19 54](https://github.com/bytecodealliance/wasmtime/assets/32137313/1d75c95d-cc1f-49ad-9e38-7192308d6309)![截屏2023-09-26 15 17 21](https://github.com/bytecodealliance/wasmtime/assets/32137313/2465bc24-2d83-48ab-b2bf-a34daed2266f)
Question 2
Based on the answer of question 1, the following instructions(
vmulsd
vdivsd
) are to calculate the variablex
. (x = piref / ( three * (double)m );
)# flops.fast.c 2f5: c5 4b 59 25 73 04 00 vmulsd xmm12,xmm6,QWORD PTR [rip+0x473] # 770 <wasm[0]::function[3]+0x730> 2fc: 00 2fd: c4 41 53 5e fc vdivsd xmm15,xmm5,xmm12
As for the question 3, I am still doing it.
hungryzzz edited a comment on issue #7085:
@cfallin Thank you very much for your quick reply, I also attach the corresponding Wasm files so you can run it in
Wasmtime
withoutEmscripten
if you like.Question 1
I've checked the corresponding
wat
file and I found that the movement ofprintf
statements will actually affect theEmscripten
to produce the different Wasm bytecode except that the printfs are moved. Specifically, the Wasm bytecode of the loop followed by theprintf
statements are slightly different.
The difference is related to the variablex
in the source code. In the fast version, variablex
will be calculated in every beginning of the loop even thoughx
is a constant. While in the slow version, variablex
is regarded as a constant.(f64.const 0x1.cc8ff6689d97dp-28
) So it is very surprising that the number of the instructions in fast version is more than that in slow version but the execution time is the opposite.
However, even though the Wasm files are slightly different, their execution times inWasmEdge
andWamr
are the same. So I think it is the fault ofWasmtime
.
![截屏2023-09-26 15 19 54](https://github.com/bytecodealliance/wasmtime/assets/32137313/1d75c95d-cc1f-49ad-9e38-7192308d6309)![截屏2023-09-26 15 17 21](https://github.com/bytecodealliance/wasmtime/assets/32137313/2465bc24-2d83-48ab-b2bf-a34daed2266f)
Question 2
Based on the answer of question 1, the following instructions(
vmulsd
vdivsd
) are to calculate the variablex
. (x = piref / ( three * (double)m );
)# flops.fast.c 2f5: c5 4b 59 25 73 04 00 vmulsd xmm12,xmm6,QWORD PTR [rip+0x473] # 770 <wasm[0]::function[3]+0x730> 2fc: 00 2fd: c4 41 53 5e fc vdivsd xmm15,xmm5,xmm12
As for the question 3, I am still doing it.
hungryzzz edited a comment on issue #7085:
@cfallin Thank you very much for your quick reply, I also attach the corresponding Wasm files so you can run it in
Wasmtime
withoutEmscripten
if you like.Question 1
I've checked the corresponding
wat
file and I found that the movement ofprintf
statements will actually affect theEmscripten
to produce the different Wasm bytecode except that the printfs are moved. Specifically, the Wasm bytecode of the loop followed by theprintf
statements are slightly different.
The difference is related to the variablex
in the source code. In the fast version, variablex
will be calculated in every beginning of the loop even thoughx
is a constant. While in the slow version, variablex
is regarded as a constant.(f64.const 0x1.cc8ff6689d97dp-28
) So it is very surprising that the number of the instructions in fast version is more than that in slow version but the execution time is the opposite.
However, even though the Wasm files are slightly different, their execution times inWasmEdge
andWamr
are the same. So I think it is the fault ofWasmtime
.
![截屏2023-09-26 15 19 54](https://github.com/bytecodealliance/wasmtime/assets/32137313/1d75c95d-cc1f-49ad-9e38-7192308d6309)![截屏2023-09-26 15 17 21](https://github.com/bytecodealliance/wasmtime/assets/32137313/2465bc24-2d83-48ab-b2bf-a34daed2266f)
Question 2
Based on the answer of question 1, the following instructions(
vmulsd
vdivsd
) are to calculate the variablex
. (x = piref / ( three * (double)m );
)# flops.fast.c 2f5: c5 4b 59 25 73 04 00 vmulsd xmm12,xmm6,QWORD PTR [rip+0x473] # 770 <wasm[0]::function[3]+0x730> 2fc: 00 2fd: c4 41 53 5e fc vdivsd xmm15,xmm5,xmm12
Question 3
I use
perf
to profile and collect the performance counters and the results are as followed.> sudo perf stat -d wasmtime flops.fast.wasm FLOPS C Program (Double Precision), V2.0 18 Dec 1992 Module Error RunTime MFLOPS (usec) 4 0.0000 0.0000 0.0000 Performance counter stats for 'wasmtime flops.fast.wasm': 624.93 msec task-clock # 1.000 CPUs utilized 12 context-switches # 19.202 /sec 0 cpu-migrations # 0.000 /sec 851 page-faults # 1.362 K/sec 2,286,312,939 cycles # 3.659 GHz 3,766,795,833 instructions # 1.65 insn per cycle 315,471,390 branches # 504.811 M/sec 85,240 branch-misses # 0.03% of all branches 316,824,991 L1-dcache-loads # 506.977 M/sec 294,058 L1-dcache-load-misses # 0.09% of all L1-dcache accesses 69,425 LLC-loads # 111.093 K/sec 23,580 LLC-load-misses # 33.96% of all LL-cache accesses 0.625042358 seconds time elapsed 0.625360000 seconds user 0.000000000 seconds sys
> sudo perf stat -d wasmtime flops.slow.wasm FLOPS C Program (Double Precision), V2.0 18 Dec 1992 Module Error RunTime MFLOPS (usec) 4 0.0000 0.0000 0.0000 Performance counter stats for 'wasmtime flops.slow.wasm': 2,390.93 msec task-clock # 1.000 CPUs utilized 19 context-switches # 7.947 /sec 1 cpu-migrations # 0.418 /sec 852 page-faults # 356.347 /sec 8,797,759,453 cycles # 3.680 GHz 3,299,916,619 instructions # 0.38 insn per cycle 315,808,198 branches # 132.086 M/sec 94,853 branch-misses # 0.03% of all branches 317,397,865 L1-dcache-loads # 132.751 M/sec 295,131 L1-dcache-load-misses # 0.09% of all L1-dcache accesses 67,453 LLC-loads # 28.212 K/sec 22,840 LLC-load-misses # 33.86% of all LL-cache accesses 2.391182093 seconds time elapsed 2.391560000 seconds user 0.000000000 seconds sys
Actually I try to only profile the JITed code execution process instead of the whole
Wasmtime
process but I don't know how to do that. But I find that the execution time of the JITed binary accounted for 98% of the total execution time under the instruction of Profiling with perfmap. So I think maybe the above results are almost accurate to represent the JITed code execution process.
alexcrichton commented on issue #7085:
I dug into this with @fitzgen yesterday and we were extremely surprised by everything going on here. Ultimately our investigation bore no fruit but I wanted to summarize and write down what we tried out anyway.
I was able to confirm everything that @hungryzzz is seeing (thanks for the detailed bug and putting in the effort here!). This alone was surprising:
- The loop with
fdiv
was faster- The "bigger loop" was faster (
flops.fast.wasm
has larger loop instruction-encoding-wise thanflops.slow.wasm
)The two wasms are indeed different, so there's not like a correctness bug in Cranelift, but there is still seemingly performance left on the table because the two source programs are semantically the same yet perform so different. As for the source-level differences causing differences in wasm, my assumption is that LLVM on the wasm target assumes that
printf
may modify all local variables. This means that in the "fast" program the prints are below some constants but aren't considered constant in the loop. In the "slow" program the values are all constant for the loop and are hoisted out. (this isn't a mistake in my writing I understand how weird this sounds).Nick and I experimented with a few changes to Cranelift to try to see what was going on:
Don't load constants from memory
In the "slow" program constants in floating-point operations are sunk into the constant pool and are memory operands in instructions, for example:
2cc: c5 1b 59 25 8c 04 00 vmulsd xmm12,xmm12,QWORD PTR [rip+0x48c] # 760 <wasm[0]::function[3]+0x720> ... 311: c5 2b 58 15 57 04 00 vaddsd xmm10,xmm10,QWORD PTR [rip+0x457] # 770 <wasm[0]::function[3]+0x730> 318: 00
We tried not doing this and instead forcing materialization of the constant. This was a
movq
of a 64-bit value into a general-purpose register followed byvmovd
of that register into anxmm
register, followed by use of thexmm
register.This improved performance of the "slow" program to being faster than the "fast" program, as one would expect with a loop not having
fdiv
being faster than the other loop. Additionally this approached the performance of the native program compiled with-Os
(to thwart vectorization)Don't rematerialize float constants
Next we thought to improve the "fast" loop to hoist out its recalculation of the
x
constant each time the loop turns. We removed the explicit rematerializatoin of floating-point-constants. This, surprisingly, slowed down both programs. Both programs were then just as slow as the original "slow" program. Both source changes resulted in the same loop:1.63 │2ad:┌─→lea 0x1(%rdi),%esi ▒ │ │ vcvtsi2sd %edi,%xmm2,%xmm2 ▒ 6.14 │ │ vmulsd %xmm5,%xmm2,%xmm2 ◆ 6.71 │ │ vmulsd %xmm2,%xmm2,%xmm13 ▒ 6.32 │ │ vmulsd %xmm13,%xmm11,%xmm2 ▒ 6.53 │ │ vaddsd %xmm1,%xmm2,%xmm2 ▒ 7.39 │ │ vmulsd %xmm2,%xmm13,%xmm2 ▒ 6.92 │ │ vaddsd %xmm7,%xmm2,%xmm2 ▒ 6.68 │ │ vmulsd %xmm2,%xmm13,%xmm2 ▒ 6.79 │ │ vaddsd %xmm6,%xmm2,%xmm2 ▒ 6.05 │ │ vmulsd %xmm2,%xmm13,%xmm2 ▒ 6.54 │ │ vaddsd %xmm4,%xmm2,%xmm2 ▒ 6.00 │ │ vmulsd %xmm2,%xmm13,%xmm2 ▒ 6.90 │ │ vaddsd %xmm3,%xmm2,%xmm2 ▒ 6.23 │ │ vmulsd %xmm2,%xmm13,%xmm2 ▒ 6.88 │ │ vaddsd %xmm14,%xmm2,%xmm2 ▒ 6.29 │ │ vaddsd %xmm0,%xmm2,%xmm14 ▒ │ │ cmp $0x9502f90,%esi ▒ │ │↓ je 306 ▒ │ │ mov %rsi,%rdi ▒ │ └──jmpq 2ad ▒ │306: mov 0x20(%rsp),%rcx ▒
What was extremely surprisin
[message truncated]
alexcrichton commented on issue #7085:
Oh I should also say we never bottomed out why Cranelift would sink the calculation of
x
into the loop. In the source program this isn't in the loop, so somehow Cranelift ended up putting the calculation into the loop, and we never fully bottomed that out. The reason we didn't try to take a look was that the program which sunk the calculation into the loop was faster (counter-intuitively). This is probably an independent bug, however.
hungryzzz commented on issue #7085:
@alexcrichton @cfallin Hi, I extract the loop code snippet from the whole binary and wrap them using inline assembly. I find
there exists execution time differences between the two binary snippets. I think it will more convenient to do the further experiment on this one so I paste it.
test_quick()
: 0.61stest_slow()
: 2.38s#include <stdint.h> double three = 3.0; double one = 1.0; double B1 = -0.4999999999982; double B2 = 0.4166666664651E-1; double B3 = -0.1388888805755E-2; double B4 = 0.24801428034E-4; double B5 = -0.2754213324E-6; double B6 = 0.20189405E-8; double piref = 3.14159265358979324; double m = 156250000; double x = 6.70206E-9; double s = 1000.0; uint64_t test_quick(){ __asm__ __volatile__( "vmovsd %0,%%xmm5\n\t" // piref "vmovsd %1,%%xmm6\n\t" // m "vmovsd %2,%%xmm13\n\t" // B2 "vmovsd %3,%%xmm14\n\t" // B3 "vmovsd %4,%%xmm0\n\t" // B4 "vmovsd %5,%%xmm2\n\t" // B5 "vmovsd %6,%%xmm3\n\t" // B6 "vmovsd %7,%%xmm9\n\t" // s "mov $1,%%rcx\n\t" "START:\n\t" "lea 0x1(%%rcx),%%esi\n\t" "vmulsd %8,%%xmm6,%%xmm12\n\t" // three "vdivsd %%xmm12,%%xmm5,%%xmm15\n\t" "vcvtsi2sd %%ecx,%%xmm12,%%xmm12\n\t" "vmulsd %%xmm12,%%xmm15,%%xmm12\n\t" "vmulsd %%xmm12,%%xmm12,%%xmm7\n\t" "vmulsd %%xmm7,%%xmm3,%%xmm12\n\t" "vaddsd %%xmm2,%%xmm12,%%xmm12\n\t" "vmulsd %%xmm12,%%xmm7,%%xmm12\n\t" "vaddsd %%xmm0,%%xmm12,%%xmm12\n\t" "vmulsd %%xmm12,%%xmm7,%%xmm12\n\t" "vaddsd %%xmm14,%%xmm12,%%xmm12\n\t" "vmulsd %%xmm12,%%xmm7,%%xmm12\n\t" "vaddsd %%xmm13,%%xmm12,%%xmm12\n\t" "vmulsd %%xmm12,%%xmm7,%%xmm12\n\t" "movdqu %9,%%xmm8\n\t" // B1 "vaddsd %%xmm8,%%xmm12,%%xmm12\n\t" "vmulsd %%xmm12,%%xmm7,%%xmm12\n\t" "vaddsd %%xmm9,%%xmm12,%%xmm12\n\t" "vaddsd %%xmm12,%%xmm11,%%xmm9\n\t" "cmp $0x9502f90,%%esi\n\t" "je EXIT\n\t" "mov %%rsi,%%rcx\n\t" "jmp START\n\t" "EXIT:\n\t":"=m"(piref),"=m"(m),"=m"(B2),"=m"(B3),"=m"(B4),"=m"(B5),"=m"(B6),"=m"(s),"=m"(three),"=m"(B1): :"rcx","rsi"); return 0; } uint64_t test_quick_mod(){ __asm__ __volatile__( "vmovsd %0,%%xmm5\n\t" // piref "vmovsd %1,%%xmm6\n\t" // m "vmovsd %2,%%xmm13\n\t" // B2 "vmovsd %3,%%xmm14\n\t" // B3 "vmovsd %4,%%xmm0\n\t" // B4 "vmovsd %5,%%xmm2\n\t" // B5 "vmovsd %6,%%xmm3\n\t" // B6 "vmovsd %7,%%xmm9\n\t" // s "mov $1,%%rcx\n\t" "START1:\n\t" "lea 0x1(%%rcx),%%esi\n\t" "vmulsd %8,%%xmm6,%%xmm12\n\t" // three "vdivsd %%xmm12,%%xmm5,%%xmm15\n\t" "vcvtsi2sd %%ecx,%%xmm12,%%xmm12\n\t" "vmulsd %%xmm12,%%xmm15,%%xmm12\n\t" "vmulsd %%xmm12,%%xmm12,%%xmm7\n\t" "vmulsd %%xmm7,%%xmm3,%%xmm12\n\t" "vaddsd %%xmm2,%%xmm12,%%xmm12\n\t" "vmulsd %%xmm12,%%xmm7,%%xmm12\n\t" "vaddsd %%xmm0,%%xmm12,%%xmm12\n\t" "vmulsd %%xmm12,%%xmm7,%%xmm12\n\t" "vaddsd %%xmm14,%%xmm12,%%xmm12\n\t" "vmulsd %%xmm12,%%xmm7,%%xmm12\n\t" "vaddsd %%xmm13,%%xmm12,%%xmm12\n\t" "vmulsd %%xmm12,%%xmm7,%%xmm12\n\t" "movdqu %9,%%xmm8\n\t" // B1 "vaddsd %%xmm8,%%xmm12,%%xmm12\n\t" "vmulsd %%xmm12,%%xmm7,%%xmm12\n\t" "vaddsd %%xmm9,%%xmm12,%%xmm12\n\t" "vaddsd %%xmm12,%%xmm11,%%xmm9\n\t" "cmp $0x9502f90,%%esi\n\t" "je EXIT1\n\t" "mov %%rsi,%%rcx\n\t" "jmp START1\n\t" "EXIT1:\n\t":"=m"(piref),"=m"(m),"=m"(B2),"=m"(B3),"=m"(B4),"=m"(B5),"=m"(B6),"=m"(s),"=m"(three),"=m"(B1): :"rcx","rsi"); return 0; } uint64_t test_slow(){ __asm__ __volatile__( "vmovsd %0,%%xmm3\n\t" // B1 "vmovsd %1,%%xmm4\n\t" // B2 "vmovsd %2,%%xmm5\n\t" // B3 "vmovsd %3,%%xmm6\n\t" // B4 "vmovsd %4,%%xmm8\n\t" // B5 "vmovsd %5,%%xmm9\n\t" // B6 "vmovsd %6,%%xmm10\n\t" "mov $1,%%rsi\n\t" "START2:\n\t" "lea 0x1(%%rsi),%%edi\n\t" "vcvtsi2sd %%esi,%%xmm12,%%xmm12\n\t" "vmulsd %7,%%xmm12,%%xmm12\n\t" // x "vmulsd %%xmm12,%%xmm12,%%xmm12\n\t" "vmulsd %%xmm12,%%xmm9,%%xmm13\n\t" "vaddsd %%xmm8,%%xmm13,%%xmm13\n\t" "vmulsd %%xmm13,%%xmm12,%%xmm13\n\t" "vaddsd %%xmm6,%%xmm13,%%xmm13\n\t" "vmulsd %%xmm13,%%xmm12,%%xmm13\n\t" "vaddsd %%xmm5,%%xmm13,%%xmm13\n\t" "vmulsd %%xmm13,%%xmm12,%%xmm13\n\t" "vaddsd %%xmm4,%%xmm13,%%xmm13\n\t" "vmulsd %%xmm13,%%xmm12,%%xmm13\n\t" "vaddsd %%xmm3,%%xmm13,%%xmm13\n\t" "vmulsd %%xmm13,%%xmm12,%%xmm12\n\t" "vaddsd %%xmm10,%%xmm12,%%xmm10\n\t" "vaddsd %8,%%xmm10,%%xmm10\n\t" "cmp $0x9502f90,%%edi\n\t" "je EXIT2\n\t" "mov %%rdi,%%rsi\n\t" "jmp START2\n\t" "EXIT2:\n\t":"=m"(B1),"=m"(B2),"=m"(B3),"=m"(B4),"=m"(B5),"=m"(B6),"=m"(s),"=m"(x),"=m"(one): :"rsi","rdi"); return 0; } int main() { // test_quick(); test_slow(); return 0; }
hungryzzz edited a comment on issue #7085:
@alexcrichton @cfallin Hi, I extract the loop code snippet from the whole binary and wrap them using inline assembly. I find
there exists execution time differences between the two binary snippets. I think it will more convenient to do the further experiment on this one so I paste it.
test_quick()
: 0.61stest_slow()
: 2.38s#include <stdint.h> double three = 3.0; double one = 1.0; double B1 = -0.4999999999982; double B2 = 0.4166666664651E-1; double B3 = -0.1388888805755E-2; double B4 = 0.24801428034E-4; double B5 = -0.2754213324E-6; double B6 = 0.20189405E-8; double piref = 3.14159265358979324; double m = 156250000; double x = 6.70206E-9; double s = 1000.0; uint64_t test_quick(){ __asm__ __volatile__( "vmovsd %0,%%xmm5\n\t" // piref "vmovsd %1,%%xmm6\n\t" // m "vmovsd %2,%%xmm13\n\t" // B2 "vmovsd %3,%%xmm14\n\t" // B3 "vmovsd %4,%%xmm0\n\t" // B4 "vmovsd %5,%%xmm2\n\t" // B5 "vmovsd %6,%%xmm3\n\t" // B6 "vmovsd %7,%%xmm9\n\t" // s "mov $1,%%rcx\n\t" "START:\n\t" "lea 0x1(%%rcx),%%esi\n\t" "vmulsd %8,%%xmm6,%%xmm12\n\t" // three "vdivsd %%xmm12,%%xmm5,%%xmm15\n\t" "vcvtsi2sd %%ecx,%%xmm12,%%xmm12\n\t" "vmulsd %%xmm12,%%xmm15,%%xmm12\n\t" "vmulsd %%xmm12,%%xmm12,%%xmm7\n\t" "vmulsd %%xmm7,%%xmm3,%%xmm12\n\t" "vaddsd %%xmm2,%%xmm12,%%xmm12\n\t" "vmulsd %%xmm12,%%xmm7,%%xmm12\n\t" "vaddsd %%xmm0,%%xmm12,%%xmm12\n\t" "vmulsd %%xmm12,%%xmm7,%%xmm12\n\t" "vaddsd %%xmm14,%%xmm12,%%xmm12\n\t" "vmulsd %%xmm12,%%xmm7,%%xmm12\n\t" "vaddsd %%xmm13,%%xmm12,%%xmm12\n\t" "vmulsd %%xmm12,%%xmm7,%%xmm12\n\t" "movdqu %9,%%xmm8\n\t" // B1 "vaddsd %%xmm8,%%xmm12,%%xmm12\n\t" "vmulsd %%xmm12,%%xmm7,%%xmm12\n\t" "vaddsd %%xmm9,%%xmm12,%%xmm12\n\t" "vaddsd %%xmm12,%%xmm11,%%xmm9\n\t" "cmp $0x9502f90,%%esi\n\t" "je EXIT\n\t" "mov %%rsi,%%rcx\n\t" "jmp START\n\t" "EXIT:\n\t":"=m"(piref),"=m"(m),"=m"(B2),"=m"(B3),"=m"(B4),"=m"(B5),"=m"(B6),"=m"(s),"=m"(three),"=m"(B1): :"rcx","rsi"); return 0; } uint64_t test_slow(){ __asm__ __volatile__( "vmovsd %0,%%xmm3\n\t" // B1 "vmovsd %1,%%xmm4\n\t" // B2 "vmovsd %2,%%xmm5\n\t" // B3 "vmovsd %3,%%xmm6\n\t" // B4 "vmovsd %4,%%xmm8\n\t" // B5 "vmovsd %5,%%xmm9\n\t" // B6 "vmovsd %6,%%xmm10\n\t" "mov $1,%%rsi\n\t" "START2:\n\t" "lea 0x1(%%rsi),%%edi\n\t" "vcvtsi2sd %%esi,%%xmm12,%%xmm12\n\t" "vmulsd %7,%%xmm12,%%xmm12\n\t" // x "vmulsd %%xmm12,%%xmm12,%%xmm12\n\t" "vmulsd %%xmm12,%%xmm9,%%xmm13\n\t" "vaddsd %%xmm8,%%xmm13,%%xmm13\n\t" "vmulsd %%xmm13,%%xmm12,%%xmm13\n\t" "vaddsd %%xmm6,%%xmm13,%%xmm13\n\t" "vmulsd %%xmm13,%%xmm12,%%xmm13\n\t" "vaddsd %%xmm5,%%xmm13,%%xmm13\n\t" "vmulsd %%xmm13,%%xmm12,%%xmm13\n\t" "vaddsd %%xmm4,%%xmm13,%%xmm13\n\t" "vmulsd %%xmm13,%%xmm12,%%xmm13\n\t" "vaddsd %%xmm3,%%xmm13,%%xmm13\n\t" "vmulsd %%xmm13,%%xmm12,%%xmm12\n\t" "vaddsd %%xmm10,%%xmm12,%%xmm10\n\t" "vaddsd %8,%%xmm10,%%xmm10\n\t" "cmp $0x9502f90,%%edi\n\t" "je EXIT2\n\t" "mov %%rdi,%%rsi\n\t" "jmp START2\n\t" "EXIT2:\n\t":"=m"(B1),"=m"(B2),"=m"(B3),"=m"(B4),"=m"(B5),"=m"(B6),"=m"(s),"=m"(x),"=m"(one): :"rsi","rdi"); return 0; } int main() { // test_quick(); test_slow(); return 0; }
alexcrichton commented on issue #7085:
Oh wow that's super helpful, thank you so much for that!
alexcrichton commented on issue #7085:
Aha we have a fix! Turns out this is a false dependency issue where
vcvtsi2sd
is preserving the upper bits of one of its registers when we don't actually want it to. That created a dependency when it shouldn't have to the previous iteration of the loop which slowed down everything. Inserting avxorps
in front ofvcvtsi2sd
fixes the issue.With all the above learnings I'll look to send more PRs in the near future!
alexcrichton commented on issue #7085:
AKA we are missing this (similar code from spidermonkey)
cfallin commented on issue #7085:
Wow, that's devious -- thank you @alexcrichton for working that out (and @fitzgen too)!
hungryzzz commented on issue #7085:
@alexcrichton Thank you very much for your quick reply! I have a question about the above solution. I also dump the loop binary generated by
WasmEdge
andWamr
(they are the same, all depend on LLVM). I findvcvtsi2sd
is used alone, but the execution time of them are quick. So actually, I am still confused about when should we insert avxorps
in front ofvcvtsi2sd
? Or when will the slowing down be triggered. Thanks again!# The loop binary generated by Wamr. Loop expansion optimization is performed. 410: c5 9b 59 c4 vmulsd %xmm4,%xmm12,%xmm0 414: c5 fb 59 c0 vmulsd %xmm0,%xmm0,%xmm0 418: c5 cb 59 c8 vmulsd %xmm0,%xmm6,%xmm1 41c: c5 f3 58 cd vaddsd %xmm5,%xmm1,%xmm1 420: c5 fb 59 c9 vmulsd %xmm1,%xmm0,%xmm1 424: c5 f3 58 cb vaddsd %xmm3,%xmm1,%xmm1 428: c5 fb 59 c9 vmulsd %xmm1,%xmm0,%xmm1 42c: c5 f3 58 ca vaddsd %xmm2,%xmm1,%xmm1 430: c5 fb 59 c9 vmulsd %xmm1,%xmm0,%xmm1 434: c5 8b 58 c9 vaddsd %xmm1,%xmm14,%xmm1 438: c5 fb 59 c9 vmulsd %xmm1,%xmm0,%xmm1 43c: c5 93 58 c9 vaddsd %xmm1,%xmm13,%xmm1 440: c5 fb 59 c1 vmulsd %xmm1,%xmm0,%xmm0 444: c5 fb 58 c7 vaddsd %xmm7,%xmm0,%xmm0 448: c5 a3 58 f8 vaddsd %xmm0,%xmm11,%xmm7 44c: 3d 90 2f 50 09 cmp $0x9502f90,%eax 451: 74 4c je 49f <aot_func#1+0x47f> 453: c5 83 2a c0 vcvtsi2sd %eax,%xmm15,%xmm0 457: c5 9b 59 c0 vmulsd %xmm0,%xmm12,%xmm0 45b: c5 fb 59 c0 vmulsd %xmm0,%xmm0,%xmm0 45f: c5 cb 59 c8 vmulsd %xmm0,%xmm6,%xmm1 463: c5 f3 58 cd vaddsd %xmm5,%xmm1,%xmm1 467: c5 fb 59 c9 vmulsd %xmm1,%xmm0,%xmm1 46b: c5 f3 58 cb vaddsd %xmm3,%xmm1,%xmm1 46f: c5 fb 59 c9 vmulsd %xmm1,%xmm0,%xmm1 473: c5 f3 58 ca vaddsd %xmm2,%xmm1,%xmm1 477: c5 fb 59 c9 vmulsd %xmm1,%xmm0,%xmm1 47b: c5 8b 58 c9 vaddsd %xmm1,%xmm14,%xmm1 47f: c5 fb 59 c9 vmulsd %xmm1,%xmm0,%xmm1 483: c5 93 58 c9 vaddsd %xmm1,%xmm13,%xmm1 487: c5 fb 59 c1 vmulsd %xmm1,%xmm0,%xmm0 48b: c5 fb 58 c7 vaddsd %xmm7,%xmm0,%xmm0 48f: c5 a3 58 f8 vaddsd %xmm0,%xmm11,%xmm7 493: c5 ab 58 e4 vaddsd %xmm4,%xmm10,%xmm4 497: 83 c0 02 add $0x2,%eax 49a: e9 71 ff ff ff jmpq 410 <aot_func#1+0x3f0>
hungryzzz edited a comment on issue #7085:
@alexcrichton Thank you very much for your quick reply! I have a question about the above solution. I also dump the loop binary generated by
WasmEdge
andWamr
(results are the same, all depend on LLVM). I findvcvtsi2sd
is used alone, but the execution time of them are quick. So actually, I am still confused about when should we insert avxorps
in front ofvcvtsi2sd
? Or when will the slowing down be triggered. Thanks again!# The loop binary generated by Wamr. Loop expansion optimization is performed. 410: c5 9b 59 c4 vmulsd %xmm4,%xmm12,%xmm0 414: c5 fb 59 c0 vmulsd %xmm0,%xmm0,%xmm0 418: c5 cb 59 c8 vmulsd %xmm0,%xmm6,%xmm1 41c: c5 f3 58 cd vaddsd %xmm5,%xmm1,%xmm1 420: c5 fb 59 c9 vmulsd %xmm1,%xmm0,%xmm1 424: c5 f3 58 cb vaddsd %xmm3,%xmm1,%xmm1 428: c5 fb 59 c9 vmulsd %xmm1,%xmm0,%xmm1 42c: c5 f3 58 ca vaddsd %xmm2,%xmm1,%xmm1 430: c5 fb 59 c9 vmulsd %xmm1,%xmm0,%xmm1 434: c5 8b 58 c9 vaddsd %xmm1,%xmm14,%xmm1 438: c5 fb 59 c9 vmulsd %xmm1,%xmm0,%xmm1 43c: c5 93 58 c9 vaddsd %xmm1,%xmm13,%xmm1 440: c5 fb 59 c1 vmulsd %xmm1,%xmm0,%xmm0 444: c5 fb 58 c7 vaddsd %xmm7,%xmm0,%xmm0 448: c5 a3 58 f8 vaddsd %xmm0,%xmm11,%xmm7 44c: 3d 90 2f 50 09 cmp $0x9502f90,%eax 451: 74 4c je 49f <aot_func#1+0x47f> 453: c5 83 2a c0 vcvtsi2sd %eax,%xmm15,%xmm0 457: c5 9b 59 c0 vmulsd %xmm0,%xmm12,%xmm0 45b: c5 fb 59 c0 vmulsd %xmm0,%xmm0,%xmm0 45f: c5 cb 59 c8 vmulsd %xmm0,%xmm6,%xmm1 463: c5 f3 58 cd vaddsd %xmm5,%xmm1,%xmm1 467: c5 fb 59 c9 vmulsd %xmm1,%xmm0,%xmm1 46b: c5 f3 58 cb vaddsd %xmm3,%xmm1,%xmm1 46f: c5 fb 59 c9 vmulsd %xmm1,%xmm0,%xmm1 473: c5 f3 58 ca vaddsd %xmm2,%xmm1,%xmm1 477: c5 fb 59 c9 vmulsd %xmm1,%xmm0,%xmm1 47b: c5 8b 58 c9 vaddsd %xmm1,%xmm14,%xmm1 47f: c5 fb 59 c9 vmulsd %xmm1,%xmm0,%xmm1 483: c5 93 58 c9 vaddsd %xmm1,%xmm13,%xmm1 487: c5 fb 59 c1 vmulsd %xmm1,%xmm0,%xmm0 48b: c5 fb 58 c7 vaddsd %xmm7,%xmm0,%xmm0 48f: c5 a3 58 f8 vaddsd %xmm0,%xmm11,%xmm7 493: c5 ab 58 e4 vaddsd %xmm4,%xmm10,%xmm4 497: 83 c0 02 add $0x2,%eax 49a: e9 71 ff ff ff jmpq 410 <aot_func#1+0x3f0>
hungryzzz edited a comment on issue #7085:
@alexcrichton Thank you very much for your quick reply! I have a question about the above solution. I also dump the loop binary generated by
WasmEdge
andWamr
(results are the same, all depend on LLVM). I findvcvtsi2sd
is used alone, but the execution time of them are quick. So actually, I am still confused about when should we insert avxorps
in front ofvcvtsi2sd
? Or when will the slowing down be triggered. Thanks again!# The loop binary generated by Wamr. Loop expansion optimization is performed. 410: c5 9b 59 c4 vmulsd %xmm4,%xmm12,%xmm0 414: c5 fb 59 c0 vmulsd %xmm0,%xmm0,%xmm0 418: c5 cb 59 c8 vmulsd %xmm0,%xmm6,%xmm1 41c: c5 f3 58 cd vaddsd %xmm5,%xmm1,%xmm1 420: c5 fb 59 c9 vmulsd %xmm1,%xmm0,%xmm1 424: c5 f3 58 cb vaddsd %xmm3,%xmm1,%xmm1 428: c5 fb 59 c9 vmulsd %xmm1,%xmm0,%xmm1 42c: c5 f3 58 ca vaddsd %xmm2,%xmm1,%xmm1 430: c5 fb 59 c9 vmulsd %xmm1,%xmm0,%xmm1 434: c5 8b 58 c9 vaddsd %xmm1,%xmm14,%xmm1 438: c5 fb 59 c9 vmulsd %xmm1,%xmm0,%xmm1 43c: c5 93 58 c9 vaddsd %xmm1,%xmm13,%xmm1 440: c5 fb 59 c1 vmulsd %xmm1,%xmm0,%xmm0 444: c5 fb 58 c7 vaddsd %xmm7,%xmm0,%xmm0 448: c5 a3 58 f8 vaddsd %xmm0,%xmm11,%xmm7 44c: 3d 90 2f 50 09 cmp $0x9502f90,%eax 451: 74 4c je 49f <aot_func#1+0x47f> 453: c5 83 2a c0 vcvtsi2sd %eax,%xmm15,%xmm0 457: c5 9b 59 c0 vmulsd %xmm0,%xmm12,%xmm0 45b: c5 fb 59 c0 vmulsd %xmm0,%xmm0,%xmm0 45f: c5 cb 59 c8 vmulsd %xmm0,%xmm6,%xmm1 463: c5 f3 58 cd vaddsd %xmm5,%xmm1,%xmm1 467: c5 fb 59 c9 vmulsd %xmm1,%xmm0,%xmm1 46b: c5 f3 58 cb vaddsd %xmm3,%xmm1,%xmm1 46f: c5 fb 59 c9 vmulsd %xmm1,%xmm0,%xmm1 473: c5 f3 58 ca vaddsd %xmm2,%xmm1,%xmm1 477: c5 fb 59 c9 vmulsd %xmm1,%xmm0,%xmm1 47b: c5 8b 58 c9 vaddsd %xmm1,%xmm14,%xmm1 47f: c5 fb 59 c9 vmulsd %xmm1,%xmm0,%xmm1 483: c5 93 58 c9 vaddsd %xmm1,%xmm13,%xmm1 487: c5 fb 59 c1 vmulsd %xmm1,%xmm0,%xmm0 48b: c5 fb 58 c7 vaddsd %xmm7,%xmm0,%xmm0 48f: c5 a3 58 f8 vaddsd %xmm0,%xmm11,%xmm7 493: c5 ab 58 e4 vaddsd %xmm4,%xmm10,%xmm4 497: 83 c0 02 add $0x2,%eax 49a: e9 71 ff ff ff jmpq 410 <aot_func#1+0x3f0>
alexcrichton commented on issue #7085:
Good question! It turns out that LLVM is quite clever about this. The semantics of
vcvtsi2sd %eax,%xmm15,%xmm0
is that it converts the 32-bit value in%eax
into a 64-bit double and places it in the lower bits of%xmm0
. The upper 64-bits of the%xmm15
register are additionally copied to the upper bits of%xmm0
. This means thatvcvtsi2sd
depends on two registers, where the int-to-float operation is typically instead considered a unary operation as only depending on%eax
. That's the bug in Cranelift, we don't consider the%xmm15
dependency and filled in%xmm0
there, meaning the instruction depends on something we didn't intend.LLVM, however, already knows this, and outside the loop you'll find a
vxorps
(or something similar) to zero out the%xmm15
register. Because%xmm15
isn't written by the loop that means it's zero for the entire loop and the zeroing operation was hoisted out of the loop. Effectively LLVM set aside a dependency chain for the entire loop which is loop invariant forvcvtsi2sd
to use.In Cranelift we're not really positioned to be quite so clever, so our solution is going to be to use
vxorps
before all int-to-float instructions.
hungryzzz commented on issue #7085:
Good question! It turns out that LLVM is quite clever about this. The semantics of
vcvtsi2sd %eax,%xmm15,%xmm0
is that it converts the 32-bit value in%eax
into a 64-bit double and places it in the lower bits of%xmm0
. The upper 64-bits of the%xmm15
register are additionally copied to the upper bits of%xmm0
. This means thatvcvtsi2sd
depends on two registers, where the int-to-float operation is typically instead considered a unary operation as only depending on%eax
. That's the bug in Cranelift, we don't consider the%xmm15
dependency and filled in%xmm0
there, meaning the instruction depends on something we didn't intend.LLVM, however, already knows this, and outside the loop you'll find a
vxorps
(or something similar) to zero out the%xmm15
register. Because%xmm15
isn't written by the loop that means it's zero for the entire loop and the zeroing operation was hoisted out of the loop. Effectively LLVM set aside a dependency chain for the entire loop which is loop invariant forvcvtsi2sd
to use.In Cranelift we're not really positioned to be quite so clever, so our solution is going to be to use
vxorps
before all int-to-float instructions.Hi, I try to find the instruction out of the loop to zero out the
%xmm15
register but I find nothing. So I guess maybe the tricks of usingvcvtsi2sd
is related to the register allocation algorithm in LLVM?![截屏2023-09-27 23 11 12](https://github.com/bytecodealliance/wasmtime/assets/32137313/cbf54a77-22cc-46e7-a037-bbb416622f23)
alexcrichton commented on issue #7085:
Ah your above snippet didn't have the code before the loop and I don't see the loop in your screenshot there, so I can't say for sure. In the native LLVM output, which I suspect is similar to what you're seeing, the vxorps of xmm15 ahead of the loop is what I was seeing. Regardless though xmm15 isn't used during the loop, so there's no dependency chain that the cpu needs to worry about anyway. Technically vxorps for xmm15 isn't necessary given the duration of this loop anyway.
hungryzzz commented on issue #7085:
Hi, the whole binary file generated by
Wamr
is attached, I find that there is not instruction out of the loop to zero out the %xmm15 register. As you can see I can find only onexmm15
(vcvtsi2sd %eax,%xmm15,%xmm0
) in this file. So I just consider in which stage does LLVM handle such special instruction?
In my understanding, the solution in SpiderMonkey(insertvxorps
in front of everyvcvtsi2sd
) occurs in code emit(maybe?). I just wonder whether this register dependency problem could be solved during register allocation?
Thanks again! (I really hope that my questions would not take up too much of your time...)![截屏2023-09-27 23 25 49](https://github.com/bytecodealliance/wasmtime/assets/32137313/9971a87b-bb9c-43e0-a1e5-2ce771c7c5c1)
wamr.slow.binary
alexcrichton commented on issue #7085:
No worries! Indeed you're right in that the entire file doesn't touch xmm15 except for that instruction. From a register-dependency perspective that's ok. The SpiderMonkey solution is a simple "I have no context but want to break dependencies" solution where LLVM is taking a more complicated "I have total context and want to break the dependency" perspective. This can theoretically be solved with register allocation but our register allocator isn't built for that sort of constraint.
Basically there's really low-level details in play for why this is slow in Cranelift and fast in LLVM. I can only guess what heuristics and optimizations are in LLVM, I'm not familiar with LLVM's source code myself. In Cranelift we're likely to take a simple approach of feeding a xorps into
cvtsi2s{s,d}
to break dependency chains. This is suboptimal since, as LLVM shows, it's not strictly necessary in 100% of cases. That being said it does additionally break 100% of dependency chains, so this issue won't resurface. The follow-on problem perhaps is that icache will be hit slightly but that's hopefully not a 4x difference like you're seeing here.
hungryzzz commented on issue #7085:
Got it! Really really thank you @alexcrichton very much!
alexcrichton commented on issue #7085:
I've confirmed that locally at least https://github.com/bytecodealliance/wasmtime/pull/7098 fixes the performance here.
Thanks again for reporting this issue @hungryzzz!
alexcrichton closed issue #7085:
Hi, I use the
Emscripten
to compile the attached test case(flops.fast.c
) and run it on Wasmtime. And then I do a very small change(flops.slow.c
, only move the location of twoprintf
statements) in this case which don't affect the program semantics. However, the execution time offlops.slow.c
is almost 4 slower than that offlops.fast.c
(measured bytime
tool).
flops.fast.c
: 0.615sflops.slow.c
: 2.395s![image](https://github.com/bytecodealliance/wasmtime/assets/32137313/b6ba42cc-d2e7-4bda-83a9-245a4c20c1b8)
And then I run these two cases in
WasmEdge
andWamr
with AOT mode, and there is almost not difference between the execution time, so I think this bug is not caused byEmscripten
.WasmEdge
flops.fast.c
: 0.43sflops.slow.c
: 0.43sWamr
flops.fast.c
: 0.42sflops.slow.c
: 0.42sTry to Debug
I try to increase the number of iterations (the case only contains one loop) from
156250000
to1562500000
, and I find that the execution time will also increase by the same multiple(10), so I guess such small changes will affect the execution time of the loop.Since
Wasmtime
will compile the wasm code to binary usingCranelift
before execution, I dump the binary usingwasmtime compile flops.fast.wasm -o flops.fast.o
and check the difference of the loop between such two binary code but I don't find any semantics difference in my eyes.# flops.fast.c 2f2: 8d 71 01 lea esi,[rcx+0x1] 2f5: c5 4b 59 25 73 04 00 vmulsd xmm12,xmm6,QWORD PTR [rip+0x473] # 770 <wasm[0]::function[3]+0x730> 2fc: 00 2fd: c4 41 53 5e fc vdivsd xmm15,xmm5,xmm12 302: c5 1b 2a e1 vcvtsi2sd xmm12,xmm12,ecx 306: c4 41 03 59 e4 vmulsd xmm12,xmm15,xmm12 30b: c4 c1 1b 59 fc vmulsd xmm7,xmm12,xmm12 310: c5 63 59 e7 vmulsd xmm12,xmm3,xmm7 314: c5 1b 58 e2 vaddsd xmm12,xmm12,xmm2 318: c4 41 43 59 e4 vmulsd xmm12,xmm7,xmm12 31d: c5 1b 58 e0 vaddsd xmm12,xmm12,xmm0 321: c4 41 43 59 e4 vmulsd xmm12,xmm7,xmm12 326: c4 41 1b 58 e6 vaddsd xmm12,xmm12,xmm14 32b: c4 41 43 59 e4 vmulsd xmm12,xmm7,xmm12 330: c4 41 1b 58 e5 vaddsd xmm12,xmm12,xmm13 335: c4 41 43 59 e4 vmulsd xmm12,xmm7,xmm12 33a: f3 44 0f 6f 44 24 10 movdqu xmm8,XMMWORD PTR [rsp+0x10] 341: c4 41 1b 58 e0 vaddsd xmm12,xmm12,xmm8 346: c4 41 43 59 e4 vmulsd xmm12,xmm7,xmm12 34b: c4 41 1b 58 e1 vaddsd xmm12,xmm12,xmm9 350: c4 41 23 58 cc vaddsd xmm9,xmm11,xmm12 355: 81 fe 90 2f 50 09 cmp esi,0x9502f90 35b: 0f 84 08 00 00 00 je 369 <wasm[0]::function[3]+0x329> 361: 48 89 f1 mov rcx,rsi 364: e9 89 ff ff ff jmp 2f2 <wasm[0]::function[3]+0x2b2>
# flops.slow.c 2c5: 8d 7e 01 lea edi,[rsi+0x1] 2c8: c5 1b 2a e6 vcvtsi2sd xmm12,xmm12,esi 2cc: c5 1b 59 25 8c 04 00 vmulsd xmm12,xmm12,QWORD PTR [rip+0x48c] # 760 <wasm[0]::function[3]+0x720> 2d3: 00 2d4: c4 41 1b 59 e4 vmulsd xmm12,xmm12,xmm12 2d9: c4 41 33 59 ec vmulsd xmm13,xmm9,xmm12 2de: c4 41 13 58 e8 vaddsd xmm13,xmm13,xmm8 2e3: c4 41 1b 59 ed vmulsd xmm13,xmm12,xmm13 2e8: c5 13 58 ee vaddsd xmm13,xmm13,xmm6 2ec: c4 41 1b 59 ed vmulsd xmm13,xmm12,xmm13 2f1: c5 13 58 ed vaddsd xmm13,xmm13,xmm5 2f5: c4 41 1b 59 ed vmulsd xmm13,xmm12,xmm13 2fa: c5 13 58 ec vaddsd xmm13,xmm13,xmm4 2fe: c4 41 1b 59 ed vmulsd xmm13,xmm12,xmm13 303: c5 13 58 eb vaddsd xmm13,xmm13,xmm3 307: c4 41 1b 59 e5 vmulsd xmm12,xmm12,xmm13 30c: c4 41 1b 58 d2 vaddsd xmm10,xmm12,xmm10 311: c5 2b 58 15 57 04 00 vaddsd xmm10,xmm10,QWORD PTR [rip+0x457] # 770 <wasm[0]::function[3]+0x730> 318: 00 319: 81 ff 90 2f 50 09 cmp edi,0x9502f90 31f: 0f 84 08 00 00 00 je 32d <wasm[0]::function[3]+0x2ed> 325: 48 89 fe mov rsi,rdi 328: e9 98 ff ff ff jmp 2c5 <wasm[0]::function[3]+0x285>
Steps to Reproduce
Compile the two C cases into Wasm:
emcc -O2 -s WASM=1 -s TOTAL_MEMORY=512MB flops.fast.c -o flops.fast.wasm emcc -O2 -s WASM=1 -s TOTAL_MEMORY=512MB flops.slow.c -o flops.slow.wasm
Run wasm in
Wasmtime
and collect execution time usingtime
time wasmtime flops.fast.wasm time wasmtime flops.slow.wasm
Versions and Environment
Wasmtime
: f10d665bb9cacceee6051989e99c0934c542ad84WasmEdge
: 777d8b4ba50567b3b39aa8c9d81a6dfade355af8Wamr
: 1ff41ebdc257b45c9f574d8d1f7ee6df4be79645
Emscripten
:
emcc (Emscripten gcc/clang-like replacement + linker emulating GNU ld) 3.1.45 (ef3e4e3b044de98e1811546e0bc605c65d3412f4)
clang version 18.0.0 (https://github.com/llvm/llvm-project d1e685df45dc5944b43d2547d0138cd4a3ee4efe)
Target: wasm32-unknown-emscripten
Thread model: posix
InstalledDir: /home/ringzzz/emsdk/upstream/binkernel version: Linux 5.15.0-83-generic
Last updated: Dec 23 2024 at 12:05 UTC