hungryzzz opened issue #6287:
Description
Hi, I run the following case in different Wasm runtimes(after being compiled by
Emscripten
), and I find some performance differences betweenwasmtime
and other 3 runtimes.// random.c #include <stdio.h> int main() { int N = 10000000, last = 42; while (N--) { last = (last + 33) % 13; // compound operations are + and %. } printf("%d\n", last); return(0); }
Test Case
Steps to Reproduce
Compile the above C case using Emscripten
emcc -sENVIRONMENT=shell -O2 -s WASM=1 -s TOTAL_MEMORY=512MB random.c -o random.wasm
Execute the wasm file in different wasm runtimes and collect the execution time, all the compilation and execution options are default.
Actual Results
The execution time(collected by
perf-tool
, probe begins when starting to execute the wasm code(invoke_func
inwasmtime
) and end insched:sched_process_exit
) inwasmtime
is 3x slower than that in other 3 runtimes.And if I replace compound operations to other combinations, I find that in some situations, the execution time in
wasmtime
is 3x slower than that in other 3 runtimes, but sometimes not. The details are as follow:
\*/ +% +/ +\* wasmtime 79362.79 us 85006.50 us 73508.21 us 1577.37 us wasmer 28642.15 us 28935.57 us 19977.13 us 1965.65 us wasmedge 22462.40 us 24814.40 us 74.27 us 171.18 us wamr 23018.67 us 25328.92 us 741.49 us 856.79 us So I am wondering that there maybe some performance bugs when handling such compound operations.
Versions and Environment
- Hardware Architecture: x86_64
- Operating system: Ubuntu 20.04
- CPU: Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz
- Memory: 32GB
Emscripten
- emcc (Emscripten gcc/clang-like replacement + linker emulating GNU ld) 3.1.24 (68a9f990429e0bcfb63b1cde68bad792554350a5)
clang version 16.0.0 (https://github.com/llvm/llvm-project 277c382760bf9575cfa2eac73d5ad1db91466d3f)
Target: wasm32-unknown-emscripten
Thread model: posixWasm runtime version
- wasmer: wasmer 3.2.0
- wasmtime: wasmtime-cli 8.0.0 (build from commit a3d9c2ca33a44220bbc82488e7064c6ae045a1dd)
- wasmedge: build from commit https://github.com/WasmEdge/WasmEdge/commit/381b7b28049b968297e6a585b92d1cba955def66
- wamr: iwasm 1.1.2
hungryzzz labeled issue #6287:
Description
Hi, I run the following case in different Wasm runtimes(after being compiled by
Emscripten
), and I find some performance differences betweenwasmtime
and other 3 runtimes.// random.c #include <stdio.h> int main() { int N = 10000000, last = 42; while (N--) { last = (last + 33) % 13; // compound operations are + and %. } printf("%d\n", last); return(0); }
Test Case
Steps to Reproduce
Compile the above C case using Emscripten
emcc -sENVIRONMENT=shell -O2 -s WASM=1 -s TOTAL_MEMORY=512MB random.c -o random.wasm
Execute the wasm file in different wasm runtimes and collect the execution time, all the compilation and execution options are default.
Actual Results
The execution time(collected by
perf-tool
, probe begins when starting to execute the wasm code(invoke_func
inwasmtime
) and end insched:sched_process_exit
) inwasmtime
is 3x slower than that in other 3 runtimes.And if I replace compound operations to other combinations, I find that in some situations, the execution time in
wasmtime
is 3x slower than that in other 3 runtimes, but sometimes not. The details are as follow:
\*/ +% +/ +\* wasmtime 79362.79 us 85006.50 us 73508.21 us 1577.37 us wasmer 28642.15 us 28935.57 us 19977.13 us 1965.65 us wasmedge 22462.40 us 24814.40 us 74.27 us 171.18 us wamr 23018.67 us 25328.92 us 741.49 us 856.79 us So I am wondering that there maybe some performance bugs when handling such compound operations.
Versions and Environment
- Hardware Architecture: x86_64
- Operating system: Ubuntu 20.04
- CPU: Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz
- Memory: 32GB
Emscripten
- emcc (Emscripten gcc/clang-like replacement + linker emulating GNU ld) 3.1.24 (68a9f990429e0bcfb63b1cde68bad792554350a5)
clang version 16.0.0 (https://github.com/llvm/llvm-project 277c382760bf9575cfa2eac73d5ad1db91466d3f)
Target: wasm32-unknown-emscripten
Thread model: posixWasm runtime version
- wasmer: wasmer 3.2.0
- wasmtime: wasmtime-cli 8.0.0 (build from commit a3d9c2ca33a44220bbc82488e7064c6ae045a1dd)
- wasmedge: build from commit https://github.com/WasmEdge/WasmEdge/commit/381b7b28049b968297e6a585b92d1cba955def66
- wamr: iwasm 1.1.2
bjorn3 commented on issue #6287:
Division and modulo are relatively slow on most cpu's. For small constant divisors there exist faster methods to perform these operations using multiplication and shifting which Cranelift doesn't currently support. This may be the reason that it is slower. I haven't looked at the disassembly of any of the other wasm engines though. By the way are you using the LLVM backend of Wasmer? Wasmer has a Cranelift backend too which should perform almost identically to Wasmtime.
jameysharp commented on issue #6287:
I thought I'd take a quick look at what code we generate for this program. It's interesting that LLVM has unrolled the loop four times, but that should be fine I think.
In the wasm, the loop body is four copies of:
i32.const 33 i32.add i32.const 13 i32.rem_u
We compile that to four copies of this sequence:
000000a6 8d 42 21 lea eax, [rdx + 0x21] 000000a9 48 31 d2 xor rdx, rdx 000000ac 41 f7 f3 div r11d
(Except the first copy also loads the constant 13 into
%r11
. In this case, rematerialization is hurting us slightly since we have plenty of free registers.)I don't think we can do better than that except, as @bjorn3 says, by turning the division into multiplication by magic constants.
jameysharp commented on issue #6287:
Okay, that said, the bottom of the loop isn't ideal:
000000c1 41 83 ea 04 sub r10d, 4 000000c5 45 85 d2 test r10d, r10d 000000c8 0f 84 08 00 00 00 je 0xd6 000000ce 49 89 d3 mov r11, rdx 000000d1 e9 c0 ff ff ff jmp 0x96
We have the value of
last
in%r11
on entry to the loop, but during the first of the four unrolled copies of the loop body we switch over to keeping it in%rdx
, due to the register constraint on thediv
instruction on x86. So at the end of the loop we have to copy it back to%r11
if the loop is continuing. Since the move is only necessary on the critical edge, regalloc2 inserts it between the two branch instructions, which inhibits the machbuffer branch optimizations.cc @elliottt who might find this an interesting case to think about.
I don't think this particular issue should have much of any effect on performance in anything other than the tiniest microbenchmarks, though. I suspect the cost of the division instruction is much more significant.
hungryzzz commented on issue #6287:
Division and modulo are relatively slow on most cpu's. For small constant divisors there exist faster methods to perform these operations using multiplication and shifting which Cranelift doesn't currently support. This may be the reason that it is slower. I haven't looked at the disassembly of any of the other wasm engines though. By the way are you using the LLVM backend of Wasmer? Wasmer has a Cranelift backend too which should perform almost identically to Wasmtime.
Hi, I use the Cranelift backend of Wamser(default option), but the version(0.98) maybe different with Wasmtime. And I also use the LLVM backend of Wamser, I get the execution time is 25228.15 us, which is a little bit quicker than using Cranelift in Wamser.
hungryzzz edited a comment on issue #6287:
Division and modulo are relatively slow on most cpu's. For small constant divisors there exist faster methods to perform these operations using multiplication and shifting which Cranelift doesn't currently support. This may be the reason that it is slower. I haven't looked at the disassembly of any of the other wasm engines though. By the way are you using the LLVM backend of Wasmer? Wasmer has a Cranelift backend too which should perform almost identically to Wasmtime.
Hi, I use the Cranelift backend of Wamser(default option), but the version(0.98) maybe different with Wasmtime. And I also get the execution time 25228.15 us under LLVM backend of Wamser, which is a little bit quicker than using Cranelift in Wamser.
fitzgen commented on issue #6287:
It seems like we've bottomed this out as an issue in not doing the magic constant for divisions optimization anymore, and adding that back is tracked in https://github.com/bytecodealliance/wasmtime/issues/6049, so I'm going to go ahead and close this issue. Thanks for bringing it to our attention @hungryzzz!
fitzgen closed issue #6287:
Description
Hi, I run the following case in different Wasm runtimes(after being compiled by
Emscripten
), and I find some performance differences betweenwasmtime
and other 3 runtimes.// random.c #include <stdio.h> int main() { int N = 10000000, last = 42; while (N--) { last = (last + 33) % 13; // compound operations are + and %. } printf("%d\n", last); return(0); }
Test Case
Steps to Reproduce
Compile the above C case using Emscripten
emcc -sENVIRONMENT=shell -O2 -s WASM=1 -s TOTAL_MEMORY=512MB random.c -o random.wasm
Execute the wasm file in different wasm runtimes and collect the execution time, all the compilation and execution options are default.
Actual Results
The execution time(collected by
perf-tool
, probe begins when starting to execute the wasm code(invoke_func
inwasmtime
) and end insched:sched_process_exit
) inwasmtime
is 3x slower than that in other 3 runtimes.And if I replace compound operations to other combinations, I find that in some situations, the execution time in
wasmtime
is 3x slower than that in other 3 runtimes, but sometimes not. The details are as follow:
\*/ +% +/ +\* wasmtime 79362.79 us 85006.50 us 73508.21 us 1577.37 us wasmer 28642.15 us 28935.57 us 19977.13 us 1965.65 us wasmedge 22462.40 us 24814.40 us 74.27 us 171.18 us wamr 23018.67 us 25328.92 us 741.49 us 856.79 us So I am wondering that there maybe some performance bugs when handling such compound operations.
Versions and Environment
- Hardware Architecture: x86_64
- Operating system: Ubuntu 20.04
- CPU: Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz
- Memory: 32GB
Emscripten
- emcc (Emscripten gcc/clang-like replacement + linker emulating GNU ld) 3.1.24 (68a9f990429e0bcfb63b1cde68bad792554350a5)
clang version 16.0.0 (https://github.com/llvm/llvm-project 277c382760bf9575cfa2eac73d5ad1db91466d3f)
Target: wasm32-unknown-emscripten
Thread model: posixWasm runtime version
- wasmer: wasmer 3.2.0
- wasmtime: wasmtime-cli 8.0.0 (build from commit a3d9c2ca33a44220bbc82488e7064c6ae045a1dd)
- wasmedge: build from commit https://github.com/WasmEdge/WasmEdge/commit/381b7b28049b968297e6a585b92d1cba955def66
- wamr: iwasm 1.1.2
Last updated: Dec 23 2024 at 13:07 UTC