Stream: git-wasmtime

Topic: wasmtime / issue #6287 Performance issue in some kinds of...


view this post on Zulip Wasmtime GitHub notifications bot (Apr 26 2023 at 13:16):

hungryzzz opened issue #6287:

Description

Hi, I run the following case in different Wasm runtimes(after being compiled by Emscripten), and I find some performance differences between wasmtime and other 3 runtimes.

// random.c
#include <stdio.h>

int main() {
    int N = 10000000, last = 42;
    while (N--) {
        last = (last + 33) % 13;  // compound operations are + and %.
    }
    printf("%d\n", last);
    return(0);
}

Test Case

random.wasm.txt

Steps to Reproduce

  1. Compile the above C case using Emscripten
    emcc -sENVIRONMENT=shell -O2 -s WASM=1 -s TOTAL_MEMORY=512MB random.c -o random.wasm

  2. Execute the wasm file in different wasm runtimes and collect the execution time, all the compilation and execution options are default.

Actual Results

The execution time(collected by perf-tool, probe begins when starting to execute the wasm code(invoke_func in wasmtime) and end in sched:sched_process_exit) in wasmtime is 3x slower than that in other 3 runtimes.

And if I replace compound operations to other combinations, I find that in some situations, the execution time in wasmtime is 3x slower than that in other 3 runtimes, but sometimes not. The details are as follow:

\*/ +% +/ +\*
wasmtime 79362.79 us 85006.50 us 73508.21 us 1577.37 us
wasmer 28642.15 us 28935.57 us 19977.13 us 1965.65 us
wasmedge 22462.40 us 24814.40 us 74.27 us 171.18 us
wamr 23018.67 us 25328.92 us 741.49 us 856.79 us

So I am wondering that there maybe some performance bugs when handling such compound operations.

Versions and Environment

Emscripten

Wasm runtime version

view this post on Zulip Wasmtime GitHub notifications bot (Apr 26 2023 at 13:16):

hungryzzz labeled issue #6287:

Description

Hi, I run the following case in different Wasm runtimes(after being compiled by Emscripten), and I find some performance differences between wasmtime and other 3 runtimes.

// random.c
#include <stdio.h>

int main() {
    int N = 10000000, last = 42;
    while (N--) {
        last = (last + 33) % 13;  // compound operations are + and %.
    }
    printf("%d\n", last);
    return(0);
}

Test Case

random.wasm.txt

Steps to Reproduce

  1. Compile the above C case using Emscripten
    emcc -sENVIRONMENT=shell -O2 -s WASM=1 -s TOTAL_MEMORY=512MB random.c -o random.wasm

  2. Execute the wasm file in different wasm runtimes and collect the execution time, all the compilation and execution options are default.

Actual Results

The execution time(collected by perf-tool, probe begins when starting to execute the wasm code(invoke_func in wasmtime) and end in sched:sched_process_exit) in wasmtime is 3x slower than that in other 3 runtimes.

And if I replace compound operations to other combinations, I find that in some situations, the execution time in wasmtime is 3x slower than that in other 3 runtimes, but sometimes not. The details are as follow:

\*/ +% +/ +\*
wasmtime 79362.79 us 85006.50 us 73508.21 us 1577.37 us
wasmer 28642.15 us 28935.57 us 19977.13 us 1965.65 us
wasmedge 22462.40 us 24814.40 us 74.27 us 171.18 us
wamr 23018.67 us 25328.92 us 741.49 us 856.79 us

So I am wondering that there maybe some performance bugs when handling such compound operations.

Versions and Environment

Emscripten

Wasm runtime version

view this post on Zulip Wasmtime GitHub notifications bot (Apr 26 2023 at 14:13):

bjorn3 commented on issue #6287:

Division and modulo are relatively slow on most cpu's. For small constant divisors there exist faster methods to perform these operations using multiplication and shifting which Cranelift doesn't currently support. This may be the reason that it is slower. I haven't looked at the disassembly of any of the other wasm engines though. By the way are you using the LLVM backend of Wasmer? Wasmer has a Cranelift backend too which should perform almost identically to Wasmtime.

view this post on Zulip Wasmtime GitHub notifications bot (Apr 27 2023 at 00:33):

jameysharp commented on issue #6287:

I thought I'd take a quick look at what code we generate for this program. It's interesting that LLVM has unrolled the loop four times, but that should be fine I think.

In the wasm, the loop body is four copies of:

      i32.const 33
      i32.add
      i32.const 13
      i32.rem_u

We compile that to four copies of this sequence:

000000a6    8d 42 21                          lea eax, [rdx + 0x21]
000000a9    48 31 d2                          xor rdx, rdx
000000ac    41 f7 f3                          div r11d

(Except the first copy also loads the constant 13 into %r11. In this case, rematerialization is hurting us slightly since we have plenty of free registers.)

I don't think we can do better than that except, as @bjorn3 says, by turning the division into multiplication by magic constants.

view this post on Zulip Wasmtime GitHub notifications bot (Apr 27 2023 at 00:53):

jameysharp commented on issue #6287:

Okay, that said, the bottom of the loop isn't ideal:

000000c1    41 83 ea 04                       sub r10d, 4
000000c5    45 85 d2                          test r10d, r10d
000000c8    0f 84 08 00 00 00                 je 0xd6
000000ce    49 89 d3                          mov r11, rdx
000000d1    e9 c0 ff ff ff                    jmp 0x96

We have the value of last in %r11 on entry to the loop, but during the first of the four unrolled copies of the loop body we switch over to keeping it in %rdx, due to the register constraint on the div instruction on x86. So at the end of the loop we have to copy it back to %r11 if the loop is continuing. Since the move is only necessary on the critical edge, regalloc2 inserts it between the two branch instructions, which inhibits the machbuffer branch optimizations.

cc @elliottt who might find this an interesting case to think about.

I don't think this particular issue should have much of any effect on performance in anything other than the tiniest microbenchmarks, though. I suspect the cost of the division instruction is much more significant.

view this post on Zulip Wasmtime GitHub notifications bot (Apr 27 2023 at 08:42):

hungryzzz commented on issue #6287:

Division and modulo are relatively slow on most cpu's. For small constant divisors there exist faster methods to perform these operations using multiplication and shifting which Cranelift doesn't currently support. This may be the reason that it is slower. I haven't looked at the disassembly of any of the other wasm engines though. By the way are you using the LLVM backend of Wasmer? Wasmer has a Cranelift backend too which should perform almost identically to Wasmtime.

Hi, I use the Cranelift backend of Wamser(default option), but the version(0.98) maybe different with Wasmtime. And I also use the LLVM backend of Wamser, I get the execution time is 25228.15 us, which is a little bit quicker than using Cranelift in Wamser.

view this post on Zulip Wasmtime GitHub notifications bot (Apr 27 2023 at 08:44):

hungryzzz edited a comment on issue #6287:

Division and modulo are relatively slow on most cpu's. For small constant divisors there exist faster methods to perform these operations using multiplication and shifting which Cranelift doesn't currently support. This may be the reason that it is slower. I haven't looked at the disassembly of any of the other wasm engines though. By the way are you using the LLVM backend of Wasmer? Wasmer has a Cranelift backend too which should perform almost identically to Wasmtime.

Hi, I use the Cranelift backend of Wamser(default option), but the version(0.98) maybe different with Wasmtime. And I also get the execution time 25228.15 us under LLVM backend of Wamser, which is a little bit quicker than using Cranelift in Wamser.

view this post on Zulip Wasmtime GitHub notifications bot (May 01 2023 at 18:04):

fitzgen commented on issue #6287:

It seems like we've bottomed this out as an issue in not doing the magic constant for divisions optimization anymore, and adding that back is tracked in https://github.com/bytecodealliance/wasmtime/issues/6049, so I'm going to go ahead and close this issue. Thanks for bringing it to our attention @hungryzzz!

view this post on Zulip Wasmtime GitHub notifications bot (May 01 2023 at 18:04):

fitzgen closed issue #6287:

Description

Hi, I run the following case in different Wasm runtimes(after being compiled by Emscripten), and I find some performance differences between wasmtime and other 3 runtimes.

// random.c
#include <stdio.h>

int main() {
    int N = 10000000, last = 42;
    while (N--) {
        last = (last + 33) % 13;  // compound operations are + and %.
    }
    printf("%d\n", last);
    return(0);
}

Test Case

random.wasm.txt

Steps to Reproduce

  1. Compile the above C case using Emscripten
    emcc -sENVIRONMENT=shell -O2 -s WASM=1 -s TOTAL_MEMORY=512MB random.c -o random.wasm

  2. Execute the wasm file in different wasm runtimes and collect the execution time, all the compilation and execution options are default.

Actual Results

The execution time(collected by perf-tool, probe begins when starting to execute the wasm code(invoke_func in wasmtime) and end in sched:sched_process_exit) in wasmtime is 3x slower than that in other 3 runtimes.

And if I replace compound operations to other combinations, I find that in some situations, the execution time in wasmtime is 3x slower than that in other 3 runtimes, but sometimes not. The details are as follow:

\*/ +% +/ +\*
wasmtime 79362.79 us 85006.50 us 73508.21 us 1577.37 us
wasmer 28642.15 us 28935.57 us 19977.13 us 1965.65 us
wasmedge 22462.40 us 24814.40 us 74.27 us 171.18 us
wamr 23018.67 us 25328.92 us 741.49 us 856.79 us

So I am wondering that there maybe some performance bugs when handling such compound operations.

Versions and Environment

Emscripten

Wasm runtime version


Last updated: Nov 22 2024 at 16:03 UTC