Stream: git-wasmtime

Topic: wasmtime / issue #8178 Call stack performance investigation


view this post on Zulip Wasmtime GitHub notifications bot (Mar 18 2024 at 23:49):

rahulchaphalkar opened issue #8178:

I am running ackermann benchmark with wasmtime, and I noticed that it had a performance delta when compared with native, of approx 30%. Profiling with VTune, I see wasmtime disassembly containing lot of setup/teardown function call stack related instructions at the beginning and end of the function, while native (clang, -O3) does not.
I used wasmtime explore to correlate the wat with disassembly as well. Here are the snippets of disassembly -

Wasm Setup of the stack -

Address Source Line Assembly
0x7f2b691d8040  0   push rbp
0x7f2b691d8041  0   mov rbp, rsp
0x7f2b691d8044  0   mov r10, qword ptr [rdi+0x8]
0x7f2b691d8048  0   mov r10, qword ptr [r10]
0x7f2b691d804b  0   cmp r10, rsp
0x7f2b691d804e  0   jnbe 0x7f2b691d80b7 <Block 9>
0x7f2b691d8054  0   Block 2:
0x7f2b691d8054  0   sub rsp, 0x10
0x7f2b691d8058  0   mov qword ptr [rsp], r12
0x7f2b691d805c  0   mov qword ptr [rsp+0x8], r15
0x7f2b691d8061  0   mov r15, rdi
0x7f2b691d8064  0   test edx, edx
0x7f2b691d8066  0   mov r12, rdx
0x7f2b691d8069  0   jz 0x7f2b691d80a2 <Block 8>

I have pasted the wat file of this function below as well for reference.

Wasm Teardown -

Address Source Line Assembly
0x7f2b691d80a2  0   lea eax, ptr [rcx+0x1]
0x7f2b691d80a5  0   mov r12, qword ptr [rsp]
0x7f2b691d80a9  0   mov r15, qword ptr [rsp+0x8]
0x7f2b691d80ae  0   add rsp, 0x10
0x7f2b691d80b2  0   mov rsp, rbp
0x7f2b691d80b5  0   pop rbp
0x7f2b691d80b6  0   ret

wat of relevant function -

(func (;3;) (type 5) (param i32 i32) (result i32)
    local.get 0
    if ;; label = @1
      loop ;; label = @2
        local.get 1
        if (result i32) ;; label = @3
          local.get 0
          local.get 1
          i32.const 1
          i32.sub
          call 3
        else
          i32.const 1
        end
        local.set 1
        local.get 0
        i32.const 1
        i32.sub
        local.tee 0
        br_if 0 (;@2;)
      end
    end
    local.get 1
    i32.const 1
    i32.add
  )

Native disassembly is pretty short, the entirety of the function is as shown below (this is in at&t syntax, unlike Intel syntax in some above snippets) -

Address Source Line
0x1170  0   Block 1:
0x1170  0   pushq  %rbx
0x1171  0   mov %esi, %eax
0x1173  0   test %edi, %edi
0x1175  0   jz 0x119f <Block 8>
0x1177  0   Block 2:
0x1177  0   mov %edi, %ebx
0x1179  0   jmp 0x118a <Block 5>
0x117b  0   Block 3:
0x117b  0   nopl  %eax, (%rax,%rax,1)
0x1180  0   Block 4:
0x1180  0   mov $0x1, %eax
0x1185  0   add $0xffffffff, %ebx
0x1188  0   jz 0x119f <Block 8>
0x118a  0   Block 5:
0x118a  0   test %eax, %eax
0x118c  0   jz 0x1180 <Block 4>
0x118e  0   Block 6:
0x118e  0   add $0xffffffff, %eax
0x1191  0   mov %ebx, %edi
0x1193  0   mov %eax, %esi
0x1195  0   callq  0x1170 <Block 1>
0x119a  0   Block 7:
0x119a  0   add $0xffffffff, %ebx
0x119d  0   jnz 0x118a <Block 5>
0x119f  0   Block 8:
0x119f  0   add $0x1, %eax
0x11a2  0   popq  %rbx
0x11a3  0   retq

and the C source function to generate wasm and native is -

int ackermann(int M, int N)
{
    if (M == 0)
    {
        return N + 1;
    }
    if (N == 0)
    {
        return ackermann(M - 1, 1);
    }
    return ackermann(M - 1, ackermann(M, (N - 1)));
}

I also tried with --wasm-features tail-call cli flag, however that actually made the perf slightly worse.
Any pointers on the difference in disassembly between native and wasm?

view this post on Zulip Wasmtime GitHub notifications bot (Mar 19 2024 at 02:26):

cfallin commented on issue #8178:

Hi @rahulchaphalkar -- it looks like the difference is down to two fundamental factors:

view this post on Zulip Wasmtime GitHub notifications bot (Mar 19 2024 at 15:58):

fitzgen commented on issue #8178:

And FWIW, it is known that the tail calling convention can currently lead to some slow downs, which is why Wasm tail calls aren't enabled by default yet: https://github.com/bytecodealliance/wasmtime/issues/6759


Last updated: Dec 23 2024 at 12:05 UTC