rahulchaphalkar opened issue #8178:
I am running
ackermann
benchmark with wasmtime, and I noticed that it had a performance delta when compared with native, of approx 30%. Profiling with VTune, I see wasmtime disassembly containing lot of setup/teardown function call stack related instructions at the beginning and end of the function, while native (clang, -O3) does not.
I usedwasmtime explore
to correlate thewat
with disassembly as well. Here are the snippets of disassembly -Wasm Setup of the stack -
Address Source Line Assembly 0x7f2b691d8040 0 push rbp 0x7f2b691d8041 0 mov rbp, rsp 0x7f2b691d8044 0 mov r10, qword ptr [rdi+0x8] 0x7f2b691d8048 0 mov r10, qword ptr [r10] 0x7f2b691d804b 0 cmp r10, rsp 0x7f2b691d804e 0 jnbe 0x7f2b691d80b7 <Block 9> 0x7f2b691d8054 0 Block 2: 0x7f2b691d8054 0 sub rsp, 0x10 0x7f2b691d8058 0 mov qword ptr [rsp], r12 0x7f2b691d805c 0 mov qword ptr [rsp+0x8], r15 0x7f2b691d8061 0 mov r15, rdi 0x7f2b691d8064 0 test edx, edx 0x7f2b691d8066 0 mov r12, rdx 0x7f2b691d8069 0 jz 0x7f2b691d80a2 <Block 8>
I have pasted the wat file of this function below as well for reference.
Wasm Teardown -
Address Source Line Assembly 0x7f2b691d80a2 0 lea eax, ptr [rcx+0x1] 0x7f2b691d80a5 0 mov r12, qword ptr [rsp] 0x7f2b691d80a9 0 mov r15, qword ptr [rsp+0x8] 0x7f2b691d80ae 0 add rsp, 0x10 0x7f2b691d80b2 0 mov rsp, rbp 0x7f2b691d80b5 0 pop rbp 0x7f2b691d80b6 0 ret
wat
of relevant function -(func (;3;) (type 5) (param i32 i32) (result i32) local.get 0 if ;; label = @1 loop ;; label = @2 local.get 1 if (result i32) ;; label = @3 local.get 0 local.get 1 i32.const 1 i32.sub call 3 else i32.const 1 end local.set 1 local.get 0 i32.const 1 i32.sub local.tee 0 br_if 0 (;@2;) end end local.get 1 i32.const 1 i32.add )
Native disassembly is pretty short, the entirety of the function is as shown below (this is in at&t syntax, unlike Intel syntax in some above snippets) -
Address Source Line 0x1170 0 Block 1: 0x1170 0 pushq %rbx 0x1171 0 mov %esi, %eax 0x1173 0 test %edi, %edi 0x1175 0 jz 0x119f <Block 8> 0x1177 0 Block 2: 0x1177 0 mov %edi, %ebx 0x1179 0 jmp 0x118a <Block 5> 0x117b 0 Block 3: 0x117b 0 nopl %eax, (%rax,%rax,1) 0x1180 0 Block 4: 0x1180 0 mov $0x1, %eax 0x1185 0 add $0xffffffff, %ebx 0x1188 0 jz 0x119f <Block 8> 0x118a 0 Block 5: 0x118a 0 test %eax, %eax 0x118c 0 jz 0x1180 <Block 4> 0x118e 0 Block 6: 0x118e 0 add $0xffffffff, %eax 0x1191 0 mov %ebx, %edi 0x1193 0 mov %eax, %esi 0x1195 0 callq 0x1170 <Block 1> 0x119a 0 Block 7: 0x119a 0 add $0xffffffff, %ebx 0x119d 0 jnz 0x118a <Block 5> 0x119f 0 Block 8: 0x119f 0 add $0x1, %eax 0x11a2 0 popq %rbx 0x11a3 0 retq
and the C source function to generate wasm and native is -
int ackermann(int M, int N) { if (M == 0) { return N + 1; } if (N == 0) { return ackermann(M - 1, 1); } return ackermann(M - 1, ackermann(M, (N - 1))); }
I also tried with
--wasm-features tail-call
cli flag, however that actually made the perf slightly worse.
Any pointers on the difference in disassembly between native and wasm?
cfallin commented on issue #8178:
Hi @rahulchaphalkar -- it looks like the difference is down to two fundamental factors:
- We have explicit stack checks rather than implicit stack probes and reliance on guard pages. We've actually just been discussing this in #8135. That's the business with
r10
before decrementingrsp
.- We have two clobber-saves (
r12
andr15
), whereas the native code gets away with one (rbx
). It would be a good exercise to trace through the assembly and see what the registers are used for; perhaps the native compiler's register allocator is able to be a bit smarter about reuse. It is fundamentally necessary to have some state on the stack I think, since there is a recursive call (the one in non-tail position on the second-to-last line of C) and there is at least one word of state (M
) necessary after it returns.
fitzgen commented on issue #8178:
And FWIW, it is known that the
tail
calling convention can currently lead to some slow downs, which is why Wasm tail calls aren't enabled by default yet: https://github.com/bytecodealliance/wasmtime/issues/6759
Last updated: Dec 23 2024 at 12:05 UTC