kubkon commented on Issue #972:
@alexcrichton @rene-fonseca did you guys manage to work out what the cause of the overhead is?
alexcrichton commented on Issue #972:
No there hasn't been any discussion about the generated code itself. Using
--jitdump
, though, I've extracted both the rustc version of the compiledfib
function as well as the cranelift-compiled version:https://gist.github.com/alexcrichton/6d80d25ee6f64857c3388e130dad22ed
bjorn3 commented on Issue #972:
There seems to be more spilling and more register-register moves in the Cranelift version. The work on a new regalloc may help with this.
kubkon commented on Issue #972:
Not sure how useful this will be but I've done some more benchmarks including re-running fibonacci stripped of all recursion (sources and some numbers available at github.com/kubkon/wasmtime-bench). In that particular case, there is a massive speedup. Nonetheless, Cranelift still incurs 2x overhead as seen in a more complicated mandelbrot example.
bjorn3 commented on Issue #972:
How is the performance with the new x64 backend?
cfallin commented on Issue #972:
This is somewhat outdated; closing in favor of our upcoming more general perf-tracking efforts (bytecodealliance/rfcs#3 and bytecodealliance/rfcs#4).
cfallin closed Issue #972:
While a pretty awful benchmark "in the large" I was surprised playing around locally how the naive fibonacci program was so slow relative to native performance. Especially because fibonacci benchmarks don't touch linear memory much, this may at least be a decent benchmark of cranelift and/or the code generator in use.
Given an input file like:
fn main() { std::process::exit(run() as i32); } #[no_mangle] pub extern "C" fn run() -> u32 { fib(43) } fn fib(n: u32) -> u32 { if n <= 1 { 1 } else { fib(n - 1) + fib(n - 2) } }
native execution looks like (for me at least)
$ rustc -O fib.rs $ time ./fib ./fib 0.90s user 0.00s system 99% cpu 0.902 total
Whereas the wasm execution looks like:
$ rustc -O fib.rs --target wasm32-wasi --crate-type cdylib $ time wasmtime --disable-cache -O --invoke run fib.wasm warning: using `--invoke` with a function that returns values is experimental and may break in the future 701408733 wasmtime --disable-cache -O --invoke run fib.wasm 2.20s user 0.02s system 104% cpu 2.123 total
Here the wasm is over 2x slower than native, which was a bit surprising to me!
Some other reference information below is...
<details>
<summary>Native assembly for the
fib
function</summary>0000000000004d40 <_ZN3fib3fib17h2bacf53cb3845acfE>: 4d40: 55 push %rbp 4d41: 53 push %rbx 4d42: 50 push %rax 4d43: bd 01 00 00 00 mov $0x1,%ebp 4d48: 83 ff 02 cmp $0x2,%edi 4d4b: 72 25 jb 4d72 <_ZN3fib3fib17h2bacf53cb3845acfE+0x32> 4d4d: 89 fb mov %edi,%ebx 4d4f: bd 01 00 00 00 mov $0x1,%ebp 4d54: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) 4d5b: 00 00 00 4d5e: 66 90 xchg %ax,%ax 4d60: 8d 7b ff lea -0x1(%rbx),%edi 4d63: e8 d8 ff ff ff callq 4d40 <_ZN3fib3fib17h2bacf53cb3845acfE> 4d68: 83 c3 fe add $0xfffffffe,%ebx 4d6b: 01 c5 add %eax,%ebp 4d6d: 83 fb 01 cmp $0x1,%ebx 4d70: 77 ee ja 4d60 <_ZN3fib3fib17h2bacf53cb3845acfE+0x20> 4d72: 89 e8 mov %ebp,%eax 4d74: 48 83 c4 08 add $0x8,%rsp 4d78: 5b pop %rbx 4d79: 5d pop %rbp 4d7a: c3 retq 4d7b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
</details>
<details>
<summary>WebAssembly of the
fib
function</summary>(func $_ZN3fib3fib17hedcc9d2af68c6e00E (type 1) (param i32) (result i32) (local i32) i32.const 1 local.set 1 block ;; label = @1 local.get 0 i32.const 2 i32.lt_u br_if 0 (;@1;) i32.const 1 local.set 1 loop ;; label = @2 local.get 0 i32.const -1 i32.add call $_ZN3fib3fib17hedcc9d2af68c6e00E local.get 1 i32.add local.set 1 local.get 0 i32.const -2 i32.add local.tee 0 i32.const 1 i32.gt_u br_if 0 (;@2;) end end local.get 1)
</details>
At this time I don't know of a great way to get out the assembly generated by cranelift unfortunately, but I'm hoping others may know of an easy way to do so!
Last updated: Dec 23 2024 at 12:05 UTC