Stream: cranelift

Topic: disassemly seems (relatively) unoptimized


view this post on Zulip Terts Diepraam (Apr 30 2024 at 14:01):

Hi all!

I'm working on a language that will use cranelift as a backend. We assumed (perhaps naively) that cranelift would emit fairly optimized code by default, but based on some limited testing that doesn't seem to be the case. This is not really a deal breaker for me, but I still wanted to check whether I might be holding it wrong :)

I'm using the cranelift_jit crate and based the code mostly on the cranelift jit demo, with the settings customized like this:

let mut settings = settings::builder();
settings.set("opt_level", "speed_and_size").unwrap();
let flags = settings::Flags::new(settings);
let isa = cranelift_native::builder().unwrap().finish(flags).unwrap();

My test program then corresponded with a Rust function roughly equivalent to:

pub fn foo(num: i32) -> bool {
    let a = 5;
    if num == a {
        return true;
    } else {
        return false;
    }
}

Which I represented with the following cranelift IR:

function u0:0(i32) -> i8 system_v {
block0(v0: i32):
    v1 = iconst.i32 5
    v2 = icmp eq v1, v0  ; v1 = 5
    v3 = icmp_imm eq v2, 1
    brif v3, block1, block2

block1:
    v4 = iconst.i8 1
    return v4  ; v4 = 1

block2:
    v5 = iconst.i8 0
    return v5  ; v5 = 0
}

Which then gave me the following disassembly:

block0: ; offset 0x0
  pushq %rbp
  movq %rsp, %rbp
  cmpl $5, %edi
  sete %r8b
  cmpb $1, %r8b
  je 0x1c
  xorl %eax, %eax
  movq %rbp, %rsp
  popq %rbp
  retq
  movl $1, %eax
  movq %rbp, %rsp
  popq %rbp
  retq

There were a couple things that surprised me here. First, I assumed that the jump would be compiled away. Second, the stack frame manipulation seems superfluous. Finally, the opt_level did not seem to have any influence on the assembly. All in all, it's a fairly literal translation of the IR (with some stack pointer stuff). So, my question is whether I'm missing some configuration option or that there is some function I should call to further optimize the function.

I'm in no way an expert in assembly, so I might be totally off here with my assumptions.

view this post on Zulip fitzgen (he/him) (Apr 30 2024 at 15:21):

if I remember correctly, we haven't implemented preserve_frame_pointers=false for x86_64 yet, there was some failing test that prevented the PR from landing, and whoever was poking at it didn't have time to investigate. we should circle back on that at some point

in general, Cranelift doesn't do too much with optimizing branches/blocks. it will fold blocks together if A unconditionally jumps to B and A is B's only predecessor, but it won't determine that the if/else is returning the same thing as the condition and rewrite that to returning the condition directly. this is because cranelift is mostly used in wasmtime, where the wasm input has already been optimized by LLVM, and its job is primarily to act as a good instruction selector and register allocator.

view this post on Zulip Terts Diepraam (Apr 30 2024 at 16:15):

That's very helpful, thanks! Out of general curiosity, are more optimizations for branches on the roadmap? It sounds like that would be relevant for use of cranelift as a Rust backend and other languages too.

view this post on Zulip Chris Fallin (Apr 30 2024 at 16:19):

@Terts Diepraam we have thoughts on how we might do it, yes. Unfortunately we don't have a gaggle of engineers waiting to implement this -- we all wear many hats and no one has it as their primary mission to push the optimizer forward at the moment. (Open-source economics and infra and all that...) If you're willing to get involved, or know someone who is, we could use more folks working on it!

view this post on Zulip Chris Fallin (Apr 30 2024 at 16:21):

Also, as a starter task, the extra compare is fairly silly -- (icmp (icmp ...) (iconst 1)) should be rewritten to (icmp ...). I had thought we had a midend rule for that but maybe it's not applicable for some reason here. That'd be possibly an easier place to get started

view this post on Zulip Terts Diepraam (Apr 30 2024 at 16:22):

That's good to hear! Indeed, I in no way intend to increase the pressure. Y'all are doing great work. I might indeed try to get involved. Can't promise I'll be any good though :sweat_smile: That starter task looks fun!

view this post on Zulip Kirp (Apr 30 2024 at 17:53):

Chris Fallin said:

Also, as a starter task, the extra compare is fairly silly -- (icmp (icmp ...) (iconst 1)) should be rewritten to (icmp ...). I had thought we had a midend rule for that but maybe it's not applicable for some reason here. That'd be possibly an easier place to get started

Isn’t that simplification lines 39-43 of icmp.isle https://github.com/bytecodealliance/wasmtime/blob/main/cranelift/codegen/src/opts/icmp.isle

A fast and secure runtime for WebAssembly. Contribute to bytecodealliance/wasmtime development by creating an account on GitHub.

view this post on Zulip Terts Diepraam (Apr 30 2024 at 17:54):

Indeed, that's almost it, but it's only comparisons with 0 not with 1. PR incoming :)

view this post on Zulip Terts Diepraam (Apr 30 2024 at 18:19):

Here's my attempt: https://github.com/bytecodealliance/wasmtime/pull/8510

Now I need to figure out how to write tests

Suggested by @cfallin in this zulip thread: https://bytecodealliance.zulipchat.com/#narrow/stream/217117-cranelift/topic/disassemly.20seems.20.28relatively.29.20unoptimized The icmp-of-icmp rules o...

view this post on Zulip Afonso Bordado (Apr 30 2024 at 18:27):

We have a few similar tests in this file: https://github.com/bytecodealliance/wasmtime/blob/main/cranelift/filetests/filetests/egraph/icmp.clif

Most of the tests that cover the ISLE optimizations will probably be in that folder. You should be able to write a funcion that represents the unoptimized case, run clif-util and have it tell you what it was optimized to.

A fast and secure runtime for WebAssembly. Contribute to bytecodealliance/wasmtime development by creating an account on GitHub.

view this post on Zulip Afonso Bordado (Apr 30 2024 at 18:28):

If you need help with running those tests, or writing them, let me know!

view this post on Zulip Terts Diepraam (Apr 30 2024 at 18:28):

Great! Thanks!

view this post on Zulip Terts Diepraam (Apr 30 2024 at 20:27):

Thanks all! I'm looking forward to contributing more already

view this post on Zulip Chris Fallin (Apr 30 2024 at 20:29):

thanks again for the PR and happy to help think through other suboptimal cases if you run into any!

view this post on Zulip Terts Diepraam (May 01 2024 at 09:39):

I'm looking into omitting the function prologue and epilogue now and I've copied the approach from aarch64, where the frame_layout.setup_area_size is set to 0 if it's a leaf function (and some other conditions). However, that leads to a segfault for a simd test, because the simd swizzle is lowered to a callq which is not taken into account with the archtecture-independent is_leaf calculation. I suppose I'll put up a draft PR to make it easier to discuss.

view this post on Zulip Terts Diepraam (May 01 2024 at 09:59):

Here it is: https://github.com/bytecodealliance/wasmtime/pull/8516

This makes cranelift omit the function prologue and epilogue on the x64 architecture if possible, like it already does for aarch64. However, this is not quite right yet, because (if I understand co...

Last updated: Oct 23 2024 at 20:03 UTC