disassemly seems (relatively) unoptimized · cranelift

I'm working on a language that will use cranelift as a backend. We assumed (perhaps naively) that cranelift would emit fairly optimized code by default, but based on some limited testing that doesn't seem to be the case. This is not really a deal breaker for me, but I still wanted to check whether I might be holding it wrong :)

I'm using the cranelift_jit crate and based the code mostly on the cranelift jit demo, with the settings customized like this:

let mut settings = settings::builder();
settings.set("opt_level", "speed_and_size").unwrap();
let flags = settings::Flags::new(settings);
let isa = cranelift_native::builder().unwrap().finish(flags).unwrap();

pub fn foo(num: i32) -> bool {
    let a = 5;
    if num == a {
        return true;
    } else {
        return false;
    }
}

function u0:0(i32) -> i8 system_v {
block0(v0: i32):
    v1 = iconst.i32 5
    v2 = icmp eq v1, v0  ; v1 = 5
    v3 = icmp_imm eq v2, 1
    brif v3, block1, block2

block1:
    v4 = iconst.i8 1
    return v4  ; v4 = 1

block2:
    v5 = iconst.i8 0
    return v5  ; v5 = 0
}

block0: ; offset 0x0
  pushq %rbp
  movq %rsp, %rbp
  cmpl $5, %edi
  sete %r8b
  cmpb $1, %r8b
  je 0x1c
  xorl %eax, %eax
  movq %rbp, %rsp
  popq %rbp
  retq
  movl $1, %eax
  movq %rbp, %rsp
  popq %rbp
  retq

There were a couple things that surprised me here. First, I assumed that the jump would be compiled away. Second, the stack frame manipulation seems superfluous. Finally, the opt_level did not seem to have any influence on the assembly. All in all, it's a fairly literal translation of the IR (with some stack pointer stuff). So, my question is whether I'm missing some configuration option or that there is some function I should call to further optimize the function.

I'm in no way an expert in assembly, so I might be totally off here with my assumptions.

fitzgen (he/him) (Apr 30 2024 at 15:21):

if I remember correctly, we haven't implemented preserve_frame_pointers=false for x86_64 yet, there was some failing test that prevented the PR from landing, and whoever was poking at it didn't have time to investigate. we should circle back on that at some point

in general, Cranelift doesn't do too much with optimizing branches/blocks. it will fold blocks together if A unconditionally jumps to B and A is B's only predecessor, but it won't determine that the if/else is returning the same thing as the condition and rewrite that to returning the condition directly. this is because cranelift is mostly used in wasmtime, where the wasm input has already been optimized by LLVM, and its job is primarily to act as a good instruction selector and register allocator.

Terts Diepraam (Apr 30 2024 at 16:15):

That's very helpful, thanks! Out of general curiosity, are more optimizations for branches on the roadmap? It sounds like that would be relevant for use of cranelift as a Rust backend and other languages too.

Chris Fallin (Apr 30 2024 at 16:19):

@Terts Diepraam we have thoughts on how we might do it, yes. Unfortunately we don't have a gaggle of engineers waiting to implement this -- we all wear many hats and no one has it as their primary mission to push the optimizer forward at the moment. (Open-source economics and infra and all that...) If you're willing to get involved, or know someone who is, we could use more folks working on it!

Chris Fallin (Apr 30 2024 at 16:21):

Also, as a starter task, the extra compare is fairly silly -- (icmp (icmp ...) (iconst 1)) should be rewritten to (icmp ...). I had thought we had a midend rule for that but maybe it's not applicable for some reason here. That'd be possibly an easier place to get started

Terts Diepraam (Apr 30 2024 at 16:22):

That's good to hear! Indeed, I in no way intend to increase the pressure. Y'all are doing great work. I might indeed try to get involved. Can't promise I'll be any good though :sweat_smile: That starter task looks fun!

Kirp (Apr 30 2024 at 17:53):

wasmtime/cranelift/codegen/src/opts/icmp.isle at main · bytecodealliance/wasmtime

A fast and secure runtime for WebAssembly. Contribute to bytecodealliance/wasmtime development by creating an account on GitHub.

Terts Diepraam (Apr 30 2024 at 17:54):

Indeed, that's almost it, but it's only comparisons with 0 not with 1. PR incoming :)

Terts Diepraam (Apr 30 2024 at 18:19):

cranelift: add icmp-of-icmp rules for comparisons with 1 by tertsdiepraam · Pull Request #8510 · bytecodealliance/wasmtime

Suggested by @cfallin in this zulip thread: https://bytecodealliance.zulipchat.com/#narrow/stream/217117-cranelift/topic/disassemly.20seems.20.28relatively.29.20unoptimized The icmp-of-icmp rules o...

Afonso Bordado (Apr 30 2024 at 18:27):

Most of the tests that cover the ISLE optimizations will probably be in that folder. You should be able to write a funcion that represents the unoptimized case, run clif-util and have it tell you what it was optimized to.

wasmtime/cranelift/filetests/filetests/egraph/icmp.clif at main · bytecodealliance/wasmtime

A fast and secure runtime for WebAssembly. Contribute to bytecodealliance/wasmtime development by creating an account on GitHub.

Afonso Bordado (Apr 30 2024 at 18:28):

Terts Diepraam (Apr 30 2024 at 18:28):

Terts Diepraam (Apr 30 2024 at 20:27):

Chris Fallin (Apr 30 2024 at 20:29):

thanks again for the PR and happy to help think through other suboptimal cases if you run into any!

Terts Diepraam (May 01 2024 at 09:39):

I'm looking into omitting the function prologue and epilogue now and I've copied the approach from aarch64, where the frame_layout.setup_area_size is set to 0 if it's a leaf function (and some other conditions). However, that leads to a segfault for a simd test, because the simd swizzle is lowered to a callq which is not taken into account with the archtecture-independent is_leaf calculation. I suppose I'll put up a draft PR to make it easier to discuss.

Terts Diepraam (May 01 2024 at 09:59):

WIP: cranelift: omit function prologue and epilogue on x64 if possible by tertsdiepraam · Pull Request #8516 · bytecodealliance/wasmtime

This makes cranelift omit the function prologue and epilogue on the x64 architecture if possible, like it already does for aarch64. However, this is not quite right yet, because (if I understand co...

Stream: cranelift

Topic: disassemly seems (relatively) unoptimized

Terts Diepraam (Apr 30 2024 at 14:01):