Hi all!
I'm working on a language that will use cranelift as a backend. We assumed (perhaps naively) that cranelift would emit fairly optimized code by default, but based on some limited testing that doesn't seem to be the case. This is not really a deal breaker for me, but I still wanted to check whether I might be holding it wrong :)
I'm using the cranelift_jit
crate and based the code mostly on the cranelift jit demo, with the settings customized like this:
let mut settings = settings::builder();
settings.set("opt_level", "speed_and_size").unwrap();
let flags = settings::Flags::new(settings);
let isa = cranelift_native::builder().unwrap().finish(flags).unwrap();
My test program then corresponded with a Rust function roughly equivalent to:
pub fn foo(num: i32) -> bool {
let a = 5;
if num == a {
return true;
} else {
return false;
}
}
Which I represented with the following cranelift IR:
function u0:0(i32) -> i8 system_v {
block0(v0: i32):
v1 = iconst.i32 5
v2 = icmp eq v1, v0 ; v1 = 5
v3 = icmp_imm eq v2, 1
brif v3, block1, block2
block1:
v4 = iconst.i8 1
return v4 ; v4 = 1
block2:
v5 = iconst.i8 0
return v5 ; v5 = 0
}
Which then gave me the following disassembly:
block0: ; offset 0x0
pushq %rbp
movq %rsp, %rbp
cmpl $5, %edi
sete %r8b
cmpb $1, %r8b
je 0x1c
xorl %eax, %eax
movq %rbp, %rsp
popq %rbp
retq
movl $1, %eax
movq %rbp, %rsp
popq %rbp
retq
There were a couple things that surprised me here. First, I assumed that the jump would be compiled away. Second, the stack frame manipulation seems superfluous. Finally, the opt_level
did not seem to have any influence on the assembly. All in all, it's a fairly literal translation of the IR (with some stack pointer stuff). So, my question is whether I'm missing some configuration option or that there is some function I should call to further optimize the function.
I'm in no way an expert in assembly, so I might be totally off here with my assumptions.
if I remember correctly, we haven't implemented preserve_frame_pointers=false
for x86_64 yet, there was some failing test that prevented the PR from landing, and whoever was poking at it didn't have time to investigate. we should circle back on that at some point
in general, Cranelift doesn't do too much with optimizing branches/blocks. it will fold blocks together if A unconditionally jumps to B and A is B's only predecessor, but it won't determine that the if/else is returning the same thing as the condition and rewrite that to returning the condition directly. this is because cranelift is mostly used in wasmtime, where the wasm input has already been optimized by LLVM, and its job is primarily to act as a good instruction selector and register allocator.
That's very helpful, thanks! Out of general curiosity, are more optimizations for branches on the roadmap? It sounds like that would be relevant for use of cranelift as a Rust backend and other languages too.
@Terts Diepraam we have thoughts on how we might do it, yes. Unfortunately we don't have a gaggle of engineers waiting to implement this -- we all wear many hats and no one has it as their primary mission to push the optimizer forward at the moment. (Open-source economics and infra and all that...) If you're willing to get involved, or know someone who is, we could use more folks working on it!
Also, as a starter task, the extra compare is fairly silly -- (icmp (icmp ...) (iconst 1))
should be rewritten to (icmp ...)
. I had thought we had a midend rule for that but maybe it's not applicable for some reason here. That'd be possibly an easier place to get started
That's good to hear! Indeed, I in no way intend to increase the pressure. Y'all are doing great work. I might indeed try to get involved. Can't promise I'll be any good though :sweat_smile: That starter task looks fun!
Chris Fallin said:
Also, as a starter task, the extra compare is fairly silly --
(icmp (icmp ...) (iconst 1))
should be rewritten to(icmp ...)
. I had thought we had a midend rule for that but maybe it's not applicable for some reason here. That'd be possibly an easier place to get started
Isn’t that simplification lines 39-43 of icmp.isle https://github.com/bytecodealliance/wasmtime/blob/main/cranelift/codegen/src/opts/icmp.isle
Indeed, that's almost it, but it's only comparisons with 0 not with 1. PR incoming :)
Here's my attempt: https://github.com/bytecodealliance/wasmtime/pull/8510
Now I need to figure out how to write tests
We have a few similar tests in this file: https://github.com/bytecodealliance/wasmtime/blob/main/cranelift/filetests/filetests/egraph/icmp.clif
Most of the tests that cover the ISLE optimizations will probably be in that folder. You should be able to write a funcion that represents the unoptimized case, run clif-util
and have it tell you what it was optimized to.
If you need help with running those tests, or writing them, let me know!
Great! Thanks!
Thanks all! I'm looking forward to contributing more already
thanks again for the PR and happy to help think through other suboptimal cases if you run into any!
I'm looking into omitting the function prologue and epilogue now and I've copied the approach from aarch64, where the frame_layout.setup_area_size
is set to 0
if it's a leaf function (and some other conditions). However, that leads to a segfault for a simd test, because the simd swizzle is lowered to a callq
which is not taken into account with the archtecture-independent is_leaf
calculation. I suppose I'll put up a draft PR to make it easier to discuss.
Here it is: https://github.com/bytecodealliance/wasmtime/pull/8516
Last updated: Nov 22 2024 at 16:03 UTC