Hi,
I'm wondering if there are any updates on this - https://github.com/bytecodealliance/rfcs/blob/main/accepted/cranelift-roadmap-2023.md#cranelift-porting-how-to-write-a-backend.
Context: I am new to compiler engineering (i.e., I've never worked on compiler development before but have some experience with Rust). I looked at the Cranelift repository and feel a bit lost. I'm trying to understand how we go from a given .clif
file (assuming it's a single function) to actual machine code. So far, my understanding is as follows:
CLIF --> ISLE-based Lowering to arch-specific VCode --> Legalization --> Some e-graph-based mid-end optimizations --> Register allocation via regalloc2 --> Final lowering to VCode --> Emitting machine instructions
Additional questions:
.clif
file as its input and work through its program execution?wasmtime
?There is first optimizations on clif ir which includes the e-graph based optimizations and somewhere in the middle of the optimization pipeline legalizations are performed. Then lowering to arch-specific VCode which does some peep-hole optimizations at the same time, then regalloc and finally emitting machine code.
hi @Nihal Pasham -- the pipeline you wrote out is a bit out of order; it goes something like:
CLIF -> legalization -> mid-end egraph rewrites if enabled (rules in ISLE) -> lowering to backend-specific VCode (rules in ISLE) -> regalloc -> binary emission
in general we should have better docs, though the module-level doc comments for certain pieces try to be fairly helpful at the low-to-mid level; I'd encourage reading through the code in machinst/
(and context.rs
in the cranelift-codegen toplevel is where the compiler driver is, so you can trace from there too).
The specific "how to write a backend" tutorial idea you link to unfortunately never happened because, well, there's always too much to do; your best bet at this point would be reading through the other backends probably (I'm partial to aarch64 as maybe the cleanest at the moment?).
Finally, there's no reason Cranelift couldn't target Thumb-2 or any other ISA like that as long as it's not too "unusual": machine code with registers and jumps between instructions, etc. (For an example of a little too unusual, we've talked about targeting GPUs before, and there the differences in regalloc and the control flow handling will take some thought; but Thumb-2 is "just" another CPU ISA in comparison)
@bjorn3 and @Chris Fallin - thank you.
mid-end egraph rewrites
cover when enabled (using rules in ISLE)? Specifically, does it include common optimizations such as Global Value Numbering (GVN), Loop Invariant Code Motion (LICM), and Dead Code Elimination (DCE), or does it encompass a more exhaustive list of optimizations?For an example of a little too unusual, we've talked about targeting GPUs before, and there the differences in regalloc and the control flow handling will take some thought
Nihal Pasham said:
- Could you elaborate on the kinds of optimizations or rewrites that the
mid-end egraph rewrites
cover when enabled (using rules in ISLE)? Specifically, does it include common optimizations such as Global Value Numbering (GVN), Loop Invariant Code Motion (LICM), and Dead Code Elimination (DCE), or does it encompass a more exhaustive list of optimizations?
Yes, it implements LICM, DCE, GVN, as well as an alias analysis that feeds into redundant-load elimination, and finally (probably most importantly) it runs a bunch of rewrite rules in cranelift/codegen/src/opts/, including constant propagation/constant folding as well as a lot of algebraic identities.
You can find more detail on how the mid-end works in a talk I gave last year (video, slides), though we should probably write a paper or something eventually to describe all the inner workings. And again we try to leave reasonable comments that you can read (in this case in egraph.rs
and egraph/elaborate.rs
).
Got it. Does the same apply to more specialized hardware such as TPUs, NPUs, and DSPs, which come with specialized instruction sets designed specifically for neural network (tensor) operations or signal processing? Would these architectures be more or less amenable to Cranelift's current design?
DSPs are probably not too bad -- they're "normal" except for the VLIW instruction scheduling that most require, which we could handle in a single pass during instruction emission in the worst case -- but TPUs/NPUs are not really general-purpose processors (more like matrix-multiplication accelerators) and so are not suitable for running the general-purpose code that Cranelift compiles.
AMD's AIE NPU looks much more like a CPU with SIMD to me, it is VLIW with scalar and vector units, afaict it is turing complete (within the memory limits of course): https://docs.amd.com/r/en-US/am020-versal-aie-ml/AIE-ML-Architecture
so cranelift could probably target it, but you'd want to vectorize most of your code before feeding it in, or you'd end up with a small fraction of the performance due to just using scalar instructions.
VLIW means that the compiler has to do instruction scheduling, which Cranelift doesn't currently do.
Chris Fallin said:
Nihal Pasham said:
- Could you elaborate on the kinds of optimizations or rewrites that the
mid-end egraph rewrites
cover when enabled (using rules in ISLE)? Specifically, does it include common optimizations such as Global Value Numbering (GVN), Loop Invariant Code Motion (LICM), and Dead Code Elimination (DCE), or does it encompass a more exhaustive list of optimizations?Yes, it implements LICM, DCE, GVN, as well as an alias analysis that feeds into redundant-load elimination, and finally (probably most importantly) it runs a bunch of rewrite rules in cranelift/codegen/src/opts/, including constant propagation/constant folding as well as a lot of algebraic identities.
I’m probably wrong but don’t DCE and GVN (or equivalents) actually come ‘free’ with the transform to aegraph?
Chris Fallin said:
Nihal Pasham said:
- Could you elaborate on the kinds of optimizations or rewrites that the
mid-end egraph rewrites
cover when enabled (using rules in ISLE)? Specifically, does it include common optimizations such as Global Value Numbering (GVN), Loop Invariant Code Motion (LICM), and Dead Code Elimination (DCE), or does it encompass a more exhaustive list of optimizations?Yes, it implements LICM, DCE, GVN, as well as an alias analysis that feeds into redundant-load elimination, and finally (probably most importantly) it runs a bunch of rewrite rules in cranelift/codegen/src/opts/, including constant propagation/constant folding as well as a lot of algebraic identities.
I’m probably wrong but don’t DCE and GVN (or equivalents) actually come ‘free’ with the transform to aegraph?
Yes, exactly, hence the mid-end implements them :-)
(LICM falls out of the way we do code placement during aegraph->CFG elaboration too, it requires intentional choices but it is also not a separate pass)
Last updated: Dec 23 2024 at 12:05 UTC