Hi all -- I promised some folks a while back that I'd write up the new backend's design in a blog post or three. I've posted part 1 here if anyone is curious: https://cfallin.org/blog/2020/09/18/cranelift-isel-1/
(cc @Till Schneidereit ; I know I promised this all the way back in April or so; sorry for the delay :-) )
do you have a tweet I can retweet?
I don't do twitter (hi it's me stuck in 2006! may change sometime) but please feel free to tweet it!
@Chris Fallin The new backend is doing greedy search for isel right? No cost model and dynamic programming correct?
https://mobile.twitter.com/typesanitizer/status/1307361862325952518
@fitzgen Is there an explicit cost model for the instructions? If I am understanding correctly, this is doing greedy instruction selection (maximal munch), there's no dynamic programming involved.
- Varun Gandhi (@typesanitizer)So nothing like burg and similar things
Yes, exactly, the lowering for a given instruction and its operands is deterministic and maximal. We've had some discussions about heuristics to tune this, e.g. one rule that we tried (but quickly discarded) was to avoid merging operations with > 1 use. It turned out to be better at least for our patterns (on AArch64) to always be greedy, because the operations we're merging tend to be free (e.g. built-in register extends/shifts).
(it occurs to me that I should perhaps join tech-twitter someday so as not to rely on middlefolks such as yourself. Thanks for relaying the question!)
That's true (re burg). IMO burg is/was useful for very complex insn sets, most particularly VAX, but in the modern era (arm64, riscv, and to a large extent the subset of x64 that executes fast), it's pretty much irrelevant.
Note that the AArch64 built-in shift/extend isn't actually free: it adds an extra cycle of latency to the instruction and reduces the max throughput from 3 to 1 since it needs to use the complex ALU execution unit.
source Section 3.4
So in theory if the result of the shift/extend is used by multiple instructions then you're better off leaving it as a separate instruction.
Interesting, I hadn't seen this before, thank you!
Thinking about this a bit more -- there are implications re: register pressure as well. If we can't spare a temp for the shifted/extended value (and the destination of the add or whatever is also a source, so we can't use it as a temp), then the builtin op saves us. Definitely something to consider more carefully if/when we have a more aggressively-optimized isel mode.
Another (more common) case is folding an extend into a load: you really don't want to have to duplicate the load! However I will admit that cases where a value is actually used multiple times are quite rare in practice, so it is possible that this won't matter much in real programs.
Could the register pressure issue be somewhat alleviated by adding rematerialization support to the register allocator?
Indeed. Re: loads, we consider those side-effecting now (both because of traps, and because we've punted on the memory-ordering question until we can think more carefully about it) so we won't ever duplicate. But it's certainly a matter to consider once we adapt e.g. x86 to use memory operands directly when it sees add-from-load.
Re-mat is something that could help in general, yup; another thing on the list that we haven't gotten to quiet yet...
(There are two levels, the first being direct-reload from a spillslot when an ISA supports mem operands; the second is the more general case where we track expressions of pure operators to some depth. Want to get to at least the former, and maybe a simple form of the latter)
Number 12 on HN right now
I just finished reading the post (I noticed it floating around reddit) and I must say, it's very well done! I very much enjoyed reading it :D
Happy to hear it! Now I have to live up to this on the followups :-)
Nice work, and an excellent write-up to boot. Thanks :)
I bumped into the "linked lists in Rust" problem trying to model backend IR, so I appreciate seeing there's a "better" way for that too. Still new to Rust but from what I've seen cranelift is a work of art. Nice to see it getting even better :heart_eyes:
Use-counting is a nice touch as well... wouldn't mind dropping our own fixed-point backend DCE... (though for our stuff the real issue is RA/scheduling)
Thanks!
Eschewing SSA at this level allows us to avoid the overhead of maintaining its invariants, and maps more closely to the real machine. Lowerings for instructions are allowed to, e.g., use a destination register as a temporary before performing a final write into it. If we required SSA form, we would have to allocate a temporary in this case and rely on the register allocator to coalesce it back to the same register, which adds compile-time overhead
Part of me was hoping to see one of those fancy polynomial-time SSA-based optimal RA algorithms c:
Furthermore, a greedy allocator that traverses the dominance tree in pre-order always finds an optimal coloring, if there is no aliasing of registers and no pre-colored registers
There's the rub :(
@Alyssa Rosenzweig Actually the old cranelift backend does use SSA-based register allocation.
Eh?
https://github.com/bytecodealliance/wasmtime/tree/main/cranelift/codegen/src/regalloc
As opposed to use the new backend that uses https://github.com/bytecodealliance/regalloc.rs which is not SSA-based.
Looks like I have some "light" reading to do then. Thanks for the links!
There's also something to be said about register pressure / spills having more to do with scheduling than RA, for which the literature isn't quite there (at least on the GPU side). But I digress.
@Amanieu --Alright, reading through the first link has successfully (re)convinced me to let other people do out-of-SSA and do RA after, thanks
I'm curious, what in particular convinced you?
How many gotchas there are absolutely everywhere for real machines. Something I was prepared for, but underscored seeing a real production implementation.
I have a regalloc algorithm implemented in a few hundred lines of C, with excellent support for aliasing and precolouring. It can't claim to be optimal in any sense, but the order-of-magnitude difference in complexity is... noted.
(I admit that's an unfair comparison, since out-of-SSA is happening elsewhere with a sufficiently complicated algorithm, and the backend still drives that with its own support code. And liveness analysis is separate since it's used for DCE too. So maybe the gap isn't as aggressive as it looks.)
[Maybe the real takeaway is that RA is complicated no matter how you try to hide the complexity... and the cranelift implementations are just more honest about that :sweat_smile: ]
If nothing else - I appreciate seeing how production Rust compilers look like as I [redacted]
Last updated: Dec 23 2024 at 12:05 UTC