Stream: cranelift

Topic: blog post on new backend


view this post on Zulip Chris Fallin (Sep 18 2020 at 15:55):

Hi all -- I promised some folks a while back that I'd write up the new backend's design in a blog post or three. I've posted part 1 here if anyone is curious: https://cfallin.org/blog/2020/09/18/cranelift-isel-1/

view this post on Zulip Chris Fallin (Sep 18 2020 at 15:55):

(cc @Till Schneidereit ; I know I promised this all the way back in April or so; sorry for the delay :-) )

view this post on Zulip fitzgen (he/him) (Sep 18 2020 at 16:05):

do you have a tweet I can retweet?

view this post on Zulip Chris Fallin (Sep 18 2020 at 16:05):

I don't do twitter (hi it's me stuck in 2006! may change sometime) but please feel free to tweet it!

view this post on Zulip fitzgen (he/him) (Sep 19 2020 at 19:54):

@Chris Fallin The new backend is doing greedy search for isel right? No cost model and dynamic programming correct?

view this post on Zulip fitzgen (he/him) (Sep 19 2020 at 19:55):

https://mobile.twitter.com/typesanitizer/status/1307361862325952518

@fitzgen Is there an explicit cost model for the instructions? If I am understanding correctly, this is doing greedy instruction selection (maximal munch), there's no dynamic programming involved.

- Varun Gandhi (@typesanitizer)

view this post on Zulip fitzgen (he/him) (Sep 19 2020 at 19:55):

So nothing like burg and similar things

view this post on Zulip Chris Fallin (Sep 19 2020 at 20:35):

Yes, exactly, the lowering for a given instruction and its operands is deterministic and maximal. We've had some discussions about heuristics to tune this, e.g. one rule that we tried (but quickly discarded) was to avoid merging operations with > 1 use. It turned out to be better at least for our patterns (on AArch64) to always be greedy, because the operations we're merging tend to be free (e.g. built-in register extends/shifts).

view this post on Zulip Chris Fallin (Sep 19 2020 at 20:36):

(it occurs to me that I should perhaps join tech-twitter someday so as not to rely on middlefolks such as yourself. Thanks for relaying the question!)

view this post on Zulip Julian Seward (Sep 19 2020 at 20:38):

That's true (re burg). IMO burg is/was useful for very complex insn sets, most particularly VAX, but in the modern era (arm64, riscv, and to a large extent the subset of x64 that executes fast), it's pretty much irrelevant.

view this post on Zulip Amanieu (Sep 19 2020 at 21:29):

Note that the AArch64 built-in shift/extend isn't actually free: it adds an extra cycle of latency to the instruction and reduces the max throughput from 3 to 1 since it needs to use the complex ALU execution unit.

view this post on Zulip Amanieu (Sep 19 2020 at 21:29):

source Section 3.4

view this post on Zulip Amanieu (Sep 19 2020 at 21:31):

So in theory if the result of the shift/extend is used by multiple instructions then you're better off leaving it as a separate instruction.

view this post on Zulip Chris Fallin (Sep 19 2020 at 21:48):

Interesting, I hadn't seen this before, thank you!

view this post on Zulip Chris Fallin (Sep 19 2020 at 21:51):

Thinking about this a bit more -- there are implications re: register pressure as well. If we can't spare a temp for the shifted/extended value (and the destination of the add or whatever is also a source, so we can't use it as a temp), then the builtin op saves us. Definitely something to consider more carefully if/when we have a more aggressively-optimized isel mode.

view this post on Zulip Amanieu (Sep 19 2020 at 21:52):

Another (more common) case is folding an extend into a load: you really don't want to have to duplicate the load! However I will admit that cases where a value is actually used multiple times are quite rare in practice, so it is possible that this won't matter much in real programs.

view this post on Zulip Amanieu (Sep 19 2020 at 21:56):

Could the register pressure issue be somewhat alleviated by adding rematerialization support to the register allocator?

view this post on Zulip Chris Fallin (Sep 19 2020 at 21:56):

Indeed. Re: loads, we consider those side-effecting now (both because of traps, and because we've punted on the memory-ordering question until we can think more carefully about it) so we won't ever duplicate. But it's certainly a matter to consider once we adapt e.g. x86 to use memory operands directly when it sees add-from-load.

view this post on Zulip Chris Fallin (Sep 19 2020 at 21:57):

Re-mat is something that could help in general, yup; another thing on the list that we haven't gotten to quiet yet...

view this post on Zulip Chris Fallin (Sep 19 2020 at 21:59):

(There are two levels, the first being direct-reload from a spillslot when an ISA supports mem operands; the second is the more general case where we track expressions of pure operators to some depth. Want to get to at least the former, and maybe a simple form of the latter)

view this post on Zulip bjorn3 (Sep 21 2020 at 12:01):

image.png

Number 12 on HN right now

view this post on Zulip mental (Sep 21 2020 at 20:15):

I just finished reading the post (I noticed it floating around reddit) and I must say, it's very well done! I very much enjoyed reading it :D

view this post on Zulip Chris Fallin (Sep 21 2020 at 20:25):

Happy to hear it! Now I have to live up to this on the followups :-)

view this post on Zulip Alyssa Rosenzweig (Oct 10 2020 at 02:05):

Nice work, and an excellent write-up to boot. Thanks :)

view this post on Zulip Alyssa Rosenzweig (Oct 10 2020 at 02:11):

I bumped into the "linked lists in Rust" problem trying to model backend IR, so I appreciate seeing there's a "better" way for that too. Still new to Rust but from what I've seen cranelift is a work of art. Nice to see it getting even better :heart_eyes:

view this post on Zulip Alyssa Rosenzweig (Oct 10 2020 at 02:15):

Use-counting is a nice touch as well... wouldn't mind dropping our own fixed-point backend DCE... (though for our stuff the real issue is RA/scheduling)

view this post on Zulip Chris Fallin (Oct 10 2020 at 05:09):

Thanks!

view this post on Zulip Alyssa Rosenzweig (Oct 10 2020 at 15:31):

Eschewing SSA at this level allows us to avoid the overhead of maintaining its invariants, and maps more closely to the real machine. Lowerings for instructions are allowed to, e.g., use a destination register as a temporary before performing a final write into it. If we required SSA form, we would have to allocate a temporary in this case and rely on the register allocator to coalesce it back to the same register, which adds compile-time overhead

Part of me was hoping to see one of those fancy polynomial-time SSA-based optimal RA algorithms c:

view this post on Zulip Alyssa Rosenzweig (Oct 10 2020 at 15:41):

Furthermore, a greedy allocator that traverses the dominance tree in pre-order always finds an optimal coloring, if there is no aliasing of registers and no pre-colored registers

There's the rub :(

view this post on Zulip Amanieu (Oct 10 2020 at 16:15):

@Alyssa Rosenzweig Actually the old cranelift backend does use SSA-based register allocation.

view this post on Zulip Alyssa Rosenzweig (Oct 10 2020 at 16:15):

Eh?

view this post on Zulip Amanieu (Oct 10 2020 at 16:16):

https://github.com/bytecodealliance/wasmtime/tree/main/cranelift/codegen/src/regalloc

Standalone JIT-style runtime for WebAssembly, using Cranelift - bytecodealliance/wasmtime

view this post on Zulip Amanieu (Oct 10 2020 at 16:18):

As opposed to use the new backend that uses https://github.com/bytecodealliance/regalloc.rs which is not SSA-based.

Modular register allocator algorithms. Contribute to bytecodealliance/regalloc.rs development by creating an account on GitHub.

view this post on Zulip Alyssa Rosenzweig (Oct 10 2020 at 16:18):

Looks like I have some "light" reading to do then. Thanks for the links!

view this post on Zulip Alyssa Rosenzweig (Oct 10 2020 at 16:27):

There's also something to be said about register pressure / spills having more to do with scheduling than RA, for which the literature isn't quite there (at least on the GPU side). But I digress.

view this post on Zulip Alyssa Rosenzweig (Oct 10 2020 at 16:49):

@Amanieu --Alright, reading through the first link has successfully (re)convinced me to let other people do out-of-SSA and do RA after, thanks

view this post on Zulip Amanieu (Oct 10 2020 at 16:50):

I'm curious, what in particular convinced you?

view this post on Zulip Alyssa Rosenzweig (Oct 10 2020 at 16:54):

How many gotchas there are absolutely everywhere for real machines. Something I was prepared for, but underscored seeing a real production implementation.

view this post on Zulip Alyssa Rosenzweig (Oct 10 2020 at 16:55):

I have a regalloc algorithm implemented in a few hundred lines of C, with excellent support for aliasing and precolouring. It can't claim to be optimal in any sense, but the order-of-magnitude difference in complexity is... noted.

view this post on Zulip Alyssa Rosenzweig (Oct 10 2020 at 16:56):

(I admit that's an unfair comparison, since out-of-SSA is happening elsewhere with a sufficiently complicated algorithm, and the backend still drives that with its own support code. And liveness analysis is separate since it's used for DCE too. So maybe the gap isn't as aggressive as it looks.)

view this post on Zulip Alyssa Rosenzweig (Oct 10 2020 at 17:01):

[Maybe the real takeaway is that RA is complicated no matter how you try to hide the complexity... and the cranelift implementations are just more honest about that :sweat_smile: ]

view this post on Zulip Alyssa Rosenzweig (Oct 10 2020 at 17:08):

If nothing else - I appreciate seeing how production Rust compilers look like as I [redacted]


Last updated: Nov 22 2024 at 16:03 UTC