rfcs / PR #41 Add "A Cranelift Assembler" · rfc-notifications

Stream: rfc-notifications

Topic: rfcs / PR #41 Add "A Cranelift Assembler"

RFC notifications bot (Dec 10 2024 at 23:53):

abrown opened PR #41 from abrown:cranelift-assembler to bytecodealliance:main:

Rendered

RFC notifications bot (Dec 11 2024 at 01:16):

programmerjake submitted PR review.

RFC notifications bot (Dec 11 2024 at 01:16):

programmerjake created PR review comment:

I think it would be useful for these to be trait methods that way generic code isn't required to use the full enum Inst along with the probable code bloat (from having a match arm for every instruction even if we're only using one or a few) that implies.

RFC notifications bot (Dec 11 2024 at 01:16):

programmerjake edited PR review comment.

RFC notifications bot (Dec 11 2024 at 01:36):

abrown has marked PR #41 as ready for review.

RFC notifications bot (Dec 11 2024 at 01:48):

cfallin submitted PR review:

Thanks a ton for writing this up! It's been really neat to see the DSL come together; I think it's a very clean and powerful way of spec'ing the ISA we support, should reduce potential for errors, and will result in a more generic and reusable assembler library.

I may have more thoughts later but I wanted to write out my thoughts on what I see as the most important point below:

RFC notifications bot (Dec 11 2024 at 01:48):

cfallin created PR review comment:

This is a difficult problem to be sure. I want to write out our understanding (or at least "my understanding of our understanding") of the situation here, so the tradeoffs are clear at least:

Cranelift lowers into an IR that looks almost like machine code, except for two aspects: register operands are virtual register numbers, and register operands are in SSA form. This is because our regalloc operates on SSA input with constraints (and this for good reasons -- e.g. it avoids the need for a "defensive copy" on every two-reg instruction's src/dst, which we hope is elided later).

We call this pre-regalloc form "VCode" and, properly speaking, it is a separate IR -- not CLIF, not machine code. (Lowering takes CLIF to VCode and regalloc takes VCode to "machine code", which we represent in the VCode struct as well but that's not important here.)

To avoid significant duplication, because VCode is so close to machine code (aside from those two differences above), we overload our current "machine instruction" definitions to (i) contain Regs that can hold vregs or real regs, and (ii) always be in three-reg form, even on x86, with separate "dest-side" operands for R+W real operands. In other words, we've basically defined an IR that is a union of machine code and VCode so that we can use one representation.

This union-based approach has performance advantages as well: because one IR is almost another, and because of the way the overloading is done (in individual operand slots), we can edit in place to turn one into another.

So here's the difficulty (and here is where my analysis and opinion starts): an assembler library that wants to stay as close as possible to "machine code" will naturally not have an instruction representation that is a superset of machine code and VCode. But if we only have machine instructions, then Cranelift needs to keep VCode around with an instruction enum -- in other words, we aren't deleting any code, we're just adding another instruction representation to maintain, and a lowering step from one to the other, and the performance cost of that too. IMHO this is a nonstarter and means that we need to design the library to accomodate "in-place representation" of pre-regalloc state.

There are two general paths here: generic and specific.

The generic path is to have a pluggable notion of operand, with a trait that allows the assembler library to get the physical register name out of it once it's in its final state. An operand for x86 might actually hold two vreg numbers (for a read-modify-write inst, i.e. one use for the "input side" and one def for the "output side", constrained together) or one; and will hold a sum type that is either a vreg or real reg. But that's the concern of the user of the library; someone else could equally plug in a u8 with a stub trait impl and call it a day (and we can provide this default impl).

The upside of this approach is that it is clean and keeps Cranelift/VCode/regalloc abstractions out of the assembler library, as you state here; and makes the library more reusable for others too. (E.g., Winch probably doesn't care about vreg numbers or SSA use/def distinction.) The downside is that it adds trait/type-param/constraint soup everywhere.

Specific: choose one implementation ("two u32 vregs for R/W, one u32 vreg otherwise" seems OK). Basically this is the above but with one trait impl inlined in, if we decide the trouble isn't worth it.

Those are the two main options I see; I can't see how we are able to delete the existing VCode inst enum otherwise, since fundamentally we need to store the SSA representation of instructions somehow.

I know we explored the generic thing earlier and decided it was getting a little gnarly, but as I write out the above I seem to be convincing myself down that path again -- maybe we could try it now that you've gotten over the hurdle of an initial PoC? Would be interested in what others think as well!

RFC notifications bot (Dec 11 2024 at 21:18):

alexcrichton submitted PR review.

RFC notifications bot (Dec 11 2024 at 21:18):

alexcrichton created PR review comment:

and (ii) always be in three-reg form, even on x86, with separate "dest-side" operands for R+W real operands

To ask a question about this, IIRC this was because register allocation was so much easier when read/write operands were removed. Is that still the case? For example does regalloc2 (and regalloc3?) not support register allocation on the "native x86 form"?

RFC notifications bot (Dec 11 2024 at 21:21):

alexcrichton submitted PR review.

RFC notifications bot (Dec 11 2024 at 21:21):

alexcrichton created PR review comment:

One thing I'd be curiuos to see, especially in connection with the verbosity point called out below, is how this would integrate into the lowering of band itself at a high level. For example the lowerings we have today feel "right" to me in that they use (x64_and ...) as the result. I think it'd be reasonable to rename x64_and to x64_andl_mi_xsbl ... or such to avoid shoehorning so many instructions into one shape (which I think is a good idea), but with this ISLE snippet I don't know what the output of AssemblerGprMem is and how that'd get integrated into the band lowering.

RFC notifications bot (Dec 11 2024 at 21:26):

cfallin submitted PR review.

RFC notifications bot (Dec 11 2024 at 21:26):

cfallin created PR review comment:

Yes, absolutely, and we should not go back. (There was a huge effort, months of work mainly by @elliottt, to finish the SSA transition because of this.)

I touched on the main point above but basically, if we register-allocate on native x86 form, we lean very heavily on bundle-merging and move-elision to get back to ideal form, because we have to make a copy of every src1 operand before the instruction. Moving back to this approach would lead to a significant regression in compile time and also probably code-quality issues where our merging heuristics aren't good enough.

We also simplify many other aspects of the register allocator. For example, with SSA we can now do "overlapping liveranges" (because a value is defined only once, we can reason about multiple copies more easily), which unlocks other performance ideas (already, and more in the future potentially).

For what it's worth, our approach was inspired by IonMonkey's register allocator, which also operates on strict SSA with special constraints to deal with the x86-ism of "the output reg must also be this input reg, and clobbers it". We previously thought we should do the native thing; that was regalloc.rs; we learned from it and then reverted to this approach.

Separately, I think there's a question of how to make an assembler library most useful. It could be very opinionated about what is acceptable in an instruction, or it could offer flexibility in operand formats; I suspect Cranelift is not the only potential user with special requirements around what goes into an operand, or metadata around it. To maximize its usefulness, IMHO it should offer a pluggable T: Operand and then we can do all the SSA stuff on the Cranelift side.

RFC notifications bot (Dec 11 2024 at 21:31):

alexcrichton submitted PR review.

RFC notifications bot (Dec 11 2024 at 21:31):

alexcrichton created PR review comment:

Ok cool mostly wanted to confirm that "must be ssa form" is still a hard constraint.

I understand the desire to have the assembler library be as simple as possible and as close to the spec as possible, but I also agree that this constraint will force our hand in doing something above-and-beyond the "Rust translation of the intel manual". I don't personally have an opinion on T: Operand vs "just inline T = CraneliftX64Operand everywhere".

RFC notifications bot (Dec 11 2024 at 23:02):

cfallin submitted PR review.

RFC notifications bot (Dec 11 2024 at 23:02):

cfallin created PR review comment:

One slight refinement to T: Operand that I want to write down here as it occurs to me: we probably want something like Inst<R: ReadOperand, W: WriteOperand> or Inst<R: ReadOperand, W: WriteOperand, M: ModOperand> so that the user can plug in different things for each: e.g. the Cranelift x64 backend wants one u32 for R and W and two u32s for M. We could also potentially integrate that with the regalloc visitor (trait Visitor has fn visit_read(&mut R), fn visit_write(&mut W), fn visit_mod(&mut M)).

RFC notifications bot (Dec 16 2024 at 16:12):

fitzgen submitted PR review:

I am in favor of the overall plan and goals, but I do have concerns, nitpicks, and bikesheds on some details.

Two things that could be nice to plan for and make sure we aren't accidentally making harder than they otherwise should be, even if we don't intend to implement their support in the DSL initially, are:

Flagging which instructions are available on 32-bit x86, not just x86_64 and whatever ISA extensions, to make an eventual x86 backend that much easier and able to reuse all these instruction definitions, instead of needing to subset them.

A way to attach ISLE specs to the instructions in the DSL so we can better integrate with the ISLE verification effort.

Finally, thanks for investigating and prototyping what a better mach inst looks like! I am excited to clean this stuff up.

RFC notifications bot (Dec 16 2024 at 16:12):

fitzgen created PR review comment:

One thing I don't see discussed (thus far, at least) is how the DSL is parsed and how this code is generated. We need to be careful about adding compile time overhead to Cranelift (and Wasmtime, and every downstream crate using Cranelift).

For example, we should not encode the DSL in toml or yaml and require a new build-dep for parsing that markup language. Nor should we write our own parser for the DSL that has to be built and run from build.rs.

The only realistic approach that I am aware of that satisfies this constraint is a for_each_instruction!-style macro, or macros, similar to what Pulley does for its instruction set. This reuses the Rust compiler's parsing facilities to avoid spending compile/build time on DSL parsers and we can also generate only what is needed for different use cases by cfging different invocations of the macro on and off as necessary (e.g. if we only need the encoders and not the pretty printers for a particular build configuration).

RFC notifications bot (Dec 16 2024 at 16:12):

fitzgen created PR review comment:

bikeshed: If andb isn't even enough on its own, and we have to break ties by concatenating additional metadata, then can we use ~intel syntax instead of ~att syntax? Regardless of flavor any one person is used to, using intel syntax makes cross referencing with the official manual that much easier.

RFC notifications bot (Dec 16 2024 at 16:12):

fitzgen created PR review comment:

I think ideally we want to move everything over towards something like whatever we end up at here. I don't think everything needs to use exactly the same machinery, but the motivation for x64 applies pretty much to every backend.

We also don't have to do the migration en masse, all at once, and before experimenting with just the x64 backend. We can do things in an incremental fashion and learn what works well and what doesn't as we go.

RFC notifications bot (Dec 16 2024 at 16:12):

fitzgen created PR review comment:

I think this is also touching on the tension/impedance mismatch between "assembler library that directly reflects x86_64" and "final machine-specific backend IR for Cranelift". At the ISLE lowering rules level, I don't want to think about formats at all. I want to think about whole instructions and that's it. At the register operand collection and encoding levels, I don't care so much about whole instructions, and formats can be great shortcuts for deduplicating code.

I think we should prioritize and optimize for ISLE lowering rules. The latter things will ideally be automatically generated from the DSL describing the instructions, and so the duplication aspect doesn't affect our maintenance burden (although it admittedly could affect Cranelift's run time performance, the time it takes to build Cranelift, and the size of a Cranelift binary; nonetheless I'm not too concerned in this case).

RFC notifications bot (Dec 16 2024 at 16:19):

fitzgen submitted PR review.

RFC notifications bot (Dec 16 2024 at 16:19):

fitzgen created PR review comment:

If core assembler library just contains the DSL, and the DSL isn't actually generating an instruction enum itself, then it seems like the assembler library can punt these questions to its downstream users? And then we can generate an instruction enum in Cranelift that supports the SSA stuff, and other users can generate an instruction enum (if they even need one) that does not. In Cranelift, the generated encoding functions will need to dynamically assert that the first operand and destination operand are the same register for the two-operand instructions, but we have to do that today already.

This approach avoids adding generics to the instruction enum itself. (Depending on how we do encoding functions and what not, we may need generics on those functions and we may need to implement one or more traits for the instruction enum, but the enum itself need not be generic over some T: Operand).

RFC notifications bot (Dec 16 2024 at 16:51):

cfallin submitted PR review.

RFC notifications bot (Dec 16 2024 at 16:51):

cfallin created PR review comment:

If core assembler library just contains the DSL, and the DSL isn't actually generating an instruction enum itself, then it seems like the assembler library can punt these questions to its downstream users? And then we can generate an instruction enum in Cranelift that supports the SSA stuff, and other users can generate an instruction enum (if they even need one) that does not. In Cranelift, the generated encoding functions will need to dynamically assert that the first operand and destination operand are the same register for the two-operand instructions, but we have to do that today already.

The way that Andrew's prototype works at least, the instruction enum is generated by the assembler library. And this seems pretty essential to its value IMHO -- we don't want to be in the business of maintaining a separate list of instructions for each target in Cranelift.

Said another way: the underlying situation is fundamentally parameterized: Cranelift wants "an enum for x86 instructions, except with these details for operands". Knowledge of x86 should live in the assembler library (else it would be only a meta-assembler library to "build your own assembler"); and knowledge of operand details should live in the embedder. That calls out for pluggable details, i.e. generics.

Briefly unfolding that "meta-assembler" thought: I think what you might have in mind is a library with the DSL abstractions and framework for emission, but all the instruction definitions living in the embedder? That is also a design point but not one that is globally best even within our own repo IMHO: we want Winch to be able to benefit from the assembler's x86 knowledge (as it does from Cranelift's MachInsts today), so we want the assembler layer itself to be self-contained.

RFC notifications bot (Dec 16 2024 at 16:52):

cfallin edited PR review comment.

RFC notifications bot (Dec 16 2024 at 17:46):

alexcrichton submitted PR review.

RFC notifications bot (Dec 16 2024 at 17:46):

alexcrichton created PR review comment:

For example, we should not encode the DSL in toml or yaml and require a new build-dep for parsing that markup language. Nor should we write our own parser for the DSL that has to be built and run from build.rs.

Personally I'd push back on this and say that whether or not this is acceptable is a function of the various other tradeoffs in play. While we should be cognizant of compile time I don't think we should take a policy of "this can't land unless it has 0 compile-time regressions" and instead balance the compile time with the maintainability win (and other motivations) in this RFC

RFC notifications bot (Dec 16 2024 at 17:50):

alexcrichton submitted PR review.

RFC notifications bot (Dec 16 2024 at 17:50):

alexcrichton created PR review comment:

@fitzgen it sounds like you're advocating for a Pulley-style approach where all instructions are part of a macro invocation so callers can customize the generated code by customizing what the macro generates, is that right? If so personally I don't think that's the right approach for platforms other than Pulley which have non-Rust definitions. There's also tradeoffs to the macro-based approach in Rust so I don't think it's a clear win either. Personally I'd find a "this is just an assembler library" approach with generics more palatable than a for_each_x86_op!(...) macro generated from Intel's definitions of instructions.

RFC notifications bot (Dec 17 2024 at 00:32):

abrown submitted PR review.

RFC notifications bot (Dec 17 2024 at 00:32):

abrown created PR review comment:

Yes, agreed that we want everything to easily cross-reference. If there's anything that isn't like that it's probably due to how I set up Capstone, which we end up mimic-ing. Is there something specific here you were thinking of? Operand ordering? Suffixes?

RFC notifications bot (Dec 17 2024 at 00:36):

abrown submitted PR review.

RFC notifications bot (Dec 17 2024 at 00:36):

abrown created PR review comment:

Maybe it's not super clear in the text, @fitzgen, but there is no new parsing or markup language involved here. I think I would have preferred a true DSL but, based on your initial recommendation, I just use Rust code in a meta crate much like we do in cranelift-codegen-meta. So the DSL is made up of Rust helper functions that create the "AST" — no additional build dependencies required.

RFC notifications bot (Dec 17 2024 at 00:56):

abrown submitted PR review.

RFC notifications bot (Dec 17 2024 at 00:56):

abrown created PR review comment:

I'm not sure I completely understand what you mean: let's talk about "categories." x64_and is a category of instructions, not an x64 instruction. x64_andb, despite being more specific, is also a category of instructions. x64_andb_mi is an actual instruction. I interpret your comment about formats to actually echo something @cfallin and I have talked about, which is that actually in ISLE we _do_ want to use categories, e.g., x64_and. Is that what you mean?

RFC notifications bot (Dec 17 2024 at 01:16):

abrown submitted PR review.

RFC notifications bot (Dec 17 2024 at 01:16):

abrown created PR review comment:

Yes, this is something discussed below in the "verbosity" bullet. Let me provide more detail by just focusing in on the 8-bit version of CLIF and; in cranelift-codegen, we have three possibilities for each of the operands, x and y, to this CLIF instruction. I wrote out all the possible options and thought through how we might lower them:
| size | ~x~ | ~y~ | asm             | notes                      | rule |
|------+-----+-----+-----------------+----------------------------+------|
| I8   | Gpr | Gpr | andb_rm/andb_mr |                            |    0 |
| I8   | Gpr | Mem | andb_rm         |                            |    1 |
| I8   | Gpr | Imm | andb_mi/andb_i? |                            |    2 |
| I8   | Mem | Gpr | andb_rm flipped |                            |    3 |
| I8   | Mem | Mem |                 | No, cannot encode this     |      |
| I8   | Mem | Imm | andb_mi         | No, x's memory is modified |      |
| I8   | Imm | Gpr | andb_mi flipped |                            |    4 |
| I8   | Imm | Mem | andb_mi flipped | No, x's memory is modified |      |
| I8   | Imm | Imm |                 | No, cannot encode this     |      |
Some options we just can't encode in x64. Other's we can't yet (we need to be pattern-match a read+write to a single memory location). But for the five options we _can_ encode, let's write out some ISLE rules:
(rule 0 (lower (has_type $I8 (band (ensure_in_vreg x) (ensure_in_vreg y))))
  (x64_andb_rm x y))
(rule 1 (lower (has_type $I8 (band (ensure_in_vreg x) (sinkable_load y))))
  (x64_andb_rm x y))
(rule 2 (lower (has_type $I8 (band (ensure_in_vreg x) (simm32_from_value y))))
  (x64_andb_mi x y))
(rule 3 (lower (has_type $I8 (band (sinkable_load x) (ensure_in_vreg y))))
  (x64_andb_rm y x))
(rule 4 (lower (has_type $I8 (band (simm32_from_value x) (ensure_in_vreg y))))
  (x64_andb_mi y x))
Let's pretend that ensure_in_vreg means "this is a Gpr," sinkable_load means "this is a Mem," and ~simm32_from_value~ means "this is an Imm" (to reuse some existing ISLE helpers). Turns out there are a few other patterns we can probably encode as well that additionally pattern-match on sign extension.

Two observations from this:

if we lower at this level, it is _very_ verbose (multiply those rules times more sizes!); this is what I'm requesting discussion on in the "verbose" section below

this is also very clear: one can see that we just can't lower certain patterns as well as what is the precise instruction emitted for each pattern that we can

Hope all of that connects the dots for how one might verbosely lower to the assembler instructions. But, of course, if we want to retain the existing meta-rules like x64_and then we can probably generate those somehow as well. I'm fine with either approach.

RFC notifications bot (Dec 17 2024 at 01:16):

abrown edited PR review comment.

RFC notifications bot (Feb 28 2025 at 18:19):

abrown commented on PR #41:

Should we consider merging this (i.e., start the "motion to finalize with a disposition to merge") and, if so, what should I update in the document? I think the details at the ISLE level are different now than when originally written and will likely continue to change, so perhaps this should articulate goals more than details?

RFC notifications bot (Feb 28 2025 at 18:29):

cfallin commented on PR #41:

For what it's worth, for previous large Cranelift features where we wrote RFCs (e.g., ISLE, or the mid-end egraphs framework) the RFC was more of a snapshot of proposed design and we were fine to merge once discussion quelled. If you want to do updates here based on what we've discovered since December then that seems reasonable, I suppose, and probably better than merging something we know is now out-of-date, but this is just to say that it wouldn't be out-of-place if not (in the sense that other RFCs have not been updated since).

Last updated: Apr 14 2025 at 13:10 UTC