Stream: rfc-notifications

Topic: rfcs / PR #41 Add "A Cranelift Assembler"


view this post on Zulip RFC notifications bot (Dec 10 2024 at 23:53):

abrown opened PR #41 from abrown:cranelift-assembler to bytecodealliance:main:

Rendered

view this post on Zulip RFC notifications bot (Dec 11 2024 at 01:16):

programmerjake submitted PR review.

view this post on Zulip RFC notifications bot (Dec 11 2024 at 01:16):

programmerjake created PR review comment:

I think it would be useful for these to be trait methods that way generic code isn't required to use the full enum Inst along with the probable code bloat (from having a match arm for every instruction even if we're only using one or a few) that implies.

view this post on Zulip RFC notifications bot (Dec 11 2024 at 01:16):

programmerjake edited PR review comment.

view this post on Zulip RFC notifications bot (Dec 11 2024 at 01:36):

abrown has marked PR #41 as ready for review.

view this post on Zulip RFC notifications bot (Dec 11 2024 at 01:48):

cfallin submitted PR review:

Thanks a ton for writing this up! It's been really neat to see the DSL come together; I think it's a very clean and powerful way of spec'ing the ISA we support, should reduce potential for errors, and will result in a more generic and reusable assembler library.

I may have more thoughts later but I wanted to write out my thoughts on what I see as the most important point below:

view this post on Zulip RFC notifications bot (Dec 11 2024 at 01:48):

cfallin created PR review comment:

This is a difficult problem to be sure. I want to write out our understanding (or at least "my understanding of our understanding") of the situation here, so the tradeoffs are clear at least:

So here's the difficulty (and here is where my analysis and opinion starts): an assembler library that wants to stay as close as possible to "machine code" will naturally not have an instruction representation that is a superset of machine code and VCode. But if we only have machine instructions, then Cranelift needs to keep VCode around with an instruction enum -- in other words, we aren't deleting any code, we're just adding another instruction representation to maintain, and a lowering step from one to the other, and the performance cost of that too. IMHO this is a nonstarter and means that we need to design the library to accomodate "in-place representation" of pre-regalloc state.

There are two general paths here: generic and specific.

Those are the two main options I see; I can't see how we are able to delete the existing VCode inst enum otherwise, since fundamentally we need to store the SSA representation of instructions somehow.

I know we explored the generic thing earlier and decided it was getting a little gnarly, but as I write out the above I seem to be convincing myself down that path again -- maybe we could try it now that you've gotten over the hurdle of an initial PoC? Would be interested in what others think as well!

view this post on Zulip RFC notifications bot (Dec 11 2024 at 21:18):

alexcrichton submitted PR review.

view this post on Zulip RFC notifications bot (Dec 11 2024 at 21:18):

alexcrichton created PR review comment:

and (ii) always be in three-reg form, even on x86, with separate "dest-side" operands for R+W real operands

To ask a question about this, IIRC this was because register allocation was so much easier when read/write operands were removed. Is that still the case? For example does regalloc2 (and regalloc3?) not support register allocation on the "native x86 form"?

view this post on Zulip RFC notifications bot (Dec 11 2024 at 21:21):

alexcrichton submitted PR review.

view this post on Zulip RFC notifications bot (Dec 11 2024 at 21:21):

alexcrichton created PR review comment:

One thing I'd be curiuos to see, especially in connection with the verbosity point called out below, is how this would integrate into the lowering of band itself at a high level. For example the lowerings we have today feel "right" to me in that they use (x64_and ...) as the result. I think it'd be reasonable to rename x64_and to x64_andl_mi_xsbl ... or such to avoid shoehorning so many instructions into one shape (which I think is a good idea), but with this ISLE snippet I don't know what the output of AssemblerGprMem is and how that'd get integrated into the band lowering.

view this post on Zulip RFC notifications bot (Dec 11 2024 at 21:26):

cfallin submitted PR review.

view this post on Zulip RFC notifications bot (Dec 11 2024 at 21:26):

cfallin created PR review comment:

Yes, absolutely, and we should not go back. (There was a huge effort, months of work mainly by @elliottt, to finish the SSA transition because of this.)

I touched on the main point above but basically, if we register-allocate on native x86 form, we lean very heavily on bundle-merging and move-elision to get back to ideal form, because we have to make a copy of every src1 operand before the instruction. Moving back to this approach would lead to a significant regression in compile time and also probably code-quality issues where our merging heuristics aren't good enough.

We also simplify many other aspects of the register allocator. For example, with SSA we can now do "overlapping liveranges" (because a value is defined only once, we can reason about multiple copies more easily), which unlocks other performance ideas (already, and more in the future potentially).

For what it's worth, our approach was inspired by IonMonkey's register allocator, which also operates on strict SSA with special constraints to deal with the x86-ism of "the output reg must also be this input reg, and clobbers it". We previously thought we should do the native thing; that was regalloc.rs; we learned from it and then reverted to this approach.


Separately, I think there's a question of how to make an assembler library most useful. It could be very opinionated about what is acceptable in an instruction, or it could offer flexibility in operand formats; I suspect Cranelift is not the only potential user with special requirements around what goes into an operand, or metadata around it. To maximize its usefulness, IMHO it should offer a pluggable T: Operand and then we can do all the SSA stuff on the Cranelift side.

view this post on Zulip RFC notifications bot (Dec 11 2024 at 21:31):

alexcrichton submitted PR review.

view this post on Zulip RFC notifications bot (Dec 11 2024 at 21:31):

alexcrichton created PR review comment:

Ok cool mostly wanted to confirm that "must be ssa form" is still a hard constraint.

I understand the desire to have the assembler library be as simple as possible and as close to the spec as possible, but I also agree that this constraint will force our hand in doing something above-and-beyond the "Rust translation of the intel manual". I don't personally have an opinion on T: Operand vs "just inline T = CraneliftX64Operand everywhere".

view this post on Zulip RFC notifications bot (Dec 11 2024 at 23:02):

cfallin submitted PR review.

view this post on Zulip RFC notifications bot (Dec 11 2024 at 23:02):

cfallin created PR review comment:

One slight refinement to T: Operand that I want to write down here as it occurs to me: we probably want something like Inst<R: ReadOperand, W: WriteOperand> or Inst<R: ReadOperand, W: WriteOperand, M: ModOperand> so that the user can plug in different things for each: e.g. the Cranelift x64 backend wants one u32 for R and W and two u32s for M. We could also potentially integrate that with the regalloc visitor (trait Visitor has fn visit_read(&mut R), fn visit_write(&mut W), fn visit_mod(&mut M)).

view this post on Zulip RFC notifications bot (Dec 16 2024 at 16:12):

fitzgen submitted PR review:

I am in favor of the overall plan and goals, but I do have concerns, nitpicks, and bikesheds on some details.

Two things that could be nice to plan for and make sure we aren't accidentally making harder than they otherwise should be, even if we don't intend to implement their support in the DSL initially, are:

  1. Flagging which instructions are available on 32-bit x86, not just x86_64 and whatever ISA extensions, to make an eventual x86 backend that much easier and able to reuse all these instruction definitions, instead of needing to subset them.

  2. A way to attach ISLE specs to the instructions in the DSL so we can better integrate with the ISLE verification effort.

Finally, thanks for investigating and prototyping what a better mach inst looks like! I am excited to clean this stuff up.

view this post on Zulip RFC notifications bot (Dec 16 2024 at 16:12):

fitzgen created PR review comment:

One thing I don't see discussed (thus far, at least) is how the DSL is parsed and how this code is generated. We need to be careful about adding compile time overhead to Cranelift (and Wasmtime, and every downstream crate using Cranelift).

For example, we should not encode the DSL in toml or yaml and require a new build-dep for parsing that markup language. Nor should we write our own parser for the DSL that has to be built and run from build.rs.

The only realistic approach that I am aware of that satisfies this constraint is a for_each_instruction!-style macro, or macros, similar to what Pulley does for its instruction set. This reuses the Rust compiler's parsing facilities to avoid spending compile/build time on DSL parsers and we can also generate only what is needed for different use cases by cfging different invocations of the macro on and off as necessary (e.g. if we only need the encoders and not the pretty printers for a particular build configuration).

view this post on Zulip RFC notifications bot (Dec 16 2024 at 16:12):

fitzgen created PR review comment:

bikeshed: If andb isn't even enough on its own, and we have to break ties by concatenating additional metadata, then can we use ~intel syntax instead of ~att syntax? Regardless of flavor any one person is used to, using intel syntax makes cross referencing with the official manual that much easier.

view this post on Zulip RFC notifications bot (Dec 16 2024 at 16:12):

fitzgen created PR review comment:

I think ideally we want to move everything over towards something like whatever we end up at here. I don't think everything needs to use exactly the same machinery, but the motivation for x64 applies pretty much to every backend.

We also don't have to do the migration en masse, all at once, and before experimenting with just the x64 backend. We can do things in an incremental fashion and learn what works well and what doesn't as we go.

view this post on Zulip RFC notifications bot (Dec 16 2024 at 16:12):

fitzgen created PR review comment:

I think this is also touching on the tension/impedance mismatch between "assembler library that directly reflects x86_64" and "final machine-specific backend IR for Cranelift". At the ISLE lowering rules level, I don't want to think about formats at all. I want to think about whole instructions and that's it. At the register operand collection and encoding levels, I don't care so much about whole instructions, and formats can be great shortcuts for deduplicating code.

I think we should prioritize and optimize for ISLE lowering rules. The latter things will ideally be automatically generated from the DSL describing the instructions, and so the duplication aspect doesn't affect our maintenance burden (although it admittedly could affect Cranelift's run time performance, the time it takes to build Cranelift, and the size of a Cranelift binary; nonetheless I'm not too concerned in this case).

view this post on Zulip RFC notifications bot (Dec 16 2024 at 16:19):

fitzgen submitted PR review.

view this post on Zulip RFC notifications bot (Dec 16 2024 at 16:19):

fitzgen created PR review comment:

If core assembler library just contains the DSL, and the DSL isn't actually generating an instruction enum itself, then it seems like the assembler library can punt these questions to its downstream users? And then we can generate an instruction enum in Cranelift that supports the SSA stuff, and other users can generate an instruction enum (if they even need one) that does not. In Cranelift, the generated encoding functions will need to dynamically assert that the first operand and destination operand are the same register for the two-operand instructions, but we have to do that today already.

This approach avoids adding generics to the instruction enum itself. (Depending on how we do encoding functions and what not, we may need generics on those functions and we may need to implement one or more traits for the instruction enum, but the enum itself need not be generic over some T: Operand).

view this post on Zulip RFC notifications bot (Dec 16 2024 at 16:51):

cfallin submitted PR review.

view this post on Zulip RFC notifications bot (Dec 16 2024 at 16:51):

cfallin created PR review comment:

If core assembler library just contains the DSL, and the DSL isn't actually generating an instruction enum itself, then it seems like the assembler library can punt these questions to its downstream users? And then we can generate an instruction enum in Cranelift that supports the SSA stuff, and other users can generate an instruction enum (if they even need one) that does not. In Cranelift, the generated encoding functions will need to dynamically assert that the first operand and destination operand are the same register for the two-operand instructions, but we have to do that today already.

The way that Andrew's prototype works at least, the instruction enum is generated by the assembler library. And this seems pretty essential to its value IMHO -- we don't want to be in the business of maintaining a separate list of instructions for each target in Cranelift.

Said another way: the underlying situation is fundamentally parameterized: Cranelift wants "an enum for x86 instructions, except with these details for operands". Knowledge of x86 should live in the assembler library (else it would be only a meta-assembler library to "build your own assembler"); and knowledge of operand details should live in the embedder. That calls out for pluggable details, i.e. generics.

Briefly unfolding that "meta-assembler" thought: I think what you might have in mind is a library with the DSL abstractions and framework for emission, but all the instruction definitions living in the embedder? That is also a design point but not one that is globally best even within our own repo IMHO: we want Winch to be able to benefit from the assembler's x86 knowledge (as it does from Cranelift's MachInsts today), so we want the assembler layer itself to be self-contained.

view this post on Zulip RFC notifications bot (Dec 16 2024 at 16:52):

cfallin edited PR review comment.

view this post on Zulip RFC notifications bot (Dec 16 2024 at 17:46):

alexcrichton submitted PR review.

view this post on Zulip RFC notifications bot (Dec 16 2024 at 17:46):

alexcrichton created PR review comment:

For example, we should not encode the DSL in toml or yaml and require a new build-dep for parsing that markup language. Nor should we write our own parser for the DSL that has to be built and run from build.rs.

Personally I'd push back on this and say that whether or not this is acceptable is a function of the various other tradeoffs in play. While we should be cognizant of compile time I don't think we should take a policy of "this can't land unless it has 0 compile-time regressions" and instead balance the compile time with the maintainability win (and other motivations) in this RFC

view this post on Zulip RFC notifications bot (Dec 16 2024 at 17:50):

alexcrichton submitted PR review.

view this post on Zulip RFC notifications bot (Dec 16 2024 at 17:50):

alexcrichton created PR review comment:

@fitzgen it sounds like you're advocating for a Pulley-style approach where all instructions are part of a macro invocation so callers can customize the generated code by customizing what the macro generates, is that right? If so personally I don't think that's the right approach for platforms other than Pulley which have non-Rust definitions. There's also tradeoffs to the macro-based approach in Rust so I don't think it's a clear win either. Personally I'd find a "this is just an assembler library" approach with generics more palatable than a for_each_x86_op!(...) macro generated from Intel's definitions of instructions.

view this post on Zulip RFC notifications bot (Dec 17 2024 at 00:32):

abrown submitted PR review.

view this post on Zulip RFC notifications bot (Dec 17 2024 at 00:32):

abrown created PR review comment:

Yes, agreed that we want everything to easily cross-reference. If there's anything that isn't like that it's probably due to how I set up Capstone, which we end up mimic-ing. Is there something specific here you were thinking of? Operand ordering? Suffixes?

view this post on Zulip RFC notifications bot (Dec 17 2024 at 00:36):

abrown submitted PR review.

view this post on Zulip RFC notifications bot (Dec 17 2024 at 00:36):

abrown created PR review comment:

Maybe it's not super clear in the text, @fitzgen, but there is no new parsing or markup language involved here. I think I would have preferred a true DSL but, based on your initial recommendation, I just use Rust code in a meta crate much like we do in cranelift-codegen-meta. So the DSL is made up of Rust helper functions that create the "AST" — no additional build dependencies required.

view this post on Zulip RFC notifications bot (Dec 17 2024 at 00:56):

abrown submitted PR review.

view this post on Zulip RFC notifications bot (Dec 17 2024 at 00:56):

abrown created PR review comment:

I'm not sure I completely understand what you mean: let's talk about "categories." x64_and is a category of instructions, not an x64 instruction. x64_andb, despite being more specific, is also a category of instructions. x64_andb_mi is an actual instruction. I interpret your comment about formats to actually echo something @cfallin and I have talked about, which is that actually in ISLE we _do_ want to use categories, e.g., x64_and. Is that what you mean?

view this post on Zulip RFC notifications bot (Dec 17 2024 at 01:16):

abrown submitted PR review.

view this post on Zulip RFC notifications bot (Dec 17 2024 at 01:16):

abrown created PR review comment:

Yes, this is something discussed below in the "verbosity" bullet. Let me provide more detail by just focusing in on the 8-bit version of CLIF and; in cranelift-codegen, we have three possibilities for each of the operands, x and y, to this CLIF instruction. I wrote out all the possible options and thought through how we might lower them:

| size | ~x~ | ~y~ | asm             | notes                      | rule |
|------+-----+-----+-----------------+----------------------------+------|
| I8   | Gpr | Gpr | andb_rm/andb_mr |                            |    0 |
| I8   | Gpr | Mem | andb_rm         |                            |    1 |
| I8   | Gpr | Imm | andb_mi/andb_i? |                            |    2 |
| I8   | Mem | Gpr | andb_rm flipped |                            |    3 |
| I8   | Mem | Mem |                 | No, cannot encode this     |      |
| I8   | Mem | Imm | andb_mi         | No, x's memory is modified |      |
| I8   | Imm | Gpr | andb_mi flipped |                            |    4 |
| I8   | Imm | Mem | andb_mi flipped | No, x's memory is modified |      |
| I8   | Imm | Imm |                 | No, cannot encode this     |      |

Some options we just can't encode in x64. Other's we can't yet (we need to be pattern-match a read+write to a single memory location). But for the five options we _can_ encode, let's write out some ISLE rules:

(rule 0 (lower (has_type $I8 (band (ensure_in_vreg x) (ensure_in_vreg y))))
  (x64_andb_rm x y))
(rule 1 (lower (has_type $I8 (band (ensure_in_vreg x) (sinkable_load y))))
  (x64_andb_rm x y))
(rule 2 (lower (has_type $I8 (band (ensure_in_vreg x) (simm32_from_value y))))
  (x64_andb_mi x y))
(rule 3 (lower (has_type $I8 (band (sinkable_load x) (ensure_in_vreg y))))
  (x64_andb_rm y x))
(rule 4 (lower (has_type $I8 (band (simm32_from_value x) (ensure_in_vreg y))))
  (x64_andb_mi y x))

Let's pretend that ensure_in_vreg means "this is a Gpr," sinkable_load means "this is a Mem," and ~simm32_from_value~ means "this is an Imm" (to reuse some existing ISLE helpers). Turns out there are a few other patterns we can probably encode as well that additionally pattern-match on sign extension.

Two observations from this:


Hope all of that connects the dots for how one might verbosely lower to the assembler instructions. But, of course, if we want to retain the existing meta-rules like x64_and then we can probably generate those somehow as well. I'm fine with either approach.

view this post on Zulip RFC notifications bot (Dec 17 2024 at 01:16):

abrown edited PR review comment.


Last updated: Jan 24 2025 at 00:11 UTC