abrown opened PR #10871 from abrown:asm-scalar-moves to bytecodealliance:main:
This change represents a first part toward changing all scalar moves in x64 to use the new assembler. Recall that when moving sub-64-bit values, upper bits of a GPR remain unchanged, so Cranelift chooses to use zero-extending moves for those operations. This change does not touch that: for 64-bit moves, Cranelift _does_ use
movqand this PR only touches that usage. Why?movqis used widely throughout Cranelift and many filetests will change, so it seems safer to change moves part by part.
abrown requested cfallin for a review on PR #10871.
abrown requested wasmtime-compiler-reviewers for a review on PR #10871.
abrown submitted PR review.
abrown created PR review comment:
@cfallin: I expect 3 PCC tests to fail due to this. I tried to look at how I might re-write this based on the assembler but at this point we only have assembler types to deal with. What to do?
abrown submitted PR review.
abrown created PR review comment:
Technically this PR only makes use of this instruction but its seems valuable to add the other variants; I'm surprised we don't use the
MIformat anywhere?
abrown submitted PR review.
abrown created PR review comment:
Here's another Capstone-related weirdness: the assembler emits all immediates the same but this special logic "fixes" things up so we can match against Capstone. I will be very thankful when we can switch to a more regular disassembler.
cfallin submitted PR review:
This looks good overall -- thanks for the continued migration push!
cfallin created PR review comment:
Pre-existing but I'm just noticing it now -- whether in this PR or separately, could we update the printing format for these synthetic amodes so that we don't have "x+-y" to mean "x-y"? (Slightly more complex in the formatting code but hopefully not too bad)
cfallin created PR review comment:
It's fine to remove the newly failing tests for now, I think -- PCC shouldn't get in the way of progress here (I'm happy to update it later when I next get a timeslice for it). Feel free to remove the test directives in the files and leave a note instead "disabled until PCC is migrated to understand new assembler instructions" or similar...
cfallin submitted PR review.
cfallin created PR review comment:
I know we do read-modify-write (load; add; store) in our lowering rules but I too am surprised we don't do store-of-immediate; seems worth adding?
cfallin submitted PR review.
cfallin created PR review comment:
(Not necessarily in this PR to be clear, but in general)
cfallin submitted PR review.
cfallin created PR review comment:
Actually, on second look -- does the
rw(rm8)here mean that we have a read/write (modify) operand? This is somewhat surprising to me, per our Zulip thread -- to use this properly with narrow values we'd have to have an "undefined def" pseudoinst prior to it which would add a lot of bloat/compile-time slowdown otherwise.
abrown submitted PR review.
abrown created PR review comment:
Yeah, exactly: my plan was was to keep an eye out for this as more rules get converted.
abrown updated PR #10871.
abrown submitted PR review.
abrown created PR review comment:
Yeah, this is not great. I've been mulling over what to do here and I also think
<offset:n>isn't the most clear; perhaps we want to refactor this offset table logic to allow naming the underlying offset. I'll add this to the TODO list.
abrown submitted PR review.
abrown created PR review comment:
This bothers me too; we may need some better modeling here. On the
xmmside, we've been usingrwfor operations that only partially write the XMM register. E.g., there are f32 operations that only touch the bottom 32 bits and we need to indicate that those upper bits remain unchanged. Modeling this withrwworks well when it comes to Cranelift because the ISLE rules already pass in the XMM register we are modifying, making the dependencies make sense to regalloc. But, though convenient, this is not technically correct at the assembler level, we are not truly _reading_ the bits that get changed; we're only writing them, just like we do here withmovb. It feels like we need some modeling here at the assembler level: we want to definemovbasw(rm8)but then somehow still use aReadWriteGprwhen interacting with regalloc to keep the data dependencies sane, right?
abrown created PR review comment:
Perhaps the rule is simple for GPR: every time we see
w(r*{8|16|32})defined in the assembler, we actually emit code using the rightReadWriteGprtype.For XMM it might be more complex:
w(xmm)does not tell us that only some of the bits are modified so we may need to examine the other operands?
abrown submitted PR review.
abrown created PR review comment:
Another idea: we add a
partialbit which gets set automatically forw(r*{8|16|32})but set explicitly for XMM, e.g.,w(partial(xmm)). Then when we decide what type to use for an operandop, we would emit theReadWrite*type whenop.is_read_write() || op.is_partial().
abrown submitted PR review.
cfallin submitted PR review.
cfallin created PR review comment:
It's possible I'm not fully grokking the proposed rules here, but the end state needs to be: writes to GPRs, even narrow writes, must be only
w, notrw. This is a reflection of the semantic gap between CLIF / VCode data types and the machine: when we store ani8(say)-typed vreg in a particular register, it really is only 8 bits wide, and the rest of the machine register is unused. As long as uses and defs of vregs are properly typed, the undefined bits never affect anything.(This is different than the Xmm case, as you note too -- a lane insertion is operating on a 128-bit input and producing a 128-bit output, so we don't have the same "narrow type in wide register" gap.)
We almost certainly don't want a full type system at the assembler layer, but in the absence of that, we do need to represent this gap somehow. There's a bit of tension with the "general assembler library" use-case, where the latter may actually want to represent that the writes are partial, etc. Perhaps we have a mode bit somewhere indicating that partial GPR writes are to be treated as full writes (a
high_gpr_bits_are_undefined()method on the toplevel Regs trait so we can plug in a constanttrueimpl from Cranelift, maybe?). Or there's always the option that we adopt the Cranelift convention and, w.r.t. regalloc metadata, assume that the embedder is only doing things in a properly typed way, so narrow writes leave undefined bits rather than previously-present bits...
alexcrichton submitted PR review.
alexcrichton created PR review comment:
I'm a bit surprised because I thought
rw(rm8)is basically almost exactly what we want here? It matches the raw instruction because it's a single operand, and it matches Cranelift because the actual operand isapi::Registers::ReadWriteGprwhich is modeled as twoVRegs in Cranelift (which during emisson is required to be the same register). Additionally for generic assembler purposes there's nothing saying thatReadWriteGprhas to be different thanWriteGpr, so I'd imagine a generic usage of this crate would have the same type plugged in there.Otherwise I'm curious where your concern about compile-time comes from @cfallin? It seems like we're not actually using
movbanywhere unless we're storing to memory, and register moves generated by regalloc always otherwise usemovqto overwrite the entire destination.
Orthogonally though one possible change we may want to consider: should this instruction be split into two forms? If the destination is memory that's technically not a read/write operand, it's a pure write in this case. That doesn't actually get surfaced anywhere right now but with PCC integration in the future it might? Basically the read/write-ness is a property of the register-to-register encoding, not the register-to-memory encoding.
For the
movssinstruction for example I modeled this where memory-to-xmm is a write of the destination since it zeros upper bits, but xmm-to-xmm is a read/write of the destination as it only overwrites the lower bits. Should something similar be done here where the register-to-memory version ofmovbusesw(m8)while the register-to-register version usesrw(r8)?
cfallin submitted PR review.
cfallin created PR review comment:
Well, it comes down to expected uses. In Cranelift lowerings we make use of narrow-width instructions to perform operations that logically write a new value into a register, and we don't care about the upper bits. It seems we don't use
movbthis way today, but in general this is exposing a footgun and a semantic gap: we aren't careful about distinguishing between "I'm using this instruction that leaves upper bits untouched but I don't care, and it doesn't matter because the value is logically narrow so upper bits are undef" and "I'm using this instruction to insert bits in top of a pre-existing value". We wantWritefor the former andReadWritefor the latter.Otherwise I'm curious where your concern about compile-time comes from @cfallin?
I had thought we were using
movbdirectly for reg-to-reg moves but it seems not. My concern in general comes from the way that one converts arwtow-with-don't-care-bits, by using anUninitializedValuepseudo-op to def the vreg that comes "into" therw. I was worried we'd have to start inserting a bunch of those to retain regalloc correctness.
I think the most reasonable solution may be: define the
r/rw/w's "scope" to be only the width of the instruction. That squares away with how mem ops actually do only load/store the 8/16/32 bits as appropriate. It also matches Cranelift's invariant/expectation already: we store narrow values in the appropriate bits of a register, and don't say anything about the rest. Concretely, that would mean we have onlywhere, because the destination width is8. (pinsrbcould still have anrwoperand because its destination width is128-- so it reads and writes 128 bits with defined results.) If we really wanted to usemovbas an "insert lower byte in register" operation we could also define a variant with a width of64and anrwop. We should then define the width of the write as part of the metadata somehow. What do you think?
cfallin created PR review comment:
We should then define the width of the write as part of the metadata somehow
or rather, I suppose that's just the
8inrm8. Sow(rm8)is correct in this interpretation, and means "we're writing 8 bits" -- we don't read the 8 bits first, so it's notrw(rm8).
cfallin submitted PR review.
alexcrichton submitted PR review.
alexcrichton created PR review comment:
I like that, it makes a lot of sense to me :+1:
I particularly like the idea of "ok let's model how some bits are preserved" is a special-case variant of an instruction we may expose in some circumstances but aren't necessarily required to.
abrown submitted PR review.
abrown created PR review comment:
Yeah, after talking about this with @cfallin, I think the right thing to at the assembler level is to use
w, i.e., " define the r/rw/w's "scope" to be only the width of the instruction." What I realized I was trying to do is avoid footguns up at the ISLE level, but we (1) are already relying on Cranelift to use the right widths and (2) we can be safer another way (e.g., we _know_ aw(r8)is actually a partial write so we could still generate aReadWriteGprparameter for that operand... or atype PartialWriteGpr = ReadWriteGpr?). I'll update this PR with allwfor now and follow it up with a different PR that we can throw darts at.
abrown updated PR #10871.
abrown merged PR #10871.
Last updated: Dec 06 2025 at 06:05 UTC