Stream: git-wasmtime

Topic: wasmtime / PR #10871 x64: convert `movq`


view this post on Zulip Wasmtime GitHub notifications bot (May 30 2025 at 16:21):

abrown opened PR #10871 from abrown:asm-scalar-moves to bytecodealliance:main:

This change represents a first part toward changing all scalar moves in x64 to use the new assembler. Recall that when moving sub-64-bit values, upper bits of a GPR remain unchanged, so Cranelift chooses to use zero-extending moves for those operations. This change does not touch that: for 64-bit moves, Cranelift _does_ use movq and this PR only touches that usage. Why? movq is used widely throughout Cranelift and many filetests will change, so it seems safer to change moves part by part.

view this post on Zulip Wasmtime GitHub notifications bot (May 30 2025 at 16:21):

abrown requested cfallin for a review on PR #10871.

view this post on Zulip Wasmtime GitHub notifications bot (May 30 2025 at 16:21):

abrown requested wasmtime-compiler-reviewers for a review on PR #10871.

view this post on Zulip Wasmtime GitHub notifications bot (May 30 2025 at 16:22):

abrown submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (May 30 2025 at 16:22):

abrown created PR review comment:

@cfallin: I expect 3 PCC tests to fail due to this. I tried to look at how I might re-write this based on the assembler but at this point we only have assembler types to deal with. What to do?

view this post on Zulip Wasmtime GitHub notifications bot (May 30 2025 at 16:23):

abrown submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (May 30 2025 at 16:23):

abrown created PR review comment:

Technically this PR only makes use of this instruction but its seems valuable to add the other variants; I'm surprised we don't use the MI format anywhere?

view this post on Zulip Wasmtime GitHub notifications bot (May 30 2025 at 16:26):

abrown submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (May 30 2025 at 16:26):

abrown created PR review comment:

Here's another Capstone-related weirdness: the assembler emits all immediates the same but this special logic "fixes" things up so we can match against Capstone. I will be very thankful when we can switch to a more regular disassembler.

view this post on Zulip Wasmtime GitHub notifications bot (May 30 2025 at 16:30):

cfallin submitted PR review:

This looks good overall -- thanks for the continued migration push!

view this post on Zulip Wasmtime GitHub notifications bot (May 30 2025 at 16:30):

cfallin created PR review comment:

Pre-existing but I'm just noticing it now -- whether in this PR or separately, could we update the printing format for these synthetic amodes so that we don't have "x+-y" to mean "x-y"? (Slightly more complex in the formatting code but hopefully not too bad)

view this post on Zulip Wasmtime GitHub notifications bot (May 30 2025 at 16:30):

cfallin created PR review comment:

It's fine to remove the newly failing tests for now, I think -- PCC shouldn't get in the way of progress here (I'm happy to update it later when I next get a timeslice for it). Feel free to remove the test directives in the files and leave a note instead "disabled until PCC is migrated to understand new assembler instructions" or similar...

view this post on Zulip Wasmtime GitHub notifications bot (May 30 2025 at 16:31):

cfallin submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (May 30 2025 at 16:31):

cfallin created PR review comment:

I know we do read-modify-write (load; add; store) in our lowering rules but I too am surprised we don't do store-of-immediate; seems worth adding?

view this post on Zulip Wasmtime GitHub notifications bot (May 30 2025 at 16:32):

cfallin submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (May 30 2025 at 16:32):

cfallin created PR review comment:

(Not necessarily in this PR to be clear, but in general)

view this post on Zulip Wasmtime GitHub notifications bot (May 30 2025 at 16:35):

cfallin submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (May 30 2025 at 16:35):

cfallin created PR review comment:

Actually, on second look -- does the rw(rm8) here mean that we have a read/write (modify) operand? This is somewhat surprising to me, per our Zulip thread -- to use this properly with narrow values we'd have to have an "undefined def" pseudoinst prior to it which would add a lot of bloat/compile-time slowdown otherwise.

view this post on Zulip Wasmtime GitHub notifications bot (May 30 2025 at 18:02):

abrown submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (May 30 2025 at 18:02):

abrown created PR review comment:

Yeah, exactly: my plan was was to keep an eye out for this as more rules get converted.

view this post on Zulip Wasmtime GitHub notifications bot (May 30 2025 at 18:06):

abrown updated PR #10871.

view this post on Zulip Wasmtime GitHub notifications bot (May 30 2025 at 18:07):

abrown submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (May 30 2025 at 18:07):

abrown created PR review comment:

Yeah, this is not great. I've been mulling over what to do here and I also think <offset:n> isn't the most clear; perhaps we want to refactor this offset table logic to allow naming the underlying offset. I'll add this to the TODO list.

view this post on Zulip Wasmtime GitHub notifications bot (May 30 2025 at 18:22):

abrown submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (May 30 2025 at 18:22):

abrown created PR review comment:

This bothers me too; we may need some better modeling here. On the xmm side, we've been using rw for operations that only partially write the XMM register. E.g., there are f32 operations that only touch the bottom 32 bits and we need to indicate that those upper bits remain unchanged. Modeling this with rw works well when it comes to Cranelift because the ISLE rules already pass in the XMM register we are modifying, making the dependencies make sense to regalloc. But, though convenient, this is not technically correct at the assembler level, we are not truly _reading_ the bits that get changed; we're only writing them, just like we do here with movb. It feels like we need some modeling here at the assembler level: we want to define movb as w(rm8) but then somehow still use a ReadWriteGpr when interacting with regalloc to keep the data dependencies sane, right?

view this post on Zulip Wasmtime GitHub notifications bot (May 30 2025 at 18:25):

abrown created PR review comment:

Perhaps the rule is simple for GPR: every time we see w(r*{8|16|32}) defined in the assembler, we actually emit code using the right ReadWriteGpr type.

For XMM it might be more complex: w(xmm) does not tell us that only some of the bits are modified so we may need to examine the other operands?

view this post on Zulip Wasmtime GitHub notifications bot (May 30 2025 at 18:25):

abrown submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (May 30 2025 at 18:33):

abrown created PR review comment:

Another idea: we add a partial bit which gets set automatically for w(r*{8|16|32}) but set explicitly for XMM, e.g., w(partial(xmm)). Then when we decide what type to use for an operand op, we would emit the ReadWrite* type when op.is_read_write() || op.is_partial().

view this post on Zulip Wasmtime GitHub notifications bot (May 30 2025 at 18:33):

abrown submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (May 30 2025 at 19:13):

cfallin submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (May 30 2025 at 19:13):

cfallin created PR review comment:

It's possible I'm not fully grokking the proposed rules here, but the end state needs to be: writes to GPRs, even narrow writes, must be only w, not rw. This is a reflection of the semantic gap between CLIF / VCode data types and the machine: when we store an i8 (say)-typed vreg in a particular register, it really is only 8 bits wide, and the rest of the machine register is unused. As long as uses and defs of vregs are properly typed, the undefined bits never affect anything.

(This is different than the Xmm case, as you note too -- a lane insertion is operating on a 128-bit input and producing a 128-bit output, so we don't have the same "narrow type in wide register" gap.)

We almost certainly don't want a full type system at the assembler layer, but in the absence of that, we do need to represent this gap somehow. There's a bit of tension with the "general assembler library" use-case, where the latter may actually want to represent that the writes are partial, etc. Perhaps we have a mode bit somewhere indicating that partial GPR writes are to be treated as full writes (a high_gpr_bits_are_undefined() method on the toplevel Regs trait so we can plug in a constant true impl from Cranelift, maybe?). Or there's always the option that we adopt the Cranelift convention and, w.r.t. regalloc metadata, assume that the embedder is only doing things in a properly typed way, so narrow writes leave undefined bits rather than previously-present bits...

view this post on Zulip Wasmtime GitHub notifications bot (May 30 2025 at 21:33):

alexcrichton submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (May 30 2025 at 21:33):

alexcrichton created PR review comment:

I'm a bit surprised because I thought rw(rm8) is basically almost exactly what we want here? It matches the raw instruction because it's a single operand, and it matches Cranelift because the actual operand is api::Registers::ReadWriteGpr which is modeled as two VRegs in Cranelift (which during emisson is required to be the same register). Additionally for generic assembler purposes there's nothing saying that ReadWriteGpr has to be different than WriteGpr, so I'd imagine a generic usage of this crate would have the same type plugged in there.

Otherwise I'm curious where your concern about compile-time comes from @cfallin? It seems like we're not actually using movb anywhere unless we're storing to memory, and register moves generated by regalloc always otherwise use movq to overwrite the entire destination.


Orthogonally though one possible change we may want to consider: should this instruction be split into two forms? If the destination is memory that's technically not a read/write operand, it's a pure write in this case. That doesn't actually get surfaced anywhere right now but with PCC integration in the future it might? Basically the read/write-ness is a property of the register-to-register encoding, not the register-to-memory encoding.

For the movss instruction for example I modeled this where memory-to-xmm is a write of the destination since it zeros upper bits, but xmm-to-xmm is a read/write of the destination as it only overwrites the lower bits. Should something similar be done here where the register-to-memory version of movb uses w(m8) while the register-to-register version uses rw(r8)?

view this post on Zulip Wasmtime GitHub notifications bot (May 30 2025 at 21:59):

cfallin submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (May 30 2025 at 21:59):

cfallin created PR review comment:

Well, it comes down to expected uses. In Cranelift lowerings we make use of narrow-width instructions to perform operations that logically write a new value into a register, and we don't care about the upper bits. It seems we don't use movb this way today, but in general this is exposing a footgun and a semantic gap: we aren't careful about distinguishing between "I'm using this instruction that leaves upper bits untouched but I don't care, and it doesn't matter because the value is logically narrow so upper bits are undef" and "I'm using this instruction to insert bits in top of a pre-existing value". We want Write for the former and ReadWrite for the latter.

Otherwise I'm curious where your concern about compile-time comes from @cfallin?

I had thought we were using movb directly for reg-to-reg moves but it seems not. My concern in general comes from the way that one converts a rw to w-with-don't-care-bits, by using an UninitializedValue pseudo-op to def the vreg that comes "into" the rw. I was worried we'd have to start inserting a bunch of those to retain regalloc correctness.


I think the most reasonable solution may be: define the r/rw/w's "scope" to be only the width of the instruction. That squares away with how mem ops actually do only load/store the 8/16/32 bits as appropriate. It also matches Cranelift's invariant/expectation already: we store narrow values in the appropriate bits of a register, and don't say anything about the rest. Concretely, that would mean we have only w here, because the destination width is 8. (pinsrb could still have an rw operand because its destination width is 128 -- so it reads and writes 128 bits with defined results.) If we really wanted to use movb as an "insert lower byte in register" operation we could also define a variant with a width of 64 and an rw op. We should then define the width of the write as part of the metadata somehow. What do you think?

view this post on Zulip Wasmtime GitHub notifications bot (May 30 2025 at 22:00):

cfallin created PR review comment:

We should then define the width of the write as part of the metadata somehow

or rather, I suppose that's just the 8 in rm8. So w(rm8) is correct in this interpretation, and means "we're writing 8 bits" -- we don't read the 8 bits first, so it's not rw(rm8).

view this post on Zulip Wasmtime GitHub notifications bot (May 30 2025 at 22:00):

cfallin submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (May 30 2025 at 22:12):

alexcrichton submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (May 30 2025 at 22:12):

alexcrichton created PR review comment:

I like that, it makes a lot of sense to me :+1:

I particularly like the idea of "ok let's model how some bits are preserved" is a special-case variant of an instruction we may expose in some circumstances but aren't necessarily required to.

view this post on Zulip Wasmtime GitHub notifications bot (May 30 2025 at 23:09):

abrown submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (May 30 2025 at 23:09):

abrown created PR review comment:

Yeah, after talking about this with @cfallin, I think the right thing to at the assembler level is to use w, i.e., " define the r/rw/w's "scope" to be only the width of the instruction." What I realized I was trying to do is avoid footguns up at the ISLE level, but we (1) are already relying on Cranelift to use the right widths and (2) we can be safer another way (e.g., we _know_ a w(r8) is actually a partial write so we could still generate a ReadWriteGpr parameter for that operand... or a type PartialWriteGpr = ReadWriteGpr?). I'll update this PR with all w for now and follow it up with a different PR that we can throw darts at.

view this post on Zulip Wasmtime GitHub notifications bot (May 30 2025 at 23:12):

abrown updated PR #10871.

view this post on Zulip Wasmtime GitHub notifications bot (May 31 2025 at 00:05):

abrown merged PR #10871.


Last updated: Dec 06 2025 at 06:05 UTC