Stream: git-wasmtime

Topic: wasmtime / issue #9942 Extraneous `mov` in Pulley bytecod...


view this post on Zulip Wasmtime GitHub notifications bot (Jan 07 2025 at 19:45):

alexcrichton opened issue #9942:

Given this input CLIF:

function u0:0(i32) -> i32 tail {
    sig0 = (i32) -> i32 tail
    fn0 = colocated u0:0 sig0

block0(v2: i32):
    v19 = call fn0(v2)
    v20 = iadd v2, v19
    return v20
}

I'm currently seeing on main

$ cargo run --features pulley compile --target x86_64 -D ../clif/wasm_func_0.clif --set opt_level=speed
.byte 157, 16, 0, 0, 0, 0, 0, 32, 0, 64, 21, 0, 2, 21, 0, 0, 0, 0, 64, 5, 21, 69, 160, 0, 158, 16, 0, 0, 0, 0, 0, 32, 0, 0

Disassembly of 34 bytes <u0:0>:
       0: 9d 10 00 00 00 00 00 20 00         push_frame_save 16, x21
       9: 40 15 00                           xmov x21, x0
       c: 02 15 00 00 00 00                  call1 x21, 0x0    // target = 0xc
      12: 40 05 15                           xmov x5, x21
      15: 45 a0 00                           xadd32 x0, x5, x0
      18: 9e 10 00 00 00 00 00 20 00         pop_frame_restore 16, x21
      21: 00                                 ret

Notably instruction at 12, xmov x5, x21, I'm not sure why that exists. In theory that should not be there and the subsequent xadd32 should be xadd32 x0, x21, x0 and that would remove the need for the insertion of the xmov. This xmov is being inserted as part of register allocation as logging shows that the input program is:

VCode {
  Entry block: 0
Block 0([]):
    (original IR block: block0)
  Inst 0: args v192=x0
  Inst 1: call CallInfo { dest: PulleyCall { name: User(userextname0), args: [XReg(v192)] }, uses: [], defs: [CallRetPair { vreg: Writable { reg: v196 }, preg: p0i }], clobbers: PRegSet { bits: [65534, 65535, 4294967295, 0] }, callee_conv: Tail, caller_conv: Tail, callee_pop_size: 0 }
  Inst 2: xadd32 v195, v192, v196
  Inst 3: rets v195=x0
}

On aarch64 I see:

$ cargo run -q --features pulley compile --target aarch64 -D ../clif/wasm_func_0.clif --set opt_level=speed
.byte 253, 123, 191, 169, 253, 3, 0, 145, 248, 15, 31, 248, 248, 3, 2, 170, 0, 0, 0, 148, 2, 3, 2, 11, 248, 7, 65, 248, 253, 123, 193, 168, 192, 3, 95, 214

Disassembly of 36 bytes <u0:0>:
   0:   fd 7b bf a9             stp     x29, x30, [sp, #-0x10]!
   4:   fd 03 00 91             mov     x29, sp
   8:   f8 0f 1f f8             str     x24, [sp, #-0x10]!
   c:   f8 03 02 aa             mov     x24, x2
  10:   00 00 00 94             bl      #0x10
  14:   02 03 02 0b             add     w2, w24, w2
  18:   f8 07 41 f8             ldr     x24, [sp], #0x10
  1c:   fd 7b c1 a8             ldp     x29, x30, [sp], #0x10
  20:   c0 03 5f d6             ret

which does the right thing after the call instruction at 10 and the add instruction at 14 doesn't have a mov in front. Additionally for x64 it also looks "correct":

$ cargo run -q --features pulley compile --target x86_64 -D ../clif/wasm_func_0.clif --set opt_level=speed
.byte 85, 72, 137, 229, 72, 131, 236, 16, 76, 137, 60, 36, 73, 137, 255, 232, 0, 0, 0, 0, 68, 1, 248, 76, 139, 60, 36, 72, 131, 196, 16, 72, 137, 236, 93, 195

Disassembly of 36 bytes <u0:0>:
   0:   55                      pushq   %rbp
   1:   48 89 e5                movq    %rsp, %rbp
   4:   48 83 ec 10             subq    $0x10, %rsp
   8:   4c 89 3c 24             movq    %r15, (%rsp)
   c:   49 89 ff                movq    %rdi, %r15
   f:   e8 00 00 00 00          callq   0x14
  14:   44 01 f8                addl    %r15d, %eax
  17:   4c 8b 3c 24             movq    (%rsp), %r15
  1b:   48 83 c4 10             addq    $0x10, %rsp
  1f:   48 89 ec                movq    %rbp, %rsp
  22:   5d                      popq    %rbp
  23:   c3                      retq

Notably the addl has no preceding mov instruction.

Is this the Pulley backend perhaps missing something with respect to regalloc metadata?

The generated get_operands method for the instruction here is:

        RawInst::Xadd32 { dst, src1,src2, .. } => {
            collector.reg_use(src1);
collector.reg_use(src2);

            collector.reg_def(dst);

        }

which is in theory pretty similar to how x64/aarch64 both work. I'm not sure if this is falling off a regalloc heuristic or a meaningful difference between x64/aarch64 though.

view this post on Zulip Wasmtime GitHub notifications bot (Jan 13 2025 at 17:46):

alexcrichton added the pulley label to Issue #9942.


Last updated: Jan 24 2025 at 00:11 UTC