Stream: git-wasmtime

Topic: wasmtime / issue #10010 ValueUseState calculation too coa...


view this post on Zulip Wasmtime GitHub notifications bot (Jan 14 2025 at 17:41):

alexcrichton opened issue #10010:

I was profiling some SIMD-using code today and noticed that the WebAssembly simd instruction v128.load32_splat wasn't generating the x64 instruction I thought it was supposed to do. This is what I'd expect but this is what I got:

function u0:0(i64, i32x4) -> i32x4 tail {
block0(v0: i64, v1: i32x4):
    v2 = iadd_imm v0, 100
    jump block2(v0, v1)

block2(v3: i64, v5: i32x4):
    v6 = load.i32 v3
    v7 = splat.i32x4 v6
    v8 = iadd v5, v7
    v11 = iadd_imm v3, 4
    v12 = icmp eq v11, v2
    brif v12, block2(v11, v8), block3

block3:
    return v8
}
$ cargo run compile ../clif/wasm_func_0.clif --target x86_64 --set has_avx2 --set has_avx -D
Disassembly of 41 bytes <u0:0>:
   0:   55                      pushq   %rbp
   1:   48 89 e5                movq    %rsp, %rbp
   4:   48 8d 4f 64             leaq    0x64(%rdi), %rcx
   8:   8b 17                   movl    (%rdi), %edx
   a:   c5 f9 6e e2             vmovd   %edx, %xmm4
   e:   c4 e2 79 58 f4          vpbroadcastd    %xmm4, %xmm6
  13:   c5 f9 fe c6             vpaddd  %xmm6, %xmm0, %xmm0
  17:   48 83 c7 04             addq    $4, %rdi
  1b:   48 39 cf                cmpq    %rcx, %rdi
  1e:   0f 84 e4 ff ff ff       je      8
  24:   48 89 ec                movq    %rbp, %rsp
  27:   5d                      popq    %rbp
  28:   c3                      retq

Notably at offset e the it's not using vbroadcastss, but instead a combo of movl, vmovd, and vpbroadcastd. I don't have a great way of measuring the relative performance of one vs the other as my hope was to convince Cranelift to generate one or the other to measure.

The reason that this lowering rule isn't firing is due to:

...
 TRACE cranelift_codegen::machinst::lower      > arg v6 used, old state Unused, new Once
...
 TRACE cranelift_codegen::machinst::lower      > arg v8 used, old state Once, new Multiple
...
 TRACE cranelift_codegen::machinst::lower      >  -> DFS reaches v6
 TRACE cranelift_codegen::machinst::lower      >  -> became Multiple
...
 TRACE cranelift_codegen::machinst::lower      > lowering: inst inst3: v7 = splat.i32x4 v6
 TRACE cranelift_codegen::machinst::lower      > get_input_for_val: val v6 at cur_inst Some(inst3) cur_scan_entry_color Some(InstColor(4))
 TRACE cranelift_codegen::machinst::lower      >  -> src inst inst2
 TRACE cranelift_codegen::machinst::lower      >  -> has lowering side effect: true
 TRACE cranelift_codegen::machinst::lower      >  -> side-effecting op inst2 for val v6: use state Multiple
 TRACE cranelift_codegen::machinst::lower      > put_value_in_regs: val v6

(or at least I think this is why sinkable_load isn't firing)

I know I've run up against compute_use_states in the past and it's subtle enough that I can't ever seem to keep it in my head. I can't seem to wrap my head around this time either...

view this post on Zulip Wasmtime GitHub notifications bot (Jan 14 2025 at 18:19):

cfallin commented on issue #10010:

So the issue here is that v8 is indeed used multiple times, and multiplicity is necessarily transitive, because we can't know how deep the rule-matching will go; so we can't allow the load to sink.

Imagine the following (very smart, but possible) instruction selection rules:

This particular example is a bit far-fetched, but it's important that we don't encode even more subtle knowledge about the possible instruction selection combinations into the core algorithms; they have to assume that rule-matching can go arbitrarily deep. We cannot allow the load to sink twice to two different locations, so the multiplicity analysis (correctly) propagates this multiple-use status backward to the load and prevents it from sinking.

I know I've run up against compute_use_states in the past and it's subtle enough that I can't ever seem to keep it in my head. I can't seem to wrap my head around this time either...

Believe it or not this was the simplest abstraction I could come up with at the time (I know, I'm not totally happy either...) It certainly errs on the side of safety but it did prevent real bugs we had / kept creating at the time. I know this last came up here and the explanation I gave there is as from-first-principles as I could make it; suggestions welcome for ways to explain it better and/or better algorithms that preserve all the safety properties we want!

view this post on Zulip Wasmtime GitHub notifications bot (Jan 14 2025 at 18:23):

cfallin commented on issue #10010:

I'll note as well that we could relax multiplicity if we "cut" it at certain points by pinning values into registers -- e.g., constrain that v8 must be computed into a register, then allow it to be computed once (and avoid multiplicity flowing backward). However placing those cut-points is a provably NP-hard combinatorial optimization problem, which doesn't play nicely with our general ethos of single-pass algorithms (or at worst fast fixpoint analyses before our main lowering pass), so we haven't really pursued it further...

view this post on Zulip Wasmtime GitHub notifications bot (Jan 16 2025 at 21:39):

alexcrichton closed issue #10010:

I was profiling some SIMD-using code today and noticed that the WebAssembly simd instruction v128.load32_splat wasn't generating the x64 instruction I thought it was supposed to do. This is what I'd expect but this is what I got:

function u0:0(i64, i32x4) -> i32x4 tail {
block0(v0: i64, v1: i32x4):
    v2 = iadd_imm v0, 100
    jump block2(v0, v1)

block2(v3: i64, v5: i32x4):
    v6 = load.i32 v3
    v7 = splat.i32x4 v6
    v8 = iadd v5, v7
    v11 = iadd_imm v3, 4
    v12 = icmp eq v11, v2
    brif v12, block2(v11, v8), block3

block3:
    return v8
}
$ cargo run compile ../clif/wasm_func_0.clif --target x86_64 --set has_avx2 --set has_avx -D
Disassembly of 41 bytes <u0:0>:
   0:   55                      pushq   %rbp
   1:   48 89 e5                movq    %rsp, %rbp
   4:   48 8d 4f 64             leaq    0x64(%rdi), %rcx
   8:   8b 17                   movl    (%rdi), %edx
   a:   c5 f9 6e e2             vmovd   %edx, %xmm4
   e:   c4 e2 79 58 f4          vpbroadcastd    %xmm4, %xmm6
  13:   c5 f9 fe c6             vpaddd  %xmm6, %xmm0, %xmm0
  17:   48 83 c7 04             addq    $4, %rdi
  1b:   48 39 cf                cmpq    %rcx, %rdi
  1e:   0f 84 e4 ff ff ff       je      8
  24:   48 89 ec                movq    %rbp, %rsp
  27:   5d                      popq    %rbp
  28:   c3                      retq

Notably at offset e the it's not using vbroadcastss, but instead a combo of movl, vmovd, and vpbroadcastd. I don't have a great way of measuring the relative performance of one vs the other as my hope was to convince Cranelift to generate one or the other to measure.

The reason that this lowering rule isn't firing is due to:

...
 TRACE cranelift_codegen::machinst::lower      > arg v6 used, old state Unused, new Once
...
 TRACE cranelift_codegen::machinst::lower      > arg v8 used, old state Once, new Multiple
...
 TRACE cranelift_codegen::machinst::lower      >  -> DFS reaches v6
 TRACE cranelift_codegen::machinst::lower      >  -> became Multiple
...
 TRACE cranelift_codegen::machinst::lower      > lowering: inst inst3: v7 = splat.i32x4 v6
 TRACE cranelift_codegen::machinst::lower      > get_input_for_val: val v6 at cur_inst Some(inst3) cur_scan_entry_color Some(InstColor(4))
 TRACE cranelift_codegen::machinst::lower      >  -> src inst inst2
 TRACE cranelift_codegen::machinst::lower      >  -> has lowering side effect: true
 TRACE cranelift_codegen::machinst::lower      >  -> side-effecting op inst2 for val v6: use state Multiple
 TRACE cranelift_codegen::machinst::lower      > put_value_in_regs: val v6

(or at least I think this is why sinkable_load isn't firing)

I know I've run up against compute_use_states in the past and it's subtle enough that I can't ever seem to keep it in my head. I can't seem to wrap my head around this time either...

view this post on Zulip Wasmtime GitHub notifications bot (Jan 16 2025 at 21:39):

alexcrichton commented on issue #10010:

Oh dear I had a comment mostly typed up in a tab somewhere but seem to have lost it...

In any case this makes more sense now! I was too narrowly focused on the loop that I completely missed the use of v8 in the later return. This has helped me form a better mental model of what's going on here as well with your examples of hypothetical mega-CISC instructions too, thanks!

Regardless I'd agree that it sounds like we probably just can't fix this for now in Cranelift at this time. I'm still not sure how much of a problem that really is. Visually "less code better yay" but I haven't done the legwork to measure one or the other. This is a downside to wasm instructions that fuse a load-and-thing (or thing-then-store) but there aren't all that many of those at this time (and they already have to be exploded on arm64 for example).

Also sorry I didn't mean to say that this was overly complicated or at fault or anything. I was halfway through writing up something initially tracing through what I thought was going on but ended up abandoning it because I was getting lost (and missing the use of v8 later on)

For now I'm going to close this as I don't think there's anything we can do in the near-term and whether or not solving this problem long-term is unclear at this time if it's worth it. I'll defer such a future conclusion to later.


Last updated: Jan 24 2025 at 00:11 UTC