alexcrichton opened issue #10010:
I was profiling some SIMD-using code today and noticed that the WebAssembly simd instruction
v128.load32_splat
wasn't generating the x64 instruction I thought it was supposed to do. This is what I'd expect but this is what I got:function u0:0(i64, i32x4) -> i32x4 tail { block0(v0: i64, v1: i32x4): v2 = iadd_imm v0, 100 jump block2(v0, v1) block2(v3: i64, v5: i32x4): v6 = load.i32 v3 v7 = splat.i32x4 v6 v8 = iadd v5, v7 v11 = iadd_imm v3, 4 v12 = icmp eq v11, v2 brif v12, block2(v11, v8), block3 block3: return v8 }
$ cargo run compile ../clif/wasm_func_0.clif --target x86_64 --set has_avx2 --set has_avx -D Disassembly of 41 bytes <u0:0>: 0: 55 pushq %rbp 1: 48 89 e5 movq %rsp, %rbp 4: 48 8d 4f 64 leaq 0x64(%rdi), %rcx 8: 8b 17 movl (%rdi), %edx a: c5 f9 6e e2 vmovd %edx, %xmm4 e: c4 e2 79 58 f4 vpbroadcastd %xmm4, %xmm6 13: c5 f9 fe c6 vpaddd %xmm6, %xmm0, %xmm0 17: 48 83 c7 04 addq $4, %rdi 1b: 48 39 cf cmpq %rcx, %rdi 1e: 0f 84 e4 ff ff ff je 8 24: 48 89 ec movq %rbp, %rsp 27: 5d popq %rbp 28: c3 retq
Notably at offset
e
the it's not usingvbroadcastss
, but instead a combo ofmovl
,vmovd
, andvpbroadcastd
. I don't have a great way of measuring the relative performance of one vs the other as my hope was to convince Cranelift to generate one or the other to measure.The reason that this lowering rule isn't firing is due to:
... TRACE cranelift_codegen::machinst::lower > arg v6 used, old state Unused, new Once ... TRACE cranelift_codegen::machinst::lower > arg v8 used, old state Once, new Multiple ... TRACE cranelift_codegen::machinst::lower > -> DFS reaches v6 TRACE cranelift_codegen::machinst::lower > -> became Multiple ... TRACE cranelift_codegen::machinst::lower > lowering: inst inst3: v7 = splat.i32x4 v6 TRACE cranelift_codegen::machinst::lower > get_input_for_val: val v6 at cur_inst Some(inst3) cur_scan_entry_color Some(InstColor(4)) TRACE cranelift_codegen::machinst::lower > -> src inst inst2 TRACE cranelift_codegen::machinst::lower > -> has lowering side effect: true TRACE cranelift_codegen::machinst::lower > -> side-effecting op inst2 for val v6: use state Multiple TRACE cranelift_codegen::machinst::lower > put_value_in_regs: val v6
(or at least I think this is why
sinkable_load
isn't firing)I know I've run up against
compute_use_states
in the past and it's subtle enough that I can't ever seem to keep it in my head. I can't seem to wrap my head around this time either...
cfallin commented on issue #10010:
So the issue here is that
v8
is indeed used multiple times, and multiplicity is necessarily transitive, because we can't know how deep the rule-matching will go; so we can't allow the load to sink.Imagine the following (very smart, but possible) instruction selection rules:
- A branch argument (
v8
in the args toblock2
at thebrif
) has a special mega-CISC instruction that merges an add, a splat, and a load all together with the branch. We thus do the load side-effect here at the branch, if we permit the load to sink.- A return-value sequence has a special mega-CISC instruction that merges an add, a splat and a load all together, puts the result in the return-value register, and returns. We thus do the load side-effect here at the
return
op, if we permit the load to sink.This particular example is a bit far-fetched, but it's important that we don't encode even more subtle knowledge about the possible instruction selection combinations into the core algorithms; they have to assume that rule-matching can go arbitrarily deep. We cannot allow the load to sink twice to two different locations, so the multiplicity analysis (correctly) propagates this multiple-use status backward to the load and prevents it from sinking.
I know I've run up against compute_use_states in the past and it's subtle enough that I can't ever seem to keep it in my head. I can't seem to wrap my head around this time either...
Believe it or not this was the simplest abstraction I could come up with at the time (I know, I'm not totally happy either...) It certainly errs on the side of safety but it did prevent real bugs we had / kept creating at the time. I know this last came up here and the explanation I gave there is as from-first-principles as I could make it; suggestions welcome for ways to explain it better and/or better algorithms that preserve all the safety properties we want!
cfallin commented on issue #10010:
I'll note as well that we could relax multiplicity if we "cut" it at certain points by pinning values into registers -- e.g., constrain that
v8
must be computed into a register, then allow it to be computed once (and avoid multiplicity flowing backward). However placing those cut-points is a provably NP-hard combinatorial optimization problem, which doesn't play nicely with our general ethos of single-pass algorithms (or at worst fast fixpoint analyses before our main lowering pass), so we haven't really pursued it further...
alexcrichton closed issue #10010:
I was profiling some SIMD-using code today and noticed that the WebAssembly simd instruction
v128.load32_splat
wasn't generating the x64 instruction I thought it was supposed to do. This is what I'd expect but this is what I got:function u0:0(i64, i32x4) -> i32x4 tail { block0(v0: i64, v1: i32x4): v2 = iadd_imm v0, 100 jump block2(v0, v1) block2(v3: i64, v5: i32x4): v6 = load.i32 v3 v7 = splat.i32x4 v6 v8 = iadd v5, v7 v11 = iadd_imm v3, 4 v12 = icmp eq v11, v2 brif v12, block2(v11, v8), block3 block3: return v8 }
$ cargo run compile ../clif/wasm_func_0.clif --target x86_64 --set has_avx2 --set has_avx -D Disassembly of 41 bytes <u0:0>: 0: 55 pushq %rbp 1: 48 89 e5 movq %rsp, %rbp 4: 48 8d 4f 64 leaq 0x64(%rdi), %rcx 8: 8b 17 movl (%rdi), %edx a: c5 f9 6e e2 vmovd %edx, %xmm4 e: c4 e2 79 58 f4 vpbroadcastd %xmm4, %xmm6 13: c5 f9 fe c6 vpaddd %xmm6, %xmm0, %xmm0 17: 48 83 c7 04 addq $4, %rdi 1b: 48 39 cf cmpq %rcx, %rdi 1e: 0f 84 e4 ff ff ff je 8 24: 48 89 ec movq %rbp, %rsp 27: 5d popq %rbp 28: c3 retq
Notably at offset
e
the it's not usingvbroadcastss
, but instead a combo ofmovl
,vmovd
, andvpbroadcastd
. I don't have a great way of measuring the relative performance of one vs the other as my hope was to convince Cranelift to generate one or the other to measure.The reason that this lowering rule isn't firing is due to:
... TRACE cranelift_codegen::machinst::lower > arg v6 used, old state Unused, new Once ... TRACE cranelift_codegen::machinst::lower > arg v8 used, old state Once, new Multiple ... TRACE cranelift_codegen::machinst::lower > -> DFS reaches v6 TRACE cranelift_codegen::machinst::lower > -> became Multiple ... TRACE cranelift_codegen::machinst::lower > lowering: inst inst3: v7 = splat.i32x4 v6 TRACE cranelift_codegen::machinst::lower > get_input_for_val: val v6 at cur_inst Some(inst3) cur_scan_entry_color Some(InstColor(4)) TRACE cranelift_codegen::machinst::lower > -> src inst inst2 TRACE cranelift_codegen::machinst::lower > -> has lowering side effect: true TRACE cranelift_codegen::machinst::lower > -> side-effecting op inst2 for val v6: use state Multiple TRACE cranelift_codegen::machinst::lower > put_value_in_regs: val v6
(or at least I think this is why
sinkable_load
isn't firing)I know I've run up against
compute_use_states
in the past and it's subtle enough that I can't ever seem to keep it in my head. I can't seem to wrap my head around this time either...
alexcrichton commented on issue #10010:
Oh dear I had a comment mostly typed up in a tab somewhere but seem to have lost it...
In any case this makes more sense now! I was too narrowly focused on the loop that I completely missed the use of
v8
in the laterreturn
. This has helped me form a better mental model of what's going on here as well with your examples of hypothetical mega-CISC instructions too, thanks!Regardless I'd agree that it sounds like we probably just can't fix this for now in Cranelift at this time. I'm still not sure how much of a problem that really is. Visually "less code better yay" but I haven't done the legwork to measure one or the other. This is a downside to wasm instructions that fuse a load-and-thing (or thing-then-store) but there aren't all that many of those at this time (and they already have to be exploded on arm64 for example).
Also sorry I didn't mean to say that this was overly complicated or at fault or anything. I was halfway through writing up something initially tracing through what I thought was going on but ended up abandoning it because I was getting lost (and missing the use of
v8
later on)For now I'm going to close this as I don't think there's anything we can do in the near-term and whether or not solving this problem long-term is unclear at this time if it's worth it. I'll defer such a future conclusion to later.
Last updated: Jan 24 2025 at 00:11 UTC