I added an agenda item but as @Chris Fallin mentioned no reason this can't be done async either!
I have a really old branch for using AVX instructions for ucomis*
and ptest
rather than today where Cranelift unconditionally uses the pre-AVX versions (even when AVX is enabled). I tried to revive that this weekend and one of the main reasons I never landed it was this comment. The x64_ucomis
is an unusual instruction helper in that it takes Value
arguments as opposed to registers, and it appears to be due to this comment.
I was hoping, while looking at this instruction, to improve the consistency of this helper with the rest of the backend. To that effect I added back in put_in_xmm_mem
to hope I saw a failure of some kind in the test suite, but nothing arose. Keeping the put_in_xmm_mem
there I tried my best to craft CLIF to generate a sunk load into the ucomis
instruction to see if I could get it to be duplicated by accident. In the end though I couldn't "cause a problem" by reverting back to the "buggy code".
This comment was added in 3934 which references 2576 as well. The test case added in 3934 no longer reproduces the issue (I checked out 3934 as-is and could reproduce the issue, however).
So that leads me to the final question: does anyone know how to trigger the behavior to generate multiple cmp instructions? I'd like to add a test case to make sure I don't accidentally break anything. If we can't find a test case, does anyone know if this comment is still applicable and/or if other changes in the meantime have made it obsolete?
The short answer is that we should still be generating a cmp
every time we use the flags; try e.g. two different selects from the same cond, with different data inputs so they don't merge in GVN?
The tradeoff we've made here is that we don't want to deal with "at most one flags value can be alive at once" in regalloc, combined with rematerialization to recreate flags when that's a conflict; that's a much much harder cross-cutting problem
so instead, we generate flags at use, and that's baked in pretty deeply
I don't have a chance to look at the test now (very short layover then long flight) but can try to help reproduce this next week if still stuck
FWIW, to your initial point: we could take Xmm
args instead of Value
args, which would still force the same thing (no load-op merging) while being more in line with other instruction ctors; maybe that would help?
yeah taking Xmm Xmm
would work well, but I was hoping to avoid that to over-specialize on this case
e.g. if this is still necessary I'd prefer to have the comment for "don't put this in mem" right next to the place that requires it
it feels pretty buried-at-a-distance right now
Currently with this input clif:
function %a(i64, i64, f32, f32) -> f32 {
block0(v0: i64, v1: i64, v2: f32, v3: f32):
v8 = load.f32 aligned notrap v0
v9 = load.f32 aligned notrap v1
v4 = fcmp eq v8, v9
v5 = select v4, v2, v3
v6 = select v4, v3, v2
v7 = fadd v5, v6
return v7
}
I can get:
$ cargo run -q compile foo.clif --target x86_64 -D --set opt_level=speed
.byte 85, 72, 137, 229, 102, 15, 111, 232, 243, 15, 16, 31, 243, 15, 16, 38, 15, 46, 220, 102, 15, 111, 213, 102, 15, 111, 194, 15, 139, 4, 0, 0, 0, 242, 15, 16, 193, 15, 132, 4, 0, 0, 0, 242, 15, 16, 193, 15, 46, 220, 15, 139, 4, 0, 0, 0, 242, 15, 16, 202, 15, 132, 4, 0, 0, 0, 242, 15, 16, 202, 243, 15, 88, 193, 72, 137, 236, 93, 195
Disassembly of 79 bytes:
0: 55 pushq %rbp
1: 48 89 e5 movq %rsp, %rbp
4: 66 0f 6f e8 movdqa %xmm0, %xmm5
8: f3 0f 10 1f movss (%rdi), %xmm3
c: f3 0f 10 26 movss (%rsi), %xmm4
10: 0f 2e dc ucomiss %xmm4, %xmm3
13: 66 0f 6f d5 movdqa %xmm5, %xmm2
17: 66 0f 6f c2 movdqa %xmm2, %xmm0
1b: 0f 8b 04 00 00 00 jnp 0x25
21: f2 0f 10 c1 movsd %xmm1, %xmm0
25: 0f 84 04 00 00 00 je 0x2f
2b: f2 0f 10 c1 movsd %xmm1, %xmm0
2f: 0f 2e dc ucomiss %xmm4, %xmm3
32: 0f 8b 04 00 00 00 jnp 0x3c
38: f2 0f 10 ca movsd %xmm2, %xmm1
3c: 0f 84 04 00 00 00 je 0x46
42: f2 0f 10 ca movsd %xmm2, %xmm1
46: f3 0f 58 c1 addss %xmm1, %xmm0
4a: 48 89 ec movq %rbp, %rsp
4d: 5d popq %rbp
4e: c3 retq
and this is _after_ I remove the put_in_xmm
to let the values get naturally coerced to Xmm
and XmmMem
so I can see the ucomiss
getting duplicated, but I can't somehow get sinking to happen
when I debugged this a bit it seemed like instruction colors were differing which prevented sinking, so I'm wondering as well if the coloring part (which I don't fully understand) serves as the guard here and the comment in inst.isle
isn't necessary
Ah, try with only one load
no dice :(
I tried v4 = fcmp eq v2, v8
and swapped the v2/v8 operands and still no load sinking
The issue with comment location is: where else would it go? The basic issue has to do with compares and there’s no other central place for that. Possibly a typesafe encoding to make this aspect of lowering completely separate
oh that's just it, I don't know where to put this
(Boarding doors closing / phone off, more thoughts later)
the current location feels wrong because I can't seem to trigger the bad case
np!
took another look at tracelog output; it seems the multiplicity analysis is doing the right thing here and marking v8 and v9 as Multiple
(transitively) because v4
(the cmp) is used multiple times
so... maybe that handles any multi-use case actually? I'm having trouble thinking of a way to get the fcmp to codegen multiple times without actually using the value at the SSA level multiple times
coloring will serve to keep the loads "in order", which is why I had suggested trying one load earlier -- sinking one load over the other might have been problematic -- but doesn't deal with multiplicity
git blame
tells me I wrote the multiplicity analysis in Apr 2022 (e4b7c8a7376
, https://github.com/bytecodealliance/wasmtime/pull/4061), after the initial bugs above, so maybe we can actually remove this safeguard in inst.isle
?
all of my 2021/2022 cranelift hacking is a blur at this point and from a good practices PoV I'd rather lean on trusting the tests and fuzzing than "no wait stop there's a subtle thing only in my brain" so... please do feel free to remove it!
ok sounds good! I'll try to collect a suite of tests while I'm at it which look like load sinking maybe should occur and I'll leave a comment in the test that if load sinking does end up happening in the future more thought needs to be applied to the various rules
Last updated: Dec 23 2024 at 12:05 UTC