Stream: cranelift

Topic: x64, fcmp, and load-op merging


view this post on Zulip Alex Crichton (Apr 15 2024 at 15:18):

I added an agenda item but as @Chris Fallin mentioned no reason this can't be done async either!

I have a really old branch for using AVX instructions for ucomis* and ptest rather than today where Cranelift unconditionally uses the pre-AVX versions (even when AVX is enabled). I tried to revive that this weekend and one of the main reasons I never landed it was this comment. The x64_ucomis is an unusual instruction helper in that it takes Value arguments as opposed to registers, and it appears to be due to this comment.

I was hoping, while looking at this instruction, to improve the consistency of this helper with the rest of the backend. To that effect I added back in put_in_xmm_mem to hope I saw a failure of some kind in the test suite, but nothing arose. Keeping the put_in_xmm_mem there I tried my best to craft CLIF to generate a sunk load into the ucomis instruction to see if I could get it to be duplicated by accident. In the end though I couldn't "cause a problem" by reverting back to the "buggy code".

This comment was added in 3934 which references 2576 as well. The test case added in 3934 no longer reproduces the issue (I checked out 3934 as-is and could reproduce the issue, however).

So that leads me to the final question: does anyone know how to trigger the behavior to generate multiple cmp instructions? I'd like to add a test case to make sure I don't accidentally break anything. If we can't find a test case, does anyone know if this comment is still applicable and/or if other changes in the meantime have made it obsolete?

I'm pretty sure I asked this before but I've forgotten the answer in the meantime.
A fast and secure runtime for WebAssembly. Contribute to bytecodealliance/wasmtime development by creating an account on GitHub.
The fpcmp helper in the x64 backend uses put_in_xmm_mem for one of its operands, which allows the compiler to merge a load with the compare instruction (ucomiss or ucomisd). Unfortunately, as we sa...
On x64, the new backend generates cmp instructions at their use-sites when possible (when the icmp that generates a boolean is known) so that the condition flows directly through flags rather than ...

view this post on Zulip Chris Fallin (Apr 15 2024 at 19:50):

The short answer is that we should still be generating a cmp every time we use the flags; try e.g. two different selects from the same cond, with different data inputs so they don't merge in GVN?

view this post on Zulip Chris Fallin (Apr 15 2024 at 19:51):

The tradeoff we've made here is that we don't want to deal with "at most one flags value can be alive at once" in regalloc, combined with rematerialization to recreate flags when that's a conflict; that's a much much harder cross-cutting problem

view this post on Zulip Chris Fallin (Apr 15 2024 at 19:51):

so instead, we generate flags at use, and that's baked in pretty deeply

view this post on Zulip Chris Fallin (Apr 15 2024 at 19:51):

I don't have a chance to look at the test now (very short layover then long flight) but can try to help reproduce this next week if still stuck

view this post on Zulip Chris Fallin (Apr 15 2024 at 19:53):

FWIW, to your initial point: we could take Xmm args instead of Value args, which would still force the same thing (no load-op merging) while being more in line with other instruction ctors; maybe that would help?

view this post on Zulip Alex Crichton (Apr 15 2024 at 20:12):

yeah taking Xmm Xmm would work well, but I was hoping to avoid that to over-specialize on this case

view this post on Zulip Alex Crichton (Apr 15 2024 at 20:12):

e.g. if this is still necessary I'd prefer to have the comment for "don't put this in mem" right next to the place that requires it

view this post on Zulip Alex Crichton (Apr 15 2024 at 20:13):

it feels pretty buried-at-a-distance right now

view this post on Zulip Alex Crichton (Apr 15 2024 at 20:18):

Currently with this input clif:

function %a(i64, i64, f32, f32) -> f32 {
block0(v0: i64, v1: i64, v2: f32, v3: f32):
  v8 = load.f32 aligned notrap v0
  v9 = load.f32 aligned notrap v1
  v4 = fcmp eq v8, v9

  v5 = select v4, v2, v3
  v6 = select v4, v3, v2
  v7 = fadd v5, v6
  return v7
}

I can get:

$ cargo run -q compile foo.clif --target x86_64 -D --set opt_level=speed
.byte 85, 72, 137, 229, 102, 15, 111, 232, 243, 15, 16, 31, 243, 15, 16, 38, 15, 46, 220, 102, 15, 111, 213, 102, 15, 111, 194, 15, 139, 4, 0, 0, 0, 242, 15, 16, 193, 15, 132, 4, 0, 0, 0, 242, 15, 16, 193, 15, 46, 220, 15, 139, 4, 0, 0, 0, 242, 15, 16, 202, 15, 132, 4, 0, 0, 0, 242, 15, 16, 202, 243, 15, 88, 193, 72, 137, 236, 93, 195

Disassembly of 79 bytes:
   0:   55                      pushq   %rbp
   1:   48 89 e5                movq    %rsp, %rbp
   4:   66 0f 6f e8             movdqa  %xmm0, %xmm5
   8:   f3 0f 10 1f             movss   (%rdi), %xmm3
   c:   f3 0f 10 26             movss   (%rsi), %xmm4
  10:   0f 2e dc                ucomiss %xmm4, %xmm3
  13:   66 0f 6f d5             movdqa  %xmm5, %xmm2
  17:   66 0f 6f c2             movdqa  %xmm2, %xmm0
  1b:   0f 8b 04 00 00 00       jnp     0x25
  21:   f2 0f 10 c1             movsd   %xmm1, %xmm0
  25:   0f 84 04 00 00 00       je      0x2f
  2b:   f2 0f 10 c1             movsd   %xmm1, %xmm0
  2f:   0f 2e dc                ucomiss %xmm4, %xmm3
  32:   0f 8b 04 00 00 00       jnp     0x3c
  38:   f2 0f 10 ca             movsd   %xmm2, %xmm1
  3c:   0f 84 04 00 00 00       je      0x46
  42:   f2 0f 10 ca             movsd   %xmm2, %xmm1
  46:   f3 0f 58 c1             addss   %xmm1, %xmm0
  4a:   48 89 ec                movq    %rbp, %rsp
  4d:   5d                      popq    %rbp
  4e:   c3                      retq

and this is _after_ I remove the put_in_xmm to let the values get naturally coerced to Xmm and XmmMem

view this post on Zulip Alex Crichton (Apr 15 2024 at 20:19):

so I can see the ucomiss getting duplicated, but I can't somehow get sinking to happen

view this post on Zulip Alex Crichton (Apr 15 2024 at 20:19):

when I debugged this a bit it seemed like instruction colors were differing which prevented sinking, so I'm wondering as well if the coloring part (which I don't fully understand) serves as the guard here and the comment in inst.isle isn't necessary

view this post on Zulip Chris Fallin (Apr 15 2024 at 20:35):

Ah, try with only one load

view this post on Zulip Alex Crichton (Apr 15 2024 at 20:36):

no dice :(

view this post on Zulip Alex Crichton (Apr 15 2024 at 20:36):

I tried v4 = fcmp eq v2, v8 and swapped the v2/v8 operands and still no load sinking

view this post on Zulip Chris Fallin (Apr 15 2024 at 20:36):

The issue with comment location is: where else would it go? The basic issue has to do with compares and there’s no other central place for that. Possibly a typesafe encoding to make this aspect of lowering completely separate

view this post on Zulip Alex Crichton (Apr 15 2024 at 20:37):

oh that's just it, I don't know where to put this

view this post on Zulip Chris Fallin (Apr 15 2024 at 20:37):

(Boarding doors closing / phone off, more thoughts later)

view this post on Zulip Alex Crichton (Apr 15 2024 at 20:37):

the current location feels wrong because I can't seem to trigger the bad case

view this post on Zulip Alex Crichton (Apr 15 2024 at 20:37):

np!

view this post on Zulip Chris Fallin (Apr 16 2024 at 04:01):

took another look at tracelog output; it seems the multiplicity analysis is doing the right thing here and marking v8 and v9 as Multiple (transitively) because v4 (the cmp) is used multiple times

view this post on Zulip Chris Fallin (Apr 16 2024 at 04:03):

so... maybe that handles any multi-use case actually? I'm having trouble thinking of a way to get the fcmp to codegen multiple times without actually using the value at the SSA level multiple times

view this post on Zulip Chris Fallin (Apr 16 2024 at 04:04):

coloring will serve to keep the loads "in order", which is why I had suggested trying one load earlier -- sinking one load over the other might have been problematic -- but doesn't deal with multiplicity

view this post on Zulip Chris Fallin (Apr 16 2024 at 04:07):

git blame tells me I wrote the multiplicity analysis in Apr 2022 (e4b7c8a7376, https://github.com/bytecodealliance/wasmtime/pull/4061), after the initial bugs above, so maybe we can actually remove this safeguard in inst.isle?

This PR addresses the longstanding issue with loads trying to merge into compares on x86-64, and more generally, with the lowering framework falsely recognizing "single uses" of one op by another (...

view this post on Zulip Chris Fallin (Apr 16 2024 at 04:09):

all of my 2021/2022 cranelift hacking is a blur at this point and from a good practices PoV I'd rather lean on trusting the tests and fuzzing than "no wait stop there's a subtle thing only in my brain" so... please do feel free to remove it!

view this post on Zulip Alex Crichton (Apr 16 2024 at 14:18):

ok sounds good! I'll try to collect a suite of tests while I'm at it which look like load sinking maybe should occur and I'll leave a comment in the test that if load sinking does end up happening in the future more thought needs to be applied to the various rules


Last updated: Nov 22 2024 at 16:03 UTC