x64 SIMD alignment · cranelift · Zulip Chat Archive

@Chris Fallin, @Alex Crichton: this morning I paged in more information related to the question Alex asked about alignment and load coalescing. The original patch where I disabled load-coalescing for SIMD operations is this one--e.g., if an instruction uses vectors and the load is not aligned, forget about load-coalescing.

x64: avoid load-coalescing SIMD operations with non-aligned loads by abrown · Pull Request #3107 · bytecodealliance/wasmtime

Fixes #2943, though not as optimally as may be desired. With x64 SIMD instructions, the memory operand must be aligned--this change adds that check. There are cases, however, where we can do better...

Andrew Brown (Nov 16 2021 at 00:26):

I think Alex's question was whether this was always the case in x64 and I vaguely mumbled something about classes of instructions.

Alex Crichton (Nov 16 2021 at 00:26):

Andrew Brown (Nov 16 2021 at 00:27):

Andrew Brown (Nov 16 2021 at 00:29):

From the Intel manual, section 14.9: "Most arithmetic and data processing instructions encoded using the VEX prefix and performing memory accesses have more flexible memory alignment requirements than instructions that are encoded without the VEX prefix. Specifically,
• With the exception of explicitly aligned 16 or 32 byte SIMD load/store instructions, most VEX-encoded, arithmetic and data processing instructions operate in a flexible environment regarding memory address alignment, i.e. VEX-encoded instruction with 32-byte or 16-byte load semantics will support unaligned load operation by default. Memory arguments for most instructions with VEX prefix operate normally without
causing #GP(0) on any byte-granularity alignment (unlike Legacy SSE instructions). The instructions that require explicit memory alignment requirements are listed in Table 14-22."

Andrew Brown (Nov 16 2021 at 00:29):

Andrew Brown (Nov 16 2021 at 00:31):

I think the bottom line is that if we switched to using AVX* encodings (i.e. VEX) for 128-bit SIMD we could re-allow the load coalescing regardless of the alignment of the memory operand

Chris Fallin (Nov 16 2021 at 00:57):

that's really good to know! It seems like we'd want to do this for any significant "serious SIMD kernel" -- fusing the load is probably a measurable perf improvement in tight loops?

Chris Fallin (Nov 16 2021 at 00:59):

the downside is compatibility iiuc -- AVX needs Haswell or up, right? (My Ivy Bridge workstation in my closet will want a fallback, as I think we should strive to support any amd64 chip in the limit, but it feels like a reasonable thing to optimize for the common case here and focus on better codegen for AVX-onward)

Alex Crichton (Nov 16 2021 at 15:22):

Hm so does this mean we shoudl disable that patch and try to actually deny-list instructions that otherwise require 16-byte alignment? We presumably have enough fuzzing now that we can be somewhat confident if we let the fuzzers run for awhile

Alex Crichton (Nov 16 2021 at 15:22):

under the presumption that most VEX things don't generate a fault for unaligned addresses

fitzgen (he/him) (Nov 16 2021 at 17:03):

in an ISLE world, I think we can just switch RegMem operands to Reg for instructions where it isn't safe to fuse the load

fitzgen (he/him) (Nov 16 2021 at 17:03):

Chris Fallin (Nov 16 2021 at 17:16):

@Alex Crichton yeah, I think it makes sense to do this transition; it can be done piecemeal (we're free to mix VEX-encoded and old-style-SSE-encoded instructions in one instruction stream, and I don't think there's a perf penalty for that?)

Chris Fallin (Nov 16 2021 at 17:18):

another cool thing about VEX encodings is that there are three-operand forms (op dest, src1, src2), which plays nicer with regalloc (both old -- no more move insertion -- and new -- fewer constraints)

Dan Gohman (Nov 16 2021 at 17:25):

My knowledge of VEX encodings is a few years out of date, but last time I worked with them, we had performance drops which we theorized could related to how the SSE encodings don't zero out the high parts of the registers, leading to spurious pipeline dependencies and partial-register update delays.

Chris Fallin (Nov 16 2021 at 17:28):

ah, that's good to know; so it sounds like we'd want to go all-in (only VEX encodings touching xmm/ymm regs)

Dan Gohman (Nov 16 2021 at 17:44):

Yeah. It looks like Intel docs still have the recommendatations about using vzeroupper between vex instructions and sse instructions. So best not to intermix them.

Andrew Brown (Nov 16 2021 at 17:44):

Andrew Brown (Nov 16 2021 at 17:45):

Dan Gohman (Nov 16 2021 at 17:45):

Oh, maybe. Intel's docs call it "AVX code". I don't have all the fine distinctions in my head paged in :-}

Dan Gohman (Nov 16 2021 at 17:46):

Andrew Brown (Nov 16 2021 at 17:46):

Andrew Brown (Nov 16 2021 at 17:47):

Andrew Brown (Nov 16 2021 at 17:48):

Andrew Brown (Nov 16 2021 at 17:49):

Dan Gohman (Nov 16 2021 at 17:49):

looks like the link above is an old version of the doc, but the new version as of June 2021 still has the guidance

Dan Gohman (Nov 16 2021 at 17:50):

Andrew Brown (Nov 16 2021 at 17:53):

it does talk about the use of 256-bit AVX instructions... not sure if the use of 128-bit AVX instructions has the same issue

Dan Gohman (Nov 16 2021 at 17:53):

It specifically talks about "256-bit Intel AVX", and I'm not 100% offhand whether that includes the re-encoded SSE instructions.

Dan Gohman (Nov 16 2021 at 17:54):

Andrew Brown (Nov 16 2021 at 17:54):

like with all of these guidance tidbits in the manual, I think I would want to benchmark things a bit on real systems

Dan Gohman (Nov 16 2021 at 17:55):

fitzgen (he/him) (Nov 16 2021 at 18:00):

Andrew Brown (Nov 16 2021 at 18:40):

Petr Penzin (Jan 04 2022 at 18:45):

Certain VEX operations could be an issue on some older AVX512-enabled CPUs, when the AVX512 throttling was applied to 256-bit operations. This has been fixed in Ice Lake, though the story for older CPUs is still somewhat complicated: https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Downclocking

I don't think 128-bit instructions should be an issue in newer CPUs and some/most of the older ones, especially the ones without AVX512. Throttling is based on whether the upper portions of the registers are set ("soft" trigger mentioned in the link above).

Stream: cranelift

Topic: x64 SIMD alignment

Andrew Brown (Nov 16 2021 at 00:26):

Andrew Brown (Nov 16 2021 at 00:26):

Alex Crichton (Nov 16 2021 at 00:26):

Andrew Brown (Nov 16 2021 at 00:27):

Andrew Brown (Nov 16 2021 at 00:29):

Andrew Brown (Nov 16 2021 at 00:29):

Andrew Brown (Nov 16 2021 at 00:31):

Chris Fallin (Nov 16 2021 at 00:57):

Chris Fallin (Nov 16 2021 at 00:59):

Alex Crichton (Nov 16 2021 at 15:22):

Alex Crichton (Nov 16 2021 at 15:22):

fitzgen (he/him) (Nov 16 2021 at 17:03):

fitzgen (he/him) (Nov 16 2021 at 17:03):

Chris Fallin (Nov 16 2021 at 17:16):

Chris Fallin (Nov 16 2021 at 17:18):

Dan Gohman (Nov 16 2021 at 17:25):

Chris Fallin (Nov 16 2021 at 17:28):

Dan Gohman (Nov 16 2021 at 17:44):

Andrew Brown (Nov 16 2021 at 17:44):

Andrew Brown (Nov 16 2021 at 17:45):

Dan Gohman (Nov 16 2021 at 17:45):

Dan Gohman (Nov 16 2021 at 17:46):

Andrew Brown (Nov 16 2021 at 17:46):

Andrew Brown (Nov 16 2021 at 17:47):

Andrew Brown (Nov 16 2021 at 17:48):

Andrew Brown (Nov 16 2021 at 17:49):

Dan Gohman (Nov 16 2021 at 17:49):

Dan Gohman (Nov 16 2021 at 17:50):

Andrew Brown (Nov 16 2021 at 17:53):

Andrew Brown (Nov 16 2021 at 17:53):

Dan Gohman (Nov 16 2021 at 17:53):

Dan Gohman (Nov 16 2021 at 17:54):

Andrew Brown (Nov 16 2021 at 17:54):

Dan Gohman (Nov 16 2021 at 17:55):

fitzgen (he/him) (Nov 16 2021 at 18:00):

Andrew Brown (Nov 16 2021 at 18:40):

Petr Penzin (Jan 04 2022 at 18:45):