Stream: cranelift

Topic: x64 SIMD alignment


view this post on Zulip Andrew Brown (Nov 16 2021 at 00:26):

@Chris Fallin, @Alex Crichton: this morning I paged in more information related to the question Alex asked about alignment and load coalescing. The original patch where I disabled load-coalescing for SIMD operations is this one--e.g., if an instruction uses vectors and the load is not aligned, forget about load-coalescing.

Fixes #2943, though not as optimally as may be desired. With x64 SIMD instructions, the memory operand must be aligned--this change adds that check. There are cases, however, where we can do better...

view this post on Zulip Andrew Brown (Nov 16 2021 at 00:26):

I think Alex's question was whether this was always the case in x64 and I vaguely mumbled something about classes of instructions.

view this post on Zulip Alex Crichton (Nov 16 2021 at 00:26):

ah yes that's what I was remembering, thanks for digging that up!

view this post on Zulip Andrew Brown (Nov 16 2021 at 00:27):

The official answer is this:

view this post on Zulip Andrew Brown (Nov 16 2021 at 00:29):

From the Intel manual, section 14.9: "Most arithmetic and data processing instructions encoded using the VEX prefix and performing memory accesses have more flexible memory alignment requirements than instructions that are encoded without the VEX prefix. Specifically,
• With the exception of explicitly aligned 16 or 32 byte SIMD load/store instructions, most VEX-encoded, arithmetic and data processing instructions operate in a flexible environment regarding memory address alignment, i.e. VEX-encoded instruction with 32-byte or 16-byte load semantics will support unaligned load operation by default. Memory arguments for most instructions with VEX prefix operate normally without
causing #GP(0) on any byte-granularity alignment (unlike Legacy SSE instructions). The instructions that require explicit memory alignment requirements are listed in Table 14-22."

view this post on Zulip Andrew Brown (Nov 16 2021 at 00:29):

image.png

view this post on Zulip Andrew Brown (Nov 16 2021 at 00:31):

I think the bottom line is that if we switched to using AVX* encodings (i.e. VEX) for 128-bit SIMD we could re-allow the load coalescing regardless of the alignment of the memory operand

view this post on Zulip Chris Fallin (Nov 16 2021 at 00:57):

that's really good to know! It seems like we'd want to do this for any significant "serious SIMD kernel" -- fusing the load is probably a measurable perf improvement in tight loops?

view this post on Zulip Chris Fallin (Nov 16 2021 at 00:59):

the downside is compatibility iiuc -- AVX needs Haswell or up, right? (My Ivy Bridge workstation in my closet will want a fallback, as I think we should strive to support any amd64 chip in the limit, but it feels like a reasonable thing to optimize for the common case here and focus on better codegen for AVX-onward)

view this post on Zulip Alex Crichton (Nov 16 2021 at 15:22):

Hm so does this mean we shoudl disable that patch and try to actually deny-list instructions that otherwise require 16-byte alignment? We presumably have enough fuzzing now that we can be somewhat confident if we let the fuzzers run for awhile

view this post on Zulip Alex Crichton (Nov 16 2021 at 15:22):

under the presumption that most VEX things don't generate a fault for unaligned addresses

view this post on Zulip fitzgen (he/him) (Nov 16 2021 at 17:03):

in an ISLE world, I think we can just switch RegMem operands to Reg for instructions where it isn't safe to fuse the load

view this post on Zulip fitzgen (he/him) (Nov 16 2021 at 17:03):

for the instruction constructors (in inst.isle)

view this post on Zulip Chris Fallin (Nov 16 2021 at 17:16):

@Alex Crichton yeah, I think it makes sense to do this transition; it can be done piecemeal (we're free to mix VEX-encoded and old-style-SSE-encoded instructions in one instruction stream, and I don't think there's a perf penalty for that?)

view this post on Zulip Chris Fallin (Nov 16 2021 at 17:18):

another cool thing about VEX encodings is that there are three-operand forms (op dest, src1, src2), which plays nicer with regalloc (both old -- no more move insertion -- and new -- fewer constraints)

view this post on Zulip Dan Gohman (Nov 16 2021 at 17:25):

My knowledge of VEX encodings is a few years out of date, but last time I worked with them, we had performance drops which we theorized could related to how the SSE encodings don't zero out the high parts of the registers, leading to spurious pipeline dependencies and partial-register update delays.

view this post on Zulip Chris Fallin (Nov 16 2021 at 17:28):

ah, that's good to know; so it sounds like we'd want to go all-in (only VEX encodings touching xmm/ymm regs)

view this post on Zulip Dan Gohman (Nov 16 2021 at 17:44):

Yeah. It looks like Intel docs still have the recommendatations about using vzeroupper between vex instructions and sse instructions. So best not to intermix them.

view this post on Zulip Andrew Brown (Nov 16 2021 at 17:44):

@Dan Gohman you probably mean EVEX?

view this post on Zulip Andrew Brown (Nov 16 2021 at 17:45):

oh, nm

view this post on Zulip Dan Gohman (Nov 16 2021 at 17:45):

Oh, maybe. Intel's docs call it "AVX code". I don't have all the fine distinctions in my head paged in :-}

view this post on Zulip Dan Gohman (Nov 16 2021 at 17:46):

Source: 11.3.1 in https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

view this post on Zulip Andrew Brown (Nov 16 2021 at 17:46):

which edition?

view this post on Zulip Andrew Brown (Nov 16 2021 at 17:47):

oh, sorry... linked

view this post on Zulip Andrew Brown (Nov 16 2021 at 17:48):

ok, yeah, looks like something to look at some more

view this post on Zulip Andrew Brown (Nov 16 2021 at 17:49):

and yeah, it is in fact talking about VEX when they say AVX there

view this post on Zulip Dan Gohman (Nov 16 2021 at 17:49):

looks like the link above is an old version of the doc, but the new version as of June 2021 still has the guidance

view this post on Zulip Dan Gohman (Nov 16 2021 at 17:50):

https://cdrdv2.intel.com/v1/dl/getContent/671488?explicitVersion=true&wapkw=intel%2064%20and%20ia-32%20architectures%20optimization%20reference%20manual

view this post on Zulip Andrew Brown (Nov 16 2021 at 17:53):

(for reference, it is section 15.3.1 in the newest edition)

view this post on Zulip Andrew Brown (Nov 16 2021 at 17:53):

it does talk about the use of 256-bit AVX instructions... not sure if the use of 128-bit AVX instructions has the same issue

view this post on Zulip Dan Gohman (Nov 16 2021 at 17:53):

It specifically talks about "256-bit Intel AVX", and I'm not 100% offhand whether that includes the re-encoded SSE instructions.

view this post on Zulip Dan Gohman (Nov 16 2021 at 17:54):

yeah

view this post on Zulip Andrew Brown (Nov 16 2021 at 17:54):

like with all of these guidance tidbits in the manual, I think I would want to benchmark things a bit on real systems

view this post on Zulip Dan Gohman (Nov 16 2021 at 17:55):

yeah :-)

view this post on Zulip fitzgen (he/him) (Nov 16 2021 at 18:00):

Andrew Brown said:

like with all of these guidance tidbits in the manual, I think I would want to benchmark things a bit on real systems

we should add more than just blake3-simd for SIMD things to sightglass :-p

view this post on Zulip Andrew Brown (Nov 16 2021 at 18:40):

yup

view this post on Zulip Petr Penzin (Jan 04 2022 at 18:45):

Dan Gohman said:

My knowledge of VEX encodings is a few years out of date, but last time I worked with them, we had performance drops which we theorized could related to how the SSE encodings don't zero out the high parts of the registers, leading to spurious pipeline dependencies and partial-register update delays.

Certain VEX operations could be an issue on some older AVX512-enabled CPUs, when the AVX512 throttling was applied to 256-bit operations. This has been fixed in Ice Lake, though the story for older CPUs is still somewhat complicated: https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Downclocking

Andrew Brown said:

it does talk about the use of 256-bit AVX instructions... not sure if the use of 128-bit AVX instructions has the same issue

I don't think 128-bit instructions should be an issue in newer CPUs and some/most of the older ones, especially the ones without AVX512. Throttling is based on whether the upper portions of the registers are set ("soft" trigger mentioned in the link above).


Last updated: Jan 24 2025 at 00:11 UTC