@Chris Fallin, @Alex Crichton: this morning I paged in more information related to the question Alex asked about alignment and load coalescing. The original patch where I disabled load-coalescing for SIMD operations is this one--e.g., if an instruction uses vectors and the load is not aligned, forget about load-coalescing.
I think Alex's question was whether this was always the case in x64 and I vaguely mumbled something about classes of instructions.
ah yes that's what I was remembering, thanks for digging that up!
The official answer is this:
From the Intel manual, section 14.9: "Most arithmetic and data processing instructions encoded using the VEX prefix and performing memory accesses have more flexible memory alignment requirements than instructions that are encoded without the VEX prefix. Specifically,
• With the exception of explicitly aligned 16 or 32 byte SIMD load/store instructions, most VEX-encoded, arithmetic and data processing instructions operate in a flexible environment regarding memory address alignment, i.e. VEX-encoded instruction with 32-byte or 16-byte load semantics will support unaligned load operation by default. Memory arguments for most instructions with VEX prefix operate normally without
causing #GP(0) on any byte-granularity alignment (unlike Legacy SSE instructions). The instructions that require explicit memory alignment requirements are listed in Table 14-22."
I think the bottom line is that if we switched to using AVX* encodings (i.e. VEX) for 128-bit SIMD we could re-allow the load coalescing regardless of the alignment of the memory operand
that's really good to know! It seems like we'd want to do this for any significant "serious SIMD kernel" -- fusing the load is probably a measurable perf improvement in tight loops?
the downside is compatibility iiuc -- AVX needs Haswell or up, right? (My Ivy Bridge workstation in my closet will want a fallback, as I think we should strive to support any amd64 chip in the limit, but it feels like a reasonable thing to optimize for the common case here and focus on better codegen for AVX-onward)
Hm so does this mean we shoudl disable that patch and try to actually deny-list instructions that otherwise require 16-byte alignment? We presumably have enough fuzzing now that we can be somewhat confident if we let the fuzzers run for awhile
under the presumption that most VEX things don't generate a fault for unaligned addresses
in an ISLE world, I think we can just switch RegMem
operands to Reg
for instructions where it isn't safe to fuse the load
for the instruction constructors (in inst.isle
)
@Alex Crichton yeah, I think it makes sense to do this transition; it can be done piecemeal (we're free to mix VEX-encoded and old-style-SSE-encoded instructions in one instruction stream, and I don't think there's a perf penalty for that?)
another cool thing about VEX encodings is that there are three-operand forms (op dest, src1, src2), which plays nicer with regalloc (both old -- no more move insertion -- and new -- fewer constraints)
My knowledge of VEX encodings is a few years out of date, but last time I worked with them, we had performance drops which we theorized could related to how the SSE encodings don't zero out the high parts of the registers, leading to spurious pipeline dependencies and partial-register update delays.
ah, that's good to know; so it sounds like we'd want to go all-in (only VEX encodings touching xmm/ymm regs)
Yeah. It looks like Intel docs still have the recommendatations about using vzeroupper
between vex instructions and sse instructions. So best not to intermix them.
@Dan Gohman you probably mean EVEX?
oh, nm
Oh, maybe. Intel's docs call it "AVX code". I don't have all the fine distinctions in my head paged in :-}
Source: 11.3.1 in https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf
which edition?
oh, sorry... linked
ok, yeah, looks like something to look at some more
and yeah, it is in fact talking about VEX when they say AVX there
looks like the link above is an old version of the doc, but the new version as of June 2021 still has the guidance
(for reference, it is section 15.3.1 in the newest edition)
it does talk about the use of 256-bit AVX instructions... not sure if the use of 128-bit AVX instructions has the same issue
It specifically talks about "256-bit Intel AVX", and I'm not 100% offhand whether that includes the re-encoded SSE instructions.
yeah
like with all of these guidance tidbits in the manual, I think I would want to benchmark things a bit on real systems
yeah :-)
Andrew Brown said:
like with all of these guidance tidbits in the manual, I think I would want to benchmark things a bit on real systems
we should add more than just blake3-simd
for SIMD things to sightglass :-p
yup
Dan Gohman said:
My knowledge of VEX encodings is a few years out of date, but last time I worked with them, we had performance drops which we theorized could related to how the SSE encodings don't zero out the high parts of the registers, leading to spurious pipeline dependencies and partial-register update delays.
Certain VEX operations could be an issue on some older AVX512-enabled CPUs, when the AVX512 throttling was applied to 256-bit operations. This has been fixed in Ice Lake, though the story for older CPUs is still somewhat complicated: https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Downclocking
Andrew Brown said:
it does talk about the use of 256-bit AVX instructions... not sure if the use of 128-bit AVX instructions has the same issue
I don't think 128-bit instructions should be an issue in newer CPUs and some/most of the older ones, especially the ones without AVX512. Throttling is based on whether the upper portions of the registers are set ("soft" trigger mentioned in the link above).
Last updated: Jan 24 2025 at 00:11 UTC