I posted on the Libre-SOC mailing list a plan to apply for a Rust Foundation grant once those are open again to add a PowerISA backend to Cranelift:
https://lists.libre-soc.org/pipermail/libre-soc-dev/2023-February/005502.html
there was also some discussion on IRC:
https://libre-soc.org/irclog/%23libre-soc.2023-02-19.log.html#t2023-02-19T18:56:10
also posted on rust's project-portable-simd:
https://rust-lang.zulipchat.com/#narrow/stream/257879-project-portable-simd/topic/plan.20for.20experimental.20SVP64.20support.20through.20cranelift/near/329064779
You might want to take a look at https://github.com/bytecodealliance/rfcs/blob/main/accepted/cranelift-dynamic-vector.md The cranelift ir half has been implemented afaik, but the actual backend support for AArch64 SVE didn't get implemented despite being planned.
Cranelift indeed doesn't have autovectorization.
For cg_clif specifically be aware that all simd intrinsics are currently implemented entirely using scalar operations.
Dynamic vectors in Cranelift support arbitrary sizes, but static vectors only support power-of-two sizes.
I'm happy to assist with adding PowerISA support on the cg_clif side, but I think proper SIMD support would be a lot more work that isn't a priority for me right now. I think I will need to redo how vector types are represented in cg_clif for example as it is rather fragile in terms of correctness.
ok, well SVP64 doesn't fit that well into dynamic vectors: in SVP64 all vectors are kinda like ArrayVec in that the user selects the capacity which is a compile-time constant 1..=64
elements (not limited to powers of 2, future ISA versions may expand that) and then at runtime the length can be dynamically set per-instruction to 0..=capacity
by writing to the VL register. there are also some more complex modes where each element is a fixed-width vector of 1..=4
elements (we call that sub-vectors).
for llvm, my plan is to use llvm.vp.*
ops with fixed-length vectors for an ISA-independent representation -- imho cranelift needs its generic vector ops to become more like llvm's llvm.vp.*
ops, since they're more general.
svp64 supports an even more general representation of vector ops as essentially auto-vectorized multi-instruction loops, of which the above ops are just the degenerate single-instruction case.
my plan is to support the simple llvm.vp.*
case and leave the more complex stuff for when we get more funding
basic summary of llvm.vp.*
ops:
all ops are of the form %result = call <result_ty> @llvm.vp.the_op.arg_types(<inputs...>, <mask>, <vl>)
with:
result and inputs types are fixed or dynamic length vectors (their length is effectively like ArrayVec
's capacity)
mask is a fixed or dynamic length vector with element type i1
vl is an i32 (effectively ArrayVec
's length)
the result elements are calculated using this algorithm:
result = poison // skipped for store ops
for el_idx in 0..vl.min(capacity) {
if mask[el_idx] {
result[el_idx] = op(input0[el_idx], input1[el_idx], ...);
}
}
all inputs and mask must have matching lengths and fixed/dynamic-ness.
actually, the llvm.vp.*
ops have UB if vl > capacity, sorry for the mixup
https://llvm.org/docs/LangRef.html#vector-predication-intrinsics
Doesn't Cranelift's dynamic vector representatiom allow each dynamic vector to have a different length?
not quite, according to the rfc all dynamic vectors have length n * global_scale_factor
where n
is user selectable and specific to each vector and where global_scale_factor
is a global constant, selected by the cpu the output program is running on, and not user-selectable.
so, e.g. if global_scale_factor
is 2
, then you can't have a dynamic vector with length 5
a key difference between SVP64 and most other "scalable vector" ISAs is that the capacity of vectors (MAXVL) is selected by the cpu designer on RVV and SVE, but is selected by the programmer/compiler on SVP64. This means that e.g. if you want a vector length of 14 then on RVV and SVE you always need to write a loop handling the case where the cpu designer decided to limit vector lengths to 4, whereas on SVP64 if you want a vector length of 14 then you just allocate 14 consecutive registers (for 64-bit elements) and use them as a 14-element vector, all cpus implementing SVP64 support that, though they may run at different speeds, e.g. a cpu could process 4 elements at a time and take multiple clock cycles to process the 14-element vector, but you would still only use 1 instruction.
The syntax to define a dynamic vector type is dt0 = i32x4*gv0
. I don't see any restriction that gv0
has to be a fixed constant for the cpu it runs on. Just that it is a GlobalValue
which may reference the cpu's vector size or may be a memory load (whose value may not change while the function is excuting, but may otherwise change at runtime). Shouldn't be too hard to add a constant value as additional option to GlobalValue
either.
so in other words you're saying i should use a dynamic vector but say the scale factor is always 1 on SVP64, basically a fixed-length vector with weird syntax?
e.g. i32x3*1
?
I meant using i32*gv0
where gv0
represents the constant 3.
Looking at it again it is a bit roundabout. Now that Type
has been changed from 8bit to 16bit in size, maybe non-power-of-two sizes for non-dynamic vector types could work?
The intent of the expression language was indeed to offer the flexibility for a vector length to be defined however the ISA + runtime needs it to be; the global scale factor was just one example. You can go back and read the discussion on the RFC for examples of this :-) I'm confident we can adapt the framework to whatever Power needs as well
SVP64 will need 1024+512 vector types now and more in the future -- 64 different MAXVL settings (1..=64), 4 SUBVL settings (1..=4) and 6 different element types i8/u8, i16/u16, i32/u32, i64/u64, f32, f64 -- we also support bf16 and f16 but those can be added later. future versions of the SVP64 ISA extension are highly likely to support more MAXVL settings
From a blurry memory, the existing CLIF types should be fine with the exception of non-power-of-two lanes, with support for up to 256 lanes. Unless something has changed significantly, I think the bigger concern will be handling fully predicated SIMD opcodes - that is a huge change to CLIF.
Redicated simd ops can be handled by emitting an unpredicated op followed by a vselect and then fusing both together into a predicated simd inst in the backend, right?
Yep, that. seems like a reasonable solution; it has the advantage that it will capture both code generated in exactly that shape to take advantage of predication, and code that happens to be in that shape from a generic frontend
(i.e., the usual argument about canonicalization: it increases optimization opportunity by funneling through one canonical form)
bjorn3 said:
Redicated simd ops can be handled by emitting an unpredicated op followed by a vselect and then fusing both together into a predicated simd inst in the backend, right?
that doesn't work for operations that would produce UB in masked-off lanes, e.g. division by zero or load through null pointer. SVP64 has general enough semantics for vectors (each vector op is basically a full multi-instruction loop with optional element reordering for dest register and each separate source register independently as well as masking and a bunch of other options -- this is general enough to support a full FFT in basically one instruction) that it's likely better to canonicalize into SVP64-style IR and lower from that for other backends that aren't quite as general.
when there's a single instruction in a vector op (the common case), SVP64 is basically a prefix setting up looping over vector elements and a suffix that is basically any existing PowerISA scalar instruction that is converted by the prefix into a vector op while possibly changing the element size from 32/64-bits to any of 8/16/32/64-bits (for ints) or f16/bf16/f32/f64 (for floats)
https://libre-soc.org/openpower/sv/
Right, for masked loads and stores a separate instruction makes sense. For division you could technically add another select to set the divisor to 1 for masked lanes, but I don't really like that solution either.
Last updated: Nov 22 2024 at 17:03 UTC