Stream: cranelift

Topic: PowerISA backend plans


view this post on Zulip Jacob Lifshay (Feb 20 2023 at 20:32):

I posted on the Libre-SOC mailing list a plan to apply for a Rust Foundation grant once those are open again to add a PowerISA backend to Cranelift:
https://lists.libre-soc.org/pipermail/libre-soc-dev/2023-February/005502.html
there was also some discussion on IRC:
https://libre-soc.org/irclog/%23libre-soc.2023-02-19.log.html#t2023-02-19T18:56:10

view this post on Zulip Jacob Lifshay (Feb 20 2023 at 20:50):

also posted on rust's project-portable-simd:
https://rust-lang.zulipchat.com/#narrow/stream/257879-project-portable-simd/topic/plan.20for.20experimental.20SVP64.20support.20through.20cranelift/near/329064779

If this message does not go away, try reloading the page.

view this post on Zulip bjorn3 (Feb 20 2023 at 20:54):

You might want to take a look at https://github.com/bytecodealliance/rfcs/blob/main/accepted/cranelift-dynamic-vector.md The cranelift ir half has been implemented afaik, but the actual backend support for AArch64 SVE didn't get implemented despite being planned.

RFC process for Bytecode Alliance projects. Contribute to bytecodealliance/rfcs development by creating an account on GitHub.

view this post on Zulip bjorn3 (Feb 20 2023 at 20:55):

Cranelift indeed doesn't have autovectorization.

view this post on Zulip bjorn3 (Feb 20 2023 at 20:56):

For cg_clif specifically be aware that all simd intrinsics are currently implemented entirely using scalar operations.

view this post on Zulip bjorn3 (Feb 20 2023 at 20:58):

Dynamic vectors in Cranelift support arbitrary sizes, but static vectors only support power-of-two sizes.

view this post on Zulip bjorn3 (Feb 20 2023 at 21:03):

I'm happy to assist with adding PowerISA support on the cg_clif side, but I think proper SIMD support would be a lot more work that isn't a priority for me right now. I think I will need to redo how vector types are represented in cg_clif for example as it is rather fragile in terms of correctness.

view this post on Zulip Jacob Lifshay (Feb 20 2023 at 21:18):

ok, well SVP64 doesn't fit that well into dynamic vectors: in SVP64 all vectors are kinda like ArrayVec in that the user selects the capacity which is a compile-time constant 1..=64 elements (not limited to powers of 2, future ISA versions may expand that) and then at runtime the length can be dynamically set per-instruction to 0..=capacityby writing to the VL register. there are also some more complex modes where each element is a fixed-width vector of 1..=4 elements (we call that sub-vectors).

for llvm, my plan is to use llvm.vp.* ops with fixed-length vectors for an ISA-independent representation -- imho cranelift needs its generic vector ops to become more like llvm's llvm.vp.*ops, since they're more general.

view this post on Zulip Jacob Lifshay (Feb 20 2023 at 21:21):

svp64 supports an even more general representation of vector ops as essentially auto-vectorized multi-instruction loops, of which the above ops are just the degenerate single-instruction case.

view this post on Zulip Jacob Lifshay (Feb 20 2023 at 21:22):

my plan is to support the simple llvm.vp.* case and leave the more complex stuff for when we get more funding

view this post on Zulip Jacob Lifshay (Feb 20 2023 at 21:41):

basic summary of llvm.vp.* ops:
all ops are of the form %result = call <result_ty> @llvm.vp.the_op.arg_types(<inputs...>, <mask>, <vl>) with:
result and inputs types are fixed or dynamic length vectors (their length is effectively like ArrayVec's capacity)
mask is a fixed or dynamic length vector with element type i1
vl is an i32 (effectively ArrayVec's length)
the result elements are calculated using this algorithm:

result = poison // skipped for store ops
for el_idx in 0..vl.min(capacity) {
    if mask[el_idx] {
        result[el_idx] = op(input0[el_idx], input1[el_idx], ...);
    }
}

view this post on Zulip Jacob Lifshay (Feb 20 2023 at 21:43):

all inputs and mask must have matching lengths and fixed/dynamic-ness.

view this post on Zulip Jacob Lifshay (Feb 20 2023 at 21:47):

actually, the llvm.vp.* ops have UB if vl > capacity, sorry for the mixup

view this post on Zulip Jacob Lifshay (Feb 20 2023 at 21:47):

https://llvm.org/docs/LangRef.html#vector-predication-intrinsics

view this post on Zulip bjorn3 (Feb 20 2023 at 22:29):

Doesn't Cranelift's dynamic vector representatiom allow each dynamic vector to have a different length?

view this post on Zulip Jacob Lifshay (Feb 21 2023 at 00:15):

not quite, according to the rfc all dynamic vectors have length n * global_scale_factor where n is user selectable and specific to each vector and where global_scale_factor is a global constant, selected by the cpu the output program is running on, and not user-selectable.

view this post on Zulip Jacob Lifshay (Feb 21 2023 at 00:16):

so, e.g. if global_scale_factor is 2, then you can't have a dynamic vector with length 5

view this post on Zulip Jacob Lifshay (Feb 21 2023 at 00:31):

a key difference between SVP64 and most other "scalable vector" ISAs is that the capacity of vectors (MAXVL) is selected by the cpu designer on RVV and SVE, but is selected by the programmer/compiler on SVP64. This means that e.g. if you want a vector length of 14 then on RVV and SVE you always need to write a loop handling the case where the cpu designer decided to limit vector lengths to 4, whereas on SVP64 if you want a vector length of 14 then you just allocate 14 consecutive registers (for 64-bit elements) and use them as a 14-element vector, all cpus implementing SVP64 support that, though they may run at different speeds, e.g. a cpu could process 4 elements at a time and take multiple clock cycles to process the 14-element vector, but you would still only use 1 instruction.

view this post on Zulip bjorn3 (Feb 21 2023 at 07:37):

The syntax to define a dynamic vector type is dt0 = i32x4*gv0. I don't see any restriction that gv0 has to be a fixed constant for the cpu it runs on. Just that it is a GlobalValue which may reference the cpu's vector size or may be a memory load (whose value may not change while the function is excuting, but may otherwise change at runtime). Shouldn't be too hard to add a constant value as additional option to GlobalValue either.

view this post on Zulip Jacob Lifshay (Feb 21 2023 at 09:42):

so in other words you're saying i should use a dynamic vector but say the scale factor is always 1 on SVP64, basically a fixed-length vector with weird syntax?

view this post on Zulip Jacob Lifshay (Feb 21 2023 at 11:47):

e.g. i32x3*1?

view this post on Zulip bjorn3 (Feb 21 2023 at 16:13):

I meant using i32*gv0 where gv0 represents the constant 3.

view this post on Zulip bjorn3 (Feb 21 2023 at 16:15):

Looking at it again it is a bit roundabout. Now that Type has been changed from 8bit to 16bit in size, maybe non-power-of-two sizes for non-dynamic vector types could work?

view this post on Zulip Chris Fallin (Feb 21 2023 at 16:42):

The intent of the expression language was indeed to offer the flexibility for a vector length to be defined however the ISA + runtime needs it to be; the global scale factor was just one example. You can go back and read the discussion on the RFC for examples of this :-) I'm confident we can adapt the framework to whatever Power needs as well

view this post on Zulip Jacob Lifshay (Feb 21 2023 at 19:58):

SVP64 will need 1024+512 vector types now and more in the future -- 64 different MAXVL settings (1..=64), 4 SUBVL settings (1..=4) and 6 different element types i8/u8, i16/u16, i32/u32, i64/u64, f32, f64 -- we also support bf16 and f16 but those can be added later. future versions of the SVP64 ISA extension are highly likely to support more MAXVL settings

view this post on Zulip Sam Parker (Feb 22 2023 at 09:59):

From a blurry memory, the existing CLIF types should be fine with the exception of non-power-of-two lanes, with support for up to 256 lanes. Unless something has changed significantly, I think the bigger concern will be handling fully predicated SIMD opcodes - that is a huge change to CLIF.

view this post on Zulip bjorn3 (Feb 22 2023 at 12:23):

Redicated simd ops can be handled by emitting an unpredicated op followed by a vselect and then fusing both together into a predicated simd inst in the backend, right?

view this post on Zulip Chris Fallin (Feb 22 2023 at 16:00):

Yep, that. seems like a reasonable solution; it has the advantage that it will capture both code generated in exactly that shape to take advantage of predication, and code that happens to be in that shape from a generic frontend

view this post on Zulip Chris Fallin (Feb 22 2023 at 16:01):

(i.e., the usual argument about canonicalization: it increases optimization opportunity by funneling through one canonical form)

view this post on Zulip Jacob Lifshay (Feb 22 2023 at 20:47):

bjorn3 said:

Redicated simd ops can be handled by emitting an unpredicated op followed by a vselect and then fusing both together into a predicated simd inst in the backend, right?

that doesn't work for operations that would produce UB in masked-off lanes, e.g. division by zero or load through null pointer. SVP64 has general enough semantics for vectors (each vector op is basically a full multi-instruction loop with optional element reordering for dest register and each separate source register independently as well as masking and a bunch of other options -- this is general enough to support a full FFT in basically one instruction) that it's likely better to canonicalize into SVP64-style IR and lower from that for other backends that aren't quite as general.

when there's a single instruction in a vector op (the common case), SVP64 is basically a prefix setting up looping over vector elements and a suffix that is basically any existing PowerISA scalar instruction that is converted by the prefix into a vector op while possibly changing the element size from 32/64-bits to any of 8/16/32/64-bits (for ints) or f16/bf16/f32/f64 (for floats)

https://libre-soc.org/openpower/sv/

view this post on Zulip bjorn3 (Feb 23 2023 at 10:22):

Right, for masked loads and stores a separate instruction makes sense. For division you could technically add another select to set the divisor to 1 for masked lanes, but I don't really like that solution either.


Last updated: Nov 22 2024 at 17:03 UTC