Stream: rfc-notifications

Topic: rfcs / issue #19 RFC: Cranelift sizeless vector types


view this post on Zulip RFC notifications bot (Dec 17 2021 at 10:44):

bjorn3 commented on issue #19:

What new instructions will be introduced for creating, loading and storing such sizeless vectors?

view this post on Zulip RFC notifications bot (Dec 17 2021 at 11:21):

sparker-arm commented on issue #19:

What new instructions will be introduced for creating, loading and storing such sizeless vectors?
I think we may need to add stack related instructions, since we have explicit ones, though I'm still not completely sure why. But I expect that the main changes will be encapsulated in new heap_addr and stack_addr operations. We could probably modify the existing ones, but new ones would help isolate the ambiguous semantics of the sizeless nature.

view this post on Zulip RFC notifications bot (Dec 17 2021 at 11:22):

sparker-arm edited a comment on issue #19:

What new instructions will be introduced for creating, loading and storing such sizeless vectors?

I think we may need to add stack related instructions, since we have explicit ones, though I'm still not completely sure why. But I expect that the main changes will be encapsulated in new heap_addr and stack_addr operations. We could probably modify the existing ones, but new ones would help isolate the ambiguous semantics of the sizeless nature.

view this post on Zulip RFC notifications bot (Dec 17 2021 at 21:55):

cfallin commented on issue #19:

One thing to note related to the stack: both stack address computation (stack_addr) and regalloc spilling depend on the frame layout, which we compute by knowing the size of all types. How does the prototype currently work wrt stack layout -- does it assume some maximal size? Or are the types fixed to some (implementation-defined) size at some point between the CLIF generation and lowering?

view this post on Zulip RFC notifications bot (Dec 17 2021 at 22:28):

abrown commented on issue #19:

cc: @penzn

view this post on Zulip RFC notifications bot (Dec 20 2021 at 13:17):

sparker-arm commented on issue #19:

@cfallin I've only just started looking into stack handling, so I don't have any concrete answers for you. For the backend part, my hand wavey plan was to try and do what is done in LLVM for SVE, where fixed-size and sizeless slots are assigned in different areas, using different pointers and the runtime value of 'VL' can be used to scale offsets into sizeless areas. I've only just looked at the machinst ABI layer though, and I will continue with that this week.

At an IR level, the slots would be as sizeless (minimum 128-bits, not sure about maximum) as any other sizeless value, but as I don't know the IR well, this is the main contention that I am concerned about :) From my basic understanding, it looks like it would be easiest to create sizeless-specific instructions to handle anything stack related, so that we can preserve the current semantics for fixed-size objects, and avoid having to worry about immediate offsets into a slot of unknown size. But I'm also assuming that the different slots could be freely mingled because they're just SSA values...? Please feel free to point me at any areas of the code which you think could be problematic.

If the ambiguous size of objects is going to be a serious problem for us, then there is the option of forcing a lowering to a fixed size, though this could be sub-optimal for architectures like SVE when we're not functioning as a JIT.

view this post on Zulip RFC notifications bot (Dec 20 2021 at 13:17):

sparker-arm edited a comment on issue #19:

@cfallin I've only just started looking into stack handling, so I don't have any concrete answers for you. For the backend part, my hand wavey plan was to try and do what is done in LLVM for SVE, where fixed-size and sizeless slots are assigned in different areas, using different pointers and the runtime value of 'VL' can be used to scale offsets into sizeless areas. I've only just looked at the machinst ABI layer though, and I will continue with that this week.

At an IR level, the slots would be as sizeless (minimum 128-bits, not sure about maximum) as any other sizeless value. But, as I don't know the IR well, this is the main contention that I am concerned about :) From my basic understanding, it looks like it would be easiest to create sizeless-specific instructions to handle anything stack related, so that we can preserve the current semantics for fixed-size objects, and avoid having to worry about immediate offsets into a slot of unknown size. But I'm also assuming that the different slots could be freely mingled because they're just SSA values...? Please feel free to point me at any areas of the code which you think could be problematic.

If the ambiguous size of objects is going to be a serious problem for us, then there is the option of forcing a lowering to a fixed size, though this could be sub-optimal for architectures like SVE when we're not functioning as a JIT.

view this post on Zulip RFC notifications bot (Dec 20 2021 at 22:29):

cfallin commented on issue #19:

@sparker-arm Thanks for the clarifications! I think that we can definitely find a way to accommodate variable-sized types; we just need to consider the problem explicitly and make sure to fix the places it breaks any assumptions :-)

A quick-and-dirty way of getting a feel for that would be to make ty.bits() panic when called on a variable-sized type, then run whatever vector tests updated to use the sizeless vectors. I expect that the ABI code is going to be large part of that. Note that it's not just explicit user loads/stores that we have to worry about: the regalloc, for example, assumes it can spill any register (including the new sizeless-vector ones) and needs to know how large of a spillslot to allocate.

Just for clarity -- I'm not actually sure I could say for certain despite being deep into this conversation! -- when exactly is the vector size made concrete? Do we ultimately compile the code while knowing the size (e.g., we decide we're compiling for microarchitecture X, and so we choose for vectors to be 512-bits wide)? Or are we actually generating code that will only know at runtime? I had been assuming the former, but your mention of computed offsets, etc., makes me think possibly it's the latter.

If we ultimately know the size statically, then it's just a phase-ordering/staging problem: we might need to rework how some of the ABI code operates, but it's not fundamentally incompatible with our model. If the latter, then it sounds like it's actually alloca() of sorts. We know what we need to do to make this work but it's a bit tricky, especially if we're going to have lots of spillslots.

view this post on Zulip RFC notifications bot (Dec 20 2021 at 22:59):

bjorn3 commented on issue #19:

It determined at runtime. For arm I believe it is fixed for each cpu, while for riscv the code can ask for any length (up to a certain maximum) and then the minimum of what the user requested and the cpu supports is chosen as far as I understand. The whole point is to make executables agnostic to the vector size supported by the cpu and thus not require recompilation when a new cpu is released with bigger vectors.

view this post on Zulip RFC notifications bot (Dec 21 2021 at 00:49):

penzn commented on issue #19:

Just for clarity -- I'm not actually sure I could say for certain despite being deep into this conversation! -- when exactly is the vector size made concrete? Do we ultimately compile the code while knowing the size (e.g., we decide we're compiling for microarchitecture X, and so we choose for vectors to be 512-bits wide)? Or are we actually generating code that will only know at runtime? I had been assuming the former, but your mention of computed offsets, etc., makes me think possibly it's the latter.

It should be the former, in a sense that it would be known what SIMD length a given CPU supports and then the code can be generated with the right instructions, and ideally without dynamic handling when the generated code executes. "Determined at runtime" is from the point of view of the developer, not wasm runtime.

view this post on Zulip RFC notifications bot (Dec 21 2021 at 08:08):

bjorn3 commented on issue #19:

That would only work for jit compilation and not aot compilation. With aot compilation it may not be known what cpu it runs on.

view this post on Zulip RFC notifications bot (Dec 21 2021 at 09:47):

sparker-arm commented on issue #19:

worry about: the regalloc, for example, assumes it can spill any register (including the new sizeless-vector ones) and needs to know how large of a spillslot to allocate.

Do we actually need to know the real size though? Or does it just need to understand there's some scaling factor for any register/spill slots that hold a sizeless type..? And, as we've discussed before, the new register allocator will have to accept and understand differences between the fixed and sizeless registers and how they potentially alias.

when exactly is the vector size made concrete?

The whole point is to make executables agnostic to the vector size supported by the cpu and thus not require recompilation when a new cpu is released with bigger vectors.

it would be known what SIMD length a given CPU supports and then the code can be generated with the right instructions, and ideally without dynamic handling when the generated code executes

@penzn I thought it was still a possibility that the flexible spec would support setting the length at runtime?

Either way, both these options are available, and will be both be implemented. For architectures that support runtime vector lengths, we can generate agnostic code (SVE uses the 'VL' _runtime_ variable for most of this) while Neon, AVX-2 and AVX-512 will have to choose a size during codegen. I'm currently implementing SVE and Neon, in tandem, to support for these types, and my idea is, still, that we'd have a legalisation pass that can convert sizeless to fixed sizes which would enable all the SIMD compatible backends to add (basic) support with little or no effort.

view this post on Zulip RFC notifications bot (Dec 21 2021 at 10:35):

sparker-arm commented on issue #19:

A quick-and-dirty way of getting a feel for that would be to make ty.bits() panic when called on a variable-sized type

And thanks for this @cfallin

view this post on Zulip RFC notifications bot (Dec 21 2021 at 18:25):

cfallin commented on issue #19:

>

Do we actually need to know the real size though? Or does it just need to understand there's some scaling factor for any register/spill slots that hold a sizeless type..? And, as we've discussed before, the new register allocator will have to accept and understand differences between the fixed and sizeless registers and how they potentially alias.

Right, it can be made to work; it's just a bit of an API change that we'll need to design and coordinate. Right now both regalloc.rs and regalloc2 are built around the notion that they manage spill space (in units of slots) and can ask how many slots a given value will need. We'll need to have some sort of notion of a separate user-managed spill space for type/regclass X.

Re: stack layout more generally: none of this is impossible, but it's going to need some careful design rework in the ABI code. Specifically, right now the ABI implementation is mostly shared between aarch64, x64 and s390x, and on all architectures, FP (RBP) is kept at the top of the frame, just below return address/stack args, and used to access stack args; and SP (RSP) is kept at the bottom of the frame, as required (redzone notwithstanding). All stackslots and spillslots are accessed via offsets from SP.

If part of the frame is variable, we'll need to either start accessing the fixed part via negative offsets from FP, or move FP below the fixed part and above the variable part, or keep another base register. The third is not great (extra register pressure). The first was how my original aarch64 ABI impl worked, but we switched from negative-from-FP to positive-from-SP since the encoding is more efficient. So we're left with the second, which is I think how other compilers also handle variable frame sizes on aarch64 (?). The issue with that is that we share the generic ABI code with x64, and on x64, we have to keep RBP at the top of the frame in the Windows Fastcall ABI. Also relevant: there's support for omitting frame pointer setup when unneeded (in leaf functions at least); that interacts with this decision too.

So we can definitely work this out, but we'll need to probably add "modes" to the ABI: where is FP, from which base register are (i) args, (ii) fixed slots, (iii) variable slots accessed, how unwind info is emitted correctly in all cases, etc. Lots of interacting moving parts.

All of the above is needed for e.g. alloca() support as well, and is not unique to runtime-variable-sized types, but this may be the thing that forces the need first; so, thanks for pioneering the trail :-)

view this post on Zulip RFC notifications bot (Dec 21 2021 at 18:27):

cfallin edited a comment on issue #19:

>

Do we actually need to know the real size though? Or does it just need to understand there's some scaling factor for any register/spill slots that hold a sizeless type..? And, as we've discussed before, the new register allocator will have to accept and understand differences between the fixed and sizeless registers and how they potentially alias.

Right, it can be made to work; it's just a bit of an API change that we'll need to design and coordinate. Right now both regalloc.rs and regalloc2 are built around the notion that they manage spill space (in units of slots) and can ask how many slots a given value will need. We'll need to have some sort of notion of a separate user-managed spill space for type/regclass X.

Re: stack layout more generally: none of this is impossible, but it's going to need some careful design rework in the ABI code. Specifically, right now the ABI implementation is mostly shared between aarch64, x64 and s390x, and on all architectures, FP (RBP) is kept at the top of the frame, just below return address/stack args, and used to access stack args; and SP (RSP) is kept at the bottom of the frame, as required (redzone notwithstanding). All stackslots and spillslots are accessed via offsets from SP.

If part of the frame is variable, we'll need to either start accessing the fixed part via negative offsets from FP, or move FP below the fixed part and above the variable part, or keep another base register. The third is not great (extra register pressure). The first was how my original aarch64 ABI impl worked, but we switched from negative-from-FP to positive-from-SP since the encoding is more efficient. So we're left with the second, which is I think how other compilers also handle variable frame sizes on aarch64 (?). The issue with that is that we share the generic ABI code with x64, and on x64, we have to keep RBP at the top of the frame in the Windows Fastcall ABI (EDIT: and so we always do, for uniformity). Also relevant: there's support for omitting frame pointer setup when unneeded (in leaf functions at least); that interacts with this decision too.

So we can definitely work this out, but we'll need to probably add "modes" to the ABI: where is FP, from which base register are (i) args, (ii) fixed slots, (iii) variable slots accessed, how unwind info is emitted correctly in all cases, etc. Lots of interacting moving parts.

All of the above is needed for e.g. alloca() support as well, and is not unique to runtime-variable-sized types, but this may be the thing that forces the need first; so, thanks for pioneering the trail :-)

view this post on Zulip RFC notifications bot (Dec 21 2021 at 18:38):

cfallin commented on issue #19:

Ah, another option I missed: variable part in the middle, fixed part at the bottom of the stack frame; then FP at the top and used to access args, as today, and SP at the bottom with fixed (independent of variable sizes) offsets for normal stackslots/spillslots. I think that satisfies all the constraints we have today (but not alloca(); the thing that makes VST-slots "weaker" than alloca in requirements is that we can know the size at prologue time rather than throughout the function body).

view this post on Zulip RFC notifications bot (Dec 21 2021 at 19:23):

penzn commented on issue #19:

@penzn I thought it was still a possibility that the flexible spec would support setting the length at runtime?

@sparker-arm, there is an idea to support setting 'current' length at runtime, RISC-V style, but the maximum available was always meant to be determined before, in the spirit of straight-forward compilation support.

That would only work for jit compilation and not aot compilation. With aot compilation it may not be known what cpu it runs on.

@bjorn3, valid point - how does AOT compilation currently handle target features?

view this post on Zulip RFC notifications bot (Dec 21 2021 at 19:26):

penzn edited a comment on issue #19:

@penzn I thought it was still a possibility that the flexible spec would support setting the length at runtime?

@sparker-arm, there is an idea to support setting 'current' length at runtime, RISC-V style, but the maximum available was always meant to be determined before, in the spirit of straight-forward compilation support. Doing something like that isn't impossible, but would most likely require dynamic dispatch, and was considered out of scope.

That would only work for jit compilation and not aot compilation. With aot compilation it may not be known what cpu it runs on.

@bjorn3, valid point - how does AOT compilation currently handle target features?

view this post on Zulip RFC notifications bot (Dec 21 2021 at 19:42):

bjorn3 commented on issue #19:

There is a list of target features. When doing AOT compilation you can choose a list of allowed target features. The produced executable will then run on all cpus supporting all these target features. This list is generally chosen very conservatively to maximize the amount of cpus it works on. For example only sse and sse2 are enabled by default by rustc on x86_64. This allows at most 128 bit vectors despite the fact that modern cpus support 512 bit vectors through avx512. You can enable avx (for 256 bit vectors) or avx512 (for 512 bit vectors) but then it won't run on older cpus. Arm's SVE however allows you to compile once with the SVE target feature enabled and then it will use eg 512 bit vectors on cpus that support them while retaining compatibility with cpus that only support 128 bit vectors.

view this post on Zulip RFC notifications bot (Dec 22 2021 at 07:54):

penzn commented on issue #19:

It sounds like the same would work here, when the user requests AVX instructions, they are going to get 256-bit vectors, SSE - 128-bit and so on.

view this post on Zulip RFC notifications bot (Dec 22 2021 at 16:44):

sparker-arm commented on issue #19:

variable part in the middle, fixed part at the bottom of the stack frame; then FP at the top and used to access args, as today, and SP at the bottom with fixed (independent of variable sizes) offsets for normal stackslots/spillslots.

@cfallin I think you're generally describing how AArch64 is currently handling this, with third register used in the presence of dynamic objects : https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AArch64/AArch64FrameLowering.cpp

@penzn Does the flexible spec state that the size of all the flexible types are the same? I had assumed not because of the various because of the various type.length operations, or are these available to just be more user friendly?

view this post on Zulip RFC notifications bot (Dec 23 2021 at 05:48):

penzn commented on issue #19:

@penzn Does the flexible spec state that the size of all the flexible types are the same? I had assumed not because of the various because of the various type.length operations, or are these available to just be more user friendly?

Yes, those are there to express things in number of elements instead of bytes, which would be more natural for loops and arrays (also cuts one instruction out, but that isn't a big win really). Spec does not say whether or not different types can have different byte length, but practically hardware length is the same for architectures that use same SIMD registers for different types.

view this post on Zulip RFC notifications bot (Dec 23 2021 at 10:56):

sparker-arm commented on issue #19:

Thanks @penzn, though I feel having the size defined in the spec would be very useful in reducing the amount of ambiguity for any target independent parts of the compiler; the case of stack objects is a good example where, at the IR level, we can have homogeneously unsized slot type which can be (re)used by any flexible type. When it comes closer to codegen, it also makes the layout of the frame easier/smaller/efficient as, if they potentially had different sizes, we'd likely have to pad objects to a fixed alignment or have expensive ways of calculating the address of each object.

view this post on Zulip RFC notifications bot (Dec 23 2021 at 14:35):

sparker-arm commented on issue #19:

But now I remember that, for SVE, the sizeless registers aren't always equal - the predicate registers are x8 smaller than the data regs...

view this post on Zulip RFC notifications bot (Jan 04 2022 at 11:00):

sparker-arm edited a comment on issue #19:

variable part in the middle, fixed part at the bottom of the stack frame; then FP at the top and used to access args, as today, and SP at the bottom with fixed (independent of variable sizes) offsets for normal stackslots/spillslots.

@cfallin I think you're generally describing how AArch64 is currently handling this, with third register used in the presence of dynamic objects : https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AArch64/AArch64FrameLowering.cpp

@penzn Does the flexible spec state that the size of all the flexible types are the same? I had assumed not because of the various type.length operations, or are these available to just be more user friendly?

view this post on Zulip RFC notifications bot (Jan 05 2022 at 09:56):

sparker-arm edited a comment on issue #19:

But now I remember that, for SVE, the sizeless registers aren't always equal - the predicate registers are x8 smaller than the data regs...

EDIT: This actually shouldn't matter at the moment, while comparisons produce vector masks in the vector regs, but is something we should consider if predicates types are introduced.

view this post on Zulip RFC notifications bot (Jan 06 2022 at 15:41):

sparker-arm commented on issue #19:

@cfallin I've spent the last few days playing with the ABI layer, and I think I've stumbled upon (again) the limitation with the RegClass types, specifically while looking at callee saves... according to the AAPCS, in many cases SVE regs are treated the same as Neon but there are cases where the whole 'Z' register needs to be saved, and I'm not sure how to handle this while we just have shared V128 registers for Neon and SVE.

So two questions really: can you think of a way around this, and/or do we need to follow the AAPCS? RegClass is used quite a bit in the ABI layer so I'm assuming there's going to be more problems like this.

view this post on Zulip RFC notifications bot (Jan 06 2022 at 21:24):

cfallin commented on issue #19:

Hmm, interesting; can you say more about what factors influence prologue/epilogue register save details? Is it just "if you clobber the upper bits, save them" or something like that? I think it might be reasonable to, as we scan the function body building up the clobber set during ABI processing, look at some information provided by an instruction ("requires special save/restore") and use that info as well.

The basic abstraction at the regalloc layer is that we have disjoint classes of units to allocate; overlapping classes are a major refactor that would require quite a significant investment of time (a month or so in regalloc code, with significant correctness risk); so I think that we likely need to find a way to make this work without two different register classes that mean "Z/vec reg as traditional vec reg" and "Z/vec reg as Z reg". But it's fundamentally a single register, so this doesn't seem wrong to me. In a sense, the way that we use it is type information layered on top of the allocation itself.

view this post on Zulip RFC notifications bot (Jan 07 2022 at 09:02):

sparker-arm commented on issue #19:

It is when the routine receives and/or returns Z or predicate registers that the callee needs to save the whole Z register (z8-z23). From what you said though, I guess we could just grab the type information from the instruction so that we don't have to rely on RegClass? I definitely have no interest in trying to bend/break regalloc to the will of SVE, but if we can pass the high-level type information down then I think this will be fine. I still haven't looked at spill/restores though :)

view this post on Zulip RFC notifications bot (Jan 07 2022 at 09:22):

sparker-arm edited a comment on issue #19:

It is when the routine receives and/or returns Z or predicate registers that the callee needs to save the whole Z register (z8-z23). From what you said though, I guess we could just grab the type information from the instruction so that we don't have to rely on RegClass? I definitely have no interest in trying to bend/break regalloc to the will of SVE, but if we can pass the high-level type information down then I think this will be fine. I still haven't looked at spill/restores though :)

edit: But now from quickly glancing, I see that type information is passed around in the spill/restore APIs so I'm hopeful.

view this post on Zulip RFC notifications bot (Jan 07 2022 at 10:35):

sparker-arm edited a comment on issue #19:

It is when the routine receives and/or returns Z or predicate registers that the callee needs to save the whole Z register (z8-z23). From what you said though, I guess we could just grab the type information from the instruction so that we don't have to rely on RegClass? I definitely have no interest in trying to bend/break regalloc to the will of SVE, but if we can pass the high-level type information down then I think this will be fine. I still haven't looked at spill/restores though :)

edit: But now from quickly glancing, I see that type information is passed around in the spill/restore APIs so I'm hopeful.

edit: And the only place that I've found which doesn't provide a virtual reg for spill/reload is in regalloc's get_stackmap_artefacts_at function...

view this post on Zulip RFC notifications bot (Jan 07 2022 at 16:50):

sparker-arm commented on issue #19:

@cfallin Do you think adding something like 'add_sizeless_def' to RegUsageCollector would be a suitable way to communicate between regalloc and cranelfit? I _think_ this would allow us to convert 'clobbered_registers' to hold a RealReg and bool to represent whether it is sizeless.

view this post on Zulip RFC notifications bot (Jan 07 2022 at 18:37):

cfallin commented on issue #19:

edit: But now from quickly glancing, I see that type information is passed around in the spill/restore APIs so I'm hopeful.

Not anymore actually; see the recent fuzzbug fix (and followup issue #3645) where we determined that this type information can be inaccurate in the presence of moves and move elision.

There's a bit of subtlety here: the register allocator should not know about IR types in a non-opaque way; that's an abstraction leak and a correctness risk (what if it tries to do something with that info?). The allocator is a thing that hands out registers; registers are just black boxes that hold bits; modern aarch64 machines have Z registers; Cranelift can decide to put a Z-register-sized value, or a v128-sized value, or an f64-sized value, or anything else into that register. That's a clear delineation of responsibilities and if we blur that line, I fear we could have bigger correctness problems later (similar to the one CVE and another almost-CVE this area has given us when we do blur the line).

The suggested fix in the issue above involves having the regalloc record type info alongside registers, but only to verify that moves are valid; we don't want it to actually peek into that type info.

But, the ABI code is free to reason about how the function body uses registers. So what I imagine could work is that when we generate prologue/epilogue code, we can scan over parameter/return types ("do we take or return a Z reg" condition above) and determine what we actually clobber and what we need to save. If something also depends on whether instructions actually e.g. touch high bits or whatnot, we can scan instructions for that too. But that all stays within Cranelift; it doesn't involve regalloc changes.

It's possible I'm missing some requirement here though -- is there some reason that we can't determine the info we need from scanning the code on the Cranelift side?

view this post on Zulip RFC notifications bot (Jan 10 2022 at 11:19):

sparker-arm commented on issue #19:

Yes, I think anything should be possible in cranelift and a scan sounds fine - as long as cranelift still has type info at that point, otherwise manually matching SVE opcodes sounds like a bug waiting to happen too. But maybe it's also fine, for now, to treat all V128 regs like Z-regs.

I agree that the regalloc shouldn't need to know about IR types and my concerns stem from (my lack of knowledge) and regallocs interest in sizes, but I think your patch remedies that :) Without IR types, I was imagining spill slots being re-used for RegClass::V128 with potentially mismatching sizes.

Having static sizes for stack objects should make the rest of codegen more simple, though I imagine we may have to be conservative in an AOT setting and only support the minimum width. And, of course, a larger Z-reg will have a greater negative effect on stack usage for scalar FP and Neon too.

view this post on Zulip RFC notifications bot (Jan 10 2022 at 15:02):

sparker-arm edited a comment on issue #19:

Yes, I think anything should be possible in cranelift and a scan sounds fine - as long as cranelift still has type info at that point, otherwise manually matching SVE opcodes sounds like a bug waiting to happen too. But maybe it's also fine, for now, to treat all V128 regs like Z-regs.

I agree that the regalloc shouldn't need to know about IR types and my concerns stem from (my lack of knowledge) and regallocs interest in sizes, but I think your patch remedies that :) Without IR types, I was imagining spill slots being re-used for RegClass::V128 with potentially mismatching sizes.

Having static sizes for stack objects should make the rest of codegen more simple, though I imagine we may have to be conservative in an AOT setting and only support the minimum width. And, of course, a larger Z-reg will have a greater negative effect on stack usage for scalar FP and Neon too.

edit: I've just realized that userspace applications don't have the necessary privileges to set the vector length, so I can't see static sizes actually working for spill slots - unless we default to the maximum size and that doesn't seem like a good idea!

view this post on Zulip RFC notifications bot (Feb 14 2022 at 15:58):

sparker-arm commented on issue #19:

Gentle ping on this... the latest change adds sizeless stack instructions and a new slot type to clif and the frame layout has been modified.

view this post on Zulip RFC notifications bot (Feb 14 2022 at 23:03):

cfallin commented on issue #19:

Addendum to above: after reviewing the thread discussion above again, I see that it can actually be both: runtime-determined size, or compile-time-bound size. Things are a little clearer to me now, but then we do need to fill in details how the runtime-determined part works: how do we tell the ABI code what the size actually is?

Maybe a useful way to look at this is: what are the semantics of the CLIF, separate from any target architecture? (There must be some such semantics, or else we can't sensibly talk about target-independent codegen and optimizations, and, e.g., the CLIF interpreter.) Perhaps there is some special kind of global value (we could define a new one for symbolic VL on aarch64, or use a "load from this part of vmctx" op) that we can feed to a "sizeless vector" entity that actually gives it its size?

So something like:

    size0 = vl
    dynslot0 = dynamically_sized_slot size0
    dynslot1 = dynamically_sized_slot size0

...
block0:
    v0 = dynamically_sized_vector_load dynslot0

(with more concise keywords as desired)

In that case, I'd prefer to call it a "dynamically-sized" vector, as sketched above, and I'd prefer for the compile-time binding to be thought of as a legalization pass (as you aluded to above) that's sort of like constant propagation: basically we replace all dynamically_sized_slot slots with normal slots with a given type.

Does this seem reasonable?

view this post on Zulip RFC notifications bot (Feb 15 2022 at 11:33):

sparker-arm commented on issue #19:

@cfallin Many thanks for taking another look!

In other words, at what time do we know the actual size of the vectors, and are they all the same size, or a different size per vector?

This needs to be defined by the flexible vector spec and it's my view that all the current flexible vector types should be specified to have the same size as each other. My current implementation reports a single size for all such vector types and they're all treated the same throughout compilation.

seems unnecessary. In a sense, in this design the CLIF is just "polymorphic on vector type"; we can monomorphize for, say, 256 or 512 bits, and then the rest of the pipeline works as-is today, just with another register class.

Yes, with the current proposal we are unable to provide proper dynamically sized/sizeless support, only partially, but we are still mapping to V128 registers as, without aliasing info, I cannot see how another using register class would be a feasible option for SVE (I don't know about AVX and friends).

if we really do expect this to be a runtime-bound value (say, libc during startup detects CPU features and then the rest of our code uses N-bit vectors accordingly)

To clarify, for proper SVE support, we wouldn't go for the route of querying the vector width and would instead produce a fully generic program. And I assume that passing feature flags to enable SVE, AVX, etc... would be a perfectly reasonable use case.

Perhaps there is some special kind of global value (we could define a new one for symbolic VL on aarch64, or use a "load from this part of vmctx" op) that we can feed to a "sizeless vector" entity that actually gives it its size?

I'm not sure if this would be necessary, it just seems like sugaring to me, as we could surely define the same semantics on the explicit_sizeless_slot entities and their accessing instruction counterparts. After all, the actual semantics, whether on a symbol or instruction, are defined by whatever machine, or interpreter, we're running on. Additionally, we (flexible vector spec) also have instructions to query for the vector length so I would be keen to avoid having a global value which may confuse things. I feel that we should define the semantics of sizeless/dynamic slots in terms of the vec.*.length operations. How does that sound to you?

I'd prefer for the compile-time binding to be thought of as a legalization pass (as you aluded to above) that's sort of like constant propagation: basically we replace all dynamically_sized_slot slots with normal slots with a given type.

For now, this seems more than reasonable enough to me! Once splillslots had a fixed size I figured this would make the most sense, but I still mainly wanted to see how the rest of the ABI would cope with the notion of sizeless type.

I also perfectly happy with using the 'dynamic' nomenclature and sorry if I've missed any of your other points or questions.

view this post on Zulip RFC notifications bot (Feb 17 2022 at 01:52):

cfallin commented on issue #19:

@cfallin Many thanks for taking another look!

In other words, at what time do we know the actual size of the vectors, and are they all the same size, or a different size per vector?

This needs to be defined by the flexible vector spec and it's my view that all the current flexible vector types should be specified to have the same size as each other. My current implementation reports a single size for all such vector types and they're all treated the same throughout compilation.

FWIW, I'd prefer that we define how our vector mechanism works within CLIF, independently of the "upstream" flexible vector spec; we can always update our semantics if needed, but we should err on the side of explicitness.

I think that "each dynamically-sized vector can have its own size" may actually be more logically consistent (see more below) if things are implemented at the CLIF-primitive level in the way I'm imagining, but I could see either way working.

Perhaps there is some special kind of global value (we could define a new one for symbolic VL on aarch64, or use a "load from this part of vmctx" op) that we can feed to a "sizeless vector" entity that actually gives it its size?

I'm not sure if this would be necessary, it just seems like sugaring to me, as we could surely define the same semantics on the explicit_sizeless_slot entities and their accessing instruction counterparts. After all, the actual semantics, whether on a symbol or instruction, are defined by whatever machine, or interpreter, we're running on.

This I actually disagree more with and I think it's important to spell out the principle at play: the whole point of a machine-independent IR is that the semantics do not depend on the machine or interpreter we're running on. If we have special vector instructions, or types, that behave differently according to the underlying platform, then:

(There are a few places where this general principle isn't true and I want to fix them eventually; e.g. "native endian" loads/stores.)

So I think a better starting point is to define an abstraction, whatever it may look like, that can (i) implement what we need (here, the Wasm flexible vector spec), and (ii) be mapped relatively straightforwardly to the machine architectures we know/care about.

A suggestion below:

Additionally, we (flexible vector spec) also have instructions to query for the vector length so I would be keen to avoid having a global value which may confuse things. I feel that we should define the semantics of sizeless/dynamic slots in terms of the vec.*.length operations. How does that sound to you?

The reason that things like the stack check, or location of heaps, are defined via the "global value" abstraction (really a limited expression language that can express certain inputs like vmctx, constants, adds, and loads) is that we need to know some values before the body of the function starts executing. For example, the prologue code needs to know about stack limits before the first instruction.

The general hierarchy of CLIF is that (i) we have global values, which define "environment/context"; (ii) we have entities, like heaps or slots, and some of these entities are parameterized on the global values; (iii) we have instructions, some of which use entities.

Since a stack slot is an entity used by certain instructions, and since we need to know the size of stack slots to allocate the stack frame in the prologue, it makes more sense to me that we would have a kind of entity that is a dynamic slot, and that entity uses a global value to define its size in bytes. In contrast, trying to recover size of the slot from the instructions that use it doesn't solve the problem of knowing slot size at prologue time, and it feels more fragile in general.

We could define new "roots" in the global-value expression language that refer to a machine register (VL?) or a platform-specific constant. Note that by having these machine-specific inputs reified as global values, we have a single place to encapsulate this nondeterminism/platform-specific behavior (basically an explicit input to the function), so we can reason about the rest of the CLIF semantics in a machine-independent way.

I think this design fits all of the criteria:

Does that make sense in general? I think we can tweak the design a bit if needed but now that I understand the general outline of what is needed, that seems (to me at least!) like the direction that would fit best with the rest of the compiler...

view this post on Zulip RFC notifications bot (Feb 17 2022 at 11:14):

sparker-arm commented on issue #19:

Does that make sense in general?

Thanks for the clear explanation of the CLIF hierachy, with that context I agree that a global VL sounds best.

We could define new "roots" in the global-value expression language that refer to a machine register (VL?) or a platform-specific constant. Note that by having these machine-specific inputs reified as global values, we have a single place to encapsulate this nondeterminism/platform-specific behavior (basically an explicit input to the function)

This is a particularly compelling argument, but I still don't think a global VL solves all our problems...

I think that "each dynamically-sized vector can have its own size" may actually be more logically consistent (see more below) if things are implemented at the CLIF-primitive level in the way I'm imagining

So, my reasoning for having a fixed width for all these vector types is for same the reasons you've raised about avoiding machine dependent semantics in the IR (a big yes to all those). Without one VL to rule them all, in becomes very difficult (impossible?) for us to reason about what is valid. But the root cause of the issue is really due the weak nature of the new types that I've proposed, which are no more than the flexible vector types.

So, with a global value for VL (dyn_slot size in bytes) I'd also propose that all our dynamically sized vector types are specified with a shape, not just a type, so we have a new family of types with a dynamic factor such TypexLanesxDynFactor, i.e i32x4xN. Where N is another target-defined constant and 32x4xN <= VL. I think this change would enable dynamic and fixed types to play nicely with each other and enable validation.

What do you think?

view this post on Zulip RFC notifications bot (Feb 18 2022 at 02:58):

cfallin commented on issue #19:

Ah, that's interesting, I didn't realize before but agree now that the types need to somehow be linked to the dynamic factor as well!

I find myself a little bit allergic to global state or special singletons, though, so I'm not completely sure I like a "there is one global factor that alters the special xN types"; but maybe there's something more general we can do. What if we have a first-class notion of "dynamic type" as a CLIF entity, and then feed this type both into the dynslot, and refer to it with an indexed Type?

So something like:

function %f() {
  gv0 = vector_length_reg  ; VL reg
  dt0 = i32 * gv0          ; "dynamic type 0" -- defines shape/size of `dt0` below
  dynslot0 = slot dyntype0

block0:
  v0 = load.dt0 dynslot0
  v1 = iadd.dt0 v0, v0
  ; ...
}

In other words, we have a set of types dt0..dtN (pre-define some range of our Type for this; 64 values maybe?). Then we can (i) define slots of this type, and (ii) load, store, do all the other things with this type, just like with values of any other type.

This seems at least to me like an abstraction that can express whatever the higher-level "flexible vector" use-cases require, and we can map it to any hardware with a dynamic fallback and certain combinations (e.g. types known to use vector_length_reg on aarch64) to more efficient ISA extensions. It allows us to typecheck the IR like we do with other types currently, and it allows us to know the stack-frame size at function entry time. It's possible I've missed some requirements though -- thoughts/concerns?

view this post on Zulip RFC notifications bot (Feb 18 2022 at 02:58):

cfallin edited a comment on issue #19:

Ah, that's interesting, I didn't realize before but agree now that the types need to somehow be linked to the dynamic factor as well!

I find myself a little bit allergic to global state or special singletons, though, so I'm not completely sure I like a "there is one global factor that alters the special xN types"; but maybe there's something more general we can do. What if we have a first-class notion of "dynamic type" as a CLIF entity, and then feed this type both into the dynslot, and refer to it with an indexed Type?

So something like:

function %f() {
  gv0 = vector_length_reg  ; VL reg
  dt0 = i32 * gv0          ; "dynamic type 0" -- defines shape/size of `dt0` below
  dynslot0 = slot dt0

block0:
  v0 = load.dt0 dynslot0
  v1 = iadd.dt0 v0, v0
  ; ...
}

In other words, we have a set of types dt0..dtN (pre-define some range of our Type for this; 64 values maybe?). Then we can (i) define slots of this type, and (ii) load, store, do all the other things with this type, just like with values of any other type.

This seems at least to me like an abstraction that can express whatever the higher-level "flexible vector" use-cases require, and we can map it to any hardware with a dynamic fallback and certain combinations (e.g. types known to use vector_length_reg on aarch64) to more efficient ISA extensions. It allows us to typecheck the IR like we do with other types currently, and it allows us to know the stack-frame size at function entry time. It's possible I've missed some requirements though -- thoughts/concerns?

view this post on Zulip RFC notifications bot (Feb 22 2022 at 09:45):

sparker-arm commented on issue #19:

Creating a type in IR sounds rather useful, I had no idea that it was possible! Could you point me at an example if we're already doing this for something else?

So, at least for SVE, vector_length_reg will return the bit/byte width of either the Z-regs, or the predicates - both of which are scalable but are sized differently. So, I think we need to parameterize our special operation with a vector type, and move away from the notion of a VL register:

function %f() {
  gv0 = dyn_scale.i32x4  ; How many i32x4 vectors can fit in a register?
  dt0 = i32x4 * gv0
  dynslot0 = slot dt0

block0:
  v0 = load.dt0 dynslot0
  v1 = iadd.dt0 v0, v0
  ; ...
}

I believe this still provides all the characteristics we're looking for, and should make lowering more efficient with better type info. Does that look okay?

My one concern is passing a type, instead of a size, to the slot as this is very different to the current way of doing things, but I will trust that it's feasible.

view this post on Zulip RFC notifications bot (Feb 22 2022 at 17:25):

cfallin commented on issue #19:

Creating a type in IR sounds rather useful, I had no idea that it was possible! Could you point me at an example if we're already doing this for something else?

I don't think we do anything of the sort currently -- it's a new idea :-) But, yes, it certainly seems possible to me. This will change some of the internal signatures, e.g. Type no longer always knows its size -- it could return an Option<usize> from bits() or it could return a TypeSize with Static(usize) and Dynamic(ir::Value) arms, I suppose, and to get the latter it would need the &ir::Function as an argument. But all of this seems reasonable to me at least.

So, at least for SVE, vector_length_reg will return the bit/byte width of either the Z-regs, or the predicates - both of which are scalable but are sized differently. So, I think we need to parameterize our special operation with a vector type, and move away from the notion of a VL register:

```
function %f() {
gv0 = dyn_scale.i32x4 ; How many i32x4 vectors can fit in a register?
dt0 = i32x4 * gv0
dynslot0 = slot dt0

block0:
v0 = load.dt0 dynslot0
v1 = iadd.dt0 v0, v0
; ...
}
```

I believe this still provides all the characteristics we're looking for, and should make lowering more efficient with better type info. Does that look okay?

I think so, yes. Just to make sure I understand, the dyn_scale global value will be determined based on a known target microarchitecture/ISA level? Or read/computed from special register(s) in the function prologue?

My one concern is passing a type, instead of a size, to the slot as this is very different to the current way of doing things, but I will trust that it's feasible.

Yes, I think this actually makes more sense overall: associating the slot with a type feels more appropriate than requiring the IR producer to match the size given to the slot entity with the known size of the loads and stores used to access it.

view this post on Zulip RFC notifications bot (Feb 23 2022 at 08:55):

sparker-arm commented on issue #19:

I don't think we do anything of the sort currently -- it's a new idea :-) But, yes, it certainly seems possible to me.

Okay :) sounds like I'll be kept busy for a while hacking on the type system!

Just to make sure I understand, the dyn_scale global value will be determined based on a known target microarchitecture/ISA level? Or read/computed from special register(s) in the function prologue?

Yes, in most cases it will just be a constant set in the target backend. If/when we add fully dynamic support for SVE, it will be an instruction or two, in the prologue, that reads the runtime VL.

view this post on Zulip RFC notifications bot (Mar 01 2022 at 15:55):

sparker-arm commented on issue #19:

Hi @cfallin, now that I've had some time to revisit the meta level of cranelift, I'm still struggling to see how these dynamically created types would work. It just seems to be too against how types and instructions are currently implemented.

During the parsing we can check that dyn_scale and our dynamic type have been declared with the same base vector type, that's okay...At the IR level, the mechanical bits seem fine too: Types are just u8 values, we reserve some of those for dynamic types, as I have done for the original sizeless types, and these should be named such as I8X16XN. Implementing methods to handle the dynamic addition is not an issue. But this doesn't seem to tie dyn_scale value to the type in the way you would like, these are concrete types and not connected to a global value.

I actually got so lost in the meta layers that I actually can't remember which part made my mind finally fall over :) I think my problem comes when implementing the polymorphic instruction generation, when we want to iterate through concrete types. From the description above, they seem like concrete types (just a u8 with a bit for dynamic) but from the user perspective they are not. In the meta level, I've tried to generate a set of types that have a reference to an IR entitty, which then seems superficial since that entity isn't actually part of the type. Does any of this successfully convey my pain?

view this post on Zulip RFC notifications bot (Mar 01 2022 at 18:31):

cfallin commented on issue #19:

@sparker-arm , I'm happy to help work out implementation/prototype details -- do you have a WIP branch you can point me to that demonstrates what you're thinking / what issues you're running into?

Hi @cfallin, now that I've had some time to revisit the meta level of cranelift, I'm still struggling to see how these dynamically created types would work. It just seems to be too against how types and instructions are currently implemented.

During the parsing we can check that dyn_scale and our dynamic type have been declared with the same base vector type, that's okay...At the IR level, the mechanical bits seem fine too: Types are just u8 values, we reserve some of those for dynamic types, as I have done for the original sizeless types, and these should be named such as I8X16XN. Implementing methods to handle the dynamic addition is not an issue. But this doesn't seem to tie dyn_scale value to the type in the way you would like, these are concrete types and not connected to a global value.

I don't think that the IR literally needs to have a Type that refers to/holds an IR entity, or somesuch; I'm imagining something like:

struct Function {
    // ...
    dyn_types: Vec<DynType>,
}

/// A dynamically-sized type. Type `dtN` refers to the definition in `dyn_types[N]`.
struct DynType {
    /// The dynamically-sized type is defined in terms of a base type and a global value that indicates
    /// how many of those base-type elements this type contains.
    base_type: Type,
    length: GlobalValue,
}

It isn't really an issue that the DynType isn't literally part of the Type; the Type contains an index, and we can look up the DynType given the Function context.

Possibly I'm missing some other difficulty here though? Are there places where it would be awkward to look up the DynType (I guess we need to think about what additional plumbing this requires)?

I actually got so lost in the meta layers that I actually can't remember which part made my mind finally fall over :) I think my problem comes when implementing the polymorphic instruction generation, when we want to iterate through concrete types. From the description above, they seem like concrete types (just a u8 with a bit for dynamic) but from the user perspective they are not. In the meta level, I've tried to generate a set of types that have a reference to an IR entitty, which then seems superficial since that entity isn't actually part of the type. Does any of this successfully convey my pain?

When generating code, I imagine there would be a case that matches on dynamic types, and then switches on the base type; so just as we today have cases for i32x4 and i64x2, we would have cases for i32xN and i64xN. There's just one indirection to go look up the base type. Does that make sense?

view this post on Zulip RFC notifications bot (Mar 02 2022 at 09:36):

sparker-arm commented on issue #19:

It isn't really an issue that the DynType isn't literally part of the Type; the Type contains an index, and we can look up the DynType given the Function context.

Okay, thanks, this sounds much more feasible and sorry I kinda missed this suggestion in your previous comment.

One thing regarding the global value names, is it a fundamental property of clif that we need a prefix followed by a number? I was wondering whether we could have Nxi32x4 = i32x4 * gv0 instead, which could make the IR more readable.

I'm happy to help work out implementation/prototype details -- do you have a WIP branch you can point me to that demonstrates what you're thinking / what issues you're running into?

I greatly appreciate the offer, but I think your above suggestion is enough to get me moving forward again. I will be on holiday for a couple weeks though, starting Friday, so please don't interpret my silence as being stuck again.

view this post on Zulip RFC notifications bot (Mar 02 2022 at 22:40):

cfallin commented on issue #19:

It isn't really an issue that the DynType isn't literally part of the Type; the Type contains an index, and we can look up the DynType given the Function context.

Okay, thanks, this sounds much more feasible and sorry I kinda missed this suggestion in your previous comment.

One thing regarding the global value names, is it a fundamental property of clif that we need a prefix followed by a number? I was wondering whether we could have Nxi32x4 = i32x4 * gv0 instead, which could make the IR more readable.

The CLIF parser seems to be built assuming the entityN syntax and it does have a nice logical consistency to it, though I see the appeal of arbitrary type names as well... for now I think it's probably simplest to keep the same scheme for this new type of entity, but we can definitely discuss/refine syntax more when there's a prototype to play with, I think.

view this post on Zulip RFC notifications bot (Apr 20 2022 at 08:36):

sparker-arm commented on issue #19:

@cfallin @abrown Would you be able to take a look at this again, please? I'm happy that everything ended up working as we discussed, and I'm kinda keen to move into a code RFC to better illustrate all the moving pieces.

view this post on Zulip RFC notifications bot (Apr 21 2022 at 08:50):

sparker-arm commented on issue #19:

Thanks @cfallin !

The biggest question to me is how the pipeline lowers this to the existing stackslot abstractions; as we discussed earlier if all the sizes are initially compile-time constants then in theory the ABI implementation could remain unchanged, if we legalize beforehand. But for full generality maybe the ABI needs to be aware of the separate category of slots.

Indeed, the current ABI changes that I've made aren't really needed for this initial fixed size implementation, but I've tried to implement it with true dynamic sizes, and general SVE, in mind.

Speaking of which, my current plan for the code RFC is to avoid including any SVE codegen, except some changes to the ABI API, purely to avoid distractions from the IR and ABI changes (it already feels big enough to me). Do you think this is wise, or would you also prefer to see some SVE codegen too?

view this post on Zulip RFC notifications bot (Apr 21 2022 at 16:20):

cfallin commented on issue #19:

Speaking of which, my current plan for the code RFC is to avoid including any SVE codegen, except some changes to the ABI API, purely to avoid distractions from the IR and ABI changes (it already feels big enough to me). Do you think this is wise, or would you also prefer to see some SVE codegen too?

I think your incremental approach sounds good: best to get support for the dynamically-sized types in, handling/storing/spilling them, then we can add ISA support as a followup. Actually maybe there are (at least) three pieces: the dynamically-sized types and ABI support; then a "polyfill" without SVE, just NEON (and others could build the same with Intel SIMD if interested); then actually using the new hardware instructions. But, we don't need to fix the details here, they can change as needed I think.

view this post on Zulip RFC notifications bot (Apr 27 2022 at 07:00):

sparker-arm commented on issue #19:

Hi, it's been a week since I updated this so I'd now like to move this RFC into the final comment period. Thanks!

view this post on Zulip RFC notifications bot (Apr 27 2022 at 20:18):

cfallin commented on issue #19:

@sparker-arm could you post a final-comment-period approval checklist? Then we can start to collect approvals and hopefully get this in soon!

(+1 from me as well, i.e. feel free to mark me as checked already)

view this post on Zulip RFC notifications bot (Apr 28 2022 at 07:53):

sparker-arm commented on issue #19:

Stakeholders sign-off

Arm

Fastly

Intel

Unaffliated

IBM

view this post on Zulip RFC notifications bot (Apr 28 2022 at 14:38):

alexcrichton edited a comment on issue #19:

Stakeholders sign-off

Arm

Fastly

Intel

Unaffliated

IBM

view this post on Zulip RFC notifications bot (Apr 28 2022 at 16:56):

fitzgen edited a comment on issue #19:

Stakeholders sign-off

Arm

Fastly

Intel

Unaffliated

IBM

view this post on Zulip RFC notifications bot (May 10 2022 at 11:33):

sparker-arm commented on issue #19:

I believe enough time has now passed? Can this now be merged?

view this post on Zulip RFC notifications bot (May 10 2022 at 15:17):

cfallin commented on issue #19:

Indeed, the FCP has now elapsed with no objections, so this RFC should now be merged! (I'm happy to click the button if you don't have permissions to do so, but we should also fix that if so)

view this post on Zulip RFC notifications bot (May 10 2022 at 17:53):

sparker-arm commented on issue #19:

Great, I didn't think I had permissions! Thanks for all the help with this @cfallin

view this post on Zulip RFC notifications bot (May 10 2022 at 17:59):

cfallin commented on issue #19:

It was just added apparently (thanks to Till); you're part of the Cranelift core org group now so it shouldn't be an issue in the future :-)


Last updated: Oct 23 2024 at 20:03 UTC