wasmtime / issue #4418 cranelift: Truly dynamic vector types · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / issue #4418 cranelift: Truly dynamic vector types

Wasmtime GitHub notifications bot (Jul 08 2022 at 13:25):

sparker-arm opened issue #4418:

Current Status

Basic support for dynamic vector types has been committed in preparation to support the Wasm Flexible Vectors extension. The RFC discussion around the design is here. And while we now have support for dynamic/flexible/scalable vectors, it doesn't support proper dynamic types. This is a top-level issue for what needs to be done to get there.

Cranelift's type system now contains a dynamic vector type for each corresponding fixed-width type, with the dynamic type being a fixed-width type scaled by a target-defined factor. Currently, this factor (dyn_scale_target_const) is legalized to a constant. This makes lowering and stack layout simple. Below is an example CLIF function.
function %i8x16_splat_add(i8, i8) -> i8x16 {
  gv0 = dyn_scale_target_const.i8x16
  dt0 = i8x16*gv0

block0(v0: i8, v1: i8):
  v2 = splat.dt0 v0
  v3 = splat.dt0 v1
  v4 = iadd v2, v3
  v5 = extract_vector v4, 0
  return v5
}
Next Steps

The two main areas that need more work are in IR and the MachInst ABI layer. The first part is to modify, or introduce a new GlobalValue, that will be legalized and lowered to a runtime scaling value. Maybe in a similar way to how global values are currently used in the ABI for generating the stack limit.

The second, and bigger issue, is in the ABI layer where we determine the stack layout. A complicating factor here is that the 'Vanilla' layer is mainly shared between the backends and, of course, everything is also designed around known constant sizes. However, I think the biggest challenge is the interface with the register allocator.

Spill Slots
The register allocator currently only supports two register classes, and types are aren't tracked, so there is the potential for a target with wide vector support to use far more stack than necessary. For example, with the current implementation, a target using AVX-512 would require 64-bytes to spill a single precision float.

To enable truly dynamic types, the interface with the register allocator will need to change. If we leave it to return a constant value, we will need to accommodate the maximum possible register size (2KB in the case of SVE) and that is a prohibitive cost for most CPUs.

Also, with wider vectors usually comes predication and predicate registers are unlikely to map to either of the existing regalloc classes either.

Stack Slots
Along with spill slots, our frame layout will need to handle dynamically sized stack slots which are defined in the IR or, possibly, arguments passed on the stack. The current stack layout is as follows (there's currently no distinction between spill slots for fixed and dynamic types):
//!   (high address)
//!
//!                              +---------------------------+
//!                              |          ...              |
//!                              | stack args                |
//!                              | (accessed via FP)         |
//!                              +---------------------------+
//! SP at function entry ----->  | return address            |
//!                              +---------------------------+
//! FP after prologue -------->  | FP (pushed by prologue)   |
//!                              +---------------------------+
//!                              |          ...              |
//!                              | clobbered callee-saves    |
//! unwind-frame base     ---->  | (pushed by prologue)      |
//!                              +---------------------------+
//!                              |          ...              |
//!                              | spill slots               |
//!                              | (accessed via nominal SP) |
//!                              |          ...              |
//!                              | sized stack slots         |
//!                              | dynamic stack slots       |
//!                              | (accessed via nominal SP) |
//! nominal SP --------------->  | (alloc'd by prologue)     |
//! (SP at end of prologue)      +---------------------------+
//!                              | [alignment as needed]     |
//!                              |          ...              |
//!                              | args for call             |
//! SP before making a call -->  | (pushed at callsite)      |
//!                              +---------------------------+
//!
//!   (low address)
We likely want to collect all the dynamically sized stack values and move them up the stack and introduce a new StackAMode to be addressed by FP. Spill and stack slots of compile-time known sizes can accessed as they are now, but the way we calculate at the end of the prologue will need to be modified. The current implementation allows the TargetIsa to report a fixed size for each dynamic type and so stack offsets can be calculated at compile-time. For dynamically-sized objects, _I think_ we'll want to use vmctx to generate our scaling factor from a GlobalValue and then multiple a slot index by the scaling factor to get our address.

It could be that a target wants to specify multiple scaling values though, depending on the type/register that will be used. So, we could group the values so that each group is using the same scale value. The awkward part here is that we won't have a uniform space to scale across, and so we'll need a method to 'jump' over groups.

Wasmtime GitHub notifications bot (Jul 08 2022 at 13:25):

sparker-arm labeled issue #4418:

Current Status

Basic support for dynamic vector types has been committed in preparation to support the Wasm Flexible Vectors extension. The RFC discussion around the design is here. And while we now have support for dynamic/flexible/scalable vectors, it doesn't support proper dynamic types. This is a top-level issue for what needs to be done to get there.

Cranelift's type system now contains a dynamic vector type for each corresponding fixed-width type, with the dynamic type being a fixed-width type scaled by a target-defined factor. Currently, this factor (dyn_scale_target_const) is legalized to a constant. This makes lowering and stack layout simple. Below is an example CLIF function.
function %i8x16_splat_add(i8, i8) -> i8x16 {
  gv0 = dyn_scale_target_const.i8x16
  dt0 = i8x16*gv0

block0(v0: i8, v1: i8):
  v2 = splat.dt0 v0
  v3 = splat.dt0 v1
  v4 = iadd v2, v3
  v5 = extract_vector v4, 0
  return v5
}
Next Steps

The two main areas that need more work are in IR and the MachInst ABI layer. The first part is to modify, or introduce a new GlobalValue, that will be legalized and lowered to a runtime scaling value. Maybe in a similar way to how global values are currently used in the ABI for generating the stack limit.

The second, and bigger issue, is in the ABI layer where we determine the stack layout. A complicating factor here is that the 'Vanilla' layer is mainly shared between the backends and, of course, everything is also designed around known constant sizes. However, I think the biggest challenge is the interface with the register allocator.

Spill Slots
The register allocator currently only supports two register classes, and types are aren't tracked, so there is the potential for a target with wide vector support to use far more stack than necessary. For example, with the current implementation, a target using AVX-512 would require 64-bytes to spill a single precision float.

To enable truly dynamic types, the interface with the register allocator will need to change. If we leave it to return a constant value, we will need to accommodate the maximum possible register size (2KB in the case of SVE) and that is a prohibitive cost for most CPUs.

Also, with wider vectors usually comes predication and predicate registers are unlikely to map to either of the existing regalloc classes either.

Stack Slots
Along with spill slots, our frame layout will need to handle dynamically sized stack slots which are defined in the IR or, possibly, arguments passed on the stack. The current stack layout is as follows (there's currently no distinction between spill slots for fixed and dynamic types):
//!   (high address)
//!
//!                              +---------------------------+
//!                              |          ...              |
//!                              | stack args                |
//!                              | (accessed via FP)         |
//!                              +---------------------------+
//! SP at function entry ----->  | return address            |
//!                              +---------------------------+
//! FP after prologue -------->  | FP (pushed by prologue)   |
//!                              +---------------------------+
//!                              |          ...              |
//!                              | clobbered callee-saves    |
//! unwind-frame base     ---->  | (pushed by prologue)      |
//!                              +---------------------------+
//!                              |          ...              |
//!                              | spill slots               |
//!                              | (accessed via nominal SP) |
//!                              |          ...              |
//!                              | sized stack slots         |
//!                              | dynamic stack slots       |
//!                              | (accessed via nominal SP) |
//! nominal SP --------------->  | (alloc'd by prologue)     |
//! (SP at end of prologue)      +---------------------------+
//!                              | [alignment as needed]     |
//!                              |          ...              |
//!                              | args for call             |
//! SP before making a call -->  | (pushed at callsite)      |
//!                              +---------------------------+
//!
//!   (low address)
We likely want to collect all the dynamically sized stack values and move them up the stack and introduce a new StackAMode to be addressed by FP. Spill and stack slots of compile-time known sizes can accessed as they are now, but the way we calculate at the end of the prologue will need to be modified. The current implementation allows the TargetIsa to report a fixed size for each dynamic type and so stack offsets can be calculated at compile-time. For dynamically-sized objects, _I think_ we'll want to use vmctx to generate our scaling factor from a GlobalValue and then multiple a slot index by the scaling factor to get our address.

It could be that a target wants to specify multiple scaling values though, depending on the type/register that will be used. So, we could group the values so that each group is using the same scale value. The awkward part here is that we won't have a uniform space to scale across, and so we'll need a method to 'jump' over groups.

Wasmtime GitHub notifications bot (Jul 08 2022 at 13:25):

sparker-arm labeled issue #4418:

Current Status

Basic support for dynamic vector types has been committed in preparation to support the Wasm Flexible Vectors extension. The RFC discussion around the design is here. And while we now have support for dynamic/flexible/scalable vectors, it doesn't support proper dynamic types. This is a top-level issue for what needs to be done to get there.

Cranelift's type system now contains a dynamic vector type for each corresponding fixed-width type, with the dynamic type being a fixed-width type scaled by a target-defined factor. Currently, this factor (dyn_scale_target_const) is legalized to a constant. This makes lowering and stack layout simple. Below is an example CLIF function.
function %i8x16_splat_add(i8, i8) -> i8x16 {
  gv0 = dyn_scale_target_const.i8x16
  dt0 = i8x16*gv0

block0(v0: i8, v1: i8):
  v2 = splat.dt0 v0
  v3 = splat.dt0 v1
  v4 = iadd v2, v3
  v5 = extract_vector v4, 0
  return v5
}
Next Steps

The two main areas that need more work are in IR and the MachInst ABI layer. The first part is to modify, or introduce a new GlobalValue, that will be legalized and lowered to a runtime scaling value. Maybe in a similar way to how global values are currently used in the ABI for generating the stack limit.

The second, and bigger issue, is in the ABI layer where we determine the stack layout. A complicating factor here is that the 'Vanilla' layer is mainly shared between the backends and, of course, everything is also designed around known constant sizes. However, I think the biggest challenge is the interface with the register allocator.

Spill Slots
The register allocator currently only supports two register classes, and types are aren't tracked, so there is the potential for a target with wide vector support to use far more stack than necessary. For example, with the current implementation, a target using AVX-512 would require 64-bytes to spill a single precision float.

To enable truly dynamic types, the interface with the register allocator will need to change. If we leave it to return a constant value, we will need to accommodate the maximum possible register size (2KB in the case of SVE) and that is a prohibitive cost for most CPUs.

Also, with wider vectors usually comes predication and predicate registers are unlikely to map to either of the existing regalloc classes either.

Stack Slots
Along with spill slots, our frame layout will need to handle dynamically sized stack slots which are defined in the IR or, possibly, arguments passed on the stack. The current stack layout is as follows (there's currently no distinction between spill slots for fixed and dynamic types):
//!   (high address)
//!
//!                              +---------------------------+
//!                              |          ...              |
//!                              | stack args                |
//!                              | (accessed via FP)         |
//!                              +---------------------------+
//! SP at function entry ----->  | return address            |
//!                              +---------------------------+
//! FP after prologue -------->  | FP (pushed by prologue)   |
//!                              +---------------------------+
//!                              |          ...              |
//!                              | clobbered callee-saves    |
//! unwind-frame base     ---->  | (pushed by prologue)      |
//!                              +---------------------------+
//!                              |          ...              |
//!                              | spill slots               |
//!                              | (accessed via nominal SP) |
//!                              |          ...              |
//!                              | sized stack slots         |
//!                              | dynamic stack slots       |
//!                              | (accessed via nominal SP) |
//! nominal SP --------------->  | (alloc'd by prologue)     |
//! (SP at end of prologue)      +---------------------------+
//!                              | [alignment as needed]     |
//!                              |          ...              |
//!                              | args for call             |
//! SP before making a call -->  | (pushed at callsite)      |
//!                              +---------------------------+
//!
//!   (low address)
We likely want to collect all the dynamically sized stack values and move them up the stack and introduce a new StackAMode to be addressed by FP. Spill and stack slots of compile-time known sizes can accessed as they are now, but the way we calculate at the end of the prologue will need to be modified. The current implementation allows the TargetIsa to report a fixed size for each dynamic type and so stack offsets can be calculated at compile-time. For dynamically-sized objects, _I think_ we'll want to use vmctx to generate our scaling factor from a GlobalValue and then multiple a slot index by the scaling factor to get our address.

It could be that a target wants to specify multiple scaling values though, depending on the type/register that will be used. So, we could group the values so that each group is using the same scale value. The awkward part here is that we won't have a uniform space to scale across, and so we'll need a method to 'jump' over groups.

Wasmtime GitHub notifications bot (Jul 08 2022 at 13:25):

sparker-arm labeled issue #4418:

Current Status

Basic support for dynamic vector types has been committed in preparation to support the Wasm Flexible Vectors extension. The RFC discussion around the design is here. And while we now have support for dynamic/flexible/scalable vectors, it doesn't support proper dynamic types. This is a top-level issue for what needs to be done to get there.

Cranelift's type system now contains a dynamic vector type for each corresponding fixed-width type, with the dynamic type being a fixed-width type scaled by a target-defined factor. Currently, this factor (dyn_scale_target_const) is legalized to a constant. This makes lowering and stack layout simple. Below is an example CLIF function.
function %i8x16_splat_add(i8, i8) -> i8x16 {
  gv0 = dyn_scale_target_const.i8x16
  dt0 = i8x16*gv0

block0(v0: i8, v1: i8):
  v2 = splat.dt0 v0
  v3 = splat.dt0 v1
  v4 = iadd v2, v3
  v5 = extract_vector v4, 0
  return v5
}
Next Steps

The two main areas that need more work are in IR and the MachInst ABI layer. The first part is to modify, or introduce a new GlobalValue, that will be legalized and lowered to a runtime scaling value. Maybe in a similar way to how global values are currently used in the ABI for generating the stack limit.

The second, and bigger issue, is in the ABI layer where we determine the stack layout. A complicating factor here is that the 'Vanilla' layer is mainly shared between the backends and, of course, everything is also designed around known constant sizes. However, I think the biggest challenge is the interface with the register allocator.

Spill Slots
The register allocator currently only supports two register classes, and types are aren't tracked, so there is the potential for a target with wide vector support to use far more stack than necessary. For example, with the current implementation, a target using AVX-512 would require 64-bytes to spill a single precision float.

To enable truly dynamic types, the interface with the register allocator will need to change. If we leave it to return a constant value, we will need to accommodate the maximum possible register size (2KB in the case of SVE) and that is a prohibitive cost for most CPUs.

Also, with wider vectors usually comes predication and predicate registers are unlikely to map to either of the existing regalloc classes either.

Stack Slots
Along with spill slots, our frame layout will need to handle dynamically sized stack slots which are defined in the IR or, possibly, arguments passed on the stack. The current stack layout is as follows (there's currently no distinction between spill slots for fixed and dynamic types):
//!   (high address)
//!
//!                              +---------------------------+
//!                              |          ...              |
//!                              | stack args                |
//!                              | (accessed via FP)         |
//!                              +---------------------------+
//! SP at function entry ----->  | return address            |
//!                              +---------------------------+
//! FP after prologue -------->  | FP (pushed by prologue)   |
//!                              +---------------------------+
//!                              |          ...              |
//!                              | clobbered callee-saves    |
//! unwind-frame base     ---->  | (pushed by prologue)      |
//!                              +---------------------------+
//!                              |          ...              |
//!                              | spill slots               |
//!                              | (accessed via nominal SP) |
//!                              |          ...              |
//!                              | sized stack slots         |
//!                              | dynamic stack slots       |
//!                              | (accessed via nominal SP) |
//! nominal SP --------------->  | (alloc'd by prologue)     |
//! (SP at end of prologue)      +---------------------------+
//!                              | [alignment as needed]     |
//!                              |          ...              |
//!                              | args for call             |
//! SP before making a call -->  | (pushed at callsite)      |
//!                              +---------------------------+
//!
//!   (low address)
We likely want to collect all the dynamically sized stack values and move them up the stack and introduce a new StackAMode to be addressed by FP. Spill and stack slots of compile-time known sizes can accessed as they are now, but the way we calculate at the end of the prologue will need to be modified. The current implementation allows the TargetIsa to report a fixed size for each dynamic type and so stack offsets can be calculated at compile-time. For dynamically-sized objects, _I think_ we'll want to use vmctx to generate our scaling factor from a GlobalValue and then multiple a slot index by the scaling factor to get our address.

It could be that a target wants to specify multiple scaling values though, depending on the type/register that will be used. So, we could group the values so that each group is using the same scale value. The awkward part here is that we won't have a uniform space to scale across, and so we'll need a method to 'jump' over groups.

Wasmtime GitHub notifications bot (Jul 11 2022 at 16:26):

akirilov-arm labeled issue #4418:

Current Status

Basic support for dynamic vector types has been committed in preparation to support the Wasm Flexible Vectors extension. The RFC discussion around the design is here. And while we now have support for dynamic/flexible/scalable vectors, it doesn't support proper dynamic types. This is a top-level issue for what needs to be done to get there.

Cranelift's type system now contains a dynamic vector type for each corresponding fixed-width type, with the dynamic type being a fixed-width type scaled by a target-defined factor. Currently, this factor (dyn_scale_target_const) is legalized to a constant. This makes lowering and stack layout simple. Below is an example CLIF function.
function %i8x16_splat_add(i8, i8) -> i8x16 {
  gv0 = dyn_scale_target_const.i8x16
  dt0 = i8x16*gv0

block0(v0: i8, v1: i8):
  v2 = splat.dt0 v0
  v3 = splat.dt0 v1
  v4 = iadd v2, v3
  v5 = extract_vector v4, 0
  return v5
}
Next Steps

The two main areas that need more work are in IR and the MachInst ABI layer. The first part is to modify, or introduce a new GlobalValue, that will be legalized and lowered to a runtime scaling value. Maybe in a similar way to how global values are currently used in the ABI for generating the stack limit.

The second, and bigger issue, is in the ABI layer where we determine the stack layout. A complicating factor here is that the 'Vanilla' layer is mainly shared between the backends and, of course, everything is also designed around known constant sizes. However, I think the biggest challenge is the interface with the register allocator.

Spill Slots
The register allocator currently only supports two register classes, and types are aren't tracked, so there is the potential for a target with wide vector support to use far more stack than necessary. For example, with the current implementation, a target using AVX-512 would require 64-bytes to spill a single precision float.

To enable truly dynamic types, the interface with the register allocator will need to change. If we leave it to return a constant value, we will need to accommodate the maximum possible register size (2KB in the case of SVE) and that is a prohibitive cost for most CPUs.

Also, with wider vectors usually comes predication and predicate registers are unlikely to map to either of the existing regalloc classes either.

Stack Slots
Along with spill slots, our frame layout will need to handle dynamically sized stack slots which are defined in the IR or, possibly, arguments passed on the stack. The current stack layout is as follows (there's currently no distinction between spill slots for fixed and dynamic types):
//!   (high address)
//!
//!                              +---------------------------+
//!                              |          ...              |
//!                              | stack args                |
//!                              | (accessed via FP)         |
//!                              +---------------------------+
//! SP at function entry ----->  | return address            |
//!                              +---------------------------+
//! FP after prologue -------->  | FP (pushed by prologue)   |
//!                              +---------------------------+
//!                              |          ...              |
//!                              | clobbered callee-saves    |
//! unwind-frame base     ---->  | (pushed by prologue)      |
//!                              +---------------------------+
//!                              |          ...              |
//!                              | spill slots               |
//!                              | (accessed via nominal SP) |
//!                              |          ...              |
//!                              | sized stack slots         |
//!                              | dynamic stack slots       |
//!                              | (accessed via nominal SP) |
//! nominal SP --------------->  | (alloc'd by prologue)     |
//! (SP at end of prologue)      +---------------------------+
//!                              | [alignment as needed]     |
//!                              |          ...              |
//!                              | args for call             |
//! SP before making a call -->  | (pushed at callsite)      |
//!                              +---------------------------+
//!
//!   (low address)
We likely want to collect all the dynamically sized stack values and move them up the stack and introduce a new StackAMode to be addressed by FP. Spill and stack slots of compile-time known sizes can accessed as they are now, but the way we calculate at the end of the prologue will need to be modified. The current implementation allows the TargetIsa to report a fixed size for each dynamic type and so stack offsets can be calculated at compile-time. For dynamically-sized objects, _I think_ we'll want to use vmctx to generate our scaling factor from a GlobalValue and then multiple a slot index by the scaling factor to get our address.

It could be that a target wants to specify multiple scaling values though, depending on the type/register that will be used. So, we could group the values so that each group is using the same scale value. The awkward part here is that we won't have a uniform space to scale across, and so we'll need a method to 'jump' over groups.

Last updated: Apr 17 2025 at 02:30 UTC