wasmtime / PR #6542 riscv64: Implement `iadd` Widening In... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / PR #6542 riscv64: Implement `iadd` Widening In...

Wasmtime GitHub notifications bot (Jun 08 2023 at 14:44):

afonso360 opened PR #6542 from afonso360:riscv-simd-widening-iadd to bytecodealliance:main:

:wave: Hey,

This PR adds the widening add instructions from the V spec. These are vwadd{u,}.{w,v}{v,x}.

This also adds a bunch of rules to try to match these instructions. And some of these end up being quite complex.

Rules that match {u,s}widen_high are the same as their {u,s}widen_low counterparts but they first do a vslidedown of half the vector, to bring the top lanes down.

uwiden_low rules are the same as the swiden_low rules, but they use vwaddu.* instead of vwadd.* which is the unsigned version of the instruction.

Now, in each of these groups of rules we have a few different instructions.

vwadd.wv does a 2SEW = 2SEW + SEW, this just means that the elements in the RHS vector are first sign extended before doing the addition. The only trick here is that since the result is 2*SEW we must use a vstate type that has half the element size as the type that we want to end up with. So to end up with a i32x4 iadd we need to pass in a i16x4 type as a vstate type.

vwadd.vv does 2*SEW = SEW + SEW, so as long as both sides are extended we can use this instruction. Again we must pass in a type with half the element size.

vwadd.wx and vwadd.vx do the same thing, but the RHS is expected to be a extended and splatted X register, so we try to match exactly that. To make these rules more applicable I've previously added some egraph rules (#6533) that convert {u,s}widen_{low,high} into splat+{u,s}extend, this way we only have to try to match the splat version, which reduces the number of rules.

All of these rules use vstate_mf2. This is sets the LMUL setting to 1/2, meaning that at most we will read half of the source vector registers, and the result is guaranteed to fit in a single destination register. Otherwise the CPU could have to write the result into multiple registers, which is something that the ISA supports, but adds a bunch of constraints that we don't need here.

I would really appreciate if we could find a way to reduce the number of rules here. I couldn't really come up with anything, but maybe someone does. We have equivalents to all of these for isub and imul so finding a good reduction here, will hopefully make implementing those easier.

Wasmtime GitHub notifications bot (Jun 08 2023 at 14:44):

afonso360 requested cfallin for a review on PR #6542.

Wasmtime GitHub notifications bot (Jun 08 2023 at 14:44):

afonso360 requested wasmtime-compiler-reviewers for a review on PR #6542.

Wasmtime GitHub notifications bot (Jun 08 2023 at 14:44):

afonso360 edited PR #6542:

:wave: Hey,

This PR adds the widening add instructions from the V spec. These are vwadd{u,}.{w,v}{v,x}.

This also adds a bunch of rules to try to match these instructions. And some of these end up being quite complex.

Rules that match {u,s}widen_high are the same as their {u,s}widen_low counterparts but they first do a vslidedown of half the vector, to bring the top lanes down.

uwiden_low rules are the same as the swiden_low rules, but they use vwaddu.* instead of vwadd.* which is the unsigned version of the instruction.

Now, in each of these groups of rules we have a few different instructions.

vwadd.wv does a 2 * SEW = 2 * SEW + SEW, this just means that the elements in the RHS vector are first sign extended before doing the addition. The only trick here is that since the result is 2*SEW we must use a vstate type that has half the element size as the type that we want to end up with. So to end up with a i32x4 iadd we need to pass in a i16x4 type as a vstate type.

vwadd.vv does 2 * SEW = SEW + SEW, so as long as both sides are extended we can use this instruction. Again we must pass in a type with half the element size.

vwadd.wx and vwadd.vx do the same thing, but the RHS is expected to be a extended and splatted X register, so we try to match exactly that. To make these rules more applicable I've previously added some egraph rules (#6533) that convert {u,s}widen_{low,high} into splat+{u,s}extend, this way we only have to try to match the splat version, which reduces the number of rules.

All of these rules use vstate_mf2. This is sets the LMUL setting to 1/2, meaning that at most we will read half of the source vector registers, and the result is guaranteed to fit in a single destination register. Otherwise the CPU could have to write the result into multiple registers, which is something that the ISA supports, but adds a bunch of constraints that we don't need here.

I would really appreciate if we could find a way to reduce the number of rules here. I couldn't really come up with anything, but maybe someone does. We have equivalents to all of these for isub and imul so finding a good reduction here, will hopefully make implementing those easier.

Wasmtime GitHub notifications bot (Jun 08 2023 at 14:47):

afonso360 edited PR #6542:

:wave: Hey,

This PR adds the widening add instructions from the V spec. These are vwadd{u,}.{w,v}{v,x}.

This also adds a bunch of rules to try to match these instructions. And some of these end up being quite complex.

Rules that match {u,s}widen_high are the same as their {u,s}widen_low counterparts but they first do a vslidedown of half the vector, to bring the top lanes down.

uwiden_low rules are the same as the swiden_low rules, but they use vwaddu.* instead of vwadd.* which is the unsigned version of the instruction.

Now, in each of these groups of rules we have a few different instructions.

vwadd.wv does a 2 * SEW = 2 * SEW + SEW, this just means that the elements in the RHS vector are first sign extended before doing the addition. The only trick here is that since the result is 2*SEW we must use a vstate type that has half the element size as the type that we want to end up with. So to end up with a i32x4 iadd we need to pass in a i16x4 type as a vstate type.

vwadd.vv does 2 * SEW = SEW + SEW, so as long as both sides are extended we can use this instruction. Again we must pass in a type with half the element size.

vwadd.wx and vwadd.vx do the same thing, but the RHS is expected to be a extended and splatted X register, so we try to match exactly that. To make these rules more applicable I've previously added some egraph rules (#6533) that convert {u,s}widen_{low,high} into splat+{u,s}extend, this way we only have to try to match the splat version, which reduces the number of rules.

All of these rules use vstate_mf2. This is sets the LMUL setting to 1/2, meaning that at most we will read half of the source vector registers, and the result is guaranteed to fit in a single destination register. Otherwise we could have to write the result into multiple registers, which is something that the ISA supports, but adds a bunch of constraints that we don't need here.

I would really appreciate if we could find a way to reduce the number of rules here. I couldn't really come up with anything, but maybe someone does. We have equivalents to all of these for isub and imul so finding a good reduction here, will hopefully make implementing those easier.

Wasmtime GitHub notifications bot (Jun 08 2023 at 14:51):

afonso360 edited PR #6542:

:wave: Hey,

This PR adds the widening add instructions from the V spec. These are vwadd{u,}.{w,v}{v,x}.

This also adds a bunch of rules to try to match these instructions. And some of these end up being quite complex.

Rules that match {u,s}widen_high are the same as their {u,s}widen_low counterparts but they first do a vslidedown of half the vector, to bring the top lanes down.

uwiden_low rules are the same as the swiden_low rules, but they use vwaddu.* instead of vwadd.* which is the unsigned version of the instruction.

Now, in each of these groups of rules we have a few different instructions.

vwadd.wv does a 2 * SEW = 2 * SEW + SEW, this just means that the elements in the RHS vector are first sign extended before doing the addition. The only trick here is that since the result is 2*SEW we must use a vstate type that has half the element size as the type that we want to end up with. So to end up with a i32x4 iadd we need to pass in a i16x4 type as a vstate type.

vwadd.vv does 2 * SEW = SEW + SEW, so as long as both sides are extended we can use this instruction. We must set the vstate with a type that has half the element size and half the lanes.

vwadd.wx and vwadd.vx do the same thing, but the RHS is expected to be a extended and splatted X register, so we try to match exactly that. To make these rules more applicable I've previously added some egraph rules (#6533) that convert {u,s}widen_{low,high} into splat+{u,s}extend, this way we only have to try to match the splat version, which reduces the number of rules.

All of these rules use vstate_mf2. This is sets the LMUL setting to 1/2, meaning that at most we will read half of the source vector registers, and the result is guaranteed to fit in a single destination register. Otherwise we could have to write the result into multiple registers, which is something that the ISA supports, but adds a bunch of constraints that we don't need here.

I would really appreciate if we could find a way to reduce the number of rules here. I couldn't really come up with anything, but maybe someone does. We have equivalents to all of these for isub and imul so finding a good reduction here, will hopefully make implementing those easier.

Wasmtime GitHub notifications bot (Jun 08 2023 at 14:51):

afonso360 edited PR #6542:

:wave: Hey,

This PR adds the widening add instructions from the V spec. These are vwadd{u,}.{w,v}{v,x}.

This also adds a bunch of rules to try to match these instructions. And some of these end up being quite complex.

Rules that match {u,s}widen_high are the same as their {u,s}widen_low counterparts but they first do a vslidedown of half the vector, to bring the top lanes down.

uwiden_low rules are the same as the swiden_low rules, but they use vwaddu.* instead of vwadd.* which is the unsigned version of the instruction.

Now, in each of these groups of rules we have a few different instructions.

vwadd.wv does a 2 * SEW = 2 * SEW + SEW, this just means that the elements in the RHS vector are first sign extended before doing the addition. The only trick here is that since the result is 2*SEW we must use a vstate type that has half the element size as the type that we want to end up with. So to end up with a i32x4 iadd we need to pass in a i16x4 type as a vstate type.

vwadd.vv does 2 * SEW = SEW + SEW, so as long as both sides are extended we can use this instruction. We must set the vstate with a type that has half the element size and half the lanes of the final type.

vwadd.wx and vwadd.vx do the same thing, but the RHS is expected to be a extended and splatted X register, so we try to match exactly that. To make these rules more applicable I've previously added some egraph rules (#6533) that convert {u,s}widen_{low,high} into splat+{u,s}extend, this way we only have to try to match the splat version, which reduces the number of rules.

All of these rules use vstate_mf2. This is sets the LMUL setting to 1/2, meaning that at most we will read half of the source vector registers, and the result is guaranteed to fit in a single destination register. Otherwise we could have to write the result into multiple registers, which is something that the ISA supports, but adds a bunch of constraints that we don't need here.

I would really appreciate if we could find a way to reduce the number of rules here. I couldn't really come up with anything, but maybe someone does. We have equivalents to all of these for isub and imul so finding a good reduction here, will hopefully make implementing those easier.

Wasmtime GitHub notifications bot (Jun 08 2023 at 15:55):

fitzgen submitted PR review:

I must admit I didn't dig super deep into every single rule/test, but it looks like it has the bits it should have, so LGTM!

Wasmtime GitHub notifications bot (Jun 08 2023 at 19:09):

afonso360 merged PR #6542.

Last updated: Apr 18 2025 at 21:03 UTC