afonso360 opened PR #6542 from afonso360:riscv-simd-widening-iadd
to bytecodealliance:main
:
:wave: Hey,
This PR adds the widening add instructions from the V spec. These are
vwadd{u,}.{w,v}{v,x}
.This also adds a bunch of rules to try to match these instructions. And some of these end up being quite complex.
Rules that match
{u,s}widen_high
are the same as their{u,s}widen_low
counterparts but they first do avslidedown
of half the vector, to bring the top lanes down.
uwiden_low
rules are the same as theswiden_low
rules, but they usevwaddu.*
instead ofvwadd.*
which is the unsigned version of the instruction.Now, in each of these groups of rules we have a few different instructions.
vwadd.wv
does a 2SEW = 2SEW + SEW, this just means that the elements in the RHS vector are first sign extended before doing the addition. The only trick here is that since the result is 2*SEW we must use a vstate type that has half the element size as the type that we want to end up with. So to end up with a i32x4iadd
we need to pass in a i16x4 type as a vstate type.
vwadd.vv
does 2*SEW = SEW + SEW, so as long as both sides are extended we can use this instruction. Again we must pass in a type with half the element size.
vwadd.wx
andvwadd.vx
do the same thing, but the RHS is expected to be a extended and splatted X register, so we try to match exactly that. To make these rules more applicable I've previously added some egraph rules (#6533) that convert{u,s}widen_{low,high}
intosplat+{u,s}extend
, this way we only have to try to match the splat version, which reduces the number of rules.All of these rules use
vstate_mf2
. This is sets the LMUL setting to 1/2, meaning that at most we will read half of the source vector registers, and the result is guaranteed to fit in a single destination register. Otherwise the CPU could have to write the result into multiple registers, which is something that the ISA supports, but adds a bunch of constraints that we don't need here.
I would really appreciate if we could find a way to reduce the number of rules here. I couldn't really come up with anything, but maybe someone does. We have equivalents to all of these for
isub
andimul
so finding a good reduction here, will hopefully make implementing those easier.
afonso360 requested cfallin for a review on PR #6542.
afonso360 requested wasmtime-compiler-reviewers for a review on PR #6542.
afonso360 edited PR #6542:
:wave: Hey,
This PR adds the widening add instructions from the V spec. These are
vwadd{u,}.{w,v}{v,x}
.This also adds a bunch of rules to try to match these instructions. And some of these end up being quite complex.
Rules that match
{u,s}widen_high
are the same as their{u,s}widen_low
counterparts but they first do avslidedown
of half the vector, to bring the top lanes down.
uwiden_low
rules are the same as theswiden_low
rules, but they usevwaddu.*
instead ofvwadd.*
which is the unsigned version of the instruction.Now, in each of these groups of rules we have a few different instructions.
vwadd.wv
does a 2 * SEW = 2 * SEW + SEW, this just means that the elements in the RHS vector are first sign extended before doing the addition. The only trick here is that since the result is 2*SEW we must use a vstate type that has half the element size as the type that we want to end up with. So to end up with a i32x4iadd
we need to pass in a i16x4 type as a vstate type.
vwadd.vv
does 2 * SEW = SEW + SEW, so as long as both sides are extended we can use this instruction. Again we must pass in a type with half the element size.
vwadd.wx
andvwadd.vx
do the same thing, but the RHS is expected to be a extended and splatted X register, so we try to match exactly that. To make these rules more applicable I've previously added some egraph rules (#6533) that convert{u,s}widen_{low,high}
intosplat+{u,s}extend
, this way we only have to try to match the splat version, which reduces the number of rules.All of these rules use
vstate_mf2
. This is sets the LMUL setting to 1/2, meaning that at most we will read half of the source vector registers, and the result is guaranteed to fit in a single destination register. Otherwise the CPU could have to write the result into multiple registers, which is something that the ISA supports, but adds a bunch of constraints that we don't need here.
I would really appreciate if we could find a way to reduce the number of rules here. I couldn't really come up with anything, but maybe someone does. We have equivalents to all of these for
isub
andimul
so finding a good reduction here, will hopefully make implementing those easier.
afonso360 edited PR #6542:
:wave: Hey,
This PR adds the widening add instructions from the V spec. These are
vwadd{u,}.{w,v}{v,x}
.This also adds a bunch of rules to try to match these instructions. And some of these end up being quite complex.
Rules that match
{u,s}widen_high
are the same as their{u,s}widen_low
counterparts but they first do avslidedown
of half the vector, to bring the top lanes down.
uwiden_low
rules are the same as theswiden_low
rules, but they usevwaddu.*
instead ofvwadd.*
which is the unsigned version of the instruction.Now, in each of these groups of rules we have a few different instructions.
vwadd.wv
does a 2 * SEW = 2 * SEW + SEW, this just means that the elements in the RHS vector are first sign extended before doing the addition. The only trick here is that since the result is 2*SEW we must use a vstate type that has half the element size as the type that we want to end up with. So to end up with a i32x4iadd
we need to pass in a i16x4 type as a vstate type.
vwadd.vv
does 2 * SEW = SEW + SEW, so as long as both sides are extended we can use this instruction. Again we must pass in a type with half the element size.
vwadd.wx
andvwadd.vx
do the same thing, but the RHS is expected to be a extended and splatted X register, so we try to match exactly that. To make these rules more applicable I've previously added some egraph rules (#6533) that convert{u,s}widen_{low,high}
intosplat+{u,s}extend
, this way we only have to try to match the splat version, which reduces the number of rules.All of these rules use
vstate_mf2
. This is sets the LMUL setting to 1/2, meaning that at most we will read half of the source vector registers, and the result is guaranteed to fit in a single destination register. Otherwise we could have to write the result into multiple registers, which is something that the ISA supports, but adds a bunch of constraints that we don't need here.
I would really appreciate if we could find a way to reduce the number of rules here. I couldn't really come up with anything, but maybe someone does. We have equivalents to all of these for
isub
andimul
so finding a good reduction here, will hopefully make implementing those easier.
afonso360 edited PR #6542:
:wave: Hey,
This PR adds the widening add instructions from the V spec. These are
vwadd{u,}.{w,v}{v,x}
.This also adds a bunch of rules to try to match these instructions. And some of these end up being quite complex.
Rules that match
{u,s}widen_high
are the same as their{u,s}widen_low
counterparts but they first do avslidedown
of half the vector, to bring the top lanes down.
uwiden_low
rules are the same as theswiden_low
rules, but they usevwaddu.*
instead ofvwadd.*
which is the unsigned version of the instruction.Now, in each of these groups of rules we have a few different instructions.
vwadd.wv
does a 2 * SEW = 2 * SEW + SEW, this just means that the elements in the RHS vector are first sign extended before doing the addition. The only trick here is that since the result is 2*SEW we must use a vstate type that has half the element size as the type that we want to end up with. So to end up with a i32x4iadd
we need to pass in a i16x4 type as a vstate type.
vwadd.vv
does 2 * SEW = SEW + SEW, so as long as both sides are extended we can use this instruction. We must set the vstate with a type that has half the element size and half the lanes.
vwadd.wx
andvwadd.vx
do the same thing, but the RHS is expected to be a extended and splatted X register, so we try to match exactly that. To make these rules more applicable I've previously added some egraph rules (#6533) that convert{u,s}widen_{low,high}
intosplat+{u,s}extend
, this way we only have to try to match the splat version, which reduces the number of rules.All of these rules use
vstate_mf2
. This is sets the LMUL setting to 1/2, meaning that at most we will read half of the source vector registers, and the result is guaranteed to fit in a single destination register. Otherwise we could have to write the result into multiple registers, which is something that the ISA supports, but adds a bunch of constraints that we don't need here.
I would really appreciate if we could find a way to reduce the number of rules here. I couldn't really come up with anything, but maybe someone does. We have equivalents to all of these for
isub
andimul
so finding a good reduction here, will hopefully make implementing those easier.
afonso360 edited PR #6542:
:wave: Hey,
This PR adds the widening add instructions from the V spec. These are
vwadd{u,}.{w,v}{v,x}
.This also adds a bunch of rules to try to match these instructions. And some of these end up being quite complex.
Rules that match
{u,s}widen_high
are the same as their{u,s}widen_low
counterparts but they first do avslidedown
of half the vector, to bring the top lanes down.
uwiden_low
rules are the same as theswiden_low
rules, but they usevwaddu.*
instead ofvwadd.*
which is the unsigned version of the instruction.Now, in each of these groups of rules we have a few different instructions.
vwadd.wv
does a 2 * SEW = 2 * SEW + SEW, this just means that the elements in the RHS vector are first sign extended before doing the addition. The only trick here is that since the result is 2*SEW we must use a vstate type that has half the element size as the type that we want to end up with. So to end up with a i32x4iadd
we need to pass in a i16x4 type as a vstate type.
vwadd.vv
does 2 * SEW = SEW + SEW, so as long as both sides are extended we can use this instruction. We must set the vstate with a type that has half the element size and half the lanes of the final type.
vwadd.wx
andvwadd.vx
do the same thing, but the RHS is expected to be a extended and splatted X register, so we try to match exactly that. To make these rules more applicable I've previously added some egraph rules (#6533) that convert{u,s}widen_{low,high}
intosplat+{u,s}extend
, this way we only have to try to match the splat version, which reduces the number of rules.All of these rules use
vstate_mf2
. This is sets the LMUL setting to 1/2, meaning that at most we will read half of the source vector registers, and the result is guaranteed to fit in a single destination register. Otherwise we could have to write the result into multiple registers, which is something that the ISA supports, but adds a bunch of constraints that we don't need here.
I would really appreciate if we could find a way to reduce the number of rules here. I couldn't really come up with anything, but maybe someone does. We have equivalents to all of these for
isub
andimul
so finding a good reduction here, will hopefully make implementing those easier.
fitzgen submitted PR review:
I must admit I didn't dig super deep into every single rule/test, but it looks like it has the bits it should have, so LGTM!
afonso360 merged PR #6542.
Last updated: Jan 24 2025 at 00:11 UTC