afonso360 opened issue #6828:
:wave: Hey,
Feature
This was pointed out by @jameysharp in https://github.com/bytecodealliance/wasmtime/pull/6815#pullrequestreview-1570118805!
We should try to transform
(op (splat x) (splat y) ...)
into(splat (op x y ...))
for operations that support this.Benefit
This transforms SIMD operations into their scalar counterpart which should be beneficial. We also have better constant propagation on scalars, so this is also an opportunity to do that further.
RISC-V specifically really benefits from this optimization since we have opcodes that can eat the splat on the transformed version.
Implementation
The e-graphs mid end is awesome for stuff like this! Here's one rule:
(rule (simplify (imul ty (splat ty x) (splat ty y))) (splat ty (imul (lane_type ty) x y)))
Pasting this into
cranelift/codegen/src/opts/vector.isle
makes this tranform work onimul
!You can try running this test to verify that it works:
test optimize set opt_level=speed target x86_64 function %imul_splat_into_splat_imul(i64, i64) -> i64x2 { block0(v0: i64, v1: i64): v2 = splat.i64x2 v0 v3 = splat.i64x2 v1 v4 = imul.i64x2 v2, v3 return v4 ; check: v5 = imul v0, v1 ; check: v6 = splat.i64x2 v5 ; check: return v6 }
(Run this with
cargo run -- test ./the-above.clif
from the/cranelift
directory)This is also a good test to add to our testsuite in
cranelift/filetests/filetests/egraph/...
.There are so many opcodes that this optimization works for that I actually can't list them all. Here's a few:
iadd
,isub
,imul
,ineg
,iabs
,umulhi
,smulhi
,... Just look at our opcode list and a lot of them will work out.Where it wont work:
- Trapping operations (i.e.
idiv
/urem
/srem
). This doesn't work because we currently don't perform optimizations on these operations (see #5908)Where it may not work:
- Floating point operations. I'm not 100% sure about this one, it logically makes sense that it wouldn't change the result, but I'm just afraid of doing transforms on floats.
Alternatives
There are so many opcodes for which this rule can be implemented that it may be worth considering auto-generating it. However I doubt it would be worth the effort + maintenance complexity of that. Copy pasting a bunch of times is sometimes better!
afonso360 added the good first issue label to Issue #6828.
afonso360 added the cranelift label to Issue #6828.
afonso360 added the cranelift:E-compiler label to Issue #6828.
afonso360 added the cranelift:goal:optimize-speed label to Issue #6828.
afonso360 added the cranelift:E-easy label to Issue #6828.
gurry commented on issue #6828:
@afonso360 I'd like to pick this one up. Will be doing some reading on ISLE first.
gurry commented on issue #6828:
@afonso360 How do I see the optimized CLIF code for a given func after all the mid-end opts are applied to it? In the case of some operators my test cases are failing. For example for
band_not
if I use this simplify rule:(rule (simplify (band_not ty (splat ty x) (splat ty y))) (splat ty (band_not (lane_type ty) x y)))
the following test is failing:
function %band_not_splat_into_splat_band_not(i64, i64) -> i64x2 { block0(v0: i64, v1: i64): v2 = splat.i64x2 v0 v3 = splat.i64x2 v1 v4 = band_not.i64x2 v2, v3 return v4 ; check: v5 = band_not v0, v1 ; check: v6 = splat.i64x2 v5 ; check: return v6 }
with the following message:
FAIL filetests\filetests\egraph\simd-splat-simplify.clif: optimize Caused by: filecheck failed for function on line 38: #0 check: v5 = band_not v0, v1 #1 check: v6 = splat.i64x2 v5 #2 check: return v6 > function %band_not_splat_into_splat_band_not(i64, i64) -> i64x2 fast { Missed #0: \bv5 = band_not v0, v1\b > block0(v0: i64, v1: i64): > v2 = splat.i64x2 v0 > v3 = splat.i64x2 v1 > v5 = bnot v3 > v4 = band v2, v5 > return v4 > } 1356 tests Error: 1 failure error: test failed, to rerun pass `--test filetests`
So I just wanted to debug and see how the generated code differs from the the expectation in the test.
gurry deleted a comment on issue #6828:
@afonso360 How do I see the optimized CLIF code for a given func after all the mid-end opts are applied to it? In the case of some operators my test cases are failing. For example for
band_not
if I use this simplify rule:(rule (simplify (band_not ty (splat ty x) (splat ty y))) (splat ty (band_not (lane_type ty) x y)))
the following test is failing:
function %band_not_splat_into_splat_band_not(i64, i64) -> i64x2 { block0(v0: i64, v1: i64): v2 = splat.i64x2 v0 v3 = splat.i64x2 v1 v4 = band_not.i64x2 v2, v3 return v4 ; check: v5 = band_not v0, v1 ; check: v6 = splat.i64x2 v5 ; check: return v6 }
with the following message:
FAIL filetests\filetests\egraph\simd-splat-simplify.clif: optimize Caused by: filecheck failed for function on line 38: #0 check: v5 = band_not v0, v1 #1 check: v6 = splat.i64x2 v5 #2 check: return v6 > function %band_not_splat_into_splat_band_not(i64, i64) -> i64x2 fast { Missed #0: \bv5 = band_not v0, v1\b > block0(v0: i64, v1: i64): > v2 = splat.i64x2 v0 > v3 = splat.i64x2 v1 > v5 = bnot v3 > v4 = band v2, v5 > return v4 > } 1356 tests Error: 1 failure error: test failed, to rerun pass `--test filetests`
So I just wanted to debug and see how the generated code differs from the the expectation in the test.
gurry commented on issue #6828:
@afonso360 How do I see the generated code after all the mid-end opts are applied to a func? Some of my tests are failing for some operators. For example for
band_not
for this simplify rule:(rule (simplify (band_not ty (splat ty x) (splat ty y))) (splat ty (band_not (lane_type ty) x y)))
the following test fails:
function %band_not_splat_into_splat_band_not(i64, i64) -> i64x2 { block0(v0: i64, v1: i64): v2 = splat.i64x2 v0 v3 = splat.i64x2 v1 v4 = band_not.i64x2 v2, v3 return v4 ; check: v5 = band_not v0, v1 ; check: v6 = splat.i64x2 v5 ; check: return v6 }
with the following error:
FAIL filetests\filetests\egraph\simd-splat-simplify.clif: optimize Caused by: filecheck failed for function on line 38: #0 check: v5 = band_not v0, v1 #1 check: v6 = splat.i64x2 v5 #2 check: return v6 > function %band_not_splat_into_splat_band_not(i64, i64) -> i64x2 fast { Missed #0: \bv5 = band_not v0, v1\b > block0(v0: i64, v1: i64): > v2 = splat.i64x2 v0 > v3 = splat.i64x2 v1 > v5 = bnot v3 > v4 = band v2, v5 > return v4 > } 1356 tests Error: 1 failure error: test failed, to rerun pass `--test filetests`
So I just wanted to debug and see how the generated code differs from the test expectation.
afonso360 commented on issue #6828:
Oh! Right I forgot that we have a few special opcodes like
band_not
that are transformed into multiple ops.We should be able to ignore those since the mid-end will never see them (I think!) before they are expanded into multiple operations. With everything that you have on that PR, I think you are just missing the transform for
bnot
and that test should start passing.
But to check the optimized code you can use the slightly complicated command:
cargo run -- compile --target x86_64 --set opt_level=speed -p ./the-file.clif
This runs the whole compile pipeline, and
-p
prints the clif file before being sent to the backends for converting into machine code.
afonso360 edited a comment on issue #6828:
Oh! Right I forgot that we have a few special opcodes like
band_not
that are transformed into multiple ops.We should be able to ignore those since the mid-end will never see them (I think!) before they are expanded into multiple operations. With everything that you have on that PR, I think you are just missing the transform for
bnot
and we should be able to transform those two ops.
But to check the optimized code you can use the slightly complicated command:
cargo run -- compile --target x86_64 --set opt_level=speed -p ./the-file.clif
This runs the whole compile pipeline, and
-p
prints the clif file before being sent to the backends for converting into machine code.
gurry commented on issue #6828:
Thanks for the info @afonso360 .
The test for
bnot
was failing for me for some reason. I'll investigate it now with the help of the command you shared.What about the family of shift and rotate ops such as
ishl
androtl
? Will this optimization be applicable to them?
afonso360 commented on issue #6828:
What about the family of shift and rotate ops such as ishl and rotl? Will this optimization be applicable to them?
Yeah, they should be! However note that
ishl
and friends only support a scalar rhs (the number of bits to shift) and reject vectors in that position.So the optimization is slightly different. We only need to ensure that the lhs is splatted. So it would look something like this
(ishl (splat lhs) rhs) -> (splat (ishl lhs rhs))
The verifier shouldn't even let you build a
ishl
with a vector rhs. A good way to check what kinds of values are supported by each operation is to read the documentation for it. We'll usually have a remark stating: "this supports scalar or vectors", or this is scalar only. Something along those lines.
afonso360 edited a comment on issue #6828:
What about the family of shift and rotate ops such as ishl and rotl? Will this optimization be applicable to them?
Yeah, they should be! However note that
ishl
and friends only support a scalar rhs (the number of bits to shift) and reject vectors in that position.So the optimization is slightly different. We only need to ensure that the lhs is splatted. So it would look something like this
(ishl (splat lhs) rhs) -> (splat (ishl lhs rhs))
.It's also probably worth adding a comment with that reasoning in the code! I didn't remember it myself until I tried it.
The verifier shouldn't even let you build a
ishl
with a vector rhs. A good way to check what kinds of values are supported by each operation is to read the documentation for it. We'll usually have a remark stating: "this supports scalar or vectors", or this is scalar only. Something along those lines.
afonso360 closed issue #6828:
:wave: Hey,
Feature
This was pointed out by @jameysharp in https://github.com/bytecodealliance/wasmtime/pull/6815#pullrequestreview-1570118805!
We should try to transform
(op (splat x) (splat y) ...)
into(splat (op x y ...))
for operations that support this.Benefit
This transforms SIMD operations into their scalar counterpart which should be beneficial. We also have better constant propagation on scalars, so this is also an opportunity to do that further.
RISC-V specifically really benefits from this optimization since we have opcodes that can eat the splat on the transformed version.
Implementation
The e-graphs mid end is awesome for stuff like this! Here's one rule:
(rule (simplify (imul ty (splat ty x) (splat ty y))) (splat ty (imul (lane_type ty) x y)))
Pasting this into
cranelift/codegen/src/opts/vector.isle
makes this tranform work onimul
!You can try running this test to verify that it works:
test optimize set opt_level=speed target x86_64 function %imul_splat_into_splat_imul(i64, i64) -> i64x2 { block0(v0: i64, v1: i64): v2 = splat.i64x2 v0 v3 = splat.i64x2 v1 v4 = imul.i64x2 v2, v3 return v4 ; check: v5 = imul v0, v1 ; check: v6 = splat.i64x2 v5 ; check: return v6 }
(Run this with
cargo run -- test ./the-above.clif
from the/cranelift
directory)This is also a good test to add to our testsuite in
cranelift/filetests/filetests/egraph/...
.There are so many opcodes that this optimization works for that I actually can't list them all. Here's a few:
iadd
,isub
,imul
,ineg
,iabs
,umulhi
,smulhi
,... Just look at our opcode list and a lot of them will work out.Where it wont work:
- Trapping operations (i.e.
idiv
/urem
/srem
). This doesn't work because we currently don't perform optimizations on these operations (see #5908)Where it may not work:
- Floating point operations. I'm not 100% sure about this one, it logically makes sense that it wouldn't change the result, but I'm just afraid of doing transforms on floats.
Alternatives
There are so many opcodes for which this rule can be implemented that it may be worth considering auto-generating it. However I doubt it would be worth the effort + maintenance complexity of that. Copy pasting a bunch of times is sometimes better!
Last updated: Dec 23 2024 at 13:07 UTC