Stream: git-wasmtime

Topic: wasmtime / issue #6828 Cranelift: Optimize `op+splat` int...


view this post on Zulip Wasmtime GitHub notifications bot (Aug 09 2023 at 21:56):

afonso360 opened issue #6828:

:wave: Hey,

Feature

This was pointed out by @jameysharp in https://github.com/bytecodealliance/wasmtime/pull/6815#pullrequestreview-1570118805!

We should try to transform (op (splat x) (splat y) ...) into (splat (op x y ...)) for operations that support this.

Benefit

This transforms SIMD operations into their scalar counterpart which should be beneficial. We also have better constant propagation on scalars, so this is also an opportunity to do that further.

RISC-V specifically really benefits from this optimization since we have opcodes that can eat the splat on the transformed version.

Implementation

The e-graphs mid end is awesome for stuff like this! Here's one rule:

(rule (simplify (imul ty (splat ty x) (splat ty y)))
      (splat ty (imul (lane_type ty) x y)))

Pasting this into cranelift/codegen/src/opts/vector.isle makes this tranform work on imul!

You can try running this test to verify that it works:

test optimize
set opt_level=speed
target x86_64

function %imul_splat_into_splat_imul(i64, i64) -> i64x2 {
block0(v0: i64, v1: i64):
  v2 = splat.i64x2 v0
  v3 = splat.i64x2 v1
  v4 = imul.i64x2 v2, v3
  return v4
  ; check: v5 = imul v0, v1
  ; check: v6 = splat.i64x2 v5
  ; check: return v6
}

(Run this with cargo run -- test ./the-above.clif from the /cranelift directory)

This is also a good test to add to our testsuite in cranelift/filetests/filetests/egraph/....

There are so many opcodes that this optimization works for that I actually can't list them all. Here's a few: iadd,isub,imul,ineg,iabs,umulhi,smulhi,... Just look at our opcode list and a lot of them will work out.

Where it wont work:

Where it may not work:

Alternatives

There are so many opcodes for which this rule can be implemented that it may be worth considering auto-generating it. However I doubt it would be worth the effort + maintenance complexity of that. Copy pasting a bunch of times is sometimes better!

view this post on Zulip Wasmtime GitHub notifications bot (Aug 09 2023 at 21:56):

afonso360 added the good first issue label to Issue #6828.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 09 2023 at 21:56):

afonso360 added the cranelift label to Issue #6828.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 09 2023 at 21:56):

afonso360 added the cranelift:E-compiler label to Issue #6828.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 09 2023 at 21:56):

afonso360 added the cranelift:goal:optimize-speed label to Issue #6828.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 09 2023 at 21:56):

afonso360 added the cranelift:E-easy label to Issue #6828.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 14 2023 at 11:24):

gurry commented on issue #6828:

@afonso360 I'd like to pick this one up. Will be doing some reading on ISLE first.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 16 2023 at 10:47):

gurry commented on issue #6828:

@afonso360 How do I see the optimized CLIF code for a given func after all the mid-end opts are applied to it? In the case of some operators my test cases are failing. For example for band_not if I use this simplify rule:

(rule (simplify (band_not ty (splat ty x) (splat ty y)))
      (splat ty (band_not (lane_type ty) x y)))

the following test is failing:

function %band_not_splat_into_splat_band_not(i64, i64) -> i64x2 {
block0(v0: i64, v1: i64):
  v2 = splat.i64x2 v0
  v3 = splat.i64x2 v1
  v4 = band_not.i64x2 v2, v3
  return v4
  ; check: v5 = band_not v0, v1
  ; check: v6 = splat.i64x2 v5
  ; check: return v6
}

with the following message:

FAIL filetests\filetests\egraph\simd-splat-simplify.clif: optimize

Caused by:
    filecheck failed for function on line 38:
    #0 check: v5 = band_not v0, v1
    #1 check: v6 = splat.i64x2 v5
    #2 check: return v6
    > function %band_not_splat_into_splat_band_not(i64, i64) -> i64x2 fast {
    Missed #0: \bv5 = band_not v0, v1\b
    > block0(v0: i64, v1: i64):
    >     v2 = splat.i64x2 v0
    >     v3 = splat.i64x2 v1
    >     v5 = bnot v3
    >     v4 = band v2, v5
    >     return v4
    > }

1356 tests
Error: 1 failure
error: test failed, to rerun pass `--test filetests`

So I just wanted to debug and see how the generated code differs from the the expectation in the test.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 16 2023 at 10:47):

gurry deleted a comment on issue #6828:

@afonso360 How do I see the optimized CLIF code for a given func after all the mid-end opts are applied to it? In the case of some operators my test cases are failing. For example for band_not if I use this simplify rule:

(rule (simplify (band_not ty (splat ty x) (splat ty y)))
      (splat ty (band_not (lane_type ty) x y)))

the following test is failing:

function %band_not_splat_into_splat_band_not(i64, i64) -> i64x2 {
block0(v0: i64, v1: i64):
  v2 = splat.i64x2 v0
  v3 = splat.i64x2 v1
  v4 = band_not.i64x2 v2, v3
  return v4
  ; check: v5 = band_not v0, v1
  ; check: v6 = splat.i64x2 v5
  ; check: return v6
}

with the following message:

FAIL filetests\filetests\egraph\simd-splat-simplify.clif: optimize

Caused by:
    filecheck failed for function on line 38:
    #0 check: v5 = band_not v0, v1
    #1 check: v6 = splat.i64x2 v5
    #2 check: return v6
    > function %band_not_splat_into_splat_band_not(i64, i64) -> i64x2 fast {
    Missed #0: \bv5 = band_not v0, v1\b
    > block0(v0: i64, v1: i64):
    >     v2 = splat.i64x2 v0
    >     v3 = splat.i64x2 v1
    >     v5 = bnot v3
    >     v4 = band v2, v5
    >     return v4
    > }

1356 tests
Error: 1 failure
error: test failed, to rerun pass `--test filetests`

So I just wanted to debug and see how the generated code differs from the the expectation in the test.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 16 2023 at 10:53):

gurry commented on issue #6828:

@afonso360 How do I see the generated code after all the mid-end opts are applied to a func? Some of my tests are failing for some operators. For example for band_not for this simplify rule:

(rule (simplify (band_not ty (splat ty x) (splat ty y)))
      (splat ty (band_not (lane_type ty) x y)))

the following test fails:

function %band_not_splat_into_splat_band_not(i64, i64) -> i64x2 {
block0(v0: i64, v1: i64):
  v2 = splat.i64x2 v0
  v3 = splat.i64x2 v1
  v4 = band_not.i64x2 v2, v3
  return v4
  ; check: v5 = band_not v0, v1
  ; check: v6 = splat.i64x2 v5
  ; check: return v6
}

with the following error:

FAIL filetests\filetests\egraph\simd-splat-simplify.clif: optimize

Caused by:
    filecheck failed for function on line 38:
    #0 check: v5 = band_not v0, v1
    #1 check: v6 = splat.i64x2 v5
    #2 check: return v6
    > function %band_not_splat_into_splat_band_not(i64, i64) -> i64x2 fast {
    Missed #0: \bv5 = band_not v0, v1\b
    > block0(v0: i64, v1: i64):
    >     v2 = splat.i64x2 v0
    >     v3 = splat.i64x2 v1
    >     v5 = bnot v3
    >     v4 = band v2, v5
    >     return v4
    > }

1356 tests
Error: 1 failure
error: test failed, to rerun pass `--test filetests`

So I just wanted to debug and see how the generated code differs from the test expectation.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 16 2023 at 11:01):

afonso360 commented on issue #6828:

Oh! Right I forgot that we have a few special opcodes like band_not that are transformed into multiple ops.

We should be able to ignore those since the mid-end will never see them (I think!) before they are expanded into multiple operations. With everything that you have on that PR, I think you are just missing the transform for bnot and that test should start passing.


But to check the optimized code you can use the slightly complicated command:

cargo run -- compile --target x86_64 --set opt_level=speed -p ./the-file.clif

This runs the whole compile pipeline, and -p prints the clif file before being sent to the backends for converting into machine code.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 16 2023 at 11:02):

afonso360 edited a comment on issue #6828:

Oh! Right I forgot that we have a few special opcodes like band_not that are transformed into multiple ops.

We should be able to ignore those since the mid-end will never see them (I think!) before they are expanded into multiple operations. With everything that you have on that PR, I think you are just missing the transform for bnot and we should be able to transform those two ops.


But to check the optimized code you can use the slightly complicated command:

cargo run -- compile --target x86_64 --set opt_level=speed -p ./the-file.clif

This runs the whole compile pipeline, and -p prints the clif file before being sent to the backends for converting into machine code.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 16 2023 at 11:42):

gurry commented on issue #6828:

Thanks for the info @afonso360 .

The test for bnot was failing for me for some reason. I'll investigate it now with the help of the command you shared.

What about the family of shift and rotate ops such as ishl and rotl? Will this optimization be applicable to them?

view this post on Zulip Wasmtime GitHub notifications bot (Aug 16 2023 at 11:53):

afonso360 commented on issue #6828:

What about the family of shift and rotate ops such as ishl and rotl? Will this optimization be applicable to them?

Yeah, they should be! However note that ishl and friends only support a scalar rhs (the number of bits to shift) and reject vectors in that position.

So the optimization is slightly different. We only need to ensure that the lhs is splatted. So it would look something like this
(ishl (splat lhs) rhs) -> (splat (ishl lhs rhs))

The verifier shouldn't even let you build a ishl with a vector rhs. A good way to check what kinds of values are supported by each operation is to read the documentation for it. We'll usually have a remark stating: "this supports scalar or vectors", or this is scalar only. Something along those lines.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 16 2023 at 11:54):

afonso360 edited a comment on issue #6828:

What about the family of shift and rotate ops such as ishl and rotl? Will this optimization be applicable to them?

Yeah, they should be! However note that ishl and friends only support a scalar rhs (the number of bits to shift) and reject vectors in that position.

So the optimization is slightly different. We only need to ensure that the lhs is splatted. So it would look something like this
(ishl (splat lhs) rhs) -> (splat (ishl lhs rhs)).

It's also probably worth adding a comment with that reasoning in the code! I didn't remember it myself until I tried it.

The verifier shouldn't even let you build a ishl with a vector rhs. A good way to check what kinds of values are supported by each operation is to read the documentation for it. We'll usually have a remark stating: "this supports scalar or vectors", or this is scalar only. Something along those lines.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 18 2023 at 09:02):

afonso360 closed issue #6828:

:wave: Hey,

Feature

This was pointed out by @jameysharp in https://github.com/bytecodealliance/wasmtime/pull/6815#pullrequestreview-1570118805!

We should try to transform (op (splat x) (splat y) ...) into (splat (op x y ...)) for operations that support this.

Benefit

This transforms SIMD operations into their scalar counterpart which should be beneficial. We also have better constant propagation on scalars, so this is also an opportunity to do that further.

RISC-V specifically really benefits from this optimization since we have opcodes that can eat the splat on the transformed version.

Implementation

The e-graphs mid end is awesome for stuff like this! Here's one rule:

(rule (simplify (imul ty (splat ty x) (splat ty y)))
      (splat ty (imul (lane_type ty) x y)))

Pasting this into cranelift/codegen/src/opts/vector.isle makes this tranform work on imul!

You can try running this test to verify that it works:

test optimize
set opt_level=speed
target x86_64

function %imul_splat_into_splat_imul(i64, i64) -> i64x2 {
block0(v0: i64, v1: i64):
  v2 = splat.i64x2 v0
  v3 = splat.i64x2 v1
  v4 = imul.i64x2 v2, v3
  return v4
  ; check: v5 = imul v0, v1
  ; check: v6 = splat.i64x2 v5
  ; check: return v6
}

(Run this with cargo run -- test ./the-above.clif from the /cranelift directory)

This is also a good test to add to our testsuite in cranelift/filetests/filetests/egraph/....

There are so many opcodes that this optimization works for that I actually can't list them all. Here's a few: iadd,isub,imul,ineg,iabs,umulhi,smulhi,... Just look at our opcode list and a lot of them will work out.

Where it wont work:

Where it may not work:

Alternatives

There are so many opcodes for which this rule can be implemented that it may be worth considering auto-generating it. However I doubt it would be worth the effort + maintenance complexity of that. Copy pasting a bunch of times is sometimes better!


Last updated: Dec 23 2024 at 13:07 UTC