Stream: git-wasmtime

Topic: wasmtime / issue #6623 riscv64: Improve SIMD `icmp` with ...


view this post on Zulip Wasmtime GitHub notifications bot (Jun 22 2023 at 13:10):

afonso360 opened issue #6623:

Feature

We currently implement a bunch of rules to try to efficiently lower an icmp clif instruction into a single RISC-V vector instruction. We have a table with all of the cases where we know there is an equivalent instruction (see gen_icmp_mask in the RISC-V backend).

However, we are still missing one set of cases. When we have a icmp+splat+iconst we can usually emit a *.vi opcode. And we do that if the icmp and splat match exactly what the opcode does, but we can go further.

Take this example:

function %simd_icmp_sge_splat_const_rhs_i32(i32x4) -> i32x4 {
block0(v0: i32x4):
    v1 = iconst.i32 10
    v2 = splat.i32x4 v1
    v3 = icmp sge v0, v2
    return v3
}

While we don't have a vmsge.vi instruction we can use the vmsgt.vi instruction to essentially do the same thing by decrementing the immediate by 1.

There 8 cases like this one that we can still improve in the backend. (See the godbolt link below)

Benefit

This improves the icmp codegen by generating fewer instructions for a select number of cases.

Implementation

Here's a godbolt link of what LLVM does in each of the missing scenarios in gen_icmp_mask.

For some of them, such as icmp_ule_lhs_splat it looks like we don't have any efficient implementation other than a separate splat and .vv instruction. In these cases, we can leave holes in the table and we'll automatically generate that code.

However the more interesting cases are icmp_ule_lhs_splat_const (and similar) where we generate the reverse IntCC with an immediate that is imm-1. That gives us some range optimize these icmp's.

Alternatives

This is all optional, and these improvements only cover 8 cases (out of 50 possible), so they might be fairly rare.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 22 2023 at 13:10):

afonso360 labeled issue #6623:

Feature

We currently implement a bunch of rules to try to efficiently lower an icmp clif instruction into a single RISC-V vector instruction. We have a table with all of the cases where we know there is an equivalent instruction (see gen_icmp_mask in the RISC-V backend).

However, we are still missing one set of cases. When we have a icmp+splat+iconst we can usually emit a *.vi opcode. And we do that if the icmp and splat match exactly what the opcode does, but we can go further.

Take this example:

function %simd_icmp_sge_splat_const_rhs_i32(i32x4) -> i32x4 {
block0(v0: i32x4):
    v1 = iconst.i32 10
    v2 = splat.i32x4 v1
    v3 = icmp sge v0, v2
    return v3
}

While we don't have a vmsge.vi instruction we can use the vmsgt.vi instruction to essentially do the same thing by decrementing the immediate by 1.

There 8 cases like this one that we can still improve in the backend. (See the godbolt link below)

Benefit

This improves the icmp codegen by generating fewer instructions for a select number of cases.

Implementation

Here's a godbolt link of what LLVM does in each of the missing scenarios in gen_icmp_mask.

For some of them, such as icmp_ule_lhs_splat it looks like we don't have any efficient implementation other than a separate splat and .vv instruction. In these cases, we can leave holes in the table and we'll automatically generate that code.

However the more interesting cases are icmp_ule_lhs_splat_const (and similar) where we generate the reverse IntCC with an immediate that is imm-1. That gives us some range optimize these icmp's.

Alternatives

This is all optional, and these improvements only cover 8 cases (out of 50 possible), so they might be fairly rare.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 22 2023 at 14:08):

afonso360 labeled issue #6623:

Feature

We currently implement a bunch of rules to try to efficiently lower an icmp clif instruction into a single RISC-V vector instruction. We have a table with all of the cases where we know there is an equivalent instruction (see gen_icmp_mask in the RISC-V backend).

However, we are still missing one set of cases. When we have a icmp+splat+iconst we can usually emit a *.vi opcode. And we do that if the icmp and splat match exactly what the opcode does, but we can go further.

Take this example:

function %simd_icmp_sge_splat_const_rhs_i32(i32x4) -> i32x4 {
block0(v0: i32x4):
    v1 = iconst.i32 10
    v2 = splat.i32x4 v1
    v3 = icmp sge v0, v2
    return v3
}

While we don't have a vmsge.vi instruction we can use the vmsgt.vi instruction to essentially do the same thing by decrementing the immediate by 1.

There 8 cases like this one that we can still improve in the backend. (See the godbolt link below)

Benefit

This improves the icmp codegen by generating fewer instructions for a select number of cases.

Implementation

Here's a godbolt link of what LLVM does in each of the missing scenarios in gen_icmp_mask.

For some of them, such as icmp_ule_lhs_splat it looks like we don't have any efficient implementation other than a separate splat and .vv instruction. In these cases, we can leave holes in the table and we'll automatically generate that code.

However the more interesting cases are icmp_ule_lhs_splat_const (and similar) where we generate the reverse IntCC with an immediate that is imm-1. That gives us some range optimize these icmp's.

Alternatives

This is all optional, and these improvements only cover 8 cases (out of 50 possible), so they might be fairly rare.


Last updated: Nov 22 2024 at 17:03 UTC