Stream: git-wasmtime

Topic: wasmtime / issue #7188 riscv64: Improve codegen for opera...


view this post on Zulip Wasmtime GitHub notifications bot (Oct 08 2023 at 13:15):

afonso360 edited issue #7188.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 08 2023 at 13:21):

afonso360 added the cranelift:area:riscv64 label to Issue #7188.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 08 2023 at 13:23):

afonso360 edited issue #7188:

:wave: Hey,

vslideup.vi is a very neat instruction that merges two vector registers by copying the bottom elements of one register into the top elements of another register.

We currently only use it when we cannot perform an operation in a single register due to the register not being large enough.

We should create a ty_vec_fits_twice_in_reg extractor that allows us to know that we can fit two vector values in a single register. That way we know we can place twice the elements in a single register.

Instructions that use this operation are:

{s,u,uu}narrow

These operations do a 2xvnclip and 1xvslideup. We could instead do the vslideup first merging both values in the same register, and emit a single vnclip.

iadd_pairwise

We don't have a dedicated iadd_pairwise instruction. Instead we shuffle the elements in a register and do a regular vadd.vv.

This works, but requires us two perform twice the instructions that we normally would have to.

Ideally we would merge both registers in a single vslideup, use vcompress to extract each side of the addition, and emit a single vadd to sum both sides.

The V8 lowering for i32x4.dot_i16x8_s pulls a similar trick by using LMUL2 which uses two registers. We can't use LMUL>1 due to regalloc incompatibilities.

Alternatives

We don't actually need to do this. The RISC-V vector specification allows us to use LMUL > 1 to treat a register as having more space than it actually does.

The way this works is by combining multiple registers and having each instruction working on multiple registers at once.

The reason we don't currently do this is that it adds a bunch of register allocation constraints that we can't describe to regalloc2.


Last updated: Jan 24 2025 at 00:11 UTC