akirilov-arm labeled issue #2296:
The bitmask extraction operations (WebAssembly/simd#201) have already become a part of the fixed-width SIMD proposal, so we should definitely implement them in Cranelift's AArch64 backend (and elsewhere, of course, but that's a separate discussion), so this issue is going to concentrate on implementation approaches, and in particular the instruction sequences that will be used. That is probably going to be the most challenging part because those operations don't have a straightforward mapping into the A64 instruction set.
Let's start with
i8x16.bitmask
, since it is the most complicated one - the suggested lowering from the proposal is:sshr vt.16b, vs.16b, #7 ldr qt2, .Lmask and vt.16b, vt.16b, vt2.16b ext vt2.16b, vt.16b, vt.16b, #8 zip1 vt.16b, vt.16b, vt2.16b addv ht, vt.8h umov wd, vt.h[0] ... .Lmask: .byte 1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128
where
Vs
is the input vector register,Wd
is the output general-purpose register, andVt
andVt2
- temporary vector registers.The main issue with this instruction sequence is that it uses the horizontal reduction
ADDV
, which is quite expensive on some microarchitectures, so here's an alternative:cmlt vt.16b, vs.16b, #0 ldr qt2, .Lmask and vt.16b, vt.16b, vt2.16b addp vt.16b, vt.16b, vt.16b addp vt.16b, vt.16b, vt.16b addp vt.16b, vt.16b, vt.16b umov wd, vt.h[0] ... .Lmask: .byte 1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128
We also use
CMLT #0
instead ofSSHR
, since it has better throughput in some cases, but, of course, the best approach would be to avoid it completely. AFAIK the expected usage of the bitmask extraction operations is to work on vector comparison results, which are already in the required form (the bits in each lane are either all 0 or 1), so we could avoid the initial instruction by clever IR pattern matching (bare minimum would bemaybe_input_insn(ctx, inputs[0], Opcode::Icmp)
and so on).I wrote a microbenchmark for each instruction sequence by putting it in a loop and completing it with a
dup vs.16b, wd
, which establishes a loop-carried dependency and makes it possible to measure latency. Here are the results for various microarchitectures in terms of speedup achieved by the alternative sequence, i.e. higher numbers mean that the latter is faster (has lower latency):
CPU core Speedup Ampere eMAG 114.71% Arm Cortex-A53 90.49% Arm Cortex-A72 114.35% Arm Cortex-A73 112.50% Arm Neoverse N1 126.66% The results are based on median runtimes from 20 runs; standard errors were 0.02% or less, with one exception on the Neoverse N1, which was 1.95% because the very first run took significantly longer than the rest.
Unfortunately, there isn't a clear winner - the alternative instruction sequence is faster on big cores, while the one from the proposal is faster on the little core, i.e. Cortex-A53. However, the speedups achieved on the big cores are greater than the slowdown experienced on the little core.
Future architectural extensions could enable further simplification; in particular, the bit permute instructions that are part of the second version of the Scalable Vector Extension (SVE2) may reduce the number of instructions and avoid the literal load at the same time.
cc @julian-seward1
cfallin commented on issue #2296:
We pass all SIMD tests on aarch64, so I believe these operations should be supported; closing issue now.
cfallin closed issue #2296:
The bitmask extraction operations (WebAssembly/simd#201) have already become a part of the fixed-width SIMD proposal, so we should definitely implement them in Cranelift's AArch64 backend (and elsewhere, of course, but that's a separate discussion), so this issue is going to concentrate on implementation approaches, and in particular the instruction sequences that will be used. That is probably going to be the most challenging part because those operations don't have a straightforward mapping into the A64 instruction set.
Let's start with
i8x16.bitmask
, since it is the most complicated one - the suggested lowering from the proposal is:sshr vt.16b, vs.16b, #7 ldr qt2, .Lmask and vt.16b, vt.16b, vt2.16b ext vt2.16b, vt.16b, vt.16b, #8 zip1 vt.16b, vt.16b, vt2.16b addv ht, vt.8h umov wd, vt.h[0] ... .Lmask: .byte 1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128
where
Vs
is the input vector register,Wd
is the output general-purpose register, andVt
andVt2
- temporary vector registers.The main issue with this instruction sequence is that it uses the horizontal reduction
ADDV
, which is quite expensive on some microarchitectures, so here's an alternative:cmlt vt.16b, vs.16b, #0 ldr qt2, .Lmask and vt.16b, vt.16b, vt2.16b addp vt.16b, vt.16b, vt.16b addp vt.16b, vt.16b, vt.16b addp vt.16b, vt.16b, vt.16b umov wd, vt.h[0] ... .Lmask: .byte 1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128
We also use
CMLT #0
instead ofSSHR
, since it has better throughput in some cases, but, of course, the best approach would be to avoid it completely. AFAIK the expected usage of the bitmask extraction operations is to work on vector comparison results, which are already in the required form (the bits in each lane are either all 0 or 1), so we could avoid the initial instruction by clever IR pattern matching (bare minimum would bemaybe_input_insn(ctx, inputs[0], Opcode::Icmp)
and so on).I wrote a microbenchmark for each instruction sequence by putting it in a loop and completing it with a
dup vs.16b, wd
, which establishes a loop-carried dependency and makes it possible to measure latency. Here are the results for various microarchitectures in terms of speedup achieved by the alternative sequence, i.e. higher numbers mean that the latter is faster (has lower latency):
CPU core Speedup Ampere eMAG 114.71% Arm Cortex-A53 90.49% Arm Cortex-A72 114.35% Arm Cortex-A73 112.50% Arm Neoverse N1 126.66% The results are based on median runtimes from 20 runs; standard errors were 0.02% or less, with one exception on the Neoverse N1, which was 1.95% because the very first run took significantly longer than the rest.
Unfortunately, there isn't a clear winner - the alternative instruction sequence is faster on big cores, while the one from the proposal is faster on the little core, i.e. Cortex-A53. However, the speedups achieved on the big cores are greater than the slowdown experienced on the little core.
Future architectural extensions could enable further simplification; in particular, the bit permute instructions that are part of the second version of the Scalable Vector Extension (SVE2) may reduce the number of instructions and avoid the literal load at the same time.
cc @julian-seward1
Last updated: Jan 24 2025 at 00:11 UTC