Stream: git-wasmtime

Topic: wasmtime / issue #2296 Cranelift AArch64: Support the SIM...


view this post on Zulip Wasmtime GitHub notifications bot (Oct 01 2021 at 21:00):

akirilov-arm labeled issue #2296:

The bitmask extraction operations (WebAssembly/simd#201) have already become a part of the fixed-width SIMD proposal, so we should definitely implement them in Cranelift's AArch64 backend (and elsewhere, of course, but that's a separate discussion), so this issue is going to concentrate on implementation approaches, and in particular the instruction sequences that will be used. That is probably going to be the most challenging part because those operations don't have a straightforward mapping into the A64 instruction set.

Let's start with i8x16.bitmask, since it is the most complicated one - the suggested lowering from the proposal is:

  sshr  vt.16b, vs.16b, #7
  ldr   qt2, .Lmask
  and   vt.16b, vt.16b, vt2.16b
  ext   vt2.16b, vt.16b, vt.16b, #8
  zip1  vt.16b, vt.16b, vt2.16b
  addv  ht, vt.8h
  umov  wd, vt.h[0]
  ...
.Lmask:
  .byte 1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128

where Vs is the input vector register, Wd is the output general-purpose register, and Vt and Vt2 - temporary vector registers.

The main issue with this instruction sequence is that it uses the horizontal reduction ADDV, which is quite expensive on some microarchitectures, so here's an alternative:

  cmlt  vt.16b, vs.16b, #0
  ldr   qt2, .Lmask
  and   vt.16b, vt.16b, vt2.16b
  addp  vt.16b, vt.16b, vt.16b
  addp  vt.16b, vt.16b, vt.16b
  addp  vt.16b, vt.16b, vt.16b
  umov  wd, vt.h[0]
  ...
.Lmask:
  .byte 1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128

We also use CMLT #0 instead of SSHR, since it has better throughput in some cases, but, of course, the best approach would be to avoid it completely. AFAIK the expected usage of the bitmask extraction operations is to work on vector comparison results, which are already in the required form (the bits in each lane are either all 0 or 1), so we could avoid the initial instruction by clever IR pattern matching (bare minimum would be maybe_input_insn(ctx, inputs[0], Opcode::Icmp) and so on).

I wrote a microbenchmark for each instruction sequence by putting it in a loop and completing it with a dup vs.16b, wd, which establishes a loop-carried dependency and makes it possible to measure latency. Here are the results for various microarchitectures in terms of speedup achieved by the alternative sequence, i.e. higher numbers mean that the latter is faster (has lower latency):

CPU core Speedup
Ampere eMAG 114.71%
Arm Cortex-A53 90.49%
Arm Cortex-A72 114.35%
Arm Cortex-A73 112.50%
Arm Neoverse N1 126.66%

The results are based on median runtimes from 20 runs; standard errors were 0.02% or less, with one exception on the Neoverse N1, which was 1.95% because the very first run took significantly longer than the rest.

Unfortunately, there isn't a clear winner - the alternative instruction sequence is faster on big cores, while the one from the proposal is faster on the little core, i.e. Cortex-A53. However, the speedups achieved on the big cores are greater than the slowdown experienced on the little core.

Future architectural extensions could enable further simplification; in particular, the bit permute instructions that are part of the second version of the Scalable Vector Extension (SVE2) may reduce the number of instructions and avoid the literal load at the same time.

cc @julian-seward1

view this post on Zulip Wasmtime GitHub notifications bot (May 04 2022 at 22:54):

cfallin commented on issue #2296:

We pass all SIMD tests on aarch64, so I believe these operations should be supported; closing issue now.

view this post on Zulip Wasmtime GitHub notifications bot (May 04 2022 at 22:54):

cfallin closed issue #2296:

The bitmask extraction operations (WebAssembly/simd#201) have already become a part of the fixed-width SIMD proposal, so we should definitely implement them in Cranelift's AArch64 backend (and elsewhere, of course, but that's a separate discussion), so this issue is going to concentrate on implementation approaches, and in particular the instruction sequences that will be used. That is probably going to be the most challenging part because those operations don't have a straightforward mapping into the A64 instruction set.

Let's start with i8x16.bitmask, since it is the most complicated one - the suggested lowering from the proposal is:

  sshr  vt.16b, vs.16b, #7
  ldr   qt2, .Lmask
  and   vt.16b, vt.16b, vt2.16b
  ext   vt2.16b, vt.16b, vt.16b, #8
  zip1  vt.16b, vt.16b, vt2.16b
  addv  ht, vt.8h
  umov  wd, vt.h[0]
  ...
.Lmask:
  .byte 1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128

where Vs is the input vector register, Wd is the output general-purpose register, and Vt and Vt2 - temporary vector registers.

The main issue with this instruction sequence is that it uses the horizontal reduction ADDV, which is quite expensive on some microarchitectures, so here's an alternative:

  cmlt  vt.16b, vs.16b, #0
  ldr   qt2, .Lmask
  and   vt.16b, vt.16b, vt2.16b
  addp  vt.16b, vt.16b, vt.16b
  addp  vt.16b, vt.16b, vt.16b
  addp  vt.16b, vt.16b, vt.16b
  umov  wd, vt.h[0]
  ...
.Lmask:
  .byte 1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128

We also use CMLT #0 instead of SSHR, since it has better throughput in some cases, but, of course, the best approach would be to avoid it completely. AFAIK the expected usage of the bitmask extraction operations is to work on vector comparison results, which are already in the required form (the bits in each lane are either all 0 or 1), so we could avoid the initial instruction by clever IR pattern matching (bare minimum would be maybe_input_insn(ctx, inputs[0], Opcode::Icmp) and so on).

I wrote a microbenchmark for each instruction sequence by putting it in a loop and completing it with a dup vs.16b, wd, which establishes a loop-carried dependency and makes it possible to measure latency. Here are the results for various microarchitectures in terms of speedup achieved by the alternative sequence, i.e. higher numbers mean that the latter is faster (has lower latency):

CPU core Speedup
Ampere eMAG 114.71%
Arm Cortex-A53 90.49%
Arm Cortex-A72 114.35%
Arm Cortex-A73 112.50%
Arm Neoverse N1 126.66%

The results are based on median runtimes from 20 runs; standard errors were 0.02% or less, with one exception on the Neoverse N1, which was 1.95% because the very first run took significantly longer than the rest.

Unfortunately, there isn't a clear winner - the alternative instruction sequence is faster on big cores, while the one from the proposal is faster on the little core, i.e. Cortex-A53. However, the speedups achieved on the big cores are greater than the slowdown experienced on the little core.

Future architectural extensions could enable further simplification; in particular, the bit permute instructions that are part of the second version of the Scalable Vector Extension (SVE2) may reduce the number of instructions and avoid the literal load at the same time.

cc @julian-seward1


Last updated: Jan 24 2025 at 00:11 UTC