wasmtime / Issue #2296 Cranelift AArch64: Support the SIM... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / Issue #2296 Cranelift AArch64: Support the SIM...

Wasmtime GitHub notifications bot (Oct 17 2020 at 00:46):

akirilov-arm opened Issue #2296:

The bitmask extraction operations (WebAssembly/simd#201) have already become a part of the fixed-width SIMD proposal, so we should definitely implement them in Cranelift's AArch64 backend (and elsewhere, of course, but that's a separate discussion), so this issue is going to concentrate on implementation approaches, and in particular the instruction sequences that will be used. That is probably going to be the most challenging part because those operations don't have a straightforward mapping into the A64 instruction set.

Let's start with i8x16.bitmask, since it is the most complicated one - the suggested lowering from the proposal is:
  sshr  vt.16b, vs.16b, #7
  ldr   qt2, .Lmask
  and   vt.16b, vt.16b, vt2.16b
  ext   vt2.16b, vt.16b, vt.16b, #8
  zip1  vt.16b, vt.16b, vt2.16b
  addv  ht, vt.8h
  umov  wd, vt.h[0]
  ...
.Lmask:
  .byte 1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128
where Vs is the input vector register, Wd is the output general-purpose register, and Vt and Vt2 - temporary vector registers.

The main issue with this instruction sequence is that it uses the horizontal reduction ADDV, which is quite expensive on some microarchitectures, so here's an alternative:
  cmlt  vt.16b, vs.16b, #0
  ldr   qt2, .Lmask
  and   vt.16b, vt.16b, vt2.16b
  addp  vt.16b, vt.16b, vt.16b
  addp  vt.16b, vt.16b, vt.16b
  addp  vt.16b, vt.16b, vt.16b
  umov  wd, vt.h[0]
  ...
.Lmask:
  .byte 1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128
We also use CMLT #0 instead of SSHR, since it has better throughput in some cases, but, of course, the best approach would be to avoid it completely. AFAIK the expected usage of the bitmask extraction operations is to work on vector comparison results, which are already in the required form (the bits in each lane are either all 0 or 1), so we could avoid the initial instruction by clever IR pattern matching (bare minimum would be maybe_input_insn(ctx, inputs[0], Opcode::Icmp) and so on).

I wrote a microbenchmark for each instruction sequence by putting it in a loop and completing it with a dup vs.16b, wd, which establishes a loop-carried dependency and makes it possible to measure latency. Here are the results for various microarchitectures in terms of speedup achieved by the alternative sequence, i.e. higher numbers mean that the latter is faster (has lower latency):

CPU core Speedup

Ampere eMAG 114.71%

Arm Cortex-A53 90.49%

Arm Cortex-A72 114.35%

Arm Cortex-A73 112.50%

Arm Neoverse N1 126.66%

The results are based on median runtimes from 20 runs; standard errors were 0.02% or less, with one exception on the Neoverse N1, which was 1.95% because the very first run took significantly longer than the rest.

Unfortunately, there isn't a clear winner - the alternative instruction sequence is faster on big cores, while the one from the proposal is faster on the little core, i.e. Cortex-A53. However, the speedups achieved on the big cores are greater than the slowdown experienced on the little core.

Future architectural extensions could enable further simplification; in particular, the bit permute instructions that are part of the second version of the Scalable Vector Extension (SVE2) may reduce the number of instructions and avoid the literal load at the same time.

cc @julian-seward1

CPU core	Speedup
Ampere eMAG	114.71%
Arm Cortex-A53	90.49%
Arm Cortex-A72	114.35%
Arm Cortex-A73	112.50%
Arm Neoverse N1	126.66%

Wasmtime GitHub notifications bot (Oct 17 2020 at 06:44):

julian-seward1 commented on Issue #2296:

Anton, thank you for making these measurements. It's a shame that it isn't a clear win on all targets. One other variant that I'd like to ask about is formation of the magic constant without doing a load. Currently what SpiderMonkey has implemented is formation of the constant like this:
                    // mov   tmp_gpr, #0x0201
                    // movk  tmp_gpr, #0x0804, lsl 16
                    // movk  tmp_gpr, #0x2010, lsl 32
                    // movk  tmp_gpr, #0x8040, lsl 48
                    // dup   tmp_vec.2d, tmp_gpr
I remember reading somewhere (in one of the optimisation guides) that loads have a 3- or 4-cycle latency. Whereas mov/movk arranged like this can dual-issue in pairs, hence producing the tmp_gpr value in two cycles. If the dup can execute in one or two cycles then it's not a loss, and has the advantage of not causing extra memory traffic. But I'm just guessing here. What do you think?

Wasmtime GitHub notifications bot (Oct 19 2020 at 13:49):

akirilov-arm commented on Issue #2296:

I am actually working on improving vector constant materialization right now, and in particular on replicated patterns. While I am thinking mostly about using MOVI/MVNI/FMOV (immediate form), I can add this to what I am focusing on. You are right about the load timing and the possibility of dual-issue, but that is still a chain of dependent instructions, so I'd like to get some microbenchmark data first. The other issue is that in contrast with the literal load approach it is impossible to share the constant value with other instances of the bitmask extraction operation. Yes, we don't support proper constant pools in the AArch64 backend yet (refer to #1549), so sharing is not possible anyway, but that may change in the future.

My suggestion is to use lower_constant_f128() and then we can tackle this separately.

Wasmtime GitHub notifications bot (Oct 23 2020 at 20:48):

yurydelendik commented on Issue #2296:

My suggestion is to use lower_constant_f128() and then we can tackle this separately.

My understanding that the lower_constant_f128 has jump instruction, which might affect the benchmark mentioned in the description. Is it possible to re-run this benchmark that includes loading from pc + and B?

Wasmtime GitHub notifications bot (Oct 23 2020 at 21:12):

akirilov-arm commented on Issue #2296:

In PR #2310 lower_constant_f128() no longer generates a literal load for replicated 64-bit patterns and uses the sequence of moves above instead, which is part of the reason I suggested it (I expected to push my changes shortly afterwards, which I did).

The idea is actually to concentrate the handling of constants in a single place, so that we can easily improve code generation in the future. We can also add special cases inside lower_constant_f128() for well-known values like the bitmask extraction ones.

I intend to run some benchmarks again, but they will be slightly different because we are no longer interested in the approach to calculate the bitmask, but in the best way to materialize the magic constants (which is an orthogonal issue).

Wasmtime GitHub notifications bot (Oct 23 2020 at 21:13):

akirilov-arm edited a comment on Issue #2296:

In PR #2310 lower_constant_f128() no longer generates a literal load for replicated 64-bit patterns and uses the sequence of moves above instead, which was part of the reason I suggested it (I expected to push my changes shortly afterwards, which I did).

The idea is actually to concentrate the handling of constants in a single place, so that we can easily improve code generation in the future. We can also add special cases inside lower_constant_f128() for well-known values like the bitmask extraction ones.

I intend to run some benchmarks again, but they will be slightly different because we are no longer interested in the approach to calculate the bitmask, but in the best way to materialize the magic constants (which is an orthogonal issue).

Wasmtime GitHub notifications bot (Jan 27 2021 at 11:20):

akirilov-arm commented on Issue #2296:

I have finally managed to run several microbenchmarks for the constant materialization. My base case is the sequence of moves above:
  mov xt, #0x0201
  movk xt, #0x0804, lsl #16
  movk xt, #0x2010, lsl #32
  movk xt, #0x8040, lsl #48
  dup vd.2d, xt
The first alternative is a straight literal load, which I'll call literal:
  ldr qd, .Lmask
  ...
.Lmask:
  .byte 1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128
The second one is a literal load, but only of 64 bits (since we are dealing with a replicated pattern), followed by a move from the first vector element to the other (literal_64):
  ldr dd, .Lmask
  mov vd.d[1], vd.d[0]
  ...
.Lmask:
  .byte 1, 2, 4, 8, 16, 32, 64, 128
And the third one is similar to the second one, but uses a replicated load instead (so it's called literal_replicate) and requires a temporary register - the instruction count stays the same because LD1R lacks a literal addressing mode:
  adr xt, .Lmask
  ld1r { vd.2d }, [xt]
  ...
.Lmask:
  .byte 1, 2, 4, 8, 16, 32, 64, 128
Each of the alternatives also has a variant in which the literal is positioned immediately after the load, so it has to be branched over - I'll add a _branch suffix to denote it. For example, here's what literal_64_branch looks like:
  ldr dd, 2f
  b 3f
2:
  .byte 1, 2, 4, 8, 16, 32, 64, 128
3:
  mov vd.d[1], vd.d[0]
Note that it is not really possible to establish a dependency chain between loop iterations, so I have done an essentially throughput (not latency) measurement, especially on out-or-order processors.

Here are the results for various microarchitectures in terms of speedup achieved by the alternatives, i.e. higher numbers mean that the latter are faster (have higher throughput):

CPU core literal speedup literal_64 speedup literal_replicate speedup literal_branch speedup literal_64_branch speedup literal_replicate_branch speedup

Ampere eMAG 109.40% 163.48% 164.41% 51.73% 53.59% 54.91%

Arm Cortex-A53 249.95% 125.03% 249.72% 166.17% 110.51% 124.40%

Arm Cortex-A55 150.33% 75.17% 300.00% 99.71% 66.41% 74.92%

Arm Cortex-A72 200.23% 200.23% 150.00% 66.59% 57.10% 57.10%

Arm Neoverse N1 346.51% 346.51% 231.01% 77.00% 65.93% 65.93%

The results are based on median runtimes from 20 runs; standard errors are 0.21% or less.

Overall, the literal load approaches without a branch are a clear win, with some microarchitectures (especially the little ones like A53 and A55) preferring the variants that transfer only 64 bits from memory like literal_replicate (explained by the fact that A53 and A55 have a 64-bit read path from the data L1 cache). On the other hand, if execution must branch over the literal, then there is almost always a (significant) regression, so a sequence of moves is preferable.

CPU core	literal speedup	literal_64 speedup	literal_replicate speedup	literal_branch speedup	literal_64_branch speedup	literal_replicate_branch speedup
Ampere eMAG	109.40%	163.48%	164.41%	51.73%	53.59%	54.91%
Arm Cortex-A53	249.95%	125.03%	249.72%	166.17%	110.51%	124.40%
Arm Cortex-A55	150.33%	75.17%	300.00%	99.71%	66.41%	74.92%
Arm Cortex-A72	200.23%	200.23%	150.00%	66.59%	57.10%	57.10%
Arm Neoverse N1	346.51%	346.51%	231.01%	77.00%	65.93%	65.93%

Wasmtime GitHub notifications bot (Feb 04 2021 at 00:43):

akirilov-arm labeled Issue #2296:

The bitmask extraction operations (WebAssembly/simd#201) have already become a part of the fixed-width SIMD proposal, so we should definitely implement them in Cranelift's AArch64 backend (and elsewhere, of course, but that's a separate discussion), so this issue is going to concentrate on implementation approaches, and in particular the instruction sequences that will be used. That is probably going to be the most challenging part because those operations don't have a straightforward mapping into the A64 instruction set.

Let's start with i8x16.bitmask, since it is the most complicated one - the suggested lowering from the proposal is:
  sshr  vt.16b, vs.16b, #7
  ldr   qt2, .Lmask
  and   vt.16b, vt.16b, vt2.16b
  ext   vt2.16b, vt.16b, vt.16b, #8
  zip1  vt.16b, vt.16b, vt2.16b
  addv  ht, vt.8h
  umov  wd, vt.h[0]
  ...
.Lmask:
  .byte 1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128
where Vs is the input vector register, Wd is the output general-purpose register, and Vt and Vt2 - temporary vector registers.

The main issue with this instruction sequence is that it uses the horizontal reduction ADDV, which is quite expensive on some microarchitectures, so here's an alternative:
  cmlt  vt.16b, vs.16b, #0
  ldr   qt2, .Lmask
  and   vt.16b, vt.16b, vt2.16b
  addp  vt.16b, vt.16b, vt.16b
  addp  vt.16b, vt.16b, vt.16b
  addp  vt.16b, vt.16b, vt.16b
  umov  wd, vt.h[0]
  ...
.Lmask:
  .byte 1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128
We also use CMLT #0 instead of SSHR, since it has better throughput in some cases, but, of course, the best approach would be to avoid it completely. AFAIK the expected usage of the bitmask extraction operations is to work on vector comparison results, which are already in the required form (the bits in each lane are either all 0 or 1), so we could avoid the initial instruction by clever IR pattern matching (bare minimum would be maybe_input_insn(ctx, inputs[0], Opcode::Icmp) and so on).

I wrote a microbenchmark for each instruction sequence by putting it in a loop and completing it with a dup vs.16b, wd, which establishes a loop-carried dependency and makes it possible to measure latency. Here are the results for various microarchitectures in terms of speedup achieved by the alternative sequence, i.e. higher numbers mean that the latter is faster (has lower latency):

CPU core Speedup

Ampere eMAG 114.71%

Arm Cortex-A53 90.49%

Arm Cortex-A72 114.35%

Arm Cortex-A73 112.50%

Arm Neoverse N1 126.66%

The results are based on median runtimes from 20 runs; standard errors were 0.02% or less, with one exception on the Neoverse N1, which was 1.95% because the very first run took significantly longer than the rest.

Unfortunately, there isn't a clear winner - the alternative instruction sequence is faster on big cores, while the one from the proposal is faster on the little core, i.e. Cortex-A53. However, the speedups achieved on the big cores are greater than the slowdown experienced on the little core.

Future architectural extensions could enable further simplification; in particular, the bit permute instructions that are part of the second version of the Scalable Vector Extension (SVE2) may reduce the number of instructions and avoid the literal load at the same time.

cc @julian-seward1

CPU core	Speedup
Ampere eMAG	114.71%
Arm Cortex-A53	90.49%
Arm Cortex-A72	114.35%
Arm Cortex-A73	112.50%
Arm Neoverse N1	126.66%

Wasmtime GitHub notifications bot (Apr 11 2021 at 21:29):

aqrit commented on Issue #2296:

Shift and accumulate instructions could be used instead of a constant mask.
https://stackoverflow.com/a/58381188/5691983
https://bugs.llvm.org/show_bug.cgi?id=49577

Last updated: Apr 18 2025 at 07:03 UTC