akirilov-arm opened Issue #2296:
The bitmask extraction operations (WebAssembly/simd#201) have already become a part of the fixed-width SIMD proposal, so we should definitely implement them in Cranelift's AArch64 backend (and elsewhere, of course, but that's a separate discussion), so this issue is going to concentrate on implementation approaches, and in particular the instruction sequences that will be used. That is probably going to be the most challenging part because those operations don't have a straightforward mapping into the A64 instruction set.
Let's start with
i8x16.bitmask
, since it is the most complicated one - the suggested lowering from the proposal is:sshr vt.16b, vs.16b, #7 ldr qt2, .Lmask and vt.16b, vt.16b, vt2.16b ext vt2.16b, vt.16b, vt.16b, #8 zip1 vt.16b, vt.16b, vt2.16b addv ht, vt.8h umov wd, vt.h[0] ... .Lmask: .byte 1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128
where
Vs
is the input vector register,Wd
is the output general-purpose register, andVt
andVt2
- temporary vector registers.The main issue with this instruction sequence is that it uses the horizontal reduction
ADDV
, which is quite expensive on some microarchitectures, so here's an alternative:cmlt vt.16b, vs.16b, #0 ldr qt2, .Lmask and vt.16b, vt.16b, vt2.16b addp vt.16b, vt.16b, vt.16b addp vt.16b, vt.16b, vt.16b addp vt.16b, vt.16b, vt.16b umov wd, vt.h[0] ... .Lmask: .byte 1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128
We also use
CMLT #0
instead ofSSHR
, since it has better throughput in some cases, but, of course, the best approach would be to avoid it completely. AFAIK the expected usage of the bitmask extraction operations is to work on vector comparison results, which are already in the required form (the bits in each lane are either all 0 or 1), so we could avoid the initial instruction by clever IR pattern matching (bare minimum would bemaybe_input_insn(ctx, inputs[0], Opcode::Icmp)
and so on).I wrote a microbenchmark for each instruction sequence by putting it in a loop and completing it with a
dup vs.16b, wd
, which establishes a loop-carried dependency and makes it possible to measure latency. Here are the results for various microarchitectures in terms of speedup achieved by the alternative sequence, i.e. higher numbers mean that the latter is faster (has lower latency):
CPU core Speedup Ampere eMAG 114.71% Arm Cortex-A53 90.49% Arm Cortex-A72 114.35% Arm Cortex-A73 112.50% Arm Neoverse N1 126.66% The results are based on median runtimes from 20 runs; standard errors were 0.02% or less, with one exception on the Neoverse N1, which was 1.95% because the very first run took significantly longer than the rest.
Unfortunately, there isn't a clear winner - the alternative instruction sequence is faster on big cores, while the one from the proposal is faster on the little core, i.e. Cortex-A53. However, the speedups achieved on the big cores are greater than the slowdown experienced on the little core.
Future architectural extensions could enable further simplification; in particular, the bit permute instructions that are part of the second version of the Scalable Vector Extension (SVE2) may reduce the number of instructions and avoid the literal load at the same time.
cc @julian-seward1
julian-seward1 commented on Issue #2296:
Anton, thank you for making these measurements. It's a shame that it isn't a clear win on all targets. One other variant that I'd like to ask about is formation of the magic constant without doing a load. Currently what SpiderMonkey has implemented is formation of the constant like this:
// mov tmp_gpr, #0x0201 // movk tmp_gpr, #0x0804, lsl 16 // movk tmp_gpr, #0x2010, lsl 32 // movk tmp_gpr, #0x8040, lsl 48 // dup tmp_vec.2d, tmp_gpr
I remember reading somewhere (in one of the optimisation guides) that loads have a 3- or 4-cycle latency. Whereas
mov
/movk
arranged like this can dual-issue in pairs, hence producing thetmp_gpr
value in two cycles. If thedup
can execute in one or two cycles then it's not a loss, and has the advantage of not causing extra memory traffic. But I'm just guessing here. What do you think?
akirilov-arm commented on Issue #2296:
I am actually working on improving vector constant materialization right now, and in particular on replicated patterns. While I am thinking mostly about using
MOVI
/MVNI
/FMOV
(immediate form), I can add this to what I am focusing on. You are right about the load timing and the possibility of dual-issue, but that is still a chain of dependent instructions, so I'd like to get some microbenchmark data first. The other issue is that in contrast with the literal load approach it is impossible to share the constant value with other instances of the bitmask extraction operation. Yes, we don't support proper constant pools in the AArch64 backend yet (refer to #1549), so sharing is not possible anyway, but that may change in the future.My suggestion is to use
lower_constant_f128()
and then we can tackle this separately.
yurydelendik commented on Issue #2296:
My suggestion is to use lower_constant_f128() and then we can tackle this separately.
My understanding that the
lower_constant_f128
has jump instruction, which might affect the benchmark mentioned in the description. Is it possible to re-run this benchmark that includes loading frompc +
andB
?
akirilov-arm commented on Issue #2296:
In PR #2310
lower_constant_f128()
no longer generates a literal load for replicated 64-bit patterns and uses the sequence of moves above instead, which is part of the reason I suggested it (I expected to push my changes shortly afterwards, which I did).The idea is actually to concentrate the handling of constants in a single place, so that we can easily improve code generation in the future. We can also add special cases inside
lower_constant_f128()
for well-known values like the bitmask extraction ones.I intend to run some benchmarks again, but they will be slightly different because we are no longer interested in the approach to calculate the bitmask, but in the best way to materialize the magic constants (which is an orthogonal issue).
akirilov-arm edited a comment on Issue #2296:
In PR #2310
lower_constant_f128()
no longer generates a literal load for replicated 64-bit patterns and uses the sequence of moves above instead, which was part of the reason I suggested it (I expected to push my changes shortly afterwards, which I did).The idea is actually to concentrate the handling of constants in a single place, so that we can easily improve code generation in the future. We can also add special cases inside
lower_constant_f128()
for well-known values like the bitmask extraction ones.I intend to run some benchmarks again, but they will be slightly different because we are no longer interested in the approach to calculate the bitmask, but in the best way to materialize the magic constants (which is an orthogonal issue).
akirilov-arm commented on Issue #2296:
I have finally managed to run several microbenchmarks for the constant materialization. My base case is the sequence of moves above:
mov xt, #0x0201 movk xt, #0x0804, lsl #16 movk xt, #0x2010, lsl #32 movk xt, #0x8040, lsl #48 dup vd.2d, xt
The first alternative is a straight literal load, which I'll call
literal
:ldr qd, .Lmask ... .Lmask: .byte 1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128
The second one is a literal load, but only of 64 bits (since we are dealing with a replicated pattern), followed by a move from the first vector element to the other (
literal_64
):ldr dd, .Lmask mov vd.d[1], vd.d[0] ... .Lmask: .byte 1, 2, 4, 8, 16, 32, 64, 128
And the third one is similar to the second one, but uses a replicated load instead (so it's called
literal_replicate
) and requires a temporary register - the instruction count stays the same becauseLD1R
lacks a literal addressing mode:adr xt, .Lmask ld1r { vd.2d }, [xt] ... .Lmask: .byte 1, 2, 4, 8, 16, 32, 64, 128
Each of the alternatives also has a variant in which the literal is positioned immediately after the load, so it has to be branched over - I'll add a
_branch
suffix to denote it. For example, here's whatliteral_64_branch
looks like:ldr dd, 2f b 3f 2: .byte 1, 2, 4, 8, 16, 32, 64, 128 3: mov vd.d[1], vd.d[0]
Note that it is not really possible to establish a dependency chain between loop iterations, so I have done an essentially throughput (not latency) measurement, especially on out-or-order processors.
Here are the results for various microarchitectures in terms of speedup achieved by the alternatives, i.e. higher numbers mean that the latter are faster (have higher throughput):
CPU core literal speedup literal_64 speedup literal_replicate speedup literal_branch speedup literal_64_branch speedup literal_replicate_branch speedup Ampere eMAG 109.40% 163.48% 164.41% 51.73% 53.59% 54.91% Arm Cortex-A53 249.95% 125.03% 249.72% 166.17% 110.51% 124.40% Arm Cortex-A55 150.33% 75.17% 300.00% 99.71% 66.41% 74.92% Arm Cortex-A72 200.23% 200.23% 150.00% 66.59% 57.10% 57.10% Arm Neoverse N1 346.51% 346.51% 231.01% 77.00% 65.93% 65.93% The results are based on median runtimes from 20 runs; standard errors are 0.21% or less.
Overall, the literal load approaches without a branch are a clear win, with some microarchitectures (especially the little ones like A53 and A55) preferring the variants that transfer only 64 bits from memory like
literal_replicate
(explained by the fact that A53 and A55 have a 64-bit read path from the data L1 cache). On the other hand, if execution must branch over the literal, then there is almost always a (significant) regression, so a sequence of moves is preferable.
akirilov-arm labeled Issue #2296:
The bitmask extraction operations (WebAssembly/simd#201) have already become a part of the fixed-width SIMD proposal, so we should definitely implement them in Cranelift's AArch64 backend (and elsewhere, of course, but that's a separate discussion), so this issue is going to concentrate on implementation approaches, and in particular the instruction sequences that will be used. That is probably going to be the most challenging part because those operations don't have a straightforward mapping into the A64 instruction set.
Let's start with
i8x16.bitmask
, since it is the most complicated one - the suggested lowering from the proposal is:sshr vt.16b, vs.16b, #7 ldr qt2, .Lmask and vt.16b, vt.16b, vt2.16b ext vt2.16b, vt.16b, vt.16b, #8 zip1 vt.16b, vt.16b, vt2.16b addv ht, vt.8h umov wd, vt.h[0] ... .Lmask: .byte 1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128
where
Vs
is the input vector register,Wd
is the output general-purpose register, andVt
andVt2
- temporary vector registers.The main issue with this instruction sequence is that it uses the horizontal reduction
ADDV
, which is quite expensive on some microarchitectures, so here's an alternative:cmlt vt.16b, vs.16b, #0 ldr qt2, .Lmask and vt.16b, vt.16b, vt2.16b addp vt.16b, vt.16b, vt.16b addp vt.16b, vt.16b, vt.16b addp vt.16b, vt.16b, vt.16b umov wd, vt.h[0] ... .Lmask: .byte 1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128
We also use
CMLT #0
instead ofSSHR
, since it has better throughput in some cases, but, of course, the best approach would be to avoid it completely. AFAIK the expected usage of the bitmask extraction operations is to work on vector comparison results, which are already in the required form (the bits in each lane are either all 0 or 1), so we could avoid the initial instruction by clever IR pattern matching (bare minimum would bemaybe_input_insn(ctx, inputs[0], Opcode::Icmp)
and so on).I wrote a microbenchmark for each instruction sequence by putting it in a loop and completing it with a
dup vs.16b, wd
, which establishes a loop-carried dependency and makes it possible to measure latency. Here are the results for various microarchitectures in terms of speedup achieved by the alternative sequence, i.e. higher numbers mean that the latter is faster (has lower latency):
CPU core Speedup Ampere eMAG 114.71% Arm Cortex-A53 90.49% Arm Cortex-A72 114.35% Arm Cortex-A73 112.50% Arm Neoverse N1 126.66% The results are based on median runtimes from 20 runs; standard errors were 0.02% or less, with one exception on the Neoverse N1, which was 1.95% because the very first run took significantly longer than the rest.
Unfortunately, there isn't a clear winner - the alternative instruction sequence is faster on big cores, while the one from the proposal is faster on the little core, i.e. Cortex-A53. However, the speedups achieved on the big cores are greater than the slowdown experienced on the little core.
Future architectural extensions could enable further simplification; in particular, the bit permute instructions that are part of the second version of the Scalable Vector Extension (SVE2) may reduce the number of instructions and avoid the literal load at the same time.
cc @julian-seward1
aqrit commented on Issue #2296:
Shift and accumulate instructions could be used instead of a constant mask.
https://stackoverflow.com/a/58381188/5691983
https://bugs.llvm.org/show_bug.cgi?id=49577
Last updated: Jan 24 2025 at 00:11 UTC