cfallin opened Issue #1537:
We currently use a sequence of instructions intended for 64-bit operation, and zero-extend narrower inputs. The original implementation by @jgouly almost works for 32 bits (and 8/16 bits extended to 32 bits), with a slight issue. We should rework the lowering to fix this issue and remove the need for zero-extending 32-bit operands.
bnjbvr labeled Issue #1537:
We currently use a sequence of instructions intended for 64-bit operation, and zero-extend narrower inputs. The original implementation by @jgouly almost works for 32 bits (and 8/16 bits extended to 32 bits), with a slight issue. We should rework the lowering to fix this issue and remove the need for zero-extending 32-bit operands.
bnjbvr labeled Issue #1537:
We currently use a sequence of instructions intended for 64-bit operation, and zero-extend narrower inputs. The original implementation by @jgouly almost works for 32 bits (and 8/16 bits extended to 32 bits), with a slight issue. We should rework the lowering to fix this issue and remove the need for zero-extending 32-bit operands.
github-actions[bot] commented on Issue #1537:
Subscribe to Label Action
cc @bnjbvr
<details>
This issue or pull request has been labeled: "cranelift"Thus the following users have been cc'd because of the following labels:
- bnjbvr: cranelift
To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.
Learn more.
</details>
cfallin labeled Issue #1537:
We currently use a sequence of instructions intended for 64-bit operation, and zero-extend narrower inputs. The original implementation by @jgouly almost works for 32 bits (and 8/16 bits extended to 32 bits), with a slight issue. We should rework the lowering to fix this issue and remove the need for zero-extending 32-bit operands.
bnjbvr commented on Issue #1537:
For what it's worth, the
x86
legalization used a slightly different sequence that ought to be reusable here. Would it make sense to make it a target-independent legalization, and use it insimple_legalize
then? (doing it at the CLIR level would allow applying optimizations onto the constants, esp. GVN could common out the constants instead of regenerating as many times as the popcnt instruction is used)
akirilov-arm commented on Issue #1537:
Note that the proper implementation for 32- and 64-bit operands may use the Neon
CNT
instruction (for example, this is what GCC and LLVM do), so a target-independent legalization may be counterproductive. However, I am not sure about 8- and 16-bit operands.
akirilov-arm commented on Issue #1537:
I decided to compare the approach using the Neon
CNT
instruction (which I'll call "vector") with the one that just used operations on general-purpose registers (i.e. the existing implementation in Cranelift, which I'll call "scalar"), so I wrote several microbenchmarks.Here's the scalar implementation for 8-bit operands:
lsr wd, ws, #1 and wd, wd, #0x55555555 sub wd, ws, wd and wt, wd, #0x33333333 lsr wd, wd, #2 and wd, wd, #0x33333333 add wd, wd, wt add wd, wd, wd, lsr #4 and wd, wd, #0xf
And the vector one:
fmov st, ws cnt vt.8b, vt.8b umov wd, vt.b[0]
16-bit scalar:
lsr wd, ws, #1 and wd, wd, #0x55555555 sub wd, ws, wd and wt, wd, #0x33333333 lsr wd, wd, #2 and wd, wd, #0x33333333 add wd, wd, wt add wd, wd, wd, lsr #4 and wd, wd, #0x0f0f0f0f add wd, wd, wd, lsr #8 and wd, wd, #0xff
16-bit vector:
fmov st, ws cnt vt.8b, vt.8b addp vt.8b, vt.8b, vt.8b umov wd, vt.b[0]
32-bit scalar:
lsr wd, ws, #1 and wd, wd, #0x55555555 sub wd, ws, wd and wt, wd, #0x33333333 lsr wd, wd, #2 and wd, wd, #0x33333333 add wd, wd, wt add wd, wd, wd, lsr #4 and wd, wd, #0x0f0f0f0f mov wt, #0x01010101 mul wd, wd, wt lsr wd, wd, #24
32-bit vector (almost the same as the 64-bit one - only the first instruction differs):
fmov st, ws cnt vt.8b, vt.8b addv bt, vt.8b umov wd, vt.b[0]
For the 64-bit case I tried 2 scalar variants - one that doesn't use multiplication and another one that does; here's the first one:
lsr xd, xs, #1 and xd, xd, #0x5555555555555555 sub xd, xs, xd and xt, xd, #0x3333333333333333 lsr xd, xd, #2 and xd, xd, #0x3333333333333333 add xt, xd, xt add xt, xt, xt, lsr #4 and xt, xt, #0x0f0f0f0f0f0f0f0f add xt, xt, xt, lsl #8 add xt, xt, xt, lsl #16 add xt, xt, xt, lsl #32 lsr xd, xt, #56
And the second one:
lsr xd, xs, #1 and xd, xd, #0x5555555555555555 sub xd, xs, xd and xt, xd, #0x3333333333333333 lsr xd, xd, #2 and xd, xd, #0x3333333333333333 add xt, xd, xt add xt, xt, xt, lsr #4 and xt, xt, #0x0f0f0f0f0f0f0f0f mov xd, #0x0101010101010101 mul xd, xd, xt lsr xd, xd, #56
64-bit vector:
fmov dt, xs cnt vt.8b, vt.8b addv bt, vt.8b umov wd, vt.b[0]
Here are the results for various microarchitectures in terms of speedup achieved by the vector implementation, i.e. higher numbers mean that the latter is better (the raw values that are measured are actually runtimes, hence speedup):
CPU core 8-bit latency speedup 8-bit throughput speedup 16-bit latency speedup 16-bit throughput speedup 32-bit latency speedup 32-bit throughput speedup 64-bit latency speedup 64-bit throughput speedup Arm Cortex-A55 252.84% 276.55% 224.39% 241.52% 256.10% 282.28% 293.79% 321.77% Arm Neoverse N1 132.56% 348.84% 135.26% 243.04% 96.96% 197.12% 123.08% 213.46% In addition, here's how the alternative 64-bit scalar implementation (using multiplication) performs:
CPU core|Latency speedup|Throughput speedup
-|-|-
Arm Cortex-A55|111.81%|109.10%
Arm Neoverse N1|112.60%|110.45%The results demonstrate quite clearly that the vector variant is better than the scalar one; the only regression is in the 32-bit latency value. While the alternative 64-bit scalar implementation is a definite improvement, it still lags behind the vector one.
akirilov-arm edited a comment on Issue #1537:
I decided to compare the approach using the Neon
CNT
instruction (which I'll call "vector") with the one that just used operations on general-purpose registers (i.e. the existing implementation in Cranelift, which I'll call "scalar"), so I wrote several microbenchmarks.Here's the scalar implementation for 8-bit operands:
lsr wd, ws, #1 and wd, wd, #0x55555555 sub wd, ws, wd and wt, wd, #0x33333333 lsr wd, wd, #2 and wd, wd, #0x33333333 add wd, wd, wt add wd, wd, wd, lsr #4 and wd, wd, #0xf
And the vector one:
fmov st, ws cnt vt.8b, vt.8b umov wd, vt.b[0]
16-bit scalar:
lsr wd, ws, #1 and wd, wd, #0x55555555 sub wd, ws, wd and wt, wd, #0x33333333 lsr wd, wd, #2 and wd, wd, #0x33333333 add wd, wd, wt add wd, wd, wd, lsr #4 and wd, wd, #0x0f0f0f0f add wd, wd, wd, lsr #8 and wd, wd, #0xff
16-bit vector:
fmov st, ws cnt vt.8b, vt.8b addp vt.8b, vt.8b, vt.8b umov wd, vt.b[0]
32-bit scalar:
lsr wd, ws, #1 and wd, wd, #0x55555555 sub wd, ws, wd and wt, wd, #0x33333333 lsr wd, wd, #2 and wd, wd, #0x33333333 add wd, wd, wt add wd, wd, wd, lsr #4 and wd, wd, #0x0f0f0f0f mov wt, #0x01010101 mul wd, wd, wt lsr wd, wd, #24
32-bit vector (almost the same as the 64-bit one - only the first instruction differs):
fmov st, ws cnt vt.8b, vt.8b addv bt, vt.8b umov wd, vt.b[0]
For the 64-bit case I tried 2 scalar variants - one that doesn't use multiplication and another one that does; here's the first one:
lsr xd, xs, #1 and xd, xd, #0x5555555555555555 sub xd, xs, xd and xt, xd, #0x3333333333333333 lsr xd, xd, #2 and xd, xd, #0x3333333333333333 add xt, xd, xt add xt, xt, xt, lsr #4 and xt, xt, #0x0f0f0f0f0f0f0f0f add xt, xt, xt, lsl #8 add xt, xt, xt, lsl #16 add xt, xt, xt, lsl #32 lsr xd, xt, #56
And the second one:
lsr xd, xs, #1 and xd, xd, #0x5555555555555555 sub xd, xs, xd and xt, xd, #0x3333333333333333 lsr xd, xd, #2 and xd, xd, #0x3333333333333333 add xt, xd, xt add xt, xt, xt, lsr #4 and xt, xt, #0x0f0f0f0f0f0f0f0f mov xd, #0x0101010101010101 mul xd, xd, xt lsr xd, xd, #56
64-bit vector:
fmov dt, xs cnt vt.8b, vt.8b addv bt, vt.8b umov wd, vt.b[0]
Here are the results for various microarchitectures in terms of speedup achieved by the vector implementation, i.e. higher numbers mean that the latter is better:
CPU core 8-bit latency speedup 8-bit throughput speedup 16-bit latency speedup 16-bit throughput speedup 32-bit latency speedup 32-bit throughput speedup 64-bit latency speedup 64-bit throughput speedup Arm Cortex-A55 252.84% 276.55% 224.39% 241.52% 256.10% 282.28% 293.79% 321.77% Arm Neoverse N1 132.56% 348.84% 135.26% 243.04% 96.96% 197.12% 123.08% 213.46% In addition, here's how the alternative 64-bit scalar implementation (using multiplication) performs:
CPU core|Latency speedup|Throughput speedup
-|-|-
Arm Cortex-A55|111.81%|109.10%
Arm Neoverse N1|112.60%|110.45%The results are based on median runtimes from 20 runs; standard errors were 0.02% or less.
The data demonstrates quite clearly that the vector variant is better than the scalar one; the only regression is in the 32-bit latency value. While the alternative 64-bit scalar implementation is a definite improvement, it still lags behind the vector one.
cfallin closed Issue #1537:
We currently use a sequence of instructions intended for 64-bit operation, and zero-extend narrower inputs. The original implementation by @jgouly almost works for 32 bits (and 8/16 bits extended to 32 bits), with a slight issue. We should rework the lowering to fix this issue and remove the need for zero-extending 32-bit operands.
cfallin commented on Issue #1537:
Thanks very much for this very detailed benchmarking!
Last updated: Jan 24 2025 at 00:11 UTC