wasmtime / Issue #1537 Cranelift aarch64 backend: impleme... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / Issue #1537 Cranelift aarch64 backend: impleme...

Wasmtime GitHub notifications bot (Apr 17 2020 at 17:04):

We currently use a sequence of instructions intended for 64-bit operation, and zero-extend narrower inputs. The original implementation by @jgouly almost works for 32 bits (and 8/16 bits extended to 32 bits), with a slight issue. We should rework the lowering to fix this issue and remove the need for zero-extending 32-bit operands.

Wasmtime GitHub notifications bot (Apr 17 2020 at 17:12):

bnjbvr labeled Issue #1537:

We currently use a sequence of instructions intended for 64-bit operation, and zero-extend narrower inputs. The original implementation by @jgouly almost works for 32 bits (and 8/16 bits extended to 32 bits), with a slight issue. We should rework the lowering to fix this issue and remove the need for zero-extending 32-bit operands.

Wasmtime GitHub notifications bot (Apr 17 2020 at 17:12):

bnjbvr labeled Issue #1537:

We currently use a sequence of instructions intended for 64-bit operation, and zero-extend narrower inputs. The original implementation by @jgouly almost works for 32 bits (and 8/16 bits extended to 32 bits), with a slight issue. We should rework the lowering to fix this issue and remove the need for zero-extending 32-bit operands.

Wasmtime GitHub notifications bot (Apr 17 2020 at 17:21):

github-actions[bot] commented on Issue #1537:

Subscribe to Label Action

cc @bnjbvr

<details>
This issue or pull request has been labeled: "cranelift"

Thus the following users have been cc'd because of the following labels:

bnjbvr: cranelift

To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.

Learn more.
</details>

Wasmtime GitHub notifications bot (Apr 18 2020 at 00:21):

cfallin labeled Issue #1537:

We currently use a sequence of instructions intended for 64-bit operation, and zero-extend narrower inputs. The original implementation by @jgouly almost works for 32 bits (and 8/16 bits extended to 32 bits), with a slight issue. We should rework the lowering to fix this issue and remove the need for zero-extending 32-bit operands.

Wasmtime GitHub notifications bot (Apr 21 2020 at 14:28):

bnjbvr commented on Issue #1537:

For what it's worth, the x86 legalization used a slightly different sequence that ought to be reusable here. Would it make sense to make it a target-independent legalization, and use it in simple_legalize then? (doing it at the CLIR level would allow applying optimizations onto the constants, esp. GVN could common out the constants instead of regenerating as many times as the popcnt instruction is used)

Wasmtime GitHub notifications bot (Sep 10 2020 at 16:56):

akirilov-arm commented on Issue #1537:

Note that the proper implementation for 32- and 64-bit operands may use the Neon CNT instruction (for example, this is what GCC and LLVM do), so a target-independent legalization may be counterproductive. However, I am not sure about 8- and 16-bit operands.

Wasmtime GitHub notifications bot (Jan 19 2021 at 18:06):

akirilov-arm commented on Issue #1537:

I decided to compare the approach using the Neon CNT instruction (which I'll call "vector") with the one that just used operations on general-purpose registers (i.e. the existing implementation in Cranelift, which I'll call "scalar"), so I wrote several microbenchmarks.

Here's the scalar implementation for 8-bit operands:
  lsr wd, ws, #1
  and wd, wd, #0x55555555
  sub wd, ws, wd
  and wt, wd, #0x33333333
  lsr wd, wd, #2
  and wd, wd, #0x33333333
  add wd, wd, wt
  add wd, wd, wd, lsr #4
  and wd, wd, #0xf
And the vector one:
  fmov st, ws
  cnt vt.8b, vt.8b
  umov wd, vt.b[0]
16-bit scalar:
  lsr wd, ws, #1
  and wd, wd, #0x55555555
  sub wd, ws, wd
  and wt, wd, #0x33333333
  lsr wd, wd, #2
  and wd, wd, #0x33333333
  add wd, wd, wt
  add wd, wd, wd, lsr #4
  and wd, wd, #0x0f0f0f0f
  add wd, wd, wd, lsr #8
  and wd, wd, #0xff
16-bit vector:
  fmov st, ws
  cnt vt.8b, vt.8b
  addp vt.8b, vt.8b, vt.8b
  umov wd, vt.b[0]
32-bit scalar:
  lsr wd, ws, #1
  and wd, wd, #0x55555555
  sub wd, ws, wd
  and wt, wd, #0x33333333
  lsr wd, wd, #2
  and wd, wd, #0x33333333
  add wd, wd, wt
  add wd, wd, wd, lsr #4
  and wd, wd, #0x0f0f0f0f
  mov wt, #0x01010101
  mul wd, wd, wt
  lsr wd, wd, #24
32-bit vector (almost the same as the 64-bit one - only the first instruction differs):
  fmov st, ws
  cnt vt.8b, vt.8b
  addv bt, vt.8b
  umov wd, vt.b[0]
For the 64-bit case I tried 2 scalar variants - one that doesn't use multiplication and another one that does; here's the first one:
  lsr xd, xs, #1
  and xd, xd, #0x5555555555555555
  sub xd, xs, xd
  and xt, xd, #0x3333333333333333
  lsr xd, xd, #2
  and xd, xd, #0x3333333333333333
  add xt, xd, xt
  add xt, xt, xt, lsr #4
  and xt, xt, #0x0f0f0f0f0f0f0f0f
  add xt, xt, xt, lsl #8
  add xt, xt, xt, lsl #16
  add xt, xt, xt, lsl #32
  lsr xd, xt, #56
And the second one:
  lsr xd, xs, #1
  and xd, xd, #0x5555555555555555
  sub xd, xs, xd
  and xt, xd, #0x3333333333333333
  lsr xd, xd, #2
  and xd, xd, #0x3333333333333333
  add xt, xd, xt
  add xt, xt, xt, lsr #4
  and xt, xt, #0x0f0f0f0f0f0f0f0f
  mov xd, #0x0101010101010101
  mul xd, xd, xt
  lsr xd, xd, #56
64-bit vector:
  fmov dt, xs
  cnt vt.8b, vt.8b
  addv bt, vt.8b
  umov wd, vt.b[0]
Here are the results for various microarchitectures in terms of speedup achieved by the vector implementation, i.e. higher numbers mean that the latter is better (the raw values that are measured are actually runtimes, hence speedup):

CPU core 8-bit latency speedup 8-bit throughput speedup 16-bit latency speedup 16-bit throughput speedup 32-bit latency speedup 32-bit throughput speedup 64-bit latency speedup 64-bit throughput speedup

Arm Cortex-A55 252.84% 276.55% 224.39% 241.52% 256.10% 282.28% 293.79% 321.77%

Arm Neoverse N1 132.56% 348.84% 135.26% 243.04% 96.96% 197.12% 123.08% 213.46%

In addition, here's how the alternative 64-bit scalar implementation (using multiplication) performs:
CPU core|Latency speedup|Throughput speedup
-|-|-
Arm Cortex-A55|111.81%|109.10%
Arm Neoverse N1|112.60%|110.45%

The results demonstrate quite clearly that the vector variant is better than the scalar one; the only regression is in the 32-bit latency value. While the alternative 64-bit scalar implementation is a definite improvement, it still lags behind the vector one.

CPU core	8-bit latency speedup	8-bit throughput speedup	16-bit latency speedup	16-bit throughput speedup	32-bit latency speedup	32-bit throughput speedup	64-bit latency speedup	64-bit throughput speedup
Arm Cortex-A55	252.84%	276.55%	224.39%	241.52%	256.10%	282.28%	293.79%	321.77%
Arm Neoverse N1	132.56%	348.84%	135.26%	243.04%	96.96%	197.12%	123.08%	213.46%

Wasmtime GitHub notifications bot (Jan 19 2021 at 18:28):

akirilov-arm edited a comment on Issue #1537:

I decided to compare the approach using the Neon CNT instruction (which I'll call "vector") with the one that just used operations on general-purpose registers (i.e. the existing implementation in Cranelift, which I'll call "scalar"), so I wrote several microbenchmarks.

Here's the scalar implementation for 8-bit operands:
  lsr wd, ws, #1
  and wd, wd, #0x55555555
  sub wd, ws, wd
  and wt, wd, #0x33333333
  lsr wd, wd, #2
  and wd, wd, #0x33333333
  add wd, wd, wt
  add wd, wd, wd, lsr #4
  and wd, wd, #0xf
And the vector one:
  fmov st, ws
  cnt vt.8b, vt.8b
  umov wd, vt.b[0]
16-bit scalar:
  lsr wd, ws, #1
  and wd, wd, #0x55555555
  sub wd, ws, wd
  and wt, wd, #0x33333333
  lsr wd, wd, #2
  and wd, wd, #0x33333333
  add wd, wd, wt
  add wd, wd, wd, lsr #4
  and wd, wd, #0x0f0f0f0f
  add wd, wd, wd, lsr #8
  and wd, wd, #0xff
16-bit vector:
  fmov st, ws
  cnt vt.8b, vt.8b
  addp vt.8b, vt.8b, vt.8b
  umov wd, vt.b[0]
32-bit scalar:
  lsr wd, ws, #1
  and wd, wd, #0x55555555
  sub wd, ws, wd
  and wt, wd, #0x33333333
  lsr wd, wd, #2
  and wd, wd, #0x33333333
  add wd, wd, wt
  add wd, wd, wd, lsr #4
  and wd, wd, #0x0f0f0f0f
  mov wt, #0x01010101
  mul wd, wd, wt
  lsr wd, wd, #24
32-bit vector (almost the same as the 64-bit one - only the first instruction differs):
  fmov st, ws
  cnt vt.8b, vt.8b
  addv bt, vt.8b
  umov wd, vt.b[0]
For the 64-bit case I tried 2 scalar variants - one that doesn't use multiplication and another one that does; here's the first one:
  lsr xd, xs, #1
  and xd, xd, #0x5555555555555555
  sub xd, xs, xd
  and xt, xd, #0x3333333333333333
  lsr xd, xd, #2
  and xd, xd, #0x3333333333333333
  add xt, xd, xt
  add xt, xt, xt, lsr #4
  and xt, xt, #0x0f0f0f0f0f0f0f0f
  add xt, xt, xt, lsl #8
  add xt, xt, xt, lsl #16
  add xt, xt, xt, lsl #32
  lsr xd, xt, #56
And the second one:
  lsr xd, xs, #1
  and xd, xd, #0x5555555555555555
  sub xd, xs, xd
  and xt, xd, #0x3333333333333333
  lsr xd, xd, #2
  and xd, xd, #0x3333333333333333
  add xt, xd, xt
  add xt, xt, xt, lsr #4
  and xt, xt, #0x0f0f0f0f0f0f0f0f
  mov xd, #0x0101010101010101
  mul xd, xd, xt
  lsr xd, xd, #56
64-bit vector:
  fmov dt, xs
  cnt vt.8b, vt.8b
  addv bt, vt.8b
  umov wd, vt.b[0]
Here are the results for various microarchitectures in terms of speedup achieved by the vector implementation, i.e. higher numbers mean that the latter is better:

CPU core 8-bit latency speedup 8-bit throughput speedup 16-bit latency speedup 16-bit throughput speedup 32-bit latency speedup 32-bit throughput speedup 64-bit latency speedup 64-bit throughput speedup

Arm Cortex-A55 252.84% 276.55% 224.39% 241.52% 256.10% 282.28% 293.79% 321.77%

Arm Neoverse N1 132.56% 348.84% 135.26% 243.04% 96.96% 197.12% 123.08% 213.46%

In addition, here's how the alternative 64-bit scalar implementation (using multiplication) performs:
CPU core|Latency speedup|Throughput speedup
-|-|-
Arm Cortex-A55|111.81%|109.10%
Arm Neoverse N1|112.60%|110.45%

The results are based on median runtimes from 20 runs; standard errors were 0.02% or less.

The data demonstrates quite clearly that the vector variant is better than the scalar one; the only regression is in the 32-bit latency value. While the alternative 64-bit scalar implementation is a definite improvement, it still lags behind the vector one.

CPU core	8-bit latency speedup	8-bit throughput speedup	16-bit latency speedup	16-bit throughput speedup	32-bit latency speedup	32-bit throughput speedup	64-bit latency speedup	64-bit throughput speedup
Arm Cortex-A55	252.84%	276.55%	224.39%	241.52%	256.10%	282.28%	293.79%	321.77%
Arm Neoverse N1	132.56%	348.84%	135.26%	243.04%	96.96%	197.12%	123.08%	213.46%

Wasmtime GitHub notifications bot (Jan 19 2021 at 21:47):

cfallin closed Issue #1537:

We currently use a sequence of instructions intended for 64-bit operation, and zero-extend narrower inputs. The original implementation by @jgouly almost works for 32 bits (and 8/16 bits extended to 32 bits), with a slight issue. We should rework the lowering to fix this issue and remove the need for zero-extending 32-bit operands.

Wasmtime GitHub notifications bot (Jan 19 2021 at 21:48):

cfallin commented on Issue #1537:

Thanks very much for this very detailed benchmarking!

Last updated: Apr 18 2025 at 16:03 UTC