Stream: cranelift

Topic: `csdb` overhead on M1 chips (AArch64)


view this post on Zulip Martin Fink (Oct 25 2023 at 12:14):

(Not sure if this is the right channel to post this)

I've done a few experiments on an M1 Mac Mini and found some pretty high-performance overheads caused by the csdb instructions generated when lowering select_spectre_guard.
This overhead only exists on the performance cores, not on the efficiency ones. I believe the reason for this is the efficiency cores don't speculate nearly as much (or even not at all?) as the performance ones do, but then I am also by no means an expert. :)

I found it pretty surprising that the difference is that large, and since I didn't find anything else about this on the internet, I thought I'd share my findings. If there is an issue with how I ran these benchmarks, please let me know!

I've benchmarked an unmodified build (commit 270e92225) against one with the following diff applied:

diff --git a/cranelift/codegen/src/isa/aarch64/inst/emit.rs b/cranelift/codegen/src/isa/aarch64/inst/emit.rs
index 43bdd4a3e..c2519885e 100644
--- a/cranelift/codegen/src/isa/aarch64/inst/emit.rs
+++ b/cranelift/codegen/src/isa/aarch64/inst/emit.rs
@@ -1820,7 +1820,7 @@ impl MachInstEmit for Inst {
                 sink.put4(enc_dmb_ish()); // dmb ish
             }
             &Inst::Csdb {} => {
-                sink.put4(0xd503229f);
+                //sink.put4(0xd503229f);
             }
             &Inst::FpuMove64 { rd, rn } => {
                 let rd = allocs.next_writable(rd);

The tests were run on a Mac Mini with a M1 chip running Asahi Linux. I can also check if there is a difference when running the same thing on macOS, but I believe there shouldn't be.
Before running the tests, I set the governor to performance to force all cores to their maximum frequency.

Performance cores:

taskset --cpu-list 4-7 cargo run -- benchmark --iterations-per-process 2 --engine ../tmp/libwasmtime_bench_api.so --engine ../wasmtime/target/release/libwasmtime_bench_api.so --engine-flags="" benchmarks/spidermonkey/benchmark.wasm
warning: some crates are on edition 2021 which defaults to `resolver = "2"`, but virtual workspaces default to `resolver = "1"`
note: to keep the current resolver, specify `workspace.resolver = "1"` in the workspace root's manifest
note: to use the edition 2021 resolver, specify `workspace.resolver = "2"` in the workspace root's manifest
    Finished dev [unoptimized + debuginfo] target(s) in 0.04s
     Running `target/debug/sightglass-cli benchmark --iterations-per-process 2 --engine ../tmp/libwasmtime_bench_api.so --engine ../wasmtime/target/release/libwasmtime_bench_api.so --engine-flags= benchmarks/spidermonkey/benchmark.wasm`
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 8027785.90 ± 160961.98 (confidence = 99%)

  wasmtime/target/release/libwasmtime_bench_api.so is 3.16x to 3.25x faster than tmp/libwasmtime_bench_api.so!

  [11320025 11663170.40 12242229] tmp/libwasmtime_bench_api.so
  [3526618 3635384.50 3832395] wasmtime/target/release/libwasmtime_bench_api.so

instantiation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [2084 2519.50 3329] tmp/libwasmtime_bench_api.so
  [2054 2389.60 3090] wasmtime/target/release/libwasmtime_bench_api.so

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [50306278 52021344.65 55069141] tmp/libwasmtime_bench_api.so
  [50669042 52032507.05 54383128] wasmtime/target/release/libwasmtime_bench_api.so

On the efficiency cores, there is no difference:

taskset --cpu-list 0-3 cargo run -- benchmark --iterations-per-process 2 --engine ../tmp/libwasmtime_bench_api.so --engine ../wasmtime/target/release/libwasmtime_bench_api.so --engine-flags="" benchmarks/spidermonkey/benchmark.wasm
warning: some crates are on edition 2021 which defaults to `resolver = "2"`, but virtual workspaces default to `resolver = "1"`
note: to keep the current resolver, specify `workspace.resolver = "1"` in the workspace root's manifest
note: to use the edition 2021 resolver, specify `workspace.resolver = "2"` in the workspace root's manifest
    Finished dev [unoptimized + debuginfo] target(s) in 0.11s
     Running `target/debug/sightglass-cli benchmark --iterations-per-process 2 --engine ../tmp/libwasmtime_bench_api.so --engine ../wasmtime/target/release/libwasmtime_bench_api.so --engine-flags= benchmarks/spidermonkey/benchmark.wasm`
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

instantiation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [5216 6214.50 9408] tmp/libwasmtime_bench_api.so
  [5274 6807.20 10777] wasmtime/target/release/libwasmtime_bench_api.so

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [107328450 109423075.40 114775754] tmp/libwasmtime_bench_api.so
  [108226908 110536060.70 113376486] wasmtime/target/release/libwasmtime_bench_api.so

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [12505682 12657196.70 12765194] tmp/libwasmtime_bench_api.so
  [12477426 12666983.30 12849489] wasmtime/target/release/libwasmtime_bench_api.so

With dynamic bounds checks, the overhead is even bigger:

Performance cores:

taskset --cpu-list 4-7 cargo run -- benchmark --iterations-per-process 2 --engine ../tmp/libwasmtime_bench_api.so --engine ../wasmtime/target/release/libwasmtime_bench_api.so --engine-flags="-O dynamic-memory-guard-size=0 -O static-memory-guard-size=0 -O static-memory-maximum-size=0" benchmarks/spidermonkey/benchmark.wasm
warning: some crates are on edition 2021 which defaults to `resolver = "2"`, but virtual workspaces default to `resolver = "1"`
note: to keep the current resolver, specify `workspace.resolver = "1"` in the workspace root's manifest
note: to use the edition 2021 resolver, specify `workspace.resolver = "2"` in the workspace root's manifest
    Finished dev [unoptimized + debuginfo] target(s) in 0.04s
     Running `target/debug/sightglass-cli benchmark --iterations-per-process 2 --engine ../tmp/libwasmtime_bench_api.so --engine ../wasmtime/target/release/libwasmtime_bench_api.so '--engine-flags=-O dynamic-memory-guard-size=0 -O static-memory-guard-size=0 -O static-memory-maximum-size=0' benchmarks/spidermonkey/benchmark.wasm`
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 101879699.00 ± 1141736.09 (confidence = 99%)

  wasmtime/target/release/libwasmtime_bench_api.so is 16.46x to 16.81x faster than tmp/libwasmtime_bench_api.so!

  [107106675 108397620.75 112636143] tmp/libwasmtime_bench_api.so
  [6452171 6517921.75 6757069] wasmtime/target/release/libwasmtime_bench_api.so

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 2268916.25 ± 1358278.96 (confidence = 99%)

  tmp/libwasmtime_bench_api.so is 1.02x to 1.06x faster than wasmtime/target/release/libwasmtime_bench_api.so!

  [57228321 59363510.15 62661925] tmp/libwasmtime_bench_api.so
  [58325815 61632426.40 64403315] wasmtime/target/release/libwasmtime_bench_api.so

instantiation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [2214 2592.80 3968] tmp/libwasmtime_bench_api.so
  [2162 2441.25 3405] wasmtime/target/release/libwasmtime_bench_api.so

Efficiency cores:

taskset --cpu-list 0-3 cargo run -- benchmark --iterations-per-process 2 --engine ../tmp/libwasmtime_bench_api.so --engine ../wasmtime/target/release/libwasmtime_bench_api.so --engine-flags="-O dynamic-memory-guard-size=0 -O static-memory-guard-size=0 -O static-memory-maximum-size=0" benchmarks/spidermonkey/benchmark.wasm
warning: some crates are on edition 2021 which defaults to `resolver = "2"`, but virtual workspaces default to `resolver = "1"`
note: to keep the current resolver, specify `workspace.resolver = "1"` in the workspace root's manifest
note: to use the edition 2021 resolver, specify `workspace.resolver = "2"` in the workspace root's manifest
    Finished dev [unoptimized + debuginfo] target(s) in 0.11s
     Running `target/debug/sightglass-cli benchmark --iterations-per-process 2 --engine ../tmp/libwasmtime_bench_api.so --engine ../wasmtime/target/release/libwasmtime_bench_api.so '--engine-flags=-O dynamic-memory-guard-size=0 -O static-memory-guard-size=0 -O static-memory-maximum-size=0' benchmarks/spidermonkey/benchmark.wasm`
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 489379.55 ± 43314.03 (confidence = 99%)

  wasmtime/target/release/libwasmtime_bench_api.so is 1.02x to 1.02x faster than tmp/libwasmtime_bench_api.so!

  [23615521 23708462.10 23789488] tmp/libwasmtime_bench_api.so
  [23161583 23219082.55 23330794] wasmtime/target/release/libwasmtime_bench_api.so

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [127294091 131666671.05 137908998] tmp/libwasmtime_bench_api.so
  [129037489 133240513.70 137623006] wasmtime/target/release/libwasmtime_bench_api.so

instantiation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [5316 6323.80 10262] tmp/libwasmtime_bench_api.so
  [5516 6326.45 9558] wasmtime/target/release/libwasmtime_bench_api.so

view this post on Zulip Alex Crichton (Oct 25 2023 at 15:12):

Fascinating! Thanks for taking the time to collect these numbers and test here.

I dug a bit and found some background information on csdb here and here. I'm not spectre or aarch64 expert myself though so all I can be is a go-between with those.

In 4555 though it looked like at the time performance was concluded to not be affected much, but that was likely not performed on an M1 chip which may be exhibiting different performance characteristics.

One thing I've personally found a bit odd is that what you're probably running into here (in default mode) is the spectre guards for table bounds checks (and presumably spidermonkey has a lot of call_indirect). Our table bounds checks have both an explicit bounds check leading to a trap plus a select_spectre_guard. While this probably achieves spectre-hardening it also feels a bit overkill IMO and is possibly something we could improve.

Hello, You are probably well aware, but some mainstream compilers are emitting retpolines to help mitigate Spectre variant 2 attacks. Do you have any plans to add a similar capability to the creton...
As discussed in issue #1032, use the CSDB instruction.

view this post on Zulip Alex Crichton (Oct 25 2023 at 15:19):

Also I'm curious, where did you get taskset from on macos? I've got an M2 and am curious to see if the situation is any better with that or if it's the same

view this post on Zulip Alex Crichton (Oct 25 2023 at 15:23):

Running sightglass locally yields:

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 282449.80 ± 76316.41 (confidence = 99%)

  no-csdb.dylib is 1.02x to 1.03x faster than csdb.dylib!

  [10772130 10871102.65 11116720] csdb.dylib
  [10483782 10588652.85 10753478] no-csdb.dylib

although this isn't using taskset for pinning. I'm not sure if it would by default put everything on the performance cores.

view this post on Zulip Chris Fallin (Oct 25 2023 at 16:12):

Wow, that's a wild perf delta, @Martin Fink ! I was indeed worried about this when the original PR came up; @Anton Kirilov benchmarked on an Ampere Altra and found no impact

view this post on Zulip Chris Fallin (Oct 25 2023 at 16:13):

re: taskset, I wonder, are you running Linux (Asahi or whatever) on your M1 by chance? could that have an impact (different system config register settings, etc)?

view this post on Zulip Anton Kirilov (Oct 25 2023 at 16:56):

In retrospect I should have worded my statement in issue #1032 a bit better, though in pull request #4555 I got it right, I think. Also, just to be clear, I am unable to comment on any microarchitectural details of any processor that has not been designed by Arm (beyond any publicly accessible documentation, if any exists), including how susceptible it is to Spectre-style attacks.

view this post on Zulip Martin Fink (Oct 25 2023 at 19:43):

Hi! yes, this was running on Asahi. I just tried the same benchmark again on my personal Macbook (M1 Pro; this time with macOS) and got essentially the same results.
AFAIK, you can't pin a process to the performance cores, but you can pin it to the efficiency cores using taskset -c background command args.... The macOS scheduler should prefer the performance cores for this kind of workload anyways I believe.

Without taskpolicy:

cargo run -- benchmark --iterations-per-process 2 --engine ../tmp/libwasmtime_bench_api.dylib --engine ../wasmtime/target/release/libwasmtime_bench_api.dylib --engine-flags="" benchmarks/spidermonkey/benchmark.wasm
warning: some crates are on edition 2021 which defaults to `resolver = "2"`, but virtual workspaces default to `resolver = "1"`
note: to keep the current resolver, specify `workspace.resolver = "1"` in the workspace root's manifest
note: to use the edition 2021 resolver, specify `workspace.resolver = "2"` in the workspace root's manifest
    Finished dev [unoptimized + debuginfo] target(s) in 0.10s
     Running `target/debug/sightglass-cli benchmark --iterations-per-process 2 --engine ../tmp/libwasmtime_bench_api.dylib --engine ../wasmtime/target/release/libwasmtime_bench_api.dylib --engine-flags= benchmarks/spidermonkey/benchmark.wasm`
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 7875436.90 ± 76393.16 (confidence = 99%)

  wasmtime/target/release/libwasmtime_bench_api.dylib is 3.15x to 3.20x faster than tmp/libwasmtime_bench_api.dylib!

  [11307195 11495795.05 11729345] tmp/libwasmtime_bench_api.dylib
  [3532096 3620358.15 3726905] wasmtime/target/release/libwasmtime_bench_api.dylib

instantiation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [2262 2763.85 4202] tmp/libwasmtime_bench_api.dylib
  [2274 2556.15 3046] wasmtime/target/release/libwasmtime_bench_api.dylib

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [147601278 149324679.95 151225801] tmp/libwasmtime_bench_api.dylib
  [146173673 149144740.10 155132568] wasmtime/target/release/libwasmtime_bench_api.dylib

With taskpolicy:

taskpolicy -c background cargo run -- benchmark --iterations-per-process 2 --engine ../tmp/libwasmtime_bench_api.dylib --engine ../wasmtime/target/release/libwasmtime_bench_api.dylib --engine-flags="" benchmarks/spidermonkey/benchmark.wasm
warning: some crates are on edition 2021 which defaults to `resolver = "2"`, but virtual workspaces default to `resolver = "1"`
note: to keep the current resolver, specify `workspace.resolver = "1"` in the workspace root's manifest
note: to use the edition 2021 resolver, specify `workspace.resolver = "2"` in the workspace root's manifest
    Finished dev [unoptimized + debuginfo] target(s) in 2.16s
     Running `target/debug/sightglass-cli benchmark --iterations-per-process 2 --engine ../tmp/libwasmtime_bench_api.dylib --engine ../wasmtime/target/release/libwasmtime_bench_api.dylib --engine-flags= benchmarks/spidermonkey/benchmark.wasm`
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

instantiation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [6124 27213.70 111028] tmp/libwasmtime_bench_api.dylib
  [5381 40362.40 295461] wasmtime/target/release/libwasmtime_bench_api.dylib

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [23995605 30439809.15 38672919] tmp/libwasmtime_bench_api.dylib
  [24360364 28723230.40 36000666] wasmtime/target/release/libwasmtime_bench_api.dylib

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [401455183 549387423.20 1355357035] tmp/libwasmtime_bench_api.dylib
  [394497975 538719528.60 858583052] wasmtime/target/release/libwasmtime_bench_api.dylib

view this post on Zulip Alex Crichton (Oct 25 2023 at 19:45):

Whoa! Without taskpolicy you're showing a 3x slowdown on M1, but on an M2 I got a 2% slowdown, so now I'm curious if someone can reproduce that and if this is perhaps something that was updated for the M2

view this post on Zulip Martin Fink (Oct 25 2023 at 19:46):

Yes, and if I turn on dynamic bounds checks the slowdown gets worse (as there is way more spectre guards)

view this post on Zulip Alex Crichton (Oct 25 2023 at 19:48):

yeah that part definitely makes sense

view this post on Zulip Alex Crichton (Oct 25 2023 at 19:54):

FWIW this wasm:

(module
  (table 1 funcref)
  (func (param i32)
    (call_indirect (local.get 0))
  )
)

has this codegen for me locally (annotated)

                        ;; frame setup
       0: d503237f      pacibsp
       4: a9bf7bfd      stp     x29, x30, [sp, #-0x10]!
       8: 910003fd      mov     x29, sp
                        ;; stack overflow check
       c: f8408010      ldur    x16, [x0, #0x8]
      10: f8400210      ldur    x16, [x16]
      14: eb3063ff      cmp     sp, x16
      18: 54000543      b.lo    0xc0 <wasm[0]::function[0]+0xc0>
                        ;; register spills
      1c: f81f0ff3      str     x19, [sp, #-0x10]!
                        ;; branch-to-trap if table index is out of bounds
      20: aa0003e6      mov     x6, x0
      24: b9405005      ldr     w5, [x0, #0x50]
      28: 6b05005f      cmp     w2, w5
      2c: 54000482      b.hs    0xbc <wasm[0]::function[0]+0xbc>
                        ;; calculate table element address
      30: aa0603e0      mov     x0, x6
      34: f9402407      ldr     x7, [x0, #0x48]
      38: 2a0203e8      mov     w8, w2
      3c: 8b080ce8      add     x8, x7, x8, lsl #3
                        ;; spectre guard to use table address or table base
      40: 6b05005f      cmp     w2, w5
      44: 9a8820e8      csel    x8, x7, x8, hs
      48: d503229f      csdb
                        ;; load the table element, branch to init if it's not
                        ;; already initialized
      4c: f9400109      ldr     x9, [x8]
      50: 927ff920      and     x0, x9, #0xfffffffffffffffe
      54: b4000209      cbz     x9, 0x94 <wasm[0]::function[0]+0x94>
      58: aa0603f3      mov     x19, x6
                        ;; check if table element is null
      5c: b40002c0      cbz     x0, 0xb4 <wasm[0]::function[0]+0xb4>
                        ;; load metadata from the table
      60: b940180d      ldr     w13, [x0, #0x18]
      64: f940226e      ldr     x14, [x19, #0x40]
      68: aa1303e6      mov     x6, x19
                        ;; type-check and trap if function doesn't have the
                        ;; desired type
      6c: b94005ce      ldr     w14, [x14, #0x4]
      70: 6b0e01bf      cmp     w13, w14
      74: 54000221      b.ne    0xb8 <wasm[0]::function[0]+0xb8>
                        ;; perform the actual call
      78: f9400802      ldr     x2, [x0, #0x10]
      7c: f9401000      ldr     x0, [x0, #0x20]
      80: aa0603e1      mov     x1, x6
      84: d63f0040      blr     x2
                        ;; return
      88: f84107f3      ldr     x19, [sp], #0x10
      8c: a8c17bfd      ldp     x29, x30, [sp], #0x10
      90: d65f0fff      retab

                        ;; lazy initialization of table entries
      94: aa0603e0      mov     x0, x6
      98: f9401c0a      ldr     x10, [x0, #0x38]
      9c: f940254a      ldr     x10, [x10, #0x48]
      a0: 52800001      mov     w1, #0x0                // =0
      a4: aa0603f3      mov     x19, x6
      a8: aa1303e0      mov     x0, x19
      ac: d63f0140      blr     x10
      b0: 17ffffeb      b       0x5c <wasm[0]::function[0]+0x5c>

                        ;; various traps
      b4: 0000c11f      udf     #0xc11f
      b8: 0000c11f      udf     #0xc11f
      bc: 0000c11f      udf     #0xc11f
      c0: 0000c11f      udf     #0xc11f

view this post on Zulip Alex Crichton (Oct 25 2023 at 19:55):

so you can see our call_indirect is not exactly the most optimal as-is and generates a lot of code

view this post on Zulip Alex Crichton (Oct 25 2023 at 19:55):

but it's a bit odd to me that there's two length comparisons where we should probably only have one with spectre things enabled

view this post on Zulip Martin Fink (Oct 25 2023 at 20:00):

Yeah that's how I found this oddity -- I was trying to look at the indirect_call code and looking for a possibility to optimize it. But then even when removing the signature check completely, there was no performance difference (because csdb ate all performance gains one might gain from completely removing signature checks).

view this post on Zulip Alex Crichton (Oct 25 2023 at 20:01):

if you haven't already found it, -Ccranelift-enable_table_access_spectre_mitigation=n will disable the csdb bits

view this post on Zulip Alex Crichton (Oct 25 2023 at 20:01):

basically disabling the enable_table_access_spectre_mitigation setting in Cranelift

view this post on Zulip Martin Fink (Oct 25 2023 at 20:01):

Yes, with that I don't have the slowdown

view this post on Zulip Alex Crichton (Oct 25 2023 at 20:02):

you may also be interested in this issue which is probably a good starting point for optimizing this

Similar to what we did with heaps in #5283

view this post on Zulip fitzgen (he/him) (Oct 25 2023 at 20:02):

Alex Crichton said:

but it's a bit odd to me that there's two length comparisons where we should probably only have one with spectre things enabled

we did the optimization to dedupe bounds checks and spectre guards for memories, but never did that for tables

view this post on Zulip Alex Crichton (Oct 25 2023 at 20:02):

since some of the optimizations are best done at the producer level rather than in CLIF, for example we know when a table has a min/max size that's the same we could avoid a dynamic load of the table's length since it's statically known

view this post on Zulip Alex Crichton (Oct 25 2023 at 20:03):

fitzgen (he/him) said:

Alex Crichton said:

but it's a bit odd to me that there's two length comparisons where we should probably only have one with spectre things enabled

we did the optimization to dedupe bounds checks and spectre guards for memories, but never did that for tables

oh right!

view this post on Zulip fitzgen (he/him) (Oct 25 2023 at 20:03):

ah yeah that issue would be a pre-req for doing the same optimization for table accesses

view this post on Zulip Martin Fink (Oct 25 2023 at 20:05):

If you don't mind, I might have some free time in the coming days/weeks to look at this, since it sounds like an interesting issue. (And I'm a bit familiar with the pieces that compute indirect calls and table addresses now :) )

view this post on Zulip Alex Crichton (Oct 25 2023 at 20:07):

that'd be great! Many of us can help out with the issue so don't hesistate to ask questions here or on github or similar

view this post on Zulip Chris Fallin (Oct 25 2023 at 20:09):

I think this conversation may be getting a bit sidetracked by the above -- the main slowdown is the csdb, so we should work out a way to avoid that if we can, right?

view this post on Zulip Chris Fallin (Oct 25 2023 at 20:09):

I just confirmed on my M1 on macOS that I see a ~2x speedup on spidermonkey.wasm when removing csdb

view this post on Zulip Chris Fallin (Oct 25 2023 at 20:09):

so my guess is that it's doing a full pipeline flush / barrier

view this post on Zulip Chris Fallin (Oct 25 2023 at 20:10):

I know the aarch64 spec says that in theory we need to future-proof against value speculation with csdb but if it carries this much penalty on M1, maybe we should reconsider...

view this post on Zulip Alex Crichton (Oct 25 2023 at 20:12):

oh sorry yeah I didn't mean to sidetrack too much, but wanted to pull on the thread of "let's optimize call_indirect" a bit

view this post on Zulip Alex Crichton (Oct 25 2023 at 20:13):

given the data on M2 it may mean that we could consider a possible M1-specific rule along these lines

view this post on Zulip Chris Fallin (Oct 25 2023 at 20:13):

it seems that a concrete change could be: put the use of the instruction under another flag, call it spectre_aarch64_value_speculation_barrier or something similar, and have it off by default

view this post on Zulip Alex Crichton (Oct 25 2023 at 20:13):

although with this just coming out today we should probably be somewhat certain heh

view this post on Zulip Alex Crichton (Oct 25 2023 at 20:13):

if this only affects M1 though should it be off-by-default?

view this post on Zulip Chris Fallin (Oct 25 2023 at 20:14):

or on-by-default, the thing I was about to say is we should then have an allowlist (or denylist, depending) of uarches

view this post on Zulip Chris Fallin (Oct 25 2023 at 20:14):

if we know it's slow just on M1, let's turn it off on M1, with the reasoning that we suspect M1 doesn't do value speculation

view this post on Zulip Chris Fallin (Oct 25 2023 at 20:15):

(linked paper seems to be about timer resolution, that's orthogonal to speculation kinds I guess?)

view this post on Zulip Alex Crichton (Oct 25 2023 at 20:16):

oh no it's all about speculation, they showed that without low-resolution timers they successfully launched spectre stuff on all macOS M1/M2 devices

view this post on Zulip Alex Crichton (Oct 25 2023 at 20:16):

they achieved an arbitrary read of any 64-bit address within the process via speculation

view this post on Zulip Chris Fallin (Oct 25 2023 at 20:17):

right, the key contribution seems to be about timer resolution afaict though

view this post on Zulip Chris Fallin (Oct 25 2023 at 20:18):

(it's already known that m1/m2/... speculate since they're OoO, they seem to measure the depth but it was already known they have big OoO windows e.g. M1 is I think 512 insts in the ROB)

view this post on Zulip Chris Fallin (Oct 25 2023 at 20:19):

dougallj's uarch tables seem to confirm that CSDB is a full fence: 27-cycle latency (https://dougallj.github.io/applecpu/firestorm-int.html)

view this post on Zulip Chris Fallin (Oct 25 2023 at 20:26):

I'm not able to find anything that documents better what the limits of value speculation are in M1. The specific guarantee we'd need to be 100% sure about this is that M1 doesn't speculate flag values on the result of the compare that feed into the csel. What I do know is (i) value speculation has historically been considered extremely difficult and complex, and not likely to be deployed; (ii) in cases where it would be, it would likely be used to predict the result of a long-latency op, like a cache-miss load, early (at least that was the original academic paper's proposal); but the address we're bounds-checking is usually not that

view this post on Zulip Chris Fallin (Oct 25 2023 at 20:28):

again normally I'd say we should follow ISA spec to the letter, but a 2x slowdown on SpiderMonkey for a theoretical issue that likely doesn't exist on this uarch is... hard to accept :-)

view this post on Zulip Alex Crichton (Oct 25 2023 at 20:31):

to make sure I understand, the worry is that csel speculates the result not based on the actual inputs but on something conjured out of thin air?

view this post on Zulip Alex Crichton (Oct 25 2023 at 20:31):

e.g. it may not speculate based on the two possible inputs, but actually something totally different?

view this post on Zulip Alex Crichton (Oct 25 2023 at 20:31):

(I'm naive and know very little about mitigating spectre apart from the high-level bits)

view this post on Zulip Chris Fallin (Oct 25 2023 at 20:34):

right, say that there is a value predictor that fires when an instruction's inputs are not ready; it says "execute this csel as if the flags are <no overflow>", and the load speculatively executes and we affect the cache. The value speculation idea is that we could do that, remember what we speculated, and when the real input resolves, keep the work we did if we got it right

view this post on Zulip Chris Fallin (Oct 25 2023 at 20:34):

the original paper on value speculation was in iirc 1996, it's a neat idea, but afaik no one has actually built it

view this post on Zulip Chris Fallin (Oct 25 2023 at 20:35):

csdb is described as "future proofing", which reading between the lines a bit (opinion follows!) tells me the ISA folks want it in software now so they have the freedom to do these tricks later

view this post on Zulip Alex Crichton (Oct 25 2023 at 20:35):

wow that's wild

view this post on Zulip Alex Crichton (Oct 25 2023 at 20:36):

it definitely makes sense to me then at the very least that if it's purely future-proofing then we can probably safely say no new M1s will ever be designed so we can disable it there

view this post on Zulip Chris Fallin (Oct 25 2023 at 20:36):

at least on x86, everyone has seemed to just tacitly accept that it's not done, at least for cmov's (or maybe there was a statement about this from Intel that I missed?), so we can use a cmp+cmov as a mask on the misspeculated path

view this post on Zulip Alex Crichton (Oct 25 2023 at 20:36):

rephrased: disabling on M1 makes sense if we know M1 doesn't do value speculation, disabling elsewhere is still somewhat iffy but perhaps not necessary if it's not much of a perf impact

view this post on Zulip Alex Crichton (Oct 25 2023 at 20:36):

interesting yeah

view this post on Zulip Alex Crichton (Oct 25 2023 at 20:37):

do js engines use csdb do you know?

view this post on Zulip Chris Fallin (Oct 25 2023 at 20:37):

and, right, the extra evidence I'd offer is that the barrier needed to protect against it is expensive, so they didn't optimize it, so it probably isn't needed

view this post on Zulip Alex Crichton (Oct 25 2023 at 20:37):

b/c if they don't I doubt that us doing so would really move the needle

view this post on Zulip Chris Fallin (Oct 25 2023 at 20:37):

interesting Q, I was gonna check that next! apparently SM recently removed all spectre hardening because they just put trust domains in separate processes now (FF finished Project Fission)

view this post on Zulip Chris Fallin (Oct 25 2023 at 20:37):

but historically...

view this post on Zulip Alex Crichton (Oct 25 2023 at 20:37):

oh damn

view this post on Zulip Alex Crichton (Oct 25 2023 at 20:38):

I wonder if we should disable spectre stuff by defaut then and document accordingly?

view this post on Zulip Chris Fallin (Oct 25 2023 at 20:38):

anyone who runs multi-tenant in a single process probably still cares!

view this post on Zulip Chris Fallin (Oct 25 2023 at 20:38):

it is a valid question for the CLI though

view this post on Zulip Chris Fallin (Oct 25 2023 at 20:39):

(and I guess there are interesting questions around component model + multi-module too -- if you run mutually mistrusting modules in a component you probably still want the hardening)

view this post on Zulip Alex Crichton (Oct 25 2023 at 20:41):

good point yeah

view this post on Zulip Alex Crichton (Oct 25 2023 at 20:41):

I suppose this means that we basically can't use browsers as "prior art" if they're removing spectre things

view this post on Zulip Chris Fallin (Oct 25 2023 at 20:47):

so I'm looking through SM's Wasm implementation prior to their removal of spectre-protection-by-default (bug 1837602, Jun 23 of this year, commit b7d1824da842b37c4e0089ded35f9fb7b089fa60 in the github.com/mozilla/gecko-dev mercurial mirror)

Read-only Git mirror of the Mercurial gecko repositories at https://hg.mozilla.org. How to contribute: https://firefox-source-docs.mozilla.org/contributing/contribution_quickref.html - GitHub - moz...

view this post on Zulip Chris Fallin (Oct 25 2023 at 20:47):

and... surprisingly enough, I'm not able to find anything, at least in WasmBaselineCompile.cpp

view this post on Zulip Chris Fallin (Oct 25 2023 at 20:48):

(masm.speculationBarrier() is the magical incantation that gives csdb on aarch64; it's still there in some hostcalls)

view this post on Zulip Afonso Bordado (Oct 25 2023 at 20:50):

Would something like what @Anton Kirilov mentioned in #1032 also solve this? Setting a software context id when transitioning to the guest, and removing the csdb?

(I'm not sure if they solve the same spectre issue, or if it's a different one. I'm not too familiar with any of this)

Hello, You are probably well aware, but some mainstream compilers are emitting retpolines to help mitigate Spectre variant 2 attacks. Do you have any plans to add a similar capability to the creton...

view this post on Zulip Chris Fallin (Oct 25 2023 at 20:54):

that's a useful feature (separate predictor contexts) but unfortunately looks like it relies on an ISA extension that's fairly recent...

view this post on Zulip Chris Fallin (Oct 25 2023 at 20:55):

given that I'm not finding this in FF's codebase, I'd support removing csdb on M1 on that basis

view this post on Zulip Chris Fallin (Oct 25 2023 at 20:55):

if it's cheap/free as it is on M2, why not, we can future-proof

view this post on Zulip Chris Fallin (Oct 25 2023 at 20:55):

(about to start another run of meetings so there's my time-boxed investigation result :-) )

view this post on Zulip Till Schneidereit (Oct 26 2023 at 10:05):

if it's cheap/free as it is on M2, why not, we can future-proof

I guess a question is how high the risk is of other Aarch64 implementations behaving like M1 does here, instead of M2. If there's some amount of risk that more CPUs are negatively affected like M1 is, then I think we should disable this by default, and only enable it on CPUs that we know to be ok.

view this post on Zulip Anton Kirilov (Oct 26 2023 at 11:07):

@Afonso Bordado The software context number is a mitigation for a different Spectre variant (Spectre-V2 instead of Spectre-V1), which in the Wasm case would involve cross-module instance attacks, I think. In this case any potential attack happens within a single module, so the context ID would stay the same. And yes, it requires hardware support, but crucially kernel support as well.

view this post on Zulip Anton Kirilov (Oct 26 2023 at 11:20):

Also, to play devil's advocate - if the CSDB instruction is expensive on a certain microarchitecture, perhaps it is actually necessary? Unless your threat model allows you to avoid it, of course, as in the browser case where different security contexts are isolated in their own processes.

view this post on Zulip Chris Fallin (Oct 26 2023 at 15:59):

Hmm, yeah; fwiw csdb was not used by Firefox for Wasm indirect calls even prior to their multiprocess split, as far as I can tell. Perhaps that was a missing mitigation, I dunno

view this post on Zulip Chris Fallin (Oct 26 2023 at 16:00):

if anyone has time to dig around in the Safari (JavaScriptCore) source to see what's done, that would be an interesting data point too...


Last updated: Nov 22 2024 at 17:03 UTC