(Not sure if this is the right channel to post this)
I've done a few experiments on an M1 Mac Mini and found some pretty high-performance overheads caused by the csdb
instructions generated when lowering select_spectre_guard
.
This overhead only exists on the performance cores, not on the efficiency ones. I believe the reason for this is the efficiency cores don't speculate nearly as much (or even not at all?) as the performance ones do, but then I am also by no means an expert. :)
I found it pretty surprising that the difference is that large, and since I didn't find anything else about this on the internet, I thought I'd share my findings. If there is an issue with how I ran these benchmarks, please let me know!
I've benchmarked an unmodified build (commit 270e92225
) against one with the following diff applied:
diff --git a/cranelift/codegen/src/isa/aarch64/inst/emit.rs b/cranelift/codegen/src/isa/aarch64/inst/emit.rs
index 43bdd4a3e..c2519885e 100644
--- a/cranelift/codegen/src/isa/aarch64/inst/emit.rs
+++ b/cranelift/codegen/src/isa/aarch64/inst/emit.rs
@@ -1820,7 +1820,7 @@ impl MachInstEmit for Inst {
sink.put4(enc_dmb_ish()); // dmb ish
}
&Inst::Csdb {} => {
- sink.put4(0xd503229f);
+ //sink.put4(0xd503229f);
}
&Inst::FpuMove64 { rd, rn } => {
let rd = allocs.next_writable(rd);
The tests were run on a Mac Mini with a M1 chip running Asahi Linux. I can also check if there is a difference when running the same thing on macOS, but I believe there shouldn't be.
Before running the tests, I set the governor to performance to force all cores to their maximum frequency.
Performance cores:
taskset --cpu-list 4-7 cargo run -- benchmark --iterations-per-process 2 --engine ../tmp/libwasmtime_bench_api.so --engine ../wasmtime/target/release/libwasmtime_bench_api.so --engine-flags="" benchmarks/spidermonkey/benchmark.wasm
warning: some crates are on edition 2021 which defaults to `resolver = "2"`, but virtual workspaces default to `resolver = "1"`
note: to keep the current resolver, specify `workspace.resolver = "1"` in the workspace root's manifest
note: to use the edition 2021 resolver, specify `workspace.resolver = "2"` in the workspace root's manifest
Finished dev [unoptimized + debuginfo] target(s) in 0.04s
Running `target/debug/sightglass-cli benchmark --iterations-per-process 2 --engine ../tmp/libwasmtime_bench_api.so --engine ../wasmtime/target/release/libwasmtime_bench_api.so --engine-flags= benchmarks/spidermonkey/benchmark.wasm`
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm
Δ = 8027785.90 ± 160961.98 (confidence = 99%)
wasmtime/target/release/libwasmtime_bench_api.so is 3.16x to 3.25x faster than tmp/libwasmtime_bench_api.so!
[11320025 11663170.40 12242229] tmp/libwasmtime_bench_api.so
[3526618 3635384.50 3832395] wasmtime/target/release/libwasmtime_bench_api.so
instantiation :: cycles :: benchmarks/spidermonkey/benchmark.wasm
No difference in performance.
[2084 2519.50 3329] tmp/libwasmtime_bench_api.so
[2054 2389.60 3090] wasmtime/target/release/libwasmtime_bench_api.so
compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm
No difference in performance.
[50306278 52021344.65 55069141] tmp/libwasmtime_bench_api.so
[50669042 52032507.05 54383128] wasmtime/target/release/libwasmtime_bench_api.so
On the efficiency cores, there is no difference:
taskset --cpu-list 0-3 cargo run -- benchmark --iterations-per-process 2 --engine ../tmp/libwasmtime_bench_api.so --engine ../wasmtime/target/release/libwasmtime_bench_api.so --engine-flags="" benchmarks/spidermonkey/benchmark.wasm
warning: some crates are on edition 2021 which defaults to `resolver = "2"`, but virtual workspaces default to `resolver = "1"`
note: to keep the current resolver, specify `workspace.resolver = "1"` in the workspace root's manifest
note: to use the edition 2021 resolver, specify `workspace.resolver = "2"` in the workspace root's manifest
Finished dev [unoptimized + debuginfo] target(s) in 0.11s
Running `target/debug/sightglass-cli benchmark --iterations-per-process 2 --engine ../tmp/libwasmtime_bench_api.so --engine ../wasmtime/target/release/libwasmtime_bench_api.so --engine-flags= benchmarks/spidermonkey/benchmark.wasm`
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
instantiation :: cycles :: benchmarks/spidermonkey/benchmark.wasm
No difference in performance.
[5216 6214.50 9408] tmp/libwasmtime_bench_api.so
[5274 6807.20 10777] wasmtime/target/release/libwasmtime_bench_api.so
compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm
No difference in performance.
[107328450 109423075.40 114775754] tmp/libwasmtime_bench_api.so
[108226908 110536060.70 113376486] wasmtime/target/release/libwasmtime_bench_api.so
execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm
No difference in performance.
[12505682 12657196.70 12765194] tmp/libwasmtime_bench_api.so
[12477426 12666983.30 12849489] wasmtime/target/release/libwasmtime_bench_api.so
With dynamic bounds checks, the overhead is even bigger:
Performance cores:
taskset --cpu-list 4-7 cargo run -- benchmark --iterations-per-process 2 --engine ../tmp/libwasmtime_bench_api.so --engine ../wasmtime/target/release/libwasmtime_bench_api.so --engine-flags="-O dynamic-memory-guard-size=0 -O static-memory-guard-size=0 -O static-memory-maximum-size=0" benchmarks/spidermonkey/benchmark.wasm
warning: some crates are on edition 2021 which defaults to `resolver = "2"`, but virtual workspaces default to `resolver = "1"`
note: to keep the current resolver, specify `workspace.resolver = "1"` in the workspace root's manifest
note: to use the edition 2021 resolver, specify `workspace.resolver = "2"` in the workspace root's manifest
Finished dev [unoptimized + debuginfo] target(s) in 0.04s
Running `target/debug/sightglass-cli benchmark --iterations-per-process 2 --engine ../tmp/libwasmtime_bench_api.so --engine ../wasmtime/target/release/libwasmtime_bench_api.so '--engine-flags=-O dynamic-memory-guard-size=0 -O static-memory-guard-size=0 -O static-memory-maximum-size=0' benchmarks/spidermonkey/benchmark.wasm`
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm
Δ = 101879699.00 ± 1141736.09 (confidence = 99%)
wasmtime/target/release/libwasmtime_bench_api.so is 16.46x to 16.81x faster than tmp/libwasmtime_bench_api.so!
[107106675 108397620.75 112636143] tmp/libwasmtime_bench_api.so
[6452171 6517921.75 6757069] wasmtime/target/release/libwasmtime_bench_api.so
compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm
Δ = 2268916.25 ± 1358278.96 (confidence = 99%)
tmp/libwasmtime_bench_api.so is 1.02x to 1.06x faster than wasmtime/target/release/libwasmtime_bench_api.so!
[57228321 59363510.15 62661925] tmp/libwasmtime_bench_api.so
[58325815 61632426.40 64403315] wasmtime/target/release/libwasmtime_bench_api.so
instantiation :: cycles :: benchmarks/spidermonkey/benchmark.wasm
No difference in performance.
[2214 2592.80 3968] tmp/libwasmtime_bench_api.so
[2162 2441.25 3405] wasmtime/target/release/libwasmtime_bench_api.so
Efficiency cores:
taskset --cpu-list 0-3 cargo run -- benchmark --iterations-per-process 2 --engine ../tmp/libwasmtime_bench_api.so --engine ../wasmtime/target/release/libwasmtime_bench_api.so --engine-flags="-O dynamic-memory-guard-size=0 -O static-memory-guard-size=0 -O static-memory-maximum-size=0" benchmarks/spidermonkey/benchmark.wasm
warning: some crates are on edition 2021 which defaults to `resolver = "2"`, but virtual workspaces default to `resolver = "1"`
note: to keep the current resolver, specify `workspace.resolver = "1"` in the workspace root's manifest
note: to use the edition 2021 resolver, specify `workspace.resolver = "2"` in the workspace root's manifest
Finished dev [unoptimized + debuginfo] target(s) in 0.11s
Running `target/debug/sightglass-cli benchmark --iterations-per-process 2 --engine ../tmp/libwasmtime_bench_api.so --engine ../wasmtime/target/release/libwasmtime_bench_api.so '--engine-flags=-O dynamic-memory-guard-size=0 -O static-memory-guard-size=0 -O static-memory-maximum-size=0' benchmarks/spidermonkey/benchmark.wasm`
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm
Δ = 489379.55 ± 43314.03 (confidence = 99%)
wasmtime/target/release/libwasmtime_bench_api.so is 1.02x to 1.02x faster than tmp/libwasmtime_bench_api.so!
[23615521 23708462.10 23789488] tmp/libwasmtime_bench_api.so
[23161583 23219082.55 23330794] wasmtime/target/release/libwasmtime_bench_api.so
compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm
No difference in performance.
[127294091 131666671.05 137908998] tmp/libwasmtime_bench_api.so
[129037489 133240513.70 137623006] wasmtime/target/release/libwasmtime_bench_api.so
instantiation :: cycles :: benchmarks/spidermonkey/benchmark.wasm
No difference in performance.
[5316 6323.80 10262] tmp/libwasmtime_bench_api.so
[5516 6326.45 9558] wasmtime/target/release/libwasmtime_bench_api.so
Fascinating! Thanks for taking the time to collect these numbers and test here.
I dug a bit and found some background information on csdb here and here. I'm not spectre or aarch64 expert myself though so all I can be is a go-between with those.
In 4555 though it looked like at the time performance was concluded to not be affected much, but that was likely not performed on an M1 chip which may be exhibiting different performance characteristics.
One thing I've personally found a bit odd is that what you're probably running into here (in default mode) is the spectre guards for table bounds checks (and presumably spidermonkey has a lot of call_indirect
). Our table bounds checks have both an explicit bounds check leading to a trap plus a select_spectre_guard. While this probably achieves spectre-hardening it also feels a bit overkill IMO and is possibly something we could improve.
Also I'm curious, where did you get taskset
from on macos? I've got an M2 and am curious to see if the situation is any better with that or if it's the same
Running sightglass locally yields:
execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm
Δ = 282449.80 ± 76316.41 (confidence = 99%)
no-csdb.dylib is 1.02x to 1.03x faster than csdb.dylib!
[10772130 10871102.65 11116720] csdb.dylib
[10483782 10588652.85 10753478] no-csdb.dylib
although this isn't using taskset
for pinning. I'm not sure if it would by default put everything on the performance cores.
Wow, that's a wild perf delta, @Martin Fink ! I was indeed worried about this when the original PR came up; @Anton Kirilov benchmarked on an Ampere Altra and found no impact
re: taskset
, I wonder, are you running Linux (Asahi or whatever) on your M1 by chance? could that have an impact (different system config register settings, etc)?
In retrospect I should have worded my statement in issue #1032 a bit better, though in pull request #4555 I got it right, I think. Also, just to be clear, I am unable to comment on any microarchitectural details of any processor that has not been designed by Arm (beyond any publicly accessible documentation, if any exists), including how susceptible it is to Spectre-style attacks.
Hi! yes, this was running on Asahi. I just tried the same benchmark again on my personal Macbook (M1 Pro; this time with macOS) and got essentially the same results.
AFAIK, you can't pin a process to the performance cores, but you can pin it to the efficiency cores using taskset -c background command args...
. The macOS scheduler should prefer the performance cores for this kind of workload anyways I believe.
Without taskpolicy
:
cargo run -- benchmark --iterations-per-process 2 --engine ../tmp/libwasmtime_bench_api.dylib --engine ../wasmtime/target/release/libwasmtime_bench_api.dylib --engine-flags="" benchmarks/spidermonkey/benchmark.wasm
warning: some crates are on edition 2021 which defaults to `resolver = "2"`, but virtual workspaces default to `resolver = "1"`
note: to keep the current resolver, specify `workspace.resolver = "1"` in the workspace root's manifest
note: to use the edition 2021 resolver, specify `workspace.resolver = "2"` in the workspace root's manifest
Finished dev [unoptimized + debuginfo] target(s) in 0.10s
Running `target/debug/sightglass-cli benchmark --iterations-per-process 2 --engine ../tmp/libwasmtime_bench_api.dylib --engine ../wasmtime/target/release/libwasmtime_bench_api.dylib --engine-flags= benchmarks/spidermonkey/benchmark.wasm`
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm
Δ = 7875436.90 ± 76393.16 (confidence = 99%)
wasmtime/target/release/libwasmtime_bench_api.dylib is 3.15x to 3.20x faster than tmp/libwasmtime_bench_api.dylib!
[11307195 11495795.05 11729345] tmp/libwasmtime_bench_api.dylib
[3532096 3620358.15 3726905] wasmtime/target/release/libwasmtime_bench_api.dylib
instantiation :: cycles :: benchmarks/spidermonkey/benchmark.wasm
No difference in performance.
[2262 2763.85 4202] tmp/libwasmtime_bench_api.dylib
[2274 2556.15 3046] wasmtime/target/release/libwasmtime_bench_api.dylib
compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm
No difference in performance.
[147601278 149324679.95 151225801] tmp/libwasmtime_bench_api.dylib
[146173673 149144740.10 155132568] wasmtime/target/release/libwasmtime_bench_api.dylib
With taskpolicy
:
taskpolicy -c background cargo run -- benchmark --iterations-per-process 2 --engine ../tmp/libwasmtime_bench_api.dylib --engine ../wasmtime/target/release/libwasmtime_bench_api.dylib --engine-flags="" benchmarks/spidermonkey/benchmark.wasm
warning: some crates are on edition 2021 which defaults to `resolver = "2"`, but virtual workspaces default to `resolver = "1"`
note: to keep the current resolver, specify `workspace.resolver = "1"` in the workspace root's manifest
note: to use the edition 2021 resolver, specify `workspace.resolver = "2"` in the workspace root's manifest
Finished dev [unoptimized + debuginfo] target(s) in 2.16s
Running `target/debug/sightglass-cli benchmark --iterations-per-process 2 --engine ../tmp/libwasmtime_bench_api.dylib --engine ../wasmtime/target/release/libwasmtime_bench_api.dylib --engine-flags= benchmarks/spidermonkey/benchmark.wasm`
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
instantiation :: cycles :: benchmarks/spidermonkey/benchmark.wasm
No difference in performance.
[6124 27213.70 111028] tmp/libwasmtime_bench_api.dylib
[5381 40362.40 295461] wasmtime/target/release/libwasmtime_bench_api.dylib
execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm
No difference in performance.
[23995605 30439809.15 38672919] tmp/libwasmtime_bench_api.dylib
[24360364 28723230.40 36000666] wasmtime/target/release/libwasmtime_bench_api.dylib
compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm
No difference in performance.
[401455183 549387423.20 1355357035] tmp/libwasmtime_bench_api.dylib
[394497975 538719528.60 858583052] wasmtime/target/release/libwasmtime_bench_api.dylib
Whoa! Without taskpolicy
you're showing a 3x slowdown on M1, but on an M2 I got a 2% slowdown, so now I'm curious if someone can reproduce that and if this is perhaps something that was updated for the M2
Yes, and if I turn on dynamic bounds checks the slowdown gets worse (as there is way more spectre guards)
yeah that part definitely makes sense
FWIW this wasm:
(module
(table 1 funcref)
(func (param i32)
(call_indirect (local.get 0))
)
)
has this codegen for me locally (annotated)
;; frame setup
0: d503237f pacibsp
4: a9bf7bfd stp x29, x30, [sp, #-0x10]!
8: 910003fd mov x29, sp
;; stack overflow check
c: f8408010 ldur x16, [x0, #0x8]
10: f8400210 ldur x16, [x16]
14: eb3063ff cmp sp, x16
18: 54000543 b.lo 0xc0 <wasm[0]::function[0]+0xc0>
;; register spills
1c: f81f0ff3 str x19, [sp, #-0x10]!
;; branch-to-trap if table index is out of bounds
20: aa0003e6 mov x6, x0
24: b9405005 ldr w5, [x0, #0x50]
28: 6b05005f cmp w2, w5
2c: 54000482 b.hs 0xbc <wasm[0]::function[0]+0xbc>
;; calculate table element address
30: aa0603e0 mov x0, x6
34: f9402407 ldr x7, [x0, #0x48]
38: 2a0203e8 mov w8, w2
3c: 8b080ce8 add x8, x7, x8, lsl #3
;; spectre guard to use table address or table base
40: 6b05005f cmp w2, w5
44: 9a8820e8 csel x8, x7, x8, hs
48: d503229f csdb
;; load the table element, branch to init if it's not
;; already initialized
4c: f9400109 ldr x9, [x8]
50: 927ff920 and x0, x9, #0xfffffffffffffffe
54: b4000209 cbz x9, 0x94 <wasm[0]::function[0]+0x94>
58: aa0603f3 mov x19, x6
;; check if table element is null
5c: b40002c0 cbz x0, 0xb4 <wasm[0]::function[0]+0xb4>
;; load metadata from the table
60: b940180d ldr w13, [x0, #0x18]
64: f940226e ldr x14, [x19, #0x40]
68: aa1303e6 mov x6, x19
;; type-check and trap if function doesn't have the
;; desired type
6c: b94005ce ldr w14, [x14, #0x4]
70: 6b0e01bf cmp w13, w14
74: 54000221 b.ne 0xb8 <wasm[0]::function[0]+0xb8>
;; perform the actual call
78: f9400802 ldr x2, [x0, #0x10]
7c: f9401000 ldr x0, [x0, #0x20]
80: aa0603e1 mov x1, x6
84: d63f0040 blr x2
;; return
88: f84107f3 ldr x19, [sp], #0x10
8c: a8c17bfd ldp x29, x30, [sp], #0x10
90: d65f0fff retab
;; lazy initialization of table entries
94: aa0603e0 mov x0, x6
98: f9401c0a ldr x10, [x0, #0x38]
9c: f940254a ldr x10, [x10, #0x48]
a0: 52800001 mov w1, #0x0 // =0
a4: aa0603f3 mov x19, x6
a8: aa1303e0 mov x0, x19
ac: d63f0140 blr x10
b0: 17ffffeb b 0x5c <wasm[0]::function[0]+0x5c>
;; various traps
b4: 0000c11f udf #0xc11f
b8: 0000c11f udf #0xc11f
bc: 0000c11f udf #0xc11f
c0: 0000c11f udf #0xc11f
so you can see our call_indirect
is not exactly the most optimal as-is and generates a lot of code
but it's a bit odd to me that there's two length comparisons where we should probably only have one with spectre things enabled
Yeah that's how I found this oddity -- I was trying to look at the indirect_call
code and looking for a possibility to optimize it. But then even when removing the signature check completely, there was no performance difference (because csdb
ate all performance gains one might gain from completely removing signature checks).
if you haven't already found it, -Ccranelift-enable_table_access_spectre_mitigation=n
will disable the csdb
bits
basically disabling the enable_table_access_spectre_mitigation
setting in Cranelift
Yes, with that I don't have the slowdown
you may also be interested in this issue which is probably a good starting point for optimizing this
Alex Crichton said:
but it's a bit odd to me that there's two length comparisons where we should probably only have one with spectre things enabled
we did the optimization to dedupe bounds checks and spectre guards for memories, but never did that for tables
since some of the optimizations are best done at the producer level rather than in CLIF, for example we know when a table has a min/max size that's the same we could avoid a dynamic load of the table's length since it's statically known
fitzgen (he/him) said:
Alex Crichton said:
but it's a bit odd to me that there's two length comparisons where we should probably only have one with spectre things enabled
we did the optimization to dedupe bounds checks and spectre guards for memories, but never did that for tables
oh right!
ah yeah that issue would be a pre-req for doing the same optimization for table accesses
If you don't mind, I might have some free time in the coming days/weeks to look at this, since it sounds like an interesting issue. (And I'm a bit familiar with the pieces that compute indirect calls and table addresses now :) )
that'd be great! Many of us can help out with the issue so don't hesistate to ask questions here or on github or similar
I think this conversation may be getting a bit sidetracked by the above -- the main slowdown is the csdb
, so we should work out a way to avoid that if we can, right?
I just confirmed on my M1 on macOS that I see a ~2x speedup on spidermonkey.wasm when removing csdb
so my guess is that it's doing a full pipeline flush / barrier
I know the aarch64 spec says that in theory we need to future-proof against value speculation with csdb
but if it carries this much penalty on M1, maybe we should reconsider...
oh sorry yeah I didn't mean to sidetrack too much, but wanted to pull on the thread of "let's optimize call_indirect" a bit
given the data on M2 it may mean that we could consider a possible M1-specific rule along these lines
it seems that a concrete change could be: put the use of the instruction under another flag, call it spectre_aarch64_value_speculation_barrier
or something similar, and have it off by default
although with this just coming out today we should probably be somewhat certain heh
if this only affects M1 though should it be off-by-default?
or on-by-default, the thing I was about to say is we should then have an allowlist (or denylist, depending) of uarches
if we know it's slow just on M1, let's turn it off on M1, with the reasoning that we suspect M1 doesn't do value speculation
(linked paper seems to be about timer resolution, that's orthogonal to speculation kinds I guess?)
oh no it's all about speculation, they showed that without low-resolution timers they successfully launched spectre stuff on all macOS M1/M2 devices
they achieved an arbitrary read of any 64-bit address within the process via speculation
right, the key contribution seems to be about timer resolution afaict though
(it's already known that m1/m2/... speculate since they're OoO, they seem to measure the depth but it was already known they have big OoO windows e.g. M1 is I think 512 insts in the ROB)
dougallj's uarch tables seem to confirm that CSDB is a full fence: 27-cycle latency (https://dougallj.github.io/applecpu/firestorm-int.html)
I'm not able to find anything that documents better what the limits of value speculation are in M1. The specific guarantee we'd need to be 100% sure about this is that M1 doesn't speculate flag values on the result of the compare that feed into the csel
. What I do know is (i) value speculation has historically been considered extremely difficult and complex, and not likely to be deployed; (ii) in cases where it would be, it would likely be used to predict the result of a long-latency op, like a cache-miss load, early (at least that was the original academic paper's proposal); but the address we're bounds-checking is usually not that
again normally I'd say we should follow ISA spec to the letter, but a 2x slowdown on SpiderMonkey for a theoretical issue that likely doesn't exist on this uarch is... hard to accept :-)
to make sure I understand, the worry is that csel
speculates the result not based on the actual inputs but on something conjured out of thin air?
e.g. it may not speculate based on the two possible inputs, but actually something totally different?
(I'm naive and know very little about mitigating spectre apart from the high-level bits)
right, say that there is a value predictor that fires when an instruction's inputs are not ready; it says "execute this csel as if the flags are <no overflow>", and the load speculatively executes and we affect the cache. The value speculation idea is that we could do that, remember what we speculated, and when the real input resolves, keep the work we did if we got it right
the original paper on value speculation was in iirc 1996, it's a neat idea, but afaik no one has actually built it
csdb is described as "future proofing", which reading between the lines a bit (opinion follows!) tells me the ISA folks want it in software now so they have the freedom to do these tricks later
wow that's wild
it definitely makes sense to me then at the very least that if it's purely future-proofing then we can probably safely say no new M1s will ever be designed so we can disable it there
at least on x86, everyone has seemed to just tacitly accept that it's not done, at least for cmov's (or maybe there was a statement about this from Intel that I missed?), so we can use a cmp+cmov as a mask on the misspeculated path
rephrased: disabling on M1 makes sense if we know M1 doesn't do value speculation, disabling elsewhere is still somewhat iffy but perhaps not necessary if it's not much of a perf impact
interesting yeah
do js engines use csdb do you know?
and, right, the extra evidence I'd offer is that the barrier needed to protect against it is expensive, so they didn't optimize it, so it probably isn't needed
b/c if they don't I doubt that us doing so would really move the needle
interesting Q, I was gonna check that next! apparently SM recently removed all spectre hardening because they just put trust domains in separate processes now (FF finished Project Fission)
but historically...
oh damn
I wonder if we should disable spectre stuff by defaut then and document accordingly?
anyone who runs multi-tenant in a single process probably still cares!
it is a valid question for the CLI though
(and I guess there are interesting questions around component model + multi-module too -- if you run mutually mistrusting modules in a component you probably still want the hardening)
good point yeah
I suppose this means that we basically can't use browsers as "prior art" if they're removing spectre things
so I'm looking through SM's Wasm implementation prior to their removal of spectre-protection-by-default (bug 1837602, Jun 23 of this year, commit b7d1824da842b37c4e0089ded35f9fb7b089fa60
in the github.com/mozilla/gecko-dev mercurial mirror)
and... surprisingly enough, I'm not able to find anything, at least in WasmBaselineCompile.cpp
(masm.speculationBarrier()
is the magical incantation that gives csdb
on aarch64; it's still there in some hostcalls)
Would something like what @Anton Kirilov mentioned in #1032 also solve this? Setting a software context id when transitioning to the guest, and removing the csdb
?
(I'm not sure if they solve the same spectre issue, or if it's a different one. I'm not too familiar with any of this)
that's a useful feature (separate predictor contexts) but unfortunately looks like it relies on an ISA extension that's fairly recent...
given that I'm not finding this in FF's codebase, I'd support removing csdb
on M1 on that basis
if it's cheap/free as it is on M2, why not, we can future-proof
(about to start another run of meetings so there's my time-boxed investigation result :-) )
if it's cheap/free as it is on M2, why not, we can future-proof
I guess a question is how high the risk is of other Aarch64 implementations behaving like M1 does here, instead of M2. If there's some amount of risk that more CPUs are negatively affected like M1 is, then I think we should disable this by default, and only enable it on CPUs that we know to be ok.
@Afonso Bordado The software context number is a mitigation for a different Spectre variant (Spectre-V2 instead of Spectre-V1), which in the Wasm case would involve cross-module instance attacks, I think. In this case any potential attack happens within a single module, so the context ID would stay the same. And yes, it requires hardware support, but crucially kernel support as well.
Also, to play devil's advocate - if the CSDB
instruction is expensive on a certain microarchitecture, perhaps it is actually necessary? Unless your threat model allows you to avoid it, of course, as in the browser case where different security contexts are isolated in their own processes.
Hmm, yeah; fwiw csdb
was not used by Firefox for Wasm indirect calls even prior to their multiprocess split, as far as I can tell. Perhaps that was a missing mitigation, I dunno
if anyone has time to dig around in the Safari (JavaScriptCore) source to see what's done, that would be an interesting data point too...
Last updated: Jan 24 2025 at 00:11 UTC