jameysharp labeled issue #5206:
In #5201, @afonso360 observed that currently, every Cranelift backend can only lower
select_spectre_guard
when paired withicmp
, and asked if there are any plans to implement it more generally.This instruction is currently only used in
cranelift/codegen/src/legalizer/heap.rs
andcranelift/codegen/src/legalizer/table.rs
as part of the implementation of theheap_addr
andtable_addr
instructions, respectively, when Spectre mitigations are enabled in theTargetIsa
flags. And indeed, its first input is always produced byicmp
.I think we should consider fusing the
icmp
semantics into the definition ofselect_spectre_guard
's behavior. That would clarify this question and also should give us a small compile-time speedup, due to processing fewer instructions and doing less work in instruction selection pattern-matching.I believe every use of
select_spectre_guard
is also preceded by an additional copy of the exact sameicmp
, to feed intotrapnz
. Maybe we can generate better code at the backends if we fuse that sequence into this instruction as well?The "obvious" way to fuse these instructions would be to make
select_spectre_guard
take two pairs of values, along with the condition code and maybe trap code. However, I believe that would makeInstructionData
bigger than 16 bytes, which we don't want to do.The
heap_addr
legalization always uses aniconst 0
for the mispredict output, whiletable_addr
uses the table base. Can we use a constant 0 address in both cases? Then we could define the fused instruction as taking only threeValue
s, which I think should fit alongside a condition code. If we also have a trap code the story is trickier since those are four bytes as well.This might also be part of, or an alternative to, #5190, so cc: @cfallin @fitzgen @alexcrichton
jameysharp opened issue #5206:
In #5201, @afonso360 observed that currently, every Cranelift backend can only lower
select_spectre_guard
when paired withicmp
, and asked if there are any plans to implement it more generally.This instruction is currently only used in
cranelift/codegen/src/legalizer/heap.rs
andcranelift/codegen/src/legalizer/table.rs
as part of the implementation of theheap_addr
andtable_addr
instructions, respectively, when Spectre mitigations are enabled in theTargetIsa
flags. And indeed, its first input is always produced byicmp
.I think we should consider fusing the
icmp
semantics into the definition ofselect_spectre_guard
's behavior. That would clarify this question and also should give us a small compile-time speedup, due to processing fewer instructions and doing less work in instruction selection pattern-matching.I believe every use of
select_spectre_guard
is also preceded by an additional copy of the exact sameicmp
, to feed intotrapnz
. Maybe we can generate better code at the backends if we fuse that sequence into this instruction as well?The "obvious" way to fuse these instructions would be to make
select_spectre_guard
take two pairs of values, along with the condition code and maybe trap code. However, I believe that would makeInstructionData
bigger than 16 bytes, which we don't want to do.The
heap_addr
legalization always uses aniconst 0
for the mispredict output, whiletable_addr
uses the table base. Can we use a constant 0 address in both cases? Then we could define the fused instruction as taking only threeValue
s, which I think should fit alongside a condition code. If we also have a trap code the story is trickier since those are four bytes as well.This might also be part of, or an alternative to, #5190, so cc: @cfallin @fitzgen @alexcrichton
alexcrichton commented on issue #5206:
If we can I think it would be ideal to fit at least one bit for the trap code since wasm traps need to ideally distinguish between
MemoryOutOfBounds
andTableOutOfBounds
here. The full-blown 4-byte trap code is definitely not needed for wasm, however.
fitzgen commented on issue #5206:
The "obvious" way to fuse these instructions would be to make
select_spectre_guard
take two pairs of values, along with the condition code and maybe trap code. However, I believe that would makeInstructionData
bigger than 16 bytes, which we don't want to do.We could move all this stuff into a side table and have the
InstructionData
just reference the side table by entity/index.
cfallin commented on issue #5206:
Hmm -- adding a side-table to hold some instruction data out-of-band, or picking just a single bit of trapcode to store, both seem somewhat suboptimal to me. We could do either if we needed this, but I'm trying to understand the need: is the argument for this fusion based only on the question of what
select_spectre_guard
means otherwise? Because answering that question seems much easier to me; in fact the answer is basically "justselect
, but always a cmove in machine code / never folded away".
jameysharp commented on issue #5206:
I guess my main question is, why make the backends handle the general case if we don't need it? My impression is that would be a little annoying if we kept all the iflags infrastructure, and worse as we're removing it.
Besides, we've been talking about smaller numbers of more complex instructions as one of the tools in our toolbox for reducing time spent in Cranelift. Since these two instructions are always paired, I figured they seem like a natural candidate for that. In the most extreme case, we could stop legalizing
heap_addr
and add corresponding pseudo-instructions in the backends. I don't currently have an intuition for how the trade-offs work out among these options.I've considered the counter-argument that the
icmp
might be optimized away if it's the same as a preceding one. I don't think that's currently possible, because the backends always lower this pair of instructions together, regardless of whether theicmp
was already lowered for another use. At this point we'd need machine-code peephole optimization for that, I think.I don't see any other way that this split has a performance advantage in either compile time or run time, and some chance that a combined instruction might have an advantage in both.
In short: I don't know if this is a good idea, but I don't currently understand why we'd prefer the status quo.
fitzgen commented on issue #5206:
One baby step simplification we could make to
select_spectre_guard
: make the zero operand implicit.It would have the following semantics (plus the speculation sandboxing guarantees):
spectre_guard 0, value = value spectre_guard n, _ = 0 if n != 0
(Since it isn't a select at the clif level anymore, I removed the "select" from the instruction name.)
That is, this sequence today
v0 = ... v1 = iconst.i32 0 v2 = icmp ugt ... v3 = select_spectre_guard v2, v1, v0
becomes
v0 = ... v1 = icmp ugt ... v2 = spectre_guard v1, v0
jameysharp commented on issue #5206:
Would it be useful to merge
select_spectre_guard
into a new variant oftrapz
? Sospectre_trapz v0, heap_oob, v1
would trap withheap_oob
ifv0
is zero, and otherwise it would evaluate tov1
. But in case of a branch mispredict, it would evaluate to zero on the speculated path.This would work safely with the idempotent side-effects machinery: as long as an identical instruction dominates the current one, you can reuse its result instead of evaluating it again, and be assured that both the trapping side-effect and the Spectre guard are preserved.
It also fits into the approach we're discussing in #5908: we'd be able to write ISLE rules that rewrite this instruction to
v1
ifv0
is known to be non-zero. That's safe as long asv0
doesn't depend on an earlier unguarded branch, which we've been discussing is something we need to think about regardless.With #6080, this could also be rewritten to an unconditional trap if
v0
is known to be zero. This is always safe no matter what earlier branches we've seen.With #6055, during elaboration, we could rewrite this to a plain
trapz
if its result is unused.
cfallin commented on issue #5206:
I like this! In conjunction with a "no path-sensitive deductions" rule that we hold throughout optimizations, I think the constant-folding you're describing would be safe too. To elaborate on that a bit (@jameysharp and I talked offline a bit about this with others a few days ago):
- We need to be safe even in misspeculated executions of bounds-checks (this is "Spectre safety");
- In such an execution, we can ensure safety by short-circuiting an accessed memory address to NULL or another safe value based only on a comparison result, and not a branch (this relies on a microarchitecture guarantee that no value speculation is occurring, which we generally have, either explicitly via
csdb
barriers on aarch64 or via Spectre-safety recommendations on other CPUs);- If we're going to optimize the cmove used for that short-circuiting, the "choice" input can't have the wrong value, even speculatively, relative to other values in the executed path;
- So we can't make any deductions and do constant propagation based on conditional or indirect branches (which can be speculated).
Concretely this would mean that if we have a rule that sees
brif v0, truthyblock(...), falsyblock(...)
, we cannot const-fold any Spectre-guards based onv0
being truthy intruthyblock
. So if we want to foldspectre_trapz v0, ...
withv0
known to be0
or1
, we have to avoid such reasoning. But we don't currently have any such deductions, and we can keep it that way; then this unlocks some other nice optimizations, unless we've missed something...
fitzgen commented on issue #5206:
We would have to be careful that we still get the lowering we want, where the
cmov
produces a zero that is what gets loaded. This seems kind of tricky while thespectre_trapnz
is not fused as a single instruction with the load.For example this clif:
v0 = icmp ugt index, heap_bound v1 = iadd index, heap_base v2 = spectre_trapnz v0, heap_oob, v1 v3 = load.i32 v2
we would want to make sure does not expand into something like
mov v1, index add v1, heap_base mov v2, 0 test index, heap_bound cmova v2, v1 ja trap mov v3, [v2] trap: ud2
where there is an unnecessary jump-to-
ud2
(created byMInst::TrapIf
) because of the way thatspectre_trapnz
is lowered independently of theload.i32
.It seems like we could match
load(spectre_trapnz(...))
in the backends to make sure we generate the right code here, but this also seems a bit fragile and not obviously better than thespectre_guard
variant I proposed above to me.
fitzgen edited a comment on issue #5206:
We would have to be careful that we still get the lowering we want, where the
cmov
produces a zero that is what gets loaded, and loading this zero value is what produces the trap. This seems kind of tricky while thespectre_trapnz
is not fused as a single instruction with the load.For example this clif:
v0 = icmp ugt index, heap_bound v1 = iadd index, heap_base v2 = spectre_trapnz v0, heap_oob, v1 v3 = load.i32 v2
we would want to make sure does not expand into something like
mov v1, index add v1, heap_base mov v2, 0 test index, heap_bound cmova v2, v1 ja trap mov v3, [v2] trap: ud2
where there is an unnecessary jump-to-
ud2
(created byMInst::TrapIf
) because of the way thatspectre_trapnz
is lowered independently of theload.i32
.It seems like we could match
load(spectre_trapnz(...))
in the backends to make sure we generate the right code here, but this also seems a bit fragile and not obviously better than thespectre_guard
variant I proposed above to me.
jameysharp commented on issue #5206:
I think what happened here is that I forgot that it's the
load
/store
that traps, not a preceding conditional-trap instruction. Which means I'm not sure we have the pattern anywhere that a conditional trap precedes aselect_spectre_guard
, in which case the instruction I'm suggesting isn't one that we would use anywhere. So… nevermind?We probably still do want the kind of "no path-sensitive deductions" rule that Chris described, though.
Last updated: Nov 22 2024 at 16:03 UTC