cfallin opened issue #8175:
In discussion today with @fitzgen, @jameysharp, @elliottt and @lpereira, we were considering the idea to dynamically monitor branch mispredictions and isolate execution of any Wasm instance that had used up a "misspeculation quota". I realized that actually what we could do is (effectively) turn off speculation -- you run out, you can't use it anymore! -- by dynamically inserting
lfence
s.Specifically: the (one?) neat thing about fully coherent icaches on x86 is that we can switch out the code that's running, on the fly, even if other threads are in the middle of functions we're switching out, as long as we're very careful to do it atomically (state between any two stores is valid code).
Consider the case where we want an
lfence
before every indirect branch (say; or before every branch; orthogonal detail) and we have:... mov rax, ... # compute branch target (e.g. from br_table) nop. # space for `lfence` (3 bytes) nop nop jmp rax
we can replace the three bytes of
nop
(0x90, 0x90, 0x90
) withlfence
(0x0f
,0xae
,0xe8
) if we want to "turn off speculation" for this module for a little bit.There are at least three ways to do that on an x86 machine (with coherent icaches):
- Do an atomic store to code memory. For this we'd need W+X mappings temporarily, and an extra
nop
to make this a 32-bit region we could overwrite with one 32-bit store.- Above, but switch from R+X to R+W; take the SIGBUS from any running thread, temporarily hold, and release when we switch the mapping back (via a futex?).
- The one I like best: keep another version of the code segment around, and mmap it over the first.
The last one is pretty neat:
mmap
is atomic with respect to every other thread (appears as a single store in the total store order; it must, because if other thread had it mapped, it would receive an IPI, which is a synchronizing edge). So we basically "yank out the code ROM and replace it" in between instructions, and the new code doesn't speculate.Using this, we can build a control loop in a separate thread that monitors mispredict counters, and can flip the switch at will for any module that has excessive counts. It doesn't have to be a one-way trapdoor: a module could have a "mispredict quota" per time unit, and could reset to the fast code (no
lfence
s) after a set period. There is no impact on other modules -- it only impacts the module with the mispredicts.Finally, I suspect this will be a bit harder on non-coherent-icache architectures (aarch64, riscv64), but actually maybe the "mmap a new thing on top of running code" is enough of a jolt to yoink all other cores into coherent happiness again. Note that I haven't tested that!
cfallin edited issue #8175:
In discussion today with @fitzgen, @jameysharp, @elliottt and @lpereira, we were considering the idea to dynamically monitor branch mispredictions and isolate execution of any Wasm instance that had used up a "misspeculation quota". I realized that actually what we could do is (effectively) turn off speculation -- you run out, you can't use it anymore! -- by dynamically inserting
lfence
s.Specifically: the (one?) neat thing about fully coherent icaches on x86 is that we can switch out the code that's running, on the fly, even if other threads are in the middle of functions we're switching out, as long as we're very careful to do it atomically (state between any two stores is valid code).
Consider the case where we want an
lfence
before every indirect branch (say; or before every branch; orthogonal detail) and we have:... mov rax, ... # compute branch target (e.g. from br_table) nop # space for `lfence` (3 bytes) nop nop jmp rax
we can replace the three bytes of
nop
(0x90, 0x90, 0x90
) withlfence
(0x0f
,0xae
,0xe8
) if we want to "turn off speculation" for this module for a little bit.There are at least three ways to do that on an x86 machine (with coherent icaches):
- Do an atomic store to code memory. For this we'd need W+X mappings temporarily, and an extra
nop
to make this a 32-bit region we could overwrite with one 32-bit store.- Above, but switch from R+X to R+W; take the SIGBUS from any running thread, temporarily hold, and release when we switch the mapping back (via a futex?).
- The one I like best: keep another version of the code segment around, and mmap it over the first.
The last one is pretty neat:
mmap
is atomic with respect to every other thread (appears as a single store in the total store order; it must, because if other thread had it mapped, it would receive an IPI, which is a synchronizing edge). So we basically "yank out the code ROM and replace it" in between instructions, and the new code doesn't speculate.Using this, we can build a control loop in a separate thread that monitors mispredict counters, and can flip the switch at will for any module that has excessive counts. It doesn't have to be a one-way trapdoor: a module could have a "mispredict quota" per time unit, and could reset to the fast code (no
lfence
s) after a set period. There is no impact on other modules -- it only impacts the module with the mispredicts.Finally, I suspect this will be a bit harder on non-coherent-icache architectures (aarch64, riscv64), but actually maybe the "mmap a new thing on top of running code" is enough of a jolt to yoink all other cores into coherent happiness again. Note that I haven't tested that!
cfallin commented on issue #8175:
One slight tweak: those three
nop
s need to be one 3-bytenop
for the atomic "store" of the new code with re-mmap
to work safely; otherwise RIP might be right in the middle of where thelfence
is about to spontaneously appear.
sunfishcode commented on issue #8175:
Is a single
mmap
that spans multiple pages guaranteed to be entirely atomic?
cfallin commented on issue #8175:
I think so? At the very least, in the Linux implementation, the memory-map changes are made under one lock, and one IPI is performed to other cores if needed; it'd be neat to find something in the POSIX spec either way to cite though.
bjorn3 commented on issue #8175:
Even with low amounts of mispredicted branches it would be possible to (slowly) leak data, right?
cfallin commented on issue #8175:
The idea is that one would set the quota according to the desired probability (leak bit-rate bound). I haven't thought too much about the control algorithm here but perhaps one puts a module in "non-speculative mode" for the remaining duration of any individual instance alive at the time of the heightened branch mispredict rate (one could implement this with epochs, labeling instances at startup and keeping a count of active instances in epoch N-1 and N). Or something like that.
I should also note that this can be layered with existing mitigations: so e.g. any explicit bounds checks are protected already (cannot read others' heaps even in misspeculation) and this technique is mainly to address the "indirect branches can jump anywhere and find a read gadget" problem, which itself should have a lower effective bit-rate...
cfallin commented on issue #8175:
I just experimented a bit with this idea by writing a little program that mmaps two assembly routines over the top of each other -- identical except for LFENCE's vs. 3-byte NOP's -- while running, and observing the effective timing difference. (The second thread can actually mmap back and forth with different duty cycles and one can observe that smoothly changing the runtime by altering how much speculation occurs -- a very weird sort of PWM.) Here is the gist. Note that this doesn't verify the page-crossing behavior (the little snippet lives on one page), it just shows that the remap-it-live action does work.
Last updated: Jan 24 2025 at 00:11 UTC