wasmtime / issue #8175 Spectre mitigations: add a mode th... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / issue #8175 Spectre mitigations: add a mode th...

Wasmtime GitHub notifications bot (Mar 18 2024 at 21:02):

In discussion today with @fitzgen, @jameysharp, @elliottt and @lpereira, we were considering the idea to dynamically monitor branch mispredictions and isolate execution of any Wasm instance that had used up a "misspeculation quota". I realized that actually what we could do is (effectively) turn off speculation -- you run out, you can't use it anymore! -- by dynamically inserting lfences.

Specifically: the (one?) neat thing about fully coherent icaches on x86 is that we can switch out the code that's running, on the fly, even if other threads are in the middle of functions we're switching out, as long as we're very careful to do it atomically (state between any two stores is valid code).

Consider the case where we want an lfence before every indirect branch (say; or before every branch; orthogonal detail) and we have:
    ...
    mov rax, ... # compute branch target (e.g. from br_table)
    nop. # space for `lfence` (3 bytes)
    nop
    nop
    jmp rax
we can replace the three bytes of nop (0x90, 0x90, 0x90) with lfence (0x0f, 0xae, 0xe8) if we want to "turn off speculation" for this module for a little bit.

There are at least three ways to do that on an x86 machine (with coherent icaches):

Do an atomic store to code memory. For this we'd need W+X mappings temporarily, and an extra nop to make this a 32-bit region we could overwrite with one 32-bit store.

Above, but switch from R+X to R+W; take the SIGBUS from any running thread, temporarily hold, and release when we switch the mapping back (via a futex?).

The one I like best: keep another version of the code segment around, and mmap it over the first.

The last one is pretty neat: mmap is atomic with respect to every other thread (appears as a single store in the total store order; it must, because if other thread had it mapped, it would receive an IPI, which is a synchronizing edge). So we basically "yank out the code ROM and replace it" in between instructions, and the new code doesn't speculate.

Using this, we can build a control loop in a separate thread that monitors mispredict counters, and can flip the switch at will for any module that has excessive counts. It doesn't have to be a one-way trapdoor: a module could have a "mispredict quota" per time unit, and could reset to the fast code (no lfences) after a set period. There is no impact on other modules -- it only impacts the module with the mispredicts.

Finally, I suspect this will be a bit harder on non-coherent-icache architectures (aarch64, riscv64), but actually maybe the "mmap a new thing on top of running code" is enough of a jolt to yoink all other cores into coherent happiness again. Note that I haven't tested that!

Wasmtime GitHub notifications bot (Mar 18 2024 at 21:06):

cfallin edited issue #8175:

In discussion today with @fitzgen, @jameysharp, @elliottt and @lpereira, we were considering the idea to dynamically monitor branch mispredictions and isolate execution of any Wasm instance that had used up a "misspeculation quota". I realized that actually what we could do is (effectively) turn off speculation -- you run out, you can't use it anymore! -- by dynamically inserting lfences.

Specifically: the (one?) neat thing about fully coherent icaches on x86 is that we can switch out the code that's running, on the fly, even if other threads are in the middle of functions we're switching out, as long as we're very careful to do it atomically (state between any two stores is valid code).

Consider the case where we want an lfence before every indirect branch (say; or before every branch; orthogonal detail) and we have:
    ...
    mov rax, ... # compute branch target (e.g. from br_table)
    nop # space for `lfence` (3 bytes)
    nop
    nop
    jmp rax
we can replace the three bytes of nop (0x90, 0x90, 0x90) with lfence (0x0f, 0xae, 0xe8) if we want to "turn off speculation" for this module for a little bit.

There are at least three ways to do that on an x86 machine (with coherent icaches):

Do an atomic store to code memory. For this we'd need W+X mappings temporarily, and an extra nop to make this a 32-bit region we could overwrite with one 32-bit store.

Above, but switch from R+X to R+W; take the SIGBUS from any running thread, temporarily hold, and release when we switch the mapping back (via a futex?).

The one I like best: keep another version of the code segment around, and mmap it over the first.

The last one is pretty neat: mmap is atomic with respect to every other thread (appears as a single store in the total store order; it must, because if other thread had it mapped, it would receive an IPI, which is a synchronizing edge). So we basically "yank out the code ROM and replace it" in between instructions, and the new code doesn't speculate.

Using this, we can build a control loop in a separate thread that monitors mispredict counters, and can flip the switch at will for any module that has excessive counts. It doesn't have to be a one-way trapdoor: a module could have a "mispredict quota" per time unit, and could reset to the fast code (no lfences) after a set period. There is no impact on other modules -- it only impacts the module with the mispredicts.

Finally, I suspect this will be a bit harder on non-coherent-icache architectures (aarch64, riscv64), but actually maybe the "mmap a new thing on top of running code" is enough of a jolt to yoink all other cores into coherent happiness again. Note that I haven't tested that!

Wasmtime GitHub notifications bot (Mar 18 2024 at 21:11):

cfallin commented on issue #8175:

One slight tweak: those three nops need to be one 3-byte nop for the atomic "store" of the new code with re-mmap to work safely; otherwise RIP might be right in the middle of where the lfence is about to spontaneously appear.

Wasmtime GitHub notifications bot (Mar 18 2024 at 21:48):

sunfishcode commented on issue #8175:

Is a single mmap that spans multiple pages guaranteed to be entirely atomic?

Wasmtime GitHub notifications bot (Mar 18 2024 at 21:53):

cfallin commented on issue #8175:

I think so? At the very least, in the Linux implementation, the memory-map changes are made under one lock, and one IPI is performed to other cores if needed; it'd be neat to find something in the POSIX spec either way to cite though.

Wasmtime GitHub notifications bot (Mar 18 2024 at 22:32):

bjorn3 commented on issue #8175:

Even with low amounts of mispredicted branches it would be possible to (slowly) leak data, right?

Wasmtime GitHub notifications bot (Mar 19 2024 at 02:37):

cfallin commented on issue #8175:

The idea is that one would set the quota according to the desired probability (leak bit-rate bound). I haven't thought too much about the control algorithm here but perhaps one puts a module in "non-speculative mode" for the remaining duration of any individual instance alive at the time of the heightened branch mispredict rate (one could implement this with epochs, labeling instances at startup and keeping a count of active instances in epoch N-1 and N). Or something like that.

I should also note that this can be layered with existing mitigations: so e.g. any explicit bounds checks are protected already (cannot read others' heaps even in misspeculation) and this technique is mainly to address the "indirect branches can jump anywhere and find a read gadget" problem, which itself should have a lower effective bit-rate...

Wasmtime GitHub notifications bot (Mar 19 2024 at 06:25):

cfallin commented on issue #8175:

I just experimented a bit with this idea by writing a little program that mmaps two assembly routines over the top of each other -- identical except for LFENCE's vs. 3-byte NOP's -- while running, and observing the effective timing difference. (The second thread can actually mmap back and forth with different duty cycles and one can observe that smoothly changing the runtime by altering how much speculation occurs -- a very weird sort of PWM.) Here is the gist. Note that this doesn't verify the page-crossing behavior (the little snippet lives on one page), it just shows that the remap-it-live action does work.

Last updated: May 03 2026 at 21:15 UTC