wasmtime / Issue #2344 Wasm native stack limit protection... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / Issue #2344 Wasm native stack limit protection...

Wasmtime GitHub notifications bot (Nov 01 2020 at 22:45):

alexcrichton opened Issue #2344:

Recently we've received a number of fuzz bugs all about stack overflows in native code. Each fuzz bug has a module which contains an infinitely recursive set of functions when instantiated/called, and the stack overflow happens in the native code of Wasmtime, and the exact native code changes depending on what the module is doing. This situation is supposed to be mitigated by limiting the amount of stack space wasm is allowed to take. After some investigation, however, this limiting isn't happening.

When ASan is run with ASAN_OPTIONS="detect_stack_use_after_return=1" then it allocates a "shadow stack" for each function frame. This means that our guess of what the stack pointer is is actually pointing into the shadow stack area instead of the real native stack. This means that we install a stack limit very far away from the actual stack itself, rendering the stack limit useless since stack overflow is hit long before we get to the shadow stack.

To be clear I don't think this is necessarily a but in Wasmtime, this only comes up with fuzzing under AddressSanitizer. That being said that fuzz mode is very important and we don't want to lose that! I think there's a few possible fixes we could do in this situation:

Disable fuzzing with detect_stack_use_after_return. From what I can tell I think this is for having better error reports, and it may not be that critical for Wasmtime, especially with how rare use-after-return is in Rust. That being said I would prefer to take this route as a last resort.

Fix our "educated guess" about what the stack pointer is. This would probably involve us compiling an asm stub which only says mov %rsp, %rax; ret or something like that. This I think should work since it won't be instrumented by ASan and point into the shadow stack. It's a bummer to maintain custom assembly, however.

Refactor how we account for native stack space in wasm modules. Right now this is a usize where if %rsp is ever beneath it we trap. Instead we could change it to a usize of "you have this many bytes left" where each function subtracts and traps on overflow. Additionally, interrupts, which use this same flag, would simply set the remaining bytes to a negative number so the next comparison traps.

I would personally lean towards the third option since I think it also reflects more accurately how much stack space wasm can use. It means we don't ever need the stack pointer itself in native code and interleaving small wasm stacks with larger native stacks will actually work, where it doesn't work today since wasm isn't allowed to go beyond the maximum stack size of the first wasm module on the stack.

I'm curious what others think though!

Wasmtime GitHub notifications bot (Nov 02 2020 at 06:05):

cfallin commented on Issue #2344:

+1 to option 3; getting and reasoning about the stack pointer directly (at least in a platform-independent way by taking the address of an on-stack object...) seems quite fragile, whereas a countdown is completely portable.

Also, if I understand correctly, this serves as a resource limit, but we still presumably have guard page(s) to protect against e.g. overflow in native code?

Wasmtime GitHub notifications bot (Nov 02 2020 at 17:38):

alexcrichton commented on Issue #2344:

Right yeah we should always have a guard page, and that's what's being hit on CI where native code hits the guard page. We'd just rather not have native code hit the guard page for misbehaving wasm code!

Wasmtime GitHub notifications bot (Nov 02 2020 at 23:19):

alexcrichton commented on Issue #2344:

Hm ok so I dug into this today, and I almost got it working. I realized, however, that option 3 works well for stack overflow but does not work well for interrupting execution of wasm modules. Currently, in an effort to make the function prologue efficient, we do both checks in one by updating the stack limit dynamically to kill any functions that later check the stack. This technique doesn't work when the functions themselves are updating the stack limit as well, since there's no synchronization with each function call.

My best idea of how to handle this is that instead of this:
sub $stack_size, (%stack_limit)
jns ok
ud2
ok:
...
we instead do something like this:
movq (%stack_limit), %64_bit_scratch
sub $stack_limit, %64_bit_scratch
jns ok
ud2
ok:
movl %64_bit_scratch, (%stack_limit)
...
where the idea is that we load 64-bits but only store 32-bits. That way the upper 32-bits can in theory be atomically modified since we're only updating the lower 32-bits on each frame. This would also cap native stack sizes for wasm at 4G, but that seems pretty reasonable.

Is that a workable instruction sequence, though? That seems wonky to load both with one 64-bit load and expect that we'll just happen to see remote updates of the upper 32-bits, but I think that at least for x86 it should work?

Wasmtime GitHub notifications bot (Nov 02 2020 at 23:42):

cfallin commented on Issue #2344:

That's a clever solution! I think that should work, but (puts on computer-architect hat) I think the performance might not be great on modern out-of-order machines...

... specifically, we have to worry about partial-store stalls in the load/store queue. In particular, if we have one of these sequences in each prologue, and we have call-heavy code, we could have a number of prologues in-flight in the instruction window; each of the 64-bit loads will stall (probably until the store retires?) because the half-forward, half-not-forward case is generally not in the happy path.

I think we can actually just have a separate interrupt-flag and do the two checks in four instructions, i.e. no more expensive that the second sequence above:
  subq $imm, stack_limit_field(%vmctx)
  js bad
  cmp $0, interrupt_field(%vmctx)
  jne bad
  ...

bad:
  ud2
(I think it is probably better to put the ud2 at the end of the function as well -- it's cold, so we don't want to burn icache space, and it'll be absorbed into alignment padding with 15/16 probability anyway)

Thoughts?

Wasmtime GitHub notifications bot (Nov 02 2020 at 23:47):

cfallin edited a comment on Issue #2344:

That's a clever solution! I think that should work, but (puts on computer-architect hat) I think the performance might not be great on modern out-of-order machines...

... specifically, we have to worry about partial-store stalls in the load/store queue. In particular, if we have one of these sequences in each prologue, and we have call-heavy code, we could have a number of prologues in-flight in the instruction window; each of the 64-bit loads will stall (probably until the store retires?) because the half-forward, half-not-forward case is generally not in the happy path.

I think we can actually just have a separate interrupt-flag and do the two checks in four instructions, i.e. no more expensive than the second sequence above:
  subq $imm, stack_limit_field(%vmctx)
  js bad
  cmp $0, interrupt_field(%vmctx)
  jne bad
  ...

bad:
  ud2
(I think it is probably better to put the ud2 at the end of the function as well -- it's cold, so we don't want to burn icache space, and it'll be absorbed into alignment padding with 15/16 probability anyway)

Thoughts?

Wasmtime GitHub notifications bot (Nov 03 2020 at 00:50):

alexcrichton commented on Issue #2344:

Ok this took way too long to dawn on me, but it turns out if you subtract at the start you also need to add back in the allocation at the end. This means that if we want to have a dynamically updated stack limit we need to instrument all exits from the function as well as the prologue.

That I think places this strategy into a realm of "maybe not as feasible as originally though" and makes me lean towards option 2 above where we just implement a way to get the real and actual stack pointer.

Wasmtime GitHub notifications bot (Nov 04 2020 at 15:29):

alexcrichton commented on Issue #2344:

Ok I've just gone ahead and made a small patch to use psm at https://github.com/bytecodealliance/wasmtime/pull/2358

Wasmtime GitHub notifications bot (Nov 05 2020 at 13:29):

yurydelendik closed Issue #2344:

Recently we've received a number of fuzz bugs all about stack overflows in native code. Each fuzz bug has a module which contains an infinitely recursive set of functions when instantiated/called, and the stack overflow happens in the native code of Wasmtime, and the exact native code changes depending on what the module is doing. This situation is supposed to be mitigated by limiting the amount of stack space wasm is allowed to take. After some investigation, however, this limiting isn't happening.

When ASan is run with ASAN_OPTIONS="detect_stack_use_after_return=1" then it allocates a "shadow stack" for each function frame. This means that our guess of what the stack pointer is is actually pointing into the shadow stack area instead of the real native stack. This means that we install a stack limit very far away from the actual stack itself, rendering the stack limit useless since stack overflow is hit long before we get to the shadow stack.

To be clear I don't think this is necessarily a but in Wasmtime, this only comes up with fuzzing under AddressSanitizer. That being said that fuzz mode is very important and we don't want to lose that! I think there's a few possible fixes we could do in this situation:

Disable fuzzing with detect_stack_use_after_return. From what I can tell I think this is for having better error reports, and it may not be that critical for Wasmtime, especially with how rare use-after-return is in Rust. That being said I would prefer to take this route as a last resort.

Fix our "educated guess" about what the stack pointer is. This would probably involve us compiling an asm stub which only says mov %rsp, %rax; ret or something like that. This I think should work since it won't be instrumented by ASan and point into the shadow stack. It's a bummer to maintain custom assembly, however.

Refactor how we account for native stack space in wasm modules. Right now this is a usize where if %rsp is ever beneath it we trap. Instead we could change it to a usize of "you have this many bytes left" where each function subtracts and traps on overflow. Additionally, interrupts, which use this same flag, would simply set the remaining bytes to a negative number so the next comparison traps.

I would personally lean towards the third option since I think it also reflects more accurately how much stack space wasm can use. It means we don't ever need the stack pointer itself in native code and interleaving small wasm stacks with larger native stacks will actually work, where it doesn't work today since wasm isn't allowed to go beyond the maximum stack size of the first wasm module on the stack.

I'm curious what others think though!

Last updated: Apr 17 2025 at 10:03 UTC