Hello would wasmtime be open to an option that would allow disabling the usage of signals for implementing traps like the unreachable wasm instruction generating a SIGILL, etc? We run in an environment that has some complex interactions with signal handlers and signal blocking/unblocking. There are cases we I'm running into the process aborting or hanging due to these signals and I was wondering if a config option would be something the project would consider.
set these options to zero and you should only use explicit bounds checks for memories:
(instead of relying on signals)
Note that we still rely on signals for floating point errors on ISAs that support that (x86 does, aarch64 doesn't, for example)
Currently we don't have a codegen option to not rely on this, it seems; I think we might have in the past and it was removed (someone would have to do some more digging here)
(this is for div-by-0 at least; we seem to still have explicit checks for INT_MIN / -1)
Thank you, yes I'm thinking about the SIGFPE for division by zero on x86 and not using ud2
and similar instructions for unreachable.
Would having a codegen option to not rely on this be something that would be considered?
I think it's reasonable to at least consider. One thing we always worry about with more configuration options is the testing and maintenance overhead; but if this is localized to a few operations (div, rem) maybe it's not so bad.
Getting from a ud2
to the trap handler may be a bit trickier though: right now our "explicit checks" still rely on that opcode to exit the Wasm. We'd need an alternate mechanism (e.g. a jump to some address that we patch in, or provide in the vmctx) for this to work.
If we do have this mode, we'd want to test it with an integration test that sets a new option on Engine
to not register any signal handlers, then run either the Wasm test suite or at least specific tests we know to rely on trapping.
Especially given the difficulties we've had with signal handling on macOS, I think this could have real value; it's not just a "support some weird use-case we've never heard of" sort of PR. But starting the discussion here and then sketching it out in an issue and maybe a prototype to show the extent of the changes would help us decide for sure
I think many of our remaining limitations in Cranelift's test suite and fuzzing are due to not being able to handle traps, so maybe we'd want to use this mode for all Cranelift testing? That would help ensure that it's well-exercised.
Understand completely on the maintenance + testing bits.
Getting from a ud2 to the trap handler may be a bit trickier though: right now our "explicit checks" still rely on that opcode to exit the Wasm. We'd need an alternate mechanism (e.g. a jump to some address that we patch in, or provide in the vmctx) for this to work.
Yes exactly, I'm thinking the easiest thing is to always jump to a predefined function, we could pass in a parameter for which trap. It would increase the size of the generated code, but not by much?
If we do have this mode, we'd want to test it with an integration test that sets a new option on Engine to not register any signal handlers, then run either the Wasm test suite or at least specific tests we know to rely on trapping.
You can also just pthread_sigmask
and block that signal from being allowed to be handled by the thread (assuming tests are single threaded).
I think this could have real value
That's good to know! It seems there are a bunch of places that use ud2
from a quick code search so I assume this would be a largish change...
Three things to consider with the code to replace ud2
: (i) we have a bunch of them; disassembly of some functions will show a whole stream of ud2
ops at the end of the function, each a specific trap-point. ud2
is two bytes (0x0f 0x0b
); we'd have to take some inflation but every byte counts here. So e.g. jmp *offset(%rN)
where %rN
is the register holding vmctx
is 8 bytes, but that's better than a full callsite with moves into registers, a call, and cleanup. (We also can't represent "non-returning callsite" and optimize based on that currently.) (ii) we try to avoid having any relocations in the code (i.e., emit PIC where possible): this makes loading precompiled .cwasm
s much faster, as it lets us mmap straight from disk. So we'd want to go through a pointer in vmctx
. (iii) we can see the return address if we jump to a trampoline, so we could just use that; no need to pass other args. Basically we want a "fake ud2" that jumps to a little handwritten assembly trampoline that then calls into the runtime and never returns.
disassembly of some functions will show a whole stream of ud2 ops at the end of the function, each a specific trap-point.
I don't quite follow here - are you saying that cranelift injects a stream of ud2 ops at the end of a function? Or that upstream wasm toolchains (e.g. clang) outputs unreachable
instructions in batches at the end of functions? If cranelift is doing this - why? alignment?
Just to make sure I follow, the pointer in vmctx
would point to the "fake ud2" right? And the assembly trampoline essentially reads the return address, and passes that into the runtime, and the runtime looks up the trapcode for that location?
I don't quite follow here - are you saying that cranelift injects a stream of ud2 ops at the end of a function? Or that upstream wasm toolchains (e.g. clang) outputs unreachable instructions in batches at the end of functions? If cranelift is doing this - why? alignment?
The former; and not as a direct action, but as a consequence of compilation. The IR has a bunch of basic blocks, each of which ends with an unreachable
. We compile unreachable
as ud2
, and we also mark these blocks as cold so they are sunk to the bottom of the function. So the effect is a bunch of ud2
s; the identity of the trapsite is determined by the address of the specific one that traps
The pointer in vmctx
would point to trampoline code; the "fake ud2" is what we generate instead of ud2
in the function body itself
the address of that code (which becomes the return address in the trampoline) serves as the identity of the particular trap we took, just as the address of the ud2
did
which ends with an unreachable.
mark these blocks as cold
Would the side bloat of replaces these ud2 with jumps to the fake ud2 be acceptible? Or would there need to be work to optimize away those ud2s? (assuming that is possible)
That's what I was getting at above about 2 vs. 8 bytes; I'm not sure, but I guess the real question is just "is it still reasonable / workable" (since this mode wouldn't be on by default) and it seems likely
(likely to be acceptable that is)
I tried to distill this thread into https://github.com/bytecodealliance/wasmtime/issues/6926 - please feel free to chime in if there is anything incorrect or you have other thoughts. Otherwise it's probably better to consolidate the discussion there? Thanks for this!
Tyler Rockwood has marked this topic as resolved.
Last updated: Dec 23 2024 at 13:07 UTC