alexcrichton commented on issue #400:
Lightbeam was removed in https://github.com/bytecodealliance/wasmtime/pull/3390 as explained in RFC 14, so I'm going to close this.
alexcrichton closed issue #400:
So for months now we've had a miscompilation that is preventing us from integrating Lightbeam into Substrate. What we know about what causes the bug:
- It appears to manifest inside the Rust generated runtime for the
wasm32-unknown-unknown
target- It appears _not_ to manifest inside any of the testcases created by the fuzzer, but I think that's because the fuzzer isn't very good at generating webassembly modules with valid export sections. Even though I wrote a script to export every function with the correct signature from every module in the corpus, it seems like the fuzzer tries to remove the export section relatively quickly.
The bug itself appears to be that something (unknown what) causes the stack pointer differential (i.e. the difference between the stack pointer and the stack pointer that the function started with) tracked by the backend to be misaligned with the actual stack pointer differential, so when we try to emit an instruction to put the stack pointer back to where it should be for the
ret
instruction we add the wrong amount to the pointer andret
jumps to a garbage location, which appears to currently always be0x10
. As far as I know,0x10
isn't special, it's just that the function just happens to always have that on the stack because the compilation and execution is deterministic on a per-function basis. It's not clear what function is actually causing this, I'm looking into gettingwasmi
or another interpreter to print out a call tree during execution (far better than a statically-generated calltree because of the kooky semantics ofcall_indirect
) so that we can do a kind of binary search usingwasm-snip
or a similar tool. Alternatively, we can just implement emitting debuginfo in Lightbeam so thatrr
can print a backtrace. The latter is probably more work, but it's something we should do anyway. Once we have it down to a single function we might be able to get a better idea of what is causing this.My guess is that it's related to one of the following:
- Our calling convention code with either
br_table
orbr_if
. Both have quite complex code to handle calling conventions, and although both have assertions to ensure that we never accidentally ignore the calling convention of a block, it's possible that something is going wrong with it anyway- Code related to passing arguments, either to functions or to blocks - they use similar, but not identical, code (although the similarities are extracted into separate functions so there shouldn't be unintended differences). One possible suspect is the code to handle cycles in the register allocator, which can push items to the physical stack. This shouldn't be a problem, as we directly set the stack depth after handling cycles, but it's possible that something is wrong here.
- Any generated control flow that isn't reflected by control flow in the Microwasm. Although the calling conventions of the Microwasm control flow should be pretty well-handled, the internal control flow in
div
andcall_indirect
(plus maybe some other instructions) can lead to complexity where two control paths that end at the same point after converging could change the stack depths by different amounts, with only one of them (or even neither of them!) being correctly reflected in thestack_depth
field. I did a pass that tried to fix any lingering issues with this a month or two ago, but I don't know for sure if I caught everything or if my fixes were bug-free.
Last updated: Nov 22 2024 at 16:03 UTC