Stream: git-wasmtime

Topic: wasmtime / Issue #2710 Rework/simplify unwind infrastruct...


view this post on Zulip Wasmtime GitHub notifications bot (Mar 07 2021 at 09:23):

bjorn3 commented on Issue #2710:

You are missing debuginfo for the mov %rsp,%rbp instruction in the prologue.

Another thing: How hard would implementing asynchronous exceptions with landingpads for cleanup be after this PR compared to before? For example Theseus OS depends on unwinding after exceptions for fault recovery. LLVM's lack of asynchronous exception support is responsible for a significant part of the failed recovery attempts after a fault is injected. (See table 1 of https://kevinaboos.web.rice.edu/docs/theseus_boos_osdi2020.pdf)

view this post on Zulip Wasmtime GitHub notifications bot (Mar 07 2021 at 09:30):

bjorn3 edited a comment on Issue #2710:

You are missing unwindinfo for the mov %rsp,%rbp instruction in the prologue.

Another thing: How hard would implementing asynchronous exceptions with landingpads for cleanup be after this PR compared to before? For example Theseus OS depends on unwinding after exceptions for fault recovery. LLVM's lack of asynchronous exception support is responsible for a significant part of the failed recovery attempts after a fault is injected. (See table 1 of https://kevinaboos.web.rice.edu/docs/theseus_boos_osdi2020.pdf)

view this post on Zulip Wasmtime GitHub notifications bot (Mar 08 2021 at 01:05):

cfallin commented on Issue #2710:

You are missing unwindinfo for the mov %rsp,%rbp instruction in the prologue.

Can you be more specific about what you believe is missing?

The unwind directives in the Windows case are: PushRegister with rbp (this corresponds with the push rbp); SetFPReg with rbp (this corresponds with the mov rbp, rsp); StackAlloc if a frame is allocated; and a SaveRegister for each clobber-save. (Code here.) The CFI directives in the SysV case are: CfaOffset and Offset for the push rbp, indicating that the CFA is at RSP+16 and that RBP is saved at CFA-16; Offset for LR on aarch64 with CFA-8; CfaRegister with RBP on the mov rbp, rsp indicating that the CFA is now defined in terms of rbp; and Offset for each saved clobbered register. (Code here.)

In the latter case (SysV), I looked at the output of gcc -g -S and clang -g -S to match the CFI directives they were generating. On all platforms tests are passing (and a number of unit tests actually exercise unwind, as I can attest from debugging!), so it would seem that whatever directives that are needed are present.

But if there's something non-idiomatic about the output in some case, then I'm happy to fix it!

Another thing: How hard would implementing asynchronous exceptions with landingpads for cleanup be after this PR compared to before? For example Theseus OS depends on unwinding after exceptions for fault recovery. LLVM's lack of asynchronous exception support is responsible for a significant part of the failed recovery attempts after a fault is injected. (See table 1 of https://kevinaboos.web.rice.edu/docs/theseus_boos_osdi2020.pdf)

That's interesting. I expect that we'll need to tackle landingpads at some point, driven at least by Wasm exceptions (when that proposal advances) as well as by other use-cases like this one. The complexity around them, though (correctly modeling control-flow, including during optimizations) is something we would need to carefully study, and fully asynchronous exceptions would also require CFI info for epilogues (we don't generate this today because we never unwind in the middle of an epilogue -- there's simply no way for the CLIF to insert any user action there, and there are no asynchronous interrupts). So I wouldn't call it easy, exactly, and I can't personally justify spending time on it in the short term, but it should be addressed at some point.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 08 2021 at 07:09):

bjorn3 commented on Issue #2710:

Can you be more specific about what you believe is missing?

I only saw the push rbp and sub rsp, .... instructions have a corresponding unwindinfo entry. In addition if I stepped to the mov rbp, rsp there was an extra ??? frame in the backtrace produced by gdb just below the top frame.

So I wouldn't call it easy, exactly, and I can't personally justify spending time on it in the short term, but it should be addressed at some point.

I understand that it is not a priority right now. I just wanted you to consider if landing this PR would make it harder to implement that to avoid possibly having to revert back to the previous approach.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 08 2021 at 07:39):

cfallin commented on Issue #2710:

Can you be more specific about what you believe is missing?

I only saw the push rbp and sub rsp, .... instructions have a corresponding unwindinfo entry. In addition if I stepped to the mov rbp, rsp there was an extra ??? frame in the backtrace produced by gdb just below the top frame.

Hmm. Possibly this has to do with the placement of the unwindinfo opcodes; the Push and SetFP are both after the push rbp / mov rbp, rsp sequence, whereas if a debugger is stopped in between the two, it should see the Push but not the SetFP. I'll fix that tomorrow and try stepping through it. Thanks for the report!

(I did check with clang on Msys/MinGW just now and the unwind opcodes we're generating are exactly the same as what it generates for its prologue, modulo the above placement issue, so I hope this does it...)

So I wouldn't call it easy, exactly, and I can't personally justify spending time on it in the short term, but it should be addressed at some point.

I understand that it is not a priority right now. I just wanted you to consider if landing this PR would make it harder to implement that to avoid possibly having to revert back to the previous approach.

Ah, I see; if anything it will be easier with a more direct approach, I think. Most of the work for landingpads will be dealing with the control-flow implications in the rest of the compiler; adding the exception handler info to the tables will be relatively straightforward, I think.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 08 2021 at 23:12):

cfallin commented on Issue #2710:

@bjorn3, I've updated and verified now that stepping through instruction-by-instruction with gdb, the backtrace/unwind is precise at every program point. I went ahead and added epilogue info too since this was not much more work. (Note also that the old impl was missing epilogue info because it was tricky to work out exactly where the epilogues were and plumb this through to the right place.)

I also ripped out the old SysV unwind infra in the new-x64 and aarch64 backends; this was dead code, replaced by this unified new implementation, but I had not removed it previously. Happy to see diff-stats show net-negative lines of code after doing that even with Fastcall and epilogues added!

view this post on Zulip Wasmtime GitHub notifications bot (Mar 09 2021 at 20:09):

cfallin commented on Issue #2710:

Updated again; I realized that we need a few more ops around mid-function epilogues (RememberState / RestoreState) because the CFA parser doesn't trace control flow, but rather just scans the function linearly.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 11 2021 at 02:52):

cfallin commented on Issue #2710:

So it turns out that the epilogue CFA info was causing segfaults on CI but not locally, and I wasn't able to find the root cause after much head-scratching. Since epilogue info is not necessary for our unwind needs (it only becomes necessary if one has a trap somewhere in the middle of the mov rsp, rbp / pop rbp / ret sequence, or if one wants to instruction-step through that sequence and have gdb print a nice backtrace while in the middle), and since the current state of both the new x64 backend and the aarch64 backend is that no epilogue CFA info is provided, I've gone ahead and re-simplified to provide only prologue info.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 11 2021 at 02:53):

cfallin edited a comment on Issue #2710:

So it turns out that the epilogue CFA info was causing segfaults on CI but not locally when switching to the new backend in #2718, and I wasn't able to find the root cause after much head-scratching. Since epilogue info is not necessary for our unwind needs (it only becomes necessary if one has a trap somewhere in the middle of the mov rsp, rbp / pop rbp / ret sequence, or if one wants to instruction-step through that sequence and have gdb print a nice backtrace while in the middle), and since the current state of both the new x64 backend and the aarch64 backend is that no epilogue CFA info is provided, I've gone ahead and re-simplified to provide only prologue info.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 11 2021 at 17:23):

cfallin commented on Issue #2710:

@alexcrichton would you be willing to review this given that you took a look above already?

view this post on Zulip Wasmtime GitHub notifications bot (Mar 12 2021 at 00:07):

cfallin commented on Issue #2710:

Updated based on comments; thanks!

view this post on Zulip Wasmtime GitHub notifications bot (Mar 12 2021 at 04:09):

cfallin commented on Issue #2710:

@sunfishcode it looks like there might be an intermittent test with wasi_cap_std_sync on Windows, namely poll_oneoff_stdio -- see logs. Watching the latest run to see if it fares better this time...

view this post on Zulip Wasmtime GitHub notifications bot (Apr 13 2021 at 08:49):

uweigand commented on Issue #2710:

Hi @cfallin, I finally got around to updating my s390x patch to this new unwind method. This ran into a couple of problems, so I'm wondering whether I'm on the right track here. I'd appreciate your input ...

  1. Frame size definition. The total_frame_size value returned form frame_size is documented to equal the "distance between FP and nominal SP", and users (even outside of my new back-end) appear to rely on that fact, e.g. it is ultimately used to implement the get_nominal_sp_to_fp routine used in spillslots_to_stack_map. However, with the new approach, the stack space used for clobbers (clobber_size) now lies between the nominal SP and the FP, so I believe this value now needs to be added to total_frame_size. (At least in the non-baldrdash case. I'm not really familiar with how baldrdash works, in particular in the new approach ...)
  2. Accumulate outgoing args. We've added this feature a while back since I'm using it in my back-end. This patch completely removed the feature again, breaking my back-end. Would you accept a patch adding it back in the new approach?
  3. Operation without frame pointer. I don't really need or want a frame pointer, but in the new approach I'm basically forced to define a FP register. Would it be OK to add support for non-FP operation? One option would be to make the FP register optional (like the LR currently is) and update the handling of UnwindInst::PushFrameRegs and UnwindInst::DefineNewFrame if no FP is present -- this is what I've implemented so far. Another option would be to define a new UnwindInst::AllocStack that I could use instead of those two if there is no FP.
  4. Negative offsets to UnwindInst::SaveReg. In particular in the absence of a FP (where I'm not using DefineNewFrame), the offsets to SaveReg are more naturally expressed relative to the incoming SP (or CFA). But then they will be negative, which makes it a bit awkward to use the current interface which defines the offsets as u32.

view this post on Zulip Wasmtime GitHub notifications bot (Apr 13 2021 at 17:54):

cfallin commented on Issue #2710:

Hi @uweigand -- thanks for the efforts to update and sorry about the churn!

I'll note first as an aside that I'm quite open to merging the s390x backend sooner rather than later, if you think it is at a place for that, as it would allow me and others to consider its constraints and update it along with the others when refactors occur. I am of course also happy to work out followup changes to resolve issues after-the-fact, as we're doing here, but e.g. if I had known that the "accumulate outgoing args" mechanism was used by s390x as well, it might have affected my cleanup/simplification calculus a bit (sorry about that!).

In particular, I'm currently working on an update to the register allocator, which involves some churn in the way it is used in the lowering backends. I'm happy to do the update in all backends in-tree, including s390x if it is merged; keeping it out-of-tree creates more rebasing pain for you as we refactor and evolve, which I do regret :-/

To the specific questions:

Frame size definition ... [add clobber_size to nominal-SP-to-FP distance]

Yes, I think so, though without seeing your ABI implementation in more detail it's hard to say for sure. Are you using abi_impl.rs or implementing the ABICaller / ABICallee traits directly?

Accumulate outgoing args

Sorry about this! I worked to simplify the common ABI implementation as I refactored it and this seemed to me to be implementing something we could do in a simpler way, at least on the x64 and aarch64 backends. I hadn't realized it was used on s390x. I'm happy to take a patch to add it back, or, if it is simpler, perhaps the s390x backend could directly implement this in its ABI code (without indirecting through the traits)?

Operation without frame pointer

We do definitely want to make this possible eventually. Without thinking through the implications a bit more, I don't have a strong opinion which of the two approaches you name makes more sense, but my initial thought is that perhaps we add AllocStack as you suggest and then omit PushFrameRegs. The SysV unwind info generator then just needs to track that it is in FP-less mode and continue to define the CFA in terms of SP instead.

Negative offsets to UnwindInst::SaveReg

To unify the view between the different platforms, I think it does make sense to continue to view the clobber locations in terms of a "clobber frame" that starts below them; I think that if we make DefineNewFrame work in an FP-less environment (but still have the same notion of an offset from some anchor to the start of clobbers) this should be possible? Or is there a fundamental conflict here?

(This is useful pathfinding for our eventual omit-FP optimization on other architectures too, by the way, so thank you for probing these issues now!)

view this post on Zulip Wasmtime GitHub notifications bot (Apr 14 2021 at 12:22):

uweigand commented on Issue #2710:

Hi @uweigand -- thanks for the efforts to update and sorry about the churn!

No problem, thanks for the reply!

I'll note first as an aside that I'm quite open to merging the s390x backend sooner rather than later, if you think it is at a place for that, as it would allow me and others to consider its constraints and update it along with the others when refactors occur. I am of course also happy to work out followup changes to resolve issues after-the-fact, as we're doing here, but e.g. if I had known that the "accumulate outgoing args" mechanism was used by s390x as well, it might have affected my cleanup/simplification calculus a bit (sorry about that!).

In particular, I'm currently working on an update to the register allocator, which involves some churn in the way it is used in the lowering backends. I'm happy to do the update in all backends in-tree, including s390x if it is merged; keeping it out-of-tree creates more rebasing pain for you as we refactor and evolve, which I do regret :-/

I agree, in particular because there now also seems to be renewed interest in s390x support in wasmtime due to additional use cases. I'll try to get things ready for an initial merge soon, hopefully within the next few weeks.

To the specific questions:

Frame size definition ... [add clobber_size to nominal-SP-to-FP distance]

Yes, I think so, though without seeing your ABI implementation in more detail it's hard to say for sure. Are you using abi_impl.rs or implementing the ABICaller / ABICallee traits directly?

I'm using abi_impl.rs just like x64 and aarch64 do. I had thought that the problem with the get_nominal_sp_to_fp routine should also occur on those platforms, but at a second glance, it probably does not: the only user of this routine in common code is
spillslots_to_stack_map, but this uses the value only as an upper bound of the stack map size. Given that there are no stack slots in the clobber area, for this particular use case the difference probably does not matter.

On my platform, I have another use case in target-specific code. This again has to do with the fact that I am not using a frame pointer register, and therefore the code generator has to rewrite accesses to the incoming argument slots (which are expressed by common code via offsets relative to the FP) in terms of offsets relative to the (nominal) SP. This is done by simply adding the nominal_sp_to_fp offset, which used to work fine, but has now stopped working.

So we can either fix the frame_size implementation so that nominal_sp_to_fp again returns the actual offset between those two points in the stack, or else I'll have to find some other way to pass the correct offset including clobbers to the back-end codegen (e.g. via a pseudo instruction).

In case you agree with changing frame_size, here's a PR to do so: https://github.com/bytecodealliance/wasmtime/pull/2836

Accumulate outgoing args

Sorry about this! I worked to simplify the common ABI implementation as I refactored it and this seemed to me to be implementing something we could do in a simpler way, at least on the x64 and aarch64 backends. I hadn't realized it was used on s390x. I'm happy to take a patch to add it back, or, if it is simpler, perhaps the s390x backend could directly implement this in its ABI code (without indirecting through the traits)?

Here's a PR to add this support back: https://github.com/bytecodealliance/wasmtime/pull/2837

I'd be happy to implement this solely in the back-end, but the main problem I was having with this is that there does not appear to be any back-end specific (mutable) state associated with an ABICallee. I need this state to track the required outgoing argument space as we go through the whole function and lower it; each call instruction that is found will have to update the required size accordingly. Finally, in the prologue and epilogue code generator I need to access that accumulated value so I can alloc/dealloc the correct amount of extra stack space.

Without the common code implementation, do you have a suggestion where I should hold this extra state?

Operation without frame pointer

We do definitely want to make this possible eventually. Without thinking through the implications a bit more, I don't have a strong opinion which of the two approaches you name makes more sense, but my initial thought is that perhaps we add AllocStack as you suggest and then omit PushFrameRegs. The SysV unwind info generator then just needs to track that it is in FP-less mode and continue to define the CFA in terms of SP instead.

Negative offsets to UnwindInst::SaveReg

To unify the view between the different platforms, I think it does make sense to continue to view the clobber locations in terms of a "clobber frame" that starts below them; I think that if we make DefineNewFrame work in an FP-less environment (but still have the same notion of an offset from some anchor to the start of clobbers) this should be possible? Or is there a fundamental conflict here?

Here's a PR to add support for operation without frame pointer along the lines suggested above: https://github.com/bytecodealliance/wasmtime/pull/2838

In particular, this extends DefineNewFrame as you suggest for FP-less operation, and adds a new StackAlloc operation. With those two, I can define the unwind information in my backend in a pretty straightforward manner.

view this post on Zulip Wasmtime GitHub notifications bot (Apr 14 2021 at 18:03):

cfallin commented on Issue #2710:

The PRs all look good -- will land all three. Thanks!

view this post on Zulip Wasmtime GitHub notifications bot (Apr 14 2021 at 18:46):

uweigand commented on Issue #2710:

Thank you!


Last updated: Dec 23 2024 at 12:05 UTC