bjorn3 commented on Issue #2710:
You are missing debuginfo for the
mov %rsp,%rbp
instruction in the prologue.Another thing: How hard would implementing asynchronous exceptions with landingpads for cleanup be after this PR compared to before? For example Theseus OS depends on unwinding after exceptions for fault recovery. LLVM's lack of asynchronous exception support is responsible for a significant part of the failed recovery attempts after a fault is injected. (See table 1 of https://kevinaboos.web.rice.edu/docs/theseus_boos_osdi2020.pdf)
bjorn3 edited a comment on Issue #2710:
You are missing unwindinfo for the
mov %rsp,%rbp
instruction in the prologue.Another thing: How hard would implementing asynchronous exceptions with landingpads for cleanup be after this PR compared to before? For example Theseus OS depends on unwinding after exceptions for fault recovery. LLVM's lack of asynchronous exception support is responsible for a significant part of the failed recovery attempts after a fault is injected. (See table 1 of https://kevinaboos.web.rice.edu/docs/theseus_boos_osdi2020.pdf)
cfallin commented on Issue #2710:
You are missing unwindinfo for the
mov %rsp,%rbp
instruction in the prologue.Can you be more specific about what you believe is missing?
The unwind directives in the Windows case are:
PushRegister
withrbp
(this corresponds with thepush rbp
);SetFPReg
withrbp
(this corresponds with themov rbp, rsp
);StackAlloc
if a frame is allocated; and aSaveRegister
for each clobber-save. (Code here.) The CFI directives in the SysV case are:CfaOffset
andOffset
for thepush rbp
, indicating that the CFA is at RSP+16 and that RBP is saved atCFA-16
;Offset
forLR
on aarch64 withCFA-8
;CfaRegister
withRBP
on themov rbp, rsp
indicating that the CFA is now defined in terms of rbp; andOffset
for each saved clobbered register. (Code here.)In the latter case (SysV), I looked at the output of
gcc -g -S
andclang -g -S
to match the CFI directives they were generating. On all platforms tests are passing (and a number of unit tests actually exercise unwind, as I can attest from debugging!), so it would seem that whatever directives that are needed are present.But if there's something non-idiomatic about the output in some case, then I'm happy to fix it!
Another thing: How hard would implementing asynchronous exceptions with landingpads for cleanup be after this PR compared to before? For example Theseus OS depends on unwinding after exceptions for fault recovery. LLVM's lack of asynchronous exception support is responsible for a significant part of the failed recovery attempts after a fault is injected. (See table 1 of https://kevinaboos.web.rice.edu/docs/theseus_boos_osdi2020.pdf)
That's interesting. I expect that we'll need to tackle landingpads at some point, driven at least by Wasm exceptions (when that proposal advances) as well as by other use-cases like this one. The complexity around them, though (correctly modeling control-flow, including during optimizations) is something we would need to carefully study, and fully asynchronous exceptions would also require CFI info for epilogues (we don't generate this today because we never unwind in the middle of an epilogue -- there's simply no way for the CLIF to insert any user action there, and there are no asynchronous interrupts). So I wouldn't call it easy, exactly, and I can't personally justify spending time on it in the short term, but it should be addressed at some point.
bjorn3 commented on Issue #2710:
Can you be more specific about what you believe is missing?
I only saw the
push rbp
andsub rsp, ....
instructions have a corresponding unwindinfo entry. In addition if I stepped to themov rbp, rsp
there was an extra???
frame in the backtrace produced by gdb just below the top frame.So I wouldn't call it easy, exactly, and I can't personally justify spending time on it in the short term, but it should be addressed at some point.
I understand that it is not a priority right now. I just wanted you to consider if landing this PR would make it harder to implement that to avoid possibly having to revert back to the previous approach.
cfallin commented on Issue #2710:
Can you be more specific about what you believe is missing?
I only saw the
push rbp
andsub rsp, ....
instructions have a corresponding unwindinfo entry. In addition if I stepped to themov rbp, rsp
there was an extra???
frame in the backtrace produced by gdb just below the top frame.Hmm. Possibly this has to do with the placement of the unwindinfo opcodes; the Push and SetFP are both after the
push rbp / mov rbp, rsp
sequence, whereas if a debugger is stopped in between the two, it should see the Push but not the SetFP. I'll fix that tomorrow and try stepping through it. Thanks for the report!(I did check with clang on Msys/MinGW just now and the unwind opcodes we're generating are exactly the same as what it generates for its prologue, modulo the above placement issue, so I hope this does it...)
So I wouldn't call it easy, exactly, and I can't personally justify spending time on it in the short term, but it should be addressed at some point.
I understand that it is not a priority right now. I just wanted you to consider if landing this PR would make it harder to implement that to avoid possibly having to revert back to the previous approach.
Ah, I see; if anything it will be easier with a more direct approach, I think. Most of the work for landingpads will be dealing with the control-flow implications in the rest of the compiler; adding the exception handler info to the tables will be relatively straightforward, I think.
cfallin commented on Issue #2710:
@bjorn3, I've updated and verified now that stepping through instruction-by-instruction with
gdb
, the backtrace/unwind is precise at every program point. I went ahead and added epilogue info too since this was not much more work. (Note also that the old impl was missing epilogue info because it was tricky to work out exactly where the epilogues were and plumb this through to the right place.)I also ripped out the old SysV unwind infra in the new-x64 and aarch64 backends; this was dead code, replaced by this unified new implementation, but I had not removed it previously. Happy to see diff-stats show net-negative lines of code after doing that even with Fastcall and epilogues added!
cfallin commented on Issue #2710:
Updated again; I realized that we need a few more ops around mid-function epilogues (
RememberState
/RestoreState
) because the CFA parser doesn't trace control flow, but rather just scans the function linearly.
cfallin commented on Issue #2710:
So it turns out that the epilogue CFA info was causing segfaults on CI but not locally, and I wasn't able to find the root cause after much head-scratching. Since epilogue info is not necessary for our unwind needs (it only becomes necessary if one has a trap somewhere in the middle of the
mov rsp, rbp / pop rbp / ret
sequence, or if one wants to instruction-step through that sequence and have gdb print a nice backtrace while in the middle), and since the current state of both the new x64 backend and the aarch64 backend is that no epilogue CFA info is provided, I've gone ahead and re-simplified to provide only prologue info.
cfallin edited a comment on Issue #2710:
So it turns out that the epilogue CFA info was causing segfaults on CI but not locally when switching to the new backend in #2718, and I wasn't able to find the root cause after much head-scratching. Since epilogue info is not necessary for our unwind needs (it only becomes necessary if one has a trap somewhere in the middle of the
mov rsp, rbp / pop rbp / ret
sequence, or if one wants to instruction-step through that sequence and have gdb print a nice backtrace while in the middle), and since the current state of both the new x64 backend and the aarch64 backend is that no epilogue CFA info is provided, I've gone ahead and re-simplified to provide only prologue info.
cfallin commented on Issue #2710:
@alexcrichton would you be willing to review this given that you took a look above already?
cfallin commented on Issue #2710:
Updated based on comments; thanks!
cfallin commented on Issue #2710:
@sunfishcode it looks like there might be an intermittent test with
wasi_cap_std_sync
on Windows, namelypoll_oneoff_stdio
-- see logs. Watching the latest run to see if it fares better this time...
uweigand commented on Issue #2710:
Hi @cfallin, I finally got around to updating my s390x patch to this new unwind method. This ran into a couple of problems, so I'm wondering whether I'm on the right track here. I'd appreciate your input ...
- Frame size definition. The
total_frame_size
value returned formframe_size
is documented to equal the "distance between FP and nominal SP", and users (even outside of my new back-end) appear to rely on that fact, e.g. it is ultimately used to implement theget_nominal_sp_to_fp
routine used inspillslots_to_stack_map
. However, with the new approach, the stack space used for clobbers (clobber_size
) now lies between the nominal SP and the FP, so I believe this value now needs to be added tototal_frame_size
. (At least in the non-baldrdash case. I'm not really familiar with how baldrdash works, in particular in the new approach ...)- Accumulate outgoing args. We've added this feature a while back since I'm using it in my back-end. This patch completely removed the feature again, breaking my back-end. Would you accept a patch adding it back in the new approach?
- Operation without frame pointer. I don't really need or want a frame pointer, but in the new approach I'm basically forced to define a FP register. Would it be OK to add support for non-FP operation? One option would be to make the FP register optional (like the LR currently is) and update the handling of
UnwindInst::PushFrameRegs
andUnwindInst::DefineNewFrame
if no FP is present -- this is what I've implemented so far. Another option would be to define a newUnwindInst::AllocStack
that I could use instead of those two if there is no FP.- Negative offsets to
UnwindInst::SaveReg
. In particular in the absence of a FP (where I'm not usingDefineNewFrame
), the offsets toSaveReg
are more naturally expressed relative to the incoming SP (or CFA). But then they will be negative, which makes it a bit awkward to use the current interface which defines the offsets asu32
.
cfallin commented on Issue #2710:
Hi @uweigand -- thanks for the efforts to update and sorry about the churn!
I'll note first as an aside that I'm quite open to merging the s390x backend sooner rather than later, if you think it is at a place for that, as it would allow me and others to consider its constraints and update it along with the others when refactors occur. I am of course also happy to work out followup changes to resolve issues after-the-fact, as we're doing here, but e.g. if I had known that the "accumulate outgoing args" mechanism was used by s390x as well, it might have affected my cleanup/simplification calculus a bit (sorry about that!).
In particular, I'm currently working on an update to the register allocator, which involves some churn in the way it is used in the lowering backends. I'm happy to do the update in all backends in-tree, including s390x if it is merged; keeping it out-of-tree creates more rebasing pain for you as we refactor and evolve, which I do regret :-/
To the specific questions:
Frame size definition ... [add
clobber_size
to nominal-SP-to-FP distance]Yes, I think so, though without seeing your ABI implementation in more detail it's hard to say for sure. Are you using
abi_impl.rs
or implementing theABICaller
/ABICallee
traits directly?Accumulate outgoing args
Sorry about this! I worked to simplify the common ABI implementation as I refactored it and this seemed to me to be implementing something we could do in a simpler way, at least on the x64 and aarch64 backends. I hadn't realized it was used on s390x. I'm happy to take a patch to add it back, or, if it is simpler, perhaps the s390x backend could directly implement this in its ABI code (without indirecting through the traits)?
Operation without frame pointer
We do definitely want to make this possible eventually. Without thinking through the implications a bit more, I don't have a strong opinion which of the two approaches you name makes more sense, but my initial thought is that perhaps we add
AllocStack
as you suggest and then omitPushFrameRegs
. The SysV unwind info generator then just needs to track that it is in FP-less mode and continue to define the CFA in terms of SP instead.Negative offsets to
UnwindInst::SaveReg
To unify the view between the different platforms, I think it does make sense to continue to view the clobber locations in terms of a "clobber frame" that starts below them; I think that if we make
DefineNewFrame
work in an FP-less environment (but still have the same notion of an offset from some anchor to the start of clobbers) this should be possible? Or is there a fundamental conflict here?(This is useful pathfinding for our eventual omit-FP optimization on other architectures too, by the way, so thank you for probing these issues now!)
uweigand commented on Issue #2710:
Hi @uweigand -- thanks for the efforts to update and sorry about the churn!
No problem, thanks for the reply!
I'll note first as an aside that I'm quite open to merging the s390x backend sooner rather than later, if you think it is at a place for that, as it would allow me and others to consider its constraints and update it along with the others when refactors occur. I am of course also happy to work out followup changes to resolve issues after-the-fact, as we're doing here, but e.g. if I had known that the "accumulate outgoing args" mechanism was used by s390x as well, it might have affected my cleanup/simplification calculus a bit (sorry about that!).
In particular, I'm currently working on an update to the register allocator, which involves some churn in the way it is used in the lowering backends. I'm happy to do the update in all backends in-tree, including s390x if it is merged; keeping it out-of-tree creates more rebasing pain for you as we refactor and evolve, which I do regret :-/
I agree, in particular because there now also seems to be renewed interest in s390x support in wasmtime due to additional use cases. I'll try to get things ready for an initial merge soon, hopefully within the next few weeks.
To the specific questions:
Frame size definition ... [add
clobber_size
to nominal-SP-to-FP distance]Yes, I think so, though without seeing your ABI implementation in more detail it's hard to say for sure. Are you using
abi_impl.rs
or implementing theABICaller
/ABICallee
traits directly?I'm using
abi_impl.rs
just like x64 and aarch64 do. I had thought that the problem with the get_nominal_sp_to_fp routine should also occur on those platforms, but at a second glance, it probably does not: the only user of this routine in common code is
spillslots_to_stack_map, but this uses the value only as an upper bound of the stack map size. Given that there are no stack slots in the clobber area, for this particular use case the difference probably does not matter.On my platform, I have another use case in target-specific code. This again has to do with the fact that I am not using a frame pointer register, and therefore the code generator has to rewrite accesses to the incoming argument slots (which are expressed by common code via offsets relative to the FP) in terms of offsets relative to the (nominal) SP. This is done by simply adding the nominal_sp_to_fp offset, which used to work fine, but has now stopped working.
So we can either fix the frame_size implementation so that nominal_sp_to_fp again returns the actual offset between those two points in the stack, or else I'll have to find some other way to pass the correct offset including clobbers to the back-end codegen (e.g. via a pseudo instruction).
In case you agree with changing frame_size, here's a PR to do so: https://github.com/bytecodealliance/wasmtime/pull/2836
Accumulate outgoing args
Sorry about this! I worked to simplify the common ABI implementation as I refactored it and this seemed to me to be implementing something we could do in a simpler way, at least on the x64 and aarch64 backends. I hadn't realized it was used on s390x. I'm happy to take a patch to add it back, or, if it is simpler, perhaps the s390x backend could directly implement this in its ABI code (without indirecting through the traits)?
Here's a PR to add this support back: https://github.com/bytecodealliance/wasmtime/pull/2837
I'd be happy to implement this solely in the back-end, but the main problem I was having with this is that there does not appear to be any back-end specific (mutable) state associated with an ABICallee. I need this state to track the required outgoing argument space as we go through the whole function and lower it; each call instruction that is found will have to update the required size accordingly. Finally, in the prologue and epilogue code generator I need to access that accumulated value so I can alloc/dealloc the correct amount of extra stack space.
Without the common code implementation, do you have a suggestion where I should hold this extra state?
Operation without frame pointer
We do definitely want to make this possible eventually. Without thinking through the implications a bit more, I don't have a strong opinion which of the two approaches you name makes more sense, but my initial thought is that perhaps we add
AllocStack
as you suggest and then omitPushFrameRegs
. The SysV unwind info generator then just needs to track that it is in FP-less mode and continue to define the CFA in terms of SP instead.Negative offsets to
UnwindInst::SaveReg
To unify the view between the different platforms, I think it does make sense to continue to view the clobber locations in terms of a "clobber frame" that starts below them; I think that if we make
DefineNewFrame
work in an FP-less environment (but still have the same notion of an offset from some anchor to the start of clobbers) this should be possible? Or is there a fundamental conflict here?Here's a PR to add support for operation without frame pointer along the lines suggested above: https://github.com/bytecodealliance/wasmtime/pull/2838
In particular, this extends DefineNewFrame as you suggest for FP-less operation, and adds a new StackAlloc operation. With those two, I can define the unwind information in my backend in a pretty straightforward manner.
cfallin commented on Issue #2710:
The PRs all look good -- will land all three. Thanks!
uweigand commented on Issue #2710:
Thank you!
Last updated: Nov 22 2024 at 16:03 UTC