In the process of working on https://github.com/bytecodealliance/wasmtime/pull/3180 I'm hitting a bizarre bug that I'm hoping folks from ARM or someone else who's knowledgeable can help out with
As some basic background, we generate .eh_frame
tables for JIT code and then register them with libgcc on Linux arm64 machines. This .eh_frame
business is meant for unwinding which is how we capture backtraces of both wasm/native together. My PR is updated when/where the .eh_frame
is generated. Previously the .eh_frame
would be dynamically generated when we load a module into memory, but now I'm updating it such that it's emedded into the module itself, precomputed ahead of time.
The suprising behavior that I'm seeing is that my PR is segfaulting on CI, specifically in the emulated aarch64 builder. This aarch64 builder is running on an x86_64 host and is built with a pinned qemu 6.0.0 version. I cannot reproduce the segfault, however, on a bare-metal aarch64 host (the one that ARM donated to us). Interestingly enough, though, I can reproduce the segfault in qemu running on the arm64 host. That's allowed me to dig in further
After what was probably too much gdb, I and others have reached the following conclusions:
-cpu cortex-{a53,a57,a72}
to qemu, the test passes. If I pass -cpu max
the test fails. (these were all the cpus my local qemu build supported)autia1716
instructionI've confirmed in a debugger that the autia1716
instruction is indeed where native execution and emluated execution start to diverge.
Now all of this is somewhat confusing to me for a number of reasons. First off though my assumption is that the bare-metal linux server probably has pointer auth turned off entirely which is why things appear to work for me. I'm assuming that QEMU with -cpu max
turns on pointer auth for one reason or another, which is why things fail. Consequently for cortex-a72
and below I'm assuming it's also turned off which is why things start passing.
We ran into pointer auth bits for aarch64 support on macOS. Cranelift added support for specifically disabling pointer auth for jit frames, however. To do this the FDE for jit frames specifies that the pseudo-register for pointer auth is specifically set to 0, using DW_OP_lit0
I can confirm that libgcc is indeed decoding this FDE opcode correctly and it's registering that. Specifically this code gets hit and register 34, the pseudo-register, indeed has REG_SAVED_VAL_EXP
saved into it
So all of that is fine except that our disabling of pointer auth doesn't seem to be doing anything. It's still trying to run autia1716
when unwinding on addresses, and that's why I think it's working in qemu but not on bare metal (assuming autia1716
is a noop basically if pointer auth is turned off)
One suspicion of mine is that the guard for the autia1716
instruction is testing the lower bit of the offset
field in libgcc. I think this may be a bug? Corresponding code in LLVM's libunwind seems to disagree with this where LLVM checks the value of the register itself (which is how I think things work on arm64)
This suspicion, however, still doesn't really explain why my PR is actually "breaking" things. My PR, AFAIK, doesn't actually change unwinding information at all, it just moves where it's generated and how it's managed. My best guess, though, is that for some reason the way we generated it before it magically always has an even value for libgcc's offset
field for the register. I don't know how this could be the case, and is something I wanted to investigate next.
I wanted to write all this up though to see if other arm folks can help out or forward this along perhaps, I'm specifically wondering if gcc is indeed buggy in this regard.
Also FWIW the server we're using has gcc 8.3.0, and looking at gcc's main branch it appears a number of changes have happened since 8.3.0. I don't think, though, that it necesarily fundamentally changes anything, it appears that the main condition for pointer auth is still based on the lowest bit of the offset
field, which continues to confuse me in the sense that it seems to disagree with LLVM's libunwind
CC @Anton Kirilov & @Sam Parker for potential insights here
Ok so on the offset bits, I think this may indeed be a bug in libgcc. I changed slightly how the FDE was encoded such that the DW_OP_lit0
happened to always indeed be at least 2-byte aligned before, and after my PR it's only 1-byte aligned. That I think explains why the offset
field is different before/after my PR.
The reason for that is that the offset
and exp
fields are stored in a union
. The DW_CFA_val_expression
opcode that we use only sets the exp
field, so the offset
test later when executing autia1716
is actually testing exp
, the address of the expression for the register, not the actual register value itself
I think that means everything makes sense to me now. In summary:
Since PAC is typically turned off for Linux I think I'll probably just pass a -cpu
argument to qemu on CI to fix this for now, but it'd probably be good to fix this in libgcc at some point as well
Nice find! FWIW, it seems reasonable to me for now to 2-byte-align the lit0
opcode as a workaround; there's plenty of precedent (in the wider world if not in our codebase) for "this consumer has this weird bug so we generate metadata in this particular way"
If we disable PAC in the CI, we should file an issue to re-enable it at some point in the future, as I can imagine a future where it'll be common on Linux too and we want to make sure it's tested
hm true. This isn't super easy to align though I think b/c it's buried in the rest of the encoding of the FDE which doesn't have lots of room for padding I think. I'll see if I can find "the leb" though perhaps
I opted to just use a different CPU in qemu for now, I couldn't actually figure out how to pad things easily such that the expressions always showed up on 2-byte aligned boundaries
Just to note, I'm not sure what microarchitecture you're running on but, Arm haven't yet released a core with pointer authentication - they will come when the Armv9 cores arrive. I was under the impression that nobody had a Linux CPU supporting it. The reference manual also doesn't state anything about it being a NOP if unsupported (it's not a hint) so I think we'd want to ensure it's not generated if not supported! Have you raised a bug against libgcc? I will let the internal GCC people know here.
Oh the failure was only happening in QEMU, which may be doing pointer auth stuff by default or something like that? (it failed with -cpu max
, but I don't know what that actually corresponds to). Otherwise though I haven't raised the bug with libgcc, I'm relatively certain this is a bug there but I'm also uncertain enough about all this that I'm not sure I'd personally be so comfortable raising an issue there.
@Alex Crichton I am a bit late to the party, but I intend to work on adding proper pointer authentication support to Cranelift and Wasmtime soon, so this is of interest to me - indeed, when I did the initial brainstorming of the necessary changes, unwinding seemed to be the main challenge (modifying the function prologues and epilogues should be pretty straightforward). I intend to start with posting a RFC before proceeding with any patches.
oh nice! TBH though I suspect unwinding will "just work" if we do pointer authentication stuff in each jit function, the unwinding bits so far have only been complicated insofar that we've had to tell the unwinder to ignore auth bits (and libgcc here isn't really reading our request for ignoring)
To be honest, I am not up to speed on DWARF and the finer details of how unwinding works, but I am working on it, and I am using PR #2960 as a stepping stone.
I'm happy to help out where I can
I've done enough debugging at this point I can probably know at least the basics or help with further debugging
tbh unwinding issues are 90% of the time "holy cow this is nigh impossible to debug and it's blind trust between the compiler and the unwinder"
the actual bits and pieces are all bite-sized but you nothing is tested until the whole system is smooshed together which makes it much harder to see what's happening
@Anton Kirilov I'm happy to help as well -- some distance now from having written the original aarch64 ABI code and unwinding but I'm sure the right neurons are still up here somewhere
As for reporting the bug in libgcc - I could check with one of the GCC maintainers that are at Arm if that would be helpful (once I have a good enough understanding of the issue as well).
As for QEMU's -cpu max
option - that should enable support for all optional features of the Arm architecture that have been implemented in QEMU.
As @Sam Parker mentioned, right now there is no physical CPU core that provides the same functionality.
@Alex Crichton I believe that the bare-metal server you are using is based on the Ampere eMAG, right? If that is the case, it supports just the base 64-bit Arm architecture, if I am not mistaken.
uh... do you have a command for how I can check?
lscpu
, I suppose?
that yields:
Architecture: aarch64
Byte Order: Little Endian
CPU(s): 80
On-line CPU(s) list: 0-79
Thread(s) per core: 1
Core(s) per socket: 80
Socket(s): 1
NUMA node(s): 1
Vendor ID: ARM
Model: 1
Stepping: r3p1
CPU max MHz: 3000.0000
CPU min MHz: 1000.0000
BogoMIPS: 50.00
L1d cache: 64K
L1i cache: 64K
L2 cache: 1024K
NUMA node0 CPU(s): 0-79
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
not that I know what to do with this information
Ah, that's not the Ampere eMAG. Probably an Ampere Altra because I can't think of any other server CPU with 80 cores that uses an Arm core.
Anyway, there is no server CPU on the markert right now that supports pointer authentication.
Indeed, it's an Altra
(we recently upgraded to this from the eMAG)
Nice, so this provides a good test environment for the atomic instructions.
So, Arm's approach has been to ensure that there is support for the architectural extensions in the most critical software components way before there is actual hardware on the market with the same support.
Pointer authentication probably has the highest chance of breaking something, but QEMU supports memory tagging and the Scalable Vector Extension as well, for example (and they are enabled with -cpu max
).
Actually, memory tagging could potentially lead to breakages as well, so avoiding -cpu max
seems to be the right call.
On the flip side, this is the only advantage of using QEMU in CI for AArch64 testing that I can think of - I already have ideas to use SVE instructions for some SIMD functionality (no, I am not talking about the Wasm flexible vectors proposal), and that would be perfectly testable because of QEMU.
FWIW we aren't proactively passing -cpu max
to QEMU, it seems like that may just be the default?
Ah, OK, that's possible.
@Alex Crichton I had the chance to dig a bit deeper into this, and IMHO the line that sets the RA_SIGN_STATE
value is unnecessary (and removing it should fix the issue). The DWARF specification states that the value is initialized to 0 (which is what we want), and then after a signing operation (e.g. PACIASP
) the DW_CFA_AARCH64_negate_ra_state
operation is used to change the value. In fact, that is what LLVM does when -mbranch-protection=standard
is passed (check with readelf -wf
).
I noticed that actually @Benjamin Bouvier added that code, so he might have some additional insight.
Now, something fishy might still be going on in libgcc
, but at least we could avoid the immediate issue.
IIRC it was added to get compat with libunwind and Apple's implementation of arm64 unwinding, so maybe apple implemented a different default? If that's the case though it should be easy enough to have it apple-specific and avoid setting it on Linux
Thanks for looking into this though!
Confirming that i specifically did hit some pointer auth failures on the Mac M1, and i had to turn it off to make it work.
As Alex says, it's a lot of wild-guessing -- i was happy enough to stumble into a talk where Apple engineers talked about the pointer auth instructions, where they demo'd the code sequence that checked them, ... which was what i seeing failing in LLDB.
OK, so this makes me wonder though - do you know how this works with unwinding right before the signing operation? AFAIK there are no instructions that call a subroutine and then leave a signed address in the return address register, i.e. LR
/X30
. The signing operation must happen with an explicit instruction such as PACIASP
, which means that there is a small time window after a subroutine starts executing, but before the signing instruction completes, where the return address is in a clear form. If the default is that return addresses are signed, then the unwinder will be operating under the wrong assumption during that time frame. Now, I am not familiar with the details of Apple's ABI, but it sounds like a problem.
It's also wasteful in case of leaf functions that don't save LR
on the stack.
Otherwise I agree that there is a simple fix - execute the code that sets the RA_SIGN_STATE
value only if the current calling convention is CallConv::AppleAarch64
or CallConv::WasmtimeAppleAarch64
; don't do anything otherwise (until we support pointer authentication properly, which I plan to tackle).
execute the code that sets the RA_SIGN_STATE value only if the current calling convention is CallConv::AppleAarch64 or CallConv::WasmtimeAppleAarch64
Is this not what we're currently doing? or is this predicated or all aarch64? In that case, yeah, that sounds like the right fix.
@Benjamin Bouvier Sorry, I was away on vacation, so I just read your message - the RA_SIGN_STATE
value is set unconditionally on AArch64.
Ah right, my bad. Then yes, we could enable it only for aarch64-darwin in the time being if that helps, and disable it entirely when we get pointer authentication.
A bit of an update - we have a RFC proposal now that presents a couple of control flow integrity enhancements using pointer authentication. I have also published a prototype implementation of the proposal.
Really excited to see this moving forward; thanks Anton!
Another update - I have also published a prototype of the forward-edge CFI implementation, and I have added the fiber changes to the original prototype. I also discussed the issue that @Alex Crichton reported originally with one of Arm's GCC developers. The problem has been acknowledged, but at the moment it is not clear what the proper way forward is; for instance, the DWARF spec might be worded too liberally, and the intention might be that the RA_SIGN_STATE
pseudo-register would be manipulated in much more specific ways.
Last updated: Nov 22 2024 at 16:03 UTC