Stream: wasmtime

Topic: arm64, pointer auth, unwinding


view this post on Zulip Alex Crichton (Aug 12 2021 at 19:00):

In the process of working on https://github.com/bytecodealliance/wasmtime/pull/3180 I'm hitting a bizarre bug that I'm hoping folks from ARM or someone else who's knowledgeable can help out with

This commit is a major refactoring of how unwind information is stored after compilation of a function has finished. Previously we would store the raw UnwindInfo as a result of compilation and this...

view this post on Zulip Alex Crichton (Aug 12 2021 at 19:02):

As some basic background, we generate .eh_frame tables for JIT code and then register them with libgcc on Linux arm64 machines. This .eh_frame business is meant for unwinding which is how we capture backtraces of both wasm/native together. My PR is updated when/where the .eh_frame is generated. Previously the .eh_frame would be dynamically generated when we load a module into memory, but now I'm updating it such that it's emedded into the module itself, precomputed ahead of time.

view this post on Zulip Alex Crichton (Aug 12 2021 at 19:03):

The suprising behavior that I'm seeing is that my PR is segfaulting on CI, specifically in the emulated aarch64 builder. This aarch64 builder is running on an x86_64 host and is built with a pinned qemu 6.0.0 version. I cannot reproduce the segfault, however, on a bare-metal aarch64 host (the one that ARM donated to us). Interestingly enough, though, I can reproduce the segfault in qemu running on the arm64 host. That's allowed me to dig in further

view this post on Zulip Alex Crichton (Aug 12 2021 at 19:04):

After what was probably too much gdb, I and others have reached the following conclusions:

view this post on Zulip Alex Crichton (Aug 12 2021 at 19:05):

Contribute to gcc-mirror/gcc development by creating an account on GitHub.

view this post on Zulip Alex Crichton (Aug 12 2021 at 19:05):

I've confirmed in a debugger that the autia1716 instruction is indeed where native execution and emluated execution start to diverge.

view this post on Zulip Alex Crichton (Aug 12 2021 at 19:06):

Now all of this is somewhat confusing to me for a number of reasons. First off though my assumption is that the bare-metal linux server probably has pointer auth turned off entirely which is why things appear to work for me. I'm assuming that QEMU with -cpu max turns on pointer auth for one reason or another, which is why things fail. Consequently for cortex-a72 and below I'm assuming it's also turned off which is why things start passing.

view this post on Zulip Alex Crichton (Aug 12 2021 at 19:07):

We ran into pointer auth bits for aarch64 support on macOS. Cranelift added support for specifically disabling pointer auth for jit frames, however. To do this the FDE for jit frames specifies that the pseudo-register for pointer auth is specifically set to 0, using DW_OP_lit0

Standalone JIT-style runtime for WebAssembly, using Cranelift - wasmtime/systemv.rs at 2da1b9d3759bdaecac41d7bb2bc738c113a563b8 · bytecodealliance/wasmtime

view this post on Zulip Alex Crichton (Aug 12 2021 at 19:09):

I can confirm that libgcc is indeed decoding this FDE opcode correctly and it's registering that. Specifically this code gets hit and register 34, the pseudo-register, indeed has REG_SAVED_VAL_EXP saved into it

Contribute to gcc-mirror/gcc development by creating an account on GitHub.

view this post on Zulip Alex Crichton (Aug 12 2021 at 19:10):

So all of that is fine except that our disabling of pointer auth doesn't seem to be doing anything. It's still trying to run autia1716 when unwinding on addresses, and that's why I think it's working in qemu but not on bare metal (assuming autia1716 is a noop basically if pointer auth is turned off)

view this post on Zulip Alex Crichton (Aug 12 2021 at 19:11):

One suspicion of mine is that the guard for the autia1716 instruction is testing the lower bit of the offset field in libgcc. I think this may be a bug? Corresponding code in LLVM's libunwind seems to disagree with this where LLVM checks the value of the register itself (which is how I think things work on arm64)

Contribute to gcc-mirror/gcc development by creating an account on GitHub.
The LLVM Project is a collection of modular and reusable compiler and toolchain technologies. Note: the repository does not accept github pull requests at this moment. Please submit your patches at...

view this post on Zulip Alex Crichton (Aug 12 2021 at 19:13):

This suspicion, however, still doesn't really explain why my PR is actually "breaking" things. My PR, AFAIK, doesn't actually change unwinding information at all, it just moves where it's generated and how it's managed. My best guess, though, is that for some reason the way we generated it before it magically always has an even value for libgcc's offset field for the register. I don't know how this could be the case, and is something I wanted to investigate next.

view this post on Zulip Alex Crichton (Aug 12 2021 at 19:14):

I wanted to write all this up though to see if other arm folks can help out or forward this along perhaps, I'm specifically wondering if gcc is indeed buggy in this regard.

Also FWIW the server we're using has gcc 8.3.0, and looking at gcc's main branch it appears a number of changes have happened since 8.3.0. I don't think, though, that it necesarily fundamentally changes anything, it appears that the main condition for pointer auth is still based on the lowest bit of the offset field, which continues to confuse me in the sense that it seems to disagree with LLVM's libunwind

view this post on Zulip Till Schneidereit (Aug 12 2021 at 19:14):

CC @Anton Kirilov & @Sam Parker for potential insights here

view this post on Zulip Alex Crichton (Aug 12 2021 at 19:25):

Ok so on the offset bits, I think this may indeed be a bug in libgcc. I changed slightly how the FDE was encoded such that the DW_OP_lit0 happened to always indeed be at least 2-byte aligned before, and after my PR it's only 1-byte aligned. That I think explains why the offset field is different before/after my PR.

The reason for that is that the offset and exp fields are stored in a union. The DW_CFA_val_expression opcode that we use only sets the exp field, so the offset test later when executing autia1716 is actually testing exp, the address of the expression for the register, not the actual register value itself

I think that means everything makes sense to me now. In summary:

Since PAC is typically turned off for Linux I think I'll probably just pass a -cpu argument to qemu on CI to fix this for now, but it'd probably be good to fix this in libgcc at some point as well

Contribute to gcc-mirror/gcc development by creating an account on GitHub.
Contribute to gcc-mirror/gcc development by creating an account on GitHub.

view this post on Zulip Chris Fallin (Aug 12 2021 at 19:27):

Nice find! FWIW, it seems reasonable to me for now to 2-byte-align the lit0 opcode as a workaround; there's plenty of precedent (in the wider world if not in our codebase) for "this consumer has this weird bug so we generate metadata in this particular way"

view this post on Zulip Chris Fallin (Aug 12 2021 at 19:28):

If we disable PAC in the CI, we should file an issue to re-enable it at some point in the future, as I can imagine a future where it'll be common on Linux too and we want to make sure it's tested

view this post on Zulip Alex Crichton (Aug 12 2021 at 19:31):

hm true. This isn't super easy to align though I think b/c it's buried in the rest of the encoding of the FDE which doesn't have lots of room for padding I think. I'll see if I can find "the leb" though perhaps

view this post on Zulip Alex Crichton (Aug 12 2021 at 20:17):

I opted to just use a different CPU in qemu for now, I couldn't actually figure out how to pad things easily such that the expressions always showed up on 2-byte aligned boundaries

view this post on Zulip Sam Parker (Aug 16 2021 at 08:47):

Just to note, I'm not sure what microarchitecture you're running on but, Arm haven't yet released a core with pointer authentication - they will come when the Armv9 cores arrive. I was under the impression that nobody had a Linux CPU supporting it. The reference manual also doesn't state anything about it being a NOP if unsupported (it's not a hint) so I think we'd want to ensure it's not generated if not supported! Have you raised a bug against libgcc? I will let the internal GCC people know here.

view this post on Zulip Alex Crichton (Aug 16 2021 at 15:03):

Oh the failure was only happening in QEMU, which may be doing pointer auth stuff by default or something like that? (it failed with -cpu max, but I don't know what that actually corresponds to). Otherwise though I haven't raised the bug with libgcc, I'm relatively certain this is a bug there but I'm also uncertain enough about all this that I'm not sure I'd personally be so comfortable raising an issue there.

view this post on Zulip Anton Kirilov (Aug 16 2021 at 16:20):

@Alex Crichton I am a bit late to the party, but I intend to work on adding proper pointer authentication support to Cranelift and Wasmtime soon, so this is of interest to me - indeed, when I did the initial brainstorming of the necessary changes, unwinding seemed to be the main challenge (modifying the function prologues and epilogues should be pretty straightforward). I intend to start with posting a RFC before proceeding with any patches.

view this post on Zulip Alex Crichton (Aug 16 2021 at 16:21):

oh nice! TBH though I suspect unwinding will "just work" if we do pointer authentication stuff in each jit function, the unwinding bits so far have only been complicated insofar that we've had to tell the unwinder to ignore auth bits (and libgcc here isn't really reading our request for ignoring)

view this post on Zulip Anton Kirilov (Aug 16 2021 at 16:25):

To be honest, I am not up to speed on DWARF and the finer details of how unwinding works, but I am working on it, and I am using PR #2960 as a stepping stone.

Leaf functions that do not use the stack (e.g. do not clobber any callee-saved registers) do not need a frame record; this has been discussed in issue #1148. I am not familiar with the ABIs of othe...

view this post on Zulip Alex Crichton (Aug 16 2021 at 16:29):

I'm happy to help out where I can

view this post on Zulip Alex Crichton (Aug 16 2021 at 16:29):

I've done enough debugging at this point I can probably know at least the basics or help with further debugging

view this post on Zulip Alex Crichton (Aug 16 2021 at 16:30):

tbh unwinding issues are 90% of the time "holy cow this is nigh impossible to debug and it's blind trust between the compiler and the unwinder"

view this post on Zulip Alex Crichton (Aug 16 2021 at 16:30):

the actual bits and pieces are all bite-sized but you nothing is tested until the whole system is smooshed together which makes it much harder to see what's happening

view this post on Zulip Chris Fallin (Aug 16 2021 at 16:31):

@Anton Kirilov I'm happy to help as well -- some distance now from having written the original aarch64 ABI code and unwinding but I'm sure the right neurons are still up here somewhere

view this post on Zulip Anton Kirilov (Aug 16 2021 at 16:31):

As for reporting the bug in libgcc - I could check with one of the GCC maintainers that are at Arm if that would be helpful (once I have a good enough understanding of the issue as well).

view this post on Zulip Anton Kirilov (Aug 16 2021 at 16:34):

As for QEMU's -cpu max option - that should enable support for all optional features of the Arm architecture that have been implemented in QEMU.

view this post on Zulip Anton Kirilov (Aug 16 2021 at 16:35):

As @Sam Parker mentioned, right now there is no physical CPU core that provides the same functionality.

view this post on Zulip Anton Kirilov (Aug 16 2021 at 16:37):

@Alex Crichton I believe that the bare-metal server you are using is based on the Ampere eMAG, right? If that is the case, it supports just the base 64-bit Arm architecture, if I am not mistaken.

view this post on Zulip Alex Crichton (Aug 16 2021 at 16:37):

uh... do you have a command for how I can check?

view this post on Zulip Anton Kirilov (Aug 16 2021 at 16:38):

lscpu, I suppose?

view this post on Zulip Alex Crichton (Aug 16 2021 at 16:39):

that yields:

Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              80
On-line CPU(s) list: 0-79
Thread(s) per core:  1
Core(s) per socket:  80
Socket(s):           1
NUMA node(s):        1
Vendor ID:           ARM
Model:               1
Stepping:            r3p1
CPU max MHz:         3000.0000
CPU min MHz:         1000.0000
BogoMIPS:            50.00
L1d cache:           64K
L1i cache:           64K
L2 cache:            1024K
NUMA node0 CPU(s):   0-79
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs

view this post on Zulip Alex Crichton (Aug 16 2021 at 16:39):

not that I know what to do with this information

view this post on Zulip Anton Kirilov (Aug 16 2021 at 16:41):

Ah, that's not the Ampere eMAG. Probably an Ampere Altra because I can't think of any other server CPU with 80 cores that uses an Arm core.

view this post on Zulip Anton Kirilov (Aug 16 2021 at 16:42):

Anyway, there is no server CPU on the markert right now that supports pointer authentication.

view this post on Zulip Chris Fallin (Aug 16 2021 at 16:42):

Indeed, it's an Altra

view this post on Zulip Chris Fallin (Aug 16 2021 at 16:42):

(we recently upgraded to this from the eMAG)

view this post on Zulip Anton Kirilov (Aug 16 2021 at 16:43):

Nice, so this provides a good test environment for the atomic instructions.

view this post on Zulip Anton Kirilov (Aug 16 2021 at 16:46):

So, Arm's approach has been to ensure that there is support for the architectural extensions in the most critical software components way before there is actual hardware on the market with the same support.

view this post on Zulip Anton Kirilov (Aug 16 2021 at 16:47):

Pointer authentication probably has the highest chance of breaking something, but QEMU supports memory tagging and the Scalable Vector Extension as well, for example (and they are enabled with -cpu max).

view this post on Zulip Anton Kirilov (Aug 16 2021 at 16:48):

Actually, memory tagging could potentially lead to breakages as well, so avoiding -cpu max seems to be the right call.

view this post on Zulip Anton Kirilov (Aug 16 2021 at 16:52):

On the flip side, this is the only advantage of using QEMU in CI for AArch64 testing that I can think of - I already have ideas to use SVE instructions for some SIMD functionality (no, I am not talking about the Wasm flexible vectors proposal), and that would be perfectly testable because of QEMU.

view this post on Zulip Alex Crichton (Aug 16 2021 at 16:53):

FWIW we aren't proactively passing -cpu max to QEMU, it seems like that may just be the default?

view this post on Zulip Anton Kirilov (Aug 16 2021 at 16:54):

Ah, OK, that's possible.

view this post on Zulip Anton Kirilov (Aug 26 2021 at 12:40):

@Alex Crichton I had the chance to dig a bit deeper into this, and IMHO the line that sets the RA_SIGN_STATE value is unnecessary (and removing it should fix the issue). The DWARF specification states that the value is initialized to 0 (which is what we want), and then after a signing operation (e.g. PACIASP) the DW_CFA_AARCH64_negate_ra_state operation is used to change the value. In fact, that is what LLVM does when -mbranch-protection=standard is passed (check with readelf -wf).
I noticed that actually @Benjamin Bouvier added that code, so he might have some additional insight.

Standalone JIT-style runtime for WebAssembly, using Cranelift - wasmtime/abi.rs at 2da1b9d3759bdaecac41d7bb2bc738c113a563b8 · bytecodealliance/wasmtime
Application Binary Interface for the Arm® Architecture - abi-aa/aadwarf64.rst at 2bcab1e3b22d55170c563c3c7940134089176746 · ARM-software/abi-aa
Application Binary Interface for the Arm® Architecture - abi-aa/aadwarf64.rst at 2bcab1e3b22d55170c563c3c7940134089176746 · ARM-software/abi-aa
int foo(void) { void bar(void); bar(); return 42; }

view this post on Zulip Anton Kirilov (Aug 26 2021 at 12:46):

Now, something fishy might still be going on in libgcc, but at least we could avoid the immediate issue.

view this post on Zulip Alex Crichton (Aug 26 2021 at 14:29):

IIRC it was added to get compat with libunwind and Apple's implementation of arm64 unwinding, so maybe apple implemented a different default? If that's the case though it should be easy enough to have it apple-specific and avoid setting it on Linux

view this post on Zulip Alex Crichton (Aug 26 2021 at 14:29):

Thanks for looking into this though!

view this post on Zulip Benjamin Bouvier (Aug 26 2021 at 15:12):

Confirming that i specifically did hit some pointer auth failures on the Mac M1, and i had to turn it off to make it work.

As Alex says, it's a lot of wild-guessing -- i was happy enough to stumble into a talk where Apple engineers talked about the pointer auth instructions, where they demo'd the code sequence that checked them, ... which was what i seeing failing in LLDB.

view this post on Zulip Anton Kirilov (Aug 26 2021 at 16:00):

OK, so this makes me wonder though - do you know how this works with unwinding right before the signing operation? AFAIK there are no instructions that call a subroutine and then leave a signed address in the return address register, i.e. LR/X30. The signing operation must happen with an explicit instruction such as PACIASP, which means that there is a small time window after a subroutine starts executing, but before the signing instruction completes, where the return address is in a clear form. If the default is that return addresses are signed, then the unwinder will be operating under the wrong assumption during that time frame. Now, I am not familiar with the details of Apple's ABI, but it sounds like a problem.

view this post on Zulip Anton Kirilov (Aug 26 2021 at 16:07):

It's also wasteful in case of leaf functions that don't save LR on the stack.

view this post on Zulip Anton Kirilov (Aug 26 2021 at 16:13):

Otherwise I agree that there is a simple fix - execute the code that sets the RA_SIGN_STATE value only if the current calling convention is CallConv::AppleAarch64 or CallConv::WasmtimeAppleAarch64; don't do anything otherwise (until we support pointer authentication properly, which I plan to tackle).

view this post on Zulip Benjamin Bouvier (Sep 01 2021 at 09:39):

execute the code that sets the RA_SIGN_STATE value only if the current calling convention is CallConv::AppleAarch64 or CallConv::WasmtimeAppleAarch64

Is this not what we're currently doing? or is this predicated or all aarch64? In that case, yeah, that sounds like the right fix.

view this post on Zulip Anton Kirilov (Sep 06 2021 at 09:15):

@Benjamin Bouvier Sorry, I was away on vacation, so I just read your message - the RA_SIGN_STATE value is set unconditionally on AArch64.

view this post on Zulip Benjamin Bouvier (Sep 06 2021 at 12:43):

Ah right, my bad. Then yes, we could enable it only for aarch64-darwin in the time being if that helps, and disable it entirely when we get pointer authentication.

view this post on Zulip Anton Kirilov (Dec 16 2021 at 17:46):

A bit of an update - we have a RFC proposal now that presents a couple of control flow integrity enhancements using pointer authentication. I have also published a prototype implementation of the proposal.

This RFC proposes to improve control flow integrity for compiled WebAssembly code by utilizing two technologies from the Arm instruction set architecture - Pointer Authentication and Branch Target ...
This pull request is meant to illustrate the RFC proposal to improve control flow integrity for compiled WebAssembly code by using the Pointer Authentication extension to the Arm instruction set ar...

view this post on Zulip Chris Fallin (Dec 16 2021 at 18:11):

Really excited to see this moving forward; thanks Anton!

view this post on Zulip Anton Kirilov (Jan 19 2022 at 11:21):

Another update - I have also published a prototype of the forward-edge CFI implementation, and I have added the fiber changes to the original prototype. I also discussed the issue that @Alex Crichton reported originally with one of Arm's GCC developers. The problem has been acknowledged, but at the moment it is not clear what the proper way forward is; for instance, the DWARF spec might be worded too liberally, and the intention might be that the RA_SIGN_STATE pseudo-register would be manipulated in much more specific ways.

This pull request is meant to illustrate the RFC proposal to improve control flow integrity for compiled WebAssembly code by using the Branch Target Identification extension to the Arm instruction ...

Last updated: Nov 22 2024 at 16:03 UTC