wasmtime / Issue #1914 Register individual FDEs for musl ... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / Issue #1914 Register individual FDEs for musl ...

Wasmtime GitHub notifications bot (Jun 23 2020 at 21:04):

alexcrichton commented on Issue #1914:

FWIW I did a mild amount of poking with this, but I'm not sure if this works for the purposes of backtrace because cargo test --target x86_64-unknown-linux-musl was still failing.

Also I think the logic may be a bit clearer inverted for glibc-specific bits perhaps? I'm not sure how prevalent libgcc's logic is, but the "write a 0 length at the end of the table" on line 94 above these changes I think is only relevant for glibc.

Wasmtime GitHub notifications bot (Jun 23 2020 at 21:16):

peterhuene commented on Issue #1914:

It is only relevant for libgcc's implementation, but doesn't really hurt to put it in the entire frame table and just skip it for the libunwind iteration.

You're right, it appears the tracing isn't working even with __register_frame appearing to properly register the individual FDEs with these changes.

Let me dig into that before we merge this.

Wasmtime GitHub notifications bot (Jun 23 2020 at 21:19):

peterhuene edited a comment on Issue #1914:

It is only relevant for libgcc's implementation, but doesn't really hurt to put it in the entire frame table and just skip it for the libunwind iteration.

You're right, it appears stack tracing isn't working even with __register_frame appearing to properly register the individual FDEs with these changes.

Let me dig into that before we merge this.

Wasmtime GitHub notifications bot (Jun 23 2020 at 23:10):

peterhuene commented on Issue #1914:

I've verified that each frame does get added to libunwind's DwarfFDECache:

For example:
libunwind::DwarfFDECache<libunwind::LocalAddressSpace>::add(mh=37745720, ip_start=140737351909383, ip_end=140737351909406, fde=37745720)
Those are the valid ranges of one of our function JITs and the FDE pointer is correct.

What I don't see is any calls to libunwind::DwarfFDECache<libunwind::LocalAddressSpace>::findFDE from _Unwind_Backtrace. I see calls to libunwind::EHHeaderParser<libunwind::LocalAddressSpace>::findFDE which are presumably for the ELF .eh_frame data.

Still debugging.

Wasmtime GitHub notifications bot (Jun 23 2020 at 23:10):

peterhuene edited a comment on Issue #1914:

I've verified that each frame does get added to libunwind's DwarfFDECache:

For example:
libunwind::DwarfFDECache<libunwind::LocalAddressSpace>::add(mh=37745720, ip_start=140737351909383, ip_end=140737351909406, fde=37745720)
Those are the valid ranges of one of our function JITs and the FDE pointer is correct.

What I don't see is any calls to libunwind::DwarfFDECache<libunwind::LocalAddressSpace>::findFDE from _Unwind_Backtrace. I see calls to libunwind::EHHeaderParser<libunwind::LocalAddressSpace>::findFDE which are presumably for the ELF .eh_frame data.

Still debugging.

Wasmtime GitHub notifications bot (Jun 23 2020 at 23:29):

peterhuene edited a comment on Issue #1914:

I've verified that each frame does get added to libunwind's DwarfFDECache:

For example:
libunwind::DwarfFDECache<libunwind::LocalAddressSpace>::add(mh=37745720, ip_start=140737351909383, ip_end=140737351909406, fde=37745720)
Those are the valid ranges of one of our function JITs and the FDE pointer is correct.

~~What I don't see is any calls to libunwind::DwarfFDECache<libunwind::LocalAddressSpace>::findFDE from _Unwind_Backtrace~~ Function was inlined hence no breakpoint. I see calls to libunwind::EHHeaderParser<libunwind::LocalAddressSpace>::findFDE which are presumably for the ELF .eh_frame data.

Still debugging.

Wasmtime GitHub notifications bot (Jun 24 2020 at 01:11):

peterhuene commented on Issue #1914:

It looks like the unwind cursor can't get past the signal handler trampoline frame and hence doesn't even get to the Wasm frames.

From lldb trace:
(lldb) bt
...
    frame #17: 0x0000000001af8ca0 wasmtime`__restore_rt
    frame #18: 0x00007ffff7de0075  <--- faulting wasm frame with a simple unreachable trap
...
(lldb) di -a 0x0000000001af8ca0
wasmtime`__restore_rt:
    0x1af8ca0 <+0>: movq   $0xf, %rax
    0x1af8ca7 <+7>: syscall
lldb appears to special case the signal handler function names and hence can walk past this frame.

However, libunwind::UnwindCursor<A, R>::setInfoBasedOnIPRegister with pc=0x0000000001af8ca0 results in _unwindInfoMissing being set to true for the frame. Note that it did search our dynamically registered frames for this one, except it isn't one of ours so that search was fruitless.

There's a nop instruction prior to this trampoline which is important as the PC is subtracted by 1 for the FDE search (it assumes there's been a call to create a frame, after all), so the FDE start for this function would actually be 0x0000000001af8c9f. Without the nop, the adjusted PC would actually be in a different function.

I think this may work from glib rather than musl because it appears an FDE is emitted for __restore_rt.

I'm leaning towards this being a musl bug where no FDE is emitted for __restore_rt.

Wasmtime GitHub notifications bot (Jun 24 2020 at 01:12):

peterhuene commented on Issue #1914:

I'd thus expect libunwind::_Unwind_Backtrace to be busted on musl for any walk from a signal handler. I'll see if I can create a small repro in C.

Wasmtime GitHub notifications bot (Jun 24 2020 at 01:13):

peterhuene edited a comment on Issue #1914:

It looks like the unwind cursor can't get past the signal handler trampoline frame and hence doesn't even get to the Wasm frames.

From lldb trace:
(lldb) bt
...
    frame #17: 0x0000000001af8ca0 wasmtime`__restore_rt
    frame #18: 0x00007ffff7de0075  <--- faulting wasm frame with a simple unreachable trap
...
(lldb) di -a 0x0000000001af8ca0
wasmtime`__restore_rt:
    0x1af8ca0 <+0>: movq   $0xf, %rax
    0x1af8ca7 <+7>: syscall
lldb appears to special case the signal handler function names and hence can walk past this frame.

However, libunwind::UnwindCursor<A, R>::setInfoBasedOnIPRegister with pc=0x0000000001af8ca0 results in _unwindInfoMissing being set to true for the frame. Note that it did search Wasmtime's dynamically registered JIT frames for this one, except it isn't one of ours so that search was fruitless.

There's a nop instruction prior to this trampoline which is important as the PC is subtracted by 1 for the FDE search (it assumes there's been a call to create a frame, after all), so the FDE start for this function would actually be 0x0000000001af8c9f. Without the nop, the adjusted PC would actually be in a different function.

I think this may work from glib rather than musl because it appears an FDE is emitted for __restore_rt.

I'm leaning towards this being a musl bug where no FDE is emitted for __restore_rt.

Wasmtime GitHub notifications bot (Jun 24 2020 at 01:14):

peterhuene deleted a comment on Issue #1914:

I've verified that each frame does get added to libunwind's DwarfFDECache:

For example:
libunwind::DwarfFDECache<libunwind::LocalAddressSpace>::add(mh=37745720, ip_start=140737351909383, ip_end=140737351909406, fde=37745720)
Those are the valid ranges of one of our function JITs and the FDE pointer is correct.

~~What I don't see is any calls to libunwind::DwarfFDECache<libunwind::LocalAddressSpace>::findFDE from _Unwind_Backtrace~~ Function was inlined hence no breakpoint. I see calls to libunwind::EHHeaderParser<libunwind::LocalAddressSpace>::findFDE which are presumably for the ELF .eh_frame data.

Still debugging.

Wasmtime GitHub notifications bot (Jun 24 2020 at 01:15):

peterhuene edited a comment on Issue #1914:

It looks like the unwind cursor can't get past the signal handler trampoline frame and hence doesn't even get to the Wasm frames.

From lldb trace:
(lldb) bt
...
    frame #17: 0x0000000001af8ca0 wasmtime`__restore_rt
    frame #18: 0x00007ffff7de0075  <--- faulting wasm frame with a simple unreachable trap
...
(lldb) di -a 0x0000000001af8ca0
wasmtime`__restore_rt:
    0x1af8ca0 <+0>: movq   $0xf, %rax
    0x1af8ca7 <+7>: syscall
lldb appears to special case the signal handler function names and hence can walk past this frame.

However, libunwind::UnwindCursor<A, R>::setInfoBasedOnIPRegister with pc=0x0000000001af8ca0 results in _unwindInfoMissing being set to true for the frame. Note that it did search Wasmtime's dynamically registered JIT frames for this one, except it isn't one of ours so that search was fruitless.

There's a nop instruction prior to this trampoline which is important as the PC is subtracted by 1 for the FDE search (it assumes there's been a call to create a frame, after all), so the FDE start for this function would actually be 0x0000000001af8c9f. Without the nop, the adjusted PC would actually be in a different function.

I think this may work from glib rather than musl because it appears an FDE is emitted for __restore_rt.

I'm leaning towards this being a musl bug where no FDE is emitted for __restore_rt.

Wasmtime GitHub notifications bot (Jun 24 2020 at 06:26):

peterhuene commented on Issue #1914:

A simple test program compiled with clang test.c -static -lunwind -llzma:

#include <stdio.h>
#include <signal.h>
#include <stdlib.h>
#include <unistd.h>
#define UNW_LOCAL_ONLY
#include <libunwind.h>

void backtrace() {
  unw_cursor_t cursor; unw_context_t uc;
  unw_word_t ip, sp;
  unw_word_t offset;
  char name[256];
  unw_getcontext(&uc);
  unw_init_local(&cursor, &uc);
  while (unw_step(&cursor) > 0) {
    unw_get_reg(&cursor, UNW_REG_IP, &ip);
    unw_get_reg(&cursor, UNW_REG_SP, &sp);
    unw_get_proc_name(&cursor, name, sizeof(name), &offset);
    printf ("ip = %lx, sp = %lx: %s + %lx\n", (long) ip, (long) sp, name, (long) offset);
  }
}

void handler(int sig) {
  backtrace();
  exit(1);
}

void crash() {
  int* p = (int*) 0;
  *p = 1;
}

void bar() {
  crash();
}

void foo() {
  bar();
}

int main() {
  signal(SIGSEGV, handler);
  foo();
}

Interestingly, this does create an FDE for __restore_rt, but it doesn't start have a preceding nop instruction:

$ ldd a.out
/lib/ld-musl-x86_64.so.1: a.out: Not a valid dynamic program

$ readelf --debug-dump=frames -a a.out | grep 000000000041359a
   471: 000000000041359a     9 FUNC    GLOBAL HIDDEN     2 __restore_rt
00001540 000000000000001c 00001528 FDE cie=00001528 pc=000000000041359a..00000000004135a3

As a result, this program is able to walk the stack, albeit it showing sigaction+27 for __restore_rt as sigaction contains PC 0000000000413599 (recall that the FDE search is PC-1).

$ ./a.out
ip = 401270, sp = 7ffc962fbba0: handler + 10
ip = 41359a, sp = 7ffc962fbbc0: sigaction + 27
ip = 4012a9, sp = 7ffc962fbbd0: bar + 9
ip = 4012b9, sp = 7ffc962fc180: foo + 9
ip = 4012e2, sp = 7ffc962fc190: main + 22
ip = 40f8eb, sp = 7ffc962fc1b0: libc_start_main_stage2 + 27

A Rust port of the program:

use backtrace::Backtrace;
use nix::sys::signal;

extern "C" fn handler(_: nix::libc::c_int) {
    let bt = Backtrace::new();
    println!("{:?}", bt);
    std::process::exit(1);
}

fn crash() {
    unsafe {
    let p: *mut i32 = std::ptr::null_mut();
    *p = 1;
    }
}

fn bar() {
    crash();
}

fn foo() {
    bar();
}


fn main() {
    unsafe {
        signal::signal(signal::Signal::SIGSEGV, signal::SigHandler::Handler(handler));
    }

    foo();
}

However, no FDE was emitted for __restore_rt:

$ ldd test
/lib/ld-musl-x86_64.so.1: test: Not a valid dynamic program

$ readelf --debug-dump=frames -a test | grep 0000000000507047
  6473: 0000000000507047     9 FUNC    GLOBAL HIDDEN     2 __restore_rt

As a result, it only prints the top frame:

$ ./test
   0: test::handler
             at /test/src/main.rs:5

Wasmtime GitHub notifications bot (Jun 24 2020 at 06:29):

peterhuene edited a comment on Issue #1914:

A simple test program compiled with clang test.c -static -lunwind -llzma:

#include <stdio.h>
#include <signal.h>
#include <stdlib.h>
#include <unistd.h>
#define UNW_LOCAL_ONLY
#include <libunwind.h>

void backtrace() {
  unw_cursor_t cursor; unw_context_t uc;
  unw_word_t ip, sp;
  unw_word_t offset;
  char name[256];
  unw_getcontext(&uc);
  unw_init_local(&cursor, &uc);
  while (unw_step(&cursor) > 0) {
    unw_get_reg(&cursor, UNW_REG_IP, &ip);
    unw_get_reg(&cursor, UNW_REG_SP, &sp);
    unw_get_proc_name(&cursor, name, sizeof(name), &offset);
    printf ("ip = %lx, sp = %lx: %s + %lx\n", (long) ip, (long) sp, name, (long) offset);
  }
}

void handler(int sig) {
  backtrace();
  exit(1);
}

void crash() {
  int* p = (int*) 0;
  *p = 1;
}

void bar() {
  crash();
}

void foo() {
  bar();
}

int main() {
  signal(SIGSEGV, handler);
  foo();
}

Interestingly, this does create an FDE for __restore_rt:

$ ldd a.out
/lib/ld-musl-x86_64.so.1: a.out: Not a valid dynamic program

$ readelf --debug-dump=frames -a a.out | grep 000000000041359a
   471: 000000000041359a     9 FUNC    GLOBAL HIDDEN     2 __restore_rt
00001540 000000000000001c 00001528 FDE cie=00001528 pc=000000000041359a..00000000004135a3

As a result, this program is able to walk the stack, albeit it showing sigaction+27 for __restore_rt as sigaction contains PC 0000000000413599 (recall that the FDE search is PC-1).

$ ./a.out
ip = 401270, sp = 7ffc962fbba0: handler + 10
ip = 41359a, sp = 7ffc962fbbc0: sigaction + 27
ip = 4012a9, sp = 7ffc962fbbd0: bar + 9
ip = 4012b9, sp = 7ffc962fc180: foo + 9
ip = 4012e2, sp = 7ffc962fc190: main + 22
ip = 40f8eb, sp = 7ffc962fc1b0: libc_start_main_stage2 + 27

A Rust port of the program:

use backtrace::Backtrace;
use nix::sys::signal;

extern "C" fn handler(_: nix::libc::c_int) {
    let bt = Backtrace::new();
    println!("{:?}", bt);
    std::process::exit(1);
}

fn crash() {
    unsafe {
    let p: *mut i32 = std::ptr::null_mut();
    *p = 1;
    }
}

fn bar() {
    crash();
}

fn foo() {
    bar();
}


fn main() {
    unsafe {
        signal::signal(signal::Signal::SIGSEGV, signal::SigHandler::Handler(handler));
    }

    foo();
}

However, no FDE was emitted for __restore_rt:

$ ldd test
/lib/ld-musl-x86_64.so.1: test: Not a valid dynamic program

$ readelf --debug-dump=frames -a test | grep 0000000000507047
  6473: 0000000000507047     9 FUNC    GLOBAL HIDDEN     2 __restore_rt

As a result, it only prints the top frame:

$ ./test
   0: test::handler
             at /test/src/main.rs:5

Wasmtime GitHub notifications bot (Jun 24 2020 at 06:38):

peterhuene edited a comment on Issue #1914:

A simple test program compiled with clang test.c -static -lunwind -llzma (clang 10):

#include <stdio.h>
#include <signal.h>
#include <stdlib.h>
#include <unistd.h>
#define UNW_LOCAL_ONLY
#include <libunwind.h>

void backtrace() {
  unw_cursor_t cursor; unw_context_t uc;
  unw_word_t ip, sp;
  unw_word_t offset;
  char name[256];
  unw_getcontext(&uc);
  unw_init_local(&cursor, &uc);
  while (unw_step(&cursor) > 0) {
    unw_get_reg(&cursor, UNW_REG_IP, &ip);
    unw_get_reg(&cursor, UNW_REG_SP, &sp);
    unw_get_proc_name(&cursor, name, sizeof(name), &offset);
    printf ("ip = %lx, sp = %lx: %s + %lx\n", (long) ip, (long) sp, name, (long) offset);
  }
}

void handler(int sig) {
  backtrace();
  exit(1);
}

void crash() {
  int* p = (int*) 0;
  *p = 1;
}

void bar() {
  crash();
}

void foo() {
  bar();
}

int main() {
  signal(SIGSEGV, handler);
  foo();
}

Interestingly, this does create an FDE for __restore_rt:

$ ldd a.out
/lib/ld-musl-x86_64.so.1: a.out: Not a valid dynamic program

$ readelf --debug-dump=frames -a a.out | grep 000000000041359a
   471: 000000000041359a     9 FUNC    GLOBAL HIDDEN     2 __restore_rt
00001540 000000000000001c 00001528 FDE cie=00001528 pc=000000000041359a..00000000004135a3

As a result, this program is able to walk the stack, albeit it showing sigaction+27 for __restore_rt as sigaction contains PC 0000000000413599 (recall that the FDE search is PC-1).

$ ./a.out
ip = 401270, sp = 7ffc962fbba0: handler + 10
ip = 41359a, sp = 7ffc962fbbc0: sigaction + 27
ip = 4012a9, sp = 7ffc962fbbd0: bar + 9
ip = 4012b9, sp = 7ffc962fc180: foo + 9
ip = 4012e2, sp = 7ffc962fc190: main + 22
ip = 40f8eb, sp = 7ffc962fc1b0: libc_start_main_stage2 + 27

A Rust port of the program:

use backtrace::Backtrace;
use nix::sys::signal;

extern "C" fn handler(_: nix::libc::c_int) {
    let bt = Backtrace::new();
    println!("{:?}", bt);
    std::process::exit(1);
}

fn crash() {
    unsafe {
        let p: *mut i32 = std::ptr::null_mut();
        *p = 1;
    }
}

fn bar() {
    crash();
}

fn foo() {
    bar();
}

fn main() {
    unsafe {
        signal::signal(
            signal::Signal::SIGSEGV,
            signal::SigHandler::Handler(handler),
        );
    }

    foo();
}

However, no FDE was emitted for __restore_rt:

$ ldd test
/lib/ld-musl-x86_64.so.1: test: Not a valid dynamic program

$ readelf --debug-dump=frames -a test | grep 0000000000507047
  6473: 0000000000507047     9 FUNC    GLOBAL HIDDEN     2 __restore_rt

As a result, it only prints the top frame:

$ ./test
   0: test::handler
             at /test/src/main.rs:5

@alexcrichton any insights into why the Rust toolchain might be missing this FDE? This repros with stable (llvm 9) and nightly (llvm 10).

Wasmtime GitHub notifications bot (Jun 24 2020 at 06:50):

peterhuene edited a comment on Issue #1914:

A simple test program compiled with clang test.c -static -lunwind -llzma (clang 10):

#include <stdio.h>
#include <signal.h>
#include <stdlib.h>
#include <unistd.h>
#define UNW_LOCAL_ONLY
#include <libunwind.h>

void backtrace() {
  unw_cursor_t cursor; unw_context_t uc;
  unw_word_t ip, sp;
  unw_word_t offset;
  char name[256];
  unw_getcontext(&uc);
  unw_init_local(&cursor, &uc);
  while (unw_step(&cursor) > 0) {
    unw_get_reg(&cursor, UNW_REG_IP, &ip);
    unw_get_reg(&cursor, UNW_REG_SP, &sp);
    unw_get_proc_name(&cursor, name, sizeof(name), &offset);
    printf ("ip = %lx, sp = %lx: %s + %lx\n", (long) ip, (long) sp, name, (long) offset);
  }
}

void handler(int sig) {
  backtrace();
  exit(1);
}

void crash() {
  int* p = (int*) 0;
  *p = 1;
}

void bar() {
  crash();
}

void foo() {
  bar();
}

int main() {
  signal(SIGSEGV, handler);
  foo();
}

Interestingly, this does create an FDE for __restore_rt:

$ ldd a.out
/lib/ld-musl-x86_64.so.1: a.out: Not a valid dynamic program

$ readelf --debug-dump=frames -a a.out | grep 000000000041359a
   471: 000000000041359a     9 FUNC    GLOBAL HIDDEN     2 __restore_rt
00001540 000000000000001c 00001528 FDE cie=00001528 pc=000000000041359a..00000000004135a3

As a result, this program is able to walk the stack, albeit it showing sigaction+27 for __restore_rt as sigaction is the "nearest" function to PC 0000000000413599, which is the nop instruction (recall that the FDE search is PC-1).

$ ./a.out
ip = 401270, sp = 7ffc962fbba0: handler + 10
ip = 41359a, sp = 7ffc962fbbc0: sigaction + 27
ip = 4012a9, sp = 7ffc962fbbd0: bar + 9
ip = 4012b9, sp = 7ffc962fc180: foo + 9
ip = 4012e2, sp = 7ffc962fc190: main + 22
ip = 40f8eb, sp = 7ffc962fc1b0: libc_start_main_stage2 + 27

A Rust port of the program:

use backtrace::Backtrace;
use nix::sys::signal;

extern "C" fn handler(_: nix::libc::c_int) {
    let bt = Backtrace::new();
    println!("{:?}", bt);
    std::process::exit(1);
}

fn crash() {
    unsafe {
        let p: *mut i32 = std::ptr::null_mut();
        *p = 1;
    }
}

fn bar() {
    crash();
}

fn foo() {
    bar();
}

fn main() {
    unsafe {
        signal::signal(
            signal::Signal::SIGSEGV,
            signal::SigHandler::Handler(handler),
        );
    }

    foo();
}

However, no FDE was emitted for __restore_rt:

$ ldd test
/lib/ld-musl-x86_64.so.1: test: Not a valid dynamic program

$ readelf --debug-dump=frames -a test | grep 0000000000507047
  6473: 0000000000507047     9 FUNC    GLOBAL HIDDEN     2 __restore_rt

As a result, it only prints the top frame:

$ ./test
   0: test::handler
             at /test/src/main.rs:5

@alexcrichton any insights into why the Rust toolchain might be missing this FDE? This repros with stable (llvm 9) and nightly (llvm 10).

Wasmtime GitHub notifications bot (Jun 24 2020 at 06:59):

peterhuene edited a comment on Issue #1914:

A simple test program compiled with clang test.c -static -lunwind -llzma (clang 10):

#include <stdio.h>
#include <signal.h>
#include <stdlib.h>
#include <unistd.h>
#define UNW_LOCAL_ONLY
#include <libunwind.h>

void backtrace() {
  unw_cursor_t cursor; unw_context_t uc;
  unw_word_t ip, sp;
  unw_word_t offset;
  char name[256];
  unw_getcontext(&uc);
  unw_init_local(&cursor, &uc);
  while (unw_step(&cursor) > 0) {
    unw_get_reg(&cursor, UNW_REG_IP, &ip);
    unw_get_reg(&cursor, UNW_REG_SP, &sp);
    unw_get_proc_name(&cursor, name, sizeof(name), &offset);
    printf ("ip = %lx, sp = %lx: %s + %lx\n", (long) ip, (long) sp, name, (long) offset);
  }
}

void handler(int sig) {
  backtrace();
  exit(1);
}

void crash() {
  int* p = (int*) 0;
  *p = 1;
}

void bar() {
  crash();
}

void foo() {
  bar();
}

int main() {
  signal(SIGSEGV, handler);
  foo();
}

Interestingly, this does create an FDE for __restore_rt, but the starting PC is not the nop instruction:

$ ldd a.out
/lib/ld-musl-x86_64.so.1: a.out: Not a valid dynamic program

$ readelf --debug-dump=frames -a a.out | grep 000000000041359a
   471: 000000000041359a     9 FUNC    GLOBAL HIDDEN     2 __restore_rt
00001540 000000000000001c 00001528 FDE cie=00001528 pc=000000000041359a..00000000004135a3

$ ./a.out
ip = 401270, sp = 7ffc962fbba0: handler + 10
ip = 41359a, sp = 7ffc962fbbc0: sigaction + 27
ip = 4012a9, sp = 7ffc962fbbd0: bar + 9
ip = 4012b9, sp = 7ffc962fc180: foo + 9
ip = 4012e2, sp = 7ffc962fc190: main + 22
ip = 40f8eb, sp = 7ffc962fc1b0: libc_start_main_stage2 + 27

A Rust port of the program:

use backtrace::Backtrace;
use nix::sys::signal;

extern "C" fn handler(_: nix::libc::c_int) {
    let bt = Backtrace::new();
    println!("{:?}", bt);
    std::process::exit(1);
}

fn crash() {
    unsafe {
        let p: *mut i32 = std::ptr::null_mut();
        *p = 1;
    }
}

fn bar() {
    crash();
}

fn foo() {
    bar();
}

fn main() {
    unsafe {
        signal::signal(
            signal::Signal::SIGSEGV,
            signal::SigHandler::Handler(handler),
        );
    }

    foo();
}

However, no FDE was emitted for __restore_rt:

$ ldd test
/lib/ld-musl-x86_64.so.1: test: Not a valid dynamic program

$ readelf --debug-dump=frames -a test | grep 0000000000507047
  6473: 0000000000507047     9 FUNC    GLOBAL HIDDEN     2 __restore_rt

As a result, it only prints the top frame:

$ ./test
   0: test::handler
             at /test/src/main.rs:5

@alexcrichton any insights into why the Rust toolchain might be missing this FDE? This repros with stable (llvm 9) and nightly (llvm 10).

Wasmtime GitHub notifications bot (Jun 24 2020 at 07:17):

peterhuene edited a comment on Issue #1914:

A simple test program compiled with clang test.c -static -lunwind -llzma (clang 10):

#include <stdio.h>
#include <signal.h>
#include <stdlib.h>
#include <unistd.h>
#define UNW_LOCAL_ONLY
#include <libunwind.h>

void backtrace() {
  unw_cursor_t cursor; unw_context_t uc;
  unw_word_t ip, sp;
  unw_word_t offset;
  char name[256];
  unw_getcontext(&uc);
  unw_init_local(&cursor, &uc);
  while (unw_step(&cursor) > 0) {
    unw_get_reg(&cursor, UNW_REG_IP, &ip);
    unw_get_reg(&cursor, UNW_REG_SP, &sp);
    unw_get_proc_name(&cursor, name, sizeof(name), &offset);
    printf ("ip = %lx, sp = %lx: %s + %lx\n", (long) ip, (long) sp, name, (long) offset);
  }
}

void handler(int sig) {
  backtrace();
  exit(1);
}

void crash() {
  int* p = (int*) 0;
  *p = 1;
}

void bar() {
  crash();
}

void foo() {
  bar();
}

int main() {
  signal(SIGSEGV, handler);
  foo();
}

Interestingly, this does create an FDE for __restore_rt, but the starting PC is not the nop instruction:

$ ldd a.out
/lib/ld-musl-x86_64.so.1: a.out: Not a valid dynamic program

$ readelf --debug-dump=frames -a a.out | grep 000000000041359a
   471: 000000000041359a     9 FUNC    GLOBAL HIDDEN     2 __restore_rt
00001540 000000000000001c 00001528 FDE cie=00001528 pc=000000000041359a..00000000004135a3

As a result, this program is able to walk the stack, albeit it showing sigaction+27 for __restore_rt as sigaction is the nearest non-hidden symbol.

$ ./a.out
ip = 401270, sp = 7ffc962fbba0: handler + 10
ip = 41359a, sp = 7ffc962fbbc0: sigaction + 27
ip = 4012a9, sp = 7ffc962fbbd0: bar + 9
ip = 4012b9, sp = 7ffc962fc180: foo + 9
ip = 4012e2, sp = 7ffc962fc190: main + 22
ip = 40f8eb, sp = 7ffc962fc1b0: libc_start_main_stage2 + 27

A Rust port of the program:

use backtrace::Backtrace;
use nix::sys::signal;

extern "C" fn handler(_: nix::libc::c_int) {
    let bt = Backtrace::new();
    println!("{:?}", bt);
    std::process::exit(1);
}

fn crash() {
    unsafe {
        let p: *mut i32 = std::ptr::null_mut();
        *p = 1;
    }
}

fn bar() {
    crash();
}

fn foo() {
    bar();
}

fn main() {
    unsafe {
        signal::signal(
            signal::Signal::SIGSEGV,
            signal::SigHandler::Handler(handler),
        );
    }

    foo();
}

However, no FDE was emitted for __restore_rt:

$ ldd test
/lib/ld-musl-x86_64.so.1: test: Not a valid dynamic program

$ readelf --debug-dump=frames -a test | grep 0000000000507047
  6473: 0000000000507047     9 FUNC    GLOBAL HIDDEN     2 __restore_rt

As a result, it only prints the top frame:

$ ./test
   0: test::handler
             at /test/src/main.rs:5

@alexcrichton any insights into why the Rust toolchain might be missing this FDE? This repros with stable (llvm 9) and nightly (llvm 10).

Wasmtime GitHub notifications bot (Jun 24 2020 at 14:47):

alexcrichton commented on Issue #1914:

Oh wow awesome investigation!

In your C example you're compiling with clang but are you sure that's linking to musl? By default I'd imagine that links to glibc and I think -static works for simple-ish programs with glibc.

I tried your program in an alpine container and I can't seem to get good results, but I'm not the best at investigating this too. There's two libunwind implementations (libunwind-dev and llvm-libunwind-dev), and I'm pretty sure the one we ship with Rust is the latter (the LLVM one). AFAICT the Rust toolchain ships with a musl built with musl-cross-make and it may be pulling in a too-old version? In any case in Alpine the results I'm getting are:

libunwind - backtrace works

libunwind plus -static - backtrace works

llvm-libunwind - backtrace doesn't get past the first frame

llvm-libunwind plus -static - doesn't get past the first frame

The odd part is that with llvm-libunwind I can see a CIE for the backtrace function, so I don't know why it can't get past the first frame. I wonder if this is maybe just a bug with llvm-libunwind?

In any case I think this is pretty far afield from what changes we should make to wasmtime itself. It seems clear here that we should register individual FDEs rather than the whole set at once like we do with glibc. In terms of changes to this PR, it seems like we now have 2/3 platforms (macos, musl) which require individual FDEs so it seems like glibc is the outlier taking the whole set at once. In that case could the code be restructured to have the individual FDE be the general case and the glibc path is just an optimization for glibc? For example the zero-length push at the end is only needed for glibc, and I think the cfg_if! can probably be removed in favor of just a simple if cfg!(linux_gnu) { ... }

Wasmtime GitHub notifications bot (Jun 24 2020 at 14:48):

alexcrichton commented on Issue #1914:

Oh and to make matters worse I don't know how the source of __restore_rt uses eh_frame anywhere...

Wasmtime GitHub notifications bot (Jun 24 2020 at 20:16):

peterhuene commented on Issue #1914:

I can confirm that my test cases are linking against musl libc (it's an alpine image using a musl-libc targeted clang).

I've been comparing the implementations between llvm-libunwind and libunwind for stack walking. Both support having an S augmentation in the CIE to mark related FDE entries as "signal handler frames". Unfortunately, the CIE for __restore_rt has no augmentations in these test executables.

For llvm-libunwind, that appears to be the only way to determine signal handler frames. Even so, the walking algorithm doesn't seem to do anything for special for these frames (such as skipping over them). This appears to be the crux of Rust's inability to walk the stack from a signal handler.

On the other hand, libunwind, about 10 years ago, used to detect __restore_rt by checking the bytes of the trampoline directly. It would explicitly know how to unwind that frame and skip over it, hence giving more reliable walks. Today a comment claims it uses "kernel provided information for the trampoline", but I haven't yet figured out where it does that.

Regarding if cfg!(linux_gnu) { ... } (are there docs on this variable somewhere? my googlefu is weak), my concern is that there might be non-linux glibc platforms (e.g. kfreebsd, etc.) we'd be doing the wrong thing for in the future. But if you're comfortable with libunwind's __register_frame implementation being the default, I'm good with it too.

Wasmtime GitHub notifications bot (Jun 24 2020 at 20:18):

peterhuene edited a comment on Issue #1914:

I can confirm that my test cases are linking against musl libc (it's an alpine image using a musl-libc targeted clang).

I've been comparing the implementations between llvm-libunwind and libunwind for stack walking. Both support having an S augmentation in the CIE to mark related FDE entries as "signal handler frames". Unfortunately, the CIE for __restore_rt has no augmentations in these test executables.

For llvm-libunwind, that appears to be the only way to determine signal handler frames. Even so, the walking algorithm doesn't seem to do anything for special for these frames (such as skipping over them). This appears to be the crux of Rust's (or the C program linked against llvm-libunwind's) inability to walk the stack from a signal handler.

On the other hand, libunwind, about 10 years ago, used to detect __restore_rt by checking the bytes of the trampoline directly. It would explicitly know how to unwind that frame and skip over it, hence giving more reliable walks. Today a comment claims it uses "kernel provided information for the trampoline", but I haven't yet figured out where it does that.

Regarding if cfg!(linux_gnu) { ... } (are there docs on this variable somewhere? my googlefu is weak), my concern is that there might be non-linux glibc platforms (e.g. kfreebsd, etc.) we'd be doing the wrong thing for in the future. But if you're comfortable with libunwind's __register_frame implementation being the default, I'm good with it too.

Wasmtime GitHub notifications bot (Jun 24 2020 at 20:33):

bjorn3 commented on Issue #1914:

That comment was introduced in https://github.com/libunwind/libunwind/commit/dac2d001afb1fa7040ca7d8ae57032f684d7023e

Wasmtime GitHub notifications bot (Jun 24 2020 at 21:26):

alexcrichton commented on Issue #1914:

Oh geez that's quite a lot of history... Anyway definitely sounds like fuel for the fire of "not something we can fix in wasmtime". That being said this is also more fuel to the fire of "we should not rely on the native system unwinder to generate a backtrace on every single platform" since this mean wasm traps on musl will always be wrong.

@peterhuene were you able to reproduce how libunwind was working better than llvm-libunwind?

Oh also for cfg!(linux_gnu) that's not actually a thing, it's bad shorthand for cfg!(all(target_os = "linux", target_env = "gnu")). In general it just seems like frame registration is pretty likely to be different per platform and we just have it where 2/3 platforms (musl/macos out of those plus linux) require iteration rather than whole-chunk registration. Ideally we'd have a clause here which says "you're on an unknown platform, please test this and see what's right"

Wasmtime GitHub notifications bot (Jun 24 2020 at 22:08):

peterhuene commented on Issue #1914:

I'll update the PR to use cfg!(all(target_os = "linux", target_env = "gnu")) and limit the push of the ending record marker in that case too.

I agree we can declare the inability to walk the stack for musl targets using llvm-libunwind an external issue to Wasmtime. I think we should be somehow limiting our walks to sections of the stack that contain Wasm frames (as discussed in #1832) as it would both address this issue and #1845.

Wasmtime GitHub notifications bot (Jun 24 2020 at 23:27):

peterhuene commented on Issue #1914:

@alexcrichton PR updated.

Wasmtime GitHub notifications bot (Jun 24 2020 at 23:28):

peterhuene edited a comment on Issue #1914:

@alexcrichton PR updated.

I added a check for target_env = "" based on the docs. I can remove it if that's not correct.

Last updated: Oct 23 2024 at 20:03 UTC