alexcrichton commented on Issue #1914:
FWIW I did a mild amount of poking with this, but I'm not sure if this works for the purposes of backtrace because
cargo test --target x86_64-unknown-linux-musl
was still failing.Also I think the logic may be a bit clearer inverted for glibc-specific bits perhaps? I'm not sure how prevalent libgcc's logic is, but the "write a 0 length at the end of the table" on line 94 above these changes I think is only relevant for glibc.
peterhuene commented on Issue #1914:
It is only relevant for libgcc's implementation, but doesn't really hurt to put it in the entire frame table and just skip it for the libunwind iteration.
You're right, it appears the tracing isn't working even with
__register_frame
appearing to properly register the individual FDEs with these changes.Let me dig into that before we merge this.
peterhuene edited a comment on Issue #1914:
It is only relevant for libgcc's implementation, but doesn't really hurt to put it in the entire frame table and just skip it for the libunwind iteration.
You're right, it appears stack tracing isn't working even with
__register_frame
appearing to properly register the individual FDEs with these changes.Let me dig into that before we merge this.
peterhuene commented on Issue #1914:
I've verified that each frame does get added to libunwind's DwarfFDECache:
For example:
libunwind::DwarfFDECache<libunwind::LocalAddressSpace>::add(mh=37745720, ip_start=140737351909383, ip_end=140737351909406, fde=37745720)
Those are the valid ranges of one of our function JITs and the FDE pointer is correct.
What I don't see is any calls to
libunwind::DwarfFDECache<libunwind::LocalAddressSpace>::findFDE
from_Unwind_Backtrace
. I see calls tolibunwind::EHHeaderParser<libunwind::LocalAddressSpace>::findFDE
which are presumably for the ELF .eh_frame data.Still debugging.
peterhuene edited a comment on Issue #1914:
I've verified that each frame does get added to libunwind's
DwarfFDECache
:For example:
libunwind::DwarfFDECache<libunwind::LocalAddressSpace>::add(mh=37745720, ip_start=140737351909383, ip_end=140737351909406, fde=37745720)
Those are the valid ranges of one of our function JITs and the FDE pointer is correct.
What I don't see is any calls to
libunwind::DwarfFDECache<libunwind::LocalAddressSpace>::findFDE
from_Unwind_Backtrace
. I see calls tolibunwind::EHHeaderParser<libunwind::LocalAddressSpace>::findFDE
which are presumably for the ELF .eh_frame data.Still debugging.
peterhuene edited a comment on Issue #1914:
I've verified that each frame does get added to libunwind's
DwarfFDECache
:For example:
libunwind::DwarfFDECache<libunwind::LocalAddressSpace>::add(mh=37745720, ip_start=140737351909383, ip_end=140737351909406, fde=37745720)
Those are the valid ranges of one of our function JITs and the FDE pointer is correct.
What I don't see is any calls toFunction was inlined hence no breakpoint. I see calls tolibunwind::DwarfFDECache<libunwind::LocalAddressSpace>::findFDE
from_Unwind_Backtrace
libunwind::EHHeaderParser<libunwind::LocalAddressSpace>::findFDE
which are presumably for the ELF .eh_frame data.Still debugging.
peterhuene commented on Issue #1914:
It looks like the unwind cursor can't get past the signal handler trampoline frame and hence doesn't even get to the Wasm frames.
From lldb trace:
(lldb) bt ... frame #17: 0x0000000001af8ca0 wasmtime`__restore_rt frame #18: 0x00007ffff7de0075 <--- faulting wasm frame with a simple unreachable trap ... (lldb) di -a 0x0000000001af8ca0 wasmtime`__restore_rt: 0x1af8ca0 <+0>: movq $0xf, %rax 0x1af8ca7 <+7>: syscall
lldb appears to special case the signal handler function names and hence can walk past this frame.
However,
libunwind::UnwindCursor<A, R>::setInfoBasedOnIPRegister
withpc=0x0000000001af8ca0
results in _unwindInfoMissing being set to true for the frame. Note that it did search our dynamically registered frames for this one, except it isn't one of ours so that search was fruitless.There's a
nop
instruction prior to this trampoline which is important as the PC is subtracted by 1 for the FDE search (it assumes there's been a call to create a frame, after all), so the FDE start for this function would actually be 0x0000000001af8c9f. Without the nop, the adjusted PC would actually be in a different function.I think this may work from glib rather than musl because it appears an FDE is emitted for
__restore_rt
.I'm leaning towards this being a musl bug where no FDE is emitted for
__restore_rt
.
peterhuene commented on Issue #1914:
I'd thus expect
libunwind::_Unwind_Backtrace
to be busted on musl for any walk from a signal handler. I'll see if I can create a small repro in C.
peterhuene edited a comment on Issue #1914:
It looks like the unwind cursor can't get past the signal handler trampoline frame and hence doesn't even get to the Wasm frames.
From lldb trace:
(lldb) bt ... frame #17: 0x0000000001af8ca0 wasmtime`__restore_rt frame #18: 0x00007ffff7de0075 <--- faulting wasm frame with a simple unreachable trap ... (lldb) di -a 0x0000000001af8ca0 wasmtime`__restore_rt: 0x1af8ca0 <+0>: movq $0xf, %rax 0x1af8ca7 <+7>: syscall
lldb appears to special case the signal handler function names and hence can walk past this frame.
However,
libunwind::UnwindCursor<A, R>::setInfoBasedOnIPRegister
withpc=0x0000000001af8ca0
results in _unwindInfoMissing being set to true for the frame. Note that it did search Wasmtime's dynamically registered JIT frames for this one, except it isn't one of ours so that search was fruitless.There's a
nop
instruction prior to this trampoline which is important as the PC is subtracted by 1 for the FDE search (it assumes there's been a call to create a frame, after all), so the FDE start for this function would actually be 0x0000000001af8c9f. Without the nop, the adjusted PC would actually be in a different function.I think this may work from glib rather than musl because it appears an FDE is emitted for
__restore_rt
.I'm leaning towards this being a musl bug where no FDE is emitted for
__restore_rt
.
peterhuene deleted a comment on Issue #1914:
I've verified that each frame does get added to libunwind's
DwarfFDECache
:For example:
libunwind::DwarfFDECache<libunwind::LocalAddressSpace>::add(mh=37745720, ip_start=140737351909383, ip_end=140737351909406, fde=37745720)
Those are the valid ranges of one of our function JITs and the FDE pointer is correct.
What I don't see is any calls toFunction was inlined hence no breakpoint. I see calls tolibunwind::DwarfFDECache<libunwind::LocalAddressSpace>::findFDE
from_Unwind_Backtrace
libunwind::EHHeaderParser<libunwind::LocalAddressSpace>::findFDE
which are presumably for the ELF .eh_frame data.Still debugging.
peterhuene edited a comment on Issue #1914:
It looks like the unwind cursor can't get past the signal handler trampoline frame and hence doesn't even get to the Wasm frames.
From lldb trace:
(lldb) bt ... frame #17: 0x0000000001af8ca0 wasmtime`__restore_rt frame #18: 0x00007ffff7de0075 <--- faulting wasm frame with a simple unreachable trap ... (lldb) di -a 0x0000000001af8ca0 wasmtime`__restore_rt: 0x1af8ca0 <+0>: movq $0xf, %rax 0x1af8ca7 <+7>: syscall
lldb appears to special case the signal handler function names and hence can walk past this frame.
However,
libunwind::UnwindCursor<A, R>::setInfoBasedOnIPRegister
withpc=0x0000000001af8ca0
results in _unwindInfoMissing being set to true for the frame. Note that it did search Wasmtime's dynamically registered JIT frames for this one, except it isn't one of ours so that search was fruitless.There's a
nop
instruction prior to this trampoline which is important as the PC is subtracted by 1 for the FDE search (it assumes there's been a call to create a frame, after all), so the FDE start for this function would actually be 0x0000000001af8c9f. Without the nop, the adjusted PC would actually be in a different function.I think this may work from glib rather than musl because it appears an FDE is emitted for
__restore_rt
.I'm leaning towards this being a musl bug where no FDE is emitted for
__restore_rt
.
peterhuene commented on Issue #1914:
A simple test program compiled with
clang test.c -static -lunwind -llzma
:#include <stdio.h> #include <signal.h> #include <stdlib.h> #include <unistd.h> #define UNW_LOCAL_ONLY #include <libunwind.h> void backtrace() { unw_cursor_t cursor; unw_context_t uc; unw_word_t ip, sp; unw_word_t offset; char name[256]; unw_getcontext(&uc); unw_init_local(&cursor, &uc); while (unw_step(&cursor) > 0) { unw_get_reg(&cursor, UNW_REG_IP, &ip); unw_get_reg(&cursor, UNW_REG_SP, &sp); unw_get_proc_name(&cursor, name, sizeof(name), &offset); printf ("ip = %lx, sp = %lx: %s + %lx\n", (long) ip, (long) sp, name, (long) offset); } } void handler(int sig) { backtrace(); exit(1); } void crash() { int* p = (int*) 0; *p = 1; } void bar() { crash(); } void foo() { bar(); } int main() { signal(SIGSEGV, handler); foo(); }
Interestingly, this does create an FDE for
__restore_rt
, but it doesn't start have a precedingnop
instruction:$ ldd a.out /lib/ld-musl-x86_64.so.1: a.out: Not a valid dynamic program $ readelf --debug-dump=frames -a a.out | grep 000000000041359a 471: 000000000041359a 9 FUNC GLOBAL HIDDEN 2 __restore_rt 00001540 000000000000001c 00001528 FDE cie=00001528 pc=000000000041359a..00000000004135a3
As a result, this program is able to walk the stack, albeit it showing
sigaction+27
for__restore_rt
assigaction
contains PC0000000000413599
(recall that the FDE search is PC-1).$ ./a.out ip = 401270, sp = 7ffc962fbba0: handler + 10 ip = 41359a, sp = 7ffc962fbbc0: sigaction + 27 ip = 4012a9, sp = 7ffc962fbbd0: bar + 9 ip = 4012b9, sp = 7ffc962fc180: foo + 9 ip = 4012e2, sp = 7ffc962fc190: main + 22 ip = 40f8eb, sp = 7ffc962fc1b0: libc_start_main_stage2 + 27
A Rust port of the program:
use backtrace::Backtrace; use nix::sys::signal; extern "C" fn handler(_: nix::libc::c_int) { let bt = Backtrace::new(); println!("{:?}", bt); std::process::exit(1); } fn crash() { unsafe { let p: *mut i32 = std::ptr::null_mut(); *p = 1; } } fn bar() { crash(); } fn foo() { bar(); } fn main() { unsafe { signal::signal(signal::Signal::SIGSEGV, signal::SigHandler::Handler(handler)); } foo(); }
However, no FDE was emitted for
__restore_rt
:$ ldd test /lib/ld-musl-x86_64.so.1: test: Not a valid dynamic program $ readelf --debug-dump=frames -a test | grep 0000000000507047 6473: 0000000000507047 9 FUNC GLOBAL HIDDEN 2 __restore_rt
As a result, it only prints the top frame:
$ ./test 0: test::handler at /test/src/main.rs:5
peterhuene edited a comment on Issue #1914:
A simple test program compiled with
clang test.c -static -lunwind -llzma
:#include <stdio.h> #include <signal.h> #include <stdlib.h> #include <unistd.h> #define UNW_LOCAL_ONLY #include <libunwind.h> void backtrace() { unw_cursor_t cursor; unw_context_t uc; unw_word_t ip, sp; unw_word_t offset; char name[256]; unw_getcontext(&uc); unw_init_local(&cursor, &uc); while (unw_step(&cursor) > 0) { unw_get_reg(&cursor, UNW_REG_IP, &ip); unw_get_reg(&cursor, UNW_REG_SP, &sp); unw_get_proc_name(&cursor, name, sizeof(name), &offset); printf ("ip = %lx, sp = %lx: %s + %lx\n", (long) ip, (long) sp, name, (long) offset); } } void handler(int sig) { backtrace(); exit(1); } void crash() { int* p = (int*) 0; *p = 1; } void bar() { crash(); } void foo() { bar(); } int main() { signal(SIGSEGV, handler); foo(); }
Interestingly, this does create an FDE for
__restore_rt
:$ ldd a.out /lib/ld-musl-x86_64.so.1: a.out: Not a valid dynamic program $ readelf --debug-dump=frames -a a.out | grep 000000000041359a 471: 000000000041359a 9 FUNC GLOBAL HIDDEN 2 __restore_rt 00001540 000000000000001c 00001528 FDE cie=00001528 pc=000000000041359a..00000000004135a3
As a result, this program is able to walk the stack, albeit it showing
sigaction+27
for__restore_rt
assigaction
contains PC0000000000413599
(recall that the FDE search is PC-1).$ ./a.out ip = 401270, sp = 7ffc962fbba0: handler + 10 ip = 41359a, sp = 7ffc962fbbc0: sigaction + 27 ip = 4012a9, sp = 7ffc962fbbd0: bar + 9 ip = 4012b9, sp = 7ffc962fc180: foo + 9 ip = 4012e2, sp = 7ffc962fc190: main + 22 ip = 40f8eb, sp = 7ffc962fc1b0: libc_start_main_stage2 + 27
A Rust port of the program:
use backtrace::Backtrace; use nix::sys::signal; extern "C" fn handler(_: nix::libc::c_int) { let bt = Backtrace::new(); println!("{:?}", bt); std::process::exit(1); } fn crash() { unsafe { let p: *mut i32 = std::ptr::null_mut(); *p = 1; } } fn bar() { crash(); } fn foo() { bar(); } fn main() { unsafe { signal::signal(signal::Signal::SIGSEGV, signal::SigHandler::Handler(handler)); } foo(); }
However, no FDE was emitted for
__restore_rt
:$ ldd test /lib/ld-musl-x86_64.so.1: test: Not a valid dynamic program $ readelf --debug-dump=frames -a test | grep 0000000000507047 6473: 0000000000507047 9 FUNC GLOBAL HIDDEN 2 __restore_rt
As a result, it only prints the top frame:
$ ./test 0: test::handler at /test/src/main.rs:5
peterhuene edited a comment on Issue #1914:
A simple test program compiled with
clang test.c -static -lunwind -llzma
(clang 10):#include <stdio.h> #include <signal.h> #include <stdlib.h> #include <unistd.h> #define UNW_LOCAL_ONLY #include <libunwind.h> void backtrace() { unw_cursor_t cursor; unw_context_t uc; unw_word_t ip, sp; unw_word_t offset; char name[256]; unw_getcontext(&uc); unw_init_local(&cursor, &uc); while (unw_step(&cursor) > 0) { unw_get_reg(&cursor, UNW_REG_IP, &ip); unw_get_reg(&cursor, UNW_REG_SP, &sp); unw_get_proc_name(&cursor, name, sizeof(name), &offset); printf ("ip = %lx, sp = %lx: %s + %lx\n", (long) ip, (long) sp, name, (long) offset); } } void handler(int sig) { backtrace(); exit(1); } void crash() { int* p = (int*) 0; *p = 1; } void bar() { crash(); } void foo() { bar(); } int main() { signal(SIGSEGV, handler); foo(); }
Interestingly, this does create an FDE for
__restore_rt
:$ ldd a.out /lib/ld-musl-x86_64.so.1: a.out: Not a valid dynamic program $ readelf --debug-dump=frames -a a.out | grep 000000000041359a 471: 000000000041359a 9 FUNC GLOBAL HIDDEN 2 __restore_rt 00001540 000000000000001c 00001528 FDE cie=00001528 pc=000000000041359a..00000000004135a3
As a result, this program is able to walk the stack, albeit it showing
sigaction+27
for__restore_rt
assigaction
contains PC0000000000413599
(recall that the FDE search is PC-1).$ ./a.out ip = 401270, sp = 7ffc962fbba0: handler + 10 ip = 41359a, sp = 7ffc962fbbc0: sigaction + 27 ip = 4012a9, sp = 7ffc962fbbd0: bar + 9 ip = 4012b9, sp = 7ffc962fc180: foo + 9 ip = 4012e2, sp = 7ffc962fc190: main + 22 ip = 40f8eb, sp = 7ffc962fc1b0: libc_start_main_stage2 + 27
A Rust port of the program:
use backtrace::Backtrace; use nix::sys::signal; extern "C" fn handler(_: nix::libc::c_int) { let bt = Backtrace::new(); println!("{:?}", bt); std::process::exit(1); } fn crash() { unsafe { let p: *mut i32 = std::ptr::null_mut(); *p = 1; } } fn bar() { crash(); } fn foo() { bar(); } fn main() { unsafe { signal::signal( signal::Signal::SIGSEGV, signal::SigHandler::Handler(handler), ); } foo(); }
However, no FDE was emitted for
__restore_rt
:$ ldd test /lib/ld-musl-x86_64.so.1: test: Not a valid dynamic program $ readelf --debug-dump=frames -a test | grep 0000000000507047 6473: 0000000000507047 9 FUNC GLOBAL HIDDEN 2 __restore_rt
As a result, it only prints the top frame:
$ ./test 0: test::handler at /test/src/main.rs:5
@alexcrichton any insights into why the Rust toolchain might be missing this FDE? This repros with stable (llvm 9) and nightly (llvm 10).
peterhuene edited a comment on Issue #1914:
A simple test program compiled with
clang test.c -static -lunwind -llzma
(clang 10):#include <stdio.h> #include <signal.h> #include <stdlib.h> #include <unistd.h> #define UNW_LOCAL_ONLY #include <libunwind.h> void backtrace() { unw_cursor_t cursor; unw_context_t uc; unw_word_t ip, sp; unw_word_t offset; char name[256]; unw_getcontext(&uc); unw_init_local(&cursor, &uc); while (unw_step(&cursor) > 0) { unw_get_reg(&cursor, UNW_REG_IP, &ip); unw_get_reg(&cursor, UNW_REG_SP, &sp); unw_get_proc_name(&cursor, name, sizeof(name), &offset); printf ("ip = %lx, sp = %lx: %s + %lx\n", (long) ip, (long) sp, name, (long) offset); } } void handler(int sig) { backtrace(); exit(1); } void crash() { int* p = (int*) 0; *p = 1; } void bar() { crash(); } void foo() { bar(); } int main() { signal(SIGSEGV, handler); foo(); }
Interestingly, this does create an FDE for
__restore_rt
:$ ldd a.out /lib/ld-musl-x86_64.so.1: a.out: Not a valid dynamic program $ readelf --debug-dump=frames -a a.out | grep 000000000041359a 471: 000000000041359a 9 FUNC GLOBAL HIDDEN 2 __restore_rt 00001540 000000000000001c 00001528 FDE cie=00001528 pc=000000000041359a..00000000004135a3
As a result, this program is able to walk the stack, albeit it showing
sigaction+27
for__restore_rt
assigaction
is the "nearest" function to PC0000000000413599
, which is thenop
instruction (recall that the FDE search is PC-1).$ ./a.out ip = 401270, sp = 7ffc962fbba0: handler + 10 ip = 41359a, sp = 7ffc962fbbc0: sigaction + 27 ip = 4012a9, sp = 7ffc962fbbd0: bar + 9 ip = 4012b9, sp = 7ffc962fc180: foo + 9 ip = 4012e2, sp = 7ffc962fc190: main + 22 ip = 40f8eb, sp = 7ffc962fc1b0: libc_start_main_stage2 + 27
A Rust port of the program:
use backtrace::Backtrace; use nix::sys::signal; extern "C" fn handler(_: nix::libc::c_int) { let bt = Backtrace::new(); println!("{:?}", bt); std::process::exit(1); } fn crash() { unsafe { let p: *mut i32 = std::ptr::null_mut(); *p = 1; } } fn bar() { crash(); } fn foo() { bar(); } fn main() { unsafe { signal::signal( signal::Signal::SIGSEGV, signal::SigHandler::Handler(handler), ); } foo(); }
However, no FDE was emitted for
__restore_rt
:$ ldd test /lib/ld-musl-x86_64.so.1: test: Not a valid dynamic program $ readelf --debug-dump=frames -a test | grep 0000000000507047 6473: 0000000000507047 9 FUNC GLOBAL HIDDEN 2 __restore_rt
As a result, it only prints the top frame:
$ ./test 0: test::handler at /test/src/main.rs:5
@alexcrichton any insights into why the Rust toolchain might be missing this FDE? This repros with stable (llvm 9) and nightly (llvm 10).
peterhuene edited a comment on Issue #1914:
A simple test program compiled with
clang test.c -static -lunwind -llzma
(clang 10):#include <stdio.h> #include <signal.h> #include <stdlib.h> #include <unistd.h> #define UNW_LOCAL_ONLY #include <libunwind.h> void backtrace() { unw_cursor_t cursor; unw_context_t uc; unw_word_t ip, sp; unw_word_t offset; char name[256]; unw_getcontext(&uc); unw_init_local(&cursor, &uc); while (unw_step(&cursor) > 0) { unw_get_reg(&cursor, UNW_REG_IP, &ip); unw_get_reg(&cursor, UNW_REG_SP, &sp); unw_get_proc_name(&cursor, name, sizeof(name), &offset); printf ("ip = %lx, sp = %lx: %s + %lx\n", (long) ip, (long) sp, name, (long) offset); } } void handler(int sig) { backtrace(); exit(1); } void crash() { int* p = (int*) 0; *p = 1; } void bar() { crash(); } void foo() { bar(); } int main() { signal(SIGSEGV, handler); foo(); }
Interestingly, this does create an FDE for
__restore_rt
, but the starting PC is not thenop
instruction:$ ldd a.out /lib/ld-musl-x86_64.so.1: a.out: Not a valid dynamic program $ readelf --debug-dump=frames -a a.out | grep 000000000041359a 471: 000000000041359a 9 FUNC GLOBAL HIDDEN 2 __restore_rt 00001540 000000000000001c 00001528 FDE cie=00001528 pc=000000000041359a..00000000004135a3
As a result, this program is able to walk the stack, albeit it showing
sigaction+27
for__restore_rt
assigaction
is the "nearest" function to PC0000000000413599
, which is thenop
instruction (recall that the FDE search is PC-1).$ ./a.out ip = 401270, sp = 7ffc962fbba0: handler + 10 ip = 41359a, sp = 7ffc962fbbc0: sigaction + 27 ip = 4012a9, sp = 7ffc962fbbd0: bar + 9 ip = 4012b9, sp = 7ffc962fc180: foo + 9 ip = 4012e2, sp = 7ffc962fc190: main + 22 ip = 40f8eb, sp = 7ffc962fc1b0: libc_start_main_stage2 + 27
A Rust port of the program:
use backtrace::Backtrace; use nix::sys::signal; extern "C" fn handler(_: nix::libc::c_int) { let bt = Backtrace::new(); println!("{:?}", bt); std::process::exit(1); } fn crash() { unsafe { let p: *mut i32 = std::ptr::null_mut(); *p = 1; } } fn bar() { crash(); } fn foo() { bar(); } fn main() { unsafe { signal::signal( signal::Signal::SIGSEGV, signal::SigHandler::Handler(handler), ); } foo(); }
However, no FDE was emitted for
__restore_rt
:$ ldd test /lib/ld-musl-x86_64.so.1: test: Not a valid dynamic program $ readelf --debug-dump=frames -a test | grep 0000000000507047 6473: 0000000000507047 9 FUNC GLOBAL HIDDEN 2 __restore_rt
As a result, it only prints the top frame:
$ ./test 0: test::handler at /test/src/main.rs:5
@alexcrichton any insights into why the Rust toolchain might be missing this FDE? This repros with stable (llvm 9) and nightly (llvm 10).
peterhuene edited a comment on Issue #1914:
A simple test program compiled with
clang test.c -static -lunwind -llzma
(clang 10):#include <stdio.h> #include <signal.h> #include <stdlib.h> #include <unistd.h> #define UNW_LOCAL_ONLY #include <libunwind.h> void backtrace() { unw_cursor_t cursor; unw_context_t uc; unw_word_t ip, sp; unw_word_t offset; char name[256]; unw_getcontext(&uc); unw_init_local(&cursor, &uc); while (unw_step(&cursor) > 0) { unw_get_reg(&cursor, UNW_REG_IP, &ip); unw_get_reg(&cursor, UNW_REG_SP, &sp); unw_get_proc_name(&cursor, name, sizeof(name), &offset); printf ("ip = %lx, sp = %lx: %s + %lx\n", (long) ip, (long) sp, name, (long) offset); } } void handler(int sig) { backtrace(); exit(1); } void crash() { int* p = (int*) 0; *p = 1; } void bar() { crash(); } void foo() { bar(); } int main() { signal(SIGSEGV, handler); foo(); }
Interestingly, this does create an FDE for
__restore_rt
, but the starting PC is not thenop
instruction:$ ldd a.out /lib/ld-musl-x86_64.so.1: a.out: Not a valid dynamic program $ readelf --debug-dump=frames -a a.out | grep 000000000041359a 471: 000000000041359a 9 FUNC GLOBAL HIDDEN 2 __restore_rt 00001540 000000000000001c 00001528 FDE cie=00001528 pc=000000000041359a..00000000004135a3
As a result, this program is able to walk the stack, albeit it showing
sigaction+27
for__restore_rt
assigaction
is the nearest non-hidden symbol.$ ./a.out ip = 401270, sp = 7ffc962fbba0: handler + 10 ip = 41359a, sp = 7ffc962fbbc0: sigaction + 27 ip = 4012a9, sp = 7ffc962fbbd0: bar + 9 ip = 4012b9, sp = 7ffc962fc180: foo + 9 ip = 4012e2, sp = 7ffc962fc190: main + 22 ip = 40f8eb, sp = 7ffc962fc1b0: libc_start_main_stage2 + 27
A Rust port of the program:
use backtrace::Backtrace; use nix::sys::signal; extern "C" fn handler(_: nix::libc::c_int) { let bt = Backtrace::new(); println!("{:?}", bt); std::process::exit(1); } fn crash() { unsafe { let p: *mut i32 = std::ptr::null_mut(); *p = 1; } } fn bar() { crash(); } fn foo() { bar(); } fn main() { unsafe { signal::signal( signal::Signal::SIGSEGV, signal::SigHandler::Handler(handler), ); } foo(); }
However, no FDE was emitted for
__restore_rt
:$ ldd test /lib/ld-musl-x86_64.so.1: test: Not a valid dynamic program $ readelf --debug-dump=frames -a test | grep 0000000000507047 6473: 0000000000507047 9 FUNC GLOBAL HIDDEN 2 __restore_rt
As a result, it only prints the top frame:
$ ./test 0: test::handler at /test/src/main.rs:5
@alexcrichton any insights into why the Rust toolchain might be missing this FDE? This repros with stable (llvm 9) and nightly (llvm 10).
alexcrichton commented on Issue #1914:
Oh wow awesome investigation!
In your C example you're compiling with
clang
but are you sure that's linking to musl? By default I'd imagine that links to glibc and I think-static
works for simple-ish programs with glibc.I tried your program in an alpine container and I can't seem to get good results, but I'm not the best at investigating this too. There's two libunwind implementations (
libunwind-dev
andllvm-libunwind-dev
), and I'm pretty sure the one we ship with Rust is the latter (the LLVM one). AFAICT the Rust toolchain ships with a musl built with musl-cross-make and it may be pulling in a too-old version? In any case in Alpine the results I'm getting are:
libunwind
- backtrace workslibunwind
plus-static
- backtrace worksllvm-libunwind
- backtrace doesn't get past the first framellvm-libunwind
plus-static
- doesn't get past the first frameThe odd part is that with llvm-libunwind I can see a CIE for the
backtrace
function, so I don't know why it can't get past the first frame. I wonder if this is maybe just a bug with llvm-libunwind?In any case I think this is pretty far afield from what changes we should make to wasmtime itself. It seems clear here that we should register individual FDEs rather than the whole set at once like we do with glibc. In terms of changes to this PR, it seems like we now have 2/3 platforms (macos, musl) which require individual FDEs so it seems like glibc is the outlier taking the whole set at once. In that case could the code be restructured to have the individual FDE be the general case and the glibc path is just an optimization for glibc? For example the zero-length push at the end is only needed for glibc, and I think the
cfg_if!
can probably be removed in favor of just a simpleif cfg!(linux_gnu) { ... }
alexcrichton commented on Issue #1914:
Oh and to make matters worse I don't know how the source of
__restore_rt
useseh_frame
anywhere...
peterhuene commented on Issue #1914:
I can confirm that my test cases are linking against musl libc (it's an alpine image using a musl-libc targeted clang).
I've been comparing the implementations between
llvm-libunwind
andlibunwind
for stack walking. Both support having anS
augmentation in the CIE to mark related FDE entries as "signal handler frames". Unfortunately, the CIE for__restore_rt
has no augmentations in these test executables.For
llvm-libunwind
, that appears to be the only way to determine signal handler frames. Even so, the walking algorithm doesn't seem to do anything for special for these frames (such as skipping over them). This appears to be the crux of Rust's inability to walk the stack from a signal handler.On the other hand,
libunwind
, about 10 years ago, used to detect__restore_rt
by checking the bytes of the trampoline directly. It would explicitly know how to unwind that frame and skip over it, hence giving more reliable walks. Today a comment claims it uses "kernel provided information for the trampoline", but I haven't yet figured out where it does that.Regarding
if cfg!(linux_gnu) { ... }
(are there docs on this variable somewhere? my googlefu is weak), my concern is that there might be non-linux glibc platforms (e.g. kfreebsd, etc.) we'd be doing the wrong thing for in the future. But if you're comfortable with libunwind's__register_frame
implementation being the default, I'm good with it too.
peterhuene edited a comment on Issue #1914:
I can confirm that my test cases are linking against musl libc (it's an alpine image using a musl-libc targeted clang).
I've been comparing the implementations between
llvm-libunwind
andlibunwind
for stack walking. Both support having anS
augmentation in the CIE to mark related FDE entries as "signal handler frames". Unfortunately, the CIE for__restore_rt
has no augmentations in these test executables.For
llvm-libunwind
, that appears to be the only way to determine signal handler frames. Even so, the walking algorithm doesn't seem to do anything for special for these frames (such as skipping over them). This appears to be the crux of Rust's (or the C program linked againstllvm-libunwind
's) inability to walk the stack from a signal handler.On the other hand,
libunwind
, about 10 years ago, used to detect__restore_rt
by checking the bytes of the trampoline directly. It would explicitly know how to unwind that frame and skip over it, hence giving more reliable walks. Today a comment claims it uses "kernel provided information for the trampoline", but I haven't yet figured out where it does that.Regarding
if cfg!(linux_gnu) { ... }
(are there docs on this variable somewhere? my googlefu is weak), my concern is that there might be non-linux glibc platforms (e.g. kfreebsd, etc.) we'd be doing the wrong thing for in the future. But if you're comfortable with libunwind's__register_frame
implementation being the default, I'm good with it too.
bjorn3 commented on Issue #1914:
That comment was introduced in https://github.com/libunwind/libunwind/commit/dac2d001afb1fa7040ca7d8ae57032f684d7023e
alexcrichton commented on Issue #1914:
Oh geez that's quite a lot of history... Anyway definitely sounds like fuel for the fire of "not something we can fix in wasmtime". That being said this is also more fuel to the fire of "we should not rely on the native system unwinder to generate a backtrace on every single platform" since this mean wasm traps on musl will always be wrong.
@peterhuene were you able to reproduce how libunwind was working better than llvm-libunwind?
Oh also for
cfg!(linux_gnu)
that's not actually a thing, it's bad shorthand forcfg!(all(target_os = "linux", target_env = "gnu"))
. In general it just seems like frame registration is pretty likely to be different per platform and we just have it where 2/3 platforms (musl/macos out of those plus linux) require iteration rather than whole-chunk registration. Ideally we'd have a clause here which says "you're on an unknown platform, please test this and see what's right"
peterhuene commented on Issue #1914:
I'll update the PR to use
cfg!(all(target_os = "linux", target_env = "gnu"))
and limit the push of the ending record marker in that case too.I agree we can declare the inability to walk the stack for musl targets using llvm-libunwind an external issue to Wasmtime. I think we should be somehow limiting our walks to sections of the stack that contain Wasm frames (as discussed in #1832) as it would both address this issue and #1845.
peterhuene commented on Issue #1914:
@alexcrichton PR updated.
peterhuene edited a comment on Issue #1914:
@alexcrichton PR updated.
I added a check for
target_env = ""
based on the docs. I can remove it if that's not correct.
Last updated: Oct 23 2024 at 20:03 UTC