alexcrichton commented on issue #2991:
I was curious so I ran the test suite in qemu and I ran into an issue that looks like:
---- wasi_cap_std_sync::fd_readdir stdout ---- preopen: "/tmp/wasi_common_fd_readdirfr5CGv" guest stderr: thread 'main' panicked at 'assertion failed: `(left == right)` left: `0`, right: `2`: expected two entries in an empty directory', src/bin/fd_readdir.rs:76:5 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace === Error: error while testing Wasm module 'fd_readdir' Caused by: wasm trap: call stack exhausted wasm backtrace: 0: 0xaafb - <unknown>!<std::sys::wasi::stdio::Stderr as std::io::Write>::is_write_vectored::hf152121ba89ed5c9 1: 0xa867 - <unknown>!rust_panic 2: 0xa3fd - <unknown>!std::panicking::rust_panic_with_hook::hf735cc98c0f3e6f4 3: 0x9b35 - <unknown>!std::panicking::begin_panic_handler::{{closure}}::hb082d09953c1ceec 4: 0x9a76 - <unknown>!std::sys_common::backtrace::__rust_end_short_backtrace::hac58197bca415fd5 5: 0xa2a1 - <unknown>!rust_begin_unwind 6: 0xff0b - <unknown>!core::panicking::panic_fmt::hf8b3045973a2d1f9 7: 0x10b73 - <unknown>!core::panicking::assert_failed::inner::h4a10935a4d4a4d0d 8: 0x3581 - <unknown>!core::panicking::assert_failed::hf88aca872cdb2b11 9: 0x167b - <unknown>!fd_readdir::main::hb00a7e1c801281d4 10: 0x2dcf - <unknown>!std::sys_common::backtrace::__rust_begin_short_backtrace::hf92d88d850fed84d 11: 0x2e06 - <unknown>!std::rt::lang_start::{{closure}}::hc83ff37db6c562d6 12: 0xa912 - <unknown>!std::rt::lang_start_internal::h3bc712c5a299b4e4 13: 0x1ec4 - <unknown>!__original_main 14: 0x545 - <unknown>!_start 15: 0x13c08 - <unknown>!_start.command_export note: run with `WASMTIME_BACKTRACE_DETAILS=1` environment variable to display more information
where something seems off there if it's saying that the call-stack is exhausted. Perhaps a qemu bug? Maybe a backend bug? In any case was just curious how qemu would run, although it unfortunately didn't make it to the meat of the tests.
I was also a little surprised at how slow the compile was, our aarch64 build finishes building tests in ~18m but the s390x tests built in ~27m. This is the speed of the LLVM backend for s390x presumably, so nothing related to Wasmtime, just curious!
uweigand commented on issue #2991:
Note that mainline qemu doesn't quite support z14 yet. Support has been merged into the qemu s390x maintainer repo (branch s390-next in https://gitlab.com/cohuck/qemu.git) but not yet mainline. Not sure if this explains this particular crash.
I was also a little surprised at how slow the compile was, our aarch64 build finishes building tests in ~18m but the s390x tests built in ~27m. This is the speed of the LLVM backend for s390x presumably, so nothing related to Wasmtime, just curious!
Is this running as cross-compiler, or running the native LLVM under qemu? I don't see any particular reason why the s390x back-end should be significantly slower than the aarch64 back-end when running as cross-compiler ...
uweigand commented on issue #2991:
Also, please hold off merging this a bit -- I just noticed that there seems to be bug in the auxv crate that causes getauxval to sometimes return a wrong value so the native platform is mis-detected. I'm currently testing a fix to just use getauxval from the libc crate, which works correctly (and seems more straightforward anyway).
alexcrichton commented on issue #2991:
Ah yeah it was using stock qemu 6.0.0, and "stack overflow" also happens with illegal instructions, so that would indeed explain that!
For the slowness, it's LLVM running natively but compiling to s390x. It could also just be variance in GitHub Actions perhaps, but afaik the only thing affecting the speed of compiling the test suite in this case would be the s390x backend in LLVM. In any case though not like something we'll fix here, just something I was curious about.
uweigand commented on issue #2991:
Ah yeah it was using stock qemu 6.0.0, and "stack overflow" also happens with illegal instructions, so that would indeed explain that!
For the slowness, it's LLVM running natively but compiling to s390x. It could also just be variance in GitHub Actions perhaps, but afaik the only thing affecting the speed of compiling the test suite in this case would be the s390x backend in LLVM. In any case though not like something we'll fix here, just something I was curious about.
Is there a simple way to reproduce this process outside of GitHub actions? I could have a look ...
alexcrichton commented on issue #2991:
While not exactly easy one possible way to reproduce is to run the same steps locally that CI does, which basically just downloads QEMU, builds it, and then configures some env vars for cargo's build
uweigand commented on issue #2991:
Also, please hold off merging this a bit -- I just noticed that there seems to be bug in the auxv crate that causes getauxval to sometimes return a wrong value so the native platform is mis-detected. I'm currently testing a fix to just use getauxval from the libc crate, which works correctly (and seems more straightforward anyway).
OK, this is fixed now. The current version passes the full test suite on both z14 and z15, and it will indeed use the z15 instructions on the latter. As far as I can see, this should be good to merge now. FYI @cfallin .
uweigand commented on issue #2991:
While not exactly easy one possible way to reproduce is to run the same steps locally that CI does, which basically just downloads QEMU, builds it, and then configures some env vars for cargo's build
Turns out this has nothing to do with qemu, I'm seeing the same failure natively. This is related to the
--features "test-programs/test_programs"
argument used by./ci/run-tests.sh
-- I hadn't been using this argument in my testing, which means I've apparently never even attempted to executed some of those tests.I'll have a look why those tests are failing.
uweigand commented on issue #2991:
Turns out this was an endian bug in handling of the
Dirent
data type: https://github.com/bytecodealliance/wasmtime/pull/3016With this, I can now successfully run
./ci/run-tests.sh
(at least natively).
alexcrichton commented on issue #2991:
The trap was originally reported as a stack overflow exhaustion but given the wasm stack that doesn't seem to be the case, but was the trap classification fixed by https://github.com/bytecodealliance/wasmtime/pull/3014? I could definitely imagine that switching endiannness would cause some random traps on reads/writes in wasm though...
uweigand commented on issue #2991:
The trap was originally reported as a stack overflow exhaustion but given the wasm stack that doesn't seem to be the case, but was the trap classification fixed by #3014?
Looks like this is indeed the case! I now get
wasm trap: unreachable
which seems reasonable for arust_panic
.
Last updated: Nov 22 2024 at 17:03 UTC