Stream: git-wasmtime

Topic: wasmtime / Issue #1893 AArch64 CI test failure


view this post on Zulip Wasmtime GitHub notifications bot (Jun 17 2020 at 19:40):

akirilov-arm opened Issue #1893:

The AArch64 CI test that runs using QEMU fails consistently for PR #1871 and the reasons are not clear - here's the relevant excerpt from the log:

2020-06-13T16:29:49.3730503Z test wast::Cranelift::spec::simd::simd_i32x4_cmp ... ok
2020-06-13T16:29:57.9345959Z test wast::Cranelift::spec::simd::simd_i8x16_sat_arith ... ignored
2020-06-13T16:30:08.5287111Z test wast::Cranelift::spec::simd::simd_lane ... ignored
2020-06-13T16:30:15.8261749Z test wast::Cranelift::spec::simd::simd_load ... ignored
2020-06-13T16:49:23.7624987Z error: test failed, to rerun pass '-p wasmtime-cli --test all'
2020-06-13T16:49:23.7648421Z
2020-06-13T16:49:23.7651248Z Caused by:
2020-06-13T16:49:23.7664954Z   process didn't exit successfully: `/home/runner/qemu/bin/qemu-aarch64 -L /usr/aarch64-linux-gnu /home/runner/work/wasmtime/wasmtime/target/aarch64-unknown-linux-gnu/release/deps/all-0af4aa3748ec4770` (signal: 9, SIGKILL: kill)
2020-06-13T16:49:24.0613948Z ##[error]Process completed with exit code 101.
2020-06-13T16:49:25.4620071Z Post job cleanup.

I have reproduced the test environment locally using the following commands:

rm -rf qemu-5.0.0 ${HOME}/qemu
curl https://download.qemu.org/qemu-5.0.0.tar.xz | tar xJf -
cd qemu-5.0.0
./configure --target-list=aarch64-linux-user --prefix=${HOME}/qemu --disable-tools --disable-slirp --disable-fdt --disable-capstone --disable-docs
make -j$(nproc) install
cd ..
RUSTFLAGS="-D warnings" \
  CARGO_INCREMENTAL=0 \
  CARGO_PROFILE_DEV_DEBUG=1 \
  CARGO_PROFILE_TEST_DEBUG=1 \
  CARGO_BUILD_TARGET=aarch64-unknown-linux-gnu \
  CARGO_TARGET_AARCH64_UNKNOWN_LINUX_GNU_RUNNER="${HOME}/qemu/bin/qemu-aarch64 -L /usr/aarch64-linux-gnu" \
  CARGO_TARGET_AARCH64_UNKNOWN_LINUX_GNU_LINKER=aarch64-linux-gnu-gcc \
  RUST_BACKTRACE=1 \
  cargo test \
  --features test-programs/test_programs \
  --release \
  --all \
  --exclude lightbeam \
  --exclude peepmatic \
  --exclude peepmatic-automata \
  --exclude peepmatic-fuzzing \
  --exclude peepmatic-macro \
  --exclude peepmatic-runtime \
  --exclude peepmatic-test \
  --exclude wasmtime-fuzz

However, I don't experience any test failures. In addition to that, I don't see any issues either when I run the test natively in an AArch64 environment. In that case the list of commands can be simplified to:

cargo test --release --all --exclude lightbeam

Note that the --features test-programs/test_programs parameter is omitted because it requires rust-lld, which appears not to be a part of the native AArch64 toolchain.

This issue has also been discussed in PR #1802.

cc @cfallin

view this post on Zulip Wasmtime GitHub notifications bot (Jun 17 2020 at 20:00):

cfallin commented on Issue #1893:

I suspect a qemu issue, as @alexcrichton had said earlier; it's too bad that upgrading to 5.0.0 didn't fix it.

I wonder if we could transition to running CI jobs on our native aarch64 machine, now that we have one -- @alexcrichton, thoughts (I think GitHub has a native-CI-runner feature)?

view this post on Zulip Wasmtime GitHub notifications bot (Jun 17 2020 at 21:23):

alexcrichton commented on Issue #1893:

Locally I ran the test suite in qemu 5.0.0 and I saw the peak memory usage jump by ~1GB after applying https://github.com/bytecodealliance/wasmtime/pull/1871. This is the peak memory usage of QEMU itself when running the test suite. Already 10GB is pretty huge, for comparison it takes 200MB on native to run the all-* test suite.

I ran a small test on Github Actions CI and found that a program could allocate a 10687086592-byte (9.95 GiB) vector but would fail to allocate 10791944192 bytes (10.05 GiB). Similarly in local testing (according to /usr/bin/time) the before all-* test suite in qemu took 10129944k bytes (9.6 GiB) and went to 11286384k (10.7 GiB) after enabling this test. My test program was killed by SIGKILL on Github Actions as well.

Given that this doesn't feel like a bug in QEMU other than "maybe too much memory is used?" and it seems like we're just hitting OOM on CI. It appears that if we cross the 10GiB threshold for allocated memory we get OOM-killed. That would explain why it's not an issue locally either because we presumably have lots more ram and/or less aggressive OOM killers.

In terms of fixing this, that may be a bit harder. Some options include:

None of these AFAIK are easy-ish things to do, unfortunately... I suppose there's the option of writing fewer tests :)

view this post on Zulip Wasmtime GitHub notifications bot (Jun 17 2020 at 21:54):

cfallin commented on Issue #1893:

Hmm. Just now I went down a small rabbit-hole trying to work out if there's a way to reduce the translation cache size for qemu's JIT, in case that's the issue. Unfortunately it seems there's only -accel tb-size=... for system-mode qemu, but not user-mode qemu. (Anyone else know another option?)

Another option to add to the above list would be "fix qemu's memory blowup". Unfortunately that doesn't seem a whole lot easier than the other options, but who knows, maybe it's a quick fix once found.

@akirilov-arm: for now, while we develop aarch64 SIMD support, I think it's reasonable to keep the SIMD tests specifically disabled in-tree, in the absence of better options. (We should be careful to run tests locally on a native aarch64 machine, of course.) We'll have to find a better solution before declaring SIMD "done", though.

I'll go ahead and rename this issue to track the qemu memory blowup (which is the root problem), if you don't mind. Sorry again about our CI wonkiness!


Last updated: Nov 22 2024 at 16:03 UTC