wasmtime / Issue #1893 AArch64 CI test failure · git-wasmtime

Wasmtime GitHub notifications bot (Jun 17 2020 at 19:40):

akirilov-arm opened Issue #1893:

The AArch64 CI test that runs using QEMU fails consistently for PR #1871 and the reasons are not clear - here's the relevant excerpt from the log:

2020-06-13T16:29:49.3730503Z test wast::Cranelift::spec::simd::simd_i32x4_cmp ... ok
2020-06-13T16:29:57.9345959Z test wast::Cranelift::spec::simd::simd_i8x16_sat_arith ... ignored
2020-06-13T16:30:08.5287111Z test wast::Cranelift::spec::simd::simd_lane ... ignored
2020-06-13T16:30:15.8261749Z test wast::Cranelift::spec::simd::simd_load ... ignored
2020-06-13T16:49:23.7624987Z error: test failed, to rerun pass '-p wasmtime-cli --test all'
2020-06-13T16:49:23.7648421Z
2020-06-13T16:49:23.7651248Z Caused by:
2020-06-13T16:49:23.7664954Z   process didn't exit successfully: `/home/runner/qemu/bin/qemu-aarch64 -L /usr/aarch64-linux-gnu /home/runner/work/wasmtime/wasmtime/target/aarch64-unknown-linux-gnu/release/deps/all-0af4aa3748ec4770` (signal: 9, SIGKILL: kill)
2020-06-13T16:49:24.0613948Z ##[error]Process completed with exit code 101.
2020-06-13T16:49:25.4620071Z Post job cleanup.

I have reproduced the test environment locally using the following commands:

rm -rf qemu-5.0.0 ${HOME}/qemu
curl https://download.qemu.org/qemu-5.0.0.tar.xz | tar xJf -
cd qemu-5.0.0
./configure --target-list=aarch64-linux-user --prefix=${HOME}/qemu --disable-tools --disable-slirp --disable-fdt --disable-capstone --disable-docs
make -j$(nproc) install
cd ..
RUSTFLAGS="-D warnings" \
  CARGO_INCREMENTAL=0 \
  CARGO_PROFILE_DEV_DEBUG=1 \
  CARGO_PROFILE_TEST_DEBUG=1 \
  CARGO_BUILD_TARGET=aarch64-unknown-linux-gnu \
  CARGO_TARGET_AARCH64_UNKNOWN_LINUX_GNU_RUNNER="${HOME}/qemu/bin/qemu-aarch64 -L /usr/aarch64-linux-gnu" \
  CARGO_TARGET_AARCH64_UNKNOWN_LINUX_GNU_LINKER=aarch64-linux-gnu-gcc \
  RUST_BACKTRACE=1 \
  cargo test \
  --features test-programs/test_programs \
  --release \
  --all \
  --exclude lightbeam \
  --exclude peepmatic \
  --exclude peepmatic-automata \
  --exclude peepmatic-fuzzing \
  --exclude peepmatic-macro \
  --exclude peepmatic-runtime \
  --exclude peepmatic-test \
  --exclude wasmtime-fuzz

However, I don't experience any test failures. In addition to that, I don't see any issues either when I run the test natively in an AArch64 environment. In that case the list of commands can be simplified to:

cargo test --release --all --exclude lightbeam

Note that the --features test-programs/test_programs parameter is omitted because it requires rust-lld, which appears not to be a part of the native AArch64 toolchain.

This issue has also been discussed in PR #1802.

cc @cfallin

Wasmtime GitHub notifications bot (Jun 17 2020 at 20:00):

I suspect a qemu issue, as @alexcrichton had said earlier; it's too bad that upgrading to 5.0.0 didn't fix it.

I wonder if we could transition to running CI jobs on our native aarch64 machine, now that we have one -- @alexcrichton, thoughts (I think GitHub has a native-CI-runner feature)?

Wasmtime GitHub notifications bot (Jun 17 2020 at 21:23):

Locally I ran the test suite in qemu 5.0.0 and I saw the peak memory usage jump by ~1GB after applying https://github.com/bytecodealliance/wasmtime/pull/1871. This is the peak memory usage of QEMU itself when running the test suite. Already 10GB is pretty huge, for comparison it takes 200MB on native to run the all-* test suite.

I ran a small test on Github Actions CI and found that a program could allocate a 10687086592-byte (9.95 GiB) vector but would fail to allocate 10791944192 bytes (10.05 GiB). Similarly in local testing (according to /usr/bin/time) the before all-* test suite in qemu took 10129944k bytes (9.6 GiB) and went to 11286384k (10.7 GiB) after enabling this test. My test program was killed by SIGKILL on Github Actions as well.

Given that this doesn't feel like a bug in QEMU other than "maybe too much memory is used?" and it seems like we're just hitting OOM on CI. It appears that if we cross the 10GiB threshold for allocated memory we get OOM-killed. That would explain why it's not an issue locally either because we presumably have lots more ram and/or less aggressive OOM killers.

In terms of fixing this, that may be a bit harder. Some options include:

Move to native AArch64 CI. This is unfortunately pretty tricky to do, and boils down to GitHub recommends we don't do this. There are possible workarounds we could apply (rust-lang/rust is pioneering this, we'll likely just copy them). This will take some time though and rust-lang/rust is still in the process of working out all the various issues.
Split apart our test suite. I suspect the issue is that QEMU isn't freeing something it should, so we could fewer tests inside of a single QEMU process. Unfortunately I don't know of a great way to do this automatically. Ironically we actually unified our test suite for other CI-related issues. Our binaries are quite large so we can't have dozens of test binaries since that'll blow our disk limit.
There's experimental support on nightly where each test is run in a forked process, which we may be able to try out. I'm not holding my breath for this though.
Split just the execution of the test suite by having a "driver program" which executes the test suite with --list and then manually splits that list into shards and runs the test executable multiple times with --exact options and a list of test names.

None of these AFAIK are easy-ish things to do, unfortunately... I suppose there's the option of writing fewer tests :)

Wasmtime GitHub notifications bot (Jun 17 2020 at 21:54):

Hmm. Just now I went down a small rabbit-hole trying to work out if there's a way to reduce the translation cache size for qemu's JIT, in case that's the issue. Unfortunately it seems there's only -accel tb-size=... for system-mode qemu, but not user-mode qemu. (Anyone else know another option?)

Another option to add to the above list would be "fix qemu's memory blowup". Unfortunately that doesn't seem a whole lot easier than the other options, but who knows, maybe it's a quick fix once found.

@akirilov-arm: for now, while we develop aarch64 SIMD support, I think it's reasonable to keep the SIMD tests specifically disabled in-tree, in the absence of better options. (We should be careful to run tests locally on a native aarch64 machine, of course.) We'll have to find a better solution before declaring SIMD "done", though.

I'll go ahead and rename this issue to track the qemu memory blowup (which is the root problem), if you don't mind. Sorry again about our CI wonkiness!

Last updated: Apr 17 2025 at 23:03 UTC

Stream: git-wasmtime

Topic: wasmtime / Issue #1893 AArch64 CI test failure

Wasmtime GitHub notifications bot (Jun 17 2020 at 19:40):

Wasmtime GitHub notifications bot (Jun 17 2020 at 20:00):

Wasmtime GitHub notifications bot (Jun 17 2020 at 21:23):

Wasmtime GitHub notifications bot (Jun 17 2020 at 21:54):