wasmtime / PR #5766 Update CI to use GitHub's Merge Queue · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / PR #5766 Update CI to use GitHub's Merge Queue

Wasmtime GitHub notifications bot (Feb 13 2023 at 15:41):

alexcrichton opened PR #5766 from merge-queue to main:

GitHub recently made its merge queue feature available for use in public repositories owned by organizations meaning that the Wasmtime repository is a candidate for using this. GitHub's Merge Queue feature is a system that's similar to Rust's bors integration where PRs are tested before merging and only passing PRs are merged. This implements the "not rocket science" rule where the main branch of Wasmtime, for example, is always tested and passes CI. This is in contrast to our current implementation of CI where PRs are merged when they pass their own CI, but the code that was tested is not guaranteed to be the state of main when the PR is merged, meaning that we're at risk now of a failing main branch despite all merged PRs being green. While this has happened with Wasmtime this is not a common occurrence, however.

The main motivation, instead, to use GitHub's Merge Queue feature is that it will enable Wasmtime to greatly reduce the amount of CI running on PRs themselves. Currently the full test suite runs on every push to every PR, meaning that our workers on GitHub Actions are frequently clogged throughout weekdays and PRs can take quite some time to come back with a successful run. Through the use of a Merge Queue, however, we're able to configure only a small handful of checks to run on PRs while deferring the main body of checks to happening on the merge-via-the-queue itself. This is hoped to free up capacity on CI and overall improve CI times for Wasmtime and Cranelift developers.

The implementation of all of this required quite a lot of plumbing and retooling of our CI. I've been testing this in an [external repository][testrepo] and I think everything is working now. A list of changes made in this PR are:

The build.yml workflow is merged back into the main.yml workflow as the original reason to split it out is not longer applicable (it'll run on all merges). This was also done to fit in the dependency graph of jobs of one workflow.

Publication of the gh-pages branch, the dev tag artifacts, and release artifacts have been moved to a separate publish-artifacts.yml workflow. This workflow runs on all pushes to main and all tags. This workflow no longer actually preforms any builds, however, and relies on a merge queue or similar being used for branches/tags where artifacts are downloaded from the workflow run to be uploaded. For pushes to main this works because a merge queue is run meaning that by the time the push happens all artifacts are ready. For release branches this is handled by..

The push-tag.yml workflow is subsumed by the main.yml workflow. CI for a tag being pushed will upload artifacts to a release in GitHub, meaning that all builds must finish first for the commit. The main.yml workflow at the end now scans commits for the preexisting magical marker and pushes a tag if necessary.

CI is currently a flat list of "run all these jobs" and this is now rearchitected to a "fan out" approach where some jobs run to determine the next jobs to run which then get "joined" into a finish step. The purpose for this is somewhat nuanced and this has implications for CI runtime as well. The Merge Queue feature requires branches to be protected with "these checks must pass" and then the same checks are gates both to enter the merge queue as well as pass the merge queue. The saving grace, however, is that a "skipped" check counts as passing, meaning checks can be skipped on PRs but run to completion on the merge queue. A problem with this though is the build matrix used for tests where PRs want to only run one element of the build matrix ideally but there's no means on GitHub Actions right now for the skipped entries to show up as skipped easily (or not that I know of). This means that the "join" step serves the purpose of being the single gate for both PR and merge queue CI and there's just more inputs to it for merge queue CI. The major consequence of this decision is that GitHub's actions scheduling doesn't work out well here. Jobs are scheduled in a FIFO order meaning that the job for "ok complete the CI run" is queued up after everything else has completed, possibly after lots of other CI requests in the middle for other PRs. The hope here is that by using a merge queue we can keep CI relatively under control and this won't affect merge times too much.

All jobs in the main.yml workflow will not automatically cancel the entire run if they fail. Previously this fail-fast behavior was only part of the matrix runs (and just for that matrix), but this is required to make the merge queue expedient. The gate of the merge queue is the final "join" step which is only executed once all dependencies have finished. This means, for example, that if rustfmt fails quickly then the tests which take longer might run for quite awhile before the join step reports failure, meaning that the PR sits in the queue for longer than needed being tested when we know it's already going to fail. By having all jobs cancel the run this means that failures immediately bail out and mark the whole job as cancelled.

A new "determine" CI job was added to determine what CI actually needs to run. This is a "choke point" which is scheduled at the start of CI that quickly figures out what else needs to be run. This notably indicates whether large swaths of ci (the run-full flag) like the build matrix are executed. Additionally this dynamically calculates a matrix of tests to run based on a new ./ci/build-test-matrix.js script. Various inputs are considered for this such as:

All pushes, meaning merge queue branches or release-branch merges, will run full CI.

PRs to release branches will run full CI.

PRs to main, the most common, determine what to run based on what's modified and what's in the commit message.

Some examples for (3) above are if modifications are made to
cranelift/codegen/src/isa/* then that corresponding builder is
executed on CI. If the crates/c-api directory is modified then the
CMake-based tests are run on PRs but are otherwise skipped.
Annotations in commit messages such as prtest:s390x can be used to
explicitly request testing, notably using prtest:full to explicitly
request the full test suite is run.

Before this PR merges to main would perform two full runs of CI: one on the PR itself and one on the merge to main. Note that the one as a merge to main was quite frequently cancelled due to a merge happening later. Additionally before this PR there was always the risk of a bad merge where what was merged ended up creating a main that failed CI to to a non-code-related merge conflict.

After this PR merges to main will perform one full run of CI, the one as part of the merge queue. PRs themselves will perform one test job most of the time otherwise. The main branch is additionally always guaranteed to pass tests via the merge queue feature.

For release branches, before this PR merges would perform two full builds - one for the PR and one for the merge. A third build was then required for the release tag itself. This is now cut down to two full builds, one for the PR and one for the merge. The reason for this is that the merge queue feature currently can't be used for our wildcard-based release-* branch protections. It is now possible, however, to turn on required CI checks for the release-* branch PRs so we can at least have a "hit the button and forget" strategy for merging PRs now.

Note that this change to CI is not without its risks. The Merge Queue feature is still in beta and is quite new for GitHub. One bug that Trevor and I uncovered is that if a PR is being tested in the merge queue and a contributor pushes to their PR then the PR isn't removed from the merge queue but is instead merged when CI is successful, losing the changes that the contributor pushed (what's merged is what was tested). We suspect that GitHub will fix this, however.

Additionally though there's the risk that this may increase merge time for PRs to Wasmtime in practice. The Merge Queue feature has the ability to "batch" PRs together for a merge but this is only done if concurrent builds are allowed. This means that if 5 PRs are batched together then 5 separate merges would be created for the stack of 5 PRs. If the CI for all 5 merged together passes then everything is merged, otherwise a PR is kicked out. We can't easily do this, however, since a major purpose for the merge queue for us would be to cut down on usage of CI builders meaning the max concurrency would be set to 1 meaning that only one PR at a time will be merged. This means PRs may sit in the queue for awhile since previously many main-based builds are cancelled due to subsequent merges of other PRs, but now they must all run to 100% completion.

[testrepo]: https://github.com/bytecodealliance/wasmtime-merge-queue-testing

Wasmtime GitHub notifications bot (Feb 13 2023 at 15:43):

alexcrichton requested elliottt for a review on PR #5766.

Wasmtime GitHub notifications bot (Feb 13 2023 at 19:22):

alexcrichton updated PR #5766 from merge-queue to main.

Wasmtime GitHub notifications bot (Feb 16 2023 at 18:43):

elliottt submitted PR review.

Wasmtime GitHub notifications bot (Feb 16 2023 at 18:43):

elliottt submitted PR review.

Wasmtime GitHub notifications bot (Feb 16 2023 at 18:43):

elliottt created PR review comment:

I really like that all the logic for determining what to run is inlined in this job!

Wasmtime GitHub notifications bot (Feb 16 2023 at 18:43):

elliottt created PR review comment:

It might be worth adding a comment here that the intent is not that we re-run full CI for merges to main, but only for release branch and merge-queue builds.

Wasmtime GitHub notifications bot (Feb 16 2023 at 18:43):

elliottt created PR review comment:

Worth adding a -F here to ensure that the . isn't interpreted as a wildcard?

Wasmtime GitHub notifications bot (Feb 16 2023 at 18:43):

elliottt created PR review comment:

If I write a commit message about being fuzzy on some details, will that cause the fuzz tests to run?

Wasmtime GitHub notifications bot (Feb 16 2023 at 18:43):

elliottt created PR review comment:

What was this change for?

Wasmtime GitHub notifications bot (Feb 16 2023 at 18:49):

alexcrichton submitted PR review.

Wasmtime GitHub notifications bot (Feb 16 2023 at 18:49):

alexcrichton created PR review comment:

Ah this was just to make this more easily testable in external repositories. Otherwise I couldn't, as written, test it anywhere but the wasmtime repo. The condition here allows explicit requests for this workflow via the github UI which in theory no one will ever do but me

Wasmtime GitHub notifications bot (Feb 16 2023 at 18:49):

alexcrichton submitted PR review.

Wasmtime GitHub notifications bot (Feb 16 2023 at 18:49):

alexcrichton created PR review comment:

In this specific case this is checking names.log which is just the list of edited files in the PR, so I don't think so. Otherwise though the grep-based filters are indeed highly simplistic and will probably fall into traps such as that.

Wasmtime GitHub notifications bot (Feb 16 2023 at 18:51):

alexcrichton submitted PR review.

Wasmtime GitHub notifications bot (Feb 16 2023 at 18:51):

alexcrichton created PR review comment:

Oh I actually intended this to be a wildcard since I don't want to deal with / vs \ for Windows. I don't think it actually matters though because the modified files I think always use /. Other than that though this is just my supreme laziness of "I don't want to have to remember if / is valid for a regex so I'll just put a wildcard there and accept false positives"

Wasmtime GitHub notifications bot (Feb 16 2023 at 18:54):

alexcrichton updated PR #5766 from merge-queue to main.

Wasmtime GitHub notifications bot (Feb 16 2023 at 19:56):

alexcrichton merged PR #5766.

Last updated: Apr 16 2025 at 21:03 UTC