alexcrichton opened PR #5766 from merge-queue
to main
:
GitHub recently made its merge queue feature available for use in public repositories owned by organizations meaning that the Wasmtime repository is a candidate for using this. GitHub's Merge Queue feature is a system that's similar to Rust's bors integration where PRs are tested before merging and only passing PRs are merged. This implements the "not rocket science" rule where the
main
branch of Wasmtime, for example, is always tested and passes CI. This is in contrast to our current implementation of CI where PRs are merged when they pass their own CI, but the code that was tested is not guaranteed to be the state ofmain
when the PR is merged, meaning that we're at risk now of a failingmain
branch despite all merged PRs being green. While this has happened with Wasmtime this is not a common occurrence, however.The main motivation, instead, to use GitHub's Merge Queue feature is that it will enable Wasmtime to greatly reduce the amount of CI running on PRs themselves. Currently the full test suite runs on every push to every PR, meaning that our workers on GitHub Actions are frequently clogged throughout weekdays and PRs can take quite some time to come back with a successful run. Through the use of a Merge Queue, however, we're able to configure only a small handful of checks to run on PRs while deferring the main body of checks to happening on the merge-via-the-queue itself. This is hoped to free up capacity on CI and overall improve CI times for Wasmtime and Cranelift developers.
The implementation of all of this required quite a lot of plumbing and retooling of our CI. I've been testing this in an [external repository][testrepo] and I think everything is working now. A list of changes made in this PR are:
The
build.yml
workflow is merged back into themain.yml
workflow as the original reason to split it out is not longer applicable (it'll run on all merges). This was also done to fit in the dependency graph of jobs of one workflow.Publication of the
gh-pages
branch, thedev
tag artifacts, and release artifacts have been moved to a separatepublish-artifacts.yml
workflow. This workflow runs on all pushes tomain
and all tags. This workflow no longer actually preforms any builds, however, and relies on a merge queue or similar being used for branches/tags where artifacts are downloaded from the workflow run to be uploaded. For pushes tomain
this works because a merge queue is run meaning that by the time the push happens all artifacts are ready. For release branches this is handled by..The
push-tag.yml
workflow is subsumed by themain.yml
workflow. CI for a tag being pushed will upload artifacts to a release in GitHub, meaning that all builds must finish first for the commit. Themain.yml
workflow at the end now scans commits for the preexisting magical marker and pushes a tag if necessary.CI is currently a flat list of "run all these jobs" and this is now rearchitected to a "fan out" approach where some jobs run to determine the next jobs to run which then get "joined" into a finish step. The purpose for this is somewhat nuanced and this has implications for CI runtime as well. The Merge Queue feature requires branches to be protected with "these checks must pass" and then the same checks are gates both to enter the merge queue as well as pass the merge queue. The saving grace, however, is that a "skipped" check counts as passing, meaning checks can be skipped on PRs but run to completion on the merge queue. A problem with this though is the build matrix used for tests where PRs want to only run one element of the build matrix ideally but there's no means on GitHub Actions right now for the skipped entries to show up as skipped easily (or not that I know of). This means that the "join" step serves the purpose of being the single gate for both PR and merge queue CI and there's just more inputs to it for merge queue CI. The major consequence of this decision is that GitHub's actions scheduling doesn't work out well here. Jobs are scheduled in a FIFO order meaning that the job for "ok complete the CI run" is queued up after everything else has completed, possibly after lots of other CI requests in the middle for other PRs. The hope here is that by using a merge queue we can keep CI relatively under control and this won't affect merge times too much.
All jobs in the
main.yml
workflow will not automatically cancel the entire run if they fail. Previously this fail-fast behavior was only part of the matrix runs (and just for that matrix), but this is required to make the merge queue expedient. The gate of the merge queue is the final "join" step which is only executed once all dependencies have finished. This means, for example, that if rustfmt fails quickly then the tests which take longer might run for quite awhile before the join step reports failure, meaning that the PR sits in the queue for longer than needed being tested when we know it's already going to fail. By having all jobs cancel the run this means that failures immediately bail out and mark the whole job as cancelled.A new "determine" CI job was added to determine what CI actually needs to run. This is a "choke point" which is scheduled at the start of CI that quickly figures out what else needs to be run. This notably indicates whether large swaths of ci (the
run-full
flag) like the build matrix are executed. Additionally this dynamically calculates a matrix of tests to run based on a new./ci/build-test-matrix.js
script. Various inputs are considered for this such as:
- All pushes, meaning merge queue branches or release-branch merges, will run full CI.
- PRs to release branches will run full CI.
- PRs to
main
, the most common, determine what to run based on what's modified and what's in the commit message.Some examples for (3) above are if modifications are made to
cranelift/codegen/src/isa/*
then that corresponding builder is
executed on CI. If thecrates/c-api
directory is modified then the
CMake-based tests are run on PRs but are otherwise skipped.
Annotations in commit messages such asprtest:s390x
can be used to
explicitly request testing, notably usingprtest:full
to explicitly
request the full test suite is run.Before this PR merges to
main
would perform two full runs of CI: one on the PR itself and one on the merge tomain
. Note that the one as a merge tomain
was quite frequently cancelled due to a merge happening later. Additionally before this PR there was always the risk of a bad merge where what was merged ended up creating amain
that failed CI to to a non-code-related merge conflict.After this PR merges to
main
will perform one full run of CI, the one as part of the merge queue. PRs themselves will perform one test job most of the time otherwise. Themain
branch is additionally always guaranteed to pass tests via the merge queue feature.For release branches, before this PR merges would perform two full builds - one for the PR and one for the merge. A third build was then required for the release tag itself. This is now cut down to two full builds, one for the PR and one for the merge. The reason for this is that the merge queue feature currently can't be used for our wildcard-based
release-*
branch protections. It is now possible, however, to turn on required CI checks for therelease-*
branch PRs so we can at least have a "hit the button and forget" strategy for merging PRs now.Note that this change to CI is not without its risks. The Merge Queue feature is still in beta and is quite new for GitHub. One bug that Trevor and I uncovered is that if a PR is being tested in the merge queue and a contributor pushes to their PR then the PR isn't removed from the merge queue but is instead merged when CI is successful, losing the changes that the contributor pushed (what's merged is what was tested). We suspect that GitHub will fix this, however.
Additionally though there's the risk that this may increase merge time for PRs to Wasmtime in practice. The Merge Queue feature has the ability to "batch" PRs together for a merge but this is only done if concurrent builds are allowed. This means that if 5 PRs are batched together then 5 separate merges would be created for the stack of 5 PRs. If the CI for all 5 merged together passes then everything is merged, otherwise a PR is kicked out. We can't easily do this, however, since a major purpose for the merge queue for us would be to cut down on usage of CI builders meaning the max concurrency would be set to 1 meaning that only one PR at a time will be merged. This means PRs may sit in the queue for awhile since previously many
main
-based builds are cancelled due to subsequent merges of other PRs, but now they must all run to 100% completion.[testrepo]: https://github.com/bytecodealliance/wasmtime-merge-queue-testing
<!--
Please ensure that the following steps are all taken care of before submitting
the PR.
[ ] This has been discussed in issue #..., or if not, please tell us why
here.[ ] A short description of what this does, why it is needed; if the
description becomes long, the matter should probably be discussed in an issue
first.[ ] This PR contains test cases, if meaningful.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so. The list of
suggested reviewers on the right can help you.Please ensure all communication adheres to the code of conduct.
-->
alexcrichton requested elliottt for a review on PR #5766.
alexcrichton updated PR #5766 from merge-queue
to main
.
elliottt submitted PR review.
elliottt submitted PR review.
elliottt created PR review comment:
I really like that all the logic for determining what to run is inlined in this job!
elliottt created PR review comment:
It might be worth adding a comment here that the intent is not that we re-run full CI for merges to
main
, but only for release branch and merge-queue builds.
elliottt created PR review comment:
Worth adding a
-F
here to ensure that the.
isn't interpreted as a wildcard?
elliottt created PR review comment:
If I write a commit message about being
fuzzy on some details
, will that cause the fuzz tests to run?
elliottt created PR review comment:
What was this change for?
alexcrichton submitted PR review.
alexcrichton created PR review comment:
Ah this was just to make this more easily testable in external repositories. Otherwise I couldn't, as written, test it anywhere but the wasmtime repo. The condition here allows explicit requests for this workflow via the github UI which in theory no one will ever do but me
alexcrichton submitted PR review.
alexcrichton created PR review comment:
In this specific case this is checking
names.log
which is just the list of edited files in the PR, so I don't think so. Otherwise though the grep-based filters are indeed highly simplistic and will probably fall into traps such as that.
alexcrichton submitted PR review.
alexcrichton created PR review comment:
Oh I actually intended this to be a wildcard since I don't want to deal with
/
vs\
for Windows. I don't think it actually matters though because the modified files I think always use/
. Other than that though this is just my supreme laziness of "I don't want to have to remember if/
is valid for a regex so I'll just put a wildcard there and accept false positives"
alexcrichton updated PR #5766 from merge-queue
to main
.
alexcrichton merged PR #5766.
Last updated: Dec 23 2024 at 12:05 UTC