cfallin opened issue #2933:
While waiting for CI jobs during the release process on Friday, a few of us started discussing CI runtimes; and again today, while watching a "Publish" job remain queued, it's very much on my mind.
CI time is important to reduce both because it's in the critical path of lots of things -- especially, but not only, when making an urgent release -- and because it costs resources. (E.g., if we decide to pay for more GitHub runners some day, it looks like we would pay 0.8 cents per minute of CI time.)
A few thoughts occur to me:
The "Publish" job depends on the others and hence starts after they finish. Because it starts much later, it sometimes gets stuck behind another PR's jobs in the global run-queue, and sits for a long time waiting to start. This is a prioritization failure (runner tasks that allow an approved CI to merge should go before initial CI runs on a speculative PR, etc.), but we can also exert some control over this problem by avoiding the need for additional job-starts.
Specifically, could we incorporate a "slice" of the publish task at the end of each build job? That way, instead of using a dedicated job to upload build artifacts once all individual parts are built, we just upload as we go. If we did this, we would have a single-depth critical path.
It's possible that we want to think about how test failures in one configuration would or wouldn't gate uploads from another; but it seems to me that if uploads are keyed by commit hash, or if we are just careful about concurrent runs when tagged releases are involved, then this could be avoided.
The release-build jobs are \~always the long tail, and they run tests as well after the build is complete. We thus run tests in both debug mode ("Test ({stable,beta,nightly})" jobs) and release mode. This is nice for coverage -- there are certainly times when issues occur only as a result of certain optimizations -- but most of the time, this is not the case, and we would save significant resources and wait-time by doing the tests and the release build (only) in parallel.
We have some other tests that are (IMHO) nice-to-have, but not critical for our current release configuration. E.g., "Rebuild Peephole Optimizers" takes \~15-20 minutes of CI time per run. This is nice to have as a part of Peepmatic, for sure; but if development is not currently ongoing on that project, we could potentially "pause" the jobs until it is, and save CI time and resources.
Thoughts? I hope the general topic of reducing CI time is not controversial, though I recognize some of the above ideas could be; hoping to spawn discussion about our explicit needs and resources, nonetheless!
cfallin commented on issue #2933:
Along with the above, I should note that if we have tests that we want to put into a "second tier" bucket, we could potentially find a way to run them, e.g., nightly. Release-build tests could fall in that category, for example. This would ensure we catch the long tail of bugs "eventually" without waiting for the long tail of runtimes on every CI re-spin.
alexcrichton commented on issue #2933:
I'm all for making CI faster! In addition to the "do less work" angle you've mentioned above, the other major route to speed things up is to optimize what's already there, some possibilities being:
- We've never dug into
sccache
, but that's probably one of the easiest things to set up and get a speedup. This isn't a silver bullet but should help at least a little bit. This would require an AWS account to get set up with an s3 bucket however. Note that I think we would quickly blow the 5gb limit for GitHubactions/cache
caching, so I think for caching between buildssccache
would be our best bet.- Try to avoid redundant builds
- Reuse as much of the cache as possible between runs. The "Test" jobs do a
cargo run -p run-examples
followed by the real test, and I think we're recompiling things a lot of times between these runs. There's likely a lot of room for optimization here.The downside of "just make things faster", though, is that it's a constant uphill battle. Things always regress accidentally because you can't get precise timing from CI. Additionally it's very easy to add more things and generally quite difficult to take them out. While I think this is worth mentioning I think what you've outlined above is perhaps a better route forward.
For "do less work" as you've mentioned this is always tricky. For example I'm less confident about removing the release-mode tests. We've had some tricky/subtle bugs show up rarely in the past, and while I definitely agree that we'll 95% of the time never fail these builders having us catch them instead of users is generally much better. I personally don't know how to weigh "this is an expensive CI job" with "here's the hypothetical failure rate and benefits it bring us". Ideally we could put numbers on that and have a literal threshold, but it seems somewhat far-fetched.
To answer your ideas:
- The publish job is intended to be a mostly-atomic "release everything" moment. It collects documentation artifacts from a number of builders as well as release artifacts from various builders, and then shoves them all into gh-pages and releases. The semi-atomic part here prevents weird state where we have multiple CI jobs for the
main
branch running around stomping all over each other, but https://github.com/bytecodealliance/wasmtime/pull/2932 may alleviate this to the point where we should just ditch the release job entirely and push things up inline.- For skipping release build tests, as mentioned above, I'd ideally prefer that this was a last resort after we've exhausted other avenues.
- For non-critical things that aren't seeing active development, like peepmatic, I agree it seems reasonable to cut them from CI for now.
One final option is to look to integrate bors in one way or another. This is unfortunately a very significant investment because AFAIK there's no really easy and nice integration with bors right now (typically things involve a lot of wonky permissions, setup, servers you run, etc). This primary benefit of this I think is that we could defer "heavy" work to serial one-at-a-time testing and only do light "likely to fail" testing on PRs. For example PRs might build docs quickly, run
cargo deny
, and run linux tests (but that's it). Merges to themain
branch would run nothing and only merges throughauto
would do the full build and produce artifacts.I think this would also be a serious undertaking because it would require us to redesign the release process. The model rust-lang/rust uses doesn't super-clean-ly apply here so we might need to consider some alternatives for how to do release artifacts and such with a model like this.
bjorn3 commented on issue #2933:
bors-ng is a github app. It is used by for example rust-analyzer.
alexcrichton commented on issue #2933:
CI has been serving us pretty well since the last round of significant changes so I'm going to close this. If there's remaining issues to tackle they're probably best done through follow-ups.
alexcrichton closed issue #2933:
While waiting for CI jobs during the release process on Friday, a few of us started discussing CI runtimes; and again today, while watching a "Publish" job remain queued, it's very much on my mind.
CI time is important to reduce both because it's in the critical path of lots of things -- especially, but not only, when making an urgent release -- and because it costs resources. (E.g., if we decide to pay for more GitHub runners some day, it looks like we would pay 0.8 cents per minute of CI time.)
A few thoughts occur to me:
The "Publish" job depends on the others and hence starts after they finish. Because it starts much later, it sometimes gets stuck behind another PR's jobs in the global run-queue, and sits for a long time waiting to start. This is a prioritization failure (runner tasks that allow an approved CI to merge should go before initial CI runs on a speculative PR, etc.), but we can also exert some control over this problem by avoiding the need for additional job-starts.
Specifically, could we incorporate a "slice" of the publish task at the end of each build job? That way, instead of using a dedicated job to upload build artifacts once all individual parts are built, we just upload as we go. If we did this, we would have a single-depth critical path.
It's possible that we want to think about how test failures in one configuration would or wouldn't gate uploads from another; but it seems to me that if uploads are keyed by commit hash, or if we are just careful about concurrent runs when tagged releases are involved, then this could be avoided.
The release-build jobs are \~always the long tail, and they run tests as well after the build is complete. We thus run tests in both debug mode ("Test ({stable,beta,nightly})" jobs) and release mode. This is nice for coverage -- there are certainly times when issues occur only as a result of certain optimizations -- but most of the time, this is not the case, and we would save significant resources and wait-time by doing the tests and the release build (only) in parallel.
We have some other tests that are (IMHO) nice-to-have, but not critical for our current release configuration. E.g., "Rebuild Peephole Optimizers" takes \~15-20 minutes of CI time per run. This is nice to have as a part of Peepmatic, for sure; but if development is not currently ongoing on that project, we could potentially "pause" the jobs until it is, and save CI time and resources.
Thoughts? I hope the general topic of reducing CI time is not controversial, though I recognize some of the above ideas could be; hoping to spawn discussion about our explicit needs and resources, nonetheless!
Last updated: Dec 23 2024 at 13:07 UTC