Stream: git-wasmtime

Topic: wasmtime / issue #2933 Optimize CI runtime by examining i...


view this post on Zulip Wasmtime GitHub notifications bot (May 24 2021 at 20:18):

cfallin opened issue #2933:

While waiting for CI jobs during the release process on Friday, a few of us started discussing CI runtimes; and again today, while watching a "Publish" job remain queued, it's very much on my mind.

CI time is important to reduce both because it's in the critical path of lots of things -- especially, but not only, when making an urgent release -- and because it costs resources. (E.g., if we decide to pay for more GitHub runners some day, it looks like we would pay 0.8 cents per minute of CI time.)

A few thoughts occur to me:

  1. The "Publish" job depends on the others and hence starts after they finish. Because it starts much later, it sometimes gets stuck behind another PR's jobs in the global run-queue, and sits for a long time waiting to start. This is a prioritization failure (runner tasks that allow an approved CI to merge should go before initial CI runs on a speculative PR, etc.), but we can also exert some control over this problem by avoiding the need for additional job-starts.

    Specifically, could we incorporate a "slice" of the publish task at the end of each build job? That way, instead of using a dedicated job to upload build artifacts once all individual parts are built, we just upload as we go. If we did this, we would have a single-depth critical path.

    It's possible that we want to think about how test failures in one configuration would or wouldn't gate uploads from another; but it seems to me that if uploads are keyed by commit hash, or if we are just careful about concurrent runs when tagged releases are involved, then this could be avoided.

  2. The release-build jobs are \~always the long tail, and they run tests as well after the build is complete. We thus run tests in both debug mode ("Test ({stable,beta,nightly})" jobs) and release mode. This is nice for coverage -- there are certainly times when issues occur only as a result of certain optimizations -- but most of the time, this is not the case, and we would save significant resources and wait-time by doing the tests and the release build (only) in parallel.

  3. We have some other tests that are (IMHO) nice-to-have, but not critical for our current release configuration. E.g., "Rebuild Peephole Optimizers" takes \~15-20 minutes of CI time per run. This is nice to have as a part of Peepmatic, for sure; but if development is not currently ongoing on that project, we could potentially "pause" the jobs until it is, and save CI time and resources.

Thoughts? I hope the general topic of reducing CI time is not controversial, though I recognize some of the above ideas could be; hoping to spawn discussion about our explicit needs and resources, nonetheless!

view this post on Zulip Wasmtime GitHub notifications bot (May 24 2021 at 20:21):

cfallin commented on issue #2933:

Along with the above, I should note that if we have tests that we want to put into a "second tier" bucket, we could potentially find a way to run them, e.g., nightly. Release-build tests could fall in that category, for example. This would ensure we catch the long tail of bugs "eventually" without waiting for the long tail of runtimes on every CI re-spin.

view this post on Zulip Wasmtime GitHub notifications bot (May 24 2021 at 21:06):

alexcrichton commented on issue #2933:

I'm all for making CI faster! In addition to the "do less work" angle you've mentioned above, the other major route to speed things up is to optimize what's already there, some possibilities being:

The downside of "just make things faster", though, is that it's a constant uphill battle. Things always regress accidentally because you can't get precise timing from CI. Additionally it's very easy to add more things and generally quite difficult to take them out. While I think this is worth mentioning I think what you've outlined above is perhaps a better route forward.

For "do less work" as you've mentioned this is always tricky. For example I'm less confident about removing the release-mode tests. We've had some tricky/subtle bugs show up rarely in the past, and while I definitely agree that we'll 95% of the time never fail these builders having us catch them instead of users is generally much better. I personally don't know how to weigh "this is an expensive CI job" with "here's the hypothetical failure rate and benefits it bring us". Ideally we could put numbers on that and have a literal threshold, but it seems somewhat far-fetched.

To answer your ideas:


One final option is to look to integrate bors in one way or another. This is unfortunately a very significant investment because AFAIK there's no really easy and nice integration with bors right now (typically things involve a lot of wonky permissions, setup, servers you run, etc). This primary benefit of this I think is that we could defer "heavy" work to serial one-at-a-time testing and only do light "likely to fail" testing on PRs. For example PRs might build docs quickly, run cargo deny, and run linux tests (but that's it). Merges to the main branch would run nothing and only merges through auto would do the full build and produce artifacts.

I think this would also be a serious undertaking because it would require us to redesign the release process. The model rust-lang/rust uses doesn't super-clean-ly apply here so we might need to consider some alternatives for how to do release artifacts and such with a model like this.

view this post on Zulip Wasmtime GitHub notifications bot (May 24 2021 at 21:23):

bjorn3 commented on issue #2933:

bors-ng is a github app. It is used by for example rust-analyzer.

view this post on Zulip Wasmtime GitHub notifications bot (May 05 2022 at 16:33):

alexcrichton commented on issue #2933:

CI has been serving us pretty well since the last round of significant changes so I'm going to close this. If there's remaining issues to tackle they're probably best done through follow-ups.

view this post on Zulip Wasmtime GitHub notifications bot (May 05 2022 at 16:33):

alexcrichton closed issue #2933:

While waiting for CI jobs during the release process on Friday, a few of us started discussing CI runtimes; and again today, while watching a "Publish" job remain queued, it's very much on my mind.

CI time is important to reduce both because it's in the critical path of lots of things -- especially, but not only, when making an urgent release -- and because it costs resources. (E.g., if we decide to pay for more GitHub runners some day, it looks like we would pay 0.8 cents per minute of CI time.)

A few thoughts occur to me:

  1. The "Publish" job depends on the others and hence starts after they finish. Because it starts much later, it sometimes gets stuck behind another PR's jobs in the global run-queue, and sits for a long time waiting to start. This is a prioritization failure (runner tasks that allow an approved CI to merge should go before initial CI runs on a speculative PR, etc.), but we can also exert some control over this problem by avoiding the need for additional job-starts.

    Specifically, could we incorporate a "slice" of the publish task at the end of each build job? That way, instead of using a dedicated job to upload build artifacts once all individual parts are built, we just upload as we go. If we did this, we would have a single-depth critical path.

    It's possible that we want to think about how test failures in one configuration would or wouldn't gate uploads from another; but it seems to me that if uploads are keyed by commit hash, or if we are just careful about concurrent runs when tagged releases are involved, then this could be avoided.

  2. The release-build jobs are \~always the long tail, and they run tests as well after the build is complete. We thus run tests in both debug mode ("Test ({stable,beta,nightly})" jobs) and release mode. This is nice for coverage -- there are certainly times when issues occur only as a result of certain optimizations -- but most of the time, this is not the case, and we would save significant resources and wait-time by doing the tests and the release build (only) in parallel.

  3. We have some other tests that are (IMHO) nice-to-have, but not critical for our current release configuration. E.g., "Rebuild Peephole Optimizers" takes \~15-20 minutes of CI time per run. This is nice to have as a part of Peepmatic, for sure; but if development is not currently ongoing on that project, we could potentially "pause" the jobs until it is, and save CI time and resources.

Thoughts? I hope the general topic of reducing CI time is not controversial, though I recognize some of the above ideas could be; hoping to spawn discussion about our explicit needs and resources, nonetheless!


Last updated: Oct 23 2024 at 20:03 UTC