in the spirit of starting a thread on this, I've noticed notification emails being received very late today, I'm only just now getting notifications in email for stuff that happened hours ago
Continuing in that spirit GH actions and GH as a whole has been slow/buggy this week (and a little of last week)? My emails have been also coming through slowly, but haven't seen actions failures as a symptom just yet
pages are loading slowly and getting intermittent errors for me rn
https://www.githubstatus.com/ says "notifications are delayed" but it seems like its their whole system that is being sluggish
aaand now I'm getting unicorns
git pushes are failing now too
man things fell over fast
Screenshot 2026-02-09 at 10.46.38.jpg
it's like it's christmas!
so colorful!
when I do manage to load a PR or something, it seems like actions are not getting scheduled for the PR at all
going through a VPN with a European exit node might help? https://eu.githubstatus.com/
there was a snow hour over here, too
my US team seems like it's working again?
Till Schneidereit said:
going through a VPN with a European exit node might help? https://eu.githubstatus.com/
routing through germany I'm getting extremely slow git pushes right now as well as unicorns -- my guess is this is more of a backend thing than a frontend
everything got green for a bit and it's all back to very red
We're not really keeping track per-se, but at some point we're going to cross the threshold of "it would be cheaper to hire someone to maintain self-hosted CI infrastructure"
Three Mondays in a row with major outages; perhaps all BA member companies should adopt 4-day workweeks Tue-Fri ¯\_(ツ)_/¯
Till Schneidereit said:
going through a VPN with a European exit node might help? https://eu.githubstatus.com/
Wouldn't that only help if the bytecodealliance enterprise account itself is a European account?
I'm fairly sure there was an underlying resource failure of some sort; it absolutely went out here in the EU as well, but was back up in about 15 minutes or so......
FWIW, all of gh is on a stability freeze -- no new rollouts or config changes of any sort -- to fully understand and rectify what happened particularly this week.
Screenshot 2026-02-10 at 09.34.58.jpg
so it begins anew...
good god
are you running again, or is it STILL there?
I have done much GitHub myself this morning and the page is all green now so hopefully fine...
Screenshot 2026-02-11 at 10.17.07.jpg
Another day, more errors. I'm seeing a lot of delayed notifications this morning as well as a lot of spurious failures in this CI run
all I can do here is listen to the pain and pass it along to Ben
Oh that's understandable yeah, this is primarily a heads-up channel for us so we can share what we're seeing and be aware of outages/problems on our end
totes git it; I'm just letting you know that I'm backchanneling but also that I can't do more than that
that's also much appreciated too!
Maybe we can get the CEO of GraphQL on the phone
(apologies, couldn't resist)
hey, any port in a storm, right?
as it happens, the ex CEO of GH is starting his own new GH, so maybe we can all move there while they don't charge anything? :-)
meanwhile, the poor pm who has to deal with all this from customers:
image.png
due diligence: he IS kidding, painfully
We've talked about retries and such before, but here's an example of an exponential backoff and it just fails every time...
Could be hitting the rate limit for unauthenticated requests...
Could try using the gh CLI which can download via authenticated API calls, e.g. for the example you linked this seems to work: gh release download --repo bytecodealliance/wasm-tools wasm-tools-1.0.27 -p wasm-tools-1.0.27-x86_64-linux.tar.gz
I believe gh is preinstalled for standard actions runners but it might require a bit more config to make it authenticate as the action: https://docs.github.com/en/actions/tutorials/authenticate-with-github_token#example-1-passing-the-github_token-as-an-input
Alternatively: https://github.com/marketplace/actions/release-downloader
It looks like each attempt there downloads ~55kB then stalls -- I'd expect a rate limit to immediately return a 429 or 500 or whatever. Looks like maybe a CDN/cache problem as each download stalls at the same chunk? In any case, points more to "flaky platform" than "problem that we can solve easily" IMHO
we could also try to cache tool downloads, so we presumably at least are closer to the storage the bits come from, and they all come from the same storage?
https://www.githubstatus.com/ is green but I'm getting intermittent unicorns rn
If you see
---- cli_tests::test_programs::p3_cli_serve_hello_world_many_no_concurrent_reuse stdout ----
failed to wait for child or read stdio: child failed Output { status: ExitStatus(ExitStatus(1)), stdout: "", stderr: "\nthread 'tokio-rt-worker' (8740) panicked at C:\\Users\\runneradmin\\.cargo\\registry\\src\\index.crates.io-1949cf8c6b5b557f\\tokio-1.51.1\\src\\sync\\mpsc\\list.rs:278:9:\nattempt to subtract with overflow\nnote: run with `RUST_BACKTRACE=1` environment variable to display a backtrace\n" }
Error: failed to read body
Caused by:
0: error reading a body from connection
1: unexpected EOF during chunk size line
in CI logs it's a spurious failure. This is https://github.com/tokio-rs/tokio/issues/8061 and while this has been a bug in Tokio for a long time it seems the tokio update in https://github.com/bytecodealliance/wasmtime/pull/13104 caused scheduling changes such that it happens more frequently now.
CI is broken until https://github.com/bytecodealliance/wasmtime/pull/13150 lands
sorry about that; did I miss something on the add-a-new-crate checklist? annoying that this doesn't surface until a release
last time I dug into this it's actually impossible to prevent this from happening, we're forced to, when adding a new crate, accept that CI will be broken on the next publication
I forget exactly why though, and things have changed where we publish things ahead-of-time now, and we add crates rarely enough I never bothered to re-check
so, no, no mistake on your part and our docs don't mention this, it's just always a fun surprise on the next publish heh
https://www.githubstatus.com/incidents/myrbk7jvvs6p its another day ending in y
Ok this is a first I think -- github seems to have corrupted a merge to the main branch
https://github.com/bytecodealliance/wasmtime/pull/13180 just landed on the tip of tree, and the diff there looks as-expected
However the squashed commit -- https://github.com/bytecodealliance/wasmtime/commit/0c3a69f18df3e6939048b68e9d0dcb5a4d4518f3 -- seems to additionally include a revert of the parent commit -- https://github.com/bytecodealliance/wasmtime/commit/54929c175c1249b8d1978a76c54f92c0317b0181
so github has helpfully reverted a commit for us
I've... never seen data corruption before
how many other PRs have landed and been silently reverted.... I have no idea
that's... extremely odd? race condition wrt base branch maybe? (clearly a GitHub bug)
according to https://www.githubstatus.com/incidents/zsg1lk7w13cf
We have identified a regression in merge queue behavior present when squash merging or rebasing. We have identified the root-cause and are in the process of reverting the change.
the perils of opaque SaaS providers
(I say as an employee of a SaaS provider, speaking with other employees of other SaaS providers)
well, I'll just reland Nick's patch and pray that's the only victim
it's ... kind of insanely lucky that I caught this
I just happened to want to do a small follow-up and couldn't find the code when I was trying to do that
otherwise we never would have noticed this
at the risk of re-igniting the "should we be on GitHub" question, silently losing a commit is kind of the worst sin that a git host could commit
I don't know what to do about that but just want to say it out loud
status page now says:
Update - We have resolved a regression present when using merge queue with either squash merges or rebases. If you use merge queue in this configuration, some pull requests may have been merged incorrectly between 2026-04-23 16:05-20:43 UTC.
I can only hope they realize how utterly serious this is
uptime is almost nothing compared to data loss
whelp it happened again
Let's not land anything else today...
Maybe others did as well, but FYI I got an email from GitHub notifying us of two PRs being dropped instead of merged, so they at least seem to realize that yes, this is quite terrible
I also got an email yeah, I'll mitigate this morning
Yes, I received that email notification from GitHub as well, identifying the PRs impacted and providing follow-up details. Thanks, everyone.
update: yes, they realize it was quite terrible. :-/
AGAIN????
fyi that (i) it seems we missed patching v24 (LTS) from our January CVE (https://github.com/bytecodealliance/wasmtime/issues/13211 just reported); and (ii) I will not do a patch-release for this today, because GitHub Status is red. Another point for "what the fuck, we need a different repository host"
according to https://github.blog/news-insights/company-news/an-update-on-github-availability/ silently reverting commits is not data loss since the previous commits were still in the history
also as news to all it's a day ending in 'y' so there's another github outage today
jesus, what a mess
wish I could help, but I can't
The blog post also describes the merge-queue bug as affecting "merge groups" with more than one PR; that's not us, but we were still affected. Even aside from the PR spin about "no data loss" (sure, if you want to call it data corruption instead, we can), that's concerning from an accurate-postmortem point of view
in all this, I do want to give credit for the fact that we were notified via email within a few hours, and the email included the affected PRs. In combination with the commits still being addressable, that at least meant that even in the extremely unlikely scenario where no other copies would've existed, we could've restored them, and we knew we'd have to pretty quickly
speaking or trying to speak objectively, it sucks and if it were normally the case I'd never put my work there; it hasn't been that bad before, and I don't have insight into what is the issue now (I could ask, but I already know the people I know are underwater as you might imagine trying to stablize things), but hey -- make it work or lose the user is pretty much the name of the game.
there are other things going on of course, including the yearly rate of growth that I'm not at liberty to discuss but that is absolutely insane and which makes my megacorp gasp. But again, none of that matters if they corrupt my stuff, let alone block working each week a bunch of times.
https://mitchellh.com/writing/ghostty-leaving-github pretty much sums up most of the people I know:
Lately, I've been very publicly critical of GitHub. I've been mean about it. I've been angry about it. I've hurt people's feelings. I've been lashing out. Because GitHub is failing me, every single day, and it is personal. It is irrationally personal. I love GitHub more than a person should love a thing, and I'm mad at it. I'm sorry about the hurt feelings to the people working on it.
I've felt this way for a long time, but for the past month I've kept a journal where I put an "X" next to every date where a GitHub outage has negatively impacted my ability to work2. Almost every day has an X. On the day I am writing this post, I've been unable to do any PR review for ~2 hours because there is a GitHub Actions outage3. This is no longer a place for serious work if it just blocks you out for hours per day, every day.
It's not a fun place for me to be anymore. I want to be there but it doesn't want me to be there. I want to get work done and it doesn't want me to get work done. I want to ship software and it doesn't want me to ship software.
lotsa fun
From the MS people in runtime, it seems the amount of work github is having to do has increased significantly from AI
https://github.blog/news-insights/company-news/an-update-on-github-availability/
OH YES
another data point: over the past two years, roughly 90% of the data in the entire world was created by AI
and we can guess how much of that was worthless
so... if you throw in the GH "unversal user" bug they had to fix the past two months and the AI scale up and the human scale up it's a hard job. That said, uptime and consistency are their raison d'etre, as they say......
so.... do they a raison?
most of my page loads right now don't have css and stop loading halfway through the page, no current incident and iunno if it's my internet, but wanted to note
yeah I can't load github at all so I'm limited to work currently where it doesn't involve the web ui...
or I just needed to restart my browser?! sometimes you never know...
currently loading fine for me fwiw; different ISP and geo so who knows what network weather looks like for you of course
well, there is now https://www.githubstatus.com/incidents/72q3n8yxthcy
(comments aren't or are slow to go through)
time for the Daily Incident: https://www.githubstatus.com/incidents/1j40g94rn22j (currently blocking the patch-release merge)
my two thoughts are "what the fuck" and "we should have the platform-move discussion again"
from what im seeing, many projects are "having the platform-move discussion", most prominently on orange site the dude who started vagrant or something has committed to move but where to is unknown
my best guess is we need to sit tight for 6 months or more, still, until places worth moving to get more capable of taking on projects like ours
Given the scale of our CI and donation from msft to host BA projects I'm not even sure where it would be possible to move to.
yeah the code hosting is basically inconsequential, its the CI thats a pretty massive engineering undertaking
like, I'm just as unhappy about this as anyone else, but I don't think we have any other options available to us which don't start with "where do we move our 200+ concurrent runners to for free"
and if theyre offering any nontrivial number of concurrent runners to any random account that signs up for free, theyre just as unsustainable as github, so it needs to be somewhere that would sponsor a project like ours but make actual money off the lots of other people migrating off github... doesnt seem very likely
or the BA could pay for it in theory, but it really needs to be a service, not something that we have to sys admin ourselves and build out CI infra from scratch on top of AWS or whatever
agreed with alex that we really have no choice but to grit our teeth and hope that either github gets their shit together or somehow an alternative appears with an on-ramp both technically (running our existing actions stuff with a minimum of finagling) and financially (could we afford 20k/yr? very likely. 100k/yr? not gonna happen)
I have not even attempted to pencil out what our runners would cost at e.g. ec2 list price
(could someone do that exercise? just for fun?)
Pat Hickey said:
(could someone do that exercise? just for fun?)
I don't think we can just get off-the-shelf quotes with our scale, or at least not with circle CI. requires reaching out to their sales team. FWIW, a single CI run for us seems like it would eat all the monthly credits of their $15/month tier. gotta talk to their sales team for info on larger plans than that.
circle ci just got bought as well :grimacing:
oh yeah i just meant a back of the envelope in terms of ec2 prices and figure best case if youre paying a CI provider you're paying 2x ec2 list price
maybe more like 10x, idk
but ive never even priced it out in terms of ec2
I wholeheartedly support anything you all want to do in order to get work done. No question there. The only thing I have done is make sure through my connections that they hear your conversations and they definitely know you aren't the only ones. Just the ones I know best.
whatever you want to do, we're in.
(missed the convo I kicked off, oops) yeah, I generally agree that we don't have a ready-made realistic option; but "having the conversation again" is exactly evaluating where we are on the tradeoff axis right now.
back of the envelope: per https://github.com/bytecodealliance/wasmtime/actions/metrics/usage?dateRangeType=DATE_RANGE_TYPE_PREVIOUS_MONTH, in April we spent 777,773 CPU-minutes of CI time. That's 777773 * 60 / (30 * 86400) = 18.004 CPU-seconds per second, or 18 cores steady-state
of course the other high-order bits are (i) macOS and Windows too; (ii) load is spiky of course, we want high parallelism then have long periods of zero load. Our peak load for one full CI run is a few hundred jobs, probably something like 500 cores? (Above stats give total number of job runs but there's no way to know total number of CI triggers and distribution of job-count per trigger)
Let's say we provision 96 Linux/x86-64 CPU cores; that's 12 t3.2xlarge, each with 8 cores / 32GiB RAM; on a 1-year committed rate in US-East (Ohio) or US-West (Oregon) that's $0.2399/hr per machine, or $18300/year for the 12 machines. (https://aws.amazon.com/savingsplans/compute-pricing/)
Windows machines only go to t3.xlarge on that table (4 cores / 16 GiB) but taking 4 of those at $0.1732/hr is $4400/year.
macOS machines are $1.97/hour for bare-metal M4 Pro (14 cores, 48GiB); $12.5k/yr for that alone. M4s are pretty fast so that probably still wouldn't be the bottleneck.
So that's $35k/year for a little CI fleet with pretty good Linux capacity and acceptable Windows and macOS, as a floor. Plus sysadmin overhead (no idea what software exists, but I know that there are ready-made open source CI packages out there)
(and that single little mac mini is ouch; I wonder if the hardware can be rented cheaper elsewhere)
(to be clear I'm not volunteering to maintain that! but good to know what self-hosting cost would be, at a minimum)
Thanks Chris thats super helpful! so the minimum back of envelope a for-profit CI service that took our actions yamls and made it all go brr could concievably charge us would be, say, 70k
thats definitely getting into the range of "the BA cant really afford that unless github is down to like 90% uptime"
yeah, maybe less with benefit of scale; down to maybe $20k compute costs if one gets closer to the 18-core true steady-state cost, with perfect binpacking across customers, so 40-50k As A Service
Pat Hickey said:
thats definitely getting into the range of "the BA cant really afford that unless github is down to like 90% uptime"
the, uh, "good news" is that depending on how you count, GitHub is currently at 84.88% uptime
im not going to really look into exactly how thats calculated there except a gut feel that its a little bit pessimistic. not that the actual situation is good, but
Yeah, it's "any active incident" I think, which includes both "CI isn't running at all" and e.g. "search might not be working". 97% for Actions is still not great ("one 9") but not yet truly catastrophic
if we want to have a chance to make any of this actually work in a sustainable way, I don't think we could do it without a full-time ops person. And that's on top of the investment needed to port everything we have to this new setup that we'd be setting up. Which for the record my team will definitely not be able to prioritize at all
which is to say, I think these numbers are highly optimistic when it comes to what we actually have to invest including paid time investment, and that I'm also skeptical we could really do better when it comes to reliability etc without investing several times as much as what we're getting from GitHub for free
So the more realistic option if we have to go there is another CI provider -- the aws self-hosted option above is both a datapoint and (as Pat said) a floor-with-some-profit-multiplier for what this would likely cost via SaaS. That at least cuts out the need for ops folks and (hopefully, if a provider is competitive) has better reliability, though the porting cost is still huge. (Business opportunity for someone to build a GHA-compatible CI host though!)
Agreed this is not the practical choice today given all the above. I just wish we had better options!
AWS is pretty expensive, for comparison renting 500 cores worth of 12-core VPS instances from Contabo (what I'm using for my own web server, though I have a 6-core VPS) comes out to $12k/yr
Yeah agree with Till we should not consider running our own CI, whether on a cloud or renting bare metal. we'd need a fullly managed offering, and it would have to be very substantially compatible with github actions and all the related machinery required to make releases.
And thats a big barrier. I assume that somewhere at some other hyperscalar there is a team vibe coding at that goal as hard as they possibly can right now to make hay while the sun shines, but we dont want to be the alpha testers for that solution either. which is why my "wait and see what the world looks like in 6 months" estimate is probably itself unrealistic. we are stuck with what we have got for a while, best we can do is re-evaluate what the world looks like this winter
mail is coming in slow today -- https://www.githubstatus.com/incidents/z3jhyg3l0dvx
A message was moved here from #wasmtime > Should we rethink our disclosure policy? by Alex Crichton.
Last updated: May 26 2026 at 09:09 UTC