GitHub Promises ‘Additional Guardrails’ After Wednesday’s Update Triggers Short Outage

Posted by EditorDavid on Saturday August 17, 2024 @10:34AM from the issue-tracking dept.

Wednesday GitHub “broke itself,” reports the Register, writing that “the Microsoft-owned code-hosting outfit says it made a change involving its database infrastructure, which sparked a global outage of its various services.”

Or, as the Verge puts it, GitHub experienced “some major issues” which apparently lasted for 36 minutes: When we first published this story, navigating to the main GitHub website showed an error message that said “no server is currently available to service your request,” but the website was working again soon after. (The error message also featured an image of an angry unicorn.) GitHub’s report of the incident also listed problems with things like pull requests, GitHub Pages, Copilot, and the GitHub API.
GitHub attributed the downtime to “an erroneous configuration change rolled out to all GitHub.com databases that impacted the ability of the database to respond to health check pings from the routing service. As a result, the routing service could not detect healthy databases to route application traffic to. This led to widespread impact on GitHub.com starting at 23:02 UTC.” (Downdetector showed “more than 10,000 user reports of problems,” according to the Verge, “and that the problems were reported quite suddenly.”)

GitHub’s incident report adds that “Given the severity of this incident, follow-up items are the highest priority work for teams at this time.” To prevent recurrence we are implementing additional guardrails in our database change management process. We are also prioritizing several repair items such as faster rollback functionality and more resilience to dependency failures.