I didn’t notice any issues with GitHub, but others did. The majority of my interaction is just through the git protocol, so things tend to work fast, and I don’t have any database access. I rarely use the Issues, and other parts of GitHub, which were affected when GitHub had a MySQL cluster fail over. There’s a good write up of the post incident analysis that’s worth reading, from a database perspective.
I’m not a big MySQL guy, only running an instance to power T-SQL Tuesday. The structure of a write primary and many read replicas that GitHub describes makes sense. It’s similar to what I’ve done in SQL Server, and certainly the idea of some quorum management, handled at GitHub with the Orchestrator software, is something that needs to be configured properly. Allan Hirt has talked about the complexities of quorum in large installations, and it’s not a simple thing to configure.
In reading about this, there are a couple things that strike me. First, the analysis talks about a degredation of service because East coast applications had to send writes to West Coast database servers. There were some problems with the way the database servers were working, but it seems to me that there should be some sort of application failover that’s possible. If you can’t have an application and database fail separately without customer impact, then there should be some way to fail applications over. Perhaps not, but if you’re responsible for designing HA for the database, make sure you talk to the application people and test for issues.
The second thing for me is that somehow there was a period of time when writes were occurring to the East Coast system that weren’t sent to the West Coast. My ignorance of how this HA stuff works in MySQL prevents me from making a big deal of this, but this isn’t something that should happen. If the quorum moves data to another node, it must stop writes to the first node. This could happen in SQL Server, but for me, this is the level of data loss I’d need to accept in my RPO.