Amazon had a load balancer failure in 2012. The analysis of the event shows that there were missing data in the devices that caused issues. The restore of data from these devices is complex, way more complex with less mature tools than most database platforms. The result was a nearly 10 hour period of time when some customers were experiencing issues.
In 2016, Gliffy had three days of downtime from a database error. In this case, an admin was updating a replicated system, but failed to sever a link with the primary node. Forgetting this step caused a data removal on the node, which replicated to the secondary nodes. They discovered the restore and replay of logs would take many days due to the size. They hadn’t practiced a DR situation in some time, and were not prepared for the delays.
Digital Ocean received alerts earlier in 2017 that some services were not functioning. They traced this down to the primary database being deleted. The issue was a process used the wrong credentials for automated testing, and I’m guessing that part of the testing was removing and rebuilding a database. Five hours across the middle of the night resulted in the main database being restored, and a couple more hours to get replicas caught up.
In the first two cases, there were issues with the deployment of changes to systems, as well as inadequate backup and restore processes. In both of these cases, I would argue that a good DevOps process would have automated the way the code was deployed, including ensuring that steps weren’t forgotten or predeployment backups captured the state of configuration. DevOps includes the “Ops” changes and should ensure that all state information is captured and stored in a VCS. If this had been done, it’s possible that these companies wouldn’t have had these issues.
In the last case, certainly whoever sets up a system is responsible for using the correct credentials. While it’s easy to say that a developer or tester shouldn’t know the production credentials, but it’s entirely possible that the person that configured the process would have the credentials. I don’t know what to do here, as the first test of this might cause the issue. Maybe a second set of eyes is important for security changes in automated systems? That certainly could be part of your DevOps process. What I’d like here is two factor authentication for all security setup, including for SQL Server.
DevOps isn’t a prescriptive set of things that someone does. Whenever I talk with people about DevOps and they give reasons why a particular step I’ve demonstrated won’t work for them, I tell them to stop doing that step. After all, the way you implement DevOps doesn’t have to match what I did. We each need to do what works for our environment, and ensure we have some consistency and repeatability in our process. Hopefully preventing downtime from simple mistakes.