Gitlab had a database problem recently. I’m sure you read about it. There have been commentaries from many people, including Brent Ozar and Mike Walsh. There are many ways to look at this outage and data loss (the extent of which is not known), but I’d like to stop and focus on a couple items that I think stand out: competence and care. I don’t know how we prevent problems, but I certainly think these items are worth pondering.
First, there is the question of competence. I have no idea what the skills or experience are for the GitLab staff that responded to the event. They certainly seem to understand something about replication or backup, but are they skilled enough to understand deeply about the mechanics of PostgreSQL (or their scripting) to determine where things were broken? I have no idea, and without more information I don’t question competence. The thing to be aware of, whether for this incident or your own, are the people working the problem well enough trained to deal with the issues. Perhaps most important, do they realize when they have reached the limit of their expertise? Do they know when (and are they willing to) to call in someone else or contact a support resource?
I saw a note from Brent Ozar that the GitLab job description for a database specialist doesn’t mention backups. It does say a solid understanding of the parts of the database, which should include backups. I’d hope that anyone hiring a database specialist would inquire how someone deals with backups, especially in a distributed environment. It’s great that you give database staff a chance to work on the application, tune code, build interesting solutions to help the company, but their core responsibility and focus needs to be on the database being stable, which includes DR situations.
The second item that I worry about is the care someone takes when performing a task. In this case, any of us might have been tired at 9pm. Especially if we’d spent the day working on a replication setup, which can be frustrating. Responding to a page, especially for a security incident can be stressful. Solving an issue like that, and then having performance problems crop up is disturbing. Anyone might question their actions, wondering if they had made a mistake and caused the issue. I know when multiple problems appear in a short time, many of us would struggle to decide if two issues are coincidental or correlated. I’m glad that after the mistakes, the individual responsible handed off control to others. As with any job, once you’ve made a serious mistake, you may not perform at the same level you normally do, and it’s good to step back. Kudos, once again.
The ultimate mistake, and one that many of us have made, is to run a command on the wrong server. Whether you use a GUI or command line, it’s easy to mistake db1 for db2. I’ve tried color coding for connections, separate accounts for production, even trying to get in the habit or looking at the connection string before running a command, but in the heat of the moment, nothing really works. People will make mistakes, which is why it becomes dangerous to allow any one person to respond in a production crisis. As a manager, I’ve wanted employees to take care, and use a partner to double check code before you actually execute anything.
And above all, log your actions. I have to say I’m very impressed with GitLab’s handling of the incident and their live disclosure. This is what I like to see during a war room. Lots of notes, open disclosure, and an timeline that allows us to re-examine the incident later and learn from the response. This is an area that too few companies want to spend resources on, but learning from good and bad choices helps distribute knowledge and prepare more people for the future. I’d like to see more disclosure of post-incident review from many companies, especially cloud vendors. I can understand not disclosing too much information while the crisis is underway, as I’d worry some security related information might be released, but afterwards, I think customers deserve to know just how well their vendor deals with issues.