Disaster recovery is one of the core tasks that many DBAs think about on a regular basis. Ensuring that we can get our data back online, available, accessible, and intact is important. More than a few DBAs that haven’t been able to recover systems, find themselves seeking new employment.
That’s not to say that most DBAs perform perfectly under pressure. Plenty make mistakes, and there may be times when they can’t recover all data. There does seem to be a correlation between how often DBAs practice recovery skills and how well they perform in an actual emergency. I know that at a few companies, we scheduled regular disaster tests, though often with simulated recovery of a systems that didn’t expect to actually take over a workload. Arguably not a good test, but better than nothing.
Google takes things a step further. They have annual, company wide, multi-day DiRT (Disaster Recovery Testing) events. These are across many departments and can be substantial in terms of the disruption that the these events cause to their infrastructure. This is a way for the various individuals responsible for infrastructure to actually evaluate if they are prepared for potential issues.
If you read the article, you find that Google started small with these and progressed them to larger, more inclusive tests, like taking down a data center. They also whitelist some servers, knowing they cannot pass a test, so there is no reason to actually take them down. After all, business still needs to work.
It’s good to have tests and walk through actual events, like call lists and bridges to be sure that communication and documentation work. This might be especially important when teams often expect that all their written procedures are available. I went through an audit with one company, where we failed immediately when all our DR plans were on a network share. In this simulation, we had experienced a network failure and servers had crashed. We were supposed to bring up the systems on spare hardware, but some critical documentation wasn’t available without a network. We started printing things out right away so that we could continue on with the simulation (as well as have this in a binder in our office).
Not everyone can schedule large scale tests, and certainly many managers don’t see the point. They’ll often want to gamble that staff will “figure things out” if there is an incident. That doesn’t mean that DBAs and sysadmins can afford to wait for a disaster to practice some skills. Be sure that everyone on your team can recover databases, they know where backups are (or how to determine this), and multiple people have access to resources. The last thing you want is a disaster to occur during your vacation and have managers calling you to cut short your holiday because you’re the only one that knows where something is or has the authority to access a resource.
Think about this ahead of time and prepare.