Expect the Unexpected with DiRT

Disaster recovery is one of the core tasks that many DBAs think about on a regular basis. Ensuring that we can get our data back online, available, accessible, and intact is important. More than a few DBAs that haven’t been able to recover systems, find themselves seeking new employment.

That’s not to say that most DBAs perform perfectly under pressure. Plenty make mistakes, and there may be times when they can’t recover all data. There does seem to be a correlation between how often DBAs practice recovery skills and how well they perform in an actual emergency. I know that at a few companies, we scheduled regular disaster tests, though often with simulated recovery of a systems that didn’t expect to actually take over a workload. Arguably not a good test, but better than nothing.

Google takes things a step further. They have annual, company wide, multi-day DiRT (Disaster Recovery Testing) events. These are across many departments and can be substantial in terms of the disruption that the these events cause to their infrastructure. This is a way for the various individuals responsible for infrastructure to actually evaluate if they are prepared for potential issues.

If you read the article, you find that Google started small with these and progressed them to larger, more inclusive tests, like taking down a data center. They also whitelist some servers, knowing they cannot pass a test, so there is no reason to actually take them down. After all, business still needs to work.

It’s good to have tests and walk through actual events, like call lists and bridges to be sure that communication and documentation work. This might be especially important when teams often expect that all their written procedures are available. I went through an audit with one company, where we failed immediately when all our DR plans were on a network share. In this simulation, we had experienced a network failure and servers had crashed. We were supposed to bring up the systems on spare hardware, but some critical documentation wasn’t available without a network. We started printing things out right away so that we could continue on with the simulation (as well as have this in a binder in our office).

Not everyone can schedule large scale tests, and certainly many managers don’t see the point. They’ll often want to gamble that staff will “figure things out” if there is an incident. That doesn’t mean that DBAs and sysadmins can afford to wait for a disaster to practice some skills. Be sure that everyone on your team can recover databases, they know where backups are (or how to determine this), and multiple people have access to resources. The last thing you want is a disaster to occur during your vacation and have managers calling you to cut short your holiday because you’re the only one that knows where something is or has the authority to access a resource.

Think about this ahead of time and prepare.

Steve Jones

Listen to the podcast at Libsyn, Stitcher or iTunes.

About way0utwest

Editor, SQLServerCentral
This entry was posted in Editorial and tagged . Bookmark the permalink.

3 Responses to Expect the Unexpected with DiRT

  1. pianorayk says:

    I once got into an argument with someone who attended my “Disaster Documents” presentation. He kept arguing that paper was dead, and there was no reason to keep hardcopy documents. I kept saying, what if you can’t get to your online documentation?

    Not surprisingly, he gave me a poor evaluation.

  2. pianorayk says:

    Reblogged this on Welcome to Ray Kim's 'blog and commented:
    Steve’s article reminded me about the first time I gave my Disaster Documents presentation at a SQL Saturday.

    At the end of my presentation, one attendee started an argument with me. He kept saying that paper was dead, everything was online, and there was no reason to keep hardcopy documents. I argued, what if you can’t get to your online documentation?

    Not surprisingly, he gave me a poor evaluation.

    The bottom line is this: even documentation needs a backup. Other than, say, getting lost in a fire, paper documents can’t break. At a minimum, have hardcopy documents that instruct how to get minimal services back up and running, and back up other recovery documentation so you can recover it later.

Comments are closed.