The British Airways computer failure has been on my mind for a few weeks. It seems like such an epic failure in 2017 from a high public company that many, many people depend on. I still struggle to believe the power explanation, and really, I’m not sure there’s any explanation that I would accept. At this point in our industry, there’s no good reason for any large, global company to not be able to restart services inside of a couple hours in a DR site. In fact, it really shouldn’t even take that long.
However, for many of us, we will have a failure or disaster at some point. It might not even be a hardware failure or system crash. It’s much more likely that a user will cause an issue. As our systems grow larger, perhaps even more loaded with transactions, we might not always be able to easily separate out good data from bad, and I would expect we’ll experience a restore.
For many of us this will mean we will lose some data from the system. Even with frequent log backups, we might end up with a short period where we can’t recover data. Most of us should have conversations with business stakeholders on what the acceptable level of data loss is, and plan to meet those requirements. We should also have plans around how to rebuild data. I wouldn’t recommend a full test on a system, but it might be worth a few conversations with those that deal with transactional data and discuss how the latest data might be recreated.
No one wants to lose data, and in many cases, there are ways to rebuild or recover the data with manual efforts. Perhaps your company has paper records, or maybe there’s an audit trail that could be used to reconstruct actions. Maybe you rely on memory or even customers to provide information again. Today I’m wondering if you’ve thought about how you might recover data in a non-technical way and what methods you’d use.