Are you going to start seeing more pressure for outages in applications? I suspect many outages are caused more by application issues than database ones, but those two are becoming very tightly linked as we look to more rapidly deploy features and enhancements in our applications, which often include database changes
This outage from United shows that there can be a huge impact, not only financially, but also an inconvenience to clients and potential lost future business. A company might struggle to with future business after a large outage, especially when there are so many other choices easily available to consumers across the Internet.
That brings to mind a very interesting problem as companies grow and look to build scalable systems. Large groups of servers require some level of standardization, mostly for the ease of management by IT workers as well as the ability to train future workers to understand the systems. However that standardization becomes a point of failure when there is a problem during an upgrade, or even a hack from some type of malware.
I saw an interesting piece on how Netflix has tried to expect, and handle failures in the cloud, and a comment from Jeff Atwood that you ought have your own chaos monkey to regularly test your systems. Interesting advice, and in many cases, it’s probably good advice to ensure that both your systems and your people know how to deal with outages.
You probably cannot eliminate outages, as Netflix and many other companies have learned. However you can work to ensure your people and systems know how to respond. I also wonder if having (at least) two versions of your systems out there at all times that work in a similar way might be a way to provide some tolerance against a single point of failure. I don’t know how you might implement this, but it might provide some protection against a failure in a completely homogeneous environment.