I’ve always worked on the notion that my hardware rarely fails, and while I protect it with backups RAID, etc., I don’t expect failures. That seems to be how most of the RDBMSes are structured, with this idea that things will work well and we have protection mechanisms when they don’t.
Contrast that with the way Google viewa the world. To them,failure is inevitable at scale, as noted by Jeremiah Peschaka in The Promise and Failure of Federated Data. At large scales of deployment, Google and other companies assume that there will be a percentage of failures and they have to account for them. This is the same type of accounting that restaurants use (spoilage) and retailers (shrinkage) use for allow for some amount of loss.
In technology, we account for potential failures with RAID, with HA or DR technologies, and hopefully with substantial testing to ensure that we have properly accounted for potential failures. However it seems that most technology people account for failure as a possibility and not a probability. Many people seem to assume that a serious disaster is not likely in their career.
I think that a catastrophic, we lost the whole data center, event is unlikely for most of us. As Hurricane Katrina and the recent earthquake in Japan have shown, that is possible. However it’s unlikely for most locations, which is a good thing.
Failures are inevitable and whether it’s a disk corruption, a server crash, or a building power failure, we have to assume we will experience one and plan for the event. We also have to expect hardware will fail, which means regular checks and monitoring to detect these failures as soon as possible.
Hope for the best, but plan for the worst.