Cascading Human Error

One mistake could cascade

I caught this short piece on the Amazon EC2 outageand while it does highlight a potential problem in the world of the cloud, there was one part of it that really stuck out to me. This quote in particular was one that has me worried about the future: “It’s the kind of error an operator could makes as a wrong choice on a menu or the entry of the name of the last network worked on instead of the one needed. In short, it was a human error that’s all too likely to occur with anyone momentarily preoccupied with the price of mangoes or a flare up with a spouse. “ That’s a little scary, and while it’s not necessarily a reason to avoid the cloud, it is potentially a reason to avoid the extremely large companies like Amazon and Microsoft. They have built these large infrastructures, using lots of standardization and hopefully, automation. In the places where humans do have to type commands or make configuration changes, it’s possible that they make a mistake. While that can happen in any size company, when it happens for Amazon, Microsoft, or Google, a large number of people can be affected. In many ways I think we have a problem with extremely large, standardized, and highly dependent infrastructures. While being able to deploy a configuration change to a 1,000 servers at once or reboot 800 for a patch is cool, it’s also potentially a problem. We used to manage nearly 1,000 Windows servers and patch the large majority at one time with SMS, but we also held our collective breaths whenever hundreds of them rebooted at once. One mistake, and it makes for a very long night. And a very long day when you are explaining the issues to management. I don’t think we will ever eliminate human error, but we can definitely minimize it. Strong QA processes should require automated deployments and ensure the exact same deployment run on QA is run in production. If we can stick to that type of process, regardless of the delays it might create, we can minimize lots of human error. Steve Jones

