Last year Solarwinds was hacked and blamed an intern for a security lapse. When Equifax was hacked, in testimony to the US Congress, the former CEO blamed a specific, though unnamed, person for not patching a system. British Airways blamed their USD$200+mm IT issue on an engineer that rebooted a system too quickly.
I don’t know that any large company from my younger days, say before 1990, would have blamed a massive failure on a single person. While any single person can influence more systems in the age of technology, no one should have the power to cause such a massive failure. If they do, I think I’d look towards poor system design, rather than individuals.
These aren’t the only examples of management trying to scapegoat an IT worker, and I suspect we’ll see more examples in the future. However, I hope that governments and shareholders start to demand better management from management. If you don’t understand how IT works, get auditors or consultants to evaluate things and explain them to you. If you don’t think that your systems are well put together without single points of failure, address that. If you worry about security, make that a priority. Microsoft did after the Slammer worm, and arguably they have a difficult job where most employees want to control their laptops and workstations entirely and run them in their individual manner. Microsoft built better controls into infrastructure and software development, and everyone else should as well. Management needs to own their responsibility for failures.
We should expect mistakes in security, in design, in coding, and more. We should also be placing guardrails, tests, and limits inside our environments to ensure that we catch most of the issues. Software development and system design have improved dramatically the last decade to help us improve quality and security, but we have to embrace the knowledge that’s been gained, as well as ensure we have circuit breakers to prevent runaway failures. If a sysadmin can alter a Chef script to set the max memory in SQL Server to 1MB, this shouldn’t get deployed to all instances. Moreover, we ought to be testing for all sorts of potential changes that can cause issues.
To me, this is the area that DevOps, GitOps, anything Ops, automated, or at scale, needs to mature. We need to allow for, expect, and assume mistakes and failures will happen and build in controls to our build and test systems. Once we start to better understand how someone can make simple mistakes, we can attach more checks and balances to ensure that we continue to improve quality, without sacrificing speed, or lowering security.