I remember the first time I worked in a large, 10,000+ employee company, and we had a crisis with our systems. A number of us crowded into the cold computer room, gathering around a few workstations and trying to solve the issues. We worked furiously to restore service, with various managers and executives periodically knocking on the locked door wanting status updates, unaware they were slowing us down. Eventually we stabilized things, but it was a chaotic and inefficient environment with too many people involved and more time spent talking and discussing problems than solving them.
Later I worked in a similarly sized, but more mature company. We had various virus issues, including SQL Slammer. When we realized there was a crisis, we’d contact a director who would convene a crisis team. There were designated individuals from various groups (network, security, database, etc.), each of which had a backup, but only one representative from each area was a part of the team. Those were the only people that participated in the meetings, giving status updates, or taking actions to be distributed to their team. Each was responsible for coordinating the activities of their area with others. Status updates were scheduled regularly with a specific individual posting them. The director leading the crisis would update executives.
I thought back on these experiences (I was the main person in the database area) when I read about the Amazon war room experiences around launching one of their products. While our crisis management wasn’t quite like this, it was fairly well scripted. There were times that the process didn’t proceed smoothly, but it worked well overall for problem situations. I wish that we had handled deployments a little more formally, though not as strictly as Amazon did. Our deployments didn’t have a large an impact as a product launch, but we certainly could have used more coordination between different groups. I remember no shortage of networking/firewall issues, security mismatches, or missed communications with customers from deployments.
I’d like to see software deployment to be an easier and simpler process. My hope is that more people learn to code better, and they implement unit tests to ensure they meet requirements and prevent regressions. I want to see automated deployments into staging environments to catch potential issues, and eventually, smooth execution from the client perspective. I want these things to happen for both database and application software.
A better development, test, and deployment process doesn’t mean that there isn’t a need for strong coordination among everyone involved, and certainly doesn’t mean a crisis team isn’t prepared to respond if there are issues. Thinking ahead to potential issues and ensuring everyone is on the same page helps to smooth any of the bumps that will occasionally crop up. At least, I expect they are only occasional if you follow a good CI/CD process.