Recently British Airways had a massive computer failure and had to cancel hundreds of flights across a couple days. This was a large disruption for travel as tens of thousands of passengers were stranded around the world. In addition to the concerns many of us have about security, I also worry about large scale systems that continue to grow in scale, with more users depending on them, and not necessarily getting updated with modern technologies. While I know many of the customer facing systems have been enhanced for companies like airlines, I’m not sure the core infrastructure that backs these systems has changed.
British Airways has blamed the cause of the outage on a power failure, with some finger pointing between the airline and their outsourcing company. BA appears to say a power surge caused problems with it’s UPS’s and batteries, which resulted in a situation where “the controlled contingency migration to other facilities could not be applied.”
I’m sure some of the information being published is carefully vetted by lawyers and doesn’t necessarily reflect the actual technical issue, but it doesn’t matter. A well designed HA solution shouldn’t care if a primary system drops off line, falters, or anything else. In the worst case, any IT system failover can be forced, understanding a potential loss of data. In the SQL world, we could always remove a primary and just go with whatever data is on the secondary. Certainly within a few hours we could be up and running.
I suspect that the architecture of the BA system is not well designed for large failures. I know that dealing with power is tricky. While working for a large company (10k+ employees), our data center went down one day, which affected our public presence, customer support, and plenty of internal systems. We had UPS service underway, which had disconnected main power for some reason and we couldn’t quickly switch to the public grid when a system failed. It took a few hours to reroute power, during which our CTO paced the data center floor in an irate manner. Certainly that didn’t speed things along.
As noted in the piece I’ve linked, there are some strange inconsistencies with the explanation. It’s entirely likely that a data quality issue caused problems that took hours to rectify. I’ve been on the wrong side of those, working through queries and attempting to piece together some version of the correct data from various sources. If that’s the case, then I’d really like to know what the faliure of their systems allowed bad data to disrupt operations. There are learning opportunities here for us data professionals.
I do hope that some of the details of how BA architected their system gets shared among technical staff. At the least, I’d like to have large companies like UPS, Wal-Mart, British Airways, and other companies publish and share information about technical details and architectures. Some of the tech companies do this already (Amazon, Google, etc.), but it’s good for other industries to share their information as well. We owe it to the industry to learn what works well and what fails as we grow systems to larger scales and more complex interactions. Certainly I’m proud that many SQL Server experts share their experiences, helping others learn and make fewer mistakes in the future.