It was bound to happen. I’m sure it’s happened before, but this event was interesting to me as it wasn’t a customer issue, but a cloud vendor problem. There was an outage in Azure on Jan 29, which happens, but in this case data was lost. There was problem internal code in the Azure cloud that dropped some customer databases using the Azure Key Vault with TDE encryption. That was slightly disconcerting for me as I was setting up and testing Azure Key Vault this week.
There are frequent snapshots and Microsoft was able to restore the databases from one that was about five minutes old. Microsoft acknowledges that five minutes of data loss might be an issue and is asking customers that lost business or were affected by the drops to raise a support ticket. I have found Azure support to be pretty good about crediting my account when issues occur, and I hope they do the same here, though I wonder if they’ll compensate anything beyond the charges normally assessed to customers. They are offering credit for the restored and original databases for a few months as well.
Plenty of people are upset, and with good reason. There should not be Azure management code that drops databases. Or should there be? Would this be any different on premises?
I’ve had cleanup code that removed resources after some time. I don’t work at the scale of Azure, so I usually have things removed after a month or quarter. With the scale of Azure and potential costs, they might remove resources quicker, but I certainly have seen similar home grown, if-this-then-that code that does x when y occurs. In this case, I remove encryption keys, which might be not actually be removed for month or more. At that point, there is code that trigger a drop of databases. I’ve certainly seen users in various organizations drop, or restore over, the wrong database. In some cases they don’t realize it in five minutes, and often don’t have a way to restore from a five minute ago backup. Actually, relatively few people I know have RPOs under five minutes.
This is bad, but it isn’t necessarily out of the ordinary for complex IT environments. If this happened in an organization, the IT staff would be worried and hoping for forgiveness. Some people would want others fired, but most of the time management would understand these things happen. Perhaps not if this isn’t the first time, but usually we accept some people make mistakes. Most of us don’t have complete control of all aspects of our environment. We depend on network staff, storage people, employees that manage hardware, and more. It’s possible that any one of these people could destroy data inadvertently.
That happened here, though I don’t want to make excuses for Microsoft. They’re supposed to hire the best people and build processes that are better than what I’d expect inside an organization. Events will cascade into different areas, and there should be circuit breakers that prevent anything that could cause data loss in those events. Protect other people’s data with more care than you think you need. It’s your responsibility to do so.