I use Libsyn to store podcast files. They’ve been a reliable service for me for years. Once in awhile uploads are slow, but things seem to work. The other day I went to upload a file and got a status message that there was an issue with database maintenance. When I looked at the status page, I saw a few updates. This was for a platform issue, and I saw this set of updates. I have the timing that I saw on the page with my own comments added:
- 1139 – reports of issues. We’re looking into this
- Three hours ago – Identified the issue related to db maintenance, working on it, other services affected.
- Three hours ago – all services affected, working, update soon
- Three hours ago – emergency maintenance on db systems, we will provide regular updates
- One hour ago – db maintenance in progress, now healthy nodes, turning things on.
That was what I saw at around 3:30 my time. I went back the next day and saw a more detailed set of times listed and a note that the cluster was fixed and then all services were restored. While I couldn’t upload things that day, I did check that downloads for listeners were working, and they were at that time.
I have no idea what happened, and I did appreciate an email the next day that apologized and noted this outage was not the result of malicious attacks and that no data breach had taken place. The latter item hadn’t occurred to me, but I thought that was a good reassurance sentence in the email. I’m sure it was a rough day for DBAs and the Ops staff, and hopefully, they were able to restore all data.
My concern, however, was that multiple times they noted they would post updates soon, but there were some pretty good gaps in the status messages. While I liked a few quick messages together (3 in 30 minutes), the long gaps are disconcerting to me as a customer. I expect management would feel the same way and hopefully, management was updated more often.
If you’ve been in an outage, sometimes there isn’t a change in status. A long restore or rebuild of some sort can take time, with platforms not always reporting progress or an estimate of time remaining. Even when you get some progress, we all know that the time to go from 25% to 50% could be shorter than the time to go from 90% to 95%.
When I have had to report to management, or to an incident team, usually we have regular updates. Even if these are “no change, we’re still working,” it’s good to let others know what you know. I think that’s important for customers as well, especially those that might have time-sensitive expectations for using your application. Without an update, anyone checking a status might not know if anything has changed, if things are worse, or maybe that you forgot to post an update.
My recommendation is that there is someone dedicated to logging what is happening and taking notes for later review. This person is also someone that ought to be responsible for updating others on a regular basis. Every hour, every two hours, something regular. If you have external customers, then they should expect and get regular updates, even if these are “no change, the cluster is still rebuilding.”
A little transparency goes a long way for your customers.