There was a time when I worked for a company that sold products on-line Since our wares could be purchased at any time of the day or night, we wanted to ensure that our systems were running all the time. This led us to build some sort of monitoring, which we tried. That led us to buy some monitoring software, which we did. This led us to build more tools, and it felt like we were in an endless loop for a period of time.
Eventually we stepped back and tried to answer the question that many Operations people have asked themselves and others: what is downtime?
It’s a tough question, and I want to give you a few examples of how I’ve viewed things, and debates I’ve had. For example, we had a database server and a web server. We used a simple script to ensure that the services (IIS and SQL) were running on both machines. If they weren’t, we received a page. Is that sufficient to detect if our system is working?
We also had a process that would ping our web server from outside the data center, using a public machine. If that works, is the system working?
In this job, we deployed new code every week, in a DevOps style process that existed before anyone had ever uttered the term. These updates sometimes included schema changes, but almost always included application changes. If a page on our website broke after a deployment, was our system up or down?
We integrated with some third party software to perform various tasks. There were times that we couldn’t communicate with the third party, or received broken communications. In those cases, were we up or down?
We built our application to work with multiple browsers, but at times there would be a new piece of functionality that didn’t render or work correctly on either a new (Firefox) or old browser (IE6). Did that mean the application was down?
Determining uptime isn’t a single thing. Even when you provide mechanisms that ensure all parts of your application are working, are they working for everyone? Many of us might see this in various online calls, where a system like GoToMeeting or Skype might work for some of the audience and not others. I see this at times with Microsoft sites where some of us can use one of their online systems, but others can’t, sometimes because of the browser of the end user.
I was thinking about this while researching zero-downtime deployments, which can be hard for database changes. There are people that have success, but many others don’t. At Redgate Software, we are trying to build tools to make this easier for everyone, but there seem to be plenty of edge cases that cause issues. There are also many different processes and flows that groups use to perform database development, which often affects the final deployments. It is hard to build a general solution that needs to apply to specific environments.
I tend to learn towards measuring uptime of the systems I’m responsible for and letting others worry about intermediate infrastructure. I’ll caveat that with the note that I sometimes only worry about sections of the system and if those are broken. It’s good to be clear when talking about this topic with others. For example, we might be able to take orders, but can’t report on them, or can’t add new customers. That’s downtime for some sections of our application, but less stressful than if we couldn’t take orders.
Let us know today. How do you measure downtime or uptime, and where is your responsibility?