This is a great postmortem from Basecamp with a detailed explanation of a problem and how they are hoping to avoid issues with their service. Basecamp is a Software as a Service tool from 37 Signals that had an outage recently. Actually, the service was up and could be read, but new items couldn’t be added and existing items couldn’t be changed. That somewhat defeats the purpose of the tool. 37 Signals recognizes this and notes that they’re calling this downtime and not trying to get pedantic about the state of the service.
I have sympathy here because I’ve gotten the same error in SQL Sever. I’ve reached the point where I had 2,147,483,647 items in a table. Any additional items caused an error, which some of you might guess. This is the largest positive INT value and if you’ve started at 0 and incremented all new rows, when you reach this mark, you can’t add more data. There are a few fixes for this, including moving to negative numbers or changing to a bigint type.
The staff at 37 Signals knew this was an issue and had actually set a new default for Basecamp that used a bigint in MySQL and PostgreSQL. What they didn’t do is actually make the changes in their hosted Basecamp database because they thought they had more time. I’ve seen that before, often when an application runs at a higher workload than expected for a few years and DBAs aren’t paying attention. When staff least expects it, the INT will reach the max value an clients start complaining. I’ve had to respond to an urgent phone call about this in a few companies.
If you use INT values that are being incremented in some standard way, I’d likely set some alert process that triggers when the value gets to 2 billion or so, depending on how fast you use the numbers. SQL Monitor can do this with a custom metric, and I’m sure other software can as well, including SQL Agent. Having some monitoring in place will help you plan and be prepared to address issues before they cause problems.