It’s the first T-SQL Tuesday of 2018, being brought to you by Arun Sirpal. His invitation this month asks for you to talk about some technical challenge that you conquered in your career. Some type of issue that you had to troubleshoot and discover what the issue was, as well as the item you corrected.
I’ve written about lots of items in my career, so I decided to reach back to pick one that wasn’t too hard to diagnose, but was hard to solve.
A long, long time ago, in a company far, far, away, I was a DBA. Actually, not that far away. Just down the road in Englewood, CO, and it wasn’t that long ago, but it was prior to the year 2000, which was it’s own adventure (and non-adventure).
In any case, I was a DBA for a company that had a fairly large and active application in use by hundreds of clients. There were a number of issues I had to solve here, but one of the most memorable came when a customer called and said they were getting an error in our application. We tended to hide most errors and return generic ones to the user, so I had a developer get some debug logs and we discovered a severe error from SQL Server that would drop the connection.
This was annoying since the VB6 app wouldn’t reconnect by itself and the user had to close it and restart. What was interesting was that the user didn’t get this for all activity, just a few items.
I tested a few queries on dev systems and they seemed to work from isql/w. We tried the application and that worked. That led me down the path of a data issue. I suspected some data might have had strange characters that the app couldn’t handle.
As we dug into this on production, it seemed our test accounts and various others we tried would work, even for this customer data. Along the way we were running SQL Trace (this was v6.5) and app logging to debug. A few hours in, I stumbled on a few pieces of data for this customer that caused a broken connection.
The error led me to believe there was corruption in the database and I immediately opened a call to Microsoft while alerting our management. Since we were a company in the financial area, this was a big deal. Fortunately it was near the end of the day and we could take some emergency downtime on the system. Since we traded mutual funds at the time, we weren’t involved in real time activity outside of NY business hours.
At this point I’d been working on this a few hours, and in talking with Microsoft, we started some diagnostics, including CHECKDB work. We ran this periodically, but if I recall, this was weekly.
Eventually we discovered that there was corruption in part of one table. With out indexes and a spread out client load, many queries read around the corrupt section, which explained the behavior we saw. Unfortunately, backups wouldn’t help here, nor could we select out the table to get data. As I was handed off around the world to different customer service centers, my apprehension grew.
I was told that we’d need to move data out of the table and rebuild it, no easy (or quick) task. Since we didn’t know exactly where the corruption was, I was given quite a few queries to slowly work through sections of the clustered index and find what was readable and what wasn’t. Eventually we boxed in the bad sections and moved good data from other areas into a new table.
When we thought we had it all, we dropped the table and renamed a new one. However, I wasn’t done. While this was going on, I was also restoring a few other backups to try and find out when the corruption started and hopefully recover other data.
I worked all night, and into the next day. A couple cat naps while some things ran, but I’d been up close to 40 hours by the time I could leave. I’d recovered most data in the corrupt areas from backups, leaving notes for our service people to try and recreate the rest. We could enable our FKs again, though a few dummy records were needed in places where we weren’t sure what the data should read.
Most customers never knew about the issue and only a few were upset. We replaced a lot of disks and rechecked server hardware, planning on moving to new metal as soon as we could, though the nature of v6.5 made this a challenge in getting disk systems setup correctly.
That was one of my more memorable days at the office, though not the only overnight session. I kept a pillow and blanket in my desk, sleeping on the floor 4 or 5 times that year. Eventually I moved on, and was glad to do so.