Both as a DBA and developer, I’ve had plenty of immediate, this-is-broken, fix-it-quickly issues. Usually, I, or someone else, wrote some bad code and somehow got it deployed. I mean, I do test things, and I would (probably) never change code after I’d tested it to fix that one little annoying thing, like the formatting. I’d (almost) never do that, and I’m sure you wouldn’t either.
Yet somehow bugs slip in at times.
Those are the acute issues, and they can be hard to fix at times, but often we can reproduce the problem in development and build a fix. Sometimes we even spot the issue quickly and just fix it in production. I’m sure you never do that, but I have had that experience myself a few times.
However, in the database world, we can have other, slow-growing problems. I saw this post from Jacob Sebastian about production issues that don’t trigger alarms. There are just slowdowns that trickle across multiple systems and cause issues for clients. These aren’t things you instrument for, as a slowdown isn’t necessarily an issue. These things can resolve themselves, or they can develop into a major issue.
I think about this like vehicle traffic. A minor fender bender on a highway might not be a problem, but it can become one. Cars don’t get out of traffic quickly enough, or traffic police don’t arrive soon and move the cars. Traffic starts to back up, which slows down the response, including that important tow truck that might keep things moving. Suddenly, it’s not a few people inconvenienced by an accident, but thousands.
There are likely signals in your environment that would let you know about a potential issue coming soon. These are subtle and not always indicative of a problem individually, but taken together, they indicate a production issue is going to occur. To me, this is a place AI can eagerly be taught to look for these signals and then happily keep looking for them every day.
The future of monitoring is the active examination of correlated data that precede an issue, hopefully giving humans, or other AIs, enough time to respond and prevent customers from experiencing a slowdown.
I’d certainly welcome this in both traffic motorways and database systems.
Steve Jones
Listen to the podcast at Libsyn, Spotify, or iTunes.
Note, podcasts are only available for a limited time online.


