I was watching a PoSh session at the PowerShell + DevOps Global Summit recently where a sysadmin had a series of scripts to run when there was a problem. One of these was Rapid Response, which gathers information from a machine(s) and stores it in a series of files. It’s a grab bag of various items, but the data can be used to help determine what’s wrong.
Some of us have monitoring tools for our databases, and some don’t. I’m wondering, in each case, is there a set of data you want or need when an incident occurs? Do you have separate types of incidents that require disparate data? Perhaps you respond differently to performance issues than security incidents than hardware problems and want different types of data gathered.
I know that in the past, I’ve often had scripts I ran to respond to some issues, but not others. I’ve also depending on monitoring systems (bought or built), but usually they don’t have all the information I need when something goes wrong. Capturing all the data I need in an incident is often too much to store for any length of time, but it is data that I need for specific issues. Having a series of automated processes that might start collecting data when an incident occurs, perhaps filtered based on an instance, database, user, or some other value, would be helpful. However, I think I’d need a lot of incidents to build the list of scripts myself for different issues.
A crowd sourced series of scripts, developed by people responding to different problems, would likely be the best way to capture this information. I do see some good resources (PDF, GH) for certain types of problems, but in order for these to be useful to you, some knowledge and familiarity is needed. You need to know what scripts are useful in which situations.
This is really the best reason for blameless RCA (root cause analysis) work after problems occur. If you have runaway blocking, constant security probes from unknown clients, or any other issue, it becomes important to analyze what happened and how people responded. Build up a protocol for how to respond and ensure that the knowledge is distributed to others that might need it. Practice running scripts and looking at information, perhaps even in a controlled replay of the problem.
When an incident takes place, you’ll be glad you are prepared. Whether it’s small or large, a little practice will help you get through things more efficiently, and likely with less stress.