Incident Review

Something will always go wrong with our software systems. Issues might happen regularly or rarely, but something is going to go wrong at some point and a client will complain. Often the way we handle these situations determines whether our customers continue to do business with us or look for an alternative.

In my career, I’ve been a part of many incidents, lots of which were service outages for customers. I’ve attended some incidents as a technical person working to diagnose the issue. In other incidents, I’ve been a developer that has to fix code. During many, my role was as a manager trying to ensure information moves smoothly between resources, ensuring people double-checked that our “fix” wouldn’t cause another problem. After a problem, I’ve also had to take part in a post-incident review. Unfortunately, this has happened far less often than it should.

When I read this description of a post-incident review, it is unlike many of the after-action meetings I’ve attended. In my experience, there is one meeting and someone is being blamed. Senior management is often there, putting pressure on everyone in attendance to “never let this happen again.” I haven’t known anyone that wanted to go through another outage or a post-incident meeting after an incident, but with complex systems, and humans managing them, something is bound to go wrong. We should expect to have incidents in the future.

What we want is this same type of incident to not get repeated, which comes about if we learn from our mistakes and design better protocols that help us catch mistakes.  Often to ensure a human doesn’t make a simple error. We should accept that mistakes will happen and try to find ways to detect the problem quickly, limit the scope of impact, and provide a way to share this knowledge with other workers. Depending on humans to be more perfect in the future isn’t likely to be successful.

These days I read post-incident reviews that my company will publish after an outage, I find them fact-based, focusing on what things went wrong without blaming a person. They include analysis not only of the actual issue but the conditions that led to the hardware/software failure or decision that was made. There are learnings about how we might have prevented something with a time machine, while still assuming that humans would make mistakes or a component might fail. There are also suggestions for improvements in hardware, software, training, or monitoring that might assist in quicker recovery in the future.

Coming out of an incident with a positive mindset is the best way to try and prevent a repeat of the same incident in the future. This requires that we not only avoid blaming someone for an error but that we also take steps to limit the potential for future errors. If the issue is someone clicking the wrong selection in a drop-down or pressing “OK” when they meant to press “Cancel”, there are limited ways to prevent those issues. However, we can adopt the mindset an outage is a team failure and build a habit of double checking each other. That’s much better than blaming one person, giving the job to another human, who might easily make the same mistake.

Many humans struggle to avoid placing blame on others and just accepting that some mistakes will happen. A DevOps mindset, with blameless reviews instead focuses on how we can do better as a group, rather than how we failed as an individual. This little change helps us build a better team, one that often performs better in the future.

Steve Jones

Listen to the podcast at Libsyn, Stitcher, Spotify, or iTunes.

Posted in Editorial | Tagged | 1 Comment

Daily Coping 30 Sep 2021

I started to add a daily coping tip to the SQLServerCentral newsletter and to the Community Circle, which is helping me deal with the issues in the world. I’m adding my responses for each day here. All my coping tips are under this tag.

Today’s tip is to find a moment to sit quietly and breathe for a few minutes.

I’m traveling today. Up early, I’m going to Pasadena to our office there to see a few people and get away from my small space. I have really missed this, and I’m excited to be going.

I’m adding a note in my calendar for this trip to follow this tip twice. First, when I get to the airport, which I know will be slightly rushed, I’m going to take a minute after I purchasing food and getting to the gate to stop and sit and breathe. Usually I’m checking email or reading, but here I’m going to sit quietly.  That will be my 7:05a meeting.

I know when I get to LA, I’ll be riding up to Pasadena. I’m going to avoid music and talking for the first part of the trip, just looking out at LA and letting my mind wander. Think, breathe, and relax. The rest of the day will be busy.

Posted in Blog | Tagged , , | Comments Off on Daily Coping 30 Sep 2021

Goal Progress for September 2021

I set goals at the beginning of the year, and I’m tracking my progress in these updates during 2021.

Not a lot of progress in September. I think this is a bit of a symptom of life for me, as I find myself in a career lull as I get near the end of summer each year. Perhaps a learning item for me is to tackle more things earlier in the year, or later, as I find the end of the year slow.

I’m also starting to think that my goals should be for part of a year, with a re-assessment at some point. Perhaps 2022 ought to be six month goals, or even 4 months goals and then create new ones, or leave some off.

Current Grade: C-

I’m leaving the same grade. Minor things accomplished. Mostly I think September was time away from work spent on life. A new car, chores before winter, and some time with my wife. No regrets here, other than my career stalled.

Reading

I did finish Lights Out, and I purchased Thanks for the Feedback. I opened it and got started. I also write a Lights Out review.

Here’s the current progress:

Technical Skills

No real change here.  I have done much coding outside of work.

Projects

No real changes here. Minor updates, but not enough time spent on these items.

Posted in Blog | Tagged , | Comments Off on Goal Progress for September 2021

Daily Coping 29 Sep 2021

I started to add a daily coping tip to the SQLServerCentral newsletter and to the Community Circle, which is helping me deal with the issues in the world. I’m adding my responses for each day here. All my coping tips are under this tag.

Today’s tip is to try something different for exercise today, however small.

My exercise routine for the last year has been roughly this:

  • Yoga 2-3 times a week
  • Weightlifting 1-2 times a week
  • Walking 2-3 times a week, often with weight lifting
  • Swimming 1x week
  • Snowboarding 1-2 times every couple weeks in winter
  • Hiking at times
  • yardwork/chores

As a change, I decided to do a do some pilates. I had done some years ago, but haven’t done a class at all. My boss used to talk about doing some YouTube classes while traveling, so I decided to try this.

I found Move with Nicole online, which was similar to yoga, but not quite the same. I did this 30 minute workout. Holy @#$!@#@!#, that was hard. I had to break a few times, and I found some muscles I don’t use enough.

I may try to add one day a week of Pilates to my movement, but certainly I’ll be sticking with 20-30 minute sessions.

Posted in Blog | Tagged , , | Comments Off on Daily Coping 29 Sep 2021