Date Published May 21, 2019 - Last Updated 3 Years, 352 Days, 19 Hours, 49 Minutes ago
Somewhere along the way, birds got a bad reputation. Not as cute as kittens or playful as puppies, birds are...boring. Before the ornithologists out there get too upset, I want to point out that “boring” isn’t bad. In fact, for incident response, “boring” is actually a pretty desirable characteristic because the last thing your business needs is an encounter with a black swan.
How do you get from “black swan” to “boring?” Many companies pursue an aggressive strategy for modernization, bringing in modern tools like Slack and PagerDuty and expecting transformation to just happen. And while many people, including some ITIL supporters, view modernization as a worthless exercise (i.e., it’s “for the birds”), I cling to the Oxford definition of modernization:
The process of adapting something to modern needs or habits.
In other words, if your business isn’t being impacted with modern needs or habits, don’t modernize.
On the other hand, if your customer demands are increasing, or your organization is transforming to reach customers differently, or your operational complexity is rising as more developers get involved, then let me share a couple of important ways to keep incident response as boring as possible. These techniques are based on PagerDuty’s open-sourced incident response documentation.
Dave Cliffe presented a session on Modernizing Incident Response at HDI Conference & Expo.
When US Airways Flight 1549 struck a flock of geese leaving LaGuardia airport, Captain Chesley “Sully” Sullenberger had to make a quick decision based on how bad things were. As the dialog goes:
Mayday, mayday, mayday, this is Cactus 1549. Hit birds, we’ve lost thrust in both engines. We’re turning back towards LaGuardia.
Air Traffic Controller:
Okay, you need to return to LaGuardia?
Captain Sully: We’re unable [to land]. We may end up on the Hudson.
Note that it wasn’t the air traffic controller making that call; Captain Sully was the expert on the situation at hand. This is similar to DevOps culture, which advocates for service ownership and makes the experts (the developers themselves) accountable for the customer’s experience with their service. The DevOps model of ownership greatly improves triage and tightens feedback loops when responding to customer-impacting or business-impacting issues. In contrast, in IT operations, there often seems to be a desire to put triage teams or Level 1 support teams as the first point of contact.
Another element to simplifying triage is measuring what impacts your business. Developers and operations engineers often get caught responding to metrics that lack outside-in context, such as CPU load or API responsiveness. While those can be helpful in diagnosing a problem, they won’t help you triage.
At PagerDuty, one of our most critical capabilities is ensuring that customers receive timely notifications. As a result, one of our key metrics is measuring end-to-end notification latency. If our notification latency starts to creep up, we jump into incident response (SEV-2) immediately, even when all of our servers seem to be behaving normally. Meanwhile, for Amazon, that relevant business metric might be “orders per second” instead of notifications.
Your metric might be different—the key thing is to choose a metric that reflects your business and your customers’ expectations. When you can glance at that metric in your triage process and get a quick gauge of what's going on, it will speed up your ability to answer questions about an incident’s business impact.
While triage is a nuanced process with many inputs, don’t forget that the output should be a decision on how to respond. If the business metrics say it’s a SEV-2 but it feels like a SEV-1, then respond as if it’s a SEV-1. Take command by keeping everyone focused on mitigating the customer impact. Additionally, when training incident commanders, decision-making is a critical point of emphasis—one of the easiest ways to waste time on a response call is to discuss incident severities.
You may have noticed that I haven’t used the term “incident manager” here, which I see frequently in centralized operational models. Instead, PagerDuty took inspiration from the National Incident Management System (NIMS) model, which is considered state of art when it comes to incident response, albeit in a rather different vertical than IT operations. Using that model, we defined our roles to ensure that the incident commander could be most effective; in contrast, most other models typically have incident managers focus more on scribe and communications liaison activities.
In addition to keeping everyone focused, incident commanders drive decision-making, take input from subject matter experts, and quickly establish consensus to keep things moving toward resolution. Incident commanders also assign tasks to specific people with a specific time allotment and follow up when tasks are completed—this helps avoid the bystander effect that so often plagues emergency responders.
Not everyone is cut out to be an incident commander. You need to be a cool as a penguin, with a bird’s eye view of your systems, and be able to maintain authority in the face of stressed-out executives who have the tendency to try and “swoop and poop” on your response team. The good news is that the process can be practiced.
Not everyone is cut out to be an incident commander.
We take the opportunity to practice our incident response process through our chaos engineering experiments—what we call “Failure Fridays.” In fact, you can even make it fun by using games such as Keep Talking and Nobody Explodes! Just remember: After you make it fun, make sure you make it boring! Incident response is best for you and your business when it’s boring.
Dave Cliffe is a bird-brained software engineer who has led various product management and product marketing initiatives at PagerDuty. Before PagerDuty, he filled a variety of roles at Microsoft (on the Azure team) and Amazon (launching their grocery business).