We’ve all been there. A major incident happens. Email goes down or a major server crashes or any number of other fires. Everyone pulls together and fixes the immediate incident, putting the fire out. Often, we lose sleep, stay up all night and work way beyond our normal hours. When it is over, we pat ourselves on the back and give accolades and celebrate what a great team we are…and then we forget it.
It’s easy to do. Some other fire starts, or another project needs our attention, and we are tempted to never circle back and figure out exactly what happened or why, and we take few, if any, steps to make sure it doesn’t happen again. And then when it does happen again in a few weeks or months we scratch our heads and wonder why we are doing it all over again.
Implementing a root cause analysis (RCA) process in your organization helps close that loop. If you are aspiring to ITIL best practices, root cause analysis is a part of problem management. It is also a big part of change management and risk management. But root cause analysis has deep roots (pun intended) outside and before we ever thought about IT best practices.
Root cause analysis was first developed and used widely by Sakichi Toyoda in 1958 as a part of Toyota’s manufacturing process. It was then he introduced the 5 whys method as a way to get to the real problem. Root cause analysis is used in almost every industry out there, from publishing to engineering. It’s helpful in any field where complex systems or cause and effect relationships exist.
Regardless of what methodology your organization is using, root cause analysis can play an important role in continuous improvement and just general operations, especially in IT, and especially if you are concerned about customer experience. Root cause analysis keeps us from making the same mistakes, which customers notice and leads to distrust.
Root cause analysis keeps us from making the same mistakes, which customers notice and leads to distrust.
So how do you go about doing a root cause analysis? Most experts agree to the following steps:
1. Define the Problem
It seems really simple, but defining the problem might not be as obvious as it looks. It is not uncommon to begin the process defining the problem a certain way only to realize the problem is not what you originally thought. Be careful not to jump to conclusions too quickly. It is also a good idea to keep this description simple and in words that even non-technical people will understand.
2. Gather Data
This step is where it gets really fun. It makes me feel like a crime scene investigator. There are several techniques for gathering data, but the key point is to keep looking. The 5 whys technique is a great way to do that. In its simplest form, keep asking why to every answer, placing you deeper and deeper into the issue. For example:
Why did you fall? The floor was wet
Why was the floor wet? Something spilled
Why did something spill? People take water from one part of the room to another
Why do people take water from one part of the room to another? To water plants across the room
Why are the plants across the room or even why are the plants not closer to the water? Because that’s where they’ve always been (Now we’re getting somewhere.)
Gathering data also means interviewing everyone involved, and I do mean everyone. I’ve seen many IT RCAs done that included lots of technical information without talking to the service desk or remembering to gather data around customer impact. It is important to get everyone’s perspective, as each of them might help uncover part of the problem. Look for ways to quantify the problem when possible in dollars, time, and effect. Be sure that you have a good timeline/chronology of events.
In addition to the 5 whys, there are other tools and techniques that you can use. Fishbone diagrams are helpful when root cause is completely unknown. They guide you to brainstorm. I’d also recommend Pareto analysis, based on what most call the “80/20 rule” that says 80% of effects come from 20% of the problems. Pareto analysis will specifically help with the next step.
3. Find the Cause(s)
Finding the cause is the reason for all of this. If you’ve thoroughly defined the problem and exhausted all the possible data sources, the cause or causes should be evident. These causes are usually classified as technical, human, or procedural. In my experience, it is more often multiple things than a single source. In the organizations I have worked in, lack of process has overwhelmingly been the biggest cause, with technical running a close second. I’ve found most human errors that occur can be prevented, caught, or corrected with good process.
In our example in the last step, I’d suggest the causes are technical, human, and procedural. The technical cause of the fall was water was on the floor. The human cause is that someone spilled the water. The procedural in that the water has to go from one side of the room to the other.
One thing to note in RCAs versus other methods of seeking solutions to problems is that the focus is less on individual people and more on process. So, while the cause might be human, it should not be presented as an individual is at fault, but rather a human error happened. An RCA is not a place to call an individual out to deal with neglect or poor performance. There are human resource procedures and policies that should deal with that. RCAs seek to correct problems and improve the organization, not address employee performance.
4. Find Solutions
Just knowing what the problem and the cause of the problem are is not enough. A critical part of the RCA process is to seek solutions. Solutions are often classified as corrective (a short-term or stop-gap solution that addresses the direct cause of the issue) or preventive (a long-term resolution that targets the root cause and prevents it in the future). Sometimes the solution involves spending money, but, as noted above, frequently the solution is process. Can you tell I believe in process?
In our example, the solution is not to clean up the water, that’s the resolution to the incident. The corrective solution could be to move the plants closer to the water. The preventive solutions might be to develop a process for watering the plants that will prevent water from being spilled or even run a water line to the plants.
5. Develop Strategies to Correct/Prevent
This section takes what you discovered that identified the cause and what you know the solutions are and then begins the hard task of strategizing to get better. It isn’t correcting the initial presenting issue. It is thinking from a 30,000-foot view how we can correct the overarching issue and/or how we can prevent it from happening again.
In our example, the strategies would be to outline what you’ll do from the suggestions in step 4. These are the same in our simple example, but strategies should be more detailed and often involve more steps than simply naming the solutions.
6. Report Out
I hate to admit this, but there have been times that I or the organization I have worked with has done all these steps to then do nothing. All of your work to this point is wasted and moot if you do not share this information with those that need to see it. Back to my crime scene investigator role…What if a crime was committed—the medical team came out and addressed the initial injuries of those involved, law enforcement came and gathered evidence, statements were taken, a suspect was identified—and then nothing. If the evidence was never given to the proper authorities for an arrest to be made or if all that evidence was just thrown in the trash, wouldn’t that be an awful waste of time and miscarriage of justice? Well, yes.
And so why do all this work and not report out on it? There are tons of RCA templates out there that you can borrow to report on. I don’t think one is necessarily superior over another. I think what is important is that you cover these steps. I lean on the side of brevity in solutions and strategies sections. In general, these reports go to senior leadership and I find that they neither have the time nor the attention to read the details. Always consider your audience.
7. Monitor the Solutions and Close the Loop
Finally, when an RCA is done, it isn’t the end. The work is just beginning. Make the strategies you’ve outlined actionable. Assign them to specific teams and/or individuals. Give them deadlines and outline how you’ll recognize success. A quarterly or annual review of RCAs, actionable items, accomplishments, etc. is a great idea. Close the loop. Make the work meaningful.
Revisit Over Time
As with everything, review your RCA process from time to time. Make sure it is working for you and that it is producing what you want and need. What works in one place might not work in another. Constantly look for ways to improve. And don’t forget to celebrate your success. It matters.
Vicki Rogers has more than 20 years of IT experience and is currently the Senior Manager of Change at Georgia Tech Previously, she was a senior IT manager at Amtrak and the service desk manager at the University of West Georgia. She has expertise in service management, change management, leadership development, and diversity in IT. She has been involved in service desk creation, implementation, and adoption of ITSM best practices, as well as insourcing IT. Vicki has a BBA in Business Management, an MBA, and an EdS in Learning, Leadership, and Organizational Development. Her graduate research involved cultivating and developing women leaders in higher education IT divisions. Vicki is a regular national speaker on leadership, service management, diversity, and change. Outside of work and school, Vicki is the mom of two brilliant and successful college girls and one very spoiled schnauzer. Follow Vicki on Twitter @vickirogers.