Let’s assume you already understand the importance and value of a strong problem management process. Perhaps you are either getting ready to build your process and team, already have one, or are trying to evolve your current program and are looking for different perspectives.
The intent of this article is to describe the mechanics of implementing a problem management function. The information I share is based on my experiences. There is no wrong way of doing problem management, as long as it functions well within your organization and provides for the outcomes expected.
As with every new process, it is paramount to ensure you have identified all the appropriate stakeholders and are involving them in the process design. Before you can begin designing, however, you need to make an important decision: Will you have one process owner or will you have many process owners? A centralized model will provide for one process owner that designs the process for the entire company, again making sure to gain buy in from all stakeholders. The decentralized model will provide for many independent process owners, each typically in a different organization throughout the company.
Regardless of the size of your company, your design process is relatively similar. At its highest level, you will want to make sure you are speaking to the model, any policies inherent to the process, the inputs and outputs of the process (or in other words, how you expect people to engage the process and what they can expect from it), roles and responsibilities, and metrics (remember, without metrics you don’t have a way to measure the effectiveness of your process). Once you have this foundation then you can start filling in gaps as needed, including RACI charts, process flow diagrams, glossaries, or appendices to applicable procedures or other documents.
Take the time to design your own process. Do not take a process designed for another company and assume it will work for yours. Every company is different. It never hurts to get different perspectives. But remember there is no one right way to do problem management.
With the centralized approach mentioned above, you typically have one process owner. They will work with other stakeholders to make sure they capture the most ideal process possible, and they will also be responsible for facilitating any process meetings, holding teams accountable to the process, reporting metrics, and ensuring the process is fit for purpose. This approach ensures there is one process (or source of truth), so that everyone is performing tasks or actions the same way with consistency. You may encounter a challenge with this model because you elected for consistency in approach over the entire lifecycle; the process doesn’t always allow for flexibility within the individual teams that make up the company, which may have different goals, objectives, and policies. With this approach, you might consider having this role exist in a “neutral” group like a PMO to avoid any feeling of bias. In a smaller company that may not have a PMO, it could be a role that reports directly to an executive.
In a decentralized model, you have many different process owners, and these process owners will design the process that works for their part of the organization, building upon a framework that defines what has to be consistent between the organizations. These consistencies can be priority/severity matrices and definitions, rules for team engagement, and meeting and reporting facilitation. This gives the individual process owners the ability to design the model that works for their organization, including roles and responsibilities, process flows, techniques and data sources for trending analysis, root cause analysis methodologies, and metrics. This approach gives you a high degree of flexibility. The challenges you can face include a lack of consistency and an inability to speak to the process holistically. From an auditing perspective, who’s right? What is the source of truth? As long as it is documented thoroughly and spoken to, it shouldn’t matter.
Team Structure and Skillset
After your process is documented and published (regardless whether centralized or decentralized), you now need to decide what your team structure will look like. In this section, I will discuss a centralized versus decentralized team structure, or dedicated positions versus roles. We will also discuss different skillset needs for each structure.
A centralized team generally has dedicated positions whose sole purpose is to manage the process, identify problems, facilitate root cause analysis discussions, and engage teams to establish preventative and corrective actions. This team will also host and facilitate problem review meetings and report out on metrics. A centralized team will also have only one process owner. In a centralized team, the skillset can vary depending on the needs of your organization. In the most ideal circumstances, you can identify individuals who have the technical skillset and are able to facilitate and manage process. These individuals are a little harder to find “out of the box,” but if you hire for technical skill, you can usually teach process facilitation.
There are many merits in having a centralized team:
- One stop for all problem management efforts
- One accountable party
- One set of “best practices” used across the company
- Complete visibility on all problems that span the company
- One focus on underlying problems
- Understanding any dependencies that exist
By having a dedicated team, you lessen the workload on other organizations, allowing for other teams to focus on enhancements or “net new” projects.
The potential downside to this model is that you are now paying for dedicated headcount and resources to support this team that wouldn’t necessarily exist with a decentralized structure.
A decentralized team structure is usually one of two options:
- In the first option, the process is owned, managed, and facilitated by one process owner within each organization. Teams will report problems up to the process owner and look for that person to help represent those problems to the problem review board and prioritize the resources to investigate and correct problems. This can be ideal for companies that can’t afford or choose not to spend the additional operating expense on hiring for dedicated positions.
- The second option can be more of a hybrid structure. The team can be technically oriented, or process oriented, or both. The advantage of having a team structure like this is its more self-contained. The potential downside to this model is the limited visibility you have to problems throughout the rest of the company.
Problem Review Board
Every problem management function should have a problem review board, regardless of what you call it. You typically have some latitude as to how you structure or define your problem review board, which will be guided based on the size or structure of the company. Problem review boards, much like change advisory boards, can be one single group of individuals or many smaller regionalized or product/function-based groups.
The role of the problem review board is to review problems, approve resource utilization, prioritize what to correct, and in some cases approve the cost/budget to correct. This board may meet once per week or more or less often depending on the number of problems that exist and the bandwidth and commitment to correct them. Multiple boards might be required in much larger, global companies or in companies that have their products in silos with operations, development, product management, support all dedicated to that product not to other products/services.
Some companies associate problem management with post-major-incident root cause analysis, and that’s where they stop. That only tells a portion of the story.
If you really want to be effective with problem management, you have to employ trending analysis in your process. Many refer to this as proactive problem management, although as some will contend it’s still reactive as the incidents have already occurred. Regardless of what you choose to name it, the value you can derive from trending analysis can be tremendous, depending on how you choose to examine the data.
It can become habit to only look at trends in technology when examining data. While these are likely to be the majority of your trends, make sure that you are examining every causal area. You must consider trending and bucketing the data based on people, process, and technology.
Any person who has responsibility for problem management should consider themselves an investigator, and any good investigator will tell you to examine all the details and facts, not only one investigative path. Focus your attention and efforts on the areas that will produce the most value. But, if you have the bandwidth, you should occasionally question what would otherwise be overlooked. As an example, if you notice that support calls always get routed to a particular person or that changes always fall to a particular technician, you may find through root cause analysis that there is an issue with the support ACD routing application, or that changes always fall to this technician because other organizations only submit changes for Monday nights which happens to be when this person works.
Root Cause Analysis
Over the years there have been quite a few RCA techniques introduced into the market with varying levels of benefit. My advice is to always employ the technique that will give you the best outcomes. The techniques we will examine are Five Why’s, Ishikawa Fishbone Diagram, Brainstorming, and the Sologic Method.
5 Whys. The 5 Whys technique asks why five times and at the end assumes the final why is the root cause. Under this notion, it is a rather simplistic method; however if you expand the five why’s to establishing categories of investigational path, such as “Why did the technology fail?” “Why did it take so long to restore?” and “Why didn’t we catch it before users noticed it?” and then ask the five why’s, you may get a lot closer to not only establishing root cause, but all other causes that contributed.
This technique’s primary value is it simplicity; you don’t have to be very experienced to apply it, and it helps facilitate RCA discussions. The challenge with this technique is that you have no way of tying causes together that reside in a separate category, if you choose to categorize your paths at all.
Ishikawa Diagram. This technique, typically known as the fishbone diagram, has as its core purpose to establish multiple categories to investigate, thus providing for a more thorough examination of the issue. A set of categories could include management, process, environment, people, hardware, and software.
The value with this approach is the visual depiction of the issue and all the causes leading to the focal point. This has great value in helping to identify preventive actions, as my focus is not just on root cause but all the causes that may have led to an issue. This method still has no clear way of tying causes together that reside in separate categories. When I use this technique, I might be challenged to visually tie “Failover Software Not Installed” to “Server A Failed” because one is in a hardware category and the other in a software category. The other challenge (much like the 5 Whys) is that there is no notion of providing evidence for the causes you are noting.
Brainstorming. Brainstorming provides a forum where you can have a somewhat “controlled chaos” discussion. The goal is not to examine the causes but rather state what you believe might be the root cause and then ruling out or proving each cause until you come to the one that everyone can agree is the “right one.”
The value brainstorming provides—regardless of whether you use it as a standalone or with another technique—is that it doesn’t seek to prove. This means that you aren’t yet challenging people’s thoughts and ideas. When you take this approach, you are more likely to elicit full participation because no one has a fear of failing. The challenge you may face with this technique is that you can get off track as people share ideas.
The Sologic Method. Sologic root cause analysis is my favorite RCA technique. Whether I am doing this in software or with post it notes, I first start with the brainstorming technique and then move into the Sologic method of charting, which consists of some simple concepts:
Provide evidence for my causes. Lack of evidence doesn’t mean I don’t chart it, but it does mean that I might not apply preventative actions or expend large amounts of time on it, since I can’t prove it.
Tie all causes to each other, regardless of category. This means that if I can’t tie a cause to the chart, then it’s likely not applicable
Ensure I have a holistic picture of all causes that lent to a particular issue. This is how I will not only address the root cause but causes that can prevent future incidents.
Using this method, I not only identify the line of code that may be causing a software defect, but also that the code wasn’t peer reviewed, wasn’t tested preproduction, and wasn’t validated post implementation. The code is the only piece I have to fix in order to correct that defect, but the other causes I identified will ensure other defects are caught before ever impacting a user.
In reviewing how this chart is visually represented, I can see every cause that led to this issue, and I can see how they tie together and how they tie to the focal point. I can also tie in evidence and categorize the causes. This allows me to utilize aspects of all three techniques and provide for a visual representation that end users can follow, regardless of lack of technical expertise.
An example of a Sologic chart is below.
The final standout aspect of this technique that sets it apart from brainstorming, Ishikawa, and 5 Whys is that this method not only speaks to root cause analysis, but also to identifying preventative actions. The Sologic method will have me examining an action against a set of criteria to ensure it’s a valid action. Criteria may include:
- Ease of Implementation
- Return on Investment
- Potential Negative Impacts
One challenge some face with the use of this technique is the time involved to do it correctly. While this technique is not difficult, it is very complex as it examines all aspects of the problem. As such, the time investment can be higher than that of the previously mentioned RCA techniques.
Consider the Costs
If you are considering whether or not to establish a problem management function, take the time to examine the operational cost of supporting incidents and then get a sense for the cost to implement a problem management function. Then determine ROI and your decision should be a lot easier. As an example, with one previous company, our problem management efforts were able to reduce major incidents by 70 percent. So if you consider that as a possibility and let’s say you have 10 major incidents per year, you now can potentially reduce that to 3 major incidents per year. The IT Process Institute estimates that the average major incident is around 200 minutes in duration, and at an estimated $5,000/min, that’s $1 million per outage. This means there is a real possibility of reducing your major incident cost by $7 million per year. That more than pays for any additional cost incurred by your problem management efforts.
Problem Management Done Right
Please know that problem management doesn’t have to be complicated, and every organization regardless of size can have an effective problem management function, as long as you take the time to identify what’s consumable for your organization. Remember in order for this to achieve maximum effectiveness, there must be consensus across all organizations that have to participate that the need exists and most importantly an understanding or agreement on the value that problem management can provide to your company. Without this agreement and consensus it can be exceptionally challenging to resolve problems, especially for those problems that have dependencies on other organizations. Know that you have options with process facilitation and team structure and one size doesn’t fit all. Finally, establish your trending practices with a focus on a holistic view of people, process, and technology, and when it comes to root cause analysis the most important factor is that you are able to identify actions that stop the causal chain, regardless of the methodology you choose to employ.
Matt Kade is a senior manager at Ellie Mae, Inc., accountable for service management in the technical support organization. He has been applying ITSM principles for more than a decade, with the majority of focus on major incident and problem management. He has built and managed several problem management teams in varying configurations with the express intent of reducing and/or eliminating problems before they become major incidents. Matt holds several certifications, including ITIL Expert, HDI Problem Management Professional, Sologic Root Cause Analysis Master Facilitator, KCS Principles, and HDI Support Center Manager.