The goal of problem management is to identify the root cause(s) of an incident, or group of incidents, and to prevent the issue from occurring again, if possible. If the issue can’t be prevented, then this process should at least lessen or eliminate the impact to customers, should the incident arise again.
Over time, a variety of methods are put to practice to eliminate false factors, and identify root causes, contributing causes and triggers. One of the most simple and adaptable processes for problem management identification and remediation is the use of the “Five Whys”. This article will focus on this method and the power it brings in asking the right questions.
The Five Whys (5W’s) was originally introduced by Sakichi Toyoda, the founder of Toyota Industries, and then taken mainstream by Taiichi Ohno. Ohno was the pioneer of the Toyota Production System, and whenever a problem presented itself on the production floor, he would direct his staff to explore problems first-hand until the root causes were found. “Observe the production floor without preconceptions. Ask why five times about every matter,” he encouraged. Toyota implemented this “go, see, and clarify” approach in part because of its simplicity.
The practice may be simple, but getting to the root cause, regardless of method, cannot be done in a vacuum. For this process to succeed, there are important steps that need to be taken before you start asking why.
Gather relevant team members
Root cause analysis discussions should include those with the right technical backgrounds to help define and identify the incident, and to assist in gathering and analyzing the necessary data to answer the Why.
Depending on the incident being reviewed, team members might include representation from infrastructure teams (network, database, server), application developers, application owners, and problem manager/analysts, if available. If the team members weren’t directly involved with the incident, it’s important to provide them with some details of the incident – date/time of the event, system(s) impacted, or any identified behaviors or errors that occurred during the event. Ideally, this will allow those engaged to gather system data, or logs that may be necessary to answer “Why”.
Define the problem statement
Problem statements should be brief, but give a clear explanation of the issue. A good example is this: “The customer-facing order entry system was displaying an error to all customers attempting to log in to the system.”
Start asking, “Why?”
Using the example above, initial responses may come easy, but others may take longer to answer depending on the complexity of the issue, the amount of data involved, and whether additional outside resources or vendors may need to be engaged to assist.
Here is a sample discussion of this process in practice:
Question: Why were customers getting an error when attempting to log in to the order entry system?
Answer: There was an error occurring between the system and the back-end database.
Question: Why was there an error between the system and the back-end database?
Answer: The database was rejecting login authentication requests.
Question: Why was the database rejecting login authentication requests?
Answer: Customer data in the database was missing.
Question: Why was customer data missing?
Answer: The data was missing after a process in the nightly batch update failed to complete.
Question: Why did the batch fail to complete?
Answer: The system password associated with the database update process expired, causing the process to fail.
In this scenario, you see that within five iterations of the question, we learn the customer-facing errors were caused by an expired password in a backend process.
The importance of stepping through these in such an elemental way allows us to trace the steps from incident to origination. Doing so allows us to identify the root cause. In addition, the answers to some of these questions may shed light on additional opportunities within the system processes to address the root cause, and help us see where additional preventative/protective measures may be possible, should a password issue arise again.
Depending on the complexity of the incident and associated root cause, it’s possible that you may need to ask why more than five times. However, if done thoroughly and without including assumptions, rarely should there be fewer than five questions.
The use of the 5W technique has thrived because of its simplicity and adaptability. It is not, however, a one-size-fits-all process to root cause identification. This process may be used in conjunction with other techniques depending on the complexity of the system or the effects being noted. In short, facilitating a root cause discussion requires listening, being inquisitive, and recognizing if this is the right approach for your scenario.
Angie Handley has a broad background in information technology, and is an experienced IT leader in application production support, technical training, and incident and problem management. She is ITILv4 and HDI Problem Management certified and has a degree in Information Technology. Her career passions include building effective, happy teams, creating lasting partnerships in both IT and business, and leveraging business and IT process improvement techniques to make processes more efficient, cost-effective, and reliable for IT staff and their customers. Currently, Angie is the Manager and Process Owner of Problem Management at Huntington Bank. Outside of work, Angie enjoys working with local animal rescues, volunteering time with Lions International, and playing Pickleball.