Problem management is one of the most effective methods of reducing the frequency of service outages or degradation. Incident management reacts to incidents, so the most it can achieve a reduction in mean time to repair (MTTR). Problem management can prevent future outages, thus significantly improving service availability and quality, as well as reducing reactive efforts (and support efforts overall).
Many organizations have implemented problem management, per ITIL, by performing root cause analysis on important incidents, deriving lessons learned, and deploying corrective steps in order to prevent similar incidents from recurring. However, this process is still reactive in nature, as a response to one or more incidents. In this article, we will examine several entirely proactive techniques that are not triggered by incidents.
Incident management is the process responsible for managing the lifecycle of all incidents. The goal of incident management is to restore services as quickly as possible and to minimize the impact of incidents. Tickets are generated automatically or triggered by alarms or user calls. The tickets are then resolved in a timely fashion, but they may not identify the root cause of the incident.
Problem management is the process responsible for managing the lifecycle of all problems. The goal of problem management is to identify underlying causes and prevent the recurrence of incidents. Tickets are always created manually when additional analysis is required to determine the underlying causes. Time is not of the essence in resolving these tickets.
Problem management activities can be further classified into three categories: reactive, proactive, and preventative.
Reactive activities address (identify and resolve) underlying or unidentified issues. This is done through root cause analysis, which is a truly reactive activity. It must be performed on all incidents deemed significant, always for high-severity and occasionally for low-severity cases.
Proactive activities follow up on the root cause analysis. They include:
Infrastructure cleanup: After the root cause analysis identifies and corrects a configuration error on a particular device, all other devices of the same type are scanned for this error and repaired.
Lessons learned: The root cause analysis may identify issues in the process, training, tools, etc., that must be resolved in order to avoid similar incidents in the future.
These activities are both reactive and proactive: they’re triggered by incidents, but they address issues that can prevent future incidents.
Preventative activities consist of the analysis of various data sources in order to identify and correct issues that have not yet triggered incidents. These truly proactive activities will be discussed in more detail in this article.
Based on the above definitions, using reactive and proactive activities can mainly reduce the number of recurring incidents, while preventative activities can reduce the number of new incidents.
Implementation and Governance
It’s important to understand the difference between reactive, proactive, and preventative activities. It’s akin to replacing a flat tire versus changing the oil in a car. Nobody waits until the engine breaks before they change the oil, but such preventative activities are not so widely implemented in IT organizations. The reason is that these activities are more difficult to justify when trying to secure budget for headcount and tools.
The benefits of preventative problem management are hard to quantify because it addresses potential outages, not actual ones. Other than basic regular maintenance tasks, preventative activities are only adopted by mature organizations, when service quality cannot be further improved by other support processes.
CSF and KPI
The main critical success factor (CSF) for problem management is a reduction in the number of incidents. Reactive problem management achieves this by reducing the number of recurring incidents. The obvious key performance indicator (KPI) is the number of recurring incidents. However, preventative problem management can reduce the total number of incidents by preventing unique failures. Consequently, trending the total number of incidents over time should provide a good KPI for the above CSF. Unfortunately, incident volumes can increase over time as the size and complexity of the infrastructure increase. One way to address this is to normalize the number of incidents by defining the KPI as incident rate: the ratio of the number of incidents to the number of devices in the infrastructure. While this method is imperfect in reflecting the size of the IT infrastructure, and it does not measure its complexity, it’s a good approximation. Other normalization methods may be adopted, as long they are kept consistent over time. As with any KPI, the incident rate must have a target to be used in continual process improvement (CPI). Absolute targets are difficult to set, so relative targets are recommended (e.g., a 10% reduction in incident rate year over year).
Problem Management Group
In a typical support structure, the call center (tier 0) is employed only when the organization has a large user base that generates a high number of calls for issues and service requests. The call center staff takes user calls, opens incident or service request tickets, and routes tickets to specialized groups for resolution according to prescribed instructions. They resolve service requests but have no access to monitoring systems and perform no troubleshooting. Depending on the size of the organization, some of these functions may be consolidated into one group; conversely, a service provider with multiple enterprise customers may set up several call centers to serve subsets of customers.
The service desk performs the functions of the call center when that group does not exist. For some service providers the call center is the interface for external users while the service desk performs the same functions for internal users. In addition, the service desk addresses customer notifications and may perform limited troubleshooting based on troubleshooting guides. These troubleshooting steps may lead to the resolution of the issue or aid in the collection of diagnostic data and the triage of tickets. Like the call center, the service desk typically does not have access to monitoring systems.
The network operations center (NOC) is the main point of detection and resolution of incidents. Alarms are presented on the monitoring system and prioritized according to severity. The NOC opens tickets if the network management system (NMS) doesn’t generate them automatically and closes them on completion. Trouble tickets may also arrive from the service desk or call center.
There are usually a number of operations groups, specialized by technology. They take escalations from the NOC and work on the implementation of new projects. They also perform most of the problem management activities.
Engineering usually comprises various specialized groups that undertake design and development activities related to infrastructure and applications. They also take escalations from operations. When tier 3 support groups (i.e., operations groups) don’t exist, engineering must perform problem management activities.
Problem management activities aren’t performed in real time (i.e., they’re not triggered by specific events), but are undertaken periodically, on a daily or weekly basis. Many tasks can and should be automated, such as report generation and audits, but there are always tasks that require human analysis and decision.
It’s critical that problem management activities are performed by operations (or engineering) and not by lower-tier support groups. The reason is that, as stated above, the goals of incident and problem management are not only different, they’re contradictory. While incident management is concerned with rapid service restoration, problem management needs to discover root causes without time constraints. Furthermore, support staff engaged in incident management work in a fire-fighting mode, whereas problem management requires uninterrupted quiet time for in-depth analysis. Even in a small organization, where support tiers are compressed, it’s highly recommended that the resources working on problem management are dedicated either permanently or through rotation (i.e., they cannot be involved in emergencies or escalations).
The problem management team can be separate from other groups or even operate as a virtual team. In the latter case, subject matter experts (SMEs) from various operations groups work within their domains of expertise but communicate regularly to share knowledge and coordinate efforts. Some problem management activities require in-depth technical knowledge, while other activities require general analytical skills, which is why many problem management tasks require SMEs, not junior operators.
The problem management team will also collaborate with other groups, such as engineering, security, or performance and capacity management. It’s important to note that preventative activities are not projects with a beginning and an end, but are ongoing tasks that require tools and designated resources.
The main problem management tool is the ticketing system. Problem tickets must be separate from incident tickets, even when the ticketing tool is the same, for two important reasons: first, they require different fields and options in the ticket; second, they have very different metrics. Problem tickets, with their long resolution times, would skew the average resolution time if the two types of tickets were not segregated. However, the two ticketing systems must be coordinated, so that a problem ticket can be generated from an incident ticket and ticket cross-referencing is possible.
The problem ticketing system must also allow workflow automation. This is important when action items are generated for various groups as part of the lessons learned or follow-up plan. The action items can generate tasks that are assigned to responsible groups and can easily be tracked to completion.
Incident analysis is the most important preventative problem management activity. Except for regular maintenance, it should be the first activity to be implemented.
It requires a solid incident ticketing system, including the following features:
- A searchable database capable of supporting complex queries
- Adequate fields, including platform, device, software version, root cause category, etc.
- Complete information in the records, including the troubleshooting log
These characteristics will enable analysts to extract pertinent and accurate historical data from the incident database, according to various search criteria. This activity does not require technological expertise: while the database queries must be automated (i.e., records are not reviewed individually), incident analysis is mostly manual. When problems are detected, analysts will open problem tickets and assign them to specialist groups for prioritization (e.g., Pareto analysis), analysis, and resolution.
The following techniques are employed in incident analysis.
This method identifies underlying causes that were not resolved by closing past incidents. Here are the steps involved:
- Look for repeated incidents in the last thirty days on the same device, platform, module, or software version.
- Review recurring incidents to see if they had the same symptom(s).
- If they do, open a problem ticket and refer it to a specialist group (by technology).
One possible automation improvement is to have the incident ticketing system flag a recurring incident while it’s open. This can lead to faster resolution, by applying a past workaround, but it can also highlight potential issues in past solutions. Thus, the ticket may be flagged for additional care and escalated. Additionally, a problem ticket may be opened right after an incident ticket is closed.
The Top N method (N = a small number) identifies underlying causes that triggered multiple incidents. This technique is similar to the previous one, except that the incident data is mined in a different way. The following steps are involved:
- Build Top N lists with the highest associated incident counts by device, platform, module, or software version (e.g., the top three devices with highest number of incidents).
- Analyze each list for common causes. It’s not enough to say device X is bad because there were many issues with it. Further analysis on device X must reveal a common cause. The analyst will look for common threads, such as incident closure codes/reasons and failure symptoms, or even review the troubleshooting log (though this is time consuming and should be avoided whenever possible).
- If a common cause is suspected, such as a software error or a hardware defect on a platform, device, or module, the analyst will open a problem ticket and assign it to a specialist group for further analysis and resolution.
Consider the following example: Problem management observed a steep increase in the number of incidents at a certain point in time. Further analysis showed that many of them were related to overnight changes performed on a new platform. They discovered that the cause was a lack of training on the new technology among the night-shift operators responsible for implementing changes.
With this method, the goal is to uncover underlying causes triggering seemingly unrelated incidents. The following steps should be employed:
- Group incidents by root cause category, closure code, or incident reason.
- Depending on the ticket information (available fields and options), build lists by access, location, environment, administration, tools and instrumentation, related changes, user or staff procedural issues, contention issues, database errors, etc.
- Identify subtle or hidden issues in training, process and procedures, power and HVAC, and even governance and controls.
- Using Pareto analysis, open problem tickets for the potential causes that generate the most incidents.
This technique is not as widely used as the other two, but its usefulness should not be underestimated. Consider this example: For quite some time, a high number of unidentified incidents were observed. The failures occurred on different devices of dissimilar types, and they were attributed to transient hardware errors. Further analysis showed that they all happened in a particular physical location. That led to tests on power and environment, which eventually uncovered a grounding issue.
These activities are typically performed by technology SMEs, usually outside of the problem management group. For this reason, they should either be part of the virtual problem management team or collaborate closely with the team. These tasks require additional tools and consist of automated audits, scripts, and manual analysis. The problem tickets created by these activities may or may not be resolved by the same SMEs. Because of the required in-depth expertise in technology, platforms, and software, collaboration with vendors and application development is essential.
The following techniques must be considered for implementation.
Alerts (SNMP traps, syslogs) can be classified according to the following categories:
- High-priority alerts (error and failure alarms) that must be addressed in real time in order to avert incidents
- Low-priority alerts, including warnings and notifications
The volume of low-priority alerts is very large. These alerts are stored for use in postmortem investigations. However, the syslog data is a wealth of information ready to be mined for indications of potential hardware and software errors. The following steps can be used:
- Build a small set of low-priority syslogs that are good indicators of potential failures (e.g., data loss, interface flapping, storage or memory depletion).
- Scan your syslog database periodically for the defined set.
- Open problem tickets for matches.
- Over time, increase the set of predefined syslogs for analysis, adding new syslogs based on equipment vendor recommendations, application development input, or past failures.
Another type of syslog analysis is similar to the Top N technique employed in incident analysis, except the analysis is performed on the syslog data instead of the incident data. Using this technique will not only identify potential failures based on top “talkers” but also reduce the number of alarms flowing into the network management system. (Note: Syslog analysis requires a syslog parser that takes its input from the set of predefined syslogs.)
Various infrastructure assets must be configured with numerous parameters in order to optimize their functionality. Network devices, such as routers, switches, and firewalls, have complex configurations that are stored on the device and are regularly backed up to external storage. The same is true for servers and operating systems. Many of these devices perform similar functions, hence their configurations are supposed to be identical. However, over time, changes to the infrastructure introduce configuration discrepancies that may be detrimental to the operation of the device. A good practice is to maintain all the devices that perform similar functions to the same configuration standard. The process of configuration compliance comprises the following steps:
- Define standards (templates) for device configurations (e.g., routers, switches, servers, Windows, Linux, Unix). These templates are usually developed and owned by engineering.
- Run periodic audits to compare deployed configurations to the standards, and generate noncompliance reports. This step requires network configuration management tools that work on the deployed platforms, and there are tools that verify compliance to industry or proprietary best practices. Industry best practices are based on manufacturers’ recommendations; proprietary best practices are developed in-house, based on architecture and design standards, as well as past experience.
- Develop action items from the noncompliance report and assign them to relevant groups. This step can be partially automated using a workflow automation tool that assists in assigning tasks and monitoring their progress. Another way to automate this step is to enable the network configuration management tool to correct some of the common noncompliance items.
For network devices, the audit may slow down certain devices, impeding their normal operation. Hence, it’s recommended that the audit be run on the backup configuration files on external storage.
Hardware and Software Lifecycle
Both hardware and software assets have a limited lifespan, usually determined by the end-of-support (EoS) milestone. While the asset can be operational beyond this date, the organization must develop support and/or upgrade plans for all assets in its inventory. (It’s important to note that assets that no longer have vendor support will not benefit from bug fixes and software upgrades. Over time, this may generate additional support costs in failures and workarounds.)
- Define a strategy and draft a road map for each hardware platform and software version, including triggers for upgrading the hardware or software (e.g., bugs, security issues, new features required), upgrade paths for all hardware and software platforms (i.e., replacement candidates), and mitigation plans for end-of-life (EoL) hardware and software, such as acquiring spare parts. Depending on supported applications and features, there may be different plans for the same hardware model or software version.
- Review all vendor security notices and field notices promptly, and update plans accordingly.
- Review each plan on a regular basis to reconsider costs and benefits, as well as to evaluate new technologies available on the market. This can be achieved be including an expiration time on all the plans (typically one year).
Other Preventative Activities
There are many other preventative activities that may or may not be coordinated by problem management. They all provide the same general benefits—that is, preventing service outages and degradation, thus improving service availability and quality. Problem management may have an active role in some of these activities, it may coordinate others, or it may just stay informed. It’s important that all activities are monitored and reported in order to show costs, benefits, and achievements. The following is a list of suggested preventative activities.
Regular maintenance activities, as recommended by the vendor and developed in-house: This is the most essential preventative component of good infrastructure support.
Regulatory audits: Depending on the industry vertical, the organization of these audits is usually mandatory.
Performance monitoring and capacity plans: These are usually developed by engineering and carried out by operations.
Service continuity plans and disaster recovery testing: Depending on the industry vertical, these may be mandatory or voluntary.
Strong validation and testing: This is a key risk mitigation component of release management.
Strong change and release management policies: Because changes to the infrastructure are the main source of incidents, strong risk mitigation policies are necessary.
Security management: This is one domain that doesn’t need any explanation or justification, as every organization is well aware of its criticality.
Starting on the Preventative Road
If your organization has not yet implemented problem management, or it is limited to root cause analysis, here are some practical considerations.
Start small, with a limited number of activities: Consider all of the preventative activities, and then pick the ones that are easiest to implement. Over time, increase the number of activities. Establish an objective of adding one new preventative technique every quarter, for example. Many preventative methods are technology-related, so specialized operations and engineering teams and SMEs must be engaged in devising new techniques.
Create a virtual team for communication and building expertise: The problem management team relies on the expertise and resources of other groups. Tight collaboration is key to success.
Rely on your equipment vendors for leading practices: Hardware manufacturers and software developers publish their recommendations and often will respond promptly to queries or requests for assistance or best practices.
As reactive workload decreases, move more resources into preventative work: One of the main benefits of preventative problem management is that it reduces reactive work. This provides an opportunity to increase preventative efforts and further improve service quality and availability.
Ensure continuing management support by communicating progress and success: Remember that problem management doesn’t show obvious benefits. Take an active role in celebrating achievements and continuously measuring and reporting benefits. Establish metrics, KPIs, and targets, and include them in the IT scorecard reported periodically to senior management.
Remember that preventative problem management is like being a detective (lots of fun!): Any person who works in IT support knows that preventative work performed from 9 to 5 beats fire-fighting during the graveyard shift.
Problem management is a critical support process that has a major impact on reducing the number of recurring incidents, and thus reducing the number of outages, improving service quality, and increasing availability. The preventative activities described here have the potential to increase these benefits even further by reducing the number of new incidents as well. The costs of running these activities will pay off in reduced support costs and increased customer satisfaction.
Gabriel Soreanu has more than thirty years of experience in the IT and communications industries. He has worked for Nortel, Deloitte, and Cisco. As an ITSM consultant, Gabriel has worked with global Fortune 500 customers in the service provider, financial, public sector, and other industries. Gabriel is the technical lead for the Cisco ITSM Practice. In addition to being a Professional Engineer and MBA, Gabriel holds various professional certifications, including ITILv3 Expert, COBIT, TOGAF, and PMP.