How to Handle Incident Management Like a Boss


by Sidharth Suri
June 1, 2016

Incident management can be like a jigsaw puzzle. The right pieces need to fit into place to see the complete picture. ITIL® defines an incident as

“An unplanned interruption to an IT service or reduction in the quality of an IT service. Failure of a configuration item that has not yet affected service is also an incident – for example, failure of one disk from a mirror set.”

Resolving incidents is definitely not for the faint of heart. You must look at incidents critically and face the (potentially) significant problems. Time is money, and by not having a plan for resolution, you could be burning countless dollars trying to find a solution. 

Stick with the Basics

Incidents come from all directions. But, no matter the source, the steps are always simple and basic:

Identify and Log the Incident. You may receive the incident via your self-service portal, meaning that logging the incident is already done for you. If not, it’s the service desk team's job to properly log the incident so it's in the system. Incidents can be reported in a variety of ways: phone call, email, text, social media, and so on.

It’s the service desk team's job to properly log the incident.
Tweet: It’s the service desk team's job to properly log the incident. @ThinkHDI

Monitoring—that is, using technology that watches systems for unexpected changes—can predict potential incidents before an outage occurs or automatically log a ticket when an incident occurs. It’s important to ensure the monitoring tool creating an incident provides classification and relevant data (provided from configuration management) so the IT team can address the issue without having to do extra legwork to get the details. In any case, the incident needs to be logged and the information noted should include:

  • Incident reporter (i.e., a customer/user ID)
  • When it happened
  • What happened
  • A unique ID for the ticket itself
  • Any available information about the configuration item and affected services

Assign a Logical Category. Know what issues are present and keep track of small bugs just the same as the big ones. Categorizing is critical for effective problem management. It helps to produce valuable reports that problem investigation teams can leverage to identify trends and emerging issues that need to be addressed.

Prioritize Everything. Prioritize every incident according to impact and urgency. Assess the impact on the business, including any potential compliance and security risks, then determine the urgency of resolving each. Address all open incidents based on the priority levels, which ensures the best level of service to the business.

Respond and Report

Something happens. A ticket is filed. The IT team is on the case. The service desk—individually or as a team—comes up with a hypothesis of the incident, or may be able to identify the solution immediately.

This is where your knowledge base is critical. If the issue is common, there should be answers within a few clicks. By documenting wins and failures, there’s a path to follow. The knowledge base acts as a problem-solving road map.

Level 1 can try their hand at the incident by utilizing the answers in the knowledge base. But, maybe the problem proves to be a little trickier than planned. This is when you escalate to Level 2. But, Level 1 still has to document everything. It’s imperative for Levels 2 and 3 to have all of the relevant information, so when a next tier agent looks over the ticket, they know Level 1 has checked the knowledge base and explored all of the options at their level.

Resolve and Recover

Eventually, the IT team will diagnose and resolve the incident with the goal of restoring normal service to the customer. Fixes like bug patches could require testing and deployment after resolution.

The service desk closes the ticket. To maintain quality, only service desk employees should close incidents, but not before the incident reporter (i.e., the affected user or customer) agrees everything is resolved.

Typically, the IT team will set the incident status to resolved and wait for the incident reporter or service owner to confirm it has, in fact, been resolved. The time frame is often seven days—to allow for user feedback and for the IT team to reopen the incident if the issue requires further action. Once an incident is closed, it should never be reopened. This protects the metrics that IT measures.  

Pro-Tip 1: Don't Skip Steps

The steps we've covered exist for a reason. It's because they work.

Countless hours have been spent streamlining the incident process. Don't reinvent the wheel. Innovation is awesome; our industry thrives on it, but sometimes, you need to stick with what works. Incident management is one of those things. Keep a checklist: 

  • Log everything
  • Give the incident a unique number, even if your ticketing system doesn’t
  • Document all of the details
  • Assign a category and priority level
  • Check the knowledge base for every incident, even if you think you know the solution

In the case of a surge of incidents, the best bet is to see if they're related and link the incidents in your ticketing system. Most IT teams will work from a master incident and open related child incidents. This reduces the workload and improves the coordinated response to the service outage.

Pro-Tip 2: Define an Incident Response Plan

Be ready with an incident response for major service outages: 

  • Create incident response run-books IT teams can follow to quickly start the response process 
  • Document service owners so the team knows who to contact when an outage occurs
  • Practice for major outages that affect critical services that have the highest priority to the business
  • Review incident reporting for Severity 1 incidents to ensure IT has the proper reporting to measure against KPIs

Pro-Tip 3: Define Roles and Responsibilities

Depending on your IT team's size, establishing roles is crucial. A smaller team may have a one-person army, whereas a bigger team has multiple people swarming on an issue. These roles—using the analogy of a ship at sea—should include: 

  • Incident manager: Captain of the ship who keeps the team on target
  • Subject matter expert(s): First mate who figures out the causes of failures and suggests fixes
  • Service operations engineer: Navigates bumpy waters, is in charge of the first assessment, and pushes fixes
  • Internal communications manager: Bosun’s mate who handles staff communication, staving off potential mutiny
  • Release manager: First officer who ensures emergency releases of software happen quickly and safely on their watch
  • External communications manager: Keeps customers happy, thus keeping the boat afloat

Pro-Tip 4: Keep Your Customers in the Loop

You know the drill from the customer perspective: Something breaks. The customer gets annoyed. They try every solution they know of. They may bang their heads on their desks because nothing is working and they don't want to sit on hold waiting for the service desk. Finally, they give up, and call/email/text/smoke signal a support agent, and the agent tells them it's an outage. It wasn't the customer’s fault.

Many times, that’s how IT discovers there is an interruption; often enough, the interruption is known to IT, but not to the customers. Don't be that team. Tell your customers when something breaks. Use a central portal where customers can check for current service outages. 

Follow the Plan

And there you have it, how to handle incident management like a boss. Be proactive when it’s possible, using monitoring and alerts. Collect all the relevant information and document it in the ticket. Categorize and prioritize the work. Resolve every incident as quickly as possible, returning customers and users to work and producing value for the business. Remember to follow a process, document incidents, use your knowledge base, define roles, and communicate with your customers. Now, let’s deliver legendary IT service. 


Sid Suri is the vice president of marketing for JIRA Service Desk. He's worked in various technology marketing roles over the past 15 years at Salesforce.com, Oracle (CRM), InQuira (acquired by Oracle), and TIBCO Software. He has an MBA from the Haas School of Business and a bachelor’s degree in Economics and Italian from Middlebury College. He lives in San Francisco.


Tag(s): support center, supportworld, service desk, incident management, best practice, process management

Related:

More from Sidharth Suri

    No articles were found.

Comments: