How to Prepare for Major Incidents

Preparing for the worst is better than confronting a major outage without a plan. Here are some steps to take to create a plan that will work for your organization.

by Delcia Marrs
Date Published January 23, 2023 - Last Updated May 3, 2023

When a major incident occurs in your organization, it can have devastating effects, impacting revenue, security, safety, employee and customer satisfaction, and reputation.

A major incident will be different for small, medium and large companies, across industries and demographics. The scope of major incidents will be defined based on the organization’s goals and objectives. They are typically categorized as Priority 1 and 2. Having standardized methods and procedures on how to respond from the moment the major incident is reported (or an incident/event is discovered that could potentially turn into a major incident) until service restoration and documented learnings, will mitigate and/or eliminate potential future critically impacting downtime to your business.

Keeping on top of major incidents will always be a work in progress as new technologies are introduced. In addition, as you work with different vendors, contractors and other external parties, security will be front and center.

Here are some areas that will have you armed for success:

The “Must-Have In Place” List

Service Level Agreement (SLA), Operational Level Agreement (OLA) and Contract/Vendor Agreement documents that include a priority matrix and disaster recovery details in each.

Links to the SLA, OLA, and Contract Agreements should be referenced within each document for easy access. Keep these in a centralized repository so when it comes time for review, they can be updated as a whole.

An ITSM tool that stores a Service Catalog, provides reporting capabilities, classification, and priority selections based on the SLA.

If an SLA is being re-prioritized, your ITSM tool must be updated, as well.

Your Service Catalog should be one that is agreed upon by the Service Provider and the Business, and communicated to the user community so there is transparency and visibility around expectations.

A Service Desk/Help Desk team (Tier 1) which serves as the single point of contact (SPOC) for your customers and users.

Provide visibility via an internal customer portal with business hours of operation and how to obtain after hours support. Have a section dedicated to major incidents outlining examples and how these should be reported.
With many teams now mostly working remotely or within a hybrid model, communication tools other than phone are essential - such as chat/instant messaging, collaborative/conferencing, dedicated virtual desktops, VPN, and company or personal (containerized) mobile devices.
A triage knowledgebase is the team’s “bible” when it comes to troubleshooting, incident assignment, support contact details, escalations, and procedures. Ensure this is kept up-to-date by asking service owners for regular reviews of their entries noted in the knowledgebase.

A reliable Automatic Call Distribution (ACD) telephony system that has the capability for major incident workflow routing (i.e. emergency queue)

Incoming calls to this queue should be answered as a priority over any other calls and by a live person. If your ACD is not working, redirects should be in place.

Reporting, Actioning and Resolving a Major Incident

The following are steps you can take to minimize service disruption by making sure your Service Provider organization has a clear sense of urgency and your customers and users are aware of what constitutes a major incident.

Reporting a Major Incident

Communicate on a regular basis the importance of reporting major incidents to Tier 1 Support. Provide examples to your end-users and explain the need to report these types of incidents in a timely manner. If your main communication channel is not working, provide alternatives.

Service Desk/Help Desk

Once Tier 1 Support has triaged and determined it is in fact a major incident, ascertain the correct priority. There may be different action items for Priority 1 vs Priority 2. The highest priority will generally have additional steps because time is ticking and service restoration is paramount.

Your team should ideally have a checklist of action items for clarity and consistency:

Page out the incident to the respective team and/or vendor for investigation and actioning. Have set timelines to call back if you are unable to reach them initially. Escalate beyond a certain number of tries.
Assign an Incident Manager, who should essentially be the service applications/systems owner, and an Incident Coordinator, who is a member of your Tier 1 team (e.g. the person who took the initial call). The Incident Coordinator can delegate tasks on the checklist to other team members during the outage. If additional reports are coming into Tier 1 Support, link each one to a single parent ticket. This provides good metrics by way of location, departments, applications/systems, and users affected.
Communication

1. Open up a conference or virtual call. Engage I.T. teams, stakeholders, vendors, suppliers, and any other parties who can potentially assist. The outage could also extend to other businesses across the state, province, or country. Be aware of power or unforeseen outages.

2. Communicate to your end users the service disruption details. Depending on the outage, select channels that are available and working.

3. Communicate to your external customer-base via social media platforms. Place a service disruption notice on your website. Offer workarounds if available.

4. Add a broadcast message on your ACD telephony system outlining detailslike affected applications/systems and location, as well as estimated restoration time.

5. Provide regular updates. In times of criticality and impact, it’s better to over-communicate and keep everyone informed of progress.

Testing: Have your I.T. teams and stakeholders test when service has been determined as restored. Test internal and external systems. Test from multiple devices and operating systems.

Service Restored: Based on testing, the Incident Manager will confirm if the incident is to be closed. Resolution details should be in layman’s terms. No acronyms, as your audience likely will not be aware of technological terminologies. A Root Cause Analysis Report should be requested. Action items and learnings from this report should be visible to the organization and implemented in agreed-upon timelines.
Send out service restoration communication. Remove any ACD broadcast messages. Remove any disruption notices on internal and external websites.

Practice Makes Preparing for a Major Incident Manageable

We’re all human, so when a major incident hits, it’s panic time. This is especially true if a major incident is rare. Performing mock trials/drills of a major incident with your Tier 1 team will alleviate the anxiety, stress, and pressure when the actual one hits. Include all of your team members – experienced and new recruits. Make sure everyone is comfortable with the process.

Systems can be extremely complex and need to be reliable. Every effort should be made to secure constant and consistent availability.

Delcia Marrs is Senior Technical & Functional Service Management Analyst with BC Ferries. An IT professional with extensive experience in customer services support, she has demonstrated successes in shaping positive customer service journeys, systems support, customer training, and documentation. Her goal is to push the positive boundaries of success for organizations.

Tag(s): supportworld, support models, technology

Real-Time Intelligence: How Data Cloud Transforms Support Operations

Shining a Light on IT Operations Metrics

Top 25 Thought Leaders in Technical Support and Service Management for 2017

No articles were found.