There I was, standing in front of a conference room full of senior managers and IT chiefs, reviewing the prior month’s service desk metrics and discussing performance, as I do every month. “As you can see by the chart, our abandon rate and average speed of answer dropped, while our FCR remained steady,” I proudly declared from my place at the head of the table. I glanced around the room at each person, listening to the tap-tap-tap-click-click-click of their laptop keyboards and mice as they read and responded to email. The two that weren’t checking email were staring, eyes glazed, at my presentation. Maybe hitting the audience with workload and volume figures would spark a reaction.
“We continue to see an upward trend in overall ticket workload, not only through calls but also through self-submitted web tickets. Because of this, our average workload per analyst is climbing and is well above industry averages,” I said. I threw some benchmarking in there, too: “We’re busier than benchmark figures say each person should be, so that means our analysts are working harder than other analysts in similar roles.” Surely that would get a reaction! A quick glance around the room quickly dispelled that thought. Not even the customer satisfaction scores got more than a nod or two.
The rest of the world just doesn’t get as interested in our metrics as we do, does it? I had a surprise, though. An ace up my sleeve. I had a metric involving money.
“The next slide shows us a very important metric: cost per contact. Our cost per contact has dropped, and when benchmarked against similar organizations, we’re significantly below averages.” Tap-tap-tap-click-click-yawn-blink. The response was underwhelming. Why was this room full of IT leaders not heaping praise on my team? We worked hard, got results, and saved money. These were metrics my team could be proud of!
Maybe my next metric, a new metric I’d never reported on before, would get them involved and engaged. I clicked to the next slide and began to explain the major incidents resulting from change ratio. Laptops closed, heads turned to look at the screen, and eyes focused.
Major Incidents and Change: The Definition
Before you can calculate or communicate this metric, you need to make sure you’re using the right terms, and you need to make sure your organization is doing two things: You have to be tracking major incidents (or whatever your organization calls them), and you have to have a change management process.
ITIL defines a major incident as the highest category of impact for an incident; a major incident results in significant disruption to the business. Like most things ITIL, the exact definition and criteria for what makes a major incident a major incident will vary from organization to organization, depending on the needs of the organization. For healthcare organizations, one of the primary criteria for declaring major incidents is when an incident has the potential to affect patient care. For a staffing/contracting firm, the criteria may involve timekeeping and billing software for contract employees. For just about everyone, email being unavailable would qualify. Even if you aren’t focused on ITIL best practices or terminology, you’re probably doing some form of major incident management as you deal with unscheduled outages or downtime.
ITIL defines a change as the addition, modification, or removal of anything that could have an effect on IT services. Changes are typically reviewed and authorized by a change approval board (CAB).
There are three types of ITIL-defined changes, and I’ve added a fourth type of change that I’ve noticed tends to cause everyone a lot of trouble:
Standard change: A preauthorized change that is low risk, relatively common, and follows a procedure or work instruction (e.g., a monthly application server reboot that all customers are aware of).
Emergency change: A change that must be implemented as soon as possible; for example, to resolve a major incident or implement a security patch (e.g., a hard drive on a server failed and needs to be replaced).
Normal change: Any service change that’s not a standard or emergency change (e.g., rolling out a new version of Java to desktop PCs through a push).
Unscheduled change: A change that was performed without going through the change management process.
Normal and standard changes are scheduled in advance, and emergency changes are almost always reactive. Unscheduled changes should be infrequent, and there should be consequences when they’re performed.
Major Incidents and Change: The Metric
Calculating and tracking this metric is relatively simple: divide the number of major incidents resulting from changes by the number of major incidents. Change managers and major incident managers need to coordinate with each other to get the numbers to correlate major incident occurrences with scheduled changes. Support teams may have to help gather unscheduled change figures.
At the end of the life of a major incident, one of the outputs should be a report on the major incident, including its impact on the organization and root cause analysis (RCA). Many ITSM tools will allow you to attach “child” incidents to a major incident; as long as the service desk is recording child incidents, you should have a good picture of the scope of the impact. A schedule of changes using your change management tool should provide enough information to show that a scheduled change caused the major incident.
This research is all part of what happens after a major incident has been resolved and services have been restored. Never forget that your goal for incident management is restoring services as quickly as possible; research, RCA, reporting, and linking major incidents to changes are done after the major incident has been resolved. If this is a recurring major incident, it’s the job of problem management to address the root cause and put a stop to the recurrence.
According to the 2012 HDI Support Center Practices & Salary Report, the fully-burdened cost per ticket is $10–$17, depending on the channel the user uses to report the incident. If a major incident impacts 100 users and results in a ticket spike of 100 tickets, that’s an instant monetary impact of at least $1,000—and that’s only the users that report the issue, and only in terms of support costs, to say nothing of lost productivity.
Major Incidents and Change: The Figures
As soon as I presented this metric, I was bombarded with questions: What teams implemented these changes? Why do we have unscheduled changes? How many of these were normal and how many were standard? Are we doing change right? Are we testing scheduled changes appropriately? What can we do to reduce major incidents? Finally! At last I’d found a metric my senior management team could get interested in!
In January, fifteen major incidents occurred, an average of one new major incident every 2.1 days. Twelve of the fifteen major incidents were caused by changes—a full 80 percent! February not only showed a decrease in major incidents but also a significant drop in the number of major incidents that resulted from some form of change. One of the more important data points from February was the fact that no major incidents resulted from unscheduled changes to the environment.
A high number of major incidents resulting from scheduled changes is a great indicator that changes are not being tested appropriately. If, when you break the number of scheduled changes down into normal and standard changes, you discover that standard changes are resulting in major incidents, it may be time to stop allowing those changes to be standard and require them to follow normal change procedures. If they are normal changes, are controls in place to transition the changes into production
safely? If not, your service transition, operation, and support procedures need to be revisited, reviewed, and matured.
The January figure for unscheduled changes is something that should raise a red flag. Why are unscheduled changes being performed? Who’s performing these system changes? Is this an educational problem about when it’s appropriate to use change management? Is a vendor performing work without proper planning or communication? Or could it be something as sinister as willful disobedience?
The number we want to see grow in the chart is the frequency number. The higher the number of days between major incidents, the more “peace” the organization enjoys with regard to information system uptime.
Why Measure This Metric?
This metric isn’t just for eliciting positive reactions or drumming up interest from senior leadership. Major incidents resulting from change is one of the most effective metrics in your collection because it shows the service level impact of the changes being executed. It isn’t a measure of system failures; it’s a measure of departmental failures. It holds teams accountable for the impact they have on the business. It gauges interruption to the business caused by IT itself. In other words, it’s the “shot ourselves in the foot” measurement.
This is why executives care. They want to know how much harm we’re causing to ourselves and what we’re doing to turn that around. This is why you should add major incidents resulting from change to your suite of metrics.
* * * * *
As I wrapped up my presentation and prepared to end the meeting, I looked around the room at the senior managers and IT chiefs, all very much engaged and interested. As they stood to leave, they said, “This was really interesting. We’re looking forward to seeing next month’s numbers.”
Jared Van Doorn is the service desk manager at Orlando Health. He is responsible for managing twenty-five service desk analysts as well as the identity and access management team for one of the largest healthcare systems in Florida. He has more than twelve years of professional experience in IS support, and he’s currently the president of the Central Florida HDI local chapter. In addition, he’s earned the HDI Support Center Manager and KCS Principles certificates, two ITIL Intermediate certifications, his Certified Process Design Engineer certificate, and his MBA.