IT is a huge part of the budget for many companies, and the money spent on IT is intended to create business value and support business growth, but many IT organizations spend a large part of this budget on managing incidents. The service desk typically employs many people, and uses expensive telephony and service management tools; second- and third-line support are often expensive people to hire and retain, and vendor support contracts can also be expensive.
What do the customers of IT get in return for this investment? At best they get a reduction in the amount of business disruption that they suffer as a result of service failures. It’s never going to be easy to say to your customers, "We’ve done a fantastic job managing incidents, what contribution did this make to your business growth?" because customers expect you to resolve incidents; simply managing them doesn't add value to them or their business processes.
Let’s look at this a different way.
Suppose you managed 500 incidents this week: every one of them was managed perfectly, you beat all your SLA targets for resolution time, and your post-incident customer survey shows that every user was satisfied with the incident management you provided. Do you think it would delight your customer if you went to them and said "next month we’re planning to manage 1,000 incidents, so this should make you even happier"? It sounds absurd doesn’t it? That’s because managing incidents doesn’t create business value, it simply reduces the amount of business value that you lose. What would make your customers really happy is if you never had any incidents. They certainly expect you to resolve incidents quickly when they do happen, but the fact is the regardless of good your incident management is it will never create any business value.
This leaves IT with a problem.
A huge part of the budget is being spent on an activity that doesn’t create business value, but if we stopped doing it then our customers would be even less happy. How can we manage our way out of this? The obvious solution is to invest in things that reduce the number of incidents, and reduce the business disruption caused by the incidents that we can’t prevent. That way we can really increase customer satisfaction by making IT better able to meet their needs, and if we do a really good job we may find that we can reallocate our budget away from incident management towards things that help grow the business.
I’ve spent quite a bit of time thinking about things that can be done to reduce the amount of incident management that we do, and there are lots of them, but most IT departments don’t invest nearly as much in these things as they do in incident management. Here are some areas where I think IT departments should be investing.
Problem management has two aspects, and both of them are important to helping reduce the frequency and business impact of incidents.
- Proactive problem management analyzes incident trends to try and identify problems that might otherwise be missed.
- Reactive problem management identifies the root causes of problems, creates workarounds to reduce the impact, and initiates changes to permanently resolve problems.
For some reason many IT organizations do very little problem management. The most common problem management activity I see is root cause analysis that is carried out after major incidents, to try to prevent them recurring. This is important, but it is only a very small part of problem management.
I regularly hear people saying that they can’t do more problem management because they are far too busy. Typically this is because they have so many incidents that they aren’t able to spare people to prevent the incidents in the first place. If only they could just stop and stand back from the incidents for a while they might be able to break this vicious cycle. A little problem management work could lead to a small reduction in incidents, freeing up resources to do more problem management work. This is the kind of situation where it might make sense to bring in a contractor for a short while, to get problem management started and free up people who can then take over running the process.
One approach that I have seen to starting problem management is to simply create a "Top 5" problem report each month and manage these five problems. This report is based on analysis of incident trends, and simply identifies the five problems that have had the biggest negative impact on the business each month. Problem management staff then work on reducing the impact of these problems (by documenting workarounds) or the frequency of them (by initiating changes). The effectiveness of the approach can be measured by reporting the total business impact of these five problems in successive months after they were first identified.
There are several other simple things that IT organizations can do to improve their problem management, for example:
- Move the focus of problem management away from root cause analysis. What is really important is identifying workarounds and stopping the incidents from happening again. Root cause analysis is something we do to help make these happen, but it should not be the focus of all your activities. For example you should almost always devise a workaround before you understand the root cause. Tell the service desk what to do if the problem happens again. You may not get this perfectly right but if your best technical people think about it for a while they’re going to do a better job than the level 1 service desk staff! When the workaround is in place this will reduce the business impact, giving you more time to fix the problem properly, and you may even find that the workaround is so effective that you can stop working on this problem altogether and move your focus to a problem with a bigger business impact.
- Train your staff on problem solving techniques. Technical staff are usually trained to ensure they understand the technology components, and many of them receive service management training to ensure they understand the processes they have to follow, and how they fit in to the organization. Surprisingly few technical staff have attended any formal training in how to analyse problems. There are a number of approaches that can be effective, my personal favourites are Kepner-Tregoe problem solving and Timeline analysis, but other approaches can be equally effective.
Knowledge management can make sure that the right information is available to the people who need it at the time it will be most valuable to them. This can make a big improvement to incident management by giving your service desk people access to information that helps them resolve incidents faster. It can also be used to support a self-service portal so that end users can resolve their own incidents.
Knowledge management probably won’t reduce the number of incidents, although it might have some impact by providing users with information that they need to use services effectively. What it will do is significantly reduce the business impact of incidents, and the cost to IT of managing those incidents.
One very effective approach to using knowledge management in IT organizations is Knowledge-Centered Support (KCSsm). You don’t need a huge project to start improving how you manage knowledge, this is definitely one area where an agile approach to IT service management can work well. Think about what knowledge would help to improve things, and see how you can provide that, then go and find some more knowledge you could create to add more value. Eventually you may need to invest in tools and more formal process, but start with something simple.
In many IT organizations the biggest cause of service failure is changes that go wrong. Estimates of what percentage of incidents are due to change vary, but I have personally seen an 80% reduction in incidents caused by a change freeze in one organization, so it can be very significant.
Many IT organizations have overly bureaucratic change management that slows changes down without removing the risk. What you need is an efficient and effective change management process that helps you make the changes your customers need while controlling the risk. This is difficult but not impossible. One thing you can do is use trend analysis to identify which changes are likely to result in downtime and focus your efforts on risk mitigation in those areas, while providing a fast path through the process for changes which are low risk.
The best thing you can do to reduce the frequency and impact of incidents is to design your service right in the first place, but for some reason many IT organizations choose not to invest in planning to avoid failure, preferring to spend their money on recovering after the event. ITIL describes processes for availability management and capacity management but very few organizations plan service availability and most capacity management work is purely technical, with insufficient business data to allow for real prediction of future needs.
There are many things that can be done to improve designs. The simplest of these is simply to make sure the redundancy and resilience measures you have put in place are actually working. When I carry out audits for customers I always find at least one technical control that is no longer working, but has not been noticed. For example disks may have been configured in RAID sets, or with mirrors, but nobody has noticed that one disk has already failed, or a network may have dual links between sites but one of the links is no longer working. Even when there has not been an undetected failure there has often been no testing of the failover capability, and either this does not work, or it relies on manual steps that haven’t been properly documented. If you haven’t carried out an audit of these measures in your environment then why not go and check how well your technical controls are working – you may save yourself and your customer an expensive outage! You could even create an availability plan that supports regular testing of all failover measures, whether they are automated or manual.
A recent innovation in availability design is the idea of creating anti-fragile solutions. This is a very different philosophy to traditional availability management. The basic concept is that creating very rugged solutions that are designed not to fail can never succeed, and ultimately leads to catastrophic failure with enormous business impact. An anti-fragile solution is designed with the knowledge that everything fails, so it is designed to recover very quickly and with minimal business impact. The focus is on mean time to restore service (MTRS) rather than on mean time between failures (MTBF). Ideally an anti-fragile solution will recover from failure automatically, before the users even notice that it has failed.
Most IT organizations are spending far too much time and effort on incident management, which delivers no net value to their customers. If we all invest a bit more on measures to reduce the frequency and business impact of incidents, then our customers would get a much better service, and we might even find that it reduces the overall cost of providing IT.
There are many different things you can try, and each of them will lead to incremental improvement in your service. You don’t need to implement everything I have described here, but you should think about how you can do less incident management and more value creation for your customers.
Who is Joe the IT Guy? A native New Yorker. Loves everything IT-related (and hugs). Passionate blogger and Twitter addict. Oh...and resident IT Guy at SysAid Technologies (almost forgot the day job!).