Use Lessons Learned from the Cloud to Optimize Your Service Management

by Adam Rauh
Date Published July 12, 2018 - Last Updated December 13, 2018

Unless you’ve been stuck under a technological rock for the last decade, you’ve all heard about “the cloud.” Businesses are rapidly seeing the advantages (and some trade-offs) of moving their systems to the cloud. Everyone’s mileage will of course vary, but excluding a few hiccups with best practices and security, the overall prospects seem pretty one-sided: the cloud is here to stay, and it’s probably worth it. Given these returns on investment, it would behoove us to evaluate the benefits and lessons learned from cloud deployments and see if and where these are applicable to ITSM environments and services.

There are a number of benefits for moving to the cloud. Some of the interesting ones for our purposes are:

  • Elasticity and Scalability
  • Speed
  • Capital vs. Variable Expenses

Different cloud providers mix and match some of these items, so I’ll give a simple definition for each before we begin our exploration.

Elasticity and Scalability

Scalability is the ability to add resources to accommodate increased loads. Usually this means “scale up.” Elasticity is the ability to scale up and down to meet demands as close to the current situation as possible. So, scalability would be able to quickly add more CPUs or RAM with a few mouse clicks (vs. racking/stacking new hardware), and elasticity would be the ability to scale dynamically (adding resources during peak hours, removing afterwards).

In ITSM, scalability is the harder of the two. You can’t magically make new trained resources appear. Sure, you could off-load some Level 1 work to Level 2 or vice-versa to “scale,” but in reality you’re just trading between fixed resources, not really adding more.

That said, there are ways to be more “elastic.” Suppose you know Mondays always have peak calls. Schedule your resources so that you can have more capacity then, dispersing them afterwards. Having a process for all-hands-on-deck spikes (perhaps once your call queue gets past X amount, then the all-hands goes out) is another way to help make your department more elastic.

Establish a distribution list (or other alerting system, depending on what you use for communications) for your all-hands alert. This should contain the appropriate resources you plan on calling in. Come up with a metric for when to send out these alerts. A possible nice to have is either a close code or some demarcation (e.g., add a form field in the table) to note when all-hands tickets are closed. This will help you track duration, impact, etc.

Sure, these may be short-term fixes. But to your customers, it adds a layer of elasticity that they will be sure to appreciate.

Speed

Speed as a definition is pretty self-evident. But the speed of what? Speed from the cloud comes from eliminating barriers. With cloud, the barriers to entry are so small that gone are the days of POs, waiting, racking, stacking, etc., in lieu of small mouse clicks and near-instantly deployed infrastructure. So how is it possible to make a direct relation to ITSM processes? I would say in two ways:

  1. Eliminating waste
  2. Limiting capacity

Why these two things? Because the potential for speed is directly related to the effects of constraints. You can try to run as fast as you want, but if you’re running in wet muddy shoes, over time you’re going to run slower. Notice though that I said the potential for speed. You can eliminate all the roadblocks in the world and still choose to drive no faster than before. Eliminating waste and limiting capacity are tools, but you still need to make the decision to go faster.

Let’s start by talking about eliminate waste. Waste creates barriers that affect speed and therefore, needs to be removed. Speeding up time spent on waste is different from eliminating waste; it’s just making it take less time (but it’s still waste). Some ways to think about these are to pick one or two processes, forms, whatever and just do a quick review. Ask yourself these questions:

  • What value is this providing?
  • If this went away tomorrow, what would the impact be?
  • Even if we have determined this is waste(ful), would changing the process now cost too much legwork?

If you can’t find any definitive answers, move on to the next one or two areas. Soon though, I’m sure you’ll come up with a list of good candidates for slimming down. This is a great area to look at applying the 80/20 rule as well.

Turning to limiting capacity as a way to increase speed, what I’m really talking about is Little’s Law. Little’s Law is written as

L = λW  

L = average # of customers
λ  = arrival rate
W = amount of time customers spend in the systems

In business terms, L = WIP (work in progress), λ equals throughput, and W = lead time. What this basically means is that if we can alter one of these three items, it will have an effect on the others. But which to alter? Radically, I would propose cutting your queues. Put a hard limit on the number of items your techs are working on. You may think this is impossible, but consider the alternative. If you think adding X number of tickets into Y queue will suddenly make the time of those X tickets get done faster, you’re wrong. This doesn’t mean reject tickets per se, but perhaps have a separate system or backlog that won’t be worked until specific WIP limits are resolved and capacity released. 

Kanban boards are a great way to visualize this. Create a Kanban board, but set rules on WIP in each stage. It will take time for your techs to adhere to these rules. But as you develop cadence and stable cycle times, you should see your mean time to resolve (MTTR) decrease as your velocity increases.

Capital vs. Variable Expenses

Eliminating capital expenses (sunk upfront costs on static hardware and services) in lieu of variable expenses (paying for what you need, when you need it) is one of the big drivers of cloud. But how can we take lessons from this and apply to ITSM? This one may seem tricky, but there are lessons to be learned.

For one, optimize not only your inventory, but your purchasing flow. Having a ton of laptops sitting around is great in case a blackout happens, but otherwise they’re waste, no way around it. If you are able to develop a stable lead time for delivery and configuration (the closer to the dealer the better), then instead of having a ton of stuff in stock, order as needed, or pass those processes off directly to the user.

Another area to think about is the concept of charge-backs. Whether or not these are real or merely perceived (“here’s how much your department would have been charged by IT”), balancing your books from a variable perspective can point out more opportunities for process improvements, training for users, etc. Knowing your cost per ticket, cost per department, cost per subcategory, etc. will help show which external forces like projects, problems, and training are expensive.

To help come up with these numbers, assign some simple dollar amounts to your tickets. For incidents, this may be an average tech’s hourly rate (HDI’s Technical Support Practices and Salary Report reports industry salary information). For service requests, you may be able to add this directly into your service catalog depending on request type or hardware requested. Then just track these over time to see the cost to your customers. When you start to force other people to think about where they would be spending their money, if they had to, then you begin to shift the conversation between sunk costs and seeing variability that can drive change.

A Note of Warning, or Inspiration

Technological ingenuity aside, many of the big-name cloud providers are able to provide these services at a speed and cost that is beneficial due to economies of scale. The cloud really is just someone else’s computer—but there are a ton of computers. There’s not an IT department in the world that has that same kind of infrastructure clout or budget.

The cloud really is just someone else’s computer—but there are a ton of computers.
Tweet: The cloud really is just someone else’s computer—but there are a ton of computers. @ThinkHDI #cloud

Considering this, you might be thinking, well, this has been great, but we lack the resources and budget to be able to provide the scale and elasticity that enables many of these benefits. Fear not. The goal of this article is not teach you how take cloud infrastructure and mock up some processes out of it. Rather, it is to think of the lessons learned from the cloud, and see where they can be applied (sometimes at little to no cost) to ITSM. To summarize:

  • Create processes to have elastic capacity. Having an all-hands alert process and/or known scale time for more resources (e.g., Monday mornings) are a great way to start.
  • Eliminate waste, and limit queue capacity, where it makes sense. Pay attention to Little’s Law and cycle times.
  • Think of ways to add a dollar-amount to your effort. Unless you’re already doing charge-backs, these will just be for internal reporting. But the closer you can get to showing what the actual dollar amount is, the more effective you can be with your resource planning.

Finally, think outside the box. You can’t write an API for your people. But you can implement a few simple changes that could pay off in big dividends. Get crafty. After all, someone did that one day with distributed systems at scale, and now we have the cloud. What can you fly away with? 


Adam Rauh has been working in IT since 2005. Currently in the business intelligence and analytics space at Tableau, he spent over a decade working in IT operations focusing on ITSM, leadership, and infrastructure support. He is passionate about data analytics, security, and process frameworks and methodologies. He has spoken at, contributed to, or authored articles for a number of conferences, seminars, and user-groups across the US on a variety of subjects related to IT, data analytics, and public policy. He currently lives in Georgia. Connect with Adam on LinkedIn.


Tag(s): supportworld, service management, ITSM, cloud computing

Related:

More from Adam Rauh


Comments: