The phrase “change is constant” isn’t new, and it’s truer today in the world of technology than ever. Whether it be the introduction of new technology, products, and services, enhancements to existing offerings, or decommissioning what is no longer viable, all these things represent change to our technical landscape.
Recent surveys report that up to 80% of major incidents are caused by change. Gathering the metrics of success rates and incidents caused by change are important when trying to understand your strengths and weaknesses.
Here are some questions to consider:
- What is your institution’s change success rate?
- Are you measuring incidents caused by change?
- Are they caused by change or are they triggered by a change? What’s the difference?
- Are there teams that are more successful or applications more resilient?
- Are you using automation?
There are so many elements to a successful change environment, but without first looking at the data to understand how you’re performing, you’ll never know where to start. If you’re looking at these metrics and you can identify trends, then you can start to develop a roadmap to a more stable environment. While it’s not possible, or realistic, to prevent 100% of incidents when implementing a change, looking deeper into our change, analysis, testing, and implementation processes can help to improve on these numbers and understand where the opportunities for improvement lie.
From planning to implementation, communication is critical to the success of any change. Often there’s coordination between teams when implementing, but it may be too narrowly focused. As you’re building requirements, you must ensure that dependent applications are, at minimum, aware of the general nature of your change.
It’s also important to ensure that your underlying infrastructure teams understand the impacts to their environments because of your changes. If you’re expecting an increase in volume or size of service calls, you need to understand the significance of the difference and communicate that to your support teams. When caught off guard, a large influx of unexpected traffic can wreak havoc on supporting systems and have residual effects on other applications, as well. These teams not only need to be aware of your changes, but they need to be involved in your testing.
Your call centers, service desk agents, and other application-facing colleagues also need to know the elements of your change as it pertains to any user interfaces. Are workflows changing? Ensure that resources are available to help guide them and set their expectations, and provide information about what to do when something falls outside of those expectations. The better equipped resources are with the knowledge and expectations of a change, the less “noise” post change to allow teams to focus on post-implementation monitoring to ensure a smoother experience for all.
Thorough testing is one of the best ways to ensure a change goes according to plan. Unit testing, regression testing, performance testing, user acceptance testing - the more mature and robust your testing plan and environments are, the lower the chances of a major incident post-implementation.
Ideally, for performance testing, non-production environments will be sized to be “production like”. Transaction volumes should also mirror expected production volumes whenever possible. Additionally, when performance testing, before-and-after change reviews can be critical in identifying unexpected behaviors. Are your supporting environments showing significant changes in resource consumption after a change? Better to know and understand this now postproduction.
Test scripts for both automated and manual testing should include testing for errors and error handling. If you only test the “happy path”, you’ll likely experience some bumps in the road when the path strays from those very specific scenarios. As more organizations move to a more agile project delivery speed, testing automation can be a very useful and powerful aid in ensuring that your core functionality remains business as usual and alleviates manual testing workloads. We’ve likely all seen the memes about testing in production, but no one likes the experience of working those major incidents that come because of inadequate testing.
We’ve already touched on the benefits of automation in testing, but code deployed via automation is less likely to be afflicted with human error. Code deployment automation can be tested across environments, which raises the likelihood of success. Additionally, the process of rolling back those deployments should be tested, as well. As we’ve said, it’s impossible to guarantee 100% change successes, so it’s important that in addition to testing outside the “happy path”, we plan and test code rollbacks should it be needed if we encounter the unexpected.
Lastly, let’s look at the difference between caused-by-change and triggered-by-change. At the time of the incident, they may look synonymous, but they are not.
It’s important to understand whether an element of the change or the implementation of that change caused the issue, or if it uncovered something else in the environment. If it’s the latter, incidents that are triggered by a change can be frustrating because often there are elements in the environment that were not understood, or couldn’t be tested in a pre-production environment because they didn’t exist there; or it’s possible that environmental instability after a systemic reboot to complete the implementation of a change can cause the issue. In these cases, there is no element of the change that is at fault, it’s only the precipitating event prior to the incident. Again, having a thoroughly tested rollback procedure could be critical to ensuring system restoration quickly if the unexpected occurs.
We’ve touched on several elements that go a long way to securing change success, and there are others, but a solid foundation is always a great place to start. Depending on the size of your organization and the resources available, you may be limited in some of these capabilities, but reflection and analysis will guide you in identifying where you have opportunities. if you have an opportunity to define how your organization can improve on change quality, then process and resource maturity will likely follow.
Angie Handley has a broad background in information technology and is an experienced IT leader in application production support, technical training, incident management and problem management. She is ITIL and HDI certified and has a degree in Information Technology. Her career passions include building effective, happy teams, creating lasting partnerships in both IT and business, and leveraging business and IT process improvement techniques to make processes more efficient, cost-effective, and reliable for both IT staff and their customers. Currently, Angie is the Manager and Process Owner of Problem Management in the Financial Services Industry. Outside of work, Angie enjoys working with local animal rescues and playing Pickleball.