This 2014 column is a great story of our "From the Vault" series. Outages happen, but how an organization reacts to it is what matters.
Have you ever experienced an outage to a critical business system? It’s reasonable to presume that nearly every organization has faced this situation at one time or another. When it happened at my company, Medtronic, we initiated a program to prevent it from happening again.
The first step was to define the problem we were trying to solve and identify improvements that, if implemented and sustained, could minimize or even eliminate impact to our business. An outside consulting services partner helped us assess the processes and activities that led to the outage.
Using fact-based metrics and measures, we concluded that we had very well-written process documents that weren’t always being well executed. We identified gaps related to process execution, made awareness across the organization a priority, and ensured that we had a culture of accountability.
Fortunately, we weren’t starting from scratch. Several years earlier, we launched a fairly rigorous ITSM program. The organization had designed, implemented, and improved several processes, such as incident, change, configuration, and knowledge management. Each was reasonably mature, according to Capability Maturity Model Integration (CMMI) assessments. For many of these processes, formal standard operating procedures were in place. We also had ongoing program efforts for request fulfillment, as well as problem, release, service level, service catalog, and access management.
Still, the impact of system outages made it clear that we could do better.
The Journey Begins
What exactly is production assurance? Essentially, production assurance is an organizational and process alignment designed to protect the production environment, facilitating and supporting business capabilities by ensuring system stability and supportability. The practices leverage common IT management and governance frameworks, including ITIL 2011 and COBIT v5.
We established a team within the Medtronic Global IT organization (primarily by realigning existing resources) that consisted of our Global IT Service Management team and our Global IT Change and Release team.
We leveraged—and continue to leverage—Lean/Six Sigma DMAIC methodology to help us:
The leaders of the production assurance team crafted and articulated a vision, that of facilitating and supporting business capabilities by ensuring critical system stability and supportability. This vision was supported by three pillars:
Globally aligned processes: We will define, leverage, and continually improve ITSM processes to support the design, transition, and operations of our systems.
Risk management: We will define, leverage, and continually improve our ability to measure and mitigate risks related to the delivery and support of our systems.
Governance and controls: We will define, leverage, and continually improve governance and controls to ensure consistent delivery and support of our systems.
However, words alone aren’t enough to impact behavior. Creating this vision was only the beginning, but it helped to engage and energize the team. Layered onto this were improved rigor and controls, all while ensuring the practicality of the day-to-day work of the Global IT staff.
Develop a Project Plan—and Execute It
We always saw this journey as one of continual improvement, because knew we’d never be "done" protecting our production environment. We’d only continue to get better at doing it. Early steps on the journey included:
Establishing a well-defined list of improvement activities and timelines
Measuring risk reduction over time
Determining and monitoring dependencies between activities
Monitoring progress toward the stated objective and along the defined timeline
Some of the key objectives included:
Strengthening language in policies and standards (e.g., using "shall" instead of "should")
Defining parameters for "Critical System" to help manage scope and control
Building effective system models in the CMDB
Establishing comprehensive system monitoring standards and guidelines
Defining management reporting package and process to ensure that metrics are available/visible
One of the core improvements was increasing the rigor of our change and release processes to help avoid introducing challenges into production environments. Strict adherence to processes was closely monitored, and the staff was held accountable to these controls.
The team also communicated. Frequently. Broadly. To help drive awareness, understanding, and alignment across processes and across our global organization, the team needed to be highly visible. So we published regular communications across a variety of channels, offered training (recorded and in-person), and leveraged metrics to increase visibility of issues and opportunities.
Every team member had responsibilities in the new organizational group, and leaders kept their fingers on the pulse using a weekly program scorecard. We now measure process capability based on ISO 15504, and we also measure task progress via CSFs and KPIs. Ongoing communications address status, objectives, and changes that are needed to drive behavioral changes.
Challenges on the Journey
Every road has a few bumps and potholes, but we stuck with our stated strategy and pursued our destination, making mild course adjustments as necessary.
Without executive support for the initiative and the team, much of our effort may have been in vain. However, support from leadership was phenomenal. They helped to communicate the importance of this transitional program and to drive progress toward our goals. If our leaders hadn’t been as engaged in the program as they were, advocating for the value of consistent and rigorous process, governance, and risk management, the team likely wouldn’t have moved the needle as powerfully as we did.
This was amplified by our own personal accountability. Many of the Medtronic corporate traits—those behaviors, skills, and capabilities that create a strong organizational culture—command a high degree of accountability. The team leaned into the value, not the cost, of holding one another accountable.
We examined the quarter-over-quarter and year-over-year data for significant incidents in our production environments. In the first year after implementing production assurance, we improved the number and impact of major incidents in the global environment by nearly 90 percent—and nearly 90 percent for our primary ERP system, as well. The results were well received.
We’ve also developed system models (or service maps) for 100 percent of critical systems in the CMDB. We continue to build out system models for our noncritical systems. And the team continues to report progress, opportunities, and challenges to IT executives every two weeks.
As mentioned, this was always seen as a continual improvement effort. Two years in, we continue to maintain stable systems with very few outages. The work continues to bear fruit. The journey required the definition of a program made up of many projects, but it wasn’t just a program...it was a new way of looking at and performing our work.
Are you looking for ways to make your production environment more stable? The approach we applied isn’t proprietary—you can do it, too!
The recipe is relatively simple:
Clarify the problem: What’s your problem statement?
Determine your current state: How do you know where you are? What are you measuring?
Define your objectives: How can you leverage your partners to determine the best actions to take?
Define your metrics: How will you measure progress? Do your metrics support business outcomes?
Define your approach: Do you have the people you need? Do you have the backing you need?
Communicate. Communicate. Communicate.
Get back to the basics of great foundational data, practical and consistent processes, meaningful metrics and measures, and active communications across the organization. Move forward from there.
Are you ready to take the production assurance journey?
Chris Gardner is a dynamic IT leader with more than twenty-five years of experience managing systems and technology projects, programs, and teams. His background includes application development, telecommunications and network management, custom e-commerce solutions, and enterprise application implementations. Chris is an ITIL Expert. He’s currently the manager of the Global ITSM team and program at Medtronic, Inc.