The promises of faster delivery and better quality with DevOps are attractive to organizations embarked on digital transformation, but implementing DevOps can be hard, especially when we are working in a large, complex, and traditional organization. There are many tools and processes can be implemented; the challenge is, with a limited amount of time and resource, how can we ensure that we are working on the most important things first, and do them in the right order? In addition, being in large, complex, and traditional organizations, we naturally face many constraints and obstacles that are absent in smaller startups and digital native companies. At times, these obstacles can be overwhelming, daunting, and discouraging. However, if we take a systematic approach, focus on the biggest bottlenecks first, and with a different attitude and accept the situation as is, not as what we wish it to be, these same obstacles can become a source of our creative energy. Then what’s in our way can point to the way of meeting these challenges and working towards a solution.
In the race of digital transformation, many companies are looking at adopting DevOps to deliver software faster and with better quality. But implementing DevOps is hard, especially if you are doing it in a large, complex, and traditional organization. Unlike many of our counterparts in the Silicon Valley startups and other digital native companies, we must deal with many more obstacles, such as the lack of DevOps talents, competing priorities, legacy architecture, lack of good test automation, and a management process that was established in an era of mass production and PMO (Project Management Office).
Fortunately, many tools and processes already exist to help us dealing with these difficult situations. For example, one useful framework, the Theory of Constraint (TOC), can help us systematically identifying the biggest bottleneck in our workflow, so we can rework the system around it to break the bottleneck. Taiichi Ohno, the “Father of Toyota Production System”, has developed many simply and but practical tools and processes, such as Kanban board, WIP (Work-in-Progress) limits, one-piece flow, JIT (Just-in-Time), Five Whys, etc., to force process discipline and efficient flow in manufacturing. Many of these tools and processes have been successfully used in agile software development in the past twenty years.
The difficulty with implementing effective DevOps initiatives is not in tools and processes; rather, it’s more about our ability to think and act systematically, and effect change in organizational behaviors. If we have a deep understanding of software value chain and lean thinking, applying these tools and processes effectively can help us focus on the biggest bottlenecks and removing them first, which is the essence of enabling a steady and continuous flow.
In this report, I will share my experience of a successful DevOps implementation in a large, complex, and traditional organization with many legacies and baggage. In less than 18 months, my team of 30 engineers was able to automate the build, package, and deployment process using Jenkins and many open-source tools. Using a limited number of automated tests supplemented by manual testing, we were able to work-around our legacy architecture and the lack of mature test automation practice, and delivered 80% of the changes through a weekly CI/CD (Continuous Integration and Continuous Delivery) release cadence during normal business hours. More importantly, the automated deployment process reduced the deployment time from a 3-day and 3-night marathon event to less than one day by a much smaller team, and at the same time we cut defects and deployment failure rates by 85-95%. The improved release quality essentially eliminated the need for post-launch support that used to take 6-8 weeks for each big release.
I am currently a Senior Manager of DevOps with J.B. Hunt Transport, Inc., leading several enterprise DevOps initiatives to help the organization’s massive digital transformation in the transportation services industry. The story in this report, however, was based on my experience of first-time implementing DevOps with my previous employer in its Product Development organization. I will just call it Global Automotive.
I came to application development in early 2015 after spending many years in IT operations in Manufacturing at Global Automotive. My last job before moving to application development was the IT Operations manager for one of the most challenging operations in Manufacturing at Global Automotive. My plant was making high volume and high value pickup trucks. There are many integrated IT systems on the seven miles of assembly line in the Final Assembly plant alone, and it produces a truck every 52 seconds when the line is running. For those who have read Dr. Mik Kersten’s book, Project to Product, my plant and the manufacturing complex where it’s a part of, are truly the “pinnacle of mass production,” when it comes to efficiency and its advanced implementation of Lean Manufacturing processes. As its head of IT, I was on-call 24×7 and my job was very stressful. Many of our critical IT systems can directly impact the production, so any significant downtime with the critical IT systems would have a large negative impact on the company’s top and bottom lines. You can image I was quite excited when I finally moved from plant operations to application development. I thought my new job would be so easy that I would be living in semi-retirement.
Of course, my semi-retirement dream burst within a few short weeks on the new job. Prior to my joining of the development organization, Global Automotive had started a multi-year program in Product Development to implement a large and complex COTS (Commercial Off-the-Shelf) application suite that would provide its design and manufacturing engineers with an end-to-end suite of tools for designing and making vehicles. It was a complex and difficult program. My organization supported about 50 applications with a global user base of 30,000 engineers. Its computing environment included 700 on-prem servers of Windows, Linux, and Solaris. Most of the 50 applications were monolithic COTS applications, with heavy customization. When the program started two years ago, my organization used the traditional waterfall development method, with long lead time and used only manual methods to build, package, test, and deploy. We could only release software every 6-9 months in big batches, and each deployment took about 3-days and 3-nights nonstop. It took approximately 8 weeks to set up a laboratory environment and go through manual testing. Our release quality was bad. After each release, the teams typically spent 6-8 weeks to fix the hundreds of “post-launch” issues, and we euphorically called our daily meetings “post-launch forum.” As a result, our customers were not happy, and everybody on the program was stressed and miserable.
What’s even more interesting is that when I spoke with my colleagues about the miserable situation and asked if there is a better way of doing it, I was told things were much better comparing to previous releases. It did not make any sense to me. I recalled my experience of working in the truck plant where we could produce a vehicle with thousands of parts every 52 seconds; and as the trucks came off line, we could turn on the key and were confident that every vehicle can start and just work. I believed something was missing with our software delivery process. In the next six months, I did an intensive study on how the best software companies deliver their software, and I was convinced that we should implement agile and DevOps. In 2016, I started experimenting agile development with my team. I found that my many years of implementing lean manufacturing were a big help. As the words of my team’s success got out, my management asked me to champion for the agile adoption for the entire organization.
As the development teams adopted agile, we quickly realized our biggest bottleneck was with the manual delivery process. Obviously, our old way of manually building, packaging, testing, and deploying software could not keep up with the amount of new changes by the 25 development squads, and it produced too many quality problems. I proposed to management that we explore CI/CD for automated software build and deployment. In mid-2017, I was given the opportunity to combine three existing teams, Software Package, Environment Management, and Release Coordination, into one team that we called CI/CD Transformation. My charter was to develop automated delivery pipelines so we can improve the throughput and quality.
3. Common Obstacles
Some of my experiences and obstacles were unique to my company and situation, but many were common across industries and most DevOps implementations. In the follow sections, I will share four lessons that I believe are applicable for most situations.
3.1 Lack of DevOps Talents
For any company that is starting the DevOps journey, the lack of DevOps talent and expertise is almost a certainty. DevOps is still relatively new in the software development world, and it’s unlikely that any of us will have a group of qualified DevOps engineers waiting there for work to do. Many of you probably would do what I did, which is to re-group a bunch of manual packagers, manual testers, systems admins, and a few developers and call the new team CI/CD or DevOps. But calling ourselves DevOps does not mean we know how to develop good deployment pipelines.
One obvious alternative is to hire externally or bring in consultants. The difficulty here is that we never know what we are getting. In the last three years, I probably conducted at least 50 to 60 interviews for DevOps engineers and consultants. I can speak from firsthand experience that the vetting process is time consuming, tedious, and its outcome is often questionable. You see, DevOps is new, it’s cool, it’s hot, and many aspiring engineers and developers can easily take a few online courses on Jenkins, Chef, Ansible, etc., and put DevOps on their resume. I have seen one large and reputable consulting company claim that they had 5,000+ DevOps consultants on staff in 2017! Although it is entirely possible that these job candidates and consultants can work on some DevOps tools, it would be difficult to imagine that they can help us expedite our DevOps journey to deliver software faster with better quality.
Through some painful early experience, I concluded that my best option is to develop my own DevOps talent. We all have good developers who care greatly about software delivery, and good admins who can code and are passionate about automation. They are the best candidates for DevOps engineers. My task was to convince them that the old and manual way of building, testing, and releasing software is not going to work, so they must make the change. I spent a lot of time discussing with my team in group settings and in one-on-one meetings, to communicate my view of where the industry is going with DevOps and automation, and the advantages of getting onboard earlier. My rally slogan was for everybody to become an “automator” instead of being “automated.”
Once we have our team motivated, we need to set solid and time-bound objectives for them to aim for, provide them with a clear road map on how to get there, and a nurturing environment for all to learn and grow together. One difficulty I encountered here was to provide a clear roadmap for becoming a qualified DevOps engineer. Nowadays this is not a big challenge with so many books, conferences, online and in-person trainings available. But back in 2017 it was still relatively difficult to find a systematic and “authoritative” resource for training DevOps engineers. To make matters worse, I was also new to the area and unclear about how to get my engineers trained.
I took a practical and low-cost approach. Since we were using Jenkins as our CI/CD platform, I asked all the team members to become proficient with Jenkins and Cradle first, so we could get started. As we progressed, some team members naturally started to get familiar with other tools. To promote group learning and better collaboration, I took a Herculean effort and moved the 20 engineers in North America to two adjacent team rooms. This allowed me to organize the CI/CD team into three agile squads, with two in North America and one in Asia Pacific.
Of course, not everybody in the team will have the motivation or the capability to make the transition to DevOps engineer. As leaders, we must be compassionate about people and make the best effort to help those who cannot make it to find a more suitable role. With Global Automotive this was not a big challenge because the company was big and had so many other opportunities. But when the time comes, we must also be willing to make the tough call and let some people go if their staying will have a negative impact on the performance of the entire team.
3.2 Focus on Flow and Continuous Improvement
As the team starts to put their newly acquired automation skills into practice, we may quickly realize that we cannot focus on the things that matter the most, which is to develop the automated pipelines. Very likely, we will need to continue with all the manual work for the on-going activities and at the same time, figure out how to develop the new capacities.
This situation is not unfamiliar to most of us. The unfortunate truth of life is that most of us suck at prioritizing and focusing on only the vital a few things that truly matter; instead, we spread ourselves too thin. Our education system and workplace culture have conditioned us to take on more tasks and deliver as much as possible, regardless of the true value. Quite often, even though we know we should focus on the few vital things, the systems probably would not allow us. That is our human nature and social habits, and there is nothing we can do about it.
In Lean Manufacturing, we solved the issue of prioritization by focusing on flow and by removing the biggest bottleneck first. In analyzing our workflow and quality issues, I realized that many of the repetitive manual build and packaging activities are not only a big drain on resources, but also one of the biggest contributors to the bad quality. Obviously, if we get some of the most labor-intensive and error-prone tasks automated first, we could free up the resource needed for automation. Fortunately for us, many of these manual tasks are very easy to automate, so we worked on them first. This freed up more resources for automation, which allowed me to move some senior engineers to more complex automation tasks. Very quickly, we reached a tipping point where we had more automation than manual work, and from there, we grew almost exponentially until we completely shifted the system constraints out of the deployment process.
I found that many of the tools and processes from lean manufacturing could be readily applied to CI/CD pipeline. For example, early in our pipeline development, we found that some of our pipelines took several trials. I asked the team to perform a root cause analysis on each of the pipeline failures by asking Five Whys and ensure that we remove the root causes for each failure mode before the next deployment. I also borrowed another concept from Lean, the FTT (First-time-through), which I defined as the inverse of the number of trials required for a pipeline to complete successfully. Our goal was to reach 100% FTT for all our pipelines, starting with the bottleneck pipelines first. The technique is simple but very effective. In May 2018, our average FTT was 30-40% when we first used the pipeline for large-scale production deployment. Repeating this process 5 times in the lab, we improved FTT to about 80% in one month. Figure 1 shows an example of how we tracked this continuous improvement effort.
Figure 1. Continuous Improvement for Deployment Pipeline
3.3 Deal with Legacy Architecture and Lack of Good Test Automation
One of the main objectives of agile and DevOps movement is about making smaller incremental changes, pushing them through the lower environments for frequent integration and testing, then releasing them to production faster and more frequently so we can deliver value more often to our customers. Obviously, we cannot release components of a system independently and more frequently if our architecture is a big monolithic and closely coupled system. In addition, for changes to move through our deployment pipeline continuously, we must have quick and reliable automated testing that we can trigger at different stages of the pipeline. Unfortunately, modernizing legacy architecture and developing good test automation both require large amount of investment and a long time; and they often lag our ability to automate the build and deployment pipelines. The critical question is, should we wait until we have a clean and loosely coupled architecture and good test automation before we implement automated deployment pipelines? The answer is no. There are many things that we can do to improve the quality and speed of our delivery, and at the same time drive the evolution of system architecture and test automation.
At Global Automotive, most applications supported by my organizations are monolithic COTS with heavy customization. Naturally, there are a lot of hurdles to break changes into smaller features that can be deployed independently. Luckily, very early on in our journey, I realized that almost 80% of our customer change requests are simple and often cosmetic, such as style sheets, icons, UI (User Interface) configs, LOVs (List of Values), etc. These changes usually do not require any system downtime, are very easy to develop, and do not require extensive testing to validate. The build and deployment pipelines for these types of changes are also very easy to develop, therefore serve as a perfect playground for my newly minted DevOps engineers to practice and learn. From the customer value perspective, these changes are urgent and therefore considered high value because they satisfy their immediate needs. Because they are non-destructive, and most times can be well isolated, I was able to convince the customers and other stakeholders to release them during normal business hours on a weekly basis.
In the end, our weekly release program provided us many more benefits. As we steadily reduced the size of the backlog, the dev teams could focus more on the important and complex features, therefore, our overall quality started to improve. I suspect there was also a bit of psychology at play here. Because our customers and stakeholders all believed that we have the capacity to release changes “anytime” with CI/CD whenever they were ready, the incentive to rush the half-baked features through the process was removed. This further reduced the chance of pushing defects into production. As a result, we saw a dramatic reduction in our deployment failure rate and leaked defects, by 85-95% compared to the previous big releases. In addition, because we were releasing so many smaller but urgent features so frequently, customers often asked us to hold off the big releases until their organization was ready to consume. I called the weekly release the “Express Train” and the monthly or quarterly release the “Freight Train,” as illustrated in Figure 2.
The smaller batch and more frequent releases allowed us to adopt a risk-based approach to testing. From several hundreds of automated tests, we carefully selected nine tests that could be run reliably in all environments under five minutes following an installation. We use them as the post-deployment smoke tests as part of our deployment pipeline. I also asked our customers to manually test the new features in Test environment before they approve the changes as a release candidate, then validate again in Production following their release.
Figure 2. Two-Track Release Mechanism
3.4 Overcome Resistance
Of all the obstacles we are to encounter, overcoming people’s resistance to change probably is the most difficult. DevOps goes beyond toolchain implementation; it requires many organizational and process changes, which are often harder. Our success or failure depends more on our ability to communicate and collaborate with people than our technical brilliance with toolchain development and implementation. This is a challenging area for most of us: Let us be frank, if we were good with people and communication to begin with, many of us probably would have not chosen computer science or engineering!
But we must learn to do the uncomfortable things. To be effective, we must learn how to communicate more frequently and be patient with people. I found that if I have a genuine concern for people, for their wellbeing and interest, I would be more willing to spend time on communication to achieve consensus, and compromise if necessary, just to get things going. Quite often I settled on suboptimal solutions, because the best solution was often a no-go for some of the constituents. I realized that the benefits of small and early wins are more important than the best solutions, because these small incremental improvements often provide people with something tangible so they can see, feel, and touch, in order to get a better appreciation of DevOps. For example, after our first successful weekly release with only five small stylesheet changes, our customers immediately asked for a WAR update. I warned that the users may experience 2-3 minutes of website non-response time. I was told that was a small inconvenience that they were willing to accept. After that, we quickly expanded our weekly release to more and more types of changes.
Sometimes the obstacles and resistances are coming from our own management. DevOps is new, sexy, and trendy so it helps us to get the initial management support. But the reality is that many of the people in the senior management grew up in the traditional waterfall and project management paradigm, so it’s not natural for them to accept the “fail early and fail fast” way of innovation, as described in the book Lean Startup. Therefore, it is imperative for us to manage expectations and communicate clearly and more frequently, so we give our management the time to ease into this new approach. One time when we had a deployment failure in a test environment, I seized the opportunity and publicized the failure, and invited my management to celebrate the event with the team. It worked well to demonstrate the “fail early and fail fast” philosophy. It also boosted the team’s moral.
Other times the resistance came from delivery partners within IT. I found that the customers’ desire to deliver faster and more frequently can be a powerful lever to pull. For example, our customers’ enthusiasm to push more changes quickly helped me to overcome the resistance from other IT organizations such as Security, Operations, and Application Maintenance.
3.5 “The Proof of Pudding”
Gene Kim, the founder of DevOps Enterprise Summit and author and co-author of several famous books on DevOps, once said, “DevOps should be defined by the outcomes…DevOps is not about what you do, but what your outcomes are.” I often joke with my friends and colleagues, “If your life is not easier with agile and DevOps, you probably don’t do it right.” What I really mean is that agile and DevOps should help us deliver software faster with better quality and do it in the way that makes everybody’s life easier, not harder. Putting it bluntly, the only measure of success for DevOps is in our improved ability to deliver quality software faster and with ease.
Using this yardstick, our DevOps initiative was a resounding success. From the time I formed the CI/CD team to the first time we tried the weekly release in production, it only took us six months. It took us another four months to complete the automated release pipelines for large releases that require us to shut down and restart hundreds of servers. Because of the automation and consistency between the lower environments and Production, we proved to stakeholders that CI/CD is not only faster but also effective in gating product quality, as it is shown in the chart below.
Figure 3. Deployment Failure Rate and Defects to Production
The improved deployment efficiency and release quality made everybody’s life better. For the big batch releases (which had become fewer and also smaller due to the more frequent weekly releases), we reduced the deployment and validation time from 3-days and 3-nights to roughly 12 hours; and we completely eliminated the dreaded 6-8 weeks of daily “post-launch forum.” As a result, both our customer satisfactory and vendor relationship improved significantly.
4. What We Learned
My experience of implementing DevOps at Global Automotive taught me many valuable lessons. Yes, it is hard to implement successful DevOps initiatives in large, complex, and traditional organizations, due to the many challenges and constraints. Therefore, it is even more important for us to take a systematic approach, focus on the biggest bottlenecks first, and be careful and creative in dealing with the hurdles around legacy architecture, lack of good test automation, and the organizational inertia. By working diligently and persistently, we can increase our chance of success.
It has been over a year since I moved from Global Automotive to J.B. Hunt Transport. The two companies are in different industries with vastly different people, business processes, and technology stacks, but many challenges I face are surprisingly similar: Lack of DevOps talents, balancing competing priorities, legacy architecture, lack of good test automation, and people’s resistance to change. Fortunately, I also found that many of the fundamental lean/agile and DevOps principles and best practices are equally applicable and effective. As the Chinese character for “crisis” is a combination of risk and opportunity; in DevOps, what’s in the way is often the way.
Many thanks for all team members at Global Automotive who took on this difficult journey with me, through many sleepless nights and emotional roller coaster rides during the process. Special thanks to Global Automotive IT management for giving me this opportunity to experiment this new way of software delivery.
A special thank you to Michael Keeling, the software architect, agile practitioner, and the author of Design It! From Programmer to Software Architect, for shepherding me through the process of writing and refining this experience report. Michael’s generous support and guidance are invaluable to me.