This experience report tells the story of a team that has been working with automated tests since the beginning of the practice, even before the Agile Manifesto. It is a captivating account as it starts right on the innovation of techniques, tools and processes and shows how it is possible through continuous improvement to evolve into an extremely robust and secure environment, yet at the same time dynamic and scalable. Through examples of concrete actions directed at productivity, quality and costs, we aim to demystify this practice and show that test automation is essential for the sustainability of any agile team and it only needs the proper attitude and commitment of its members.
Developing software using automated testing is one of the foundations of agile methodologies. The reason being that automation is the only way to guarantee continuous and sustainable delivery. Automating testing is a way to increase productivity, lower costs, and achieve delivery levels of a greater level of magnitude compared to manual testing. So why are most development teams still so reluctant to apply this technique? What are the difficulties, myths and beliefs that hinder the success of this investment?
In our team, we discovered the power of this technique almost 20 years ago. We did not have many references or tools to help us, but we knew that in order to guarantee the robustness of what we were delivering more and more, manual tests were inefficient, time-consuming and very costly. We started by developing frameworks to support this technique and creating processes in order to foster a culture that took into account test automation throughout the software production cycle. We already had a great legacy developed without tests, but we bet on a model of introducing partial changes in a continuous way guided by the necessity itself: each new feature was only delivered with automated testing and any change in legacy code also involved the development of tests for this discovered code.
When bugs were found, the correction process required a replication script to be carried out first, only then fixing the bug. Sometimes we needed to have bigger investments in architecture, but the gains were very rewarding. In this account we will tell this story. Responding to the needs, we will see how it is possible, with attitude and innovation, to apply a step by step change from a manual culture to a robust environment that is a reference in the application of agile methodologies in large-scale.
Our company was founded in 1995 from a meeting of 'nerd' students who liked to program their own computer games. Passionate about software development, from the very beginning they intended to excel in the quality and sustainability of their deliveries and so they chose the Smalltalk language, given the pure Object Oriented paradigm that such technology offered.
In fact, I believe this is one of the reasons for the innovation in test automation. At the same time that the environment provided high flexibility (not typed, open IDE, pure O.O. etc.), the community, especially the national one, was not very big, which generated the need to create our own tools and frameworks.
We started working with a Telecom company in the sector of Pay TV and in less than one year we already had a great system in place, behind the client's core business, with its own persistence layer and various concepts that would become Design Patterns only many years later.
3. Our Story
Despite the cutting edge in applying modern development standards, we were not immune to software bugs. This is inherent in any development process when it reaches a certain stage of scale, complexity, and parallel integration of code by the team. Thus, as in other companies, we started with a test team which was responsible for validating new behaviors as well as re-executing scripts to ensure that no regressions were introduced. However, as senior as the test team was, it could no longer, at each new development, satisfactorily handle the scripts to be run and so our defect rate increased with each new release.
After successive risk situations in the client's business, we established that it was no longer humanly possible to accompany the expected growth of our deliveries with tests. We wanted to get new versions every month and we even had a fairly large implementation capacity, but it was no use if we were held hostage by the risk or the time for regressive testing.
By simply calculating costs, we were able to see that we were actually wasting time on repetitive tasks and that it would be wiser to automate these tasks through the automation of the testing process. We then created, between 1997 and 1998 our first version of a framework capable of registering and executing a set of test scripts that we started to automate from then on.
Figure 1. Test Framework Screen
This framework already had many interesting features; some of these still not found in current frameworks supporting automation of tests:
- Hierarchy of scripts;
- Recording of test results;
- Comparison between execution turns (generating module analysis more sensitive to errors);
- Branch executions;
- Easy test navigation to the code where the break occurred;
- Use of database scripting that quickly assures the assembly of new environments from scratch.
Consequently, we began our journey of automated testing and were faced with new technological and cultural challenges, some of which we will present below.
3.1 Database Persistence
Problem: Slowness. At the beginning of our automated testing, our biggest concern was replicating the features tested by our manual scripts. Thus, our automation volume grew significantly in so-called ‘functional tests’ which validated end-to-end system behavior. We were more concerned with the scenarios we considered essential from the point of view of a system user.
We knew that one scenario could not impact any other, so for each script, the system simulated a start from scratch from initial login, the creation of all objects, to the final test of the functionality we were trying to validate. Such a practice required a large amount of processing, but we identified that the biggest bottleneck slowing down the test was when it was time to commit the database.
I/O operations are always the most time consuming and in observations we found that more than 50% of the test time was spent waiting for the disk response (that was before the year 2000 and the hardware was still very limited).
Our improvement: Insistence Layer. Our first improvement in the framework was, at test time, to intercept the commit command from the database driver and simply discard it. This is completely valid in the testing environment as multiple sessions or changes of data from other users are not expected in this scenario. What counts in a test is what is contained within that very session.
Even in the issue of unforeseen data loss, the physical redundancy does not make sense since in the test environment the data will be generated fresh for each test. We continued to use the information structure provided by the database, but without persistence on the disk.
Results: We had a very large decrease in runtime, making the whole environment much faster. Another peripheral but very valuable result was the ability to improve our debugging of any production bugs. With the 'Insistence Layer' on, we could connect in production and execute actions without fear of damaging the state of the database or polluting it with invalid information coming from tests.
3.2 Time Travel
Problem: Time dependency scripts. In our functional tests, we sometimes encountered scenarios that needed to change the time of day or even the date for validation of the behavior. Imagine, for example, a process that sends out mail on the birthday of people in the database. A correct test cannot simply execute the system logic. It should simply advance the time and wait for the automatic process to be triggered by the system itself.
Our improvement: Virtual Timer. As a basic framework of our system, we had already developed a way of synchronizing clocks between the database server and client stations. Our addition here was to simulate the system clock itself so that while in testing, we had the ability to change the system date/time and the whole environment behaved as if that time period had actually passed.
Results: As a consequence, we gained the ability to test time-dependent scenarios and thus our business tests were much closer to the expected behavior of the system. A further benefit of such a practice was the reduction of intermittent scenarios. For instance: If you have a process that runs every first day of each month and your tests are not ready, you may be surprised by a behavior that is different only on this day. In practice, you will get results that are not the same, even without any code changes.
3.3 Ramp up Delay
Problem: Repeating the initial setup. As we increased the number of scenarios, we progressively found situations where the same set of initial information was needed for the validation of various behaviors.
We already had the concept of ‘InitSetup’ but this process, following the atomicity concept of a test, was very time-consuming and repeated several times in each new scenario.
Our improvement: Hierarchical environments. Our solution was to also hierarchize the data setup environments. That is, we had an initial base and, from this base, we created overlapping layers. These could be subdivided by creating a tree of data layers from which the scenarios were used according to the set of information needed for their tests.
This solution has a very interesting technical mapping for some databases that define ‘savepoints’ in their structure. This feature allows the establishment of save points and subsequent partial rollbacks. Thus, in our implementation, we made each layer save its initial ‘savepoint’ making navigation straightforward in the tree. To restore previous levels, one just needed a rollback to the chosen ‘savepoint’ and then build a new set of data.
Figure 2. Data Hierarchical Environments
Results: In addition to the gain in performance in the execution of the tests, just like the layer of insistence, with this improvement we have also enhanced our control of production environments because in these we could also create layers of data to perform any tests and thus facilitate the search for bugs.
3.4 Divide and conquer
Problem: Full test execution so long. After four to five years, we already had more than 5,000 functional scenarios in our main product. All of them being run before any build, or run sporadically when someone wanted to validate some new development. The complete execution of the tests took more than five hours on average development hardware. Which is a long time to wait for the result.
Our improvement: Multi-server. Distributing the execution was one of the most immediate solutions from the moment the delay became critical. Given the independence of the scripts and the hierarchical data model, we were able to establish two distinct models of distribution: separating the scripts or separating the environments on which the scripts depended. In the latter model, separate machines carried different environments, improving the efficiency in the execution standstill. It was a relatively low investment given the previous characteristics we had developed.
Results: The obvious gain was in the time it took to complete one full execution. We managed to decrease that time from more than five hours to less than one and the growth in the number of tests was no longer such a critical problem. However, over time, we noticed that even this solution had its limitations because in order to run it in a distributed way, we also needed several servers, including databases, resources that at that time still had a considerable cost.
3.5 Many Worlds
Problem: Different combinations from each client. Although we knew the techniques of writing good tests by observing the validation of limits, trying to cover not only the 'happy path', sometimes we came across combinations in clients that we had not thought about during development time. This is quite common in large systems where there are customizations through numerous parameters or specific sets of implementations. The combination generates an exponential number of scenarios often making a different test for each situation impossible.
Our improvement: Client profiles. The solution to this situation was the complete extraction of the settings for a ‘client profile.’ This profile labeled a real client and had all the determining characteristics of the behavior of the system for it. Loading this profile before the automated tests were executed, we were able to execute the system as the client would have in its production. In this way, we could run the entire battery of tests for client A, again for B, and C, and so on.
Results: With this improvement, we considerably reduced the possible combinations to be tested, meaning our concern shifted from validation of all possible combinations to only those valid for our existing clients.
Problem: Changing programming language. I believe this was our biggest technical challenge faced by our development team. In the middle of 2004, a large company bought our system, under the condition that we no longer use Smalltalk. The Smalltalk community in Brazil was almost non-existent and the company did not want to be dependent on this technology. We accepted the migration to Java, but there was another major concern: we had several other customers who continued to request new improvements and our team was almost completely burdened by working on new developments in the system.
If we started a new development in Java, it would never match the Smalltalk development without a long freeze on new features. The use of Smalltalk in itself was already a differentiating factor of our company, in the speed of delivery which was significantly greater than our competitors.
Coupled with these practical issues there were still some political constraints: big companies do not like big changes, so for our other customers, a language change could not have any impact on behavior, visual or any other aspect of the current system. In initial calculations, we would have to practically triple the size of the team to account for the migration and yet the timing of things in development would be very questionable.
Our improvement: Teleporter. Our only alternative, given this scenario, was to develop a “teleporter” capable of reading the Smalltalk code on one side and generating fully functional Java code on the other.
Figure 3. Smalltalk to Java Teleporter
We initially thought it was a very risky idea given the paradigm changes: non-typed to typed, interpreted to compiled, pure O.O. to pseudo-O.O. etc. We also noticed that several other similar attempts in the market had not been successful. But we had a very valuable difference from those other attempts: for this system, we had almost 10,000 functional scripts that ensured that the delivered system met our clients’ requirements.
Therefore, what we had to do was a testing framework in the new language and our teleporter would have to generate the valid test scripts for this framework. In teleporting the tests, it would be only a matter of time to teleport the rest of the system. We did some spikes and found it was an accomplishable job with a promising solution.
Results: After a little more than one year of work with only three developers, we finally managed to release a fully functional teleporter. The application, with just one click, was able to read our entire Smalltalk (about 1Gb) code base and generated another, completely new and compiled in Java. Our tests all ran in both languages with positive results.
All the ongoing development in Smalltalk continued at the same pace because the conversion was instantaneous by the teleporter. Specific developments could only be made in the already teleported version without impact on the original code. We still made a considerable effort to make the teleported code legible and easy to maintain. However, all this work was done in order to automate certain conversions, which practically did not duplicate any of our work in the matter. The final turnaround for Java was just a decision like, “OK, now all maintenance and incrementing goes on the teleported version.”
In the end, we managed to keep all the initial premises and had absolute success in this change. If we could quantify the value of the automated tests, this case alone would be enough to validate our entire investment from the previous years. We were able to, with reasonable effort and time, re-implement a giant legacy system more than a decade old by automatizing the conversion and relying on the scripts we had already created.
3.7 Database or Memory
Problem: Database performance. Although we were no longer persisting on disk in a test environment, the translation of objects to the relational model of a database began to show itself as a bottleneck with the increasing number of scripts. Handling objects directly is much faster, and if memory is available, was the recommendation. But what to do with in-clause database specific optimizations?
Our improvement: SQL to EQL. Our decision at this time was to effectively remove any direct reliance on databases. To do so, we created a translation of query commands called EQL - Entity Query Language. Such a format was able to query directly on memory objects or, in a production environment, translate such queries to the old SQL, thus maintaining the performance required for large volumes of data. It was possible to opt for the tests to run only in memory or connected to a database.
Results: This approach not only allowed a high gain in test execution performance but also made us revisit and restructure all database dependencies. This has brought a more structured architecture by breaking down unnecessary dependencies and improving the quality of the existing code. At development time the team started using only memory tests. In the generation of the builds, however, we still opted for the database-dependent execution to avoid any disparity due to database changes beyond our control.
3.8 Production is Complex
Problem: Exponential combination. As discussed earlier, combinations of client configurations generate an exponential number of scripts which are infeasible to validate at completeness. What we began to notice is that not only combinations of configurations generated unforeseen situations, but also combinations of data sets that in complex systems can differentiate the way the system behaves. Thus, we progressively found a set of factors that generated unexpected situations, but were only demonstrated in some specific production environments. The approach to scenario prediction was no longer sufficient for the level of quality we were seeking.
Our improvement: Hot Data Test. We began to rethink the way we carried out our tests and re-evaluate whether there would be no smarter ways to reduce the risks of unpredictable production scenarios. We have created a new concept of automated tests: Hot Data Test. Such a concept is based on the assumption that the test will be carried out on an already densely populated basis.
The objective, in this case, is no longer the assertiveness of the numbers and information in detail obtained from the actions, but rather the assurance that the flow of screens and system availability remain unaltered after the execution of cycles of commands common in production. These scenarios, even in small numbers, brought great benefits because they gave us the ability to handle real database copies (with data masked) from our clients, which allowed us to perform a series of validations before new builds were introduced.
Results: In addition to preventing many errors that we would only find in production, with this technique we also discovered that we could validate, by automated testing, performance issues, thread handling, and other general aspects of the system. As an added benefit, we also began to validate an aspect we had previously no control over: manual database changes performed by our clients.
Our system’s database is generally owned by our customers and, even if not recommended, some of them perform manual changes outside the system and eventually cause data inconsistencies. ‘Bugs’ from these situations are difficult to detect because they are in disagreement with what is expected by the code.
Using the ‘Hot Data Tests’ technique we started to identify such scenarios because we validated the main flows directly over a real copy of the production database. Such inconsistencies generated anomalies in the tests and indicated that something was not in accordance with the flow predicted by the use of the system.
3.9 Cluster Execution
Problem: Partial execution is not working. From 20,000 test scripts, even with all the enhancements to ensure fast execution, we still had a very long time to complete the tests. We realized that in practice most developers did not run the full tests in their development anymore and selected only branches where they imagined that errors could occur caused by the new implementation. The result of such team behavior began to generate an unstable environment because whenever a complete execution was attempted for a new build, hundreds of tests failed because they were not previously run at the implementation time.
Our improvement: Clustering and Nightly build. We needed faster, more accurate answers. We then selected a cluster of machines to run the tests. We increased the size of our server farm and made this set available so developers could have more processing power. During the nights we set up a complete daily build. In other words, regardless of whether the developer had run the tests, every night a new build would be generated and the complete set of tests would run.
Results: Initially, this attitude showed great results because it reduced the test feedback to a maximum of one day. Failures were evidenced the day after the nightly build which carried a report of all the tests that were breaking. However, a negative situation was created as explained below.
3.10 Who is the Culprit
Problem: Long time to discover who has broken the build. We had a nightly build. We ran the entire battery of tests and the result came after eight continuous hours in a large cluster of machines. The problem was: Almost every day, some scenarios broke. Who was to blame? The culprit was obviously one of those who integrated the code the day before. Still, with almost a hundred developers involved, how could we prevent everyone from spending half of the next day just revalidating everything?
Our improvement: Culprit rank algorithm. Our solution was considering the efficiency line in the search for the culprits. We created an algorithm for ranking the possible originators of the failure. When presenting the failing script, the possible culprits were also orderly listed according to the code changes they made. If the first one on the list did not identify himself as the culprit, the next one would be approached. If no one on the list took the blame, the first one would then have the obligation to correct the problem, regardless of who caused it.
Results: This solution reduced the effort in the search for the causes of the breaks of the tests, but nevertheless, it generated a certain conflict in culprit identification. Another negative factor was the nomenclature adjustments or even style patterns. Even though not altering any behavior, the developer was included in the list of possible culprits and often had to correct problems he himself did not cause.
Notwithstanding, this automated solution is very interesting and we use it to this day, even if we no longer have the nightly builds.
3.11 Multiple clusters
Problem: Tomorrow is too late. The nightly build was a breakthrough in our process, but it did not solve our problems. Having feedback only the next day is a very high hold back, especially when we are talking about a job where context switching generates most of the defects. Often problems occurred in the build, avoiding the execution of the scripts which gave a delay of two days or more.
We had at that time about 25,000 functional test scripts in our main system. For purposes of comparison, on average developer hardware the execution of the scripts would take about 48 hours uninterrupted. Another difficulty was when we needed some build the same day because of business decisions or even some emergency patch. We had to wait for the nightly build.
Our improvement: Continuous build in virtual machines. Our response was to try to expand the cluster capacity of our fleet of machines to take advantage of the development machines themselves. We created virtual machines in the stations that, together with virtual machines in the server farm, generated a large cluster, which continuously processed build requests.
Each new request entered a queue that was sequentially consumed by the build generator. In order to avoid periods of ‘freezing’, we used an optimistic premise where for each branch to be tested, it was automatically synchronized with the trunk (downmerge) and after the successful generation, it was committed in the trunk (upmerge).
Results: The process had all the characteristics for success. It was a continuous process of execution and sought to be as automatic as possible. The result, however, started to generate more misalignments in the team than before the improvement. As we have the possibility of more than one build on the same day, we inadvertently triggered the ‘priority conflict.’ Each project, or team, started claiming that its development had more priority and arbitrarily removed others from the queue for its branch to be included first. The mediation of such conflicts escalated to the executive level of the company.
Problem: We need to scale without limit. Over 30,000 scenarios, the prioritization became chaotic. Due to conflicts of priorities, delays in executions and late feedback, we observed that the integration process often consumed 30 to 40% of the functionality development itself. It was a huge waste since it did not actually generate any perceived value. It was just an internal issue of our software production process. We needed a truly disruptive solution that would favor our most fundamental concept of automated testing: getting quick feedback of our development to be able to properly direct our path.
Our improvement: Clouding tests. It was in the middle of 2014 that we then completely changed our approach! Instead of being constrained by the infrastructure, we discovered the power of cloud processing and decided to use this framework to run our entire testing environment. We implemented a variety of plug-ins and tools so that every commit attempt by a developer automatically triggered a cluster with tens or even hundreds of machines in the cloud. Another result was the possibility of upmerge automation, when all test scripts ran on a branch build passed.
Finally, we reached the continuous deployment. We were no longer limited by physical resources and now we had control of the time we wanted for the feedback.
Results: While language migration was our biggest challenge for our development team, I believe the move to cloud testing was our biggest breakthrough in terms of scalability. Obviously, the whole set of continuous improvements gave us the capacity for this jump, but this change was the most noticeable in terms of efficiency gain. We were able to increasingly expand our team, avoid work waste, improve our deliverability and, consequently, improve the quality of the product. One reason for that is avoiding losing focus on one’s task while waiting for tests execution feedback.
It was not an easy task because, in order to make good use of resources, we had to expand the functionality offered by the cloud environments in order to control ourselves the assembly and disassembly of new instances. But it was worth it! Currently, we have a very robust, fully automated and transparent process available to every developer. There are no more worries about times, priorities, or even breaks by others. With just one click, all 'magic' happens and a new build is generated with the recently fully validated code. We have a build which is ALWAYS stable.
From this point, we were able to expand our horizons by integrating static quality analysis (a.k.a. SonarQube), validation of style patterns, testing on different operating systems, and other improvements that were simply impossible since the basic process already consumed all our resources. Also, a very interesting detail: today with twice the size of the 2013 team, running hundreds of automatic builds daily taking about 95% less time to respond, we have an absolute infrastructure cost lower than in the past.
4. What We Have Learned
Analyzing our entire history of automated testing makes it hard to point out all our lessons learned. For us, developing with test automation is no longer a matter of choice. It is our natural way of developing software. For teaching purposes, however, I believe we can list here our greatest perceived benefits along this trajectory:
- Reliability in new version deployment: Test automation has changed the way we face a go-live. From ‘extremely critical and stressful moment’ to only a normal phase in our delivery process;
- Cost Reduction: Automated testing is cheaper than manual testing;
- Fewer bugs: We have reduced by more than 80% the incidence of bugs. Critical bugs have become extremely rare. Repeated bugs have simply disappeared;
- Third-party error prevention: In scenarios with communication with other systems, we prevent the occurrence of faults caused by others, as well as interface errors of our own system. We generate reliability to others in order to facilitate the analysis of integrations;
- Reduction of analysis failures: When thinking about automated tests during the initial design of the software, we have also reduced the incidence of errors of analysis because it has forced a more structured and concrete thinking in the expectation of the results;
- Transparency of problems: Automated testing makes the scenarios more clear and more direct. Good tests are able to validate every aspect of development and allow for timely improvements and fixes;
- Reduction of development time: As contradictory as it may be, developing with automated testing is much faster than without. One only needs to change the culture and to guarantee the daily use by the team;
- refactoring: Technical debit is something that can be tackled without fear. We realized this when we started using automated testing on a large scale: people simply were not afraid to improve their code anymore;
- Validation of alternatives: With automated tests go/no go analysis is much more assertive. We can quickly evidence impacts on the system and validate whether the investment is advantageous or not in technical terms;
- Training and documentation: From the coverage of the system through automated testing we realized that in fact the scripts created also started to be used as a way of explaining our system to new team members and even new clients. It is a living documentation and always updated with the features that we have in our software.
I hope this account will be of use and inspiration to those who are still reluctant to unreservedly use this technique. For us, test automation is still the only way to maintain and scale a system and, in fact, in the current software engineering scenario, delivering a product without automated testing should be considered a breach of professional ethics:
“No excuses! It is our basic responsibility to ensure the sustainability of what we are delivering.“
I would like to express my gratitude to all who participated in this story. After all, my report is only a compilation of thoughts and experiences coming from a great team. There are no merits unique to one or another member, but to the whole combined with a restless and innovative spirit. We have achieved success through hard, incessant, provocative and questioning work.
Thank you all for the opportunity to have participated in this trajectory!
And last but not least, thank you Tim O'Connor, my shepherd writing this paper. Your support and comments have been very gratifying and certainly crucial to the quality of this article.
Copyright 2017 is held by the author(s).