Experience Report

Transforming into an Agile Scrum Framework

About this Publication

COL around 2008 has made a change in it Software Development Framework in an attempt to reduce long 9-12 months release cycles and become more agile with the ability to introduce new features and products to the market in less than 3 months. The company hired a new VP of Engineering who acted as a catalyst in the agile transformation. Cross functional scrum teams were formed that included 3-5 software engineers, 1 UX designers and 1-2 QA person per team. There were about 30 teams formed with average team size of 7- 9 members per team. A new Agile Program Management Office has been formed to oversee the scalability of this new organizational structure and to establish sound Agile practices that keeps the organization on track. A Product management group also has been formed to provide leadership to the Scrum Teams on what features to develop and how business value and return on investment can be maximized. First the teams ran 1 months sprints, that later was reduced to 2 week sprint cycle as that seemed to be the sweet spot for the teams, and they released their code to operations in general once a quarter.

So all of the sudden Operations had to deal with 30 releases per quarter. As Operations was releasing to stage environment also and we needed to re-stage a couple of times it added up to about 60-90 Releases every 12 weeks. Operations was providing services to the Engineering Team, such as building testing environments, implementing the required monitoring, building production environments including networks, hardware, security etc. The development team did not have enough information on their design and how the product would run in a production environment 2 months into their development effort. At that point they would start engaging Operations with some requirements in form of Service Requests for monitoring, hardware, access etc. The random requests without collaboration ended up in a chaotic mode for operations as often we have learnt that the Engineering requests were inaccurate or incomplete so we ended up redoing a lot of work or realizing at the night of the release that something that was extremely vital for the release has not been requested at all.

Because Engineering was not really ready to talk about operational requirements in the first two months and because of resource shortages in operations we have never added an operations resource to the scrum teams as a full or even part time team member. Operations resource was assigned on an as needed basis. As building out a new environment could take up to 60 days, as a result of this timing issues releases can be delayed by a months waiting on hardware. As operations was not represented at the design phase and because the backlog was prioritized based on ROI solely, the availability and scalability of the application was a low priority, therefore applications that were hard to deploy and once deployed were hard to maintain were frequently developed. At that point it was so late in the SDLC that operations although felt uncomfortable releasing the code did not have enough power to ask for any change as it would have caused sever delay. So Ops Management identified 4 main areas that needed to be addressed to improve the operational performance of the product.

The first one was the Operational Design of the product. So we formed an Operations Design Team whose sole purpose was to collaborate with the engineering scrum teams as early as possible to create the design artifacts that would also in turn would identify the required work that needs to be performed to the rest of the operations teams.

This step addressed the second problem that operations needs to receive complete information as early as possible so the required environments can be completed. It can take up to 90 days to build a new environment depending on the type of hardware required. The key to success was that we have identified SOP’s that the design team religiously followed. Documented those SOP’s and started to train representatives of the engineering scrum teams on those design principles enabling them to better meet operational requirements. The team also worked together with engineering to create tools such as the Load Balancer configuration tool that enables an almost automatic configuration of loadbalancers in any environment. This can be used by developers to test their application. Out of this exercise came the requirement to utilize a private cloud where the developers can on demand bring up and break down servers.

The third area was deployment. There are two types of deployments in general, the first one is the one that does not require changes in the infrastructure meaning that network, hardware, or security related components do not need to be modified and there are the ones that required new hardware, a new network or even an entire data center to be built. The first type of deployment sometimes came without any warning from the engineering teams and an almost immediate turnaround time was expected on those. This especially become vital as the Engineering team liked to use the staging environment, which was maintained solely by operations, for their final testing.

For some reason as Staging was the environment that was identical to the production environment QA frequently found bugs that were not detected in the integration or QA environment. So Engineering representative put in a request to stage a deployment, operations would stage it bugs would be found and then the code would be fixed. Two hours later they back with a new staging request where another set of bugs would be found. After a couple of these runs the frustration of the operations team has increased to a high level as what the actual planned work for that week was now not being done as a result of re-staging buggy code 2-3 times. These staging requests in general took 2-4 hours to perform.

To address this issue operations created a 2 person deployment team whose sole mission was to sit tight and as soon as a deployment request comes in jump on it and deploy the code. That sped up the response time making the engineering organization more efficient as they did not need to wait around for their deployments.

Operations also recommended to build a pilot environment that would be just like production where the development teams can perform their testing on an exact production like environment. This would leave the staging environment for operations so the deployment engineer can just perform the testing of the actual deployment steps itself and not the code.

Recently we have started to use automated deployments, which enable the testing of the deployment process, as the same steps are used to deploy into all environments the only thing changes is the hardware it is deployed to. This has brought a lot of efficiency into the process and freed up the time deployment resources spend on deployments.

The fourth big area is the run time maintenance of the application. Operations in general would like to receive a run book that outlines how to run the application and provides information on how to start,stop, monitor the application, what type of alerts to expect and what to do when those alerts would arise. To address this challenge we have formed an Application Operations team that would be responsible to run the application at run time. This team would be responsible for performing Disaster Recovery testing, Responding to alerts and keeping the application up and running, plus they are providing a monthly review about the application performance and a list of improvement requests on what would improve the user experience and would provide them a better opportunity to provide higher availability or scalability. Out of this collaboration came a lot of valuable improvements that made our products more reliable and resilient.

COL around 2008 has made a change in it Software Development Framework in an attempt to reduce long 9-12 months release cycles and become more agile with the ability to introduce new features and products to the market in less than 3 months. The company hired a new VP of Engineering who acted as a catalyst in the agile transformation. Cross functional scrum teams were formed that included 3-5 software engineers, 1 UX designers and 1-2 QA person per team. There were about 30 teams formed with average team size of 7- 9 members per team. A new Agile Program Management Office has been formed to oversee the scalability of this new organizational structure and to establish sound Agile practices that keeps the organization on track. A Product management group also has been formed to provide leadership to the Scrum Teams on what features to develop and how business value and return on investment can be maximized. First the teams ran 1 months sprints, that later was reduced to 2 week sprint cycle as that seemed to be the sweet spot for the teams, and they released their code to operations in general once a quarter.

So all of the sudden Operations had to deal with 30 releases per quarter. As Operations was releasing to stage environment also and we needed to re-stage a couple of times it added up to about 60-90 Releases every 12 weeks. Operations was providing services to the Engineering Team, such as building testing environments, implementing the required monitoring, building production environments including networks, hardware, security etc. The development team did not have enough information on their design and how the product would run in a production environment 2 months into their development effort. At that point they would start engaging Operations with some requirements in form of Service Requests for monitoring, hardware, access etc. The random requests without collaboration ended up in a chaotic mode for operations as often we have learnt that the Engineering requests were inaccurate or incomplete so we ended up redoing a lot of work or realizing at the night of the release that something that was extremely vital for the release has not been requested at all.

Because Engineering was not really ready to talk about operational requirements in the first two months and because of resource shortages in operations we have never added an operations resource to the scrum teams as a full or even part time team member. Operations resource was assigned on an as needed basis. As building out a new environment could take up to 60 days, as a result of this timing issues releases can be delayed by a months waiting on hardware. As operations was not represented at the design phase and because the backlog was prioritized based on ROI solely, the availability and scalability of the application was a low priority, therefore applications that were hard to deploy and once deployed were hard to maintain were frequently developed. At that point it was so late in the SDLC that operations although felt uncomfortable releasing the code did not have enough power to ask for any change as it would have caused sever delay. So Ops Management identified 4 main areas that needed to be addressed to improve the operational performance of the product.

The first one was the Operational Design of the product. So we formed an Operations Design Team whose sole purpose was to collaborate with the engineering scrum teams as early as possible to create the design artifacts that would also in turn would identify the required work that needs to be performed to the rest of the operations teams.

This step addressed the second problem that operations needs to receive complete information as early as possible so the required environments can be completed. It can take up to 90 days to build a new environment depending on the type of hardware required. The key to success was that we have identified SOP’s that the design team religiously followed. Documented those SOP’s and started to train representatives of the engineering scrum teams on those design principles enabling them to better meet operational requirements. The team also worked together with engineering to create tools such as the Load Balancer configuration tool that enables an almost automatic configuration of loadbalancers in any environment. This can be used by developers to test their application. Out of this exercise came the requirement to utilize a private cloud where the developers can on demand bring up and break down servers.

The third area was deployment. There are two types of deployments in general, the first one is the one that does not require changes in the infrastructure meaning that network, hardware, or security related components do not need to be modified and there are the ones that required new hardware, a new network or even an entire data center to be built. The first type of deployment sometimes came without any warning from the engineering teams and an almost immediate turnaround time was expected on those. This especially become vital as the Engineering team liked to use the staging environment, which was maintained solely by operations, for their final testing.

For some reason as Staging was the environment that was identical to the production environment QA frequently found bugs that were not detected in the integration or QA environment. So Engineering representative put in a request to stage a deployment, operations would stage it bugs would be found and then the code would be fixed. Two hours later they back with a new staging request where another set of bugs would be found. After a couple of these runs the frustration of the operations team has increased to a high level as what the actual planned work for that week was now not being done as a result of re-staging buggy code 2-3 times. These staging requests in general took 2-4 hours to perform.

To address this issue operations created a 2 person deployment team whose sole mission was to sit tight and as soon as a deployment request comes in jump on it and deploy the code. That sped up the response time making the engineering organization more efficient as they did not need to wait around for their deployments.

Operations also recommended to build a pilot environment that would be just like production where the development teams can perform their testing on an exact production like environment. This would leave the staging environment for operations so the deployment engineer can just perform the testing of the actual deployment steps itself and not the code.

Recently we have started to use automated deployments, which enable the testing of the deployment process, as the same steps are used to deploy into all environments the only thing changes is the hardware it is deployed to. This has brought a lot of efficiency into the process and freed up the time deployment resources spend on deployments.

The fourth big area is the run time maintenance of the application. Operations in general would like to receive a run book that outlines how to run the application and provides information on how to start,stop, monitor the application, what type of alerts to expect and what to do when those alerts would arise. To address this challenge we have formed an Application Operations team that would be responsible to run the application at run time. This team would be responsible for performing Disaster Recovery testing, Responding to alerts and keeping the application up and running, plus they are providing a monthly review about the application performance and a list of improvement requests on what would improve the user experience and would provide them a better opportunity to provide higher availability or scalability. Out of this collaboration came a lot of valuable improvements that made our products more reliable and resilient.

How did ops utilize Agile and Scrum?

We have started an operations scrum, for each functional team inside operations. The teams would run on one week sprints, each team would go through their prioritized backlog and commit to the items they have capacity to perform. The team managers get together every Monday and review the committed backlog for the organization making priority decisions if needed. There is a 1 hour meeting for our customers to ask for escalation of any items that is essential to their success. After the sprint planning meeting the teams start working on their backlog and report progress in a daily stand up fashion. This has enabled us to understand progress and the capacity of the individual scrum teams. It also ensures that we are working on the highest priority items.

This set up works in general well, however the drawback is that the team is fully committed and is not expecting to work on other things before the next sprint planning meeting. This proves problematic if a series of activities need to be performed that depend on each other, which is often the case when your are building environments. We are trying to address that by frequent communication. Also one proposed solution would be to create a cross functional team that would be responsible for releases that have environment build out components.

Is IT service management a better way to go for some requests?

There is two type of requests, one that is clear and concise and does not require planning, the other one is not well known and does requires some or significant amount of planning. To delight our customers we have entertained the idea of creating a service catalog that would contain all well known activities that are requested by our customers, those would be handled just like an incident with an SLA based model. The idea behind these is to get customers unblocked quickly whenever they are held up by the need for some quick and simple action. Some examples are, access to host, new virtual machine, new dns, etc. This would require the build of a service catalog with well defined service items and SLA’s.

Ultimately, what we have discovered in the last couple of years, is that close collaboration between operations and engineering results in better customer experience. The DONE criteria for every new added functionality should be defined as 100% of customers using the functionality extending the value stream to include the operational work needed to be in place to release new products. Automation is key to keep the velocity of the teams and to improve performance and delay originated from hand-offs and human interactions when it is not absolutely needed.

Add to Bookmarks Remove Bookmark
Add to Bookmarks Remove from Bookmarks
Add to Bookmarks Remove from Bookmarks

Your Bookmarks

No favorites to display. You must have cookies enabled to add bookmarks.

Have a comment? Join the conversation

Related Agile Experience Reports

Many organizations have chosen SAFe as the vehicle to drive their enterprise wide transformation, and whilst SAFe is a broad opinionated framework that gives answers for much of what you will run into, there is also a lot it does not say or does not …
LeSS without Scrum provides an experience where we apply LeSS organization design elements for large-scale Agile adoption without first implementing Scrum teams. This illustrates an organization-first approach, in contrast to the more common team-fir…
Many organizations have chosen SAFe as the vehicle to drive their enterprise wide transformation, and whilst SAFe is a broad opinionated framework that gives answers for much of what you will run into, there is also a lot it does not say or does not …
LeSS without Scrum provides an experience where we apply LeSS organization design elements for large-scale Agile adoption without first implementing Scrum teams. This illustrates an organization-first approach, in contrast to the more common team-fir…

Discover the many benefits of membership

Your membership enables Agile Alliance to offer a wealth of first-rate resources, present renowned international events, support global community groups, and more — all geared toward helping Agile practitioners reach their full potential and deliver innovative, Agile solutions.

Not yet a member? Sign up now