Motivation is a slippery thing. My favourite example is described by the writer Dan Pink. He tells the true story of a Nursery who, like many Nurseries, had a problem with parents turning up late to collect their children. This is a big problem for such organisations. So what do you do in this situation? Well the obvious answer, which this Nursery tried, was to introduce a series of fines. If you arrived later than you should, you incurred a fine. If you arrived very late you incurred a bigger fine.
It is obvious that that is a deterrent, right? Well no. In fact the late collection problem got dramatically worse. What had happened is that previously the parents understood that there was a social compact. You collected your children on-time because it was bad manners, if nothing else, not to. Now the Nursery had put a price on it. The fine was the fee for looking after your children for longer. The Nursery had removed the social incentive for doing the right thing and replaced it with a financial one that described how much it cost to do the wrong thing. Parents decided to pay for the extra child-care!
These things are done with good intentions but the results are counter-productive. As I said, motivation is a slippery thing!
Metrics in particular, one form of motivation, are very difficult to get right. We often hear organisations talk about being “data-driven” and “the importance of KPIs” in driving good behaviours, but do these things really work? It is all too easy to define a metric that drives exactly the wrong behaviour.
How many times have you seen something that looks crazy and asked “why on Earth do we do that?” only to hear the answer “It is because that group of people are measured on xyz”.
“Why do our sales people sell stuff that isn’t ready?”, “Because they are incentivised to sell”.
“Why do our developers create poor quality code?”, “Because they are incentivised to create lots of features”.
“Why do our operations people slow things down?”, “Because they are incentivised on the stability of the system”.
However, there is one measure that, as far as I have seen so far, does not lead to any inappropriate gaming or misdirected incentives – Cycle Time!
I argue that Continuous Delivery is really about one thing. Having an idea, getting that idea into the hands of our users and figuring our what they make of it. So we should optimise our software development processes for that. Whatever it takes.
Cycle Time is a measure of that process. Imagine the simplest change to your production system that you can think of. We want it to be simple so that we can ignore the variable cost of development. Now imagine that change going through all of the normal processes to get it prioritised, scheduled, defined, implemented, tested, verified, documented and deployed into production. Every step that a change to production would normally take. The time that it takes to complete all of those steps, plus the time that the change spends waiting between steps, is your Cycle Time. This is a great proxy for measuring the time from “idea” to “valuable software in the hands of users”.
I believe that if you take an empirical, iterative approach to reducing Cycle-Time then pretty much all of Agile Development, Lean thinking, DevOps and Continuous Delivery practice will fall out as a natural consequence.
I once worked on a demanding, high-performance complex system that processes billions of dollars of other people’s money on a daily basis. This was a complex enterprise system, it included high-performance services, account management, public APIs, Web UIs, administration tools, multiple third-party integrations in a variety of different technologies, data-warehouses the lot. We had a Cycle Time of 57 minutes. In 57 minutes we could evaluate any change to our production system and, if all the tests passed, be in a position to release that change into the hands of users.
Now think about the consequences of being able to do that.
If you have a Cycle Time of 57 minutes, you can’t afford the communications overhead of large teams. You need small compact, cross-functional, efficient teams.
You can’t afford the hand-overs that are implicit in siloed teams. If you divide your development effort up into technical specialisms you will be too slow. You need cross-functional collaborative teams to ensure a continual flow of changes.
You can’t rely on manual regression testing. You need a great story on automated testing. Human beings are too slow, too inefficient, too error prone and too expensive.
You can’t rely on manual configuration and management of your test and production environments. You need to automate the configuration management, automate deployment and you will need a good story on “Infrastructure as code”.
You can’t have a Cycle Time of 57 minutes and have hand-offs between Dev and Ops.
You can’t have a Cycle Time of 57 minutes if your business can’t maintain a constant smooth flow of ideas.
You have to be very good at a lot of aspects of software development to achieve this kind of cycle-time. If you can confidently evaluate your changes to the point where you are happy to release into production in under an hour, without any further work, you are doing VERY well!
Optimising for short Cycle Time drives good behaviours. It is not just that you have to be good to achieve this, striving to improve your cycle-time will help you to improve your development process, culture and technology. It will force you to address impediments and inefficiencies that get in your way. This is a metric that doesn’t seem to have any bad side effects.
Many people are nervous that reducing Cycle Time will reduce quality. My experience, and that of the industry, is that the reverse is true. What happens is that by reducing Cycle Time you reduce batch-size. By reducing batch-size you reduce the risk of each change. Each change becomes simpler and lower risk. 66% of organisations that practice claim to practice Continuous Delivery say that quality goes up, not down1. Personally I am not too sure what the other 34% are doing wrong 😉
If you have a short Cycle-time, you can, and will, release change in small batches. Think about each change. Each change will be small and simple. Easy to reason about. If you release only once every few months, then you will be storing up lots of changes. Let’s imagine that each change has a small amount of risk associated with it. So the total risk for any release is going to be the sum of all of those risks.
Hmmm, not quite! As well as the sum of the risks associated with each change, there is going to be a combinatorial effect. What if my change interacts with your change? So there is an additional risk associated with the collection of changes. The probability of one of these risks being realised will grow exponentially as more changes are combined. The more changes are released together, the higher the risk that two, or more, changes will interact in unexpected ways. So the total risk is going to be something like the sum of all the risks associated with each change plus the risk that two or more changes will interact badly. Now imagine releasing changes one at a time, the second set of risks, the risks for which the probability of them occurring will increase exponentially with the number of changes, disappear all together. So overall many small changes is a much less risky strategy than fewer larger changes.
A few years ago I worked with a team building some complex software in C++. This development team was very good. They had adopted an automated testing approach some years before. They were well ahead of industry norms in that they operated a process based on an overnight build. Each night their automated systems would run, build the software and run their automated tests against it. The build and tests took about 9 hours to complete. Each morning the team would look at the results and there would be a significant number of test failures.
I spoke to one of the developers who had been working this way for the past three years. He told me that in that three year period there had been four occasions when all of the tests had passed.
So, the team did what teams do and adapted. Each morning they would look at the test results and only release the modules for which all of the the tests had passed. This is a reasonable strategy as long as none of the components interact with one another. Mostly they didn’t, but some components did. So now the team is releasing untested combinations of software into production which may or may not work together. As a result this team often saw problems deploying new features into production because of incompatibilities with older versions of components that they depended upon.
I argued that cycle-time was important, a driver for good behaviour and outcomes. I won the argument enough to give it a try.
We worked hard on the build. We invested a lot of time, money and effort on experimenting with different approaches. We parallelised the build, improved incrementalism, we bought some massive severs and triaged the tests into groups, dividing the build into a deployment pipeline. We moved from a 9 hour overnight build to a 12 minute commit stage (running the vast majority of the tests) followed by a slower (1 hour) Acceptance test stage. The “Acceptance Test” designation was fairly arbitrary in this case. If a test was too slow, we moved it to the “Acceptance Test Stage”.
The results were quite dramatic. In the first two week period, following the introduction of this new build, we saw three builds where all of the tests passed – compared to four in the previous three years. In the following two week period there were multiple successful (all tests passing) builds every day. The process now switched, instead of cherry-picking modules with passing tests, we could release all of the software together, or not at all. Each morning we would simply deploy the newest release candidate that had passed all the tests.
Now we could have more confidence that these components would work together. Now we could begin to improve our testing of scenarios that crossed the boundaries between components. Now we could be more thorough!
Reducing cycle-time drives good behaviours. It encourages us to establish concrete, efficient feedback loops that allow us to learn and adapt. The team in my war-story above was not different before and after the change in process. The change in approach, the focus on cycle-time, gave them insight into what was going wrong and an opportunity to learn. They could quickly and efficiently experiment with solutions to any problems that arose. This is a very powerful thing!
Cycle-time drives us in the direction of lower-risk release strategies. It encourages good practice and it moves us in the direction of higher-quality development practices. I encourage you to optimise your development process to reduce cycle-time. I believe that you will find that it improves almost everything that you do.
1 CA Technologies “DevOps Survey” 2015