At Devoxx 2011 I did a talk on Continuous Delivery, in which I describe the process and principles of CD, using our experience at LMAX as an example.
This has now been published here.
At Devoxx 2011 I did a talk on Continuous Delivery, in which I describe the process and principles of CD, using our experience at LMAX as an example.
This has now been published here.
We discussed various topics centered around the design of high performance systems in Java, the evolution of the Disruptor, the need to take a more scientific approach to software design and the idea of applying mechanical sympathy to the design of the software that we create.
The interview has now been published online here.
I recently attended the Devoxx conference. One of the speakers was talking on a topic close to my heart, Continuous Delivery. His presentation was essentially a tools demonstration, but one of the significant themes of his presentation was the use of feature-branching as a means of achieving CD. He said that the use of feature-branching was a debatable point within the sphere of CD and CI, we’ll I’d like to join the debate.
In this speaker’s presentation he demonstrated the use of an “integration branch” on which builds were continuously built and tested. First I’d like to say that I am not an opponent of distributed version control systems (DVCS), but there are some ways in which you can use them that compromise continuous integration.
So here is a diagram of what I understood the speaker to be describing, with one proviso, I am not certain at which point the speaker was recommending branching the “integration branch” from “head”.
In this digram there are four branches of the code. Head, the Integration branch and two feature branches. The speaker made the important point that the whole point of the the integration branch is to maintain continuous integration, so although feature branches 1 and 2 are maintained as separate branches, he recommended frequent merges back to the Integration branch. Without this any notion of CI is impossible.
So the Integration branch is a common, consistent representation of all changes. This is great, as long as each of these merges happens with a frequency of more than once per day this precisely matches my mental model of what CI is all about. In addition, providing that all of the subsequent deployment pipeline stages are also run against each change in the integration branch and releases are made from that branch this matches my definition of a Continuous Delivery style deployment pipeline too. The first problem is that if all of these criteria are met, then the head branch is redundant – the integration branch is the real head, so why bother with head at all? Actually I keep the integration branch and call it head!
There is another interpretation of this that depends on when the integration branch is merged to head, and this is what I think the speaker intended. Let’s assume that the idea here is to allow the decision of which features can be merged into the production release, from head, late in the process. In this case the integration branch, still running CI on the basis of fine-grained commits, is evaluating a common shared picture of all changes on all branches. The problem is that if a selection is made at the point at which integration is merged back to head then head is not what was evaluated, so either you would need to re-run every single test against the new ‘truth’ on head or take the risk that your changes will be safe (with no guarantees at all).
If you run the tests and they fail, what now? You have broken the feedback cycle of CI and may be seeing problems that were introduced at any point in the life of the branches and so may be very complex to diagnose or fix. This is the very problem that CI was designed to eliminate.
Through the virtues of CI on the integration branch, at every successful merge into that branch, you will know that features represented by feature branches 1 and 2 work successfully together. What you can’t know for certain is that either of them will work in isolation – you haven’t tested that case. So if you decide to merge only one of them back to head, you are about to release a previously untested scenario. Depending on your project, and your the nature of your specific changes, you may get away with this, but that is just luck. This is a risk that genuine CI and CD can eliminate, so why not do that instead and reduce the need to depend on luck?
Further, as I see it the whole and only point of branching is to isolate changes between branches, this is the polar opposite of the intent of CI, which depends upon evaluating every change, as frequently as practical, against the shared common picture of what ‘current’ means in the system as a whole. So if the feature branches are consistently merging with the integration branch, or any other shared picture of the current state of the system – like head, then it isn’t really a “feature branch” since it isn’t isolated and separate.
Let’s examine an alternative interpretation, that in this case I am certain that the speaker at the conference didn’t intend. The alternative is that the feature branches are real branches. This means that they are kept isolated, so that people working on them can concentrate on those changes and only those changes without worrying about what is going on elsewhere. This picture represents that case – just to be clear, this is a terrible idea if you mean to benefit from CI!
In this case feature branch 1 is not merged with the integration branch, or any other shared picture, until the feature is complete. The problem is that when feature branch 2 is merged it had no view of what was happening on feature branch 1 and so the merge problem it faces could be nothing at all or represent days or even weeks of effort. There is no way to tell. The people working independently on these branches cannot possibly predict the impact of the work elsewhere because they have no view of it. This is entirely unrelated to the quality of merge tools, the merge problems can be entirely functional, nothing to do with the syntactic content of the programming language constructs. No merge tool can predict that the features that I write and features that you write will work nicely together, and if we are working in isolation we won’t discover that they don’t until we come to the point of merge and discover that we have evolved fundamentally different, incompatible, interpretations. This horrible anti-pattern is what CI was invented to fix. For those of us that lived through projects that suffered, all to common, periods of merge-hell before we adopted CI never want to go back to it.
So I am left with two conclusions. One, for me the definition of CI is that you must have a single shared picture of the state of the system and every change is evaluated against that single shared picture. The corollary of this is that there is no point having a separate integration branch, rather release from head. My second conclusion is that either these things aren’t feature branches and so CI (and CD) can succeed, or they are feature branches and CI is impossible.
One more thought, feature-branching is a term that is, these days, closely associated with DVCS systems and their use, but I think it is the wrong term. For the reasons that I have outlined above these are not real branches, or they are incompatible with CI (one or the other). The only use I can see for a badly mis-named idea of “feature branching” is that if you maintain a separate branch in you DVCS, but compromise the isolation of that branch to facilitate CI, then you do have an association between all of the commits that represent the set of changes that are associated with particular feature. Not something that I can see an immense amount of value in to be honest, but I can imagine that it may be interesting occasionally. If that is the real value then I think it would benefit from a different name. This is much less like a branch and more like a change-set or more accurately in configuration management terms a collection of change-sets.
The level of interest that we have received has been very pleasing, but there is one point that is important to me that could be lost in the understandable, and important, focus on the detail of how the Disruptor works. That is the effect on the programming model for solving regular problems.
I have worked in the field of distributed computing in one form or another for a very long time. I have written software that used file exchange, data exchange via RDBMS, Windows DDE (anyone else remember that?), COM, COM+, CORBA, EJB, RMI, MOM, SOA, ESB, Java Servlets and many other technologies. Writing distributed systems has always added a level of complexity that simply does not exist in the sort of simple, single computer, code that we all start out writing.
Our use of the Disruptor, at LMAX, is the closest to that simplicity that I have seen. One way of looking at the way that we use this technology is that it completely isolates business logic from technology dependencies. Our business logic does not know how it is stored, how it is viewed whether or not it is clustered or anything else about the infrastructure that it runs on. There are two concessions that result from the programming model imposed upon us by our, Disruptor-based, infrastructure, neither of which are onerous or even unusual in regular OO programming. Our business logic needs to expose it’s operations via an interface and it reports results through an interface.
Typically our business logic looks something like this:
public interface MyServiceOperations
void doSomethingUseful(final String withSomeParameters);
public class SomeService implements MyServiceOperations
private final EventChannel eventChannel;
public SomeService(EventChannel eventChannel)
this.eventChannel = eventChannel;
public void doSomethingUseful(final String withSomeParameters)
// some useful work here ending up with someResults
There are no real constraints on what parameters may be passed, on how many interfaces may be used to expose the logic of the service, on how many event channels are used to publish the results of the service or anything else – that is it, we need to register at least one interface to get requests into the service and at least one to publish results from it. There is also nothing special about these interfaces, they are plain old java interfaces specified by the business logic (nothing to inherit from) and then registered with our infrastructure.
Since our business logic runs on a single thread and only communicates with the outside world via these interfaces it can be as clean as we like. Our models are fully stateful, rich object oriented implementations of models that address the problem that we are trying to solve. Sometimes our models are not as good as they could be, but that is not because of any technology imposition it is because our modelling wasn’t good enough!
Of course it is not quite that simple, mechanical sympathy is still important in that we need to separate our services intelligently so that they are not tightly-coupled, but this too is only about good OO design and a focus on a decent separation of concerns at the level of our services as well as at the more detailed level of our fine-grained object models. This is the cleanest, most uncluttered, approach to distributed programming that I have ever seen – by far. It feels like a liberation. I am very proud of our infrastructural technology, but for me the real win is not how clever our infrastructure is, but the degree to which it has freed us to concentrate on the business problems that we are paid to solve.
As part of our work to create and ultra-high performance financial exchange we looked into a lot of different approaches to high performance computing. We came to the conclusion that a lot of the common assumptions in this area were wrong.
We have done a lot of things to make our code fast and efficient, but the single most important thing has been to develop a new approach to managing the coordinated exchange of data between threads. This has made a dramatic difference to the performance of our code. We think it sets a new benchmark for performance, beating comparable implementations that use queues to separate processing nodes, by 3 orders of magnitude in latency and by a factor of 8 in throughput. You can watch a presentation, by a couple of my colleagues, describing one of our uses of this technology here.
I am pleased to announce that we have now released this as an open-source project. There is a technical article describing the approach and providing some evidence for our claims available at the site.
We think that this is the fastest way to write code that needs to coordinate the activity of several threads and all of our experiments so far have backed this up.
Well there are two reasons, primarily we wanted to disrupt the common assumptions in this space because we think that they are wrong. But, to be honest, we also couldn’t resist the temptation; There was some talk about Phasers in Java at the time when we named it and, for those of you too young to care, Phasers were the Federation weapon and Disruptors the Klingon equivalent in Star Trek
I was recently asked to do a presentation on the topic of Continuous Delivery at the London Tester Gathering.
You can seen a video of the presentation here
In this presentation I describe the techniques and some of the tools that we have applied at LMAX in our approach to CD.
Martin Fowler has recently made a post on the topic of the importance of reproducible builds. This is a vital principle for any process of continuous integration. The ability to recreate any given version of your system is essential, but there are several routes to it if you follow a process of Continuous Delivery (CD).
Depending on the nature of your application reproducibility will generally involve significantly more than only source code. So in the achievement of the ability to step-back in time to the precise change-set that constituted a particular release version of your software, the source code, while significant, is just a fragment of what you need to consider.
Martin outlines some of the important benefits of the ability to accurately, even precisely, reproduce any given release. When it comes to CD there is another. The ability to reproduce a build pushes you in the direction of deployment flexibility. By the time a given release candidate arrives in production it will have been deployed many times in other environments and for CD to make sense, these preceding deployments will be as close as possible to the deployment into production.
In order to achieve these benefits we must then be able to recover more than just the build, we must be able to reconstitute the environment in which that version of the code that your development team created ran. If I want to run a version of my application from a few months ago, I will almost certainly have changed the data-schemas that underly the storage that I am using. The configuration of my application, application server or messaging system may well have changed too.
In that time I have probably upgraded my operating system version, the version of my web server or the version of Java that we are running too. If we genuinely need to recreate the system that we were running a few of months ago all of these attributes may be relevant.
Jez and I describe approaches and mechanisms to achieve this in our book. An essential attribute of the ability of having a reproducible build is to have a single identifier for a release that identifies all of things that represent the release, the code, the configuration, 3rd party dependency versions, even the underlying operating system.
There are many routes to this, but fundamentally they all depend on all of these pieces of the system being held in some form of versioned storage and all related together by a single key. In Continuous Delivery it makes an enormous amount of sense to to use a build number to relate all of these things together.
The important part of this, in the context of reproducible builds, is that talking about the binary vs the source is less the issue than the scope of the reproduction that you need. If you are building an application that runs in on an end-users system, perhaps within a variety of versions of supporting operating environments, then just recreating the output of your commit build maybe enough. However if you are building a large-scale system, composed of many moving parts, then it is likely that the versions of third-party components of your system maybe important to it’s operation. In this instance you must be able to reproduce the whole works if you want to validate a bug and so rebuilding from source is not enough. You may need to be able to rebuild from source, but you will also need to recover the versions of the web-server, java, database, schema, configuration and so on.
Unless your system is simple enough to be able to store everything in source code control, you will have to have some alternative versioned storage. In our book we describe this as the artifact repository. Depending on the complexity of your system this may be a simple single store or a distributed collection of stores linked together through by the relationships between the keys that represent each versioned artefact. Of course the release candidate’s id sits at the root of these relationships so that for any given release candidate we can be definitive about the version of any other dependency.
Whatever the mechanism, if you want genuinely reproducible builds it is vital that the relationships between the important components of your system is stored somewhere and this somewhere should be along with the source code. So your committed code should include some kind of map for ANY system components that your software depends upon. This map is then used by your automated deployment tools to completely reproduce the state of the operating environment for that particular build. Perhaps by retrieving virtual machine images from some versioned storage, or perhaps running some scripts to rebuild those systems to the appropriate starting state.
Because in CD we retain these, usually 3rd party, binary dependencies, and must do so if we want to reproduce a given version of the system, then in most cases we recreate versions from binaries of our code as well as those dependencies because it is quicker and more efficient. On my current project we have never, in more than 3 years, rebuilt a release candidate from source code. However, storing complete, deployable instances of the application can take a lot of storage and while storage is cheap it isn’t free.
So how long is it sensible to retain complete deployable instances of your system? In CD each instance is referred to as a “release candidate” each release candidate has status associated with it indicating that candidate’s progress through the deployment pipeline. The length of time that it makes sense to hold onto any given candidate depends on that status.
Candidates with a status of “committed” are only interesting for a relatively short period. At LMAX we purge committed release candidates that have not been acceptance tested, those that have been skipped-over because a newer candidate was available when the acceptance test stage ran or those that failed acceptance testing. Actually we dump any candidate that fails any stage in the deployment pipeline.
The decision of when to delete candidates that pass later stages is a bit more complex. We keep all release candidates that have made it into production. The combination of rules that I have described so far leaves us with candidates that were good enough to make it into production but weren’t selected (we release at the end of each two week iteration and so some good candidates may be skipped). We hold onto these good, but superseded, candidates for an arbitrary period of a month or two. This provides us with the ability to do things like binary-chop release candidates to see when a bug was introduced or demo an old version of some function for comparison with a new.
We have implemented these policies as a part of our artefact repository so largely it looks after itself.
When I was a software consultant I got to see a wide of software projects in a wide variety of different organizations. My subjective experience is that most software projects in most organizations get it quite badly wrong most of the time.
I am aware that this is a contentious statement, and I am aware, particularly in the context of this article, of the need to be able to back up my statements with some facts, but such facts are hard to acquire, because, as I hope to show over a series of posts, our industry is riddled with subjective measures, and lacks hard data.
Some of the reason for this lack of quantitative measurement is the difficulty, if not the impossibility, of measurement of software projects. How do we decide if my project is the same as your project? Without the establishment of such a baseline, how can we ever objectively measure the effect of my methodology/technology/team structure/etc compared with yours?
The cost of a genuinely scientific approach to such measurement would be extreme to say the least, and even then the design of experiments that leveled such differences as strength of team, experience and so on would be extremely difficult to achieve.
There have been efforts in the past to gather statistical evidence of the success, or not, of projects, the most long lasting effort to collect data that I have encountered has been carried out by the “Standish Group” who have been researching software project failures for many years. Unfortunately they don’t publish their data so there is no way of knowing if their analysis stacks up. However, I recommend you type something like “research software project failure” into your search engine of choice and take a look at the somewhat dismal statistics. A high-level summary is that a significant percentage of software projects fail to deliver what the users wanted on-time and on-budget, this aligns with my own observations and, albeit, subjective impressions.
My contention is that while we talk about “Software Engineering” and “Computer Science” much day to day practice is anything but scientific, or even based on sound engineering principles. One may go to the extent that much of what is practiced in the name of software development is often based on irrational decision-making. We fool ourselves into believing that we are being diligent and rigorous, when in fact we are simply, and somewhat blindly, applying what are effectively superstitious practices that have grown up in our industry over time about what a software project should look like. Worse than that, superstitious practices that clearly don’t work very well, if at all.
In part this is a cultural thing related to the paucity of scientific education, but I also think that we have some micro-cultural influences of our own at work within our industry.
I intend to add a series of posts on this topic, outlining a particular superstition and some more rational approaches to tackling the problem, so let’s start with some low-hanging fruit – planning.
The start of a software project is the time when we know least about it. Everything that we do at this time is going to be based on speculation and guess-work. As an experienced developer having recently worked on a project similar to the one proposed, your guesses may be better than mine, but they remain guesses.
The requirements gathering process has an inevitable speculative element to it. During the process users and analysts will be guessing about the best way for the system to behave in certain circumstances, and they will be guessing at the value that this will bring when the system is live.
The commonest superstition at this stage is that all requirements must be clearly defined before the project can start. This is the worst time to define all of the requirements because it is the furthest point away from the time of use.
By the time the project is finished the business climate may have changed but even more likely the understanding of the problem will certainly have changed as more is learnt by the analysis team.
The only way to know if a requirement is correct is to implement it and to get users interacting with the behaviour in question. This suggests that rather than get all requirements identified before a project starts, the best way to achieve a measurable outcome is to convert any given requirement as quickly as possible into a software solution that can be tried by users and accepted or rejected quickly. This feedback cycle is a fundamental of a more scientific approach.
This is another inherently speculative activity. In order for planning activities to have any bearing on reality they must be closely and interactively allied to actual outcomes.
Without such an active feedback loop to keep the process on track the divergence of the plan from reality is inevitable. This is such a commonly understood outcome, and fairly widely held, that it is sometimes hard to understand how the all too common superstition in project planning has arisen. That is the superstition that the only route to success is to define a fully detailed plan of the project at inception, which must then be stuck to religiously.
At the outset of a project we know very little, we don’t really know what requirements will earn the highest business value, we don’t know how easy the technologies will be to use in the context of this project, we don’t know how well the developers understand the problem and we don’t know how much time people will be spending doing other things. Most of these things we can’t know with any degree of certainty. In fact most of these things we can’t know with anything but the woolliest of guesswork.
Therefore any plan we make must be flexible enough to cope with the extremes of these ranges of probability; if the technology choice turns out to be a big problem, the plan must be able to show the impact quickly and clearly; it can never prove that everything will be all-right, it can only show us as quickly as possible that we are in trouble!
On the other hand if our technology choice proves to have been inspired, the plan must be a useful enough tool to allow us to capitalize on the fact, and maybe bring dates forward, or perhaps increase the scope of our planned delivery; unlikely as this may sound it does happen.
Note: If our industry had a more normal distribution of project success such statements wouldn’t be surprising or unusual, because half our projects would deliver ahead of time or under budget or with more functionality.
The best plan is a complete plan, the more detail the better, the more accurate we make our forecasts (a fancy word for guesses) the more realistic our plan will be.
This is one of the biggest problems in software development as well as one of the most pervasive superstitions. I have wasted months of my life planning, months in which I could have been producing business value in the form of software and achieving a more realistic, lighter-weight plan at the same time.
When a plan is very detailed, prepared ahead of time, and diverges from reality it can cope with neither greater success, nor, the sadly more frequent, failure. In my experience one of three things happens.
1. The project manager shuts themselves away and at enormous effort and enormous pain attempts to realign the plan with reality. Despite the hard work this has always been a complete failure in my experience, mostly because for the days and weeks that they are working on the plan the project keeps moving.
2. The project becomes schizophrenic, it has two independent realities. One is the “reality” of the plan, effort is expended trying to force-fit what really happened into a shape that we pretend it was the same as we thought would happen. Then there is the reality of the software development which is essentially carrying on un-tracked and un-reported in any meaningful sense. This tends not to last very long because usually it becomes harder and harder to fool ourselves that the plan bears any relationship to the project. I have observed projects where the illusion was maintained, and it came as a very big shock to senior management when they eventually found that the project that they had thought was on-track turned out to be far from ready for production.
3. The PM realises the futility of plan 1 or 2 and gives up the tracking. Status reporting instead becomes more tactical, reporting what was done, but while this gives a sense of movement it provides no real sense of progress towards the goal of delivery.
The value in a plan is as a guide to achieving something concrete, in our case some working software. All plans will and must evolve, the more complex the plan the less able it is to evolve. Minimizing detail is not just desirable it is an essential attribute of a successful plan. A successful plan is one that is capable of maintenance, one which bears more than a passing relationship to reality.
Plans must be large-scale with minimal detail when describing things that are more than a month ahead, and detailed, concrete and most importantly, measurable, when describing the immediate future.
Establishing a high-level scope:
I post some more thoughts on common software development superstitions in future.
I was asked a good question by a colleague a couple of days ago.
On p419 of my book Jez and I show and I describe a “Configuration and Release Management Maturity Model”. My colleague asked: “What are sensible acceptance criteria associated with this model?”.
I am not sure that I can say anything too definitive, but I think that looking for acceptance criteria is a good way to think about it. So here are a few thoughts about some measurable attributes of a project that may help to steer it in the right direction:
The bullet points that follow the evaluation matrix, in the book, give some steer towards more measurable things.
In general Jez and I recommend that you focus on the most painful problems first. So, depending on where the biggest hurdles are, you can use cycle-time, defect-count, velocity and down-time as potential sources of measurement. These measures should enable you to set sensible incremental targets for improvement. Over time you can ratchet up your expectations and so continue to move the organisation on.
For the highest levels in the matrix I would expect values along the lines of:
Defect-count: Not sure that I can set too many expectations here, so much is too project specific. In general though I think that trends rather than numbers are important in the early days of establishing process.
A lot of people talk about zero defect processes and I am a big believer in that for many projects. However for a large team, and/or project that covers a large surface-area of features, I think that this is often impractical. This is not because you can’t achieve the quality, you can, but it is often complex to differentiate between a bug and a new feature. This means that it can be valid to maintain a backlog of “bugs” that are less important than your backlog of features.
A significant area of interest, and so useful data to collect, is where defects are found. So keeping metrics of defects count by deployment-pipeline stage and aiming to find the VAST majority before you get anywhere close to production is key.
Regression is an important metric here too I think. If your automated testing is good enough you should expect no regression problems.
Velocity: The value of a Continuous Delivery process is to get new features delivered and in use. Tracking the actual rate of delivery of complete working code is important, and again it is about trends, there is no absolute measure that makes sense. However, velocity is the only real measure of continuous improvement, a cornerstone of any agile process.
For most projects I would expect to see a, relatively short-term, initial dip in velocity for existing teams adopting CD as a process, as the organisation adapts, people learn how to cope with the new processes and techniques. Then I would expect to see velocity begin to build. In part this is a measure of the team’s maturity WRT agile process. High performing teams tend to see a steady increase in velocity over a long period of time, eventually it will begin to plateau a bit.
Actually it is not a consistent growth rate. Even very good teams tend to achieve improvements in stretches. Subjectively, I don’t have any data, teams seem to make significant progress, then go through a phase of stability, drop-off a little bit as they get a bit lax about the process, then make another significant move forward as they apply more effort to improve.
Teams newer to the agile approach seem to take a longer time to adopt the “just fix it” mentality that is essential to continuous improvement and so their curve is more like an ‘S’ with a flat-ish start and a strong growth phase. Often followed by longer flat periods and more significant “drop-offs”.
Down-time: This is a fairly blunt measure of quality, but is also a measure of deployment efficiency – to some extent. You can use it to encourage and direct effort towards shorter more efficient deployments. So you need to include the time that the system is unavailable during release in your downtime calculations.
For world-class performance, where it makes sense, I would expect essentially no downtime caused by the Continuous Delivery process. It is perfectly possible to release without stopping even complex applications – but of course it is harder
Your business model may mean that you don’t have to go that far, that some downtime to release is perfectly acceptable. Nevertheless your goal should be to minimize the time it takes, the release, start-up and deployment tests should be quick, a few minutes at most – let’s say 5 minutes as a goal. However if you need to perform significant data-migration as part of your release process you may incur some additional, unavoidable, time penalty for that.
My ideal is that a release should take a few minutes in total. Enough time so that I am happy to log on, select a release candidate, push the “Release Now” button and sit and watch it succeed. Even where you have significant data-migration to perform it is worth doing some work to make this efficient, otherwise when you eventually get to the no-downtime release ideal you will have a more complex problem to move state between the old version and the new. The smaller the window in which change can happen in the old version while you are trying to release the new the better.
Clearly some of this is subjective but using relative improvements in the measurements of your project can take you a long way forward in being able to steer towards the more global improvements that you want to make.
I think that the reason that agile development works is because it is the application of the scientific method to software development.
A fundamental aspect of that is the importance of forming a hypothesis before you start so that you can understand the results that you observe. By this I don’t mean some grand-unified theory kind of a hypothesis at the beginning of a project but the small, every-day, fine grained, use of predictions of outcomes before you commence some software development activity.
I have adopted this approach over the years and I think it has probably become a fairly ingrained part of my approach to things. It is probably most obvious when we are debugging. When working with a pair, or small group of people, discussing a bug I find myself regularly pausing and restating what we know to be facts, and discussing what our current theory is. The outcome of this is invariably that we can more clearly identify the next experiment that will move our understanding on a step or two.
But it is more than that. I almost always, before running a failing test, which I always do before writing the code that will make it pass, state the nature of the failure that I expect to see out-loud. That is I state my hypothesis (the nature of the failure) I then carry out the experiment (run the test). I then observe the results (look at the results of the test) and see if they matched my hypothesis. Often they don’t, it means I have got the test wrong, so I correct it until I see the failures that I expect. This is the main reason that we run the test before we write the code.
Ok, ok, I know that both of these example are a bit simplistic, but this is an enormously powerful approach to problem solving, in fact it is undisputedly the most proven model of problem solving in human history.
My bet is that as I describe it in my simple examples here everyone that reads this is nodding sagely and thinking “of course, that is how we always work.” but is it really? My observation is that this is actually a fairly uncommon approach. Certainly agile development in general and TDD in particular applies a gentle pressure in this direction, but it has been my experience that, all too often, we wander around problems in loose unstructured ways. We often randomly prod at things to see what happens. We frequently jump to conclusions and hare-off implementing solutions that we don’t know that we need.
We had a good example of this at work today. We have recently made some improvements in some of our high-performance messaging code. We put this into the system at the start of the iteration to give us time to see if we had introduced any errors. Our continuous delivery system which includes some sophisticated functional and performance tests as part of our deployment pipeline found no problems for the whole iteration, until today.
Today is our end of iteration, I’ll explain why we finish on Wednesdays another time. So we had taken our release candidate for the iteration and it was undergoing final checks before release. This morning one our our colleagues, Darren, told us at stand-up that he had seen a weird messaging failure on his development workstation when running our suite of API acceptance tests. He had apparently seen a thread that was blocked in our underlying 3rd-party pub-sub messaging code. He tried to reproduce it, and could but only do so on that particular pairing-station. Hmmmm.
Later this afternoon, we had started work on the new iteration. Almost immediately our build grid showed a dramatic failure with lots of acceptance tests failing. We started exploring what was happening and noticed that one of our services was showing a very high CPU load – unusual, our software is generally pretty efficient. On further investigation we noticed that our new messaging code was apparently stuck. Damn! This must be what Darren saw – clearly we have a problem with our new messaging code!
We reacted immediately. We went and told the business that the release candidate that we had taken may not be ready for release. We asked the QA folks who were doing their final sanity checks before the release to wait until we had finished our investigation before making any decisions. We started think that we may have to take a branch, something that we generally try to avoid, and back out our messaging changes.
We did all of this before we stopped and thought about it properly. “Hang on, this doesn’t make any sense, we have been running this code for more than a week and we have now seen this failure three times in a couple of hours.”
So we stopped and talked through what we knew, collected our facts; we had upgraded the messaging at the start of the iteration; we had a thread-dump that showed the messaging stalled; so had Darren, but his dump looked stalled in a different place; we had been running all of these tests in our deployment pipeline repeatedly and successfully for more than a week, with the messaging changes. At this point we were stuck. Our hypothesis, failing messaging, didn’t fit the facts. We needed more facts so that we could build a new hypothesis. We started where we usually start, but had omitted to earlier because the conclusion looked so obvious. We looked at the log files. Of course, you have guessed, we found an exception that clearly pointed the finger at some brand new code.
To cut a long story short the messaging problem was a symptom, not a cause. We were actually looking at a thread dump that was in a waiting state and working as it should. What had really happened was that we had found a threading bug in some new code it was obvious, simple to fix and we would have found it in 5 minutes with no fuss if we hadn’t jumped to the conclusion that it was a messaging problem – in fact we did fix it in 5 minutes once we stopped to think and built our hypothesis based on the facts that we had. It was then that we realized that the conclusions that we had jumped really didn’t fit the facts. It was this and this alone that prompted us to go and gather more facts, enough to solve the problem that we had rather than the problem that we imagined that we had.
We have a sophisticated automated test system and yet we ignored the obvious. It was obvious that we must have committed something that broke the build. Instead we joined together various facts and jumped to the wrong conclusion because there was a sequence of events that led us down the path. We built a theory on sand, not validating as we went, but building new guesses on top of old. It created an initially plausible, seemingly “obvious” cause – except it was completely wrong.
Science works! Make a hypothesis. Figure our how to prove, or disprove it. Carry out the experiment. Observe results and see if it matches you hypothesis. Repeat!