The first casualty of a software emergency…

A colleague and I were talking today about what happens when things go badly. I said that I thought that in an ideal world we should be working to make our deployment pipelines so efficient that, even in the event of an emergency we should be able to make changes and have them fully validated before we release them. After all, the last time that you want to be increasing risk is when things are already bad! Jason made me laugh with “So rather like ‘The first casualty of war is truth’, ‘The first casualty of Software-Emergency is validation”.
This is spot on. What commonly happens is that when things go badly teams throw their protections out of the window and start changing things in production without testing them first – scary!
The ideal answer to this is to concentrate, before the emergency, on shortening your cycle-time. Your cycle-time is the time it takes you to release the smallest possible change to your code following your normal release validations and procedures. Your cycle time should be short enough so that when an emergency hits, you can still release your fixes having validated them fully.
Short cycle-times are a general good. They allow you to work in small batches, ensuring that each change is small-enough so that when it does go wrong it is simple to see where the problem lies and how to fix it. They provide rapid feedback for your ideas, allowing you to quickly assess their value. They stop you from creating snowflake-servers, deploying untested changes into production. They allow you to get valuable software into the hands of your users quickly and efficiently and they allow you have a single, validated, route to production for all your changes.
Cycle-time is a wonderful tool to drive good behaviour.
People often ask me how to start with something as complex as the introduction of Continuous Delivery to an organisation. The answer is pretty simple, look to your cycle-time.
I think that Continuous Delivery is built on serval important foundations, Version Control, Continuous Integration, Automation, Feedback, Collaboration and Cycle-time.
If you don’t use version control, shame on you, download a free VCS now and start immediately. If you are not yet doing Continuous Integration, come on, catch-up, this has been a well know, well publicised, effective practice for more than a decade now. Cycle-time is different to these others though. The others are mechanisms, cycle-time is a metric by which you can measure your performance. As such it is a great tool to help you identify where to start.
A great tool for understanding your cycle-time is
value-stream analysis, (aka value-stream mapping)
Draw a diagram enumerating the steps, laid out on a timeline, that changes destined for production progress through. From idea to working, validated software in the hands of a user. You can do this at various levels of resolution, depending on how much effort you want to spend identifying and measuring the steps. You should try and capture any pauses between steps as well as the steps themselves.
Here are some examples of value-stream maps for some real projects that I have seen.
However good, or bad, you are at software delivery this should highlight places that you could improve.
This is an excellent tool and I recommend that you try it for your development process, it may surprise you.
I like tools and processes that encourage ‘good behaviour’. Processes that make doing the ‘right-thing’ easy. Cycle-time is one of those tools. If only you work to optimise cycle-time it will have a beneficial effect on your whole process. Even if that is all that you do. It will make you eliminate waste, it will make you collaborate more, because you will need to minimise hand-overs. It will make you reduce the amount of inventory/work in process and so make the process more efficient. It will make you automate more because human beings are too slow to sustain fast cycle-times.
It is not silver bullet. Software development is complex, but cycle-time is a really great metric, and working to optimise it is a great strategy towards better process.
Posted in Continuous Delivery | Leave a comment

The basics of TDD

The objectives of Test Driven Development and unit testing are generally misunderstood. The problem is the word ‘test’, it is much less about testing and much more about specification of requirements, showing your working – as in maths, and the impact it has on design. TDD is much more important than only testing. Robert C Martin has a good analogy, he likens TDD to double entry bookkeeping:

Software is a remarkably sensitive discipline. If you reach into a base of code and you change one bit you can crash the software. Go into the memory and twiddle one bit at random and very likely you will elicit some form of crash. Very, very few systems are that sensitive. You could go out to one of these bridges over here, start taking bolts out and they probably wouldn’t fall. I could pull out a gun and start shooting randomly and I probably wouldn’t kill too many people. I might wound a few but — you know — you get a bullet in the leg or a lung and you’d probably survive. People are resilient — they can survive the loss of a leg and so forth. Bridges are resilient — they survive the loss of components. But software isn’t resilient at all: one bit changes and — BANG! — it crashes. Very few disciplines are that sensitive.
But there is one other [discipline] that is, and that’s Accounting. The right mistake at exactly the right time on the right spreadsheet — that one-digit error can crash the company and send the offenders off to jail. How do accountants deal with that sensitivity? Well, they have disciplines. And one of the primary disciplines is dual-entry bookkeeping. Everything is said twice. Every transaction is entered two times — once on the credit side and once on the debit side. Those two transactions follow separate mathematical pathways until they end up at this wonderful subtraction on the balance sheet that has to yield to zero.
This is what test-driven development is: dual-entry bookkeeping. Everything is said twice — once on the test side and once on the production code side and everything runs in an execution that yields either a green bar or a red bar just like the zero on the balance sheet. It seems like that’s a good practice for us: to [acknowledge and] manage these sensitivities of our discipline…
The sensitivity of software is a good point to reflect upon, there is little in human experience that is so complex and yet so fragile. Without a strong focus on showing your working, no matter how good you are as a developer, if you omit the tests, your software will be worse than it could have been.
The double-entry bookkeeping analogy only holds up though if you do test first development. If you write your test after the code it is generally not sufficiently independent to provide a valid “separate path” check.
Test first is the idea that your write the test before you write the code that is being tested. This seems like a bizarre idea to many people at first, but actually makes perfect sense.
If you write the test first and run it, you get to see it fail, so you are testing the test.
If you write the test first then you are expressing what you want of your software from the outside in. It leads you to design for behaviour and so you have less of a tendency to get lost in irrelevant technicalities.
This is a much more effective design approach than testing after you have written the code, and as a by product it leads inevitably to software that is easy to test – you have to be pretty dumb to write a test before you have written the code for an idea that can’t be tested!
Finally there is a virtuous circle here. Software is easy to test when it is modular. It is easy to test when dependencies are externalised and it is easy to test when there is a clear separation of concerns.
Now the software industry is famous for change, but if there is any idea that has remained constant for, literally, decades it is that quality software is modular, has well defined dependencies and clear separation of concerns – sound familiar? This has been how computer science has defined quality since before I started, and that was a very long time ago!
Using TDD as a practice makes you produce higher quality software, not because it is well tested (though that is a nice by-product) but because it improves the quality of your designs. Want more detail:
Posted in Agile Development, TDD | Leave a comment

Testing Times

I came across a test breakage recently. I had committed some code and a test had failed. The test was unrelated to my changes, but I took a look anyway because maybe I was wrong. As it turned out the test was unrelated to my changes other than the fact that my changes had triggered a build. So the test was intermittently failing.

In my experience there are three common causes for test intermittency.

  • Interaction between test-cases – static methods, shared memory, stuff left on the file-system and so on. Basically all down to state shared between tests.
  • Concurrency – It is extremely hard to write good tests that deal, within the scope of the test, with concurrency. It’s too easy for small changes in the timing of the execution of different threads to give different results.
  • Time – time based tests are a pain, they are often flaky and slow because of the need to wait for results.

The last one of these was, I think, the cause of the problem with this test. Fortunately, of the three, time based testing is the simplest to fix.

There is one simple rule for dealing with time in your application: Never, Never, Never retrieve time directly from the system, always get it via a level of indirection, and then in your tests, fake the source of time so that you have complete control.


    interface Clock {
        long getNanos();

    class MyObjectThatNeedsTime {
        private final Clock clock;

        public MyClassThatNeedsTime(Clock clock)
            this.clock = clock;

        public void doSomethingNeedingTime()
            // Instead of this...
            // time = System.getNanos();
            // Do this...
            long time = clock.getNanos();

Now your test has complete control of time, simply supply a fake implementation of Clock and change what it returns in the scope of your testing to do whatever you like.

This works for all kinds of testing, not just simple unit testing. I generally use this approach for all uses of time. At LMAX we had a suite of whole-system integration tests called “time-travel tests” where we manipulated time, skipping over minutes, hours even months to test daylight saving time changes.

If for some reason you don’t want to inject the Clock into your code, you can use a singleton pattern and replace the singleton instance within the scope of tests (always remembering to replace the TestClock with a RealClock at the end of the test). Not as nice as dependency injection perhaps but it can be easier to fit in when you are adding tests to some legacy code.


    class SystemClock implements Clock {

        public long getNanos()

    class ClockFactory {
        private static Clock clock = new SystemClock();

        public Clock getClock()
            return clock;

        public void setClock(Clock newClock)
            this.clock = newClock;


    public MyClassThatNeedsTime()
        this.clock = ClockFactory.getClock();

This approach not only gives you better control in your tests, but it also speeds them up – no more sleeping threads, which can add up in large test suites. As well as all that it enables classes of testing that were simply impossible before (e.g. long duration waits).

Try it, you’ll like it!

Posted in Effective Practices, Software Design, TDD | Leave a comment

Devoxx presentation on Continuous Delivery

At Devoxx 2011 I did a talk on Continuous Delivery, in which I describe the process and principles of CD, using our experience at LMAX as an example.

This has now been published here.

Posted in Uncategorized | Leave a comment

Interview on High performance Java

I recently spoke, with my ex-colleague Martin Thompson, at the GOTO conference in Aarhus. While we were there we were interviewed by Michael Hunger.

We discussed various topics centered around the design of high performance systems in Java, the evolution of the Disruptor, the need to take a more scientific approach to software design and the idea of applying mechanical sympathy to the design of the software that we create.

The interview has now been published online here.

Posted in Uncategorized | 1 Comment

Don’t Feature Branch

I recently attended the Devoxx conference. One of the speakers was talking on a topic close to my heart, Continuous Delivery. His presentation was essentially a tools demonstration, but one of the significant themes of his presentation was the use of feature-branching as a means of achieving CD. He said that the use of feature-branching was a debatable point within the sphere of CD and CI, we’ll I’d like to join the debate.

In this speaker’s presentation he demonstrated the use of an “integration branch” on which builds were continuously built and tested. First I’d like to say that I am not an opponent of distributed version control systems (DVCS), but there are some ways in which you can use them that compromise continuous integration.

So here is a diagram of what I understood the speaker to be describing, with one proviso, I am not certain at which point the speaker was recommending branching the “integration branch” from “head”.

In this digram there are four branches of the code. Head, the Integration branch and two feature branches. The speaker made the important point that the whole point of the the integration branch is to maintain continuous integration, so although feature branches 1 and 2 are maintained as separate branches, he recommended frequent merges back to the Integration branch. Without this any notion of CI is impossible.

So the Integration branch is a common, consistent representation of all changes. This is great, as long as each of these merges happens with a frequency of more than once per day this precisely matches my mental model of what CI is all about. In addition, providing that all of the subsequent deployment pipeline stages are also run against each change in the integration branch and releases are made from that branch this matches my definition of a Continuous Delivery style deployment pipeline too. The first problem is that if all of these criteria are met, then the head branch is redundant – the integration branch is the real head, so why bother with head at all? Actually I keep the integration branch and call it head!

There is another interpretation of this that depends on when the integration branch is merged to head, and this is what I think the speaker intended. Let’s assume that the idea here is to allow the decision of which features can be merged into the production release, from head, late in the process. In this case the integration branch, still running CI on the basis of fine-grained commits, is evaluating a common shared picture of all changes on all branches. The problem is that if a selection is made at the point at which integration is merged back to head then head is not what was evaluated, so either you would need to re-run every single test against the new ‘truth’ on head or take the risk that your changes will be safe (with no guarantees at all).

If you run the tests and they fail, what now? You have broken the feedback cycle of CI and may be seeing problems that were introduced at any point in the life of the branches and so may be very complex to diagnose or fix. This is the very problem that CI was designed to eliminate.

Through the virtues of CI on the integration branch, at every successful merge into that branch, you will know that features represented by feature branches 1 and 2 work successfully together. What you can’t know for certain is that either of them will work in isolation – you haven’t tested that case. So if you decide to merge only one of them back to head, you are about to release a previously untested scenario. Depending on your project, and your the nature of your specific changes, you may get away with this, but that is just luck. This is a risk that genuine CI and CD can eliminate, so why not do that instead and reduce the need to depend on luck?

Further, as I see it the whole and only point of branching is to isolate changes between branches, this is the polar opposite of the intent of CI, which depends upon evaluating every change, as frequently as practical, against the shared common picture of what ‘current’ means in the system as a whole. So if the feature branches are consistently merging with the integration branch, or any other shared picture of the current state of the system – like head, then it isn’t really a “feature branch” since it isn’t isolated and separate.

Let’s examine an alternative interpretation, that in this case I am certain that the speaker at the conference didn’t intend. The alternative is that the feature branches are real branches. This means that they are kept isolated, so that people working on them can concentrate on those changes and only those changes without worrying about what is going on elsewhere. This picture represents that case – just to be clear, this is a terrible idea if you mean to benefit from CI!

In this case feature branch 1 is not merged with the integration branch, or any other shared picture, until the feature is complete. The problem is that when feature branch 2 is merged it had no view of what was happening on feature branch 1 and so the merge problem it faces could be nothing at all or represent days or even weeks of effort. There is no way to tell. The people working independently on these branches cannot possibly predict the impact of the work elsewhere because they have no view of it. This is entirely unrelated to the quality of merge tools, the merge problems can be entirely functional, nothing to do with the syntactic content of the programming language constructs. No merge tool can predict that the features that I write and features that you write will work nicely together, and if we are working in isolation we won’t discover that they don’t until we come to the point of merge and discover that we have evolved fundamentally different, incompatible, interpretations. This horrible anti-pattern is what CI was invented to fix. For those of us that lived through projects that suffered, all to common, periods of merge-hell before we adopted CI never want to go back to it.

So I am left with two conclusions. One, for me the definition of CI is that you must have a single shared picture of the state of the system and every change is evaluated against that single shared picture. The corollary of this is that there is no point having a separate integration branch, rather release from head. My second conclusion is that either these things aren’t feature branches and so CI (and CD) can succeed, or they are feature branches and CI is impossible.

One more thought, feature-branching is a term that is, these days, closely associated with DVCS systems and their use, but I think it is the wrong term. For the reasons that I have outlined above these are not real branches, or they are incompatible with CI (one or the other). The only use I can see for a badly mis-named idea of “feature branching” is that if you maintain a separate branch in you DVCS, but compromise the isolation of that branch to facilitate CI, then you do have an association between all of the commits that represent the set of changes that are associated with particular feature. Not something that I can see an immense amount of value in to be honest, but I can imagine that it may be interesting occasionally. If that is the real value then I think it would benefit from a different name. This is much less like a branch and more like a change-set or more accurately in configuration management terms a collection of change-sets.

Posted in Agile Development, Continuous Delivery | Tagged , | 11 Comments

Disruptor – The implications on design

My company, LMAX, has recently released our disruptor technology for high performance computing. You can read a Martin Fowler’s overview here.

The level of interest that we have received has been very pleasing, but there is one point that is important to me that could be lost in the understandable, and important, focus on the detail of how the Disruptor works. That is the effect on the programming model for solving regular problems.

I have worked in the field of distributed computing in one form or another for a very long time. I have written software that used file exchange, data exchange via RDBMS, Windows DDE (anyone else remember that?), COM, COM+, CORBA, EJB, RMI, MOM, SOA, ESB, Java Servlets and many other technologies. Writing distributed systems has always added a level of complexity that simply does not exist in the sort of simple, single computer, code that we all start out writing.

Our use of the Disruptor, at LMAX, is the closest to that simplicity that I have seen. One way of looking at the way that we use this technology is that it completely isolates business logic from technology dependencies. Our business logic does not know how it is stored, how it is viewed whether or not it is clustered or anything else about the infrastructure that it runs on. There are two concessions that result from the programming model imposed upon us by our, Disruptor-based, infrastructure, neither of which are onerous or even unusual in regular OO programming. Our business logic needs to expose it’s operations via an interface and it reports results through an interface.

Typically our business logic looks something like this:

public interface MyServiceOperations
void doSomethingUseful(final String withSomeParameters);

public class SomeService implements MyServiceOperations
private final EventChannel eventChannel;

public SomeService(EventChannel eventChannel)
this.eventChannel = eventChannel;

public void doSomethingUseful(final String withSomeParameters)
// some useful work here ending up with someResults


There are no real constraints on what parameters may be passed, on how many interfaces may be used to expose the logic of the service, on how many event channels are used to publish the results of the service or anything else – that is it, we need to register at least one interface to get requests into the service and at least one to publish results from it. There is also nothing special about these interfaces, they are plain old java interfaces specified by the business logic (nothing to inherit from) and then registered with our infrastructure.

Since our business logic runs on a single thread and only communicates with the outside world via these interfaces it can be as clean as we like. Our models are fully stateful, rich object oriented implementations of models that address the problem that we are trying to solve. Sometimes our models are not as good as they could be, but that is not because of any technology imposition it is because our modelling wasn’t good enough!

Of course it is not quite that simple, mechanical sympathy is still important in that we need to separate our services intelligently so that they are not tightly-coupled, but this too is only about good OO design and a focus on a decent separation of concerns at the level of our services as well as at the more detailed level of our fine-grained object models. This is the cleanest, most uncluttered, approach to distributed programming that I have ever seen – by far. It feels like a liberation. I am very proud of our infrastructural technology, but for me the real win is not how clever our infrastructure is, but the degree to which it has freed us to concentrate on the business problems that we are paid to solve.

Posted in Agile Development, LMAX | 2 Comments

LMAX Disruptor now open source

As part of our work to create and ultra-high performance financial exchange we looked into a lot of different approaches to high performance computing. We came to the conclusion that a lot of the common assumptions in this area were wrong.

We have done a lot of things to make our code fast and efficient, but the single most important thing has been to develop a new approach to managing the coordinated exchange of data between threads. This has made a dramatic difference to the performance of our code. We think it sets a new benchmark for performance, beating comparable implementations that use queues to separate processing nodes, by 3 orders of magnitude in latency and by a factor of 8 in throughput. You can watch a presentation, by a couple of my colleagues, describing one of our uses of this technology here.

I am pleased to announce that we have now released this as an open-source project. There is a technical article describing the approach and providing some evidence for our claims available at the site.

We think that this is the fastest way to write code that needs to coordinate the activity of several threads and all of our experiments so far have backed this up.

Why ‘Disruptor’?

Well there are two reasons, primarily we wanted to disrupt the common assumptions in this space because we think that they are wrong. But, to be honest, we also couldn’t resist the temptation; There was some talk about Phasers in Java at the time when we named it and, for those of you too young to care, Phasers were the Federation weapon and Disruptors the Klingon equivalent in Star Trek 😉

Posted in Uncategorized | Leave a comment

Presentation on Continuous Delivery at LMAX

I was recently asked to do a presentation on the topic of Continuous Delivery at the London Tester Gathering.

You can seen a video of the presentation here

In this presentation I describe the techniques and some of the tools that we have applied at LMAX in our approach to CD.

Posted in Continuous Delivery, LMAX | Tagged , | Leave a comment

How long to retain build output?

Martin Fowler has recently made a post on the topic of the importance of reproducible builds. This is a vital principle for any process of continuous integration. The ability to recreate any given version of your system is essential, but there are several routes to it if you follow a process of Continuous Delivery (CD).

Depending on the nature of your application reproducibility will generally involve significantly more than only source code. So in the achievement of the ability to step-back in time to the precise change-set that constituted a particular release version of your software, the source code, while significant, is just a fragment of what you need to consider.

Martin outlines some of the important benefits of the ability to accurately, even precisely, reproduce any given release. When it comes to CD there is another. The ability to reproduce a build pushes you in the direction of deployment flexibility. By the time a given release candidate arrives in production it will have been deployed many times in other environments and for CD to make sense, these preceding deployments will be as close as possible to the deployment into production.

In order to achieve these benefits we must then be able to recover more than just the build, we must be able to reconstitute the environment in which that version of the code that your development team created ran. If I want to run a version of my application from a few months ago, I will almost certainly have changed the data-schemas that underly the storage that I am using. The configuration of my application, application server or messaging system may well have changed too.

In that time I have probably upgraded my operating system version, the version of my web server or the version of Java that we are running too. If we genuinely need to recreate the system that we were running a few of months ago all of these attributes may be relevant.

Jez and I describe approaches and mechanisms to achieve this in our book. An essential attribute of the ability of having a reproducible build is to have a single identifier for a release that identifies all of things that represent the release, the code, the configuration, 3rd party dependency versions, even the underlying operating system.

There are many routes to this, but fundamentally they all depend on all of these pieces of the system being held in some form of versioned storage and all related together by a single key. In Continuous Delivery it makes an enormous amount of sense to to use a build number to relate all of these things together.

The important part of this, in the context of reproducible builds, is that talking about the binary vs the source is less the issue than the scope of the reproduction that you need. If you are building an application that runs in on an end-users system, perhaps within a variety of versions of supporting operating environments, then just recreating the output of your commit build maybe enough. However if you are building a large-scale system, composed of many moving parts, then it is likely that the versions of third-party components of your system maybe important to it’s operation. In this instance you must be able to reproduce the whole works if you want to validate a bug and so rebuilding from source is not enough. You may need to be able to rebuild from source, but you will also need to recover the versions of the web-server, java, database, schema, configuration and so on.

Unless your system is simple enough to be able to store everything in source code control, you will have to have some alternative versioned storage. In our book we describe this as the artifact repository. Depending on the complexity of your system this may be a simple single store or a distributed collection of stores linked together through by the relationships between the keys that represent each versioned artefact. Of course the release candidate’s id sits at the root of these relationships so that for any given release candidate we can be definitive about the version of any other dependency.

Whatever the mechanism, if you want genuinely reproducible builds it is vital that the relationships between the important components of your system is stored somewhere and this somewhere should be along with the source code. So your committed code should include some kind of map for ANY system components that your software depends upon. This map is then used by your automated deployment tools to completely reproduce the state of the operating environment for that particular build. Perhaps by retrieving virtual machine images from some versioned storage, or perhaps running some scripts to rebuild those systems to the appropriate starting state.

Because in CD we retain these, usually 3rd party, binary dependencies, and must do so if we want to reproduce a given version of the system, then in most cases we recreate versions from binaries of our code as well as those dependencies because it is quicker and more efficient. On my current project we have never, in more than 3 years, rebuilt a release candidate from source code. However, storing complete, deployable instances of the application can take a lot of storage and while storage is cheap it isn’t free.

So how long is it sensible to retain complete deployable instances of your system? In CD each instance is referred to as a “release candidate” each release candidate has status associated with it indicating that candidate’s progress through the deployment pipeline. The length of time that it makes sense to hold onto any given candidate depends on that status.

Candidates with a status of “committed” are only interesting for a relatively short period. At LMAX we purge committed release candidates that have not been acceptance tested, those that have been skipped-over because a newer candidate was available when the acceptance test stage ran or those that failed acceptance testing. Actually we dump any candidate that fails any stage in the deployment pipeline.

The decision of when to delete candidates that pass later stages is a bit more complex. We keep all release candidates that have made it into production. The combination of rules that I have described so far leaves us with candidates that were good enough to make it into production but weren’t selected (we release at the end of each two week iteration and so some good candidates may be skipped). We hold onto these good, but superseded, candidates for an arbitrary period of a month or two. This provides us with the ability to do things like binary-chop release candidates to see when a bug was introduced or demo an old version of some function for comparison with a new.

We have implemented these policies as a part of our artefact repository so largely it looks after itself.

Posted in Agile Development, Continuous Delivery | Leave a comment