Q&A from GOTO Copenhagen Session on Reactive Systems

I recently spoke at GOTO Copenhagen on the topic of Reactive Systems.

I will post a link to the video of my talk here when it is published.

I didn’t have time to answer all of the questions, so here are my answers to the questions asked…

Q: A typical question that arises when thinking about eventual consistency is the user responsiveness. Imagine a user updating some property of something in the UI, that user often wants to see the result of that update, immediately. It’s kind of bad user experience to show him out of date information and a message “please refresh in some time to see the result of your action”. What is your view on that? How can we hide eventual consistency from the user, in the cases where we don’t want him to notice that it’s eventual consistent in the backend?

Q: How do you deal with asynchrony in the UI where users want immediate feedback of their action?

A: Eventual consistency may, maybe should, make you think about some different ways of showing the user what is going on, but there is also an assumption built into your question that this is going to be slow.

Imagine what is going on in a synch call across a process boundary. A request-response call to a remote server perhaps…

Our client code needs to encode the call somehow.
The request needs to be sent across the wire.
Our thread needs to be blocked while we wait for a response.

The server end will be triggered when the message arrives.
We will need to translate the message into something useful.
We call the server code with the message.
Server code formulates a response.
Server code with need to translate the response into something to send.
and send it.

Our client code will need to receive the message.
Translate it into something useful.
Wake-up the client-blocked sync thread.
Call the client code with the response.
Process the response.

Now think about how that would be different for an async communication. We would remove the steps to block the thread on the client and reactivate it. Instead we could imagine that thread being continually busy, in the simplest case, looping around looking for new messages.

So in the line of any communication there is less work to do, not more.

All of the highest performance systems in the world, that I am aware of, are built on top of async messaging for this reason. Telecoms, Trading, Real time control systems.

So for the vast majority of interactions a user of an async system will get better, rather than worse, response. In the tiny number of interactions when something is going wrong, and responses are slowed the user is seeing the truth, that their invocation hasn’t finished yet, but they are not blocked from making progress elsewhere.

This does lead to a slightly take on UI design, but it is, at least, only different rather than worse, and maybe a more accurate and more robust representation of the truth.

Q: Event-Driven or Message-Driven in 2020?
Q: What is the difference between events and messages?
Q: Should we be storing messages or event? Or both?

A: When we wrote the Reactive Manifesto https://www.reactivemanifesto.org/ we debated this a lot. “Event or Message”. We came down on the side of Message because an “Event” gives the impression that there is not necessarily any consumer of the event. Whereas “Message” has a more obvious implication is that something somewhere cares and is listening.

I think that this may be thought of as a bit like counting Angels, and personally I am fairly relaxed about the differences, but when you are trying to communicate ideas broadly it is sometimes useful to be a bit pedantic about the language that you choose.

Q: How do reactive frameworks relate to reactive systems?

A: I think that there is a relationship, but they are not the same. Reactive frameworks are largely focussed on stream processing at a programatic level, Reactive Systems is more of an architectural stance.

I did say in my presentation that these ideas are kind of fractal though, so the async, event/message-based nature of both of these levels of granularity are common.

There are some details of the Reactive frameworks that I have seen that I dislike, as a matter of personal taste, as a programmer (Futures for example). I see little advantage in trying to make aysnc look like sync.

Taking a more architectural viewpoint and simply, at the level of a service, processing async messages as input and sending async messages as output results in simpler code. It may result in a little more typing, but the code will be simpler and so easier to follow.

The real advantage that I perceive in Reactive Systems is the separation of essential and accidental complexity. The code that I spend my day-to-day work on is inside the services. It is focused solely on the domain logic of my problem. Everything else is outside in the infrastructure. Reactive Programming probably offers the same effect if you think about it, but most of the code that I have seen doesn’t achieve that.

Q: Any good places to store events?

A: Ideally that is a problem for your infrastructure. Aeron, for example, has “Clustering” Support which allows you to preserve, and distribute, the event-log. When configured this way, it will record, and play-back, the stream of events for you.

But once you have the stream, you can do almost anything you like with it.

Q: When would a massage driven system be inappropriate, or just overkill?

A: I think that my answer to this splits into two.

On the one hand, this style of development is still reasonably niche. It has a long an extremely well established history, but is still most widely used in fairly unusual problems. Trading, Telecoms, Real-Time systems and so on. I believe that it is MUCH more widely applicable than that, but because of that the tooling is fairly niche too. Akka is probably the most fully-fledged offering. It is certainly Reactive, personally I think that there are some aspects of the Actor model in Akka that seem more complex than is really required, but it is a great place to start, with lots of examples and successful industrial and commercial applications.

On the other hand, as I said there is something fairly fundamental at the level of Computer Science here. Async message passing between isolated nodes is a bit like the quantum physics of computing, it is the fundamental reality on which everything else is built. This is how processors work internally. It is how Erlang works, it is how Transputers worked in the 1980s and it is how most of the seriously high-performance trading systems ion the world work at some level.

Performance isn’t the only criteria though. I value this approach primarily for the separation of accidental and essential complexity. Distributed and Concurrent systems are extremely complex things – Always! This approach allows me to build complex systems more simply than any other approach that I know.

So I think that it should be MUCH more broadly applicable, but the currently level of tooling and support means that I probably would choose to use it only when I know that the system will need to scale up to run on more than one computer or needs to be VERY robust. For systems simpler than that, I may compromise on a more traditional approach 🙂

Q: Instead of back pressure couldn’t you automatically startup an extra component b?

A: Yes you can but you need to signal that need somehow, and that is what “Back-pressure” is for. It allows us to build systems that are better able to “sense” their need to scale elastically on demand.

Q: Why unbounded queue is bad pattern? How about Apache Kafka?

A: An unbounded queue is ALWAYS unstable. If you overload it what happens next? To build resilient systems you must cope with the case of what you will do when demand exceeds supply (the queue is full).

There are only three options:

  1. Discard excess messages.
  2. Increase resources to cope with demand.
  3. Signal that you can’t cope and slow-down the input at the sender.

Options 2 & 3 require the idea of “Back-Pressure” to get the message out to something else to either launch some more servers (elastically scale) or to slow input.

At the limit, given that resources are always finite (even in the cloud) you probably want to consider both 2 & 3 for resilient systems.

Kafka allows you to configure what to do when the back-pressure increases.

Q: If you need to join two datasets, coming from two different streams, first stream – fast real-time, second – slowly changing, without persisting data on your storage, how would you recommend to do it? Any recommended patterns?

A: In the kind of stateful, single-threaded reactive system that I was describing this is a fairly simple problem. Imagine a stateful piece of code that represents your domain logic. Let’s imagine a book-store. I could have a service to process orders for books. I have lots of users and so the stream of orders is fast and, effectively, constant.

I may not choose to design it like this in the real-world, but for the sake of simplicity, let’s imagine that we check the price of the book as part of processing an order.

I am going to process orders and changes to the price of books on the same thread. This means that I can process different kinds of messages via my async input queue. When an event occurs to change the price of a book, interspersed with processing orders, as I begin to process that message, nothing else is, or can, go on. Remember, this is all on a single thread, so the “ChangeBookPrice” message is in complete, un-contended control of the state of the Service.

So I have no technical problems, my only problems are related to the problem-domain. These are the sorts of problems that we want to be concentrating on!

So what should we do when we change the price of a book?

We could change the price and reject orders not at that price. We could change the price, but allow orders placed before we changed the price to be processed at the old price… and so-on.

I think that the simplicity of the safe, single-threaded, stateful, programming model combined with the separation of technical and domain concerns that it entails gives us greater focus on the problem at hand.

Q: Let’s say you have scalable components and a large history of events. How to deal with the history to recreate the state of that new component which just scaled up. Use snapshots to store an intermediate state of a component?

A: Yes, this is one of the complexities of this architectural approach. You get some wonderful properties, but it is complex at the point when messages change.
The first thing to say is that in these kinds of architectures, the messages store is the truth!

The first scenario, that you talk about, is what happens if you want to re-implement your service. Well, as long as the message protocol is consistent – go ahead, everything will still work. Since a message is the only way that you can transform the state of your service, as long as you can consistently replay the messages in order, your state, however it is represented internally, will be deterministic.

The problem comes when you want to change the messages. You have then got an asymmetry between what you have recorded and what you would like to play-back. When we built our exchange we coped with this in two different ways.

When we shut the system down we would take a “snapshot” of the state of a service. When the service was re-started it would be restarted by initializing it with the newest snapshot, and then by replaying any outstanding, post-snapshot, messages.

We then built some tools that allowed us to apply (and test) transformations on the snapshot. This was a bit complicated, but worked for us.

The other solution was to support multiple message versions at runtime and dynamically apply translations into the new form required by the service.

One more, common, pattern that we didn’t use much in our exchange was to support multiple versions of the same message, through different adaptors.

Q: How can random outcomes be reproduceable? Eg implementing a game with dice. Roll die will have a result, but if only the command is saved?

A: Fairly simply, you externalize the dice! Have a service outside of the game that generates the random event. Send that as a message. The game is now deterministic in terms of the sequence of messages.

Q: What about eventual consistency of data? how do you resolve conflicts?

A: I think that broadly there are two strategies. You align your service with Bounded Contexts in the problem domain. You choose these, where you can, so that you don’t care about consistence between different services.

For example. If I am buying books off Amazon. The stuff that is in my shopping cart right now is unrelated to the history of my orders. Even once I have ordered the stuff in my cart, I don’t really care if it takes a second or two for the order-history to catch-up. So “eventual consistency” between my “Shopping Cart service” and “Order History Service” doesn’t matter at all.

Where I need two distinct, distributed, services to be in-step I can take the performance overhead of achieving consensus. There are well-established distributed consensus protocols that will achieve this. RAFT is probably the best known at the moment. So you can apply RAFT to ensure that your services are in-step where they need to be.

If this sounds slower, it is, but it is no slower than any other approach that is ALWAYS what you must do to achieve consistency. These are the same kind of techniques that happen below the covers of more conventional, distributed synchronous approaches – e.f. Distributed Transactions in a RDBMS.

Q: How do you ensure ordering across multiple instances of the same component? So scaling up, without risking two instances reserving the same, but last, book in the inventory?

A: This is back to the idea of eventual consistency. There are two strategies, live with the eventual consistency:

Allow separate instances of your multiple instances to place an order for a book at the same time, but have one “Inventory” to actually fulfill the order.

or:

Use some distributed consistency protocol to coordinate the state of each place where books can be ordered.

Q: Isn’t reactive Actors in a different pyjama?

A: The stuff I was describing could be considered to be a simple actor style. It misses some of the things that are usually involved in other actor based systems (e.f. Akka).

The fundamental characteristics though are the same. We have stateful bubbles of logic (actors) communicating exclusively via async messages.

Q: In the system you built – did you use a message bus?

A: Yes, we built our own infrastructure layered on top of a high-performance messaging system from 29 West.

Q: Should messages be sent to kafka or similar?

A: You can certainly implement Reactive Systems on top of Kafka.

Q: Why not just accept message 4 if 3 is missing? is order important?

A: Dropping messages is not a very sensible thing to do at the level of your infrastructure. Though it may make sense within a particular problem domain.

The problem is, if my infrastructure just ignores the loss of message 3, then the state of my service processing the messages is now indeterminate. Imagine two services listening to the same stream of messages. One gets 3 the other doesn’t. If we don’t make the message order dependable our system is not deterministic.

If your problem domain allows you to ignore messages, perhaps they arrive too late and are no longer of interest – true in some trading systems for example, then you should deal with that problem in the code that deals with the problem domain, the implementation of the service, rather than in the infrastructure.

So the safest approach is to build the reliable messaging into the infra and deal with other special cases as part of the business problem.

Q: Persisting events sounds like a Big overhead on traditional synchronous call

A: Yes and doing it efficiently is important. However, if you have a system, of any kind, that requires state to be retrievable following a power outage, you have to store it somewhere. The mechanisms that I described for how the message stream is persisted as a stream of events is almost precisely the same as you would implement in the guts of a Database system. All modern RDBMS’ are based on the idea of processing a “transaction log”. This is the same thing, except where, and when, we process the log is changed.

When building our exchange we did a lot of research into this aspect of our system’s performance. The trouble with something like a DB is that it is optimized for a general case. If you look carefully at the performance of the storage devices, that we use to persist things, they are all, even SSDs, not optimized for predictable performance in Random Access. They work most efficiently if you can organize your access of them sequentially. We took advantage of that in the implementation our or message-log persistence so that we could stream to pre-allocated files and so get predictable, consistent latency. Modern disks and SSDs are very good at high-bandwidth streaming. So we could outperform a RDBMS by several orders of magnitude.

There is tech on the horizon that, I think, may disruptive and so strengthen the case even more for the kind of Reactive Systems that I described. That is massive-scale, non-volatile RAM.

Q: Was it LMAX you were working for?

A: Yes, LMAX was the company where we built the exchange.

You can read a bit more about our exchange and its architecture here: https://martinfowler.com/articles/lmax.html

Q: Service as a State machine implicates that the services should be stateful? That is added complexity? Thinking about changing the flows etc.

A: You have to have state somewhere, otherwise your code can’t do anything much.

Not all of your code needs to be stateful though. For the parts of your system that form the “system of record”, in this approach, those parts are implemented as “Stateful Services”.

If you want high-performance you can do this using the in-memory state as the system of record, using the techniques that I described – That was how our exchange worked. For other, slower, systems you service could be stateful and backed by some more conventional data store if that makes sense.

Q: How would a single thread bookstore service handle an order coming in while it is still processing the previous order? Or alternatively, two simultaneous orders?

A: It would queue the incoming second order and process it once the BookStore had finished processing the first. However, because of the MASSIVE INEFFICIENCY of data sharing in concurrent systems, avoiding the locking is something like three orders of magnitude faster than tackling this as a problem in concurrency.

Q: How to effectively handle transactions (and rollback in case of fail) in event based system? And how to understand that transaction not finished?

A: In these kind of systems the simplest solution is that a message defines the scope of a transaction. If you need broader consistency, use a distributed consensus protocol like RAFT.

Q: How do you deal with the communication with mobile and web frontends (and the UX of it)? Websockets and other solutions always feel more complicated for many use cases.

A: My preferred approach for all UI is to deal with it as a full bi-directional async comms problem. So then you have to use something like Websockets to get full “push” to the remote UI.

Q: Can your code use different cores in the CPU? Or will the next instance of execution use the same core? Do you utilise all the cores?

A: Yes, the system is multi-core, but is “shared-nothing” between cores. We can achieve this through good separation of concerns. For example, one core may be dedicated to taking bytes off the network and putting them in the ring-buffer, another maybe focussed on journaling the message to disk, another to processing the business-logic of the service and so on.

You can read more about the LMAX Disruptor, that coordinated those activities here:

https://lmax-exchange.github.io/disruptor/

…and see an interview with me and my friend Martin Thompson on the topic here:

https://www.infoq.com/interviews/thompson_farley_disruptor_mechanical_sympathy/

Q: How would you make mission critical software asynchronous?

A: Most really mission-critical software is already asynchronous! Look at Telecoms!

Q: Do we have to use Reactive Frameworks (like RxJava) in a Reactive System?

A: No, see my earlier answer.

Q: Seriously, why not JS on server-side?

A: It was a cheap-shot, but Javascript is an enormously inefficient use of a computer. One argument against it is from the perspective of climate-change.

Data Centres, globally, put more CO2 into the atmosphere than commercial aviation. Something between 7 and 10% of global CO2 production. The kind of systems that I am describing are something like four or more orders of magnitude faster than most conventional Javascript systems.

If I am over-exaggerating, and we could improve the performance by a single order of magnitude we could reduce global CO2 emissions by 9%!!!

We tend not think about software in these terms, but perhaps we should!

I cannot think of any sphere of human activity that tolerates similar levels of inefficiency as software.

Q: How to measure the impact of Eventual Consistency on asynchronous Event-Driven Systems?

A: The term “eventual” is confusing, we are talking about computer-speeds here. Eventual usually, under most circumstances, means faster than any human can detect. So in most cases, the eventuality of the system doesn’t matter at the human scale. Where the system slows for some reason, then you need to be able to cope with the fact that the data is not in-step, but that is simply a reflection of the truth of ANY distributed system. So the trade-off is ALWAYS between slower communications with consistency or faster communications with eventual consistency. The overhead for consistency is considerable, but it is ALWAYS considerable, even in sync-systems.

This entry was posted in Uncategorized and tagged . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *