Calvin and Hobbes said that there’s treasure everywhere. For a sad geek like me, there’s also computer science everywhere. Doing the washing up the old school way involves a metaphor for the computer science concept of queue, if you think of people collaborating around the draining board. Quite a lot of this is a statement of the bleeding obvious, but I hope it’s still helpful. Queues crop up all over the place in computing, including the asynchronous version of micro-services.
First, three terms I’ll be using:
- Producer: the person or thing who adds things to the queue – in this case the person who washes things and puts them on the draining board;
- Consumer: the person or thing who takes things from the queue – in this case the person who takes wet things from the draining board, dries them and puts them away;
- Queue: the thing that lets the producer and consumer collaborate – in this case the draining board.
I will assume one producer and one or more consumer, because that’s my experience of washing up, but in computer science terms you can have one or more producers (as well as one or more consumers).
In order to collaborate, the producer and consumer must both know about the queue – that it exists and how to access it. They don’t need to know about each other (although see Competition below). Without a queue, the producer hands each thing directly to the consumer, and so the producer and consumer need to know about each other (but don’t need to know about a shared queue). In computing, such direct communication is e.g. making a synchronous call to an API.
The consequence of no queue is that the producer and consumer must often wait for each other. When the producer is preparing its next thing, it doesn’t need to wait for the consumer. However, as soon as it’s ready it will need to wait until the consumer is free. Similarly, while the consumer is processing something it doesn’t need to wait for the producer, but as soon as it has finished its current task it will be idle until the next thing comes from the producer.
If each task takes the same amount of time, or it doesn’t matter that the producer and/or consumer are idle, then direct communication between producer and consumer might be the correct approach. However, if you’re trying to get through the washing up as quickly as possible (meaning both producer and consumers should be busy as much of the time as possible) and there’s a range of things from pans with burned-on food to forks that need only a quick wipe, then a queue could be worth the extra complexity.
Consequences of having a queue
With a queue, a producer can work until the queue is full and a consumer can work until the queue is empty. Variations in how long each task takes will tend to be smoothed out – if the average rate of production equals the average rate of consumption, things will keep ticking along.
If production rate > consumption rate, the queue will tend to fill up, and if production rate < consumption rate, the queue will tend to empty. The variation in the production and consumption rates determine how big the queue needs to be for neither party to be blocked.
Another benefit of queues is the ability to have two co-operating sets of parties, rather than just two co-operating parties. By that I mean you can have one or more producers for a given queue, and one or more consumers for the queue. This means that the throughput of the system as a whole can be maximised more easily than with a single producer and a single consumer communicating directly and synchronously. If a consumer can’t keep up with the producer, one or more extra consumers can be added to help (and likewise for adding extra producers). This means that the queue is taking on some of the role of a load balancer.
A final aspect of the queue (which has both positive and negative consequences) is the decoupling in time of production and consumption, i.e. temporal decoupling. By this I mean that that producer can create a particular task and add it to the queue, and the consumer will pick up that task at some point in the future. If the producer doesn’t need to wait for the processing of the task by the consumer, then this is quite straightforward.
However, if the producer needs to know some answer that results from what the consumer does, things get more complicated. In this, the things the producer writes are requests it’s sending to the consumer, and it needs to know the response for each. Therefore, the mechanism by which the producer receives responses from the consumer needs addressing. One approach would be a second queue going back from the consumer to the producer, that holds the responses.
However, in the time that the consumer took to process a request, the producer may have added many more requests to the queue. Therefore, the next response that the producer reads from the response queue probably relates to a request that the producer wrote quite a while ago. Correlating or linking entries on the two queues also needs addressing, which might be as simple as the responses copying a unique id that was on the original request.
There are two or more parties collaborating on a task (getting the washing up done) using shared resources (the draining board). In order to keep things on track, their competition for shared resources needs to be managed.
It might be easiest to understand this concept if you imagine two or more people all taking from the same draining board to dry up, and they are paid by the number of things they dry. While they are actually drying something, they can be away from the draining board, but when it comes time to get the next thing to work on there might be a fight for the draining board. Or, one person could stand next to the draining board all the time, blocking the others from getting anything.
To maximise the throughput of the system as a whole:
- The different parties each need exclusive access to the shared resource some of the time;
- The period of exclusive access for a given party must be as short as possible.
So, the consumers need to take something from the draining board, and then step away from the draining board as they work on the thing in their hand. When they have finished working on it, they need to take their turn at getting the next thing from the draining board.
I still remember a time over twenty years ago when a group of people co-operating on some drying up spontaneously organised themselves into this system. Each person walked the same small circuit around the kitchen – take something from the draining board, walk to the cupboard as they dried and then put away the thing, and then walk back to the draining board. It was surprisingly pleasing.
If you’re not careful, the collaboration could devolve into a turtles-all-the-way-down type of problem. Imagine that there are many consumers, and the way they’ve agreed on to share access to the draining board is to queue up for it. Further imagine that two consumers have each finished their current task at the same time and so want to join the queue. Which one of them can join the queue first?
The queue for the draining board (a shared resource) is itself a shared resource, but this time the people doing the drying up add to this queue (by looking for their next bit of work) as well as take from it (by getting access to the draining board). This is one of the areas where getting help from your underlying platform, e.g. operating system, stops you having to code turtles all the way down.
Another time when the help from the underlying platform comes in handy is when a queue is empty for a while and then new work appears. All the consumers have become idle, and so all are waiting for work. There is now one bit of work but N consumers – which should get it? In order to free up system resources, it’s probably worth having the consumers do an idle wait for work. By that I mean they go to sleep until there’s work to do, rather than constantly pestering with “Is there work yet?” Therefore, as well as choosing which consumer gets the work, the underlying platform should wake up that consumer.
When you are drying up, taking the oldest thing from the draining board makes most sense, all other things being equal. It will have had most time to dry by itself, leaving you with the easiest job. Always taking the oldest thing means that the first thing added to the queue will be the first thing to leave it. This is the classic FIFO (First In, First Out) approach for queues.
However, it might be that you don’t take the oldest thing, but instead have a different way of choosing things from the queue. For instance, there could be one consumer that takes only cutlery, another that takes only glasses and so on. In this case the one physical queue (the draining board) is acting as many logical queues at once – one for cutlery, one for glasses etc. One consumer might take more than one kind of thing, e.g. glasses and cutlery, or just one kind. These kinds of things are often called topics.
Another important aspect to consider is the purpose of the queue. If its purpose is to pass tasks around the system in a simple way, then things can be as I’ve already described. By “simple” I mean that each thing on the queue represents one item of work that will be done by one thing that can do work (person, program etc.). This thing could be a member of a team or pool of similar things that together work on the things, but a given thing is processed by only one member of the team or pool.
If, however, the purpose of the queue is to pass information around the system, then things can get more complicated. If a new order has been added to the system and this is announced by the shopping cart writing a new entry on a queue, it might be that many other parts of the system want to know about and respond to that information:
- The debt module will want to keep track of the money that is due for that order;
- The shipping module will want to start the process of the goods getting to the customer;
- The customer service module will want to add the order to the customer’s order history in case there’s a query;
- The fraud detection module will want to see if the order matches the criteria for fraud;
One approach would be for each kind of consumer to have its own queue, but that would mean that producer would need to know about all kinds of consumer, which is probably adding unhelpful coupling between producers and consumers.
Another approach is to change to a publish / subscribe (or pub/sub) approach. In this the producers publish information on the queue, and the consumers subscribe to it. When a one subscriber, e.g. the debt module, processes a bit of information, it stays on the queue in case other subscribers want it. The next time a subscriber goes back to the queue it won’t see the entries it has already read, even though they might be there for other subscribers who haven’t read it yet.
Diagnosing an unhealthy queue might need a little extra help
Imagine you have a system that includes a queue, which you expect to be neither completely empty nor completely full. But when you look at it, it’s full. You might even have an alert set up, that told you that the queue was full or getting full, so you didn’t have to hunt this information out for yourself. What do you next?
There are at least two possible causes of a full queue (assuming that the system that looks after the queue is healthy):
- The producer has gone faster than normal;
- The consumer has gone more slowly than usual (which includes the case where the consumer has crashed).
It’s not obvious which cause you have, until you add in an extra bit of information such as the age of the things in the queue. By age I mean the time that has elapsed since each was added to the queue. An average or maximum value across the queue is good enough.
If the queue is full and all the items are young, then the cause is more likely to be that the producer has gone faster. The consumer has been able to handle old things successfully (which is why they’re not on the queue any more), but has just got overwhelmed by recent things. Why the producer has sped up is another matter – it might have gone wrong in some way and is writing more output than it should (maybe even writing duplicates in some way). Or it might in turn have its own source of inputs and the rate of its inputs has increased, i.e. the fundamental source of the speed up is somewhere further upstream.
If the queue is full and some of the items are old, then the cause is more likely to be that the consumer has gone more slowly, or even crashed. Why it has slowed down or crashed is the next question to answer.
This issue is part of an excellent video from Goto Conference that is a post-mortem of a problem in production that the speaker had to tackle.
I have only just scratched the surface of this subject, but I hope this has been helpful. I also encourage you to see the world around you on as many levels as make sense. The rainbow is beautiful, and is also the result of the refraction of light waves through water. The washing up gets things sorted for the next time you want to eat, and involves a queue between a consumer and producer.