Computer science while doing the laundry 1: Merge sort

This is part of a short series of computer science involving laundry:

Merge sort
Bin sort

In this article I will explain merge sort, which is a way of sorting things when there are so many of them it’s awkward or impossible to use other approaches. I’ll use doing the laundry as a way of explaining it. In other articles I’ve used the washing up to explain stacks and queues, and used roundabouts to explain starvation (in the Computer Science sense) and deadlock.

Setting the scene

Imagine that you have a lot of laundry to do. There is somewhere that dirty laundry collects – a tub or hamper, or bedroom floors – and you take a load’s worth at a time and put it in a washing machine. You then hang out the clean but wet washing on a clothes line or on radiators ((I will assume a single clothes lines for simplicity). Once the clothes are dry you need to distribute them back to the relevant person so that they can be worn again.

However, you’ve got so much laundry to do that it will take several loads in the washing machine etc. You could do each load end-to-end, but as has happened to me recently, you’ve managed to accumulate several piles of clean and dry washing but haven’t taken them any further. So, the clothes in these piles all still need to get back to their respective owners.

Colourful washing on a line outside a house — Image by Lynn Friedman, licence CC BY-NC-ND 2.0

Digging a bit deeper

Now we’ve got the basic flow, we can dig a little bit deeper about what sorting’s going on and where. It will turn out that we’re almost (but not quite) doing merge sort. (I’ll show later how we can do the full version of merge sort.)

I think that the dirty laundry is going to be in random order, or in time order by when it was laid down (like strata in geology or layers in archaeology). I’m going to assume that you just take the dirty laundry from the top of the pile and stop when you have enough. (I’m skipping over things like washing different colours or fabrics separately.)

In the washing machine, the laundry gets well and truly jumbled up, and it stays this way when you take it out of the machine. I’m going to assume that you take wet but clean washing from the top of its pile and hang it out in that random order.

When the washing is dry, I will assume that you take it off the line in an order that groups together all the clothes for a given person. I.e., first you collect the clothes for person A, then the clothes for person B etc. So, each pile of clean and dry laundry will have clothes for a random subset of the total set of people, and the pile will be sorted by owner.

Now imagine you take all the piles of clean and dry laundry and put them near each other on something like a bed or table. You then take the chunks of laundry pile for each person and put them together, so there is a pile per person that contains clothes from potentially many different loads of laundry.

This is almost, but not quite, merge sort. In the next section I will say what you would need to change to make it completely merge sort. Regardless of whether it’s changed or not, at the core there are two stages – a sorting phase followed by a merging phase, hence the name.

The merge sort version of doing the laundry

There are two bits missing from the previous section that would turn it into full merge sort. The first is that there needs to be an order for people, and the second is that there needs to be an order for clothes. For instance, you could order people by age or alphabetically by their name, and you could order clothes by their age.

If you add both of these orders, then for all possible pairs of clothes you will be able to order them. If one of bit of clothing is for person A and the other for person B you can order by owner. If they’re for the same person then you can break the tie by ordering by e.g. age of clothes.

I’ll be glossing over everything up to the point you collect clean and dry laundry from the line. Everything up to this point means that you have a subset of all dirty laundry, that is now clean, dry and in random order on the line. Now that we can order people and clothes, when you collect clothes off the line you collect them fully sorted. They’re grouped by person as before, but now the groups are in order (e.g. by person’s name), and the clothes within a group are ordered. The result is that the pile as a whole is ordered.

Things continue as before (in this more sorted way) so you end up with many piles of clean and dry laundry on the bed together. This is where things change again, based on things being fully sorted. Instead of directly producing a pile per person, we’re going to produce one enormous fully sorted pile. Because it’s fully sorted, it will then be easy to chop up into slices to create a pile per person. (I’m ignoring the fact that the pile is likely to be so tall and wobbly that it could easily topple to the ground, producing a frustrating jumble rather than anything more useful.)

This is how we will produce the one enormous pile. We look at the top bit of clothing in each pile and consider these as a set. We then pick the member of this set that’s first according to our order. For instance, if there are four piles of clothes and the top bit of clothing in those piles is for person A, B, C, D respectively, we pick the bit of clothing for person A. This bit of clothing becomes the first thing in our enormous output pile.

When we’ve taken off this top bit of clothing, its place in the set is taken by the second bit of clothing down in that pile. I.e. we always consider the top thing in each pile. We then pick the thing that’s now the first according to the order. In practice you would put this on top of the thing you took the first time around, so that the output pile will end up in reverse order (first thing at the bottom, rather than first thing at the top). If you wanted to be mathematically pure about this (and we’re already being a bit weird in the name of mathematical purity – fully sorting clothes, for instance) then when you add a bit of clothing to the enormous pile you would have to lift up the current enormous pile and add the new thing to its bottom.

We carry on this process, continually taking the best candidate across the top of all input piles and adding it to the output pile. It might be that we use up one of the input piles early, i.e. while there’s still stuff to take from the other input piles. That’s OK, we look at the top item in however many input piles are left.

Benefits of merge sort

Merge sort in all its glory is probably an odd way to do laundry, but can be very useful for a computer to sort lots of things. It might be that you have to use merge sort because there are too many things to fit into memory at once (so you can’t sort all of them at once). Or it might be that you want to use merge sort because it’s a way of getting things done in less elapsed time by having more than one computer or process work on the problem at once.

Too many to sort at once

I’ll describe the too-big case using some artificially small numbers for how much work the computer can do at once, because that will make things simpler. Imagine that the computer can sort only 100 things at once.

In the sort stage (getting washing off the line) a human would sort from one place (the line) to another (the pile). A computer is more likely to sort in place e.g. just re-arrange what’s in an array. I’ll assume that the array is 100 items or fewer, so the computer can sort it. (If there are more than 100 things then it will sort them 100 at a time.) It will then dump that sorted array to file so that it can bank its work sorting those things but also move onto the next chunk of stuff to sort. At the end of all the sorting there will be several chunks of 100 things, with each chunk sorted and living in a separate file.

The computer will then open the files for each of the files and read only the first line of each, to populate an array. The array will have as many elements as there are files, as each file will have contributed one entry. I’ll also assume for now that there are 100 or fewer chunks, so all the chunks can be processed at once. It will then sort this array and take the smallest element to write it to an output file. Its place in the array is taken by the next element from its file.

This process continues – there will be no more than 100 elements that need sorting at any time, so the computer can cope. Eventually an input file will be exhausted as all its members have been written to the output, but there are still things from other files waiting to be written. The array for sorting will then shrink, so that there are 99 elements, then 98 etc.

Even though the computer can sort no more than 100 elements at once, it can cope with 100 input files. Each input file can have 100 elements, meaning that 100 x 100 = 10,000 elements can be sorted by a computer that can sort only 100 at a time.

In fact, we can extend this to effectively limitless sorting by adding extra merge stages. The sorting continues as before – you chop off the next 100 elements to sort, sort them and then write them to a file. The merging as described above will have a problem if there are more than 100 files, because the array won’t be able to take a candidate line from all files.

If there are 200 files, then you take the first 100 files and merge them to produce one sorted file with 100 x 100 = 10,000 lines. You do the same for the second 100 files, ending up with two files each with 10,000 lines. It doesn’t matter how long each file is (because only one line at a time is read from each) only how many of them there are. So you can apply the merge phase on these two files of 10,000 lines to produce a single file of 20,000 lines. I.e. after one sort phase there are two, rather than one, merge phases.

You can repeat this process as many times as you need. If there were 20,000 files, you take 100 files at a time, merge them and produce one output file of 10,000 lines. At the end of this round there would be 200 files, each with 10,000 lines. You then repeat the process from the previous paragraph – take two groups of 100 files, merge them to produce two files, and then merge these two files. I.e. now there is one sort phase and three merge phases. A computer that can sort only 100 items at a time has sorted 20,000 x 100 = 2,000,000 items.

This can be extended as much as you like – it will take more and more time, but you can sort as many elements as you like even though there’s an upper limit to how many elements can be sorted at once.

Reducing elapsed time through parallelism

Merge sort is an approach to sorting that’s suitable for many computers or processes co-operating in parallel, particularly in the case where there are many merge phases. As with most parallel processing, the co-ordination of who does what requires some effort and care, but I won’t go into that here.

Each sort can be done in parallel and each merging of 100 files into one output file can be done in parallel. I.e. in the case of 20,000 files there can be:

Up to 20,000 things doing the sorting in parallel
Up to 200 things doing the first round of merging in parallel
Up to 2 things doing the second round of merging in parallel
Exactly one thing doing the third round of merging

The rounds can overlap – as soon as there are 100 sorted files, they can be merged into a single 10,000 line file, even if other sorted 100 line files are still being produced. We just need to keep track of which round a given file is from, so that it’s picked up as input by the next round.

Summary

Full merge sort is probably impractical and overkill for doing laundry, but a simpler approximation to it might be useful. The full version can allow a computer to sort more things that it has capacity for with conventional sorting, and can be worked on by many things in parallel.