This article came out of one of those realisations that two things I already knew were linked or even the same thing, like Clark Kent and Superman or regular expressions and finite state machines. It’s not a new realisation – a minimal amount of Googling showed that the link is in the first paragraph of the relevant bit of Wikipedia.
The two things on their own are cool, and the fact that they’re linked is even nicer. The two things are the k-means approach to finding clusters in a set of data points, and Voronoi maps.
Imagine that you’ve got a set of data, that could be plotted as points on a graph. For simplicity I’m going to assume that there are two properties per point, but it could in theory be more than that. So, the points could represent people, and the properties you have are their age and average time spent using a particular service per week. You plot them on a graph, with age on the X axis and time on the Y axis, giving some pattern of dots.
What k-means lets you do is see if the dots form into clusters, and if so where the centres of those clusters are. If, instead of forming into separate clusters, you think that they are in a single cloud that heads generally in one direction and you want to see how tight the cloud is, then correlation is probably a better tool to use.
The reason why you want to find if there are clusters might be that you’re building a recommendation engine for a website. If you have identified how your users divide up into clusters, then you can see which cluster a given user is in, and e.g. see if many of the other people in their cluster how bought / watched / etc. something that the current user hasn’t. Based on the behaviour of many of the other people who seem similar to them, this might be something that they would like.
How to calculate k-means
The k of k-means is unknown – you don’t know ahead of time how many clusters there are. One approach I’ve used is to start with k=2, see how tightly the points fall into 2 clusters, then repeat with k=3, 4 and so on, until there is no great improvement in trying with the next higher value for k.
The approach for a given value of k is as follows.
- You start off by randomly throwing k darts at the graph. You do this bit only once, at the start.
- For each dot, you draw a line between the dot and whichever dart is closest to it.
- After you’ve drawn all the lines, you move each dart so that it’s in the middle of all the dots that have lines drawn to it.
- You rub out all the lines.
- You go back to 2 and start again. The next time around, a dot might end up connected to the same dart as before, or to a different dart – whichever is now the closest to it.
The effect is that the darts wander across the graph, hopefully converging on the centres of clusters of dots. When the darts have needed to move less than a certain amount, or you’ve hit some limit to the number of goes around the loop above, then you stop and declare the current positions of the darts to be the answer.
There’s a nice visualisation of k-means online. Doing it in something like R is very simple, i.e. a single line of code.
Voronoi maps are a way of dividing up some space, e.g. a 2D plane, based on points in that space. It’s as if you blow a bubble around each point, and all the bubbles grow at the same rate. Eventually two or more bubbles will grow into each other and come to a halt. The outlines of the bubbles will depend on how the points are arranged. The region inside a bubble is the region of space that is closer to the point that gave rise to the bubble than to any other point.
Vi hart has an excellent video that explains Voronoi maps using baking:
There’s also a gorgeous visualisation of a lovely piece of music, based on Voronoi maps. The music is Spem in Alium by Thomas Tallis, and the visualisation has time on the X axis and pitch on the Y axis like on a musical score, with lines showing how a given part (singer) moves between their subset of the notes that are being sung by the whole choir.
Link between k-means and Voronoi maps
Imagine you do the k-means clustering for a given set of points, and then give each dart a different colour, and colour in the dots linked to it with the colour of the dart. Then imagine you create a Voronoi map based on the darts’ positions. You will find that all the dots of a given colour will fit inside the region of the Voronoi map that extends out from the dart of their colour.