This article will talk about three ways to summarise a data set. It should be gentle stuff – two you’re likely to know already, and the less-well-known one isn’t tricky to understand. I’m talking about them together in one article to show how stats can often mean you need a toolbox with several stats tools, plus you need to know which one to use for a particular job.
All three tools let you summarise a data set via a single number. The reasons why there is more than one, and why you might want to use all of them from time to time, are:
- The data sets might have a different structure;
- You want the answer to different questions about a data set.
The three tools are:
- Standard deviation,
- Gini coefficient.
Correlation applies when each point in your data set has two properties. For instance, your data set describes people, and for each person you have their age and their salary, or age and number of traffic accidents per year. Correlation lets you know how strong the relationship is between the two properties.
There are two extremes for this relationship, and then a big area in the middle where things are fuzzier. At one extreme, as one property increases the other property increases (in a perfect straight line). At the other extreme, as one property increases the other property decreases (in a perfect straight line). Much more likely is something in the middle, where there is a cloud that heads generally in one direction but is more or less spread out along one of the two lines of a perfect relationship.
It’s worth knowing the limitation of any tool, and correlation certainly has limitations. It doesn’t like curves. It doesn’t prove that the change in one property causes the change in the other, just that the two changes follow a particular pattern. (Correlation != causation.) Beware of spurious correlations.
You can find out more about correlation, including how to calculate it, on Wikipedia.
With standard deviation (and the Gini coefficient) you are concerned with only one property for each member of your data set.
For standard deviation you count how common each value is of that property, so that you end up with two values again (value and frequency). For instance, you look at the height of a lot of people, and you see how many people have each possible height.
What standard deviation lets you do is see how the frequency of the values spreads out. If every member of the data set has the same value, the standard deviation is 0. As the data set spreads out over a larger and larger range of values, the standard deviation grows towards (positive) infinity. Standard deviation is never negative, as it’s a scalar measure of spread rather than having any kind of direction like a vector does, as in correlation above.
More about standard deviation on Wikipedia (including how to calculate it).
The Gini coefficient tries to show how similar the values in a data set are to each other. For instance, if you have the income of lots of people, do they have a similar income or do some people have a large income while most people have a small one? (For simplicity, I’m assuming that all the values are non-negative.)
The way it works is:
- You sort the data set from smallest to biggest.
- You work through the data set starting with the member with the smallest value, maintaining a running total of the value from all the members you’ve seen so far. Therefore, after the first member the running total will be the same as just the value on the first member, and after the last member it will be the sum across the whole data set.
- You plot how the running total increases as you move through the data set.
In order to make this easy to compare with other data sets (e.g. income from different countries), things are normalised. Instead of the x axis being number of people, it’s percentage of the total people. Instead of the y axis being the running total’s value it’s what percentage the current value of the running total is of its final value.
The normalisation means that the graph will go from (0, 0) in the bottom left corner to (100, 100) in the top right corner.
If all members of the data set had exactly the same value, then the graph would be a straight line from bottom left to top right (i.e. the line y=x). Adding on each extra member will increase the running total by the same amount, so the graph is a straight line (i.e. with constant gradient). This is one extreme.
The other extreme is that all values are 0 apart from the last one. In this case the graph will be a horizontal line following the positive x axis, and then a vertical line going upwards. Adding on each extra member makes 0 difference to the running total, until you get to the last member at which point it jumps to its final value in one go.
In between these two extremes, the graph will be fixed to y=x at its two ends, but will sag down from y=x for the region in between.
The Gini coefficient measures how much the graph sags down from y=x. I.e. it’s the area below y=x and above the graph, expressed as a percentage of the total area below y=x. In the graph below it’s the area between the blue and green graphs, expressed as a percentage of all the area below the blue graph.
I hope that this has introduced you to or reminded you of some simple ways of summarising a data set with a single number. I hope that it also highlights the general principle of needing more than one stats tool in a given area, so that you have the right tool for the job.
To summarise the 3 tools:
|Tool||Format of data||Purpose||Possible values|
|Correlation||Property1, Property2||How close is the properties graph to y=x or y=A-x?||-1 to +1|
|Standard deviation||Property, Frequency||How close is the frequency graph to x=A?||0 to + infinity|
|Gini coefficient||Property||How close is the cumulative total graph to y=x?||0 to 100|