The cool chain of data

Perishable goods like milk and some medicine need to be kept cool otherwise they go bad. With something like milk, if it has gone bad enough you can tell from the smell. However, this will probably be after you’ve paid for it and brought it home from the shop, so you have wasted time and money. Medicine is worse – you often can’t tell if is ineffective or dangerous without specialist equipment.

Things are made more complicated for both because there are many steps involved in getting them to you, and these steps can involve many different companies, technologies, risks etc. All these steps need to link together in an unbroken chain, hence the name cool chain. It doesn’t matter if the milk is kept refrigerated at the dairy farm, in the tanker, in the shop and in your fridge if it’s kept in an unrefrigerated warehouse along the way.

Photo by Bob Embleton, under Creative Commons Attribution-Share Alike Generic Licence 2.0
**20 gallons of unpasteurised sales data**

A cool chain is also a useful metaphor for data pipelines (I’ve written about other metaphors elsewhere). Data usually goes through a series of steps that transform it into some desired state, similar to the many steps involved in getting milk from farm to fridge.

As soon as a step is defective – letting the milk or data go bad in some way – that’s often it. You can’t unspoil the milk or easily fix all data problems. If there are problems with the data pipeline, often the best you can hope for is spotting problems as soon as they happen, to minimise the amount of data that has gone off.

That leads to another issue: How confident are you that you can detect problems? Are you doing the equivalent of merely looking at the milk, or are you smelling it or even checking its chemical composition? You might be processing data that has came from elsewhere. Before you go to the trouble of doing lots of processing, have you checked its quality? How much can you trust the earlier stages of the cool chain?

So far I’ve been talking about the data engineering kinds of data processing, but cool chain type thinking also applies to the more statistics kind of data processing. There are ways to summarise data where the output is only as strong as the weakest link, similar to worrying about the least cool step of the cool chain. This is things like how many decimal places can be in the answer if there are several steps of multiplication, division etc.

This has implications for things like when you round numbers e.g. to the nearest 1 decimal place, when you throw away values if they represent groups of people that are small enough to hit data privacy issues etc. If possible, you should round values, apply data privacy filters etc. as late as possible. Stretching the analogy a bit, they correspond to a relatively warm step in the cool chain, in that they define the maximum quality of the output regardless of what steps follow and how cool they are.

For example, if I have numbers that I need to sum to 1 decimal place, I could round each number to 1 decimal place and then add them. This would guarantee that the answer is to 1 decimal place, but it would be less accurate than if I add them unrounded and then round the sum.

If we take 9 lots of the value 0.11:

0.11 + 0.11 + 0.11 + 0.11 + 0.11 + 0.11 + 0.11 + 0.11 + 0.11 = 0.99 -> round to 1.0
0.11 -> round to 0.1 -> 0.1 + … + 0.1 = 0.9

There is a 10% difference between the two answers, and I would argue that this is because the second answer is incorrect by 10%.

Thinking about your data processing as a cool chain, whether that’s data engineering or statistics, could help you design it better and assess the quality of the data it produces in a cost-effective way.

Share this:

Related

Leave a comment