I was having a conversation today with someone about ETL pipelines, and I realised that the word pipeline brought along only some of the associations that would be helpful in data processing. In this article I will go through three different terms, and the associations they each bring. I think that they’re all useful in building a mental model of data processing.
This is the normal term, and it does bring with it some useful ideas (just not all of them).
Data can leak out of a pipeline. In my experience, the important thing isn’t that you have a puddle of data on the floor somewhere. Instead it’s that you have less data in the pipeline than you expect. This is often because of things like joining the main flow of data with another set of data, but there’s no match for some bits of the main flow and so those bits of the main flow aren’t in the output of the join. It’s as if those bits of data have leaked out of the pipe.
You are probably most interested in the rate at which data comes out of the end of the pipe into its ultimate destination. (You might instead – or also – be interested in how quickly the pipeline can take in data at its start.) There might be a complicated tangle of data plumbing upstream (or, in the case of taking in data, downstream) of where the key rate happens. The important rate can be a non-trivial function of the rate of flow in different bits of the overall plumbing.
The most extreme version of this is streaming vs. batching operations. If all the operations in the pipeline are streaming operations (for instance, converting the format of fields in each record of data), it means that the data can flow through the pipeline as quickly as possible. However, if there are any batching operations (for instance, sorting the records into an order), it’s as if the pipeline has a reservoir in it for each batching operation. The reservoir has to fill with all relevant data before its dam can be opened to let any data out to the rest of the pipeline.
Splitting and joining pipes
Taking pipeline in its simplest form conjures up a single physical pipe that carries something like water from A to B. However, ETL pipelines can often be more like oil refineries or chemical plants – a complicated tangle of pipes joining and splitting. Thinking about all the inputs and outputs necessary for some processing is a useful task to do sooner rather than later.
Baggage handling system
You can also think of data processing as being like the baggage handling system in an airport, that takes luggage from check-in to the plane, and then from the plane to a conveyor belt for collection.
Separate tracked items
The main drawback of the term pipeline is that it treats data as a mass noun rather than a countable noun. A mass noun is something you have can have too little of, like water or time. A countable noun is something you can have too few of, like biscuits or dogs. Normally I prefer to treat data as a mass noun, but for data processing I like to treat it as a countable noun at least occasionally. When it’s a countable noun you are more likely to err… count it (and also worry about where individual items are).
You shouldn’t suddenly gain or lose bits of luggage / data, and it shouldn’t suddenly gain or lose weight (or some equivalent property for the data, such as the value of a sales order line item). That’s not to say that everything that goes in should end up somewhere nice. Just as some luggage is pulled aside because it’s bad in some way (e.g. it contains illegal drugs or a bomb), it’s often valid to divert some data to a reject bin because it’s bad. However, you must be confident that the total of what goes in matches the total of what comes out, and you need confidence that only the ‘correct’ things are filtered out before the final destination.
In a baggage handling system, luggage comes from many places and goes to many places, and the baggage handling system controls what goes where. It’s not necessarily wrong if a suitcase is in a particular destination (such as a plane to New York), however it could be wrong for this suitcase to be there when it should be somewhere else instead (such as a plane to Jakarta).
For data processing, this builds on the reject / accept routing decision from the previous section. There could be more than one kind of reject – ones that can be fixed automatically and retried c.f. those that need manually help, for instance, or from different stages / for different reasons. There might also be more than one kind of good destination.
Unlike with luggage handling systems, data processing systems can effectively send the same bit of data to more than one destination (by copying it), which complicates things but it in many ways is just more of the same. Do all the right things end up in the right places, and do the numbers add up?
Baggage handling in airports has security around it – scanning machines, vetting of staff, physical restrictions to certain areas of the airport etc. Data processing can also have security concerns.
In a previous job, we processed data taken from surveys of students and workers. It didn’t contain their name or address, but did include personal information such as salary, age, gender, occupation, university course, academic grades etc. A condition attached to receiving the data was that we could release information from it only when it was about a big enough group of people that it was hard to infer anything about individuals and/or it was rounded to a relatively coarse level of precision. So, we could give the average salary of a group of people only if the group were big enough. If the group were too small, e.g. the graduates from a small university course for a single year, then we couldn’t release any information about that group.
We had to think about physical security about where the data was stored and how (e.g. encryption and removable disks). We had to put measures in place to make sure that we couldn’t release the data until e.g. the size of groups had been checked.
The data going through the processing steps isn’t just some arbitrary meaningless code fodder. It could easily be commercially sensitive, or personal information such as medical or HR data.
This section isn’t just here as an excuse to include this image, honest.
Increasing value while managing costs
Why is the data processing happening at all? It needs to justify its existence because it’s costly. It consumes computing resources when it runs, it likely increases the amount of data that needs to be stored, and increases load on the databases and other stores as it reads from them and updates them. The code needs to be written and maintained, effort that could be spent on other code.
Hopefully the data processing happens because it adds value to the organisation. For instance, the data processing summarises the organisation’s activity over a day or week, so that managers can have a better understanding of the organisation’s performance, which increases the chances that any necessary corrective action happens at the right time.
But just as with a production line, the issues of value, cost, efficiency, reliability, recoverability etc. need to be thought about. Is the requirement to churn through a lot of work efficiently but over a long time, and it doesn’t matter if there’s a long delay between a bit of work becoming available and it being finished? Or is the requirement for a quick response to each bit of work, even though this reduces the amount of work that can be done e.g. over a day? I.e. are we optimising for throughput or latency?
If a data processing pipeline is optimising for throughput, it might organise the work into batches. Each batch could have prep work done on it, e.g. sorting it. At each stage in the processing, the relevant work for the many items in the batch is done together. The batch might need to be split out into individual bits again at the end of the main processing. This overhead will make it take longer for one bit of work, but could be the quickest way to process e.g. 1,000 bits of work.
On the other hand, if the requirement is for latency, then batching is probably best avoided. Each bit of work will be dealt with separately, from one end of the processing to the other. There’s minimum overhead, but also no economies of scale that can come from batches.
This builds on the previous section – safeguarding the value that the data processing hopefully delivers. How do you know that the data processing is working? How do you know that it’s working correctly? Is there a more efficient or effective way to check this than what you’re currently using? Is it better to do one big check at the end, or is it better to more, smaller, checks after each chunk of the processing?
If you detect that something’s wrong, what does that tell you? For instance, if there’s a production line building cars and one car comes off the production line with faulty headlights, will other cars also be affected? This might be because they shared a batch of headlights, or were all produced since a crucial tool was last checked to be working correctly. Might some of these cars have already been shipped to customers? How do you address these kinds of risk to quality – is it worth spending more to get better tools, is it worth putting effort into checking the quality of raw materials and sub-assemblies?
In your data processing, how quickly can you detect problems, and how many undetected existing problems might there be when you do so? What has already been built on top of the data with the problem? When you detect that there’s a problem, how much information do you need to fix the problem, and how much information do you get from your current detector?
Data processing is often thought of as a pipeline, and this is helpful because data and its processing has a lot in common with stuff that goes through pipes and is processed via pipes. However, it also has other things in common with other settings, and these settings can help you to see the problem in a different light. Doing this can help you have a fuller picture of what’s important, and the best way of getting that.