Testing a data pipeline

There are several approaches to testing a data pipeline – e.g. one built using an ETL tool such as SSIS or Azure Data Factory. In this article I will go through three, plus refer to another (unit testing components of the pipeline).

For simplicity sake I will refer to only database tables, but other forms of data store such as APIs or files could be used instead. I will also use ‘transformation’ to refer to any lump of code under test. It might be called different names (pipeline, dataflow, flowlet etc.) and might be at different levels of abstraction. It also might be all of a business-level job or just some of it. An example of a transformation, that I will use to explain the types of test is:

Diagram showing a data pipeline going from an input database table to an output database table via 3 intermediate database tables.

The approaches to testing I will cover are:

  1. Checking the contents of a database table in isolation
  2. Checking that the output tables of a transformation are consistent with its input tables
  3. Checking that a transformation produces expected output for a given known input

The second and third items might look like the same thing, but actually they’re different. The second one tests that certain properties of the data in the inputs are still true of the outputs. It doesn’t necessarily control what the inputs are, but checks that what’s true of the inputs (whatever they may be) is still true of the outputs. I will explain more below. The third form of testing controls the inputs to the transformation, and checks that the outputs are exactly as expected.

The different approaches to testing are complementary – they can test for different things, and can be used in different circumstances. If your data pipeline runs code, e.g. a tSQL stored procedure, then this can be tested in isolation i.e. not as part of the pipeline. I have written about unit testing tSQL stored procedures using tSQLt, so won’t cover it here.

Checking the contents of a database table

Assertions let you test like this:

Diagram showing the same tables and transformation stages as in the previous diagram, but with separate assertions checking each table after the input table

Each table is checked independently, possibly using different assertions. You don’t have to use assertions on all your tables (as shown above) if you don’t want to. I’ve deliberately left out the input table, because I’m assuming that its data is not under our control.

The underlying database can enforce certain things on the data, via the tables’ schema. For instance, you can’t insert a string into an int column, or a column can’t be null. However, there are often a lot of other things that you might want to enforce that the database doesn’t help with. This could be things like:

  • What looks like an int field is actually a field that should be restricted to the values 1-17.
  • What looks like a pair of floating point numbers should be restricted to a longitude and latitude pair that refers to somewhere in the UK.

An example of this kind of thing is Great Expectations. It lets you do things like:

  • expect_column_values_to_match_regex
  • expect_column_sum_to_be_between
  • expect_column_values_to_be_json_parseable
  • etc.

They can be called via a Python script, a Jupyter notebook etc. This approach can be used in production, as it just reads data and checks that, however it got there, it conforms to expectations. It’s good at spotting the existence of bad things, but not so good at spotting the absence of expected good things. If the transformation silently dropped some or all inputs on the floor, it might not produce any data in the output table[s] that falls foul of the assertions. It might not produce any good output either, but the assertions might not notice this.

Checking the consistency of the tables before and after a transformation

I go into detail about this in another article on testing data pipelines. This approach relies on there being properties of the data that you want to stay constant, even as the data is being transformed. If the data represents people buying something, you wouldn’t want the total number or value of orders to change, even as the data is grouped, split etc. If the data represents people watching video online, you wouldn’t want the total number of things watched or total duration watched to change.

To check this, you query each table for the properties you’re interested in. The details of each query will probably change as the data gets transformed, but you want the same information from each. When you have the answers from these queries, you simply check that they’re the same.

Diagram showing the same data pipeline as in the first diagram, but with tests that check each pair of adjacent tables

This can be used in production, as you are only reading data from database tables, however it got there. It can only be used for testing properties that stay the same after the transformation, which might exclude some of the properties you’re interested in.

It is sometimes called reconciliation, because you’re reconciling two tables against each other. There are two interesting interactions of this approach with the design of the transformation. Both of them are about designing for testability and observability.

If you have only one large transformation (that goes straight from the blue input table to the green output table), then you can compare only one pair of numbers. If the numbers are different, then you don’t have much clue as to where the problem is. You could get a better idea if there were intermediate steps as in the diagram, but this comes at the cost of more reading and writing database tables, so a trade-off needs to be made.

The other interaction is error handling. If your transformation checks that sales orders have a total cost, i.e. keeps only records that have a non-null total cost, this will protect downstream code from bad data. However, unless you do something about it, you will have no visibility of the rejected rows – they will silently disappear. A particularly sneaky version of this is joins – looking up data in another table that matches the row in this table. People seem to default to using an inner join, so that if no match is found then no output is produced. I.e. the input row is thrown away when no match is found.

If you have several stages where rows can be rejected, you might start with 100 rows and end up with 87 but not know why. Also, the reconciliation checks will fail and you won’t know where the error occurred.

An answer to both of these problems is to have an explicit place to put rejects. This could be an extra table per stage that could reject things, or a global reject table that somehow identifies where the reject came from. The reconciliation checks can compare the total number and value of inputs to the total number and value of good output and rejects combined.

Checking for expected outputs given known inputs

This is probably what many people think of when they consider testing a transformation. You put known data into the beginning of the transformation, and check that the expected output is produced.

Same data pipeline as at the start, but the input database table is loaded with known input, and the output database table is compared with expected output data

It can’t be used in production, as you are changing the contents of the input table (and all downstream tables). If this is run automatically then you can look for errors only where you have expected output. If you run it manually then you would also be able to spot e.g. errors being produced where you weren’t expecting any.

Tests that don’t vary individual fields

You can do a lot of useful testing without worrying about the details of the fields in the input data. This could be things like:

  • an empty file
  • a very large file (of valid inputs)
  • a file that duplicates data already in the database
  • a file that duplicates data in an earlier file (that possibly hasn’t been processed yet, but is still in a queue)
  • a file where one line is the duplicate of another line
  • files sent very frequently
  • a file that’s completely gibberish e.g. a JPEG file rather than a CSV file
  • a file that doesn’t exist (if there’s some other way that input data is announced e.g. a message on a queue)

Tests that vary individual fields

The tests that look at individual fields share a lot with tests of e.g. forms on websites. Therefore the techniques used for testing forms can help with preparing test input data. This is things like particularly long fields, empty fields, the word ‘Tuesday’ where a number is expected etc.

There are further potential problems that are specific to files. This is things like having the wrong number of fields on a line, or putting things in the middle of text that might be used to separate text. What happens if you have a company name such as ‘John, Paul, George and Ringo’ – will the commas inside the name mess things up? Similarly slashes, full stops, dashes and spaces might cause problems if they occur where they’re not expected.

Summary

There are a variety of approaches to testing a data, with their own strengths and weaknesses. This is very similar to testing software, e.g. written in C# – such software can be subjected to a barrage of tests – functional, performance, security etc. at the unit, integration and system level. The different testing tools do different jobs, and it’s a good idea to know how to use all of them.

Leave a comment