In this article I’ll go into two related operations or kinds of queries you can do to data, that are both to do with grouping things – aggregation and window functions. I’ll describe how they both work, how they’re similar but different, and give examples of when you might use them including how you might … Continue reading Aggregation and window functions for data
Category: ETL
Mental models for data engineering and data science
For programmers like me, it can be a bit of wrench when you get more into data work, particularly data engineering and data science. You’re used to data being around (in the background) and so think everything will be OK. This wasn’t the case for me, and so here are some mental models (glorified metaphors) that … Continue reading Mental models for data engineering and data science
Introduction to Azure Data Factory
Azure Data Factory (ADF) is a tool from Microsoft that lets you move data from one place to another, optionally changing it too. This activity is sometimes described as data engineering or ETL (Extract Transform Load) or ELT. There’s an older tool from Microsoft that also does ETL, called SQL Server Integration Services (SSIS). They … Continue reading Introduction to Azure Data Factory
Connecting Azure Data Factory code to an external database table
In this article I will talk about how to connect Azure Data Factory (ADF) to a database table. This can be surprisingly complex, so I will start with the simplest version and work towards more complex versions. I won't go into connecting ADF to other types of data store such as APIs, blob storage etc, … Continue reading Connecting Azure Data Factory code to an external database table
Staging input data to improve testability in data pipelines
In a data pipeline (an ETL or ELT pipeline, to feed a data warehouse, data science model etc.) it is often a good idea to copy input data to storage that you control as soon as possible after you receive it. This can be known as copying the data to a staging table (or other … Continue reading Staging input data to improve testability in data pipelines
Improving testability and observability of look-ups in data pipelines
Often in data pipelines (ETL or ELT pipelines for feeding a data warehouse, data science model etc.) we need to look up reference data that relates to the main flow of data through the pipeline. If this isn't done carefully, there can be problems for checking how the system is running. Before the system is … Continue reading Improving testability and observability of look-ups in data pipelines
Testing a data pipeline
There are several approaches to testing a data pipeline - e.g. one built using an ETL tool such as SSIS or Azure Data Factory. In this article I will go through three, plus refer to another (unit testing components of the pipeline). For simplicity sake I will refer to only database tables, but other forms … Continue reading Testing a data pipeline
Mental models for data processing
I was having a conversation today with someone about ETL pipelines, and I realised that the word pipeline brought along only some of the associations that would be helpful in data processing. In this article I will go through three different terms, and the associations they each bring. I think that they’re all useful in … Continue reading Mental models for data processing
Automating the deployment of SSIS packages
Introduction This post describes how I have automated the deployment of SSIS packages to different environments (test, staging, production etc.) There are definitely other ways of doing it, and some of these are possibly better - I think that this is the oldest-school way possible. I will describe our context which I hope will explain … Continue reading Automating the deployment of SSIS packages
From Excel to SSIS, Part 5: Joining streams of data
This article is part of a series: From Excel to SSIS Getting started File Save As (a kind of Hello, World! program for SSIS) Filtering rows Grouping and calculating Joining streams of data 0. Introduction This article will cover another very useful chunk of SSIS - joining streams of data together. Often the value of two bits … Continue reading From Excel to SSIS, Part 5: Joining streams of data