Azure Data Factory (ADF) is a tool from Microsoft that lets you move data from one place to another, optionally changing it too. This activity is sometimes described as data engineering or ETL (Extract Transform Load) or ELT. There’s an older tool from Microsoft that also does ETL, called SQL Server Integration Services (SSIS). They are quite different kinds of thing, and have similar but not identical features. For one thing, SSIS is edited via on-premises tools such as Visual Studio, but ADF is edited via a page in the Azure web portal.
In this article I will try to introduce ADF enough that you know if it’s a tool you might find useful, and also give a framework for details that you can find in many articles and videos online.
Why use an ETL tool?
The kinds of thing you can do with an ETL tool include:
- Call an API to read some data, verify it or enrich it by referring to data you already have in a database table, and then writing the results to another database table.
- Periodically, e.g. every night, pulling data from different systems in a company, transforming it out of the different formats it’s in within those different systems into one common format, and using the transformed data to update a big enterprise-wide database e.g. a data warehouse. This would create a single place to view a summary of the activity of the whole enterprise.
- Prepare data ready to be visualised via e.g. Power BI or fed into a machine learning model to train it.
Everything you can do with an ETL tool you could do in a general-purpose programming language such as C# or Java. The reasons why you might use a dedicated tool centre around the fact that it can take care of some of the low-level detail for you. This has benefits including the fact that you can often construct the series of processing steps that the ETL tool should perform by dragging and dropping boxes onto a canvas, and connecting them with arrows to show how control and/or data should flow between them.
There are a few sets of important concepts in ADF, including:
- Data concepts
- Processing concepts
- Parameters and variables
There are two main data concepts – linked service and dataset. They describe how you get data into and out of ADF, and what the data looks like as it flows through ADF.
Linked service – it’s something in the outside world that you can get data from or send it to. You often need credentials to connect to a linked service. Examples of linked services are things like a SQL database, a cloud blob storage account, or an API.
Dataset – a linked service can be associated with one or more datasets. In the case of a database, the linked service corresponds to the database as a whole, and each table in the database is a different dataset. A dataset has a name and contains a list of columns. Each column has its own name and type information such as it’s an int and can’t be null, or it’s a string of at most 30 characters and can be null.
For more on this, please see my article on connecting ADF to an external database table.
There are four processing concepts, although you don’t need to use the last one if you don’t want to. They describe the work that ADF does, zoomed in and out to different levels of detail.
Factory – as well as being part of the name of the tool, a factory is the top-level thing that you edit when you start ADF. A factory has no executable code of its own – this is all in the lower-level things described below.
Pipeline – a pipeline is the thing that you can start (by clicking Play in ADF, or by setting up a trigger such as a schedule, or by calling a REST API). A factory contains zero or more pipelines, like the sheets in a spreadsheet. Just as with the different sheets in a spreadsheet, the different pipelines in a factory can work together to achieve a bigger goal, or can be independent.
A pipeline is built up of one or more steps called activities. Activities don’t go into the detail of data processing – they can do things like just copy data from place A to place B with no alteration, or call an API to change the world in some way e.g. kick something off. The stuff that flows from one activity to the next is status information, e.g. “I have succeeded” or “I encountered an error”. A pipeline can include other pipelines, and a pipeline can also include data flows. There is an impressive multi-level pipeline example from Paul Andrew where, among other things, pipelines scale up database resources before the main data processing happens so that the main set of pipelines runs more quickly, and then scales the resources down again at the end of processing.
Data flow – a data flow, which used to be called a mapping data flow, is the thing that can do some work on the data. By that I mean it must have one or more sources of data, one or more sinks for that data (where the result[s] of the data flow will go), and zero or more processing steps in between. These steps are things like filter, sort, group, join two streams of data together etc. All the things in a data flow are called transformations, even the things that read inputs (sources) and write outputs (sinks). The stuff that flows from one transformation to the next is data, organised in rows or records.
Flowlet – a flowlet is something that you can optionally use in data flows, but don’t have to. You can chop out a section of a data flow out to create a new flowlet, and then use this flowlet as one transformation in one or more data flows. (You can also create flowlets from scratch.) The big difference between flowlets and data flows is that flowlets read data that’s already in memory and can’t read it from the outside world. Similarly, a flowlet writes its output[s] only as far as memory, rather than all the way to an external sink such as a database table. Therefore, the simplest way to use a flowlet is to create a data flow that has:
- A source transformation to read data from e.g. an API
- A call to the flowlet to work on the data read from e.g. the API that’s now in memory
- A sink transformation to take the data put into memory as the output of the flowlet and write it to e.g. a database table.
Parameters and variables
Parameters and variables are both extra kinds of housekeeping information that can exist in ADF, on top of the main flow of data e.g. between sources and sinks.
Parameters can be defined for a factory (these are called global parameters), or for things within it such as pipelines, data flows and linked services. Each parameter has a name and a type such as integer. When a pipeline is started, if it has parameters then they must be given appropriate values. If you start a pipeline manually within ADF you need to type these values in. If one pipeline calls another, the calling pipeline must pass values to the called pipeline. Similarly, if a pipeline calls a data flow that has parameters, the pipeline must supply values for the data flow’s parameters.
Inside a pipeline or data flow that uses parameters, the parameters’ values are read only. You read a parameter’s value by using the parameter’s name (plus other special syntax) in a text field that you would otherwise use to type in constant strings or numbers to configure a transformation.
They can be used e.g. as part of a filter transformation e.g. you want only rows whose MarketSegment field matches the parameter Market. They can also be used as part of configuring access to data, e.g. in a linked service.
A variable is a pipeline-only thing, whose value isn’t passed into a pipeline. Another difference from parameters is that a variable can have its value changed inside a pipeline by a Set Variable activity. A variable is read in any activity in a similar way to reading a parameter’s value.
There’s currently no way to return data from a data flow or pipeline via anything like a parameter or variable. There are a variety of workarounds, such as writing the data to a file or database table, and then reading that outside the data flow or pipeline.
Azure Data Factory is an ETL tool that lets you do things with data. There are alternative ETL tools, and you could also achieve the same effect using general purpose programming languages. Processing happens at up to four levels – factory, pipeline, data flow and flowlet. Access to data happens via two entities – linked service and dataset. Execution can be influenced by parameters and variables, which store housekeeping information rather than being part of the main flow of data through ADF.