This article is part of a series: From Excel to SSIS
- Getting started
- File Save As (a kind of Hello, World! program for SSIS)
- Filtering rows
- Gouping and calculating
- Joining streams of data
0. Introduction
This is the first article in the series where we will do something with the data as it flows through SSIS, rather than simply copying it verbatim to the output. This is where we will begin to see the benefits of using SSIS over e.g. Excel. It will still be a small and simple example, but it will become easier to see how you could build on this example to produce something useful and interesting.
1. Recap of filtering in Excel
If you turn on auto-filtering in Excel, a little drop-down control appears at the top of each column. Clicking on this lets you pick from a list of all the distinct values in that column, and only rows whose column matches the selected value[s] will be displayed.
In the spreadsheet below, rows are filtered by column A such that only rows with the value Monday are shown. This is why rows 2-7 and 9-21 are missing.
2. What we will do in SSIS
We will do something similar but not identical. We will take the file from the Excel example above, where one of its columns contains a day of the week. We will turn that into seven output files, where each output file will hold the rows from the input file for a particular day of the week.
To help you learn good SSIS habits, we will have an eighth output file for rows that don’t match any day of the week. We don’t expect any rows to be like this, but data isn’t always clean and tidy before we process it. It is usually much better to notice that there are bad rows, than to silently have fewer good rows than there should be.
3. The SSIS tools we will need
The new bit of SSIS for this article is the conditional split transformation, which will live in a data flow. It will look at each row in turn, and try to match a column in that row against filters that we define. As soon as a filter matches, that row will be sent onwards through an output linked to the matching filter.
The bits of SSIS that we will re-use from the previous article are: creating a new project, running a task, connection managers for reading and writing files, and data flow elements to read and write files.
4. Re-using things from the previous article
As you did in the previous article, create a new SSIS project and add a new data flow task to the control flow. Double click on this so that you can edit.
Unlike in the previous article, the input data matters a bit here so that you can see all the things I want to show you with minimum pain. Copy the text below into a new .CSV file:
Day,Date,Number of visitors,Total value of tickets sold Tuesday,01/05/2018,47,705 Wednesday,02/05/2018,82,1422 Thursday,03/05/2018,68,998 Friday,04/05/2018,102,1530 Saturday,05/05/2018,140,2100 Sunday,06/05/2018,138,2070 Monday,07/05/2018,25,375 Orange,08/05/2018,43,667 Wednesday,09/05/2018,76,1140 Thursday,10/05/2018,71,1066 Friday,11/05/2018,98,1438 Saturday,12/05/2018,155,2325 Sunday,13/05/2018,132,1980 Gorgonzola,14/05/2018,29,425 Tuesday,15/05/2018,39,585 Wednesday,16/05/2018,84,1211 Thursday,17/05/2018,65,975 Friday,18/05/2018,111,1588 Saturday,19/05/2018,188,2814 Sunday,20/05/2018,135,2029 Monday,21/05/2018,22,333
It might look like there are some errors; this is deliberate and so they should be copied along with everything else. Create a connection manager and element in the data flow to read this file, just as in the previous article.
5. The new SSIS stuff
Hopefully your data flow looks something like this:
Drag a conditional split element from the toolbox onto the data flow, and link it to the blue arrow coming out of the element that reads the input file. Double click on the conditional split element so that you can configure it.
You will use this screen to define a list of rules that will be used in order. We will happen to have all the rules check the same column of each row, but the rules can have whatever condition you like.
Open the Columns part of the top left pane and drag Day down into the main pane below. This will create the first rule. Change its Output Name to Monday, and edit the Condition so that it reads [Day] == “Monday”. It should look like this:
We could build a condition using more sophisticated operations like matching on only the first 5 characters of the column, or the logarithm of the column (which are some of the options available via the top-right pane), but we are simply checking that all of the column is the same as a string.
Note that the output name doesn’t have to have anything to do with the condition – we could have called the output name Fred Astaire and it would still have worked. However, we’re using Monday because it’s a helpful name for us and anyone else who looks at the code later.
Repeat this process to create rules for the other days of the week, and rename the default output name to something more useful like Unknown day.
The configuration should now look like this:
For our particular case, it doesn’t matter what order the rules are in. A given row will match one condition (or will drop through to the default if it doesn’t have a valid day of the week).
Other cases could have more than one rule matching a given row. For instance, if you have a rule that checks that the first 8 characters of a column are CUSTOMER, and another rule that checks that the first 4 characters of a column are CUST. “CUST123” will match only one of the rules, but “CUSTOMER123” will match both.
If more than one rule could match, you need to think which rule should take priority and move it above any other rules that might match. As soon as one rule is matched, SSIS will not look at any other rules for the given row.
Notice that I’m being a bit careless with the match – I’m not worrying about case, or any leading or trailing white space. In a real example these are likely to cause problems, so it’s worth thinking about them.
We now have a conditional split, but it’s not really doing anything useful yet. Drag a Flat File Destination element onto the canvas and rename it to Monday. Click on the Conditional Split and drag the blue arrow out to the Monday destination. A window will pop up so that you can pick which output of the conditional split we want to go to Monday. Choose the Monday output.
To finish off the configuration of the Monday output, as in the previous article you will need to double click on it and create a new connection manager for an output CSV file, then click on the Mappings tab to prompt SSIS to fill them in with the default values.
Repeat this for the other days of the week, so you will have 7 outputs coming from the conditional split. As you move through the days of the week, you will notice that the list of available outputs shrinks – each output can be used at most once per conditional split. (You can ignore an output if you want, but you can’t use it two or more times.)
In theory we could stop here, but this is where the good habit comes in. Make sure that you create an eighth output to catch rows that match none of the rules – in our case, rows that don’t have a valid day of the week in them.
The data flow should now look a bit more impressive, something like this
6. Checking the output
When your SSIS program runs it should look something like this
Note that the total number of rows going into conditional split equals the total number of rows going out. If we hadn’t put the default path in, then we would have dropped two rows. This is because there are two rows with random text instead of a valid day of the week. We probably can’t do anything with them as they are, but having an output that collects just the errors makes them more obvious and easier to do deal with.
7. Conclusion
I hope that this has helped you witness the firepower of a fully armed and operational SSIS task. This is the first article in which we’ve done a proper bit of data engineering with SSIS. It’s only simple, but it should give you a better idea of what possibilities there are and how you might implement them. All the remaining articles will be like this – we will be doing proper stuff with the input data, along the lines of Excel operations.