From Excel to SSIS, Part 3: Filtering rows

This article is part of a series: From Excel to SSIS

  1. Getting started
  2. File Save As (a kind of Hello, World! program for SSIS)
  3. Filtering rows
  4. Gouping and calculating
  5. Joining streams of data

0. Introduction

This is the first article in the series where we will do something with the data as it flows through SSIS, rather than simply copying it verbatim to the output.  This is where we will begin to see the benefits of using SSIS over e.g. Excel.  It will still be a small and simple example, but it will become easier to see how you could build on this example to produce something useful and interesting.

1. Recap of filtering in Excel

If you turn on auto-filtering in Excel, a little drop-down control appears at the top of each column.  Clicking on this lets you pick from a list of all the distinct values in that column, and only rows whose column matches the selected value[s] will be displayed.

In the spreadsheet below, rows are filtered by column A such that only rows with the value Monday are shown.  This is why rows 2-7 and 9-21 are missing.

A spreadsheet that has been filtered by its first column

2. What we will do in SSIS

We will do something similar but not identical.  We will take the file from the Excel example above, where one of its columns contains a day of the week.  We will turn that into seven output files, where each output file will hold the rows from the input file for a particular day of the week.

To help you learn good SSIS habits, we will have an eighth output file for rows that don’t match any day of the week.  We don’t expect any rows to be like this, but data isn’t always clean and tidy before we process it.  It is usually much better to notice that there are bad rows, than to silently have fewer good rows than there should be.

3. The SSIS tools we will need

The new bit of SSIS for this article is the conditional split transformation, which will live in a data flow.  It will look at each row in turn, and try to match a column in that row against filters that we define.  As soon as a filter matches, that row will be sent onwards through an output linked to the matching filter.

The bits of SSIS that we will re-use from the previous article are: creating a new project, running a task, connection managers for reading and writing files, and data flow elements to read and write files.

4. Re-using things from the previous article

As you did in the previous article, create a new SSIS project and add a new data flow task to the control flow.  Double click on this so that you can edit.

Unlike in the previous article, the input data matters a bit here so that you can see all the things I want to show you with minimum pain.  Copy the text below into a new .CSV file:

Day,Date,Number of visitors,Total value of tickets sold
Tuesday,01/05/2018,47,705
Wednesday,02/05/2018,82,1422
Thursday,03/05/2018,68,998
Friday,04/05/2018,102,1530
Saturday,05/05/2018,140,2100
Sunday,06/05/2018,138,2070
Monday,07/05/2018,25,375
Orange,08/05/2018,43,667
Wednesday,09/05/2018,76,1140
Thursday,10/05/2018,71,1066
Friday,11/05/2018,98,1438
Saturday,12/05/2018,155,2325
Sunday,13/05/2018,132,1980
Gorgonzola,14/05/2018,29,425
Tuesday,15/05/2018,39,585
Wednesday,16/05/2018,84,1211
Thursday,17/05/2018,65,975
Friday,18/05/2018,111,1588
Saturday,19/05/2018,188,2814
Sunday,20/05/2018,135,2029
Monday,21/05/2018,22,333

It might look like there are some errors; this is deliberate and so they should be copied along with everything else.  Create a connection manager and element in the data flow to read this file, just as in the previous article.

5. The new SSIS stuff

Hopefully your data flow looks something like this:

A data flow that reads the input file via a connection manager

Drag a conditional split element from the toolbox onto the data flow, and link it to the blue arrow coming out of the element that reads the input file.  Double click on the conditional split element so that you can configure it.

You will use this screen to define a list of rules that will be used in order.  We will happen to have all the rules check the same column of each row, but the rules can have whatever condition you like.

Open the Columns part of the top left pane and drag Day down into the main pane below.  This will create the first rule.  Change its Output Name to Monday, and edit the Condition so that it reads [Day] == “Monday”.  It should look like this:

conditional split with one rule defined

We could build a condition using more sophisticated operations like matching on only the first 5 characters of the column, or the logarithm of the column (which are some of the options available via the top-right pane), but we are simply checking that all of the column is the same as a string.

Note that the output name doesn’t have to have anything to do with the condition – we could have called the output name Fred Astaire and it would still have worked.  However, we’re using Monday because it’s a helpful name for us and anyone else who looks at the code later.

Repeat this process to create rules for the other days of the week, and rename the default output name to something more useful like Unknown day.

The configuration should now look like this:

conditional split configured with all its rules

For our particular case, it doesn’t matter what order the rules are in.  A given row will match one condition (or will drop through to the default if it doesn’t have a valid day of the week).

Other cases could have more than one rule matching a given row.  For instance, if you have a rule that checks that the first 8 characters of a column are CUSTOMER, and another rule that checks that the first 4 characters of a column are CUST.  “CUST123” will match only one of the rules, but “CUSTOMER123” will match both.

If more than one rule could match, you need to think which rule should take priority and move it above any other rules that might match.  As soon as one rule is matched, SSIS will not look at any other rules for the given row.

Notice that I’m being a bit careless with the match – I’m not worrying about case, or any leading or trailing white space.  In a real example these are likely to cause problems, so it’s worth thinking about them.

We now have a conditional split, but it’s not really doing anything useful yet.  Drag a Flat File Destination element onto the canvas and rename it to Monday.  Click on the Conditional Split and drag the blue arrow out to the Monday destination.  A window will pop up so that you can pick which output of the conditional split we want to go to Monday.  Choose the Monday output.

Setting the Monday output of the conditional split to go to the Monday destination

To finish off the configuration of the Monday output, as in the previous article you will need to double click on it and create a new connection manager for an output CSV file, then click on the Mappings tab to prompt SSIS to fill them in with the default values.

Repeat this for the other days of the week, so you will have 7 outputs coming from the conditional split.  As you move through the days of the week, you will notice that the list of available outputs shrinks – each output can be used at most once per conditional split.  (You can ignore an output if you want, but you can’t use it two or more times.)

In theory we could stop here, but this is where the good habit comes in.  Make sure that you create an eighth output to catch rows that match none of the rules – in our case, rows that don’t have a valid day of the week in them.

The data flow should now look a bit more impressive, something like this

All outputs configured for the conditional split

6. Checking the output

When your SSIS program runs it should look something like this

SSIS showing the results of running the task

Note that the total number of rows going into conditional split equals the total number of rows going out.  If we hadn’t put the default path in, then we would have dropped two rows.  This is because there are two rows with random text instead of a valid day of the week.  We probably can’t do anything with them as they are, but having an output that collects just the errors makes them more obvious and easier to do deal with.

7. Conclusion

I hope that this has helped you witness the firepower of a fully armed and operational SSIS task.  This is the first article in which we’ve done a proper bit of data engineering with SSIS.  It’s only simple, but it should give you a better idea of what possibilities there are and how you might implement them.  All the remaining articles will be like this – we will be doing proper stuff with the input data, along the lines of Excel operations.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s