In this article I’ll go into two related operations or kinds of queries you can do to data, that are both to do with grouping things – aggregation and window functions. I’ll describe how they both work, how they’re similar but different, and give examples of when you might use them including how you might use them together.
The operations crop up in several things that work with data – things like SQL, but also data engineering / ETL tools such as Azure Data Factory.
Aggregation is the more obvious of the two operations. You have a pile of data, and you want to summarise it as a series of groups. The individual rows of input data aren’t present* in the output – all you get are the groups. For each group you can calculate zero or more values over the group as a whole, such as the sum or minimum value of a property across the group members.
It will probably be easier to understand via some examples, which I’ll present via SQL as I think it will be clearest. I’ll be using TSQL, which is the Microsoft SQL Server dialect of SQL. There might be small differences between this and the version of SQL you’re using, but I hope things are still clear enough.
If you had a table with phone calls that people had made, and one of the columns (DayOfWeek) recorded which day of the week the call was made, you could count how many were made per day of the week like this:
select DayOfWeek, count(*) from PhoneCalls group by DayOfWeek
The thing that turns it into an aggregation is the group by. The output would be something like:
There are two important things to note from the output:
- You aren’t guaranteed that the groups will be output in any order (you will need to add an explicit order by to force the output into the order you want)
- The results summarise what was there, and won’t e.g. fill in the other days of the week with 0. If you want to fill gaps with 0 you would need to do something like do an outer join of the results with a table that lists the days of the week.
There are two ways to do filtering in aggregations – filtering the input to the aggregation and filtering its output. They use different key words, which go in different places in the query.
Imagine you want to group phone calls by day of the week as before, but you’re interested in only international calls. Fortunately, this information is available in the CallType column of the input table, so you can do something like this:
select DayOfWeek, count(*) from PhoneCalls where CallType = ‘international’ group by DayOfWeek
This could produce results such as
There are 18 phone calls made on Monday but none of them is international, so the group by no longer produces a Monday group. Of the 23 calls made on Tuesday, only 5 are international and the only call made on Friday is international. The groups have shrunk (or even disappeared) because the input set to the aggregation has been reduced by filtering (the where clause). Note that this is just the normal where clause that you might use, regardless of whether you also use an aggregation. The point of this is to show that if you have both where and group by in one query, the where happens first and acts to filter the inputs to the aggregation.
If instead you wanted to see the days on which more than 10 calls were made of any type, you would filter the output of the aggregation rather than its input. To do that you could do something like:
select DayOfWeek, count(*) from PhoneCalls group by DayOfWeek having count(*) > 10
This would produce these results:
The groups are the same size as at the start, but whole groups are missing if they don’t match the filter (the having clause)
You can combine the two kinds of filtering, such as:
select DayOfWeek, count(*) from PhoneCalls where CallType = ‘international’ group by DayOfWeek having count(*) > 2
which would produce:
The middle set of groups (Tuesday -> 5, Friday -> 1) are generated via filtering the input, and then any groups that don’t pass the output filter (Friday -> 1) are thrown away.
More than one property in the aggregation
So far, we’ve been aggregating by just one property and calculating only one value per group (other than the label for the group). We can go further in both areas if we want to.
We could produce a more fine-grained summary of calls, i.e. break things down by day of the week and then by call type within that:
select DayOfWeek, CallType, count(*) from PhoneCalls group by DayOfWeek, CallType
This produces output like this:
You can also calculate more than one property of the groups, using a bigger set of operations: count, min, max, sum, average and standard deviation. So you could have a query like this:
select DayOfWeek, count(*), min(CallDurationSecs), max(CallDurationSecs), avg(CallCost) from PhoneCalls group by DayOfWeek
which could produce results such as
It’s important to remember that the only things available to you in the projection (the list of columns and other properties between select and from) are those of the group – these are the things you’ve grouped by or calculated for the group such as max(CallDurationSecs).
Important aside on groups and information security
The last set of results brings me to an important point about aggregation (this is what the * from earlier was pointing to).
For the calls made on Monday and Tuesday, we know the range of duration and an average cost. For the one call made on Friday we know its exact duration and cost. When I said earlier that the identity of individual rows that are inputs to aggregations is lost, this is generally true. However, if you have a group with only one member, the group and its member are identical.
If you’re relying on aggregation to hide information about individuals in some way, you need to be careful that all your groups are big enough to do this. Even if you think that your groups are big enough, they could be combined to reveal groups that are too small. For more information about this see my article on the compounding value of information.
I hope that aggregations are now OK, so I’ll turn to window functions. Window functions also divide individual rows into groups, but the rows keep their separate identity and can gain extra information from being in a group. You could think of aggregation as the rows being in the service of groups, but in window functions groups are in the service of the rows.
This probably doesn’t make much sense yet, so here are some examples.
If you have the phone calls as in the previous examples, and want to group the calls by their call type but then order them within each call type by when the calls were made, and label the first call for a given call type with 1, the next call within the call type with 2 etc. (So that each call type has a call 1, a call 2 etc.)
select CallType, row_number() over (partition by CallType order by CallDateTime), CallDateTime, DestinationNumber from PhoneCalls
This would produce results like this:
local,1,2021-07-15 13:04:11,01223 335511
local,2,2022-08-12 08:12:44,01223 846777
national,1,2021-04-11 15:31:41,0207 334 112
international,1,2022-04-07 21:34:11,001 441 2123
international,2,2022-11-23 19:13:31,001 441 2123
mobile,1,2021-12-30 22:41:22,07999 334112
Note that the row number in column 2 starts again at the value 1 for each group. The query is in a slightly odd order – as soon as the row_number() appears in the projection it needs to be immediately followed by the rest of the window function apparatus. The partition by defines how rows are split up into groups and the order by defines how rows are sorted within a group.
The main difference from the aggregation examples is that there is a row in the output for each row in the input, rather than a row in the output for each group formed by the inputs. This means that all the properties of the original rows are available in the projection, rather than just things from the aggregation/window function apparatus. The DestinationNumber field isn’t used to partition or order by, but is available in the projection. Window functions add information to input rows, rather than smooshing input rows together into groups.
Not only are all the properties of individual rows still available in the projection, you can also work with the previous or next row in the group. For instance, this is useful if you want to show both the date time that this call was made and the date time that the previous call was made (so that you can later calculate the duration of the gap between each call and the previous call of the same type).
This can be via a query like this:
select CallType, lag(CallDateTime) over (partition by CallType order by CallDateTime), CallDateTime, DestinationNumber from PhoneCalls
This would produce results similar to this:
local,null,2021-07-15 13:04:11,01223 335511
local,2021-07-15 13:04:11,2022-08-12 08:12:44,01223 846777
national,null,2021-04-11 15:31:41,0207 334 112
international,null,2022-04-07 21:34:11,001 441 2123
international, 2022-04-07 21:34:11,2022-11-23 19:13:31,001 441 2123
mobile,null,2021-12-30 22:41:22,07999 334112
lag(x) is a function that returns the value of property x in the previous record in the same group. It has a sibling – lead(x) – which returns the value from the next record in the same group. If there’s no previous/next record in the group then the value null is returned. So in the example above, the first call in each group has null in its second column, and all other calls have the CallDateTime of the previous call in their group.
Combining aggregation and window functions
I’ll give two examples that show when you might want to use a combination of aggregation and window functions. First is an example of a window function applied to the output of aggregation, then aggregating the output of a window function.
In the first example imagine we have data for an online bookshop. The data includes a table showing the details of each copy of each book sold. We want to know the bestsellers per month, for the months of a given year.
First, we will compute the total sales per book per month using aggregation:
select BookTitle, MonthSold, count(*) as TotalSold from BookSales group by BookTitle, MonthSold
Assume that this has been inserted into a table called MonthlySalesTotals. We will then use a window function to group the totals for each month, and then order within the month by TotalSold descending (i.e. most sold first, least sold last).
select BookTitle, TotalSold, row_number() over (partition by MonthSold order by TotalSold desc) from MonthlySalesTotals
The example of using a window function and then aggregation is finding the average age of the oldest child, second oldest child etc. in a series of families. We have data on individual people, including their family and age. First we will use a window function to group people into their families, and ordering within their family by age ascending.
select FamilyId, Age, row_number() as PositionAmongSiblings over (partition by FamilyId order by Age) from PeopleInFamilies
Assume that this has been inserted into a table called AgeOfSibling. We will then use an aggregation to group people by their position in their family and compute the average across each group.
select PositionAmongSiblings, avg(Age) from AgeOfSibling group by PositionAmongSiblings