If you hang around programmers for long enough, you might hear people use terms like complexity or Big O notation, or say that performance is of the order … such as of the order N squared. I hope that this article makes those terms a bit less confusing.
The basic idea is seeing how performance varies with how big the task is, where performance often means time taken, but sometimes means other things like how much memory or disk space is used. It’s not so much about the time taken etc. for one size of task, but how time taken varies as the task gets bigger or smaller. If we attempted a task twice as big as the current task, will the bigger task take twice as long, four times as long or worse?
This article is brought to you by the HTML elements <sub> and <sup> and lots of escaped less than and greater than characters, which WordPress mangles each time I edit this.
Getting ready: defining size
Before we get into the details, it’s useful to do a bit of preparation. In the introduction I used terms like big and size without explaining what I meant. This article depends on there being some way to measure how big a problem is. For instance, how many items are in the array I’m trying to sort? Sometimes the size has two or even more dimensions. For instance, if you want to produce all pairs (a, b) where a comes from set A and b comes from set B, this will depend on both the size of A and the size of B. For this article I’ll assume it’s a one-dimensional size, for simplicity.
As I mentioned earlier, we’re often concerned with speed via the time taken to complete the task, but it could instead be how much memory or disk is used. Note that speed / space is often a trade-off – sometimes you can make a task take less time by using an algorithm that uses more space. For simplicity, I will assume we’re just thinking about time taken i.e. speed.
We use n to define the size of the problem in general terms, where n is an integer and n >= 0, such as how many items are in a list that needs sorting. We then define some function f(n) that is in terms of n. So, f(n) could be something like n2, n3, cos(n), n*log(n) etc.
Having defined n and f(n), we can then say
time = O(f(n)) if time <= f(n), for all n >= k
That is, f(n) is defining an upper bound to the performance. We are interested in the tightest upper bound to the performance – if time <= n2, then time <= n3 is also true, and time <= n4 etc. (remember that n >= 0). We use the function that it is the closest fit to the actual time taken, which is n2 in this case.
If time = O(f(n)) that can be read as “time is of order f(n)”, or “the time complexity is f(n)”.
I’ve deliberately not explained k yet, but will now. You are allowed an arbitrary number of exceptions, where time is above f(n), for small values of n, i.e. n < k. You can make k as big as you like.
You might have noticed that the example functions above are suspiciously simple. We are interested in only general trends rather than details, as might have been suggested by the previous paragraph. There are at least three ways where this comes into play:
- We can ignore any constant multiplier. If time <= 2n, we say time = O(n) rather than time = O(2n). This means if we make code take twice or half as much time, this has no effect on the function we use inside the brackets.
- We can ignore any constant offset. If time <= n + 1,000,000, we say time = O(n) and not O(n + 1,000,000). This means if we add or remove an operation of arbitrary duration – provided it’s an operation that is executed only once regardless of the size of the problem – then this has no effect on the function we use inside the brackets.
- We can ignore any terms that are lower-powered, i.e. grow less quickly, than the biggest. If time <= n3 + n2 + n, we say time = O(n3). Note that you can’t split up one term – there’s a meaningful difference between n*log(n) and log(n), so if time <= n*log(n) then time = O(n*log(n)) is true, and time = O(n) and time = O(log(n)) are both false.
I’ve deliberately been a bit sloppy so far when giving examples involving logarithms. I haven’t been explicit about which base the logs are to – 10, 2, something else? When it comes to complexity it doesn’t matter, because of the rule that lets you convert between logs of different bases:
logb(a) = logx(a) / logx(b)
For instance, if b = 10 and x = 2, i.e. complexity is expressed as a log to base 10 and we wanted to convert it to log to base 2, then:
log10(n) = log2(n) / log2(10)
log2(n) = log2(10) * log10(n)
log2(10) is just a constant between 3 and 4, and it’s acting as a multiplier in the equation so the first point above – we can ignore constant multipliers in complexity – means all bases of logs are equivalent in complexity.
Changing some code’s complexity might need nothing more than being careful with what’s inside loops and what’s outside them, i.e. making your loops as lean as possible. (Note that decent compilers can sometimes do this for you.) Or it could need a radical change in approach, as happens in sorting.
Watch out for best vs. worst case
You might expect an algorithm to behave in a certain way, i.e. have a certain complexity, only to be unpleasantly surprised in practice. This is usually because the expected behaviour is based on certain assumptions being true, but they might be true only in the best case. In a worse case, enough of them are false that you get a different behaviour which is worse.
A simple example of best vs. worst case is looking for a number in a set of numbers. We’ll start by representing this set as a list. Sometimes we’ll be lucky, and the number we want is the first item in the list. Other times we’ll be unlucky, and it will be the last item. Assuming all items are equally likely, then the average number of items we test is n/2, and so the time is O(n). (Note I’m ignoring whether the list is sorted or not, as that’s just an optimisation that lets us give up early if the item isn’t in the list all and doesn’t change the overall trend.)
If we wanted to do a bit better, then we could store the set as a binary tree. Note that this is an example of a space / speed trade-off. In a list, each item need only point to one other item (the next one), and so the total space needed is the space to hold the value itself plus the space to point to one other item. In a binary tree, each item holds its value plus points to two other items – we are improving speed at the expense of space.
The time taken to search a binary tree is O(log(n)). At each step, we discard half the remaining items by going down the left or right branch but not both. We halve a pile of things repeatedly by going down only one of two possible paths each time, until we end up with only one thing, and the number of times we need to do this halving is log(n). (Compare this to the list, where at each step we discard only one item – the one we’ve just looked at.)
The previous statement contains a lie, or at least some wishful thinking. I said, “At each step, we discard half the remaining items”. This is true if the tree looks nice, as it might do in a computer science textbook. The thing that makes this nice is that the lengths of all the paths from the root node down to all leaf nodes are similar (ideally, the same). The technical term for this is that the tree is balanced.
If you build up the tree by reading numbers from a file and just adding each number to the tree as you read it, then the shape of the resulting tree will depend on the order in which the numbers are read.
If the numbers arrive in this order: 4, 2, 6, 1, 3, 5, 7
then the tree looks like this
Searching this will be O(log(n)) as I said.
However, imagine the data arrives in the order 1, 2, 3, 4 …. That would produce a tree that is effectively a list:
It’s still stored in tree data structures, but it behaves like a list. Searching it will be O(n).
Just because the data structure and algorithm might perform well, this doesn’t mean it always will. Bear in mind how likely the best / worst cases are, and how much performance varies between them. It might be that even the worst case is fine, so it’s not worth the effort of guarding against it, but this needs to be your decision rather than just happening by default.
Watch out for hidden costs
As I alluded to above, getting data into or out of variables isn’t always O(1) or even O(log(n)), even if they use the standard data types that your language provides. If you do a look-up in the middle of a loop, then there could be an important difference in complexity if the look-up table is implemented as a list, a dictionary or a hash.
Another hidden cost could come from your own code. If you look at just the time taken by method A, if it has a single level of loop you might think that it’s O(n). However, if method A calls method B each time around the loop, and B is anything worse than O(1), then overall A will be something more like O(n*log(n)) or O(n2) when you look at the total work done between the start and end of A’s execution.
More things to watch out for
You could spend a lot of time making a method faster, but when you run your program there’s little or no improvement. It could be that the code you’ve improved is called only once, or only for a small value of n, or there’s code elsewhere that’s an even bigger problem. All these mean that you’ve improved the performance of a very small slice of the overall problem, rather than tackling the part that’s contributing most to the overall duration. To avoid aiming at the wrong target like this, a tool such as a profiler will give you a better idea of which bit of the code is contributing most to the overall duration.
Another problem you might do is robbing Peter to pay Paul. By that I mean: you fix bit of code A, but end up making other bit of code B worse, such that the combination of changes is poor. For instance, you might be moving work from one bit of code to somewhere else where it’s more expensive, or you have changed a small speed problem into a big space problem.
The topic of this article might sound like it’s something abstract and not very relevant to you. Certainly, it usually doesn’t matter how fast some code is if it’s incorrect, so getting correctness sorted is usually my priority.
If you use a database, then this sort of thing definitely affects you. If you haven’t already, I suggest that you look at the query execution plan that is used to run one of your code’s queries on the database. (The query execution plan is a concrete list of operations that the database will execute in order to obey the SQL command you issued to it. There is a part of the database called the query optimiser that turns SQL into a query execution plan.)
The worst case for getting one row of data out of a database table is a full table scan. This is the database equivalent of the list example above – each row is checked in turn until the required row is found or the database gives up. As explained above, this is O(n).
If the query optimiser realises that there’s an index that could be used, then instead of a full table scan it will use an index seek. Indexes are usually implemented as trees, and the database takes pains on your behalf to make sure that the trees are balanced. This means that a look-up that uses an index will be O(log(n)).
In real-life systems, n can get very big, i.e. it could be hundreds of thousands or even millions. So, turning O(n) into O(log(n)) can make a noticeable difference to your users. The interaction of queries, indexes and other things like database statistics (e.g. how much data is in each table) to produce query execution plans is complex, and there is much to read about this kind of thing.
Complexity can be a useful way of seeing how a code’s performance varies as the size of the task it’s doing varies. It is a close enough specification of an upper bound to the code’s performance to give you the big picture without the confusing details. Performance problems can hide, or can sneak up on you as the best case degrades poorly to the worst case. Complexity can also help explain why certain database performance problems happen.