Mutation testing

Mutation testing is a way of judging the quality of your tests, just as tests are a way of judging the quality of your code.  Usually, the tests that mutation testing works with are automated unit tests.  In theory it could apply to manual and/or higher-level tests like integration or system tests, but I hope that you’ll soon realise that both are unlikely to be worth it.

I have used Fettle, which is a mutation testing framework for C#.  There are other C# mutation testing frameworks such as Stryker.NET, and also frameworks for other languages such as Mutmut for Python.

rorschach
Rorschach from The Watchmen. For two reasons: What tests the tests? is a similar question to Who watches the watchmen? Also, one route to being a superhero is to develop a mutation.  Image credit.

What does mutation testing do?

Assuming you have source code for some system, and tests that exercise it, the mutation testing framework does the following repeatedly:

  1. Change some of the source code for your system, using a set of rules in the framework – this is known as creating a mutant.
  2. Re-compile the source code.
  3. Run the tests.
  4. If the tests pass then the mutant has survived and this is reported to the user, if at least one test fails then the mutant has not survived, and the framework moves on to the next thing to do.

The rules do things like always change a == b to a != b and vice versa.

The point is that you assume that your tests are a close fit around required behaviour.  Mutation testing checks that assumption, slightly indirectly.  It can’t directly change the system’s behaviour, but it can change the system’s source code, which you assume will change the system’s behaviour.  (In my experience, this is a much safer assumption to make than the assumption about the quality of tests.)

Relationship with code coverage

Before you can start mutation testing, the following must all be in place:

  1. You need access to your system’s source code;
  2. The source code needs to compile;
  3. There need to be tests that exercise the source code;
  4. The code that makes up the tests needs to compile.

The sneaky one is the third one above.  There could be parts, e.g. methods or classes, to your source code that have no tests that exercise them.  Mutation testing is an inefficient way of measuring code coverage of your tests (see practical issues below).  It’s much better to ignore mutation testing until something else, such as NCover, tells you the tests have decent code coverage.

Then you might say: if I’m doing code coverage, why do I also need mutation testing?  This is a reasonable question to ask.  Given how precious time is, wasting it on shiny tech toys that don’t help can be self-indulgent or even frustrating.

Code coverage tools vary.  Some measure at the level of the line, and some measure at the level of the branch.  These measurements will differ when you have lines such as if ((a == 1) || (b == 2)).  With line coverage, a test where a == 1 will mean the whole line is marked as covered.  Branch coverage will say that the b == 2 part has not yet been exercised.

Even if your code coverage tool can measure branch-level coverage, there are details to do with the mutation rules that will make a difference here.  If there is a condition a >= 3, then this will be considered fully covered if there are two tests:

  • one where a is any number 3 to +infinity
  • one where a is any number -infinity to 2

so, you could have a test where a is 0 and another where a is 10.

However, it might be that there’s a bug in your code and it should be a > 3 (without the equals part). The two tests (where a is 0 and 10) won’t notice this – they don’t get close enough to the boundary between the different regions for a.

In some ways, the simplest thing that the mutation rules could do is to turn a >= 3 into a < 3, as that is exactly the opposite region of the number line to before the mutation.  A more subtle thing would be to turn a >= 3 into a > 3.  This would check that the tests police the boundary between the two regions properly.  Code coverage, even at branch level, wouldn’t help here.

Practical issues

Performance

The simple way to do mutation testing is to work through the source code a line at a time, and for each line mutate it and then run all the tests.  If you think about how many lines of code there are in your system, and how long it takes to run all the tests, this will probably be a bad idea.

At least a simple version of linking code to tests is well worth it, so that after a mutant is generated it is checked with only the tests that stand a chance of failing.  Running tests that exercise only unmutated code is merely doing wasted work, many times over.

Coping with magic

The combination of language, compiler and run-time system might include some magic that is done behind the scenes to help you.  A simple example in C# is the null coalesce operator ??, as in a = b ?? c.  This is the same as if (b != null) {a = b;} else {a = c;}.

One approach to mutating the ?? operator would be to expand it out to the longer version and then mutate that (by turning b != null into b == null).  This works, but has two problems:

  1. It’s possible to have as many things as you like coalesced, but they are always compared in pairs rather than as a single list. That is, you could you a = b ?? c ?? d ?? e, but that would expand out into a series of nested if/elses.  Therefore, the re-writing would have to work recursively.
  2. The whole point of mutation testing is to give the user information that they can use to make their tests better. If a mutant containing ?? (or worse, many of them) survives, how will you report that to the user, so that they know where it is and what to do about it?  If you report the re-written version of the line (with the nested if/elses), that won’t match the code that they wrote.  If you report the original version of the line (with the ??), it won’t be clear what the mutant is.

Another approach is to mutate without re-writing, i.e. to change a = b ?? c into a = c ?? b.  This doesn’t have either of the problems described above, but does have its own problem.

You might expect mutation and re-writing to be commutative, i.e. you can do them in either order but still get the same result.  With this approach to mutation, they are not commutative:

  1. a = b ?? c mutates to a = c ?? b, which re-writes to if (c != null) {a = c;} else {a = b;}
  2. a = b ?? c re-writes to if (b != null) {a = b;} else {a = c;}, which mutates to if (b == null) {a = b;} else {a = c;}

These are not always going to give the same behaviour, for all values of b and c.

Combinatorial explosions

If there’s a compound boolean expression, what’s realistic with mutation testing and what would be nice are likely to be different.  Given the expression ((a == 10) && (b > 6)) || (c < 12) && (d == 7)), what should mutation testing do?  (I would hope that such magic numbers as 6, 7, 10 and 12 would normally be replaced by constants or enumerations; I’m including them here to keep things simple.)

There are seven operators that could be mutated (the comparisons ==, >, < and == again, plus the logical operators &&, || and && again).  Each of them should be mutated.  If a given operator is only ever mutated with one or more other operators, it won’t be clear what lead to a mutant surviving.  So, each operator should be at least mutated on its own (and then the relevant tests run for each mutation).  This means there will be seven candidate mutants.

However, should combinations of operators also be mutated?  This would give 121 extra candidate mutants (2^7 minus the 7 we’ve already done).  It could give extra information, but at a big cost.  I doubt that the extra cost is worth it most of the time.

Get out of jail card

Sometimes the combination of rules and test data will align so that a mutant survives unexpectedly.  For instance, a == b – 2 gets mutated to a == b / 2, and in the test, a is 2 and b is 4.

In this case, and for other random reasons specific to your code, tests and larger environment, it might be that mutation testing generates false positives or otherwise positives that you want to ignore.  In this case it’s helpful to have some way of telling your mutation testing framework to ignore any mutants generated by a given region of code.  This could be by including comments with magic text that start and stop such a region.

Summary

Mutation testing can be a good way of testing your tests.  It is a good tool to use in combination with code coverage.  Neither tool is perfect, giving you 100% signal and 0% noise.  They should be used as sources of information, to help you make better decisions about your code and tests.

Leave a comment