When a failing test might be OK

Usually, a failing test is a problem.  In this article I will cover three cases where this might not always be true: performance tests, testing a data science model such as a classifier, and testing in quantum computing.  In all these cases, a definitive answer about passing or failing is given by a set of tests, rather than by a single test.

Dividing jobs between tests

In automated tests, each test does a different job – why bother having more than one test to deliver some information?  It’s common for there to be more test code than production code and maintaining test code is a cost that can’t be ignored.  Redundancy in the test code means that this maintenance cost is bigger than it needs to be.

This means that any test failing is a sign that at least one part of the system isn’t behaving as expected.

However, there are times where there is important variability – either intrinsic to the problem being solved, or unavoidable variability in the way the problem is solved.  When it’s too hard to predict this variability accurately or how it will affect the test outcome, one approach is to create a set of related tests, and this set is in many ways treated as a single test.  The idea is that, while it’s too hard to predict the behaviour of the system via a single (component) test, it is still possible to predict its behaviour in general, i.e. across the set of tests.

In this case there are usually two levels of specification:

  1. What makes a component test pass or fail?
  2. What makes the set of tests pass or fail?

I’ll go into some examples below, where I’ll describe what gives rise to the variability and how the success criteria are defined for the tests/set of tests.

Note that flaky tests are a similar but different problem.  By flaky tests I mean tests that sometimes pass and sometimes fail, and so give unreliable results.  This is often due to variability in the order in which different parts of the production or test code are executed, and this variability trips the tests up.  Flaky tests are something that should be fixable so that individual tests reliably pass or fail, but this might require changes to the production code as well as to the test code.  The rest of this article concerns times when variability can’t be dealt with such that individual tests are reliable.

Performance tests

In this context, I’m using performance as a synonym for latency – how long will the system take to respond to a request?  One way to specify performance requirements is in terms of percentages.  For instance:

  • 95% of requests must be processed in at most 0.1 seconds,
  • 100% of requests must be processed in at most 0.5 seconds.

The performance requirements might be motivated by several things. One might be to ensure the user gets a good user experience via a GUI.  How quickly the GUI and systems such as API, database etc. behind respond to a user request will influence the user’s perception of and enjoyment of the system.  Alternatively, there might not be any GUI or user involved directly, but the production system is an API that’s called by other code.  The performance requirements might be to ensure that many separate bits of code can collaborate to create a bigger system, such as a phone network.

To test the system against its performance requirements, a suitable set of requests is created and sent to the system under test.  The meaning of suitable depends on context, but it is likely to follow the pattern of requests that has been observed in production already.  For instance, for an online banking system one part of the pattern could be:

  1. Log into the system
  2. Look at the current balance on an account
  3. Transfer money from that account to another account e.g. to pay a bill

In the light of the requirements above, if a request takes 0.7 seconds during a test – is that OK?  Definitely not.  If a request takes 0.09 seconds, that’s fine.  If a request takes 0.3 seconds, then things might be OK.  If too many other requests are also in the 0.1 – 0.5 second range, then the set of tests as a whole fails.

Why can’t performance be specified more tightly?  There are two sources of variation – the mixture of different kinds of request, and real-world limitations of the implementation (which I will explain shortly).  The different kinds of request (e.g. the banking related ones above) all need to be completed before the user thinks things are too slow, even if they need different amounts of work to complete.

The other source of variation is the limitations of the implementation.  There are many helpful lies that one part of the system tells to other parts.  The lies are essentially saying that things are simpler or better than they actually are.  Enough of the time the pretence holds, but occasionally the lie shows through and this affects performance.

The kinds of thing that I mean by helpful lie are:

  • The database / CPU / etc. is faster than it actually is – a lie involving some kind of cache;
  • The CPU, memory, network, and other important resources that some process needs are used only by that process and nothing else – a lie involving virtualisation and other ways to share things.

In a cache – whether this is an instruction cache on a CPU or a cache of queries in a database – there is an operation that’s relatively slow and expensive.  Operations seen recently have their results stored in a way that makes it quicker to look up the results than to do the operation from scratch.  However, the cache isn’t infinitely big, so old or infrequently used entries in the cache are evicted to make way for others once the cache gets full.  That means that the next time an evicted operation is attempted, it will have its full cost and not the reduced cost from using the cache.

Very early computers exposed programmers to all the details of their hardware, so having two or more programs running on a computer at once became tricky.  How will they share the CPU, memory, disk space etc?  More modern computers take that burden off the programmer and handle it in things like the operating system.

The operating system creates illusions such as virtual memory – a contiguous chunk of memory that is solely for one program, even though behind the scenes this is made of several separate chunks of physical memory, and many different programs are using the physical memory at once.  Similarly, each program thinks it’s running on a CPU dedicated to running just that program.  In reality, each program gets a series of slices of CPU time, to allow the CPU to be shared across many programs.

Enough of the time, the difference between appearance and reality is fine.  However, it can sometimes cause delays in the execution of code.  For example, there is a program that is blocked waiting for a long database query, and so the operating system has decided to divert some of its physical memory to be used by another program that isn’t currently blocked on anything.  Before this happens, the contents of the memory are written to disk.  A little while later, when the database query finishes, the program is ready to run again and so needs that bit of memory back.  There will therefore be a delay while the data is read from disk back to memory.

These costs will happen at hard to predict times, as they are based on the interactions of many moving parts at many levels of abstraction.  Therefore, it’s easier to describe the system’s latency in general terms such as percentages and ranges.

Data science

In a previous article I described fuzzy matching.  This can be used as part of classification – seeing what class or category a given thing fits into.  For instance, a system could be trained to say if a photo contains: a cat, a dog, a volcano, or something else.

In this situation, the variability is intrinsic to the problem.  Not all cats look the same, and a given cat will look different from different angles, in different lights, or in different poses.  It is usually unrealistic to expect a classifier to get 100% accurate results all the time.  Sometimes it will come up with the wrong answer (poor accuracy) or won’t be able to give any answer (poor coverage).

One way to represent the behaviour of a classifier is with a confusion matrix, such as the one below:

  CatDogVolcanoOtherDon’t know
ExpectedCat24 2 2
Dog 38  1
Volcano  15  
Other6  111

The numbers show a percentage of all images (tests) in the set of tests.  For instance, 24% of images are cats that are correctly classified as a cat.  2% of images are cats that are (incorrectly) classified as a volcano etc. Cells containing zero are left blank for clarity.  The diagonal line of italic numbers is where the system is behaving as intended – the actual result matches the expected result.  Everything else is some kind of error.

The right-hand column shows how much of a coverage problem the classifier has, i.e. how many times it has failed to come up with any answer.  The cells that are neither italic nor in the right-hand column show how much of an accuracy problem the classifier has, i.e. how many times it has come up with the wrong answer.

If there are only two categories (and no don’t know column) then a confusion matrix can be thought of as another way of representing the information in a table showing false positives, false negatives, true positives and true negatives.

How to define criteria for when the test set passes or fails will depend on the context.  Given that it’s likely to be impossible to have both 100% coverage and 100% accuracy, is it better to have high coverage or high accuracy? Within accuracy, are some categories, or some kinds of mis-categorisation, more important than others?  For instance, given that cats are more similar in how they fit into human society than they are to volcanoes, if a cat is mis-categorised as a dog is that a bigger or smaller problem than if it’s mis-categorised as a volcano?


If you’re unfamiliar with quantum computing, please refer to my article introducing it.  A summary is as follows.  There are special variables called qubits.  A qubit exists in two states: before measurement and after measurement.  Before measurement is equivalent to the box that contains Schrödinger’s cat before the box is opened.  Instead of being a box containing a cat that is both dead and alive, a qubit before measurement contains both 0 and 1.  After measurement, a qubit contains only 0 or 1, like a bit in conventional programming.  This is equivalent to Schrödinger’s cat’s box once it has been opened – the cat is definitely dead or alive / the qubit is definitely 0 or 1.

The important part is that before measurement, a qubit has a probability that it will deliver the value 0 when it’s measured (and 100% minus that probability that it will deliver 1 when it’s measured).  The goal of quantum code is to massage the probability on its qubits such that probability is moved towards the correct answer[s] and away from the incorrect answer[s].  (These operations that move probability aren’t the same as measurement, so a qubit is still in a superposition of states, it’s just that one state becomes more likely than the other one.)

There’s a class of problems, called BQP, where the best approach has a less than 100% probability of delivering the correct answer.  (The probability is at least 2/3.)  This means that bug-free code running on valid inputs will deliver the wrong answer some of the time (and the correct answer the rest of the time, on the same inputs).

If you run some code and get the wrong answer, is that evidence that your code has a bug?  It depends on how often it delivers the wrong answer, and how this compares to the expected frequency of a wrong answer.  At this point it’s possible that you think that this is a physical (quantum) system that happens to include code, so you start reaching for tools usually used for analysing data in science, such as hypothesis testing and p-values. For instance, you create a hypothesis that the code delivers correct results 87% of the time, and you want to be at least 95% sure of this (you use p=0.05).  This then directs how you will test your code.

The source of variability in this case is quantum physics!  It is probabilistic, and so code that uses it (quantum code) will also be probabilistic.


Much of the time, a single failing test can reliably tell you useful information about some code.  By this I mean that the code isn’t behaving as expected.  However, there are cases where there is too much variation in either the problem or the implementation of its solution for this to be possible.  These range from conventional cases, such as performance tests, to less conventional cases, such as quantum.

If a single failing test is unreliable, it’s worth looking at grouping together a set of related tests, in case the system’s behaviour is predictable over enough cases, i.e. in aggregate.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s