Fault tolerance

Introduction

I am slowly working my way through the 300+ back issues of the podcast Software Engineering Radio.  I’ve got as far as a couple of excellent episodes on fault tolerance with Bob Hanmer.  I recommend that you listen to them, even if (like me) you don’t have to worry about this kind of thing as much as the people who design nuclear power station control software.  It also reminded me of an excellent video by Eoin Woods on getting your system into production and keeping it there (which I also recommend you watch).

Bad things

First, some terms for bad things – fault, error and failure:

  • Fault = underlying cause (this could be a typo in the code, bad input data or receiving more work than the system is designed for)
  • Error = the wrong state that the system gets into as a result of the fault
  • Failure = perceived deviation from expected behaviour

The idea is that faults will happen, and you want to stop them from developing into failures, i.e. you want the system to tolerate the faults.

In the case of more work arriving than the system is designed for, designing the system correctly means that some kind of service is still provided (even for only some of the requests, and/or with less rich behaviour, and/ or more slowly).  Designing the system poorly means that there could be a cascading failure, where overloaded components fail, shedding load onto other components, which means that they’re overloaded and fail etc.

 

Good things

Some desirable non functional requirements in this area:

  • Dependability = security + reliability + safety + maintainability
  • Security = confidentiality + integrity + availability (+ non-repudiation according to many people)
  • Confidentiality = absence of unauthorised disclosure of information
  • Integrity = absence of improper system alterations
  • Availability = percentage of all time that the system is available to do work
  • Reliability = probability that the system will perform correctly for a specified period of time
  • Safety = absence of catastrophic consequences on the user(s) and environment
  • Maintainability = ability to undergo modifications and repairs

Availability and reliability can be confusing, so Bob Hanmer gave some examples.

A space rocket has a mission of a finite length.  It needs high reliability (it needs to not blow up, go off course etc.)  even though this relates to only that finite period of time.  A phone network needs high availability – it’s on all the time.  It’s annoying but not catastrophic if it doesn’t work correctly.  The cash machine network needs high reliability and high availability – it’s on all the time, and if it goes wrong then people will probably have the wrong amount of money.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s