I am slowly working my way through the 300+ back issues of the podcast Software Engineering Radio. I’ve got as far as a couple of excellent episodes on fault tolerance with Bob Hanmer. I recommend that you listen to them, even if (like me) you don’t have to worry about this kind of thing as much as the people who design nuclear power station control software. It also reminded me of an excellent video by Eoin Woods on getting your system into production and keeping it there (which I also recommend you watch).
First, some terms for bad things – fault, error and failure:
- Fault = underlying cause (this could be a typo in the code, bad input data or receiving more work than the system is designed for)
- Error = the wrong state that the system gets into as a result of the fault
- Failure = perceived deviation from expected behaviour
The idea is that faults will happen, and you want to stop them from developing into failures, i.e. you want the system to tolerate the faults.
In the case of more work arriving than the system is designed for, designing the system correctly means that some kind of service is still provided (even for only some of the requests, and/or with less rich behaviour, and/ or more slowly). Designing the system poorly means that there could be a cascading failure, where overloaded components fail, shedding load onto other components, which means that they’re overloaded and fail etc.
Some desirable non functional requirements in this area:
- Dependability = security + reliability + safety + maintainability
- Security = confidentiality + integrity + availability (+ non-repudiation according to many people)
- Confidentiality = absence of unauthorised disclosure of information
- Integrity = absence of improper system alterations
- Availability = percentage of all time that the system is available to do work
- Reliability = probability that the system will perform correctly for a specified period of time
- Safety = absence of catastrophic consequences on the user(s) and environment
- Maintainability = ability to undergo modifications and repairs
Availability and reliability can be confusing, so Bob Hanmer gave some examples.
A space rocket has a mission of a finite length. It needs high reliability (it needs to not blow up, go off course etc.) even though this relates to only that finite period of time. A phone network needs high availability – it’s on all the time. It’s annoying but not catastrophic if it doesn’t work correctly. The cash machine network needs high reliability and high availability – it’s on all the time, and if it goes wrong then people will probably have the wrong amount of money.