Seven consecutive errors = A Catastrophe

T Ashok (@ash_thiru)

Summary

“A typical accident takes seven consecutive errors” states Malcolm Gladwell, this notion is reflected in Mark Buchanan’s book “Ubiquity” too. This article dwells upon ‘How do you ensure that potential critical failures lurking in systems that have matured can still be uncovered?’

—

“A typical accident takes seven consecutive errors” quoted Malcolm Gladwell in his book “The Outliers”. As always Malcolm’s books are a fascinating read. In the chapter on “The theory of plane crashes”, he analyses the airplane disasters and states it is a series of small errors that results in a catastrophe. ” Plane crashes are much more likely to be a result of an accumulation of minor difficulties and seemingly trivial malfunctions” says Gladwell. The other example he quotes is the famous accident – “Three Mile Island” (nuclear station disaster in 1979).

It came near meltdown, the result of seven consecutive errors – (1) blockage in a giant water filter causes (2)moisture to leak into plant’s air system (3) inadvertently trips two valves (4) shuts down flow of cold water into generator (5) backup system’s cooling valves are closed – a human mistake (6) indicator in the control room showing that they are closed is blocked by a repair tag (7) another backup system, a relief valve was not working.

This notion is reflected in the book “Ubiquity” by Mark Buchanan too. He states that systems have a natural tendency to organise themselves into what is called the “critical state” in what Buchanan states as the “knife-edge of stability”. When the system reaches the “critical state”, all it takes is a small nudge to create a catastrophe.

Now as a person interested in breaking software and uncovering defects, I am curious to understand how I can test better. How do you ensure that potential critical failures lurking in systems that have matured can still be uncovered?

Let us look at what we do- We stimulate the system with inputs (correct & erroneous) so that we can irritate latent faults so that they may propagate resulting in failure. When the system is “young”, the test & test cases we come up are focused on uncovering specific (singular) faults. i.e a set of inputs that can irritate singular faults and yield possibly critical failures. This is possible because the “young system” is not yet resilient and therefore even a singular fault bumps it up! We then think that our test cases (i.e. combinations of inputs) are powerful/effective. But these test cases do not yield defects later as the system becomes resilient to singular faults.

As the system matures we need to sharpen the test cases to irritate a set of potential faults that can create a domino effect to yield critical failures. Creating test cases to uncover singular faults in a mature system may not useful. It is necessary that test cases be at a higher level of system validation (i.e have long flows) and have the power to irritate a set of faults.

Should we resort to uncovering critical failures only via testing? By creating test cases at higher levels that have the power to uncover multiple types of faults? Not necessarily. We can apply this thought process at the earlier stages of design/code too. Using the notion of sequence of potential errors and understanding what can happen.

If your drive in India you know what I mean … the potential accident due to a dog chasing a cow, which is charging into the guy driving the motorbike, who is talking on the cell phone, driving on the wrong side of road, encounters a “speed bump” , and screech *@^%… You avoid him if you are a defensive driver. Alas we do not always apply the same defensive logic to other disciplines like software engineering commonly enough…