by James Reason
Table 7.2. The distribution of omissions across tasks (from Rasmussen, 1980).
Omissions per task
Monitoring and inspection: 0
Supervisory control: 2
Manual operation and control: 5
Inventory control: 8
Test and calibration: 28
Repair and modification: 35
Administrative task: 1
Management, staff planning: 1
Other (not mentioned): 5
TOTAL: 85
Table 7.3. The distribution of omissions across mental task phases (from Rasmussen, 1980).
Omissions per mental task
Detection of demand: 2
Observation/communication: 2
Target: tactical system state: 1
Procedure: plan, recall: 77
Execution: 3
TOTAL: 85
3.2. The INPO root cause analysis
The root causes of 87 significant events reported to the Institute of Nuclear Power Operations (INPO is the U.S nuclear industry’s own organization, located in Atlanta, Georgia) in 1983 were analysed using the Root Cause Record Form. Of the 182 root causes identified, 80 (44 per cent) were classified as human performance problems (see Figure 7.2).
The event descriptions provided were sufficient to allow omissions to be distinguished from other behavioural error forms. Forty-eight of the 80 (60 per cent) human performance root causes were classified as involving either single or multiple omissions.
The following points are of interest: (a) Ninety-six per cent of the deficient procedures involved omissions (31.3 per cent of all human performance root causes), (b) Omissions were most frequently associated with maintenance-related activities: 64.5 per cent of the errors in this task category involved omitted acts. These made up a quarter of all human performance root causes, (c) Seventy-six per cent of the human errors in the operation task category were omissions, representing 20 per cent of all human performance root causes.
3.3. General conclusions
Clearly, there are some differences between the Rasmussen and INPO analyses regarding the distribution of omissions over tasks. But these are more likely to reflect discrepancies in categorization and emphasis than real changes in the pattern of NPP errors over time (the Rasmussen study sampled the period up to 1978; the INPO data related to 1983). Of greater importance, however, is that both studies highlighted maintenance-related activities as being the most productive source of event root causes and both identified omissions as the most prevalent error form. The former conclusion is further supported by the more extensive NUMARC study (INPO, 1985a), while the latter is in close accord with the relative incidence of error types in everyday life (Reason & Mycielska, 1982; Reason, 1984a), where forgetting intentions was the most common form of lapse.
Figure 7.2. INPO analysis of the 182 root causes identified in 87 significant events occurring within nuclear power plants in 1983.
4. Operator errors
In a subsequent INPO report (INPO, 1985b), the classificatory scheme was modified in two ways: to eliminate ‘component failure’ (preferring to seek more assiduously for the cause of these failures), and to include ‘construction and installation deficiencies’ in the human performance category. This revised scheme was then applied to 180 significant event reports issued in both 1983 and 1984, in which a total of 387 root causes were identified. This analysis is summarised in Figure 7.3.
The human performance problems were further broken down into subcategories, as shown in Table 7.4. There are two important conclusions to be drawn from these data. First, at least 92 per cent of all root causes were manmade (see Figure 7.3). Second, only a relatively small proportion of the root causes were actually initiated by front-line personnel (i.e., failure to follow procedures). Most originated in either maintenance-related activities or in bad decisions taken within the organizational and managerial domains.
Figure 7.3. INPO analysis of the 387 root causes identified in 180 significant event reports in both 1983 and 1984.
Table 7.4. Breakdown of human performance problems (from INPO, 1985b).
Human performance problems
Deficient procedures or documentation: 43%
Lack of knowledge or training: 18%
Failure to follow procedures: 16%
Deficient planning or scheduling: 10%
Miscommunication: 6%
Deficient supervision: 3%
Policy problems: 2%
Other: 2%
5. Case study analyses of latent errors
This section attempts to show something of the nature and variety of latent errors through case study analyses of six major accidents: Three Mile Island, Bhopal, Challenger, Chernobyl, Zeebrugge and the King’s Cross underground fire. These events were not chosen because of the unusually critical part played by latent failures. Other disasters like Flixborough, Seveso, Aberfan, Summerland, Tenerife, Heysel Stadium and the Bradford and Piper Alpha fires would have demonstrated the significance of these dormant factors equally well. Three criteria influenced this particular selection: (a) all the events are comparatively recent so that their general nature will be familiar to the nontechnical reader, (b) they are all well documented, indeed many have been the subject of high-level governmental investigations, and (c) they cover a range of complex, high-risk systems.
In view of the diversity of the systems considered here, it is unlikely that any one reader will be conversant with all of their technical details. Accordingly, the major part of each case study will be presented in the form of a summary table indicating some of the major contributory latent failures—not all because, by their nature, many remain undiscovered. A latent failure in this context is defined as an error or violation that was committed at least one to two days before the start of the actual emergency and played a necessary (though not sufficient) role in causing the disaster. Accompanying each table will be a short description of the accident sequence and, where appropriate, some additional commentary on the general ‘health’ of the system.
As always in such analyses, there is the problem of defining the explanatory time frame. Any catastrophic event arises from the adverse conjunction of several distinct causal chains. If these are traced backwards in time, we encounter a combinatorial explosion of possible root causes, where the elimination of any one could have thwarted the accident sequence. There are no clear-cut rules for restricting such retrospective searches. Some historians, for example, trace the origins of the charge of the light Brigade back to Cromwell’s Major-Generals (see Woodham-Smith, 1953); others are content to begin at the outset of the Crimean campaign, still others start their stories on the morning of 24 October 1854.
In the present context, there are two obvious boundary conditions. The combined constraints of space, information and the reader’s patience place severe limits on how far back in time we can go. Yet the immediate need to demonstrate the significance of antecedent events makes it essential to focus upon those human failures that were committed prior to the day of the actual catastrophe. As it turns out, these antecedent time frames vary in length from around 2 years for Three Mile Island to 9 years for the Challenger disaster, their precise extents being determined by the particular histories of each disaster and the available sources.
The point is that these starting points are fairly arbitrary ones. However, no particular significance is being placed on the quantities of latent and active errors; given their relative timescales, the former will always be more numerous than the latter. Rather, our purpose is to illustrate the insidious and often unforeseeable ways in which they combine to breach the system’s defences at some critical moment.
5.1. Three Mile Island
At 0400 on 28 March 1979, one of the turbines stopped automatically (tripped) in Unit No. 2 of Metropolitan Edison’s two pressurized water reactors (PWRs) on Three Mile Island (TMI) in the Susquehanna River, l0 miles south of Harrisburg (principal source: Kemeny, 1979). This was due to a maintenance crew
attempting to renew resin for the special treatment of the plant’s water. A cupful of water had leaked through a faulty seal in the condensate polisher system and had entered the plant’s instrument air system. The moisture interrupted the air pressure applied to two valves on the two feedwater pumps, and ‘told’ them something was wrong (which was not actually the case in this particular subsystem). The feedwater pumps stopped automatically. This cut the water flow to the steam generator and tripped the turbine. But this automatic safety device was not sufficient to render the plant safe. Without the pumps, the heat of the primary cooling system (circulating around the core) could not be transferred to the cool water in the secondary (nonradioactive) system.
At this point, the emergency feedwater pumps came on automatically. They are designed to pull water from an emergency storage tank and run it through the secondary cooling system to compensate for the water that boils off once it is not circulating. However, the pipes from these emergency feedwater tanks were blocked by closed valves, erroneously left shut during maintenance two days earlier.
With no heat removal from the primary coolant, there was a rapid rise in core temperature and pressure. This triggered another automatic safety device: the reactor ‘scrammed’ (graphite control rods, 80 per cent silver, dropped into the core and absorb neutrons, stopping the chain reaction). But decaying radioactive materials still produce heat. This further increased temperature and pressure in the core. Such pressure is designed to be relieved automatically through a pilot-operated relief valve (PORV). When open, the PORV releases water from the core through a large pressurizer vessel and then into the sump below the containment. The PORV was supposed to flip open, relieve pressure and then close automatically. But on this occasion, still only about 13 seconds into the emergency, it stuck open. This meant that the primary cooling system had a hole in it through which radioactive water, under high pressure, was pouring into the containment area, and thence down into the basement.
The emergency lasted in excess of 16 hours and resulted in the release of small quantities of radioactive material into the atmosphere. No loss of life has been traced directly to this accident, but the cost to the operating companies and the insurers was in the region of one billion dollars. It also marked a watershed in the history of nuclear power in the United States, and its consequences with regard to public concern for the safety of nuclear power plants are still felt today. The principal events, operator errors and contributing latent failures are summarised in Case Study No.l (see Appendix).
The subsequent investigations revealed a wide range of sloppy management practices and poor operating procedures. Subsequent inspection of TMI-1 (the other unit on the site) revealed a long-term lack of maintenance. For example, “boron stalactites more than a foot long hung from the valves and stalagmites had built up from the floor” (Kemeny, 1979) in the TMI-1 containment building. Other discoveries included:
(a) The iodine filters were left in continuous use rather than being preserved to filter air in the event of radioactive contamination. Consequently, on the day of the accident, they possessed considerably less than their full filtering capacity.
(b) Sensitive areas of the plant were open to the public. On the day before the accident, as many as 750 people had access to the auxiliary building.
(c) When shifts changed, no mechanism existed for making a systematic check on the status of the plant. Similarly, maintenance personnel were assigned jobs at the beginning of their shift, but no subsequent check was made on their progress.
(d) A retrospective review of TMI-2’s licensee event reports revealed repeated omissions, inadequate failure analyses and lack of corrective actions.
(e) Pipes and valves lacked suitable means of identification. Thus, 8 hours after the start of the accident, operators spent 10 minutes trying unsuccessfully to locate three decay heat valves in a high radiation field.
Was the state of TMI-2 unusual? Was this simply the “bad apple in the nuclear barrel” (Perrow, 1984)? The evidence suggests not. Some years earlier, Morris and Engelken (1973) had examined eight loss-of-coolant (LOCA) accidents that had occurred in six different boiling water reactors over a 2-year period when there were only 29 plants operating. They looked particularly at the cooccurrence of multiple failures. Each accident involved between two to four different types of failure. In half of them there were also violations of operating procedures, but they occurred in conjunction with two to five other failures. Nor were failures limited to plant personnel. Deficient valves were found in 20 plants supplied by 10 different manufacturers. As Perrow (1984) pointed out, it is from the concatenation of these relatively trivial events in nontrivial systems that accidents such as TMI-2 are born. Generating electric power from nuclear energy is a highly technical business; but it would be naive to suppose that NPPs are managed or operated by a special breed of supermen. They are no worse than those in other industries, but neither are they significantly better.
5.2. Bhopal
On the night of 2-3 December 1984, a gas leak from a small pesticide plant, owned by a subsidiary of Union Carbide Corporation, devastated the central Indian city of Bhopal. It was the worst industrial disaster ever. At least 2,500 people were killed, and more than 200,000 were injured. Perhaps more than any other event, it revealed the hitherto largely unrealised dangers associated with the manufacture of highly toxic chemicals, in this case, methyl isocyanate (MIC).
The immediate cause of the discharge was an influx of water into an MIC storage tank. How it got there is a tangled story of botched maintenance, operator errors, improvised bypass pipes, failed safety systems, incompetent management, drought, agricultural economics and bad governmental decisions. It is too long to tell in detail here, though an inventory of the major latent failures is shown in Case Study No. 2 (see Appendix).
With such a terrible catastrophe, it is difficult to find unbiased sources. Union Carbide’s own report (March, 1985) clearly has its own axe to grind, as also does the Morehouse and Subramanian account, published by the Council on International and Public Affairs (1986). Other, less comprehensive, though more balanced accounts have been written by Lihou and Lihou (1985) and Bellamy (1985). Still other accounts can be found in the general scientific press (e.g., New Scientist) and in the chemical journals throughout 1985.
5.3. Challenger
Described in purely physical terms, the cause of the Space Shuttle Challenger disaster on the morning of 28 January 1986 was brutally simple. A rubbery seal, called an O-ring, on one of the solid rocket boosters split shortly after lift-off, releasing a jet of ignited fuel that caused the entire rocket complex to explode, killing all seven astronauts. But how that item came to be there after a 9-year history of repeated erosion and faults is a complicated tale of incompetence, selective blindness, conflicting goals and reversed logic. The main protagonists were NASA’s principal solid-rocket contractor, Morton Thiokol, and all levels of the NASA management. It is summarised in Case Study No. 3 (see Appendix).
More detailed accounts can be found in the Report of the Presidential Commission on the Space Shuttle Challenger Accident (June, 1986), and in an excellent article by Cooper (1987). A discussion of how these facts were obtained from reluctant and often devious sources has been given by Kerhli (1987), one of the presidential commission’s principal investigators (his previous job had been prosecuting mafiosi).
5.4. Chernobyl
At 0124 on Saturday, 26 April 1986, two explosions blew off the 1000-tonne concrete cap sealing the Chernobyl-4 reactor, releasing molten core fragments into the immediate vicinity and fission products into the atmosphere. This was the worst accident in the history of commercial nuclear power generation. It has so far cost over 30 lives, contaminated some 400 square miles around the Ukrainian plant and significantly increased the risk of cancer deaths over a wide area of Scandinavia and Western Europe. It was an entirely man-made disaster.
The chain of events leading up to the accident together with the associated late
nt failures are shown in Case Study No. 4 (see Appendix). Other more detailed accounts of the accident can be found in the report of the USSR State Committee on the Utilization of Atomic Energy (1986), in Nature (vol. 323, 1986), and in a report prepared for the Central Electricity Generating Board (CEGB) by Collier and Davies (1986).
In the immediate aftermath of the accident, the Western nuclear industry vigorously asserted that ‘it couldn’t happen here’ (see Reason, 1987; Baker & Marshall, 1988; Reason, 1988). Whereas the Russian analysts highlighted human errors and violations as the principal cause, their Western counterparts, and especially Lord Marshall, head of the CEGB, preferred to blame the poor design of the Russian reactor and the inadequacy of the ‘Soviet safety culture’—although the latter came to sound increasingly hollow after the Zeebrugge and King’s Cross disasters.
Notwithstanding the obvious design defects of the RBMK reactor, it is clear from these latent failure analyses that the main ingredients for the Chernobyl disaster were not unique to the Soviet Union. There was a society committed to the generation of energy through large-scale nuclear power plants. There was a system that was hazardous, complex, tightly-coupled, opaque and operating outside normal conditions. There was a fallible management structure that was monolithic, remote and slow to respond, and for whom safety ranked low in the league of goals to be satisfied. There were operators who possessed only a limited understanding of the system they were controlling and in any case, were set a task that made dangerous violations inevitable.