Human Error

Home > Other > Human Error > Page 26
Human Error Page 26

by James Reason


  1.3. Systems have more defences against failure

  Because of the increasing unacceptaility of a catastrophic disaster, and because of the widespread availability of intelligent hardware, designers have sought to provide automatic safety devices (ASDs) sufficient to protect the system against all the known scenarios of breakdown. According to Perrow (1984, p. 43): “The more complicated or tightly coupled the plant, the more attention is paid to reducing the occasion for failures.”

  The design of a modern nuclear power station is based upon the philosophy of ‘defence in depth’. In addition to a large number of ‘back-up’ subsystems, one line of defence is provided by ASDs: devices that, having sensed an out-of-tolerance condition, automatically ‘trip’ the reactor, and/or switch off the turbines and/or release excess pressure. Not only are they programmed to shut down various aspects of the process, they also call in automatic safety systems, such as the emergency core cooling system (ECCS) or safety injection (SI), should there be the danger of a core melt. A further line of defence is provided by the containment, a massive concrete structure that prevents the accidental release of radioactive material to the outside world in the event of a failure of the ASDs. If all of these defences fail, and dangerous materials are released to the exterior, then it is hoped that their harmful consequences would be minimised by the general (though not universal) practice of siting nuclear power stations in sparsely populated areas.

  For a catastrophe to happen, therefore, a number of apparently unlikely events need to occur in combination during the accident sequence (see Rasmussen & Pedersen, 1984). First, the ASDs must fail to restore the disturbed system to a safe state. Second, the containment must fail to prevent the release of toxic material to the exterior. But such disasters still happen. One of the most obvious reasons is that the safety systems themselves are prey to human error, particularly of the latent kind. We are thus faced with a paradox: those specialised systems designed solely to make the plant safe are also its points of greatest weakness.

  1.4. Systems have become more opaque

  One of the consequences of the developments outlined above is that complex, tightly-coupled and highly defended systems have become increasingly opaque to the people who manage, maintain and operate them. This opacity has two aspects: not knowing what is happening and not understanding what the system can do.

  As we have seen, automation has wrought a fundamental change in the roles people play within certain high-risk technologies. Instead of having ‘hands on’ contact with the process, people have been promoted “to higher-level supervisory tasks and to long-term maintenance and planning tasks” (Rasmussen, 1988). In all cases, these are far removed from the immediate processing. What direct information they have is filtered through the computer-based interface. And, as many accidents have demonstrated, they often cannot find what they need to know while, at the same time, being deluged with information they do not want nor know how to interpret. In simpler, more linear systems, it was always possible for an operator or manager to go out and inspect the process at first hand, to examine directly the quality of the product, to look at the leaky valve or to talk to the experienced man or woman on the job. But these alternatives are not available in chemical and nuclear plants where an unapproachable and only partially understood process is largely hidden within a maze of pipes, reinforced vessels and concrete bunkers.

  There is also another important factor contributing to system opacity: the system’s own defences. Rasmussen (1988, pp. 3-4) has called this ‘the fallacy of defence in depth’.

  Another important implication of the very nature of the ‘defence in depth’ philosophy is that the system very often does not respond actively to single faults. Consequently, many errors and faults made by the staff and maintenance personnel do not directly reveal themselves by functional response from the system. Humans can operate with an extremely high level of reliability in a dynamic environment when slips and mistakes have immediately visible effects and can be corrected. Survival when driving through Paris during rush hours depends on this fact.

  Compare this to working in a system designed according to the ‘defence in depth’ principle, where several independent events have to coincide before the system responds by visible changes in behaviour. Violation of safety preconditions during work on the system will probably not result in an immediate functional response, and latent effects of erroneous acts can therefore be left in the system. When such errors are allowed to be present in a system over a longer period of time, the probability of coincidence of the multiple faults necessary for release of an accident is drastically increased. Analyses of major accidents typically show that the basic safety of the system has eroded due to latent errors. A more significant contribution to safety can be expected from efforts to decrease the duration of latent errors than from measures to decrease their basic frequency.

  1.5. The ironies of automation

  Lisanne Bainbridge (1987) of University College London has expressed in an elegant and concise form many of the difficulties that lie at the heart of the relationship between humans and machines in advanced technological installations. She calls them ‘the ironies of automation’.

  Many systems designers view human operators as unreliable and inefficient and strive to supplant them with automated devices. There are two ironies here. The first is that designers’ errors, as discussed later in this chapter, make a significant contribution to accidents and events. The second is that the same designer who seeks to eliminate human beings still leaves the operator “to do the tasks which the designer cannot think how to automate” (Bainbridge, 1987, p. 272).

  In an automated plant, operators are required to monitor that the automatic system is functioning properly. But it is well known that even highly motivated operators cannot maintain effective vigilance for anything more than quite short periods; thus, they are demonstrably ill-suited to carry out this residual task of monitoring for rare, abnormal events. In order to aid them, designers need to provide automatic alarm signals. But who decides when these automatic alarms have failed or been switched off?

  Another operator task is to take over manual control when the automatic control system fails. Manual control is a highly skilled activity, and skills need to be practised continuously in order to maintain them. Yet an automatic control system that fails only rarely denies operators the opportunity for practising these basic control skills. One of the consequences of automation, therefore, is that operators become de-skilled in precisely those activities that justify their marginalised existence. But when manual takeover is necessary something has usually gone wrong; this means that operators need to be more rather than less skilled in order to cope with these atypical conditions. Duncan (1987, p. 266) makes the same point: “The more reliable the plant, the less opportunity there will be for the operator to practise direct intervention, and the more difficult will be the demands of the remaining tasks requiring operator intervention.”

  These ironies also spill over into the area of training. Conscious of the difficulties facing operators in the high-workload, high-stress conditions of a plant emergency, designers, regulators and managers have sought to proceduralise operator actions. These frequently involve highly elaborate branching structures or algorithms designed to differentiate between a set of foreseeable faults. Some idea of what this means in practice can be gained from the following extract from the U.S. Nuclear Regulatory Commission’s report on the serious loss of main and auxiliary feedwater accident at Toledo Edison’s Davies-Besse plant in Ohio (NUREG, 1985). The extract describes the actions of the crew immediately following the reactor and turbine trips that occurred at 1.35 a.m. on 9 June 1985.

  The primary-side operator acted in accordance with the immediate post-trip action, specified in the emergency procedure that he had memorized.... The secondary-side operator heard the turbine stop valves slamming shut and knew the reactor had tripped. This ‘thud’ was heard by most of the equipment operators who al
so recognized its meaning and two of them headed for the control room.... The shift supervisor joined the operator at the secondary-side control console and watched the rapid decrease of the steam generator levels The assistant shift supervisor in the meantime opened the plant’s looseleaf emergency procedure book (It is about two inches thick with tabs for quick reference...). As he read aloud the immediate actions specified, the reactor operators were responding in the affirmative. After phoning the shift technical advisor to come to the control room, the administrative assistant began writing down what the operators were saying, although they were speaking faster than she could write.

  [Later] The assistant shift supervisor, meanwhile, continued reading aloud from the emergency procedure. He had reached the point in the supplementary actions that require verification that feedwater flow was available. However, there was no feedwater, not even from the Auxiliary Feedwater System (AFWS), a safety system designed to provide feedwater in the situation that existed. Given this condition, the procedure directs the operator to the section entitled, ‘Lack of Heat Transfer’. He opened the procedure at the tab corresponding to this condition, but left the desk and the procedure at this point, to diagnose why the AFWS had failed. He performed a valve alignment verification and found that the isolation valve in each AFW train had closed. [Both valves had failed to reopen automatically.] He tried unsuccessfully to open the valves by the push buttons on the back panel. ... The AFW system had now suffered its third common-mode failure, thus increasing the number of malfunctions to seven within seven minutes after the reactor trip At this point, things in the control room were hectic. The plant had lost all feedwater; reactor pressure and temperature were increasing; and a number of unexpected equipment problems had occurred. The seriousness of the situation was appreciated.” [It should be added that despite the commission of a number of slips and mistakes, the plant was restored to a safe state within 15 minutes. This was a very good crew!]

  This passage is worth quoting at length because it reveals what the reality of a serious nuclear power plant emergency is like. It also captures the moment when the pre-programmed procedures, like the plant, ran out of steam, forcing the operators to improvise in the face of what the industry calls a ‘beyond design basis accident’. For our present purposes, however, it highlights a further irony of automation: that of drilling operators to follow written instructions and then putting them in a system to provide knowledge-based intelligence and remedial improvisation. Bainbridge (1987, p. 278) commented: “Perhaps the final irony is that it is the most successful automated systems, with rare need for manual intervention, which may need the greatest investment in operator training.”

  1.6. The operator as temporal coordinator

  French (Montmollin, 1984) and Belgian (De Keyser, Decortis, Housiaux & Van Daele, 1987) investigators have emphasised the importance of the temporal aspects of human supervisory control. One of the side effects of automation has been the proliferation of specialised working teams acting as satellites to the overall process. These include engineers, maintenance staff, control specialists and computer scientists. In many industrial settings, the task of orchestrating their various activities falls to the control room operator. De Keyser and her colleagues are currently documenting the errors of temporal judgement that can arise in these circumstances (De Keyser, 1988).

  2. The ‘Catch 22’ of human supervisory control

  As indicated earlier, the main reason why humans are retained in systems that are primarily controlled by intelligent computers is to handle ‘non-design’ emergencies. In short, operators are there because system designers cannot foresee all possible scenarios of failure and hence are not able to provide automatic safety devices for every contingency.

  In addition to their cosmetic value, human beings owe their inclusion in hazardous systems to their unique, knowledge-based ability to carry out ‘on-line’ problem solving in novel situations. Ironically, and notwithstanding the Apollo 13 astronauts and others demonstrating inspired improvisation, they are not especially good at it; at least not in the conditions that usually prevail during systems emergencies. One reason for this is that stressed human beings are strongly disposed to employ the effortless, parallel, preprogrammed operations of highly specialised, low-level processors and their associated heuristics. These stored routines are shaped by personal history and reflect the recurring patterns of past experience.

  The first part of the catch is thus revealed: Why do we have operators in complex systems? To cope with emergencies. What will they actually use to deal with these problems? Stored routines based on previous interactions with a specific environment. What, for the most part, is their experience within the control room? Monitoring and occasionally tweaking the plant while it performs within safe operating limits. So how can they perform adequately when they are called upon to reenter the control loop? The evidence is that this task has become so alien and the system so complex that, on a significant number of occasions, they perform badly.

  One apparent solution would be to spend a large part of an operator’s shift time drilling him or her in the diagnostic and recovery lessons of previous system emergencies. And this brings us to the second part of the catch. It is in the nature of complex, tightly-coupled, highly interactive, opaque and partially understood systems to spring nasty surprises. Even if it were possible to build up—through simulations or gameplaying—an extensive repertoire of recovery routines within operating crews, there is no guarantee that they would be relevant, other than in a very general sense, to some future event. As case studies repeatedly show, accidents may begin in a conventional way, but they rarely proceed along predictable lines. Each incident is a truly novel event in which past experience counts for little and where the plant has to be recovered by a mixture of good luck and laborious, resource-limited, knowledge-based processing. Active errors are inevitable. Whereas in the more forgiving circumstances of everyday life, learning from one’s mistakes is usually a beneficial process, in the control room of chemical or nuclear power plants, such educative experiences can have unacceptable consequences.

  The point is this: Human supervisory control was not conceived with humans in mind. It was a by-product of the microchip revolution. Indeed, if a group of human factors specialists sat down with the malign intent of conceiving an activity that was wholly ill-matched to the strengths and weaknesses of human cognition, they might well have come up with something not altogether different from what is currently demanded of nuclear and chemical plant operators. To put it simply: the active errors of stressed controllers are, in large part, the delayed effects of system design failures.

  Perrow (1984, p. 9), having noted that between 60 and 80 per cent of systems accidents are attributed to ‘operator error’, went on to make the following telling comment: “But if, as we shall see time and time again, the operator is confronted by unexpected and usually mysterious interactions among failures, saying that he or she should have zigged instead of zagged is only possible after the fact. Before the accident no one could know what was going on and what should have been done.”

  3. Maintenance-related omissions

  By their nature, it is generally difficult to quantify the contribution made by latent errors to systems failures. An interesting exception, however, are those committed during the maintenance of nuclear power plants. Two independent surveys (Rasmussen, 1980; INPO, 1984) indicate that simple omissions—the failure to carry out some of the actions necessary to achieve a desired goal— constitute the single largest category of human performance problems identified in the significant event reports logged by nuclear plants. Moreover, these omission errors appear to be most closely associated with maintenance-related tasks. Here, the term maintenance-related includes preventive and corrective maintenance, surveillance testing, removal and restoration of equipment, checking, supervision, postmaintenance testing and modifications.

  3.1. The Rasmussen survey

  Drawing upon the Nuclear Pow
er Experience compilation of significant event reports in NPPs, Rasmussen (1980) analysed 200 cases classified under the heading of ‘Operational problems’. Omissions of functionally isolated acts accounted for 34 per cent of all the incidents, and a further 8.5 per cent involved other kinds of omission. The complete error distribution is shown in Table 7.1.

  Table 7.1. The distribution of error types in 200 NPP incidents (from Rasmussen, 1980).

  Breakdown of error types

  Absent-mindedness: 3

  Familiar association: 6

  Alertness low: 10

  Omission of functionally isolated acts: 68

  Other omissions: 17

  Mistakes among alternatives: 11

  Strong expectation: 10

  Side effect(s) not considered: 15

  Latent conditions not considered: 20

  Manual variability, lack of precision: 10

  Spatial orientation weak: 10

  Other, unclasslfiable: 20

  TOTAL: 200

  The 85 omission errors were further broken down according to (a) the kind of task involved, and (b) the type of mental activity implicated in the phase of the task at which the error occurred. These two analyses are shown in Tables 7.2 and 7.3.

  Two aspects of these data are of particular importance. First, they reveal the significance of omission errors in test, calibration and maintenance activities. Second, the mental task analysis shows a close association between omissions and the planning and recalling of procedures. This point is further highlighted by the INPO root cause analysis discussed below.

 

‹ Prev