When Science Goes Wrong
Page 24
In the early 1990s, NASA administrator Dan Goldin spearheaded a new approach to the design and implementation of space missions, an approach that was encapsulated in the slogan ‘Faster, Better, Cheaper’, or FBC. The FBC philosophy was, in part, a response to fiscal belt-tightening imposed by the US government. It also represented a reaction to some of the earlier missions – huge, long-delayed projects that incorporated every imaginable bell and whistle and that went tens or hundreds of millions of dollars over budget. FBC was a leaner approach that aimed to achieve more with less, employing economical strategies such as the re-use of design elements that had proven successful in earlier missions. Although this ‘heritage’ approach promised great savings in time and money, it also injected risk. How could one be sure that a large piece of software, for example, would function successfully in the different environment of a new spacecraft? And the FBC approach also demanded economies in manpower. This was something that might be acceptable so long as things went according to plan, but it might cause problems when unexpected difficulties needed to be surmounted, as happened with the Mars Climate Orbiter.
Goldin’s strategy had some early successes. The 1996 Mars Pathfinder mission, for example, safely delivered the rover Sojourner to the Martian surface using a novel airbag landing system. The rover was able to navigate semi-autonomously around the landing site, and it did some simple geological investigations of nearby rocks. It also caught the imagination of the public, including children, back on Earth: Mattel’s rover action model was America’s best-selling toy of the summer of 1997.
Soon after Pathfinder’s landing, another spacecraft, the Mars Global Surveyor, reached the planet and went into a polar orbit. Over the following 10 years, it took nearly a quarter of a million photographs of the Martian surface, and it also operated as a communications satellite for other missions until it ceased functioning in 2006.
In spite of these successes, there were also hints of problems with the FBC approach. In 1997, for example, an Earth-orbiting satellite called Lewis was launched, but once in orbit it went into a spin that prevented its solar panels from facing the sun; this caused the batteries to lose charge. The problem occurred at night while the controllers were off duty; economic considerations had prevented the appointment of sufficient controllers for round-the-clock staffing. By the time the controllers returned to work the next morning, the spacecraft was completely out of electrical power and thus could not be resuscitated: it burned up in the atmosphere a few weeks later.
The Mars Climate Orbiter (MCO) was one element in a two-spacecraft mission named the Mars Surveyor ’98 Programme. The other element was the Mars Polar Lander. The role of the Climate Orbiter was to study the Martian atmosphere with a variety of instruments and also to serve as a communication link for the Lander and for other, future, missions. The Lander was to set down near the planet’s south pole – using retrorockets rather than an airbag – and dig into the soil with the particular aim of finding water.
Lockheed Martin won the $l21-million contract to build both spacecraft (not including the scientific instruments) and the company was expected to complete the job with minimal oversight from NASA, consistent with the Faster, Better, Cheaper philosophy.
The error that doomed the MCO spacecraft occurred during the development of its navigational software. To understand the error, it is necessary to appreciate that once a Mars-bound spacecraft has escaped Earth’s gravitational field, it is essentially coasting in an orbit around the sun – albeit a highly elliptical orbit that is carefully planned to intersect the orbit of Mars at a time when Mars itself has reached that location. Thus, if the spacecraft’s initial direction and speed are well enough known, its future trajectory can be readily calculated from gravitational equations provided by Isaac Newton.
Two nongravitational factors can affect the trajectory, however. One consists of deliberate changes in trajectory induced by firing of the spacecraft’s small rocket engines, or ‘thrusters’. Usually, four or five of these trajectory correction manoeuvres (or TCMs) are performed in the course of the flight. Each time a TCM is performed, its effect on the trajectory has to be determined.
The other main source of non-gravitational effects comes from the pressure of solar radiation on the spacecraft, and from the craft’s efforts to compensate for that pressure. Radiation pressure is exerted mainly on the craft’s solar panels, on account of their large area. Unlike the Mars Global Surveyor, which had solar panels on either side of the spacecraft, the Mars Climate Orbiter had all of its panels on one side of the craft. Because of this asymmetrical design, the main effect of solar radiation was a tendency to spin the spacecraft around on its axis. Such a spin was undesirable because it reduced the amount of sunlight received by the panels. To counteract this effect, a set of small reaction wheels resembling the metal discs in gyroscope toys were automatically spun up by electric motors. These spinning reaction wheels generated an equal and opposite rotational force, so no actual rotation of the spacecraft occurred and it maintained a constant orientation to the sun.
Of course, the reaction wheels could only be spun up to a certain limiting speed, which was about 3,000 rpm. Because of the spacecraft’s asymmetrical design, it took less than 24 hours of flight for the reaction wheels to reach this limit. Then their accumulated angular momentum had to be ‘dumped’. This was done by slowing the wheels, while at the same time compensating for the effect of the slowing by firing some of the thrusters. In this way the reaction wheels could be brought back to a stop while keeping the spacecraft’s attitude constant. Then the cycle began again. These dumps were formally named ‘Angular Momentum Desaturation’ events, or AMDs, and they occurred about 10 times a week during the trip to Mars.
If these thruster firings had only affected the spacecraft’s rotation, they would not have disturbed its trajectory through space and would, therefore, have been irrelevant from a navigational standpoint. Unfortunately, design considerations led to the positioning of the thrusters in such a way that the AMD would also kick the spacecraft sideways by a small amount, and the frequent repetition of these events over the course of the flight would cause the spacecraft to deviate by a significant degree from its planned trajectory – enough to prevent the craft from entering the Mars orbit correctly.
The Lockheed engineers knew about this issue. They also knew that it would be difficult to measure the effect of the AMDs on the spacecraft’s trajectory at the time the AMDs occurred. This was because only limited information would be available about the spacecraft’s position, speed and direction of travel during the flight. In general, it is possible accurately to measure a spacecraft’s distance from Earth (based on the time for a radio signal to travel from Earth to the spacecraft and back) and its speed along the line of sight from Earth (based on Doppler changes in a fixed-frequency signal emitted by the spacecraft). However, it is not possible directly to measure its position or speed in the two dimensions that are perpendicular to the line of sight. These other variables can eventually be determined by making repeated measurements of the spacecraft’s range and line-of-sight velocity as it follows its elliptical trajectory, but these determinations are slow and subject to various forms of error. Unfortunately, the effects of the AMD events were exerted largely in the difficult-to-observe dimensions perpendicular to the line of sight from Earth.
To solve this problem, the engineers followed a strategy that had been employed in previous missions such as Mars Global Surveyor: they would use a form of ‘dead reckoning’. This depended on knowing exactly how much thrust would be exerted on the spacecraft – and in which direction – when each thruster was fired for a known period of time. With this information, it would be possible to calculate how much the speed and direction of the spacecraft would change during each AMD event, even without measuring those changes directly.
To make this possible, the subcontractor who manufactured the thrusters sent paperwork to Lockheed Martin that documented how much thrust was generate
d when each thruster was fired. This manufacturer was accustomed to working in British units (pounds, feet and so on). NASA generally requires the use of metric units throughout its operations and those of its contractors, but it makes exceptions in cases where ordering a change of units may be unduly burdensome, or where it increases the risk that the contractor will make some kind of mistake. NASA made an exception of this kind for this subcontractor, so the paperwork received by Lockheed Martin listed the thrusters’ performance in units of pounds of force.
The root cause of the Mars Climate Orbiter mishap was the failure to convert these English units to metric units of force – newtons – in the preparation of a navigational software file called ‘Small Forces’. The purpose of this file was to determine how strongly each AMD event would push the spacecraft out of its intended path. Because the remaining navigational software assumed that the output of the Small Forces file was in newtons, it underestimated the deflection of the spacecraft’s trajectory caused by each AMD event by a factor equal to the ratio between pounds and newtons – which is to say, by a factor of 4.45.
To understand how this error occurred, I spoke with John Casani, the onetime chief engineer at JPL who led the lab’s internal investigation into the mishap. I also spoke with Steve Jolly of Lockheed Martin, who was the lead systems engineer for the Mars Climate Orbiter. Casani’s and Jolly’s accounts agreed in one respect: they both told me that the failure to convert the units was primarily the fault of a young engineer who had only completed college a couple of months earlier and who was a new hire at Lockheed Martin. (This person has never been identified by name.)
Casani and Jolly gave me somewhat differing accounts of how the young engineer actually came to make the mistake. According to Casani, the engineer was given the documentation from the thruster manufacturer that contained performance data in pounds, as well as a set of instructions from JPL that specified that the output of the Small Forces file should be in newtons. The engineer simply failed to read the JPL document with sufficient care and thus overlooked the requirement for the conversion. According to this account, then, the engineer thought he was doing the right thing by providing the output in English units.
According to Jolly (who presumably would have been more knowledgeable about the matter), the engineer did know that pounds needed to be converted to newtons. The reason he failed to make the conversion, Jolly told me, had to do with the ‘heritage’ issue. The Mars Global Surveyor had used similar navigational software, and the plan was to save money by having the Climate Orbiter ‘inherit’ or reuse it. The new spacecraft used different thrusters, however, so the portion of the software relating to thruster performance was excised, and the engineer’s task was to replace that portion with code incorporating performance data for the new thrusters. Unfortunately, he assumed that the code that made the conversion of units was left in the unexcised Global Surveyor software, whereas in fact it was in the excised portion. The conversion was represented simply by the number 4.45 in an equation, without any comment as to its purpose, so it was easy to miss. Thus, in writing the new code, the engineer left the units in pounds, thinking that the required conversion would be made by the pre-existing software.
Although the engineer’s mistake was the root cause of the mishap, such mistakes are inevitable as long as science is done by humans. The more serious error was the failure of anyone to spot the mistake. Part of the problem was that a factor of 4.45 is not a terribly large error in engineering terms: the faulty code produced output that looked quite reasonable. In fact, if the Mars Global Surveyor (with its symmetrical solar panels) had incorporated the same error, that mission would probably not have been affected. It was only the asymmetrical design of the Climate Orbiter, with the resulting need for numerous AMD events, that allowed the small individual errors to accumulate to a mission-endangering level.
Following standard procedures, the faulty software was reviewed, but the error wasn’t spotted. Then it went through formal testing: using fictional AMD events, the output of the software was compared with the output of manual calculations. Unfortunately, the manual calculations somehow incorporated the same error as was present in the faulty software, so the two outputs were in agreement and the software was judged to be good.
The Small Forces software was not actually loaded into the spacecraft’s computer; rather, it was placed in computers that remained on the ground. The idea was that every time an AMD event occurred, the spacecraft would radio back data about the length of firing of the thrusters, the attitude of the spacecraft and so on, and the navigational team would then feed the data into the ground computer to extract measures of the magnitude, duration, and direction of thrust. These measures would then be used to adjust the model of the spacecraft’s trajectory.
By a bitter irony, the spacecraft’s own computer did in fact possess software to make this calculation independently, and these files correctly specified the resulting thrust in newtons. ‘You can imagine how many times I wake up at night thinking about that,’ said Jolly. In fact, the spacecraft was even programmed to radio the output of these calculations to the ground, but the navigators did not know this so no one looked at the incoming data packets or compared them to the output of the erroneous calculations being performed on the ground. If they had done so, the error would have been quickly detected. Even when I spoke with him in 2006, after NASA’s official inquiry had established and published the fact, Jolly said that he didn’t know that the spacecraft had been transmitting the correct data to Earth.
On December 11, 1998, the Mars Climate Orbiter was launched from Cape Canaveral Air Station in Florida, atop a Delta II rocket. The Delta II is a relatively inexpensive, medium-powered launch vehicle. So as not to exceed the Delta II’s lifting capacity, the mission planners had to economise on the weight of fuel carried by the spacecraft – fuel which was required for slowing the spacecraft when it reached Mars. The planners took two steps to save on fuel. First, they sent the spacecraft by a long route that took it more than halfway around the sun: this ensured that it was travelling relatively slowly as it approached Mars, but it lengthened the trip to nine months rather than the six months needed for a more direct route. Second, they planned to accomplish some of the slowing by aerobraking – repeatedly dipping the spacecraft into Mars’s outer atmosphere on successive orbits after the first encounter – rather than relying entirely on the spacecraft’s engines. Even with these measures, the orbital insertion burn would have to slow the spacecraft by nearly 5,000 kilometres per hour, a task that would consume nearly 300 kilograms of fuel – almost half the total weight of the spacecraft.
The launch went flawlessly. The Delta II’s first two stages lifted the spacecraft into low Earth orbit, then the third stage booster rocket fired for 88 seconds, kicking the spacecraft out of Earth’s gravitational clutches. After the booster separated, the spacecraft deployed its solar panels and began its long, unpowered cruise toward Mars.
Teams at JPL and Lockheed Martin, led by JPL flight operations manager Sam Thurman, monitored and controlled the spacecraft during its journey. Part of the team was a group of four JPL navigators, led by Pat Esposito, whose task was to determine the spacecraft’s trajectory and calculate the required corrections during the flight. The team was also responsible for two other spacecraft, however – Mars Global Surveyor (which was orbiting Mars) and Mars Polar Lander (which was launched on January 3). Only one team member, Eric Graat, could give his undivided attention to the Mars Climate Orbiter. This was a low level of staffing compared with previous and subsequent missions: the successful Mars Odyssey mission of 2001, for example, boasted a 15-member navigation team. Although neither Esposito nor Graat agreed to speak with me, Sam Thurman told me that they were very much overworked.
The first trajectory correction manoeuvre (TCM-1) took place ten days after launch. It corrected a deliberate mis-aim in the launch trajectory – a mis-aim whose purpose was to ensure that the third-stage booster did not strike Mars
and contaminate the planet with terrestrial germs. The manoeuvre involved an elaborate sequence of operations. First, the solar array was folded and locked against the spacecraft body to protect it from damage, then the entire spacecraft was rotated so that the firing of its aft-pointing thrusters would deflect the craft’s trajectory in the right direction, and then the thrusters were fired for a few minutes to achieve the correct trajectory. Finally, the spacecraft was rotated back into its flight orientation and the solar panels were deployed once more. A second, much smaller trajectory manoeuvre (TCM-2) was performed on January 26,1999, and it, too, went according to plan.
About every 17 hours during the flight, the spacecraft automatically performed angular momentum desaturation (AMD) procedures, firing its thrusters for a few seconds to allow the reaction wheels to be decelerated. The navigators had not been expecting the AMD events to occur so frequently, because when they came onto the job they were not familiar with the Orbiter and they did not realise that its asymmetrical design would cause an increased tendency to spin under the influence of solar radiation.
During the first four months of the flight, the navigators did not use the Small Forces software to calculate the effects of the AMDs on the spacecraft’s trajectory. This was because the software not only contained the units error (which no one was aware of), but also some other bugs that had come to light. Because the tiny effects of the AMD events would only really be important for the final approach to Mars and orbital insertion, the navigators simply did without the output of the Small Forces software, planning to incorporate the data at a later time.
Finally, in mid-April, the ground software was delivered and put into operation. Now the effects of the AMD events (including those that had already taken place) were incorporated into the navigational calculations. But for each AMD event the software told the navigators that the spacecraft had been deflected by an amount that was nearly five times larger than what had actually occurred, thanks to the poison pill that was the units error. Still, each individual navigational solution looked good, because the error was in an unobservable dimension perpendicular to the line of sight.