Chasing New Horizons

Page 24

by Alan Stern

This was Brian Bauer’s theory. Brian was then the mission’s autonomy system engineer, who had coded the recovery procedure that the spacecraft would automatically go through in just this situation. Brian told Alice, “If that is what happened, then the spacecraft will restart using the backup computer, and sixty to ninety minutes from now we’ll get a radio signal with New Horizons operating on the backup computer.”

The engineers, the Aces, along with Alice, Glen, and Alan waited out those long minutes, making contingency plans in case Brian’s hypothesis was incorrect. But sure enough, after ninety minutes, a signal arrived from New Horizons indicating it had switched to the backup computer.

Communication had been restored, and with that, the fear of a catastrophic loss of the spacecraft evaporated. But the crisis wasn’t over; it had just entered a new phase.

ONCE AGAIN: “WHATEVER IT TAKES”

The MOC and its surrounding offices were quickly filling up with engineers, more flight-control team members, and others on the project who had cut short their holiday weekend to come in and assist. People were arriving in shorts and flip-flops, in their picnic clothes, having dropped everything to get to the MOC.

As more telemetry came back from the bird, they learned that all of the command files for the flyby that had been uploaded to the main computer had been erased when the spacecraft rebooted to the backup computer. This meant that the Core flyby sequence sent that morning would have to be reloaded. But worse, numerous supporting files needed to run the Core sequence, some of which had been loaded as far back as December, would also need to be sent again. Alice recalls, “We had never recovered from this kind of anomaly before. The question was, could we do it in time to start the flyby sequence, scheduled to begin on July 7?”

That meant the team had just three days to put Humpty Dumpty back together again, from 3 billion miles away. If they couldn’t, then with every passing day they would lose dozens of unique, close-up Pluto system observations that were part of the exquisitely constructed Core load flyby plans. The mission team suddenly found itself in a three-day race to salvage everything they had spent years planning and months uploading.

The New Horizons process to get back on track after any spacecraft anomaly is shaped around a series of formal meetings called ARBs, or Anomaly Review Boards. Soon after 4:00 P.M., only forty-five minutes after spacecraft recontact, the July 4 anomaly’s first ARB was convened in the meeting room adjacent to the MOC.

At that kickoff ARB meeting the team members had to assess what had happened, how to restore the flyby plan, and how to make sure they wouldn’t accidentally do anything during the recovery that would cause another problem on the spacecraft. The scope of how far they had been set back by the reboot onto the backup computer was stupefying. It was quickly estimated that they would have to perform the equivalent of several weeks of work in just three days to start the flyby Core sequence on time on July 7. And it would all have to be done flawlessly.

What made this even worse, was that every move had to be done by remote control with a nine-hour round-trip radio communication time between mission control and the spacecraft. Science classes teach how the speed of light is incredibly fast, how a signal moving at that speed can travel around the world in an eighth of a second or to the Moon and back—a half-million-mile trip—in just two and a half seconds. But for the New Horizons team trying to get their spacecraft back on track as it closed in on Pluto, the great distance between Earth and New Horizons made the speed of light seem excruciatingly slow.

Those assembling for the ARB knew that with all the press attention, the world would soon be aware that New Horizons had tripped over itself on the verge of its flyby. In just ten days, the spacecraft would hurtle through the Pluto system—nothing could stop that celestial mechanics—but whether it would be gathering the data it had journeyed almost a decade to collect, was something else.

Alan and Glen opened the meeting, telling the ARB that there was no finer spacecraft team they’d ever known than on New Horizons, and that if any team could pull off this recovery, it was the group in that room. Then Alice took the floor and began architecting how they would effect a recovery.

Alice immediately asked Alan about the science observations being lost that day and in the next three days before the close flyby sequence was to kick off on July 7. She wanted to know, from the PI, if her team should also attempt to recover those observations, in addition to reconfiguring the spacecraft and getting all the files and command load up to the bird for the close flyby. Alan:

I didn’t call for any discussion of it from the other science team members in the room. I didn’t even let my flyby-planning czar, Leslie, weigh in. I knew for a fact that Alice’s team needed crisp direction, with no fuzz on it, and that they needed to focus on saving the main event, rather than the preliminary observations we were losing with the spacecraft idled due to the reboot. I told Alice that anything beyond getting us back on track to initiate the close flyby itself, on time, would be a distraction.

Alice wanted further clarification, and asked me, very precisely, “How much of the current command load’s science can I trash?” I knew what was at stake. I knew what was icing and what was cake. I estimated that the Core load probably contained 95 percent of everything we wanted to accomplish at Pluto. All the other command loads combined, including this one which was now suspended because of the anomaly, were just details by comparison. I looked Alice right in the eye and said, “The Core load is all that matters to me. So do whatever it takes to kick it off successfully on the seventh. Trash as much as you need to in between.”

With that, Alice had her marching orders. Her sole job now was to save the Core flyby sequence; everything else was expendable. But could it be done in time?

Alice and her team quickly but methodically devised a recovery plan. In the next three days, they had to design and build all the command procedures to get the spacecraft back onto its primary computer, then to resend all the lost command and support files for the Core load, and they would have to test all of this on the NHOPS spacecraft simulator before any actions were taken, to ensure that each step would work on the first try—there was no leeway for repeats. They knew when the flyby Core sequence needed to engage, which would be noontime on July 7. So Alice’s team took the total time available until then and divided it up into nine-hour round-trip light-times—the amount of time it would take to send each set of procedures to run on the spacecraft and receive confirmation that it had performed successfully. Counting everything else that had to be done on the ground, they found there was time for only three of these communications cycles before the Core load would need to engage mid-day on July 7.

Thus, the recovery would be split into three steps. First, the team would command the spacecraft to restore normal, rather than emergency, communications. That would up the communications bit rates by a factor of one hundred, making the rest of the recovery possible to do in time. That first step alone, they estimated, would take about half a day to code, test, send to New Horizons, and get confirmation back that it had succeeded. Tick, tock.

Next, the team would command the spacecraft to reboot onto its primary computer. This was needed in order to use the flyby command load as coded. A reboot from the backup to the prime computer had never been done in flight. So a procedure had to be designed and coded for that, and tested on NHOPS, and the test results then had to be checked before that procedure could be sent to New Horizons. Finally, the team would have to methodically restore all the Core flyby files and engage the flyby timeline. It was nearly midnight by the time this plan was architected, and there was no time to spare: the clock had already bled down over ten hours since the loss of contact that afternoon. Tick, tock.

Alice’s mission operations team, working closely with Chris Hersman’s spacecraft systems team, wrote, tested, and then sent up the first set of commands about twelve hours after they had reestablished spacecraft comm, at about 3:15 A.M. on July 5.
>
Nine hours later, midday on the fifth, the MOC received confirmation that normal communications had been restored! But a day had passed, and New Horizons had swept nearly another million miles toward its destiny at Pluto. Recovery step 1 was complete, but now only two days remained until the Core flyby sequence needed to engage. Tick, tock.

THE INCREDIBLES

The New Horizons team organized their work, and their lives, for the next few days around the nine-hour communications cycles to the spacecraft and back. They ran on very little sleep and lots of adrenaline. They had worked together for over a decade and had encountered problems on the spacecraft before, but no problem of this scope or with such high stakes had ever occurred. It demanded a round-the-clock existence in mission control, and the team delivered.

Glen recalls, “The team just did what they needed to do. I started searching for places for people to sleep, trying to find something more comfortable than their office floors.” And Alice remembers, “We found cots, blankets, and pillows, and someone brought in an air mattress. There weren’t enough, so we were sharing.” Alan:

You should have seen it. Without a single complaint, people worked day and night—without so much as changes of clothes or places to properly sleep or shower, in some cases for four days straight. Some people were sleeping on desks. Some were living on just two or three hours of catnaps per day. There was no time for restaurant meals. We brought in people just to find takeout and keep the team fed.

RECOVERY

In order to ensure that this and every step of the recovery was going to work as intended, it was essential that each of the recovery procedures be tested on NHOPS. Because NHOPS so faithfully simulated the spacecraft, command-load testing on it could be used to work out bugs and certify that the instructions that would be sent to New Horizons itself would be error free.

As it turned out, a decision made years earlier proved to be a lifesaver during the recovery. Recall that Alan had become so concerned that the team did not have a fully complete backup to NHOPS, that a second one was built. Well, during the weekend of July 4, there simply was not enough time to test all the new command loads needed to recover using only a single NHOPS. So they doubled up, using that second NHOPS to fit in more test runs. Had there been no NHOPS-2, the recovery would have taken days longer, and whole swaths of unique Pluto science would have been lost forever.

Using procedures tested on NHOPS-1 and NHOPS-2, the middle step of taking New Horizons out of safe mode and getting it back on the primary flight computer succeeded and was confirmed by telemetry sent by the spacecraft on July 6.

Next, the spacecraft had to be configured just as it had been prior to the attempt to upload the flyby script on July 4, and then, as a final step, the Core load had to be sent back up again, and with it all the dozens of associated support files that had been lost when the anomaly rebooted the primary computer. Those steps and all the NHOPS testing for them, including many Anomaly Review Board meetings to plan and certify each step, took round-the-clock work on the sixth.

But somehow, by late morning on July 7 all the recovery work was complete. Exhausted, the team had managed to get the spacecraft back on track and ready to go for the flyby. They had completed it with just four hours to spare before the Core load needed to engage.

DOING THE FORENSICS

What science was lost because of the July 4th anomaly and recovery fireworks? In saving the day for New Horizons, Alice and her team did follow Alan’s directive to do “whatever it takes” to save the Core flyby. So in the end they did trash all the observations that would have taken place during the three days of the anomaly recovery, because there was simply no way to replan them and also get the spacecraft out of safe mode and ready to start the close flyby on time.

But Alice’s team did manage to save the sixty-three images that were in the process of being compressed when the anomaly occurred. Those images had to be compressed to fit in storage because the larger, raw images had to be deleted to open up more memory space for flyby data. During the recovery operations, Alice’s team spotted an open window in the spacecraft operations timeline and managed to get that compression rescheduled, saving every single one of those precious sixty-three images.

What about all the approach observations that were trashed during the July 4 weekend recovery of the spacecraft? Alan assigned flyby planning czar Leslie Young the task of forming a tiger team to analyze just that. Leslie and her troops worked during the three days of the spacecraft recovery to look at every lost observation and its impact on the overall science return at Pluto. They found that each one had a later observation that was at higher resolution or closer range, meaning no objectives had been lost—except in one case. That was the final satellite search around Pluto that had been planned to take place on July 5 and 6, when New Horizons was still far enough out to blanket the space around Pluto with images. That sequence would have searched with several times the sensitivity of the previous search made just days before the anomaly occurred. When all the satellite search images were later scoured carefully by the New Horizons science team, no new satellites were found. This surprised many on the science team, since every time the Hubble Space Telescope had looked harder, it had found more moons. Would New Horizons have discovered satellites in that trashed, final, better search? No one knows, or will know, perhaps, until some future Pluto orbiter mission arrives, to search again.

And why did the July 4 anomaly happen in the first place? Shouldn’t the team have anticipated the combination of activities that led the main computer to be overwhelmed, and tested for it?

The sequence running on New Horizons on July 4, had been thoroughly tested. But as it turns out, the activity overlap that produced the computer overload only happened because of a fluke in timing in the way the DSN transmission to load the Core sequence was scheduled and executed. Had the Core load been sent just hours earlier or later, the computer would not have needed to store it while also doing the intensive work of compressing those precious Pluto images. So, should the team have realized that these activities could have overlapped and specifically tested for that possibility? In hindsight, yes. But when the intensive load testing for the flyby sequences took place back in 2013, the DSN schedule for 2015 wasn’t set, and the probability of that bad confluence between Core load storage and image compression occurring was very small. Alan:

Looking back in hindsight, there is no doubt that we should have spotted the possibility of the bad confluence and tested for it as a contingency, if not in 2013 before the DSN schedule for Core load transmission was set, then in 2015 after it had been. That oversight is on us, and it caused our Fourth of July fireworks. But it’s a wonder to me that that was the only detail we missed—among literally tens of thousands in the encounter operations—that marred any portion of the flyby. All those years of planning, testing, simulating, asking so many what-if questions, and more, really paid off, creating a bulletproof flyby plan in every other respect.

15

SHOWTIME

WHEN BETTER ISN’T BETTER

Amid the many dozens of activities going on behind the scenes at APL as the close flyby was beginning, a key decision had to be made: as we previously explained, to accomplish the objectives of the flyby, the spacecraft had to arrive at the closest-approach point within plus or minus nine minutes—just 540 seconds—of the planned time. Only then could all of the spacecraft pointing maneuvers properly center Pluto and its moons in the camera and spectrometer boresights.

Much of how New Horizons achieved this goal was done with careful optical navigation and rocket-engine firings that the spacecraft performed to home in on the closest-approach time. But mathematical analysis had shown that this alone might not be good enough to guarantee arrival in the critical plus-or-minus-540-second window. So the spacecraft’s designers at APL also built in some clever software to correct for any remaining timing errors once it was too late to fire the engines. That software is called a “timing knowledg
e update”; what it does is adjust the onboard clock on New Horizons, basically faking out the spacecraft to think that it is a little bit ahead of or behind where it really is in executing the Core load. The end result slides all the planned flyby activities backward or forward by up to 540 seconds to synch them up with the final, best predicted time of arrival. The process to do this had been tested many times in ground simulations using NHOPS. But it hadn’t been needed at the Jupiter flyby in 2007, so it remained something that had not been proven aboard New Horizons itself.

As the spacecraft approached Pluto, every day the optical navigation team used new images to determine just how far off the closest approach timing was going to be, and then calculated the timing knowledge update needed to correct for that. Concurrently, Leslie Young and her encounter planning team used sophisticated software tools to generate a “science consequences report,” in which each close-approach observation was simulated for the newly predicted timing error to determine, assuming no correction was made, which would succeed and which would fail.

Remarkably, once the final engine burn had been made and New Horizons was on final approach, the predicted timing errors were looking surprisingly small—less than two minutes—way inside the nine-minute-long maximum acceptable error. Leslie’s science consequences reports showed that there was no observation predicted to fail if no correction was made, though a few observations would be improved—by being better centered—if the team did make the timing knowledge update correction.

‹ Prev Next ›