The Hacker Crackdown
Page 5
Over the first half of the twentieth century, "electromechanical" switching systems of growing complexity were cautiously introduced into the phone system. In certain backwaters, some of these hybrid systems are still in use. But after 1965, the phone system began to go completely electronic, and this is by far the dominant mode today. Electromechanical systems have "crossbars," and "brushes," and other large moving mechanical parts, which, while faster and cheaper than Leticia, are still slow, and tend to wear out fairly quickly.
But fully electronic systems are inscribed on silicon chips, and are lightning-fast, very cheap, and quite durable. They are much cheaper to maintain than even the best electromechanical systems, and they fit into half the space. And with every year, the silicon chip grows smaller, faster, and cheaper yet. Best of all, automated electronics work around the clock and don't have salaries or health insurance.
There are, however, quite serious drawbacks to the use of computer-chips. When they do break down, it is a daunting challenge to figure out what the heck has gone wrong with them. A broken cordboard generally had a problem in it big enough to see. A broken chip has invisible, microscopic faults. And the faults in bad software can be so subtle as to be practically theological.
If you want a mechanical system to do something new, then you must travel to where it is, and pull pieces out of it, and wire in new pieces. This costs money. However, if you want a chip to do something new, all you have to do is change its software, which is easy, fast and dirt-cheap. You don't even have to see the chip to change its program. Even if you did see the chip, it wouldn't look like much. A chip with program X doesn't look one whit different from a chip with program Y.
With the proper codes and sequences, and access to specialized phone-lines, you can change electronic switching systems all over America from anywhere you please.
And so can other people. If they know how, and if they want to, they can sneak into a microchip via the special phonelines and diddle with it, leaving no physical trace at all. If they broke into the operator's station and held Leticia at gunpoint, that would be very obvious. If they broke into a telco building and went after an electromechanical switch with a toolbelt, that would at least leave many traces. But people can do all manner of amazing things to computer switches just by typing on a keyboard, and keyboards are everywhere today. The extent of this vulnerability is deep, dark, broad, almost mind-boggling, and yet this is a basic, primal fact of life about any computer on a network. Security experts over the past twenty years have insisted, with growing urgency, that this basic vulnerability of computers represents an entirely new level of risk, of unknown but obviously dire potential to society. And they are right. An electronic switching station does pretty much everything Letitia did, except in nanoseconds and on a much larger scale. Compared to Miss Luthor's ten thousand jacks, even a primitive 1ESS switching computer, 60s vintage, has a 128,000 lines. And the current AT&T system of choice is the monstrous fifth-generation 5ESS.
An Electronic Switching Station can scan every line on its "board" in a tenth of a second, and it does this over and over, tirelessly, around the clock. Instead of eyes, it uses "ferrod scanners" to check the condition of local lines and trunks. Instead of hands, it has "signal distributors," "central pulse distributors," "magnetic latching relays," and "reed switches," which complete and break the calls. Instead of a brain, it has a "central processor." Instead of an instruction manual, it has a program. Instead of a handwritten logbook for recording and billing calls, it has magnetic tapes. And it never has to talk to anybody. Everything a customer might say to it is done by punching the direct-dial tone buttons on your subset.
Although an Electronic Switching Station can't talk, it does need an interface, some way to relate to its, er, employers. This interface is known as the "master control center." (This interface might be better known simply as "the interface," since it doesn't actually "control" phone calls directly. However, a term like "Master Control Center" is just the kind of rhetoric that telco maintenance engineers -- and hackers -- find particularly satisfying.)
Using the master control center, a phone engineer can test local and trunk lines for malfunctions. He (rarely she) can check various alarm displays, measure traffic on the lines, examine the records of telephone usage and the charges for those calls, and change the programming.
And, of course, anybody else who gets into the master control center by remote control can also do these things, if he (rarely she) has managed to figure them out, or, more likely, has somehow swiped the knowledge from people who already know.
In 1989 and 1990, one particular RBOC, BellSouth, which felt particularly troubled, spent a purported $1.2 million on computer security. Some think it spent as much as two million, if you count all the associated costs. Two million dollars is still very little compared to the great cost-saving utility of telephonic computer systems. Unfortunately, computers are also stupid. Unlike human beings, computers possess the truly profound stupidity of the inanimate.
In the 1960s, in the first shocks of spreading computerization, there was much easy talk about the stupidity of computers -- how they could "only follow the program" and were rigidly required to do "only what they were told." There has been rather less talk about the stupidity of computers since they began to achieve grandmaster status in chess tournaments, and to manifest many other impressive forms of apparent cleverness.
Nevertheless, computers *still* are profoundly brittle and stupid; they are simply vastly more subtle in their stupidity and brittleness. The computers of the 1990s are much more reliable in their components than earlier computer systems, but they are also called upon to do far more complex things, under far more challenging conditions.
On a basic mathematical level, every single line of a software program offers a chance for some possible screwup. Software does not sit still when it works; it "runs," it interacts with itself and with its own inputs and outputs. By analogy, it stretches like putty into millions of possible shapes and conditions, so many shapes that they can never all be successfully tested, not even in the lifespan of the universe. Sometimes the putty snaps.
The stuff we call "software" is not like anything that human society is used to thinking about. Software is something like a machine, and something like mathematics, and something like language, and something like thought, and art, and information.... but software is not in fact any of those other things. The protean quality of software is one of the great sources of its fascination. It also makes software very powerful, very subtle, very unpredictable, and very risky.
Some software is bad and buggy. Some is "robust," even "bulletproof." The best software is that which has been tested by thousands of users under thousands of different conditions, over years. It is then known as "stable." This does *not* mean that the software is now flawless, free of bugs. It generally means that there are plenty of bugs in it, but the bugs are well-identified and fairly well understood.
There is simply no way to assure that software is free of flaws. Though software is mathematical in nature, it cannot by "proven" like a mathematical theorem; software is more like language, with inherent ambiguities, with different definitions, different assumptions, different levels of meaning that can conflict.
Human beings can manage, more or less, with human language because we can catch the gist of it. Computers, despite years of effort in "artificial intelligence," have proven spectacularly bad in "catching the gist" of anything at all. The tiniest bit of semantic grit may still bring the mightiest computer tumbling down. One of the most hazardous things you can do to a computer program is try to improve it -- to try to make it safer. Software "patches" represent new, untried un- "stable" software, which is by definition riskier.
The modern telephone system has come to depend, utterly and irretrievably, upon software. And the System Crash of January 15, 1990, was caused by an *improvement* in software. Or rather, an *attempted* improvement. As it happened, the problem itself -- the problem p
er se -- took this form. A piece of telco software had been written in C language, a standard language of the telco field. Within the C software was a long "do... while" construct. The "do... while" construct contained a "switch" statement. The "switch" statement contained an "if" clause. The "if" clause contained a "break." The "break" was *supposed* to "break" the "if clause." Instead, the "break" broke the "switch" statement.
That was the problem, the actual reason why people picking up phones on January 15, 1990, could not talk to one another.
Or at least, that was the subtle, abstract, cyberspatial seed of the problem. This is how the problem manifested itself from the realm of programming into the realm of real life. The System 7 software for AT&T's 4ESS switching station, the "Generic 44E14 Central Office Switch Software," had been extensively tested, and was considered very stable. By the end of 1989, eighty of AT&T's switching systems nationwide had been programmed with the new software. Cautiously, thirty- four stations were left to run the slower, less-capable System 6, because AT&T suspected there might be shakedown problems with the new and unprecedently sophisticated System 7 network.
The stations with System 7 were programmed to switch over to a backup net in case of any problems. In mid-December 1989, however, a new high-velocity, high- security software patch was distributed to each of the 4ESS switches that would enable them to switch over even more quickly, making the System 7 network that much more secure. Unfortunately, every one of these 4ESS switches was now in possession of a small but deadly flaw.
In order to maintain the network, switches must monitor the condition of other switches -- whether they are up and running, whether they have temporarily shut down, whether they are overloaded and in need of assistance, and so forth. The new software helped control this bookkeeping function by monitoring the status calls from other switches. It only takes four to six seconds for a troubled 4ESS switch to rid itself of all its calls, drop everything temporarily, and re-boot its software from scratch. Starting over from scratch will generally rid the switch of any software problems that may have developed in the course of running the system. Bugs that arise will be simply wiped out by this process. It is a clever idea. This process of automatically re-booting from scratch is known as the "normal fault recovery routine." Since AT&T's software is in fact exceptionally stable, systems rarely have to go into "fault recovery" in the first place; but AT&T has always boasted of its "real world" reliability, and this tactic is a belt-and-suspenders routine.
The 4ESS switch used its new software to monitor its fellow switches as they recovered from faults. As other switches came back on line after recovery, they would send their "OK" signals to the switch. The switch would make a little note to that effect in its "status map," recognizing that the fellow switch was back and ready to go, and should be sent some calls and put back to regular work. Unfortunately, while it was busy bookkeeping with the status map, the tiny flaw in the brand-new software came into play. The flaw caused the 4ESS switch to interacted, subtly but drastically, with incoming telephone calls from human users. If -- and only if -- two incoming phone-calls happened to hit the switch within a hundredth of a second, then a small patch of data would be garbled by the flaw. But the switch had been programmed to monitor itself constantly for any possible damage to its data. When the switch perceived that its data had been somehow garbled, then it too would go down, for swift repairs to its software. It would signal its fellow switches not to send any more work. It would go into the fault- recovery mode for four to six seconds. And then the switch would be fine again, and would send out its "OK, ready for work" signal.
However, the "OK, ready for work" signal was the *very thing that had caused the switch to go down in the first place.* And *all* the System 7 switches had the same flaw in their status-map software. As soon as they stopped to make the bookkeeping note that their fellow switch was "OK," then they too would become vulnerable to the slight chance that two phone-calls would hit them within a hundredth of a second. At approximately 2:25 p.m. EST on Monday, January 15, one of AT&T's 4ESS toll switching systems in New York City had an actual, legitimate, minor problem. It went into fault recovery routines, announced "I'm going down," then announced, "I'm back, I'm OK." And this cheery message then blasted throughout the network to many of its fellow 4ESS switches.
Many of the switches, at first, completely escaped trouble. These lucky switches were not hit by the coincidence of two phone calls within a hundredth of a second. Their software did not fail -- at first. But three switches -- in Atlanta, St. Louis, and Detroit -- were unlucky, and were caught with their hands full. And they went down. And they came back up, almost immediately. And they too began to broadcast the lethal message that they, too, were "OK" again, activating the lurking software bug in yet other switches. As more and more switches did have that bit of bad luck and collapsed, the call-traffic became more and more densely packed in the remaining switches, which were groaning to keep up with the load. And of course, as the calls became more densely packed, the switches were *much more likely* to be hit twice within a hundredth of a second.
It only took four seconds for a switch to get well. There was no *physical* damage of any kind to the switches, after all. Physically, they were working perfectly. This situation was "only" a software problem. But the 4ESS switches were leaping up and down every four to six seconds, in a virulent spreading wave all over America, in utter, manic, mechanical stupidity. They kept *knocking* one another down with their contagious "OK" messages. It took about ten minutes for the chain reaction to cripple the network. Even then, switches would periodically luck-out and manage to resume their normal work. Many calls -- millions of them -- were managing to get through. But millions weren't. The switching stations that used System 6 were not directly affected. Thanks to these old-fashioned switches, AT&T's national system avoided complete collapse. This fact also made it clear to engineers that System 7 was at fault.
Bell Labs engineers, working feverishly in New Jersey, Illinois, and Ohio, first tried their entire repertoire of standard network remedies on the malfunctioning System 7. None of the remedies worked, of course, because nothing like this had ever happened to any phone system before.
By cutting out the backup safety network entirely, they were able to reduce the frenzy of "OK" messages by about half. The system then began to recover, as the chain reaction slowed. By 11:30 pm on Monday January 15, sweating engineers on the midnight shift breathed a sigh of relief as the last switch cleared-up.
By Tuesday they were pulling all the brand-new 4ESS software and replacing it with an earlier version of System 7. If these had been human operators, rather than computers at work, someone would simply have eventually stopped screaming. It would have been *obvious* that the situation was not "OK," and common sense would have kicked in. Humans possess common sense -- at least to some extent. Computers simply don't.
On the other hand, computers can handle hundreds of calls per second. Humans simply can't. If every single human being in America worked for the phone company, we couldn't match the performance of digital switches: direct-dialling, three-way calling, speed-calling, call- waiting, Caller ID, all the rest of the cornucopia of digital bounty. Replacing computers with operators is simply not an option any more.
And yet we still, anachronistically, expect humans to be running our phone system. It is hard for us to understand that we have sacrificed huge amounts of initiative and control to senseless yet powerful machines. When the phones fail, we want somebody to be responsible. We want somebody to blame. When the Crash of January 15 happened, the American populace was simply not prepared to understand that enormous landslides in cyberspace, like the Crash itself, can happen, and can be nobody's fault in particular. It was easier to believe, maybe even in some odd way more reassuring to believe, that some evil person, or evil group, had done this to us. "Hackers" had done it. With a virus. A trojan horse. A software bomb. A dirty plot of some kind. People believed this, responsible people. I
n 1990, they were looking hard for evidence to confirm their heartfelt suspicions.
And they would look in a lot of places.
Come 1991, however, the outlines of an apparent new reality would begin to emerge from the fog. On July 1 and 2, 1991, computer-software collapses in telephone switching stations disrupted service in Washington DC, Pittsburgh, Los Angeles and San Francisco. Once again, seemingly minor maintenance problems had crippled the digital System 7. About twelve million people were affected in the Crash of July 1, 1991. Said the New York Times Service: "Telephone company executives and federal regulators said they were not ruling out the possibility of sabotage by computer hackers, but most seemed to think the problems stemmed from some unknown defect in the software running the networks."
And sure enough, within the week, a red-faced software company, DSC Communications Corporation of Plano, Texas, owned up to "glitches" in the "signal transfer point" software that DSC had designed for Bell Atlantic and Pacific Bell. The immediate cause of the July 1 Crash was a single mistyped character: one tiny typographical flaw in one single line of the software. One mistyped letter, in one single line, had deprived the nation's capital of phone service. It was not particularly surprising that this tiny flaw had escaped attention: a typical System 7 station requires *ten million* lines of code. On Tuesday, September 17, 1991, came the most spectacular outage yet. This case had nothing to do with software failures -- at least, not directly. Instead, a group of AT&T's switching stations in New York City had simply run out of electrical power and shut down cold. Their back-up batteries had failed. Automatic warning systems were supposed to warn of the loss of battery power, but those automatic systems had failed as well.
This time, Kennedy, La Guardia, and Newark airports all had their voice and data communications cut. This horrifying event was particularly ironic, as attacks on airport computers by hackers had long been a standard nightmare scenario, much trumpeted by computer- security experts who feared the computer underground. There had even been a Hollywood thriller about sinister hackers ruining airport computers -- *Die Hard II.* Now AT&T itself had crippled airports with computer malfunctions -- not just one airport, but three at once, some of the busiest in the world. Air traffic came to a standstill throughout the Greater New York area, causing more than 500 flights to be cancelled, in a spreading wave all over America and even into Europe. Another 500 or so flights were delayed, affecting, all in all, about 85,000 passengers. (One of these passengers was the chairman of the Federal Communications Commission.)