The Theory That Would Not Die
Page 23
On election night, as partial returns from counties and complete returns from selected precincts flowed in, Tukey and his colleagues watched for swings and deviations from past voting behavior and from political scientists’ opinions. Then they modified their initial odds with the new information.
As Wallace relived the moment, he said, “Say we’re working at a county level with data coming in. Suppose you had no returns from one county. A strict nonBayesian would say, ‘I can’t tell you anything there,’ but a slightly Bayesian person would say, ‘I don’t know what’s happening in county A, but county B is very similar and it’s showing a swing 5% toward Republicans.’ You might say that county A might be going the same way, but not give it great weight, because you do have to come up with a number. . . . ‘Okay,’ says Tukey, ‘go down low, take a group of counties that are similar, weight the data you get in these counties, give zero weight to non-data counties, and upgrade, update it all the time.’” Like Schlaifer at the Harvard Business School, Tukey had concluded that since he had to make a decision with inadequate information, whatever knowledge existed was better than nothing.
Wallace continued: “You take information where you have it and compute it with a lot of error bounds on it into places where you don’t have data. . . . You first of all work at rural area counties, then urban areas, north, south areas, or whatever, do it separately and play this game of upward regions and across states. It’s ‘borrowing strength,’ but I’d say it’s Bayesian. . . . You’re using historical data, from previous elections, to show variability between counties and that’s the source of your priors, so it’s very Bayesian, with a hierarchical model and historically based prior variances.”
Despite weeks of planning and rehearsals, election nights did not always proceed as hoped. Studio 8H in Rockefeller Center, where Huntley sat, was sacrosanct and off-limits to anyone without a special ID badge. But when Brillinger saw that the ID tag resembled the sugar packets in NBC’s canteen, he paperclipped a packet to his shirt and wandered happily around. For the election of 1964 between Lyndon Johnson and Barry Goldwater, NBC show-cased seven of its mainframe computers on Studio 8H’s stage floor: several of RCA’s early 301 models and two spanking new 3301’s. All evening viewers could see their imposing large black boxes on screen behind Huntley. Unfortunately, the computers did not work that night, either because their operating system was unfinished or because the heat of the recording studio’s lights fried them. So there they sat all night long, like so many limp rags, Link thought, impressive but useless. With voting results pouring in from around the country, Tukey’s team punched furiously away at old-fashioned hand calculators and adding machines. Fortunately, the work was simple that night because LBJ’s victory was a foregone conclusion, and Johnson won with a record 61% of the popular vote.
In another election, the team called the winners early for California and New York. Then, late-arriving figures came in contradicting their announcement. Two tense hours passed before voting patterns moved back in line with their predictions. Another extremely tight election kept them at work from Tuesday afternoon straight through to Thursday afternoon. Tukey and Wallace realized they needed to improve their technique.
“It turned out that the problem of projecting turnout was more difficult than that of projecting candidate percentage,” Wallace discovered. “The quality of data you get is dubious, and you have clear biases of reporting coming in from one part of the county, and if you’re lucky, they might do it randomly but machine votes come in faster than non-machine votes. Sometimes there’s chicanery as well. The turnout is very hard to predict and that has a startling effect, so Bayes was not a total serving. I had a conversation with a student who was consulting with one of the other networks and he said to me, ‘This is just so wonderful because it’s the first place in statistics where all your assumptions are totally valid.’ I was appalled. . . . It’s a highly biased sample. And you’re going to get yourself into serious trouble if you don’t realize that.”24
Tukey continued to work for NBC through the election in 1980. After that, NBC switched to exit polls based on interviews of voters emerging from their precincts. Exit polls were cheaper, more photogenic, personal, and chatty. They were the polar opposite of Tukey’s secret, highly complex, mathematized approach.
Then came the biggest surprise of all. Like Churchill’s muzzle on postwar Bletchley Park, Tukey refused to let any of his colleagues write or even give talks about their polling methods. He never wrote about them either. He said they were proprietary to RCA.
Why the secrecy? Why did Tukey scorn Bayes’ rule in public but use it privately for two decades? Toward the end of his life he conceded that “probably the best excuse for Bayesian analysis is the need . . . to combine information from other bodies of data, the general views of experts, etc., with the information provided by the data before us.”25 He even defended Savage’s subjectivist gospel, that people could look at the same information yet reach different conclusions. Reject “the fetish of objectivity,” Tukey declared.26 He also used a Bayesian argument when testifying before Congress that the U.S. Census should adjust for its undercounts of minorities in some areas by incorporating information from other, similar regions. As of 2010, the Census Bureau had not done so.
So why didn’t—or couldn’t—Tukey use the B-word? As Brillinger noted, “Bayes is an inflammatory word.”27 Certainly, Tukey’s term “borrowing strength” allowed him to avoid it. Perhaps sidestepping it carved out a neutral workspace. Perhaps he felt the need to put his own stamp on another person’s work. Or perhaps there was another reason. Given Tukey’s personality, it’s difficult to know. Halfway through his stint with NBC, RCA gave up trying to compete with IBM and sold off its computer division to Sperry Rand. After that, why would RCA care whether Tukey’s system went public? Could RCA’s military sponsors have classified Tukey’s methods, and was he using Bayes’ rule for his classified cryptographic research?
Many details of Tukey’s national security career remain “murky, deliberately so on his part,” his nephew Anscombe concluded.28 But as Wallace says, “If you go to the secret coding agencies, you’d find that Bayes had a larger history. I’m not in a position to speak of that but I. J. Good is the principal contributor to the Bayesian group and he was taking that position.”29 Good was Alan Turing’s cryptographic assistant during the Second World War. So did Tukey use Bayes’ rule for decoding for the National Security Agency? And could he have been distancing himself from Bayesian methods in order to protect work there?
The ties between Tukey, Bayes, and top-secret decoding are many and close. Bayes’ rule is a natural for decoders who have some initial guesses and must minimize the time or cost to reach a solution; it has been widely used for decoding ever since Bletchley Park. Tukey’s ties to American cryptography were particularly tight. According to William O. Baker, then head of Bell Labs, Tukey was part of the force that helped decrypt Germany’s Enigma system during the Second World War and Soviet codes during the Cold War. Tukey served on NSA’s Science Advisory Board, which was devoted to cryptography. It was a ten-member panel of scientists from universities, corporate research laboratories, and think tanks; they met twice yearly at Fort Meade, Maryland, to discuss the application of science and technology to code breaking, cryptography, and eavesdropping. Baker, Tukey’s close friend, was probably the committee’s most important member. Baker chaired a long study of America’s coding and decoding resources for the NSA and called for a Manhattan Project–like effort to focus, not on publishable and freely available research, but on top-secret studies. Whether Tukey actually did hands-on cryptography is not known, but as a professional visiting-committee consultant, he was certainly aware of all the statistical methods being used.
Tukey’s relationship with Good, one of the leading Bayesians and cryptographers of the 1950s and 1960s, is also suggestive. Tukey visited Good in Britain and invited him to lecture at Bell Labs in October 1955. The day after Good’s talk
he was surprised to find that Tukey, lying on the floor to relax, had obviously understood it all. Tukey was also sympathetic enough with Good’s Bayesian methods to introduce him to Cornfield at NIH and suggest that Good might help with statistical methods there; Cornfield became a prominent Bayesian.
Claude Shannon was also in the audience during Good’s talk. Shannon had used Bayes’ rule at Bell Labs for his pathbreaking cryptographic and communications studies during the Second World War. Tukey was close to Shannon; in 1946 Tukey coined the word “bit” for Shannon’s “binary digit.” Tukey, Shannon, and John R. Pierce applied together for a patent for a cathode ray device in 1948.
The evidence is substantial enough to convince some of Tukey’s colleagues, including Judith Tanur and Richard Link, that he probably did use Bayes’ rule for decoding at Bell Labs. Brillinger, Tukey’s biographer and NBC polling colleague, concluded, “I have no problem thinking that he might have.”30
Whatever the motivation, Tukey’s secrecy edict played a major role in the history of Bayes’ rule. As Wallace observed, “It’s important to the development of Bayesian statistics that a lot was under wraps.”31 Tukey’s censorship of his polling methods for NBC News, like the highly classified status of Bayesian cryptography during and after the Second World War, is one reason so few realized how much Bayes’ rule was being used.
Tukey’s Bayesian polling—conducted in the glare of international publicity for two of the most popular TV anchors of the day—could have spread the news of Bayes’ power and effectiveness and reinforced it at regular intervals. But his ban on speaking or writing about it meant that Bayes’ rule played a starring role on TV for almost two decades—without most statisticians knowing about it.
As a result, the only large computerized Bayesian study of a practical problem in the public domain during the Bayesian revival of the 1960s was the Mosteller–Wallace study of The Federalist in 1964. It would be 11 years before the next major Bayesian application appeared in public. And after Tukey stopped consulting for NBC in 1980, it would be 28 years before a presidential election poll utilized Bayesian techniques again.
When Nate Silver at FiveThirtyEight.com used hierarchical Bayes during the presidential race in November 2008, he combined information from outside areas to strengthen small samples from low-population areas and from exit polls with low response rates. He weighted the results of other pollsters according to their track records and sample size and how up to date their data were. He also combined them with historical polling data. That month Silver correctly predicted the winner in 49 states, a record unmatched by any other pollster. Had Tukey publicized the Bayesian methods used for NBC, the history of political polling and even American politics might have been different.
14.
three mile island
After years of working together, the two old friends Fred Mosteller and John Tukey reminisced in 1967 about how “the battle of Bayes has raged for more than two centuries, sometimes violently, sometimes almost placidly, . . . a combination of doubt and vigor.” Thomas Bayes had turned his back on his own creation; a quarter century later, Laplace glorified it. During the 1800s it was both employed and undermined. Derided during the early 1900s, it was used in desperate secrecy during the Second World War and afterward employed with both astonishing vigor and condescension.1 But by the 1970s Bayes’ rule was sliding into the doldrums.
A loss of leadership, a series of career changes, and geographical moves contributed to the gloom. Jimmie Savage, chief U.S. spokesman for Bayes as a logical and comprehensive system, died of a heart attack in 1971. After Fermi’s death, Harold Jeffreys and American physicist Edwin T. Jaynes campaigned in vain for Bayes in the physical sciences; Jaynes, who said he always checked to see what Laplace had done before tackling an applied problem, turned off many colleagues with his Bayesian fervor. Dennis Lindley was slowly building Bayesian statistics departments in the United Kingdom but quit administration in 1977 to do solo research. Jack Good moved from the super-secret coding and decoding agencies of Britain to academia at Virginia Tech. Albert Madansky, who liked any technique that worked, switched from RAND to private business and later to the University of Chicago Business School, where he claimed to find more applications than in statistics departments. George Box became interested in quality control in manufacturing and, with W. Edwards Deming and others, advised Japan’s automotive industry. Howard Raiffa also shifted gears to negotiate public policy, while Robert Schlaifer, the nonmathematical Bayesian, tried to program computers.
When James O. Berger became a Bayesian in the 1970s, the community was still so small he could track virtually all of its activity. The first international conference on Bayes’ rule was held in 1979, in Valencia, Spain, and almost every well-known Bayesian showed up—perhaps 100 in all.
Gone was the messianic dream that Bayes’ rule could replace frequentism. Ecumenical pragmatists spoke of synthesizing Bayesian and nonBayesian methods. The least controversial ideal, Mosteller and Tukey agreed, was either a frequency-based prior or a “gentle” prior based on beliefs but ready to be overwhelmed by new information.
When Box, J. Stuart Hunter, and William G. Hunter wrote Statistics for Experimenters in 1978, they intentionally omitted any reference to Bayes’ rule: too controversial to sell. Shorn of the big bad word, the book was a bestseller. Ironically, an Oxford philosopher, Richard Swinburne, felt no such compunctions a year later: he inserted personal opinions into both the prior hunch and the supposedly objective data of Bayes’ theorem to conclude that God was more than 50% likely to exist; later Swinburne would figure the probability of Jesus’ resurrection at “something like 97 percent.” These were calculations that neither the Reverend Thomas Bayes nor the Reverend Richard Price had cared to make, and even many nonstatisticians regarded Swinburne’s lack of careful measurement as a black mark against Bayes itself.
Throughout this period Jerzy Neyman’s bastion of frequentism at Berkeley remained the premiere statistical center of the United States. Stanford’s large statistics department, bolstered by Charles Stein and other University of California professors who had refused to sign a McCarthy-era loyalty oath, was also enthusiastically frequentist, and anti-Bayesian signs adorned professors’ office doors.
Bayesians were treading water. Almost without knowing it they were waiting until computers could catch up. In the absence of powerful and accessible computers and software, many Bayesians and anti-Bayesians alike had given up attempts at realistic applications and retreated into theoretical mathematics. Herman Chernoff, whose statistical work often grew out of Office of Naval Research problems, got so impatient with theoreticians spinning their wheels on increasingly elaborate generalizations that he moved from Stanford to MIT in 1974 and then on to Harvard. “We had reached a period,” he wrote, “where we had to confront the computer much more intensively and we also had to do much more applied work . . . I thought, for the future, the field needed a lot more contact with real applications in order to provide insights into which way we should go, rather than concentrating on further elaborations on theory.” Chernoff was no Bayesian, but he told statistician Susan Holmes, then beginning her career, how to face difficult problems: “Start out as a Bayesian thinking about it, and you’ll get the right answer. Then you can justify it whichever way you like.”2
Within Bayesian circles, opinions were still defended passionately. Attending his first Bayesian conference in 1976, Jim Berger was shocked to see half the room yelling at the other half. Everyone seemed to be good friends, but their priors were split between the personally subjective, like Savage’s, and the objective, like Jeffreys’s—with no definitive experiment to decide the issue. Good moved eclectically between the two camps.
In a frustrated circle of blame, Persi Diaconis was shocked and angry when John Pratt used frequentist methods to analyze his wife’s movie theater attendance data, because there was too much for the era’s computers to handle. But one of the low moments in Diaconis’s life occ
urred in a Berkeley coffee shop, where he was correcting galley proofs of an article and Lindley blamed him for using frequency methods in the article. “And you’re our leading Bayesian,” Lindley complained.3 Lindley, in turn, upset Mosteller by passing up a chance to do a big project using Bayes instead of frequency. Every opportunity lost for Bayes was a blow to the cause and a reason for recrimination. By 1978 the Neyman–Pearson frequentists held “an uneasy upper hand” over the Bayesians, while a third, smaller party of Fisherians “snipe[d] away at both sides.”4
Few theorems could boast such a history. Bayesians had developed a broad system of theory and methods, but the outlook for proving their effectiveness seemed bleak. De Finetti predicted a paradigm shift to Bayesian methods—in 50 years, post-2020. The frequentist Bradley Efron of Stanford estimated the probability of a Bayesian twenty-first century at a mere .15.
Politicking for Bayes in Britain, Lindley said, “The change is happening much more slowly than I expected. . . . It is a slow job. . . . I assumed in a naïve way that if I spent an hour talking to a mature statistician about the Bayesian argument, he would accept my reasoning and would change. That does not happen; people don’t work that way. . . . I think that the shift will take place through applied statisticians rather than through the theoreticians.” Asked how to encourage Bayesian theory, he answered tartly, “Attend funerals.”5
With Bayesian theory in limbo, its public appearances were few and far between. Consequently, when the U.S. Congress commissioned the first comprehensive study of nuclear power plant safety, the question arose: would anyone dare mention Bayes by name, much less actually use Bayes’ rule?