The Undoing Project
Page 16
Psychology had long been an intellectual garbage bin for problems and questions that for whatever reason were not welcome in other academic disciplines. The Oregon Research Institute became a practical extension of that bin. One early assignment came from a contracting company based in Eugene that had been hired to help build a pair of audacious skyscrapers in lower Manhattan, to be called the World Trade Center. The twin towers were to be 110 stories and built from light steel frames. The architect, Minoru Yamasaki, who had a fear of heights, had never designed any building higher than twenty-eight stories. The owner, New York Port Authority, planned to charge higher rents for the upper floors, and wanted the engineer, Les Robertson, to ensure that the high-paying tenants on the upper floors never sensed that the buildings moved with the wind. Realizing that this was not so much an engineering problem as a psychological one—how much could a building move before a person sitting at a desk on the ninety-ninth floor felt it?—Robertson turned to Paul Hoffman and the Oregon Research Institute.
Hoffman rented another building in another leafy Eugene neighborhood and built a room inside of it on top of the hydraulic wheels used to roll logs through Oregon’s lumber mills. At the press of a button the entire room could be made to rock back and forth, silently, like the top of a Manhattan skyscraper in a breeze. All of this was done in secrecy. The Port Authority didn’t want to alert its future tenants that they’d be swinging in the wind, and Hoffman worried that if his subjects knew they were in a building that moved, they would become more sensitive to movement and queer the experiment’s results. “After they’d designed the room,” recalled Paul Slovic, “the question was, how do we get people into the room without them knowing why?” And so after the “sway room” was built, Hoffman stuck a sign outside that read Oregon Research Institute Vision Research Center, and offered free eye exams to all comers. (He’d found a graduate student in psychology at the University of Oregon who happened also to be a certified optometrist.)
As the graduate student performed eye exams, Hoffman turned up the hydraulic rollers and made the room roll back and forth. The psychologists soon discovered that people in a building that was moving were far quicker to sense that something was off about the place than anyone, including the designers of the World Trade Center, had ever imagined. This is a strange room,” said one. “I suppose it’s because I don’t have my glasses on. Is it rigged or something? It really feels funny.” The psychologist who ran the eye exams went home every night seasick.*
When they learned of Hoffman’s findings, the World Trade Center’s engineer, its architect, and assorted officials from the New York Port Authority flew to Eugene to experience the sway room themselves. They were incredulous. Robertson later recalled his reaction for the New York Times: “A billion dollars right down the tube.” He returned to Manhattan and built his very own sway room, where he replicated Hoffman’s findings. In the end, to stiffen the buildings, he devised, and installed in each of them, eleven thousand two-and-a-half-foot-long metal shock absorbers. The extra steel likely enabled the buildings to stand for as long as they did after they were struck by commercial airliners, and it allowed some of the fourteen thousand people who escaped to flee before the buildings collapsed.
For the Oregon Research Institute, the sway room was a bit of a diversion. Many of the psychologists who joined the place shared Paul Hoffman’s interest in human judgment. They also shared an uncommon interest in Paul Meehl’s book, Clinical versus Statistical Prediction, about the inability of psychologists to outperform algorithms when trying to diagnose, or predict the behavior of, their patients. It was the same book Danny Kahneman had read in the mid-1950s before he replaced the human judges of new Israeli soldiers with a crude algorithm. Meehl was himself a clinical psychologist, and kept insisting that of course psychologists like him and those he admired had many subtle insights that could never be captured by an algorithm. And yet by the early 1960s there was a swelling pile of studies that supported Meehl’s initial pie-chucking skepticism of human judgment.†
If human judgment was somehow inferior to simple models, humanity had a big problem: Most fields in which experts rendered judgments were not as data-rich, or as data-loving, as psychology. Most spheres of human activity lacked the data to build the algorithms that might replace the human judge. For most of the thorny problems in life, people would need to rely on the expert judgment of some human being: doctors, judges, investment advisors, government officials, admissions officers, movie studio executives, baseball scouts, personnel managers, and all the rest of the world’s deciders of things. Hoffman, and the psychologists who joined his research institute, hoped to figure out exactly what experts were doing when they rendered judgments. “We didn’t have a special vision,” said Paul Slovic. “We just had a feeling this was important: how people took pieces of information and somehow processed that and came up with a decision or a judgment.”
Interestingly, they didn’t set out to explore just how poorly human experts performed when forced to compete with an algorithm. Rather, they set out to create a model of what experts were doing when they formed their judgments. Or, as Lew Goldberg, who had arrived in 1960 at the Oregon Research Institute by way of Stanford University, put it, “To be able to spot when and where human judgment is more likely to go wrong: that was the idea.” If they could figure out where the expert judgments were going wrong, they might close the gap between the expert and the algorithms. “I thought that if you understood how people made judgments and decisions, you could improve judgment and decision making,” said Slovic. “You could make people better predictors and better deciders. We had that sense—though it was kind of fuzzy at the time.”
To that end, in 1960, Hoffman had published a paper in which he set out to analyze how experts drew their conclusions. Of course you might simply ask the experts how they did it—but that was a highly subjective approach. People often said they were doing one thing when they were actually doing another. A better way to get at expert thinking, Hoffman argued, was to take the various inputs the experts used to make their decisions (“cues,” he called these inputs) and infer from those decisions the weights they had placed on the various inputs. So, for example, if you wanted to know how the Yale admissions committee decided who got into Yale, you asked for the list of the information about Yale applicants that were taken into account—grade point average, board scores, athletic ability, alumni connections, type of high school attended, and so on. Then you watched the commitee decide, over and over, whom to admit. From the committee’s many decisions you could distill the process its members had used to weigh the traits deemed relevant to the assessment of any applicant. You might even build a model of the interplay of those traits in the minds of the members of the committee, if your math skills were up to it. (The committee might place greater weight on the board scores of athletes from public schools, say, than on those of the legacy children from private schools.)
Hoffman’s math skills were up to it. “The Paramorphic Representation of Clinical Judgment,” he had titled his paper for the Psychological Bulletin. If the title was incomprehensible, it was at least in part because Hoffman expected anyone who read it to know what he was talking about. He didn’t have any great hope that his paper would be read outside of his small world: What happened in this new little corner of psychology tended to stay there. “People who were making judgments in the real world wouldn’t have come across it,” said Lew Goldberg. “The people who are not psychologists do not read psychology journals.”
The real-world experts whose thinking the Oregon researchers sought to understand were, in the beginning, clinical psychologists, but they clearly believed that whatever they learned would apply more generally to any professional decision maker—doctors, judges, meteorologists, baseball scouts, and so on. “Maybe fifteen people in the world are noodling around on this,” said Paul Slovic. “But we recognize we’re doing something that could be important: capturing what
seemed to be complex, mysterious intuitive judgments with numbers.” By the late 1960s Hoffman and his acolytes had reached some unsettling conclusions—nicely captured in a pair of papers written by Lew Goldberg. Goldberg published his first paper in 1968, in an academic journal called American Psychologist. He began by pointing out the small mountain of research that suggested that expert judgment was less reliable than algorithms. “I can summarize this ever-growing body of literature,” wrote Goldberg, “by pointing out that over a rather large array of clinical judgment tasks (including by now some which were specifically selected to show the clinician at his best and the actuary at his worst), rather simple actuarial formulae typically can be constructed to perform at a level of validity no lower than that of the clinical expert.”
So . . . what was the clinical expert doing? Like others who had approached the problem, Goldberg assumed that when, for instance, a doctor diagnosed a patient, his thinking must be complex. He further assumed that any model seeking to capture that thinking must also be complex. For example, a psychologist at the University of Colorado studying how his fellow psychologists predicted which young people would have trouble adjusting to college had actually taped psychologists talking to themselves as they studied data about their patients—and then tried to write a complicated computer program to mimic the thinking. Goldberg said he preferred to start simple and build from there. As his first case study, he used the way doctors diagnosed cancer.
He explained that the Oregon Research Institute had completed a study of doctors. They had found a gaggle of radiologists at the University of Oregon and asked them: How do you decide from a stomach X-ray if a person has cancer? The doctors said that there were seven major signs that they looked for: the size of the ulcer, the shape of its borders, the width of the crater it made, and so on. The “cues,” Goldberg called them, as Hoffman had before him. There were obviously many different plausible combinations of these seven cues, and the doctors had to grapple with how to make sense of them in each of their many combinations. The size of an ulcer might mean one thing if its contours were smooth, for instance, and another if its contours were rough. Goldberg pointed out that, indeed, experts tended to describe their thought processes as subtle and complicated and difficult to model.
The Oregon researchers began by creating, as a starting point, a very simple algorithm, in which the likelihood that an ulcer was malignant depended on the seven factors the doctors had mentioned, equally weighted. The researchers then asked the doctors to judge the probability of cancer in ninety-six different individual stomach ulcers, on a seven-point scale from “definitely malignant” to “definitely benign.” Without telling the doctors what they were up to, they showed them each ulcer twice, mixing up the duplicates randomly in the pile so the doctors wouldn’t notice they were being asked to diagnose the exact same ulcer they had already diagnosed. The researchers didn’t have a computer. They transferred all of their data onto punch cards, which they mailed to UCLA, where the data was analyzed by the university’s big computer. The researchers’ goal was to see if they could create an algorithm that would mimic the decision making of doctors.
This simple first attempt, Goldberg assumed, was just a starting point. The algorithm would need to become more complex; it would require more advanced mathematics. It would need to account for the subtleties of the doctors’ thinking about the cues. For instance, if an ulcer was particularly big, it might lead them to reconsider the meaning of the other six cues.
But then UCLA sent back the analyzed data, and the story became unsettling. (Goldberg described the results as “generally terrifying.”) In the first place, the simple model that the researchers had created as their starting point for understanding how doctors rendered their diagnoses proved to be extremely good at predicting the doctors’ diagnoses. The doctors might want to believe that their thought processes were subtle and complicated, but a simple model captured these perfectly well. That did not mean that their thinking was necessarily simple, only that it could be captured by a simple model. More surprisingly, the doctors’ diagnoses were all over the map: The experts didn’t agree with each other. Even more surprisingly, when presented with duplicates of the same ulcer, every doctor had contradicted himself and rendered more than one diagnosis: These doctors apparently could not even agree with themselves. “These findings suggest that diagnostic agreement in clinical medicine may not be much greater than that found in clinical psychology—some food for thought during your next visit to the family doctor,” wrote Goldberg. If the doctors disagreed among themselves, they of course couldn’t all be right—and they weren’t.
The researchers then repeated the experiment with clinical psychologists and psychiatrists, who gave them the list of factors they considered when deciding whether it was safe to release a patient from a psychiatric hospital. Once again, the experts were all over the map. Even more bizarrely, those with the least training (graduate students) were just as accurate as the fully trained ones (paid pros) in their predictions about what any given psychiatric patient would get up to if you let him out the door. Experience appeared to be of little value in judging, say, whether a person was at risk of committing suicide. Or, as Goldberg put it, “Accuracy on this task was not associated with the amount of professional experience of the judge.”
Still, Goldberg was slow to blame the doctors. Toward the end of his paper, he suggested that the problem might be that doctors and psychiatrists seldom had a fair chance to judge the accuracy of their thinking and, if necessary, change it. What was lacking was “immediate feedback.” And so, with an Oregon Research Institute colleague named Leonard Rorer, he tried to provide it. Goldberg and Rorer gave two groups of psychologists thousands of hypothetical cases to diagnose. One group received immediate feedback on its diagnoses; the other did not—the purpose was to see if the ones who got feedback improved.
The results were not encouraging. “It now appears that our initial formulation of the problem of learning clinical inference was far too simple—that a good deal more than outcome feedback is necessary for judges to learn a task as difficult as this one,” wrote Goldberg. At which point one of Goldberg’s fellow Oregon researchers—Goldberg doesn’t recall which one—made a radical suggestion. “Someone said, ‘One of these models you built [to predict what the doctors were doing] might actually be better than the doctor,’” recalled Goldberg. “I thought, Oh, Christ, you idiot, how could that possibly be true?” How could their simple model be better at, say, diagnosing cancer than a doctor? The model had been created, in effect, by the doctors. The doctors had given the researchers all the information in it.
The Oregon researchers went and tested the hypothesis anyway. It turned out to be true. If you wanted to know whether you had cancer or not, you were better off using the algorithm that the researchers had created than you were asking the radiologist to study the X-ray. The simple algorithm had outperformed not merely the group of doctors; it had outperformed even the single best doctor. You could beat the doctor by replacing him with an equation created by people who knew nothing about medicine and had simply asked a few questions of doctors.
When Goldberg sat down to write a follow-up paper, which he called “Man versus Model of Man,” he was clearly less optimistic than he had formerly been, both about experts and the approach taken by the Oregon Research Institute to understanding their minds. “My article . . . was an account of our experimental failures—failures to demonstrate the complexities of human judgments,” he wrote of his earlier piece: the one he’d published in American Psychologist. “Since the previous anecdotal literature was filled with speculations about the complex interactions to be expected when professionals process clinical information, we had naively expected to find that the simple linear combination of cues would not be highly predictive of individual’s judgments, and consequently that we would soon be in the business of devising highly complex mathematical expressions to represent individual jud
gment strategy. Alas, it was not to be.” It was as if the doctors had a theory of how much weight to assign to any given trait of any given ulcer. The model captured their theory of how to best diagnose an ulcer. But in practice they did not abide by their own ideas of how to best diagnose an ulcer. As a result, they were beaten by their own model.
The implications were vast. “If these findings can be generalized to other sorts of judgmental problems,” Goldberg wrote, “it would appear that only rarely—if at all—will the utilities favor the continued employment of man over a model of man.” But how could that be? Why would the judgment of an expert—a medical doctor, no less—be inferior to a model crafted from that very expert’s own knowledge? At that point, Goldberg more or less threw up his hands and said, Well, even experts are human. “The clinician is not a machine,” he wrote. “While he possesses his full share of human learning and hypothesis-generating skills, he lacks the machine’s reliability. He ‘has his days’: Boredom, fatigue, illness, situational and interpersonal distractions all plague him, with the result that his repeated judgments of the exact same stimulus configuration are not identical. . . . If we could remove some of this human unreliability by eliminating this random error in his judgments, we should thereby increase the validity of the resulting predictions . . .”
Right after Goldberg published those words, late in the summer of 1970, Amos Tversky showed up in Eugene, Oregon. He was on his way to spend a year at Stanford and wanted to visit his old friend Paul Slovic, with whom he’d studied at Michigan. Slovic, a former college basketball player, recalls shooting baskets with Amos in his driveway. Amos, who had not played college basketball, didn’t really shoot so much as heave the ball at the rim—his jump shot looked more like calisthenics than hoops. “A three-quarters speed, spinless shot put which started at mid-chest and wafted toward the basket,” in the words of his son Oren. And yet Amos had somehow become a basketball enthusiast. “Some people like to walk while they talk. Amos liked to shoot baskets,” said Slovic, adding delicately that “he didn’t look like someone who had spent a lot of time shooting baskets.” Heaving the ball at the rim, Amos told Slovic that he and Danny had been kicking around some ideas about the inner workings of the human mind and hoped to further explore how people made intuitive judgments. “He said they wanted a place where they could just sit and talk to each other all day long without the distraction of a university,” said Slovic. They had some thoughts about why even experts might make big, systematic errors. And it wasn’t just because they were having a bad day. “And I was just kind of stunned by how exciting the ideas were,” said Slovic.