The Numerati Page 11 Read online free by Stephen Baker

The Numerati Page 11

Companies and governments alike are poring over our written words. Most of the snooping is focused on crime prevention. But as the tools improve, the market will change. Instead of just looking for what we’re doing wrong, companies and governments and pollsters will be eager to learn what we’re buying, where we’re going, who we might vote for. They’re curious. As Umbria and others gorge on data in the blogosphere, they are sharpening the interpretive tools.

I READ Tears of Lust’s blog, and it’s easy to figure out quite a bit about the writer. She’s a young woman. She lives in a city on the East Coast of the United States. If I had to guess, I’d say New York. But I can’t bet on it. I could come to lots of other conclusions about her passions, her love interests, and even what she likes to eat.

This is all clear to me. But she’s writing in my language. Practically every word makes sense. The bad news, from a data-mining perspective, is that it takes me a scandalous five minutes to read through her text. In that time, Umbria’s computers work through 35,300 blog posts. This magic takes place within two domains of artificial intelligence: natural language processing and machine learning. The idea is simple enough. The machines churn through the words, using their statistical genius and formidable memory to make sense of them. To say that they “understand” the words is a stretch. It’s like saying that a blind bat, which navigates by processing the geometry of sound waves, “sees” the open window it flies through. But no matter. If computers can draw correct conclusions from the words they plow through, they pass the language test. And if they can refine those conclusions, adding context and allowing for caveats, then they’re getting smart—or, as some would have it, “smart.”

For decades, computer scientists have battled over how to teach computers language and thought. Some have pushed for a logical approach. They follow a tradition pioneered by Aristotle, which divides the entire world of knowledge into vast domains, each with its own facts, rules, and relationships. One of the most ambitious of these projects, Cycorp of Austin, Texas, is trying to build an artificial intelligence that not only knows much of the world’s information but also can make sense of it. When you ask Cycorp’s computer for democratically elected leaders in the Northern Hemisphere, the geography arm of the system starts whirring away, country by country: Britain is in Europe. Europe is in the Northern Hemisphere. The Northern Hemisphere is north of the equator. It knows each of these facts and skips from one to the next. Then, according to Cycorp’s Web page, it uses logic: if region A is part of region B, and region B is north of region C, then region A is north of region C. Therefore, it concludes, Britain is north of the equator. At that point, the geography arm has to ask its political cousin if Britain is a democracy. And the analysis continues. The trouble with this logical approach is speed and flexibility. Facts change, challenging the immense system to rejigger its bits of information and the relationships among them. In 1984, when Cycorp began assembling its knowledge universe, the Soviet Union dominated the Asian continent, and the “mouse” was barely pushing its nose from the domain of rodentia, a subgroup of mammals, into the burgeoning arena of computer accessories.

The rival approach rejects this plodding logic and prefers to see the computer purely as a counting whiz. Statistics are king. Probability defines truth. Speed and counting trump knowledge, and language exists largely as a matrix of numerical relationships. This is Umbria’s tack—and it is this statistical approach that most Numerati hew to as they study us in nearly every field. What the computers in Boulder learn is a crazy quilt of statistics and geometry. They may come up with startling insights. But they reach them through a labyrinth of calculations. These learning machines swim in numbers.

The learning process starts with humans, a team of 6 readers at Umbria headquarters and 25 colleagues in Bangalore, India. These are the annotators. They go through thousands of blogs manually, looking for evidence of each blogger’s age and gender. Is the person male or female? If so, how old? Sometimes they can’t tell. But when they can, they mark the blogs and put them into a digital folder. Their work is to build a “gold standard,” a selection of accurately labeled blogs that can be used to teach the machine. In this process, Kaushansky says, Umbria researchers put perhaps 100,000 blog posts in the golden folder. They take 90,000 of them and introduce them to the computer. They leave the other 10,000 to one side.

How do those annotators come to their conclusions about the blogs? In many cases, they rely on knowledge and context that would be hard to teach a machine. I pretend I’m an annotator and reread Tears of Lust’s post. From the very first paragraph, I’m convinced that it’s a woman writing. But what tells me this? It’s a tone I pick up, a voice. These are hard things to enumerate, never mind impart to a machine. Telling details? There are a few, but they’re not definitive. A man could have a boyfriend named Kenny. A man could get dressed up with Lizzy and search for thrift stores. I suppose a man could even write “I did a good job getting around Columbus and know the city pretty well. Go me!” It isn’t until later in the post that I read, “So as brother, sister, and fiancée we went up to surgery where we met with plenty of doctors.” It’s a little unclear who’s who. But I review the section until I’m just about sure that Kenny’s the brother, Lizzy’s the sister, and Tears of Lust is the fiancée. If I were an annotator, at this point I would confidently mark her blog post F, for female.

A computer, Ted Kremer tells me, would have to look for other signs. It wouldn’t recognize the author as the fiancée, and it might not know that a fiancée is a woman. Kremer is the chief technical officer at Umbria. He works in a huge sunlit office lined with whiteboards. He’s blond, with a square face, a pointed goatee, and near limitless patience—at least when it comes to teaching basic data mining. Once the annotators have created their golden files, he tells me as he scrawls on a whiteboard, the scientists comb through the documents, looking for hundreds of variables the computer can home in on. They look for telling words and combinations of verbs and objects. (“Go me!” might be one.) They look at punctuation, at certain word groups, at the placement of adjectives and adverbs. They adjust for certain groups of words that, taken individually, could be misunderstood. The computer has to know, for example, that Denver Broncos are a football team and not, Kremer says, laughing, “A city of horses.” His team also looks at strange spellings. Some bloggers spell great as gr8. Others sprinkle their posts with online gestures known as emoticons, such as the smiley face, : ). The scientists instruct the computer to keep an eye on the fonts used and the color of the blogs’ lettering and background. (Tears of Lust, I note, uses a blue background covered with pictures of faces, making it look like a Bollywood movie poster.) Add it all together, and the computer might have more than 1,000 different features to find and count.

Can the computer distinguish between the sexes? The test is ready. The computer plows through the 90,000 blogs at lightning speed. It counts every one of the variables, and it arranges them by gender. Kremer’s team studies the results. Are there certain quirks or features that are far more common in one gender’s posts than the other’s? That’s what Kremer’s team is hungry for. Correlations. If they find them, they assemble them into a model. This is the set of instructions that tells the computer how to tell a male blogger from a female one. It starts off with the easy stuff. Some bloggers, for example, identify themselves as M or F at the top part of the blog known as the header. That’s close to a sure bet. Certain phrases are strong indicators: “my dress,” for example, or “my beard.” But most of the model is a statistical soup of more subtle signals, of varying verb combinations, punctuation, and fonts. Each one is tied to a probability. The art in this science involves how much weight to give each component.

When the scientists have their model ready, they try it out on the 10 percent of the gold standard documents they had put aside. They quickly see how many of the documents the model sorts correctly and, more important, which ones it misses. They pore over these, looking for s
igns of faulty analysis. Why did it screw up on one man’s post? Did it give too much weight to the girlish exclamation points? Did it overlook the phrase “Guys like me”? How silly. But then again, that phrase may not have popped up in the first test batch, so it wouldn’t be in the model.

Like a chef whose soufflé comes out too salty or flat, the scientists adjust the variables. They tweak their model. They take weight away from some components and add it to others. A few new hints may have popped up in the latest sampling. Those can be factored in. This process goes on and on, sometimes through ten iterations or more. In this sense, the computer is a maddenly slow learner. It takes an entire laboratory of scientists, often skipping the slopes on weekends and ordering pizza late at night, to teach the machine what we humans know at a glance. And when the computer finally passes its gender test (Umbria won’t disclose the accuracy rate), the machine is still a long way from graduation. It moves on to the next stage, learning to peg each writer to a generation. Here, some of the markers are surprisingly simple. Older writers, for example, use a greater variety of words than younger ones do. Counting such words is hardly foolproof, but it gives the Umbria system a running start in generational groupings. It isn’t until the computer has worked through gender and age that it faces its most difficult assignment: figuring out whether bloggers give a thumbs-up or thumbs-down to whichever food, soda pop, music, or political candidate they’re analyzing.

This is a tortured way of learning about us. Instead of knocking on our door, data miners break our documents into thousands of components and then sift through them obsessively, attempting to put together a mosaic of our thoughts and appetites. It’s sneaky. It reminds me of parents who, instead of asking their teenager point-blank where he drives at night, tiptoe into the garage, take readings from the odometer, and make their own projections on a map. This approach is often less precise and a whole lot more work. But data mining plays to the counting and calculating genius of the computer, and above all to its speed. And it takes a wide detour around the computer’s weak points—specifically, its limited ability to think and to understand.

I ask Kremer about this. It seems to me, I say, that a computer working from a set of instructions, no matter how exhaustive, must make loads of mistakes. After all, we humans—each of us carrying a prodigious brain wired for communication—misread each other’s words and gestures daily. “What?” we say. “Huh? Are you kidding? Oh, I’m sorry, I thought . . . No, what I meant . . .” If you listen, we’re constantly amending what we say. Getting our meaning clear in someone else’s head and understanding what they’re trying to say is a struggle to which we devote an enormous amount of our intelligence. Entire industries, from psychology and law to literature, focus on sorting us out. They never run out of work. So computers, I ask Kremer, poor dumb computers that can count a million trees without realizing that they’re dealing with a forest, they must sometimes get utterly confused. Right?

Sometimes, he says. He leads me over to his computer, and we look at a blog post that’s been analyzed by Umbria’s system. It’s an article about Apple’s iPod Shuffle. Phrases in red are deemed negative. Green phrases are positive, and blue unknown. I look for a section in red. One of them reads: “Jobs not only moved the exalted Shuffle to the top of the menu . . .”

I study the phrase and see nothing negative about it. Could exalted be sarcastic? It doesn’t look like it. I call Kremer’s attention to the section. He reads it and shrugs. “False positive,” he says. “It happens.” He points to the “not” in the sentence. That simple negative may have led the computer to see the entire phrase as a condemnation.

Sarcasm, Kremer says, stumps the machine on a regular basis. It’s probably okay to take a San Diego blogger at her word when she exclaims, “I LOVE THIS WEATHER!” Yet how can the system know that when a blogger 1,000 miles north, in soggy Portland, writes the very same sentence, love may very well mean “hate”? These are the challenges of machine learning, and they fuel graduate research in top programs around the world. Tackling sarcasm may involve teaching the machine to keep an eye out for messages in capital letters, exclamation points, and the tendency of teens to indulge in it more often than their grandparents do. Perhaps in the distant future, contextually savvy machines will be armed with a long list of meteorological gloom zones. Maybe they’ll “understand” that salt in such latitudes often refers to highway issues, and not cuisine, and that raves about the weather, at least during certain seasons, are bound to be facetious. But for a company selling services today, such exercises are academic.

“I’ve got bigger fish to fry,” says Nicolas Nicolov, Umbria’s chief scientist. A Romanian-born computer scientist, Nicolov got his doctorate in Edinburgh before moving to America, first to IBM’s Watson lab and then to Umbria. He has an angular face and dark deep-set eyes, and he sports thick black bangs—a bit like Jim Carrey in his early movies. He works in a small, dark office down the hall from Kremer’s sunny, expansive digs. It feels like I’ve stepped into a cave.

Nicolov gives me an example of the kind of confusion he has to sort out. Umbria does lots of work for consumer electronics companies, he says. They want to know what kind of buzz the latest gizmos are generating. But in this area, even words like big and little vary with the context. “If a laptop is big, it’s negative,” he says. “But if a hard drive is big, it’s good.”

Nicolov and his team can teach these lessons to the computer. It helps enormously to train it for a specific industry, or what computer scientists call a “domain.” Within that area, it learns not only words but also groups of words. Bigrams are pairs, trigrams are triplets. Anything bigger than that is an Ngram. So a sophisticated machine trained for laptops might draw a green line under a trigram like “big hard drive.” That’s positive. But it might not be quite so confident when confronted with the Ngram “big honking hard drive.” That one it might miss.

Do these errors skew Umbria’s results? Kaushansky maintains that he’s chosen precisely the right market for inexact results. “We’re providing qualitative research, not quantitative,” he says. “It’s directional. It gives early indications of where things are going, what new issues are popping up for a company.” To make his case, Kaushansky shows me Umbria’s tracking of President Bush during his 2004 reelection campaign. He has the favorable and unfavorable blog references to the president on a chart alongside a series of Gallup polls. The Umbria blog numbers appear to anticipate by two to four weeks the ups and downs in the polls. Kaushansky is saying, in effect, that even if his computer misinterprets the words on our individual blogs, it reads our trends. It tracks our tribes.

But who exactly populates those tribes? It’s a burning question among blog analysts. The tribes, after all, are defined not by neighborhood, race, tax bracket, or the answers checked off on a survey. Instead, machines analyze our words and drop us into tribes with people we might be surprised to encounter. These tribes are a little like the buckets we land in at the supermarket—but with an extra layer of complexity. At the grocery store, consumption patterns are all that counts. But Kaushansky’s tribes, like Josh Gotbaum’s political groupings, have to embody an entire set of related values.

Kaushansky gives me an example. Four years ago, a 43-year-old friend of his rediscovered his adolescent passion for skateboarding. He’s a true fanatic, Kaushansky says, and he adores not only the skateboard but the whole culture that surrounds it. He talks like a teenager, in Kaushansky’s view. He listens to a different generation of music. And here’s the important part: Kaushansky insists that he writes like a skateboard-obsessed teen on his blog. Perhaps in a decade or two, systems like Umbria’s will be able to distinguish between true teens and middle-aged poseurs. But not today. In their statistics, this 43-year-old is likely to show up as a teen. He aches to be a member of that tribe, Kaushansky says. And for Umbria’s purposes, what difference does it make?

A few weeks later I’m in the San Francisco offices of Technorati, and I relate
the story of the skateboarder to David Sifry, the search engine’s founder. Sifry, a transplanted New Yorker with not an ounce of West Coast cool in his expansive body, explodes: “WRONG! WRONG!” A man can write like a woman, he says, but does he buy like a woman? Sifry goes on at length about the dangers of predicting people’s behavior based on statistical correlations. “Let’s say that according to my analytics, you said that Mission Impossible III was no good and that you can’t wait to see Prairie Home Companion,” he says. “I can’t assume from that that you’re an NPR listener. That’s where you get into trouble.” That’s mistaking correlation for causation, he says. It’s common among data miners—and most other humans. How many times have you heard people say, “They always do that . . .”?

For Kaushansky, putting his skateboarding friend and a few others in the wrong tribes may not turn out to be too serious. That’s why advertising and marketing are such wonderful testing grounds for the Numerati. If they screw up, the only harm is that we see the wrong ad or receive irrelevant coupons. However, as the Numerati file into other industries, such as medicine and policing, they won’t have the luxury of tossing loads of us, willy-nilly, into the same piles. Instead of concentrating on what we have in common, they’ll have to search out the data that sets us apart. It’s a much tougher job.

EARLY IN 2005, blogging was developing into a craze. Political bloggers had swung their weight in the 2004 presidential campaign, and now some 40,000 new bloggers were popping up every day. Nicolas Nicolov and his technical team at Umbria couldn’t have picked a better time to be deploying blog-crunching analysis. It was around then that a vice president at Yahoo, Jeff Weiner, awed by the phenomenon of blogs, marveled to me, “Never in the history of market research has there been a tool like this.”

‹ Prev Next ›