To the smartphone-suspicious, these services seem to be more trouble than they’re worth. What’s the value of knowing the Twitter handle of the person at the next table in a restaurant, when, at best, such an app just detracts from the authentic experience of real life? At worst, it’s giving away personal info to strangers.
However, to a growing number of smartphone owners, check-ins and geo-social Web apps like Foursquare are an integral aspect of smartphone ownership. More than 18 percent of smartphone owners use some sort of geo-social service (as of February 2012), a number up 33 percent in one year, with heaviest use concentrated among the young. Importantly, more than 70 percent of smartphone owners use some sort of location-based service on the phone, even if it is just the GPS.13
These apps change the way users perceive and interact with their environment as well as the way actors in that environment interact with them. Geo-social apps work to raise the net awareness level in any neighborhood or room. Today, most of this added social intelligence is of limited value at best. But the situation is evolving rapidly.
The rising popularity of these apps, which is closely connected to smartphone adoption in general, promises a big change in our expectations of privacy. There’s inevitability to this. As more people buy smartphones, more people use them the way the devices were designed to be used, with geo-social and location-aware apps. Wearable computing, if it eventually replaces what we know today as cell phones, will further enable this trend. We want to know more about the environment we’re in, what people on Yelp, contributors on Wikipedia, and friends on Facebook have to say about the place where we’ve arrived. This is what futurist Jamais Cascio calls augmented reality, and what the U.S. Department of Defense (DOD) calls situational awareness. It’s also human nature. As our friends, neighbors, nieces, nephews, sons, and daughters submit to the impulse to download an app that uses location information, the opt-out strategy becomes less effective for the rest of us, even those of us who consider ourselves extremely privacy aware.
We leak data through our friends.
One of the better-known examples of the accidental surrender of personal information via smartphone—what hacker, author, and astrophysicist Alasdair Allan has dubbed data leakage—involves an app called Path, which was billed as a smarter, leaner, more mobile-friendly answer to Facebook. Started by Facebook alum Dave Morin, the service was launched as a way for users to digitally document comings and goings in the world. This was your path. The service worked a lot like Facebook except that users were limited to 150 friends, based on the theory that 150 is the maximum amount of useful acquaintances that a person is capable of maintaining. These people would receive the premium subscription to your ongoing life story. Path received angel-investor funding from the likes of Ashton Kutcher and after tweaking the service a bit, it went from 10,000 users to 300,000 in less than a month.14, 15 The service today has more than 10 million users.
Path was a hit because it seemed to provide the sort of intimate, authentic, and secure sharing experience that Facebook couldn’t offer once users had to have different privacy settings for bosses, English teachers, mothers-in-law, et cetera. The sharing and posting on Path felt intuitive. Turns out it was a bit too intuitive.
Before long, a Singapore-based developer named Arun Thampi discovered that the ease of interfacing came at a high cost. Thampi was playing around with the code when he discovered something unusual. “It all started innocently enough,” he wrote on his blog. “I was thinking of implementing a Path Mac OS X app as part of our regularly scheduled hackathon . . . I started to observe the various API calls made to Path’s servers from the iPhone app . . . I observed a POST request to https://api.path.com/3/contacts/add. Upon inspecting closer, I noticed that my entire address book (including full names, emails and phone numbers) was being sent as a plist to Path. Now I don’t remember having given permission to Path to access my address book and send its contents to its servers, so I created a completely new ‘Path’ and repeated the experiment and I got the same result—my address book was in Path’s hands.”16
The company was holding detailed information on the friends, families, coworkers, and contacts for all three hundred thousand or so of its users, a list that potentially included tens of millions of people. They quickly issued an apology and a software update. But, in many ways, the damage was already done.
Allan has called this the inevitable result of the increasing market for mobile software among people who don’t understand—and have no desire to learn—how their most cherished devices work. We want our apps to know us, to present customized answers to our problems and questions, but we don’t care how they arrive at those solutions until there’s a problem.
Most people who download Instagram, Twitter, or Facebook to their phone already understand, at least in part, that they’re risking their personal private information in doing so. But they probably wouldn’t elect to give their grandparents’ contact information and other personal details to some strange company. Given that more than 9 percent of the entire U.S. population is part of a geo-social network (as calculated from the fact that 18 percent of smartphone owners are part of a geo-social network, and well more than 50 percent of the population owns a smartphone), further incidents of data leakage will affect the U.S. population well beyond the smartphone-owning community.
We presume that our personal data is compromised only when we choose to take a certain risky action. Maybe some people find amusement in these silly networks and don’t mind giving away their information to strangers, but that shouldn’t have any bearing on me, goes this line of thinking. But our friends and loved ones create data about life and that data includes us, whether we wish to be tagged or not. This is why we are using the wrong set of words to explain this phenomenon; we think of data leakage as an act of theft but we need to understand it as a contagion event. If you know someone who geo-tags their tweets, Facebook posts, or Instagram photos, you’ve already been infected.
Telemetry, Simulation, and Bayes
Once these signals are sensed, they must be processed if they are to form the basis of a useful prediction. But predictions—like the future itself—spring from the brain. The challenge is getting computers, programs, and systems to make predictions on the basis of continuously sensed information, on the basis of what’s happening now in (sort of) the same way that the brain does. This is an entirely recent problem related to the rise of continuous data streams and all the artifacts of modern information overload. But the mathematical formula to tackle it has actually been around for centuries and can be utilized as easily by a college undergrad as by a roomful of scientists.
Researchers use plenty of statistical methods, and mathematical tricks can be employed, in isolation or combination, to turn data into a prediction. But the one method that allows you to make new predictions and update old predictions on the basis of new information is named after its founder, Thomas Bayes. The theorem in its simplest form is:
In the above, P is probability, A is the outcome we are trying to predict, and X is some condition that could affect P. The theory solves for A given ( | ) X. The value you award P when you begin is sometimes called the “prior”; the value you award P after you’ve run the formula is called the “posterior.”
Undeniably, compared with other statistical methods Bayes won’t always give you the most accurate answer based on the data that you’re looking at. But it does give you a fairly honest answer. A large gap (in value) between the prior and the posterior suggests a small degree of confidence.
Celebrated artificial intelligence (AI) luminary and statistician Judea Pearl describes the process as follows: the Bayesian interpretation of probability is one in which we “encode degrees of belief about events in the world, and data are used to strengthen, update or weaken those beliefs.”17
Compared with many other statistical methods such as traditional linear regression, Bayes is one of the m
ost like the brain. Predictions of probability combine past experience with sensed input to create a (somewhat) moving picture of the future.
What’s important to understand is that although Thomas Bayes’s formula wasn’t published until 1764, about three years after his death, it’s only in the last couple of decades that Bayes has come to be seen as the essential lens through which to understand probability in a wide number of contexts. The Bayesian formula plays a critical role in statistical research methods having to deal with computer and AI problems but also the simple questions of quantifying what may happen.
When I asked the researchers in this book why they found Bayes more useful than other statistical methods for their work, the most common response I received was that Bayesian inference allows you to update a probability assumption—the degree of faith you have in a particular outcome—on the basis of new information, new details, and new facts, and to do so very quickly. If the interconnected age has taught us nothing else, it is that there will always be new facts. Bayes lets you speedily move closer to a better answer on the basis of new information.
Here’s an example. Let’s say it’s Tuesday and you are scheduled to meet your therapist. Your therapist has never missed a Tuesday appointment so you hypothesize that the probability of her showing up is 100 percent. The P for A or P(A) = 1. This is your prior belief. Obviously, it’s terrible. There is never a 100 percent chance that someone will show up to work. Now, let’s say you get some new information, that your therapist has just left from a previous appointment and she is three miles away, on foot. How would you go about adjusting your belief to more accurately reflect the probability that your therapist will make it to your appointment on time?
Let’s say you find some new data, that the average walking speed is 3.1 miles per hour. Given time and distance you can compute that your therapist will surely be late. But you must compute this in light of the prior value; your therapist is never late. You now know the chances of your therapist being late for this appointment are lower than they would be for a regular person but the possibility of her being late for your appointment, in spite of what you understand to be the lessons of all history, have grown significantly. Now you discover even more information: according to reviews of your therapist’s practice on Yelp, she’s actually late to her appointments about half the time. You can recompute the probability of your therapist’s getting to the appointment on time over and over, every time you get some new tidbit that reveals reality more clearly in all its inconvenience. What is making the future more transparent is the exponentially growing number of tidbits we have to work with. Bayes lets us manage that growth.
Imagine next that you have an enormous amount of telemetrically gathered information to update your prior assumption. You can actually track your therapist moving toward you in real time through her Nike+ profile. You can read the wind currents meeting her via Cosm’s feed off a nearby wind sensor. You can measure her heart rate and hundreds of other signals that might further refine your understanding of where she is going to be, relative to you, in the next few minutes. Let’s say you also have access to an enormous supercomputer capable of running thousands of simulations a minute, enabling you to weigh and average each new variable and piece of information more accurately. The influence of your first hilarious off-the-mark prior assumption about your therapist’s perfect punctuality is, through this process, dissolved down to nothing.
This is the promise of sensed data, of telemetrics combined with easy-to-update statistical tools such as Bayes.
Finding You
In March 2010, Adam Sadilek, a young Czech-born researcher from the University of Rochester, set out with some colleagues to see how accurately they could predict the location of someone who had turned off his or her GPS, who wasn’t geo-tagging tweets or posts, who was in effect going incognito. Sadilek and his team sampled the tweets of more than 1.2 million individuals across New York City and Los Angeles (America’s chirpiest cities). After a month, the team had more than 26 million individual messages with which to work; 7.6 million of those tweets were geo-tagged.
They trained an algorithm using Bayesian machine learning to explore the potential patterns among the Tweeters. The idea was to uncover the conversations between the users, contextualize what conversations were taking place across the New York and Los Angeles landscapes, and see if they could use that information to discover information about people who were friends with the geo-taggers but who weren’t themselves geo-tagging.
Turns out that your friends’ geo-tagged tweets provide a great indication of where you’ve been, even if you weren’t in that place with that friend. Because you, like most people, are probably a creature of habit, where you’ve been is an excellent indicator of where you’re going.
Let’s say Sadilek’s system has no “historical information” on you. You don’t geo-tag tweets; you keep your phone’s GPS setting off; you are invisible, a covert operative. But in order to maintain your cover, you established a Twitter account using a dummy e-mail address. Let’s also say you’ve got two friends on Twitter. They’re real friends, people you talk to about events in real life and with whom you relate in the real world. You see them in class, at clubs, in line at the post office. Like a lot of other people, these two friends do geo-tag their tweets. Sadilek’s system can predict your location at any moment (down to 328 feet and within a twenty-minute frame) with 47 percent accuracy. That means he’s got a 50 percent chance of catching you at any given moment.18
I know, I know, you did everything right. You were a careful steward of your privacy. It’s not fair that a twenty-five-year-old PhD grad from Czechoslovakia should be able to find out so much about you so effortlessly. It was your friends who gave you away without even realizing it. Now your not-so-secret-agent career is over.
I went to meet Sadilek at an AI conference. Sitting in the executive lounge on the top floor of the Toronto Sheraton, we overlooked downtown and saw people parking their bicycles, waiting for buses, talking on phones, walking with heads pointed toward shoes, white iPod cords dangling from their ears, people coming and going from little secret rendezvous that every one of them presumed were unknowable to the outside world. We talked a bit about human predictability.
“Somehow, growing up as a teenager, I always was sort of put off by how predictable people are. I never liked that. I liked people that were random.”
Since entering the field of machine learning, Sadilek has come face-to-face with a hard truth. Human behavior is far more predictable than anyone ever predicted; surprisingly predictable you may even say. One experiment in particular proved this in a way that astounded even Sadilek.
The year was 2011 and he was about to start an internship at Microsoft with researcher John Krumm. In his years of working at Microsoft, at a time when the company was at its most ambitious and adventurous, Krumm was able to amass a rather unique data set. He set out to make a sort of living map of human mobility the way zoologists and biologists track the movement of bears or birds or lions; but because Microsoft was so flush with cash at the time, Krumm paid several hundred test subjects to carry GPS trackers around with them wherever they went, which broadcasted the wearers’ physical location every couple of seconds. Some people carried the trackers in their pockets and some had the tracker installed on the dashboard of their cars. Microsoft was considering a lot of potential uses for this data, from helping cities better understand traffic patterns to developing a new line of smart thermostats that could predict when customers were on their way home and accordingly turn on the heat. Another potential use was an intelligent calendar to be used in conjunction with Outlook (the default e-mail provider that comes with Windows), which could forecast your potential availability for appointments into the future. Krumm watched the trackers and the people to which they were connected sail through life for more than six years. Altogether, his seven hundred–plus subjects provided more than ninety years’
worth of data on human mobility.
He presented the data set to Sadilek and they applied an algebraic technique called eigendecomposition to it. Decomposition in this sense simply means reducing a lot of numbers to a single value that’s in some way characteristic of the whole. Eigen is derived from the German word for “self.” Through eigendecomposition Sadilek and Krumm were able to create a model that could predict a subject’s location with higher than 80 percent accuracy up to eighty weeks in advance.19
Put another way, based on information stored in your phone, Sadilek and Krumm’s model can predict where you will be—down to the hour and within a square block—one year and a half from right now.
Granted, Krumm and Sadilek’s data set isn’t a typical one. Most of us don’t share geo-location information as frequently as did the folks Krumm put on the payroll. At least not yet. And most of us bounce between home, work, or school and back pretty regularly. In fact, if you know where someone usually is on a Monday at 10 A.M. you can infer their location on any given Monday at 10 A.M. fairly well, but it’s still just a guess based on two data points. The magic of Sadilek and Krumm’s Far Out model, as they named it, is that it factors in the occasional random detour—the flat tire, the unexpected work junket, or the sick day—without making those outlier events more significant than they are, without overfitting.
A flat tire on a Monday at 10 A.M. isn’t actually random, according to the strict definition of the word. We just don’t yet know how to model it. A certain type of person, someone who wears his tires thin without replacing them, someone who drives through an area with lots of hazards, et cetera, is more likely to suffer a flat every few months than is someone who doesn’t take her car out as often, or to the same places, or who replaces her tires religiously. Sadilek’s system doesn’t explain why some people have more flats, but it does find some people are more prone to these anomalies than others. When you have a data set with enough points, even outliers can reveal a pattern.
The Naked Future Page 4