Stigma is a form of bias: it’s socially justified bias. Computers make our stigmas more explicit and more public by providing “evidence” for a stigma assignment. They thoughtlessly apply labels without regard to the biases that may be attached to them. When these labels are derogatory—think “obese,” “criminal,” “poor,” “uneducated,” and the like—stigmas are empowered and circulated by the spread of these labels through social systems.
It is to our credit that society has gradually fought against the wounds of stigmas. The stigmas of being poor or mentally ill are no longer as dehumanizing as they once were, even though they still carry negative associations. Yet our ability to retract stigmas relies on our being able to alter labels, to forget them or forget their associations. Computers, in contrast, cannot forget. They fix and perpetuate the labels they apply. Fifty years ago, society treated homosexuality as a mental disorder. In another fifty years, there will surely be biases we endorse today that will be condemned as backward, narrow-minded, and reactionary.
Algorithms inevitably make many mistakes in tracking our biases and reflecting them back to us in a sufficiently sanitized form. Keeping computers from making such mistakes is a Sisyphean task, and companies like Google and Facebook will continue to apologize whenever an offensive error occurs. This is a problem that can only be managed, not solved. At worst, computers will encourage us to keep applying certain biases even if we wish to stop.
In my time as an engineer, I was fortunate not to have to traffic in the task of having computers label people. I was not involved in distinguishing between the offensive and inoffensive. Shortly after September 11, 2001, there was a curious incident at Microsoft in which they were accused of hiding anti-Semitic code in the Wingdings symbol font. In Wingdings, the letters “NYC” generated these three symbols:
It was, in fact, a complete coincidence, though conspiracy theorists and the New York Post accused Microsoft of anti-Semitism. From this incident I realized that as data and networks expand, offense will sooner or later be guaranteed, no matter how smart the algorithm. We might work hard to curtail biased application of labels to data, but when the offense arises from the sheer multiplicity of labels and data, the problem is endemic to labeling itself. To label data is to inject bias into it.
As a child, I had been drawn to computers because they were free of society’s tortuous value systems. Ironically, I now live in a world where computers are the thoughtless arbiters of those very same value systems. They have come to speak our languages like idiots, replicating their stigmas and biases. This is the fault of our dependence on inexact ontologies and their labels. Whether humans or computers are applying the bias, one does not eliminate bias from an ontology—one only copes with it. Our social networks today encourage us to prejudge one another, because on them we encounter one another by our labels and our statistics.
The Social Graph
I could escape the reduction imposed by unjust laws and customs, but not that imposed by ideas which defined me as no more than the sum of those laws and customs.
—RALPH ELLISON, “The World and the Jug”
Friendster, Myspace, and then Facebook and LinkedIn invented the social web as we know it. The social web is a network of friends. Relationships tend to be mutual and symmetrical (though not always, particularly on Twitter where one can “Follow” celebrities who don’t follow back). Each person or user is a node in the network, and the links between the nodes reflect friendships.
This initial model of a friends network has lost significance as the social web has grown. Facebook and the social web have moved far beyond treating humans and their relationships as a network of personal relationships.
Unlike web pages, humans possess vast quantities of metadata. Google was able to draw a tremendous amount of information from what was primarily an unstructured (or barely structured) mass of text, which frequently gave little to no clue about its underlying meaning. Google scanned web pages for individual words. The words that were less common often functioned as coded labels indicating what sort of a web page it was. But a human is not a page of text. Humans are not made up of words. From the perspective of a cloud service, humans generate status updates in the form of text and photos, as well as exchanging text and image messages with one another, but these were far less useful for determining people’s essential nature, or even for determining their age, gender, and race. Facebook was the first software company to confront the question of how to determine what a person is about. In other words, how to label a person.
This is a profound and fundamental question. What the cloud knows about us is what it can algorithmically assess, and what it can algorithmically assess determines how the cloud responds to us and recommends to us. The cloud’s judgments are not only descriptions of the sort of people it thinks we are and assumes that we will continue to be, but also prescriptions for how we are to be treated. Computers classify and describe us, and we respond to their classifications, which in turn cause computers to refine their descriptions. Our behavior shapes computers, and computers’ behavior shapes us.
Yet how does this feedback process take place? Social networks treat people not like web pages, but like products. A product on Amazon has a description, reviews, barcodes, related products, and such associated with it, but it doesn’t reside on a single website or an authoritative single page. So it is with humans. Information, scattered across different documents, describes a person who exists outside the web. Across the web, there are hundreds if not thousands of data sources about each person. If I were to collate all this information, gather the documents concerning a particular “human” in a single place, and analyze them, I could learn a great deal about them. I’m stingy with my personal details on Facebook, but if you correlated my Facebook profile with my Amazon purchases, my websites, my public writings, my address and demographics, my credit ratings, and other such data, you could put together a detailed picture of my life, my habits, and my tastes.
Imagine, instead of a social network where each person is a page, that each person is a website containing thousands and even millions of pages. Each page holds some bit of online information from a person: status updates, tweets, photos, emails, documents, demographic data, consumer purchases, work history, education, medical history, and more. No single service will have access to all of a person’s information, but each service will lump together what it can find for each individual and analyze away. Some bits of information are not especially meaningful. What can we learn about a person from a twenty-word status update like “I had a lousy day today and I’m going for a drink”? “Drink” is the most useful word, and even it is quite vague without greater context. We can’t even safely conclude that it means alcohol. But if the cloud knew what the person liked to drink and how often, then it could target more persuasive advertisements to this person. If a person posted about “the Second Amendment,” it could show him or her ads for the NRA, or for NRA-friendly candidates. To perform such feats, the cloud needs to accumulate metadata about a person—or better yet, have the person provide it.
The ubiquitous “Like” button, which Facebook has seeded across the web, was the simplest and most powerful generator of human metadata yet seen. By “Like”-ing a status, a page, or a product, we tell Facebook that we are interested in something. Deriving interests from status updates is a dodgy business; just figuring out whether something’s being mentioned positively or negatively (such as the Second Amendment) is hard enough.*8 Photos are tricky too; just what is that person drinking in that photo—and wait, who even is that person? Image analysis was quite primitive when Facebook started out, and text can be ambiguous. But a “Like” is simple and clear-cut. It is a low-cost and high-information tool for assembling an ad hoc but very powerful profile on a user, in order to determine what a person is about. By federating the “Like” button across other websites, Facebook followed its users bey
ond the confines of Facebook proper and refined its user profiles. Facebook’s idea of a “person” became a powerful, if messy, profile, full of rich metadata on a person’s likes and dislikes, habits, and relationships, and a great deal of their biography. This was useful metadata.
Every time a person clicks the “Like” button for X, whether X is a product or an artist or a store or an idea, Facebook labels that person as someone who likes X. This was already a step further than what Facebook could get from most status updates or photos. From there, it’s simple to group together people who like X, even if they have no explicit connection to one another. Facebook can surmise things about these groups, to infer other shared preferences. This was an application of Amazon’s recommendation engine on a far greater scale. Amazon recommends products based on what people with similar purchase histories have bought, just as Netflix does with film recommendations, but Facebook knows not only what we buy, but also what television shows we watch, food and drink we consume, sports teams we root for, stores we visit, cars we want but can’t afford to buy, politicians we support or hate, and much more. Amazon used their technology to sell stuff and generate enormous revenue, but Facebook was more capable than Amazon of putting together comprehensive user profiles. They discovered, after Google and Amazon, a third way of organizing the web. Google organized it by pages, Amazon by products, and Facebook by people.
A “Like” does not have to be explicit. As long as we are logged in to a Facebook account, Facebook can generate meaningful metadata about us by tracking which links we click on and which pages we visit. Facebook assembles a second set of implicit interests by tracking users’ activity on and off Facebook.*9 This in turn assigns further labels to users: Times readers, car lovers, gamblers, alcoholics, etc. Quizzes are also useful: introverts or extroverts, Snapes or Dumbledores, masterminds or crackpots. Basic demographic data is increasingly available to any large corporate entity through either explicit declaration or implicit inference: gender, race, location, income class, social class, familial relationships, education history, and work history. Taken together, this data enables Facebook to profile a person in far greater detail than any number of status updates alone could allow.
Labels beget more labels. People who Liked Amstel and Corona were almost certain to Like beer, even if they never bothered to click Like on the generic beer page. If you Liked the retro indie stylings of Modest Mouse and Interpol, you were a likely bet for the Strokes, as well as indie rock communities. If you Liked Boston-area restaurants, perhaps you would like other Boston-area restaurants, especially those that paid for advertising on Facebook. If you liked InfoWars, you were a good bet for other conspiracy-theorist websites. The ever-present Suggestion box was there to suggest more pages to Like.
Social networks and their third-party partners have proven to be brilliant collection mechanisms for understanding people in a conveniently computational way. Yet the side effects of this process are causing us to see each other in a more computational way.
The Presentation of Self in Internet Life
Since you are tongue-tied and so loath to speak,
In dumb significants proclaim your thoughts.
—SHAKESPEARE, Henry VI, Part 1
I feel so bad for the millennials. God, they just had their universe handed to them in hashtags.
—OTTESSA MOSHFEGH
The primitive level of user feedback encouraged by online services is a feature, not a bug. It is vastly easier for a computer to make sense out of a “Like” or a “★★★★★” than to parse meaning out of raw text. Yelp’s user reviews are a necessary part of their appeal to restaurant-goers, but Yelp could not exist without the star ratings, which allow for convenient sorting, filtering, and historical analysis over time (for instance, to track whether a restaurant is getting worse). This leads to what I’ll term:
THE FIRST LAW OF INTERNET DATA
In any computational context, explicitly structured data floats to the top.*10
* * *
“Explicitly structured” data is any data that brings with it categories, quantification, and/or rankings. This data is self-contained, not requiring any greater context in order to be put to use. Data which exists in a structured and quantifiable context—be it the DSM, credit records, Dungeons & Dragons, financial transactions, Amazon product categories, or Facebook profiles—will become more useful and more important to algorithms—and to the people and companies using those algorithms—than unstructured data like text in human language, images, and video.
This law was obscured in the early days of the internet because there was so little explicitly quantified data there. Explicitly quantified metadata like the link graph, which Google exploited so lucratively, underscored that algorithms gravitate toward explicitly quantified data. In other words, the early days of the internet were an aberration. In retrospect, the early internet was the unrepresentative beginning of a process of explicit quantification that has since taken off with the advent of social media platforms like Facebook, Snapchat, Instagram, and Twitter, which are all part of the new norm.*11 This also includes Amazon, eBay, and other companies dealing in explicitly quantified data.
Web 2.0 was not about social per se. Rather, it was about the classification of the social, and more generally the classification of life. Google had vacuumed up all that could be vacuumed out of the unstructured data. The maturation of the web demanded more explicitly organized content that could more easily be analyzed by computers. And the best way to do this at scale was to employ users to create that data.
Explicitly quantified data requires that data be labeled and classified before it can be sorted and ordered. The project of archives like the Library of Congress isn’t sorting the books per se; it’s developing the overarching classification that determines what order the books should be in. No classification, no sorting. Even machine learning fares worse when “unsupervised”—that is, when it is not provided with a preexisting classificatory framework.
THE SECOND LAW OF INTERNET DATA
For any dataset, the classification is more important than what’s being classified.*12
* * *
The conclusions and impact of data analyses more often flow from the classifications under which the data has been gathered than from the data itself. When Facebook groups people together in some category like “beer drinkers” or “fashion enthusiasts,” there isn’t some essential trait to what unifies the people in that group. Like Google’s secret recipe, Facebook’s classification has no actual secret to it. It is just an amalgam of all the individual factors that, when summed, happened to trip the category detector. Whatever it was that caused Facebook to decide I had an African-American “ethnic affinity” (was it my Sun Ra records?), it’s not anything that would clearly cause a human to decide that I have such an affinity. What’s important, instead, is that such a category exists, because it dictates how I will be treated in the future. The name of the category—whether “African American,” “ethnic minority,” “African descent,” or “black”—is more important than the criteria for the category. Facebook’s learned criteria for these categories would significantly overlap, yet the ultimate classification possesses a distinctly different meaning in each case. But the distinction between criteria is obscured. We never see the criteria, and very frequently this criteria is arbitrary or flat-out wrong. The choice of classification is more important than how the classification is performed.
Here, Facebook and other computational classifiers exacerbate the existing problems of provisional taxonomies. The categories of the DSM dictated more about how a patient population was seen than the underlying characteristics of each individual classified with these categories, because it was the category tallies that made it into the data syntheses. One’s picture of the economy depends more on how unemployment is defined (whether it includes people who’ve stopped looking for a job, part-ti
me workers, temporary workers, etc.) than it does on the raw experiences and opinions of citizens. And your opinion of your own health depends more on whether your weight, diet, and lifestyle are classified into “healthy” or “unhealthy” buckets than it does on the raw statistics themselves. Even the name of a category—“fat” vs. “overweight” vs. “obese”—carries with it associations that condition how the classification is interpreted.
Some classifications are far more successful and popular than others. The dominant rule of thumb is:
THE THIRD LAW OF INTERNET DATA
Simpler classifications will tend to defeat more elaborate classifications.
* * *
The simplicity of feedback mechanisms (Likes, star ratings, etc.) is intentional. Internet services can deal with complicated ontologies when they need to, but business and technical inertia privilege simpler ones. Facebook waited ten years to add reactions beyond “Like” and long resisted the calls for a “Dislike” button, forcing their users to Like death announcements and political scandals. Facebook preferred a simple bimodal interested/uninterested metric. When Facebook finally decided to appease their users, it added five sentiments to the original Like: Love, Haha, Wow, Sad, Angry. It is no coincidence that the two negative sentiments are at the end: “Sad” and “Angry” are more ambiguous than the others. If I express a positive reaction to something, I’m definitely interested in it. If I’m made sad or angry by something, I may still be interested in it, or perhaps I want to avoid it. Those reactions are less useful to Facebook.
Bitwise Page 26