by Kris Shaffer
unintended—are more difficult to regulate, as the groups of people who see
the same advertisements are far smaller and more diffuse than TV markets or
subscribers to a particular magazine. Finally, what happens when the (private)
data collected to make this hyper-targeted advertising effective is sold, leaked,
or hacked?
We’ll unpack these questions in Chapter 4 and beyond. But first, let’s get our
heads around how the attention economy, our cognitive limitations, and
personalized, targeted content work together in the increasingly central
feature of modern human society: the algorithmic news feed.
C H A P T E R
3
Swimming
Upstream
How Content Recommendation Engines Impact
Information and Manipulate Our Attention
As we have shifted from an information economy to an attention economy in
the past two decades, we have almost simultaneously shifted from mass media
to social media. The drastic increase in available media necessitates a way for
individuals to sift through the media that is literally at their fingertips. Content
recommendation systems have emerged as the technological solution to this
social/informational problem. Understanding how recommendation system
algorithms work, and how they reinforce (and even exaggerate) unconscious
human bias, is essential to understanding the way data influences opinion.
What’s New?
Online advertising can be uncanny.
My wife was on Facebook a while back and noticed an odd advertisement.
Facebook was suggesting that she might be interested in a set of patio seat
cushions. That’s funny, she thought, I was just on the phone with my mom, and
she told me she was out looking for just such a cushion! A couple weeks later, the
© Kris Shaffer 2019
K. Shaf fer, Data versus Democracy,
https://doi.org/10.1007/978-1-4842-4540-8_3
32
Chapter 3 | Swimming Upstream
same thing happened with a table lamp from a store she hadn’t shopped at in
years, but her parents had just been to. I asked her if she checked Facebook
from her parents’ computer when we visited the previous month, and, of
course, the answer was yes. She was sure to log out when she was done, but
the cookie Facebook installed into her parents’ web browser was still there,
phoning home, while her parents shopped online for seat cushions and lamps.
A few web searches (and credit card purchases) later, her parents’ shopping
had been synched to her profile on Facebook’s servers. So as her parents
perused household items, she saw ads for similar items in her feed.
In that case, Facebook made a mistake, conflating her online identity with that
of her parents. But sometimes these platforms get it a little too right.
You may have heard the story in the news a few years ago. A father was livid
to receive advertisements from Target in the mail, addressed to his teenage
daughter, encouraging her to buy diapers, cribs, baby clothes, and other
pregnancy-related items. “Are you trying to encourage her to get pregnant?”
he reportedly asked the store manager. A few days later, the manager phoned
the father to reiterate his apology. But the father told him, “It turns out
there’s been some activities in my house I haven’t been completely aware of.
[My daughter]’s due in August. I owe you an apology. ”1
It turns out that Target, like many retailers today, was using customers’
purchasing history to predict future purchases and send them advertisements
and individualized coupons based on those predictions, all in an effort to
convince existing customers to turn Target into their one-stop shop.
But it’s not just advertisers who feed user data into predictive models that
generate hyper-targeted results. In her book Algorithms of Oppression: How
Search Engines Reinforce Racism, Safiya Umoja Noble tells the story of a harrowing
Google search. Looking for resources on growing up as a young woman of color
in rural America to share with her daughter and her friends, Noble picked up
her laptop and searched for “black girls” on Google. She writes:
I had almost inadvertently exposed them to one of the most graphic
and overt illustrations of what the advertisers already thought of
them: Black girls were still the fodder of porn sites, dehumanizing
them as commodities, as products and as objects of sexual gratifica-
tion. I closed the laptop and redirected our attention to fun things
we might do, such as see a movie down the street.2
1Charles Duhigg, “How Companies Learn Your Secrets,” The New York Times Magazine,
published February 16, 2012, www.nytimes.com/2012/02/19/magazine/shopping-
habits.html.
2Safiya Umoja Noble, Algorithms of Oppression: How Search Engines Reinforce Racism (New
York University Press, 2018), p. 18.
Data versus Democracy
33
Noble’s book is full of other example searches that, like the ones we examined
in Chapter 1, return predictive results based on the biases and prejudices of
society—image searches for “gorillas” that return pictures of African
Americans, searches for “girls” that return older and more sexualized results
than searches for “boys,” etc. In these cases, the results are not “personalized”—
Noble was clearly not looking for the pornography that Google’s search
engine provided her in the first page of search results. The results take the
biases of the programmers who created the algorithm and the biases of the
users who search for “black girls” and, apparently, click on links to pornography
more often than links to web sites containing resources meant to empower
young women of color. These biases then return results predictive of what
most users “want” to see.
Google has since addressed problems like this. Now, a search for something
neutral like “black girls” on my computer (with SafeSearch off) returns sites
like “Black Girls Rock!” and “Black Girls Code,” as well as images of black
women fully dressed. (Note that only a minority of the images included actual
girls, as opposed to grown women. Progress is progress, but we still have work
to do…)
But what happens when the search itself isn’t neutral? What if the search is
explicitly racist?
Before terrorist Dylann Roof committed mass murder in Charleston, South
Carolina, in 2015, he left behind a “manifesto.” In that manifesto he claims
(and let me be clear, I take the claims made in any terrorist’s manifesto with a
humungous grain of salt) that he was motivated to pursue his “race war” by a
Google search. According to Roof, he searched for “black on white crime,”
and the resulting pages provided him with all the information he needed to be
self-radicalized.3 Google has since made updates that correct searches like
this, as well, to the point that three of the top ten results for that search on
my computer today lead me to official U.S. government crime statistics. But
in 2015, people looking for government crime statistics broken down by
demographic rarely searched using terms like “bla
ck on white crime” or “black
on black crime.” Those were phrases most often used by racists, both in their
searches and on their web sites. That very search reflects the likely racism of
the search engine user, and even a “neutral” algorithm would likely predict
racist web sites to be the most desirable results.
Now, let me be clear. When someone who clearly has the capacity to kill nine
perfect strangers in a church at a Bible study searches for “black on white
crime,” that person is already well on the route to radicalization. Dylann Roof
was not radicalized by a Google search. But given all search engines’ ability to
return biased results for a “neutral” search like “black girls,” it’s important to
3“Google and the Miseducation of Dylann Roof,” Southern Poverty Law Center, published
January 18, 2017, www.splcenter.org/20170118/google-and-miseducation-dylann-roof.
34
Chapter 3 | Swimming Upstream
note how biased the predictive results can be when the input itself is biased,
even hateful.
The bias-multiplication effect of search engine algorithms is perhaps seen
most starkly in what happened to The Guardian reporter, Carole Cadwalladr,
a few years ago. She writes:
One week ago, I typed “did the hol” into a Google search box and
clicked on its autocomplete suggestion, “Did the Holocaust happen?”
And there, at the top of the list, was a link to Stormfront, a neo-Nazi
white supremacist web site, and an article entitled “Top 10 reasons
why the Holocaust didn’t happen. ”4
Perhaps even more scandalous than the anti-Semitic search results was
Google’s initial response. Google issued a statement, saying, “We are saddened
to see that hate organizations still exist. The fact that hate sites appear in
search results does not mean that Google endorses these views.” But they
declined at the time to remove the neo-Nazi sites from their search results.
Eventually, Google did step in and correct the results for that search, 5 so that
today a Google search for “Did the holocaust happen?” returns an entire page
of results confirming that it did, in fact, happen and explaining the phenomenon
of Holocaust denial. In my search just now, there was even an article in the
first page of results from The Guardian about the history of this search term
scandal.
Google’s initial response—that the algorithm was neutral, and that it would
be inappropriate to alter the search results manually, regardless of how
reprehensible they may be—hits at the core of what we will unpack in this
chapter. This idea of algorithmic neutrality is a data science fallacy. Developing
a machine learning model always means refining the algorithm when the
outputs do not match the expectations or demands of the model. That’s
because mistakes and biases always creep into the system, and sometimes
they can only be detected once the algorithm is put to work at a large scale.
In the case of search results, though, the model is constantly evolving, as
millions of searches are performed each day, each one altering the dataset the
model is “trained” on in real time. Not only will a search engine algorithm
propagate the conscious and unconscious bias of its programmers, it is also
open to being gamed consciously or unconsciously by its users.
4Carole Cadwalladr, “How to bump Holocaust deniers off Google’s top spot? Pay Google,”
The Observer, published December 17, 2016, www.theguardian.com/technology/2016/
dec/17/holocaust-deniers-google-search-top-spot.
5Jeff John Roberts, “Google Demotes Holocaust Denial and Hate Sites in Update to
Algorithm,” Fortune, published December 20, 2016, http://fortune.com/2016/12/20/
google-algorithm-update/.
Data versus Democracy
35
Left unchecked, a content recommendation algorithm is a bias amplifier. That
is as true for social network feeds as for search engines. Understanding how
they work on a general level can help us better understand online self-
radicalization, partisan polarization, and how actors can manipulate platforms
to advance their agendas on a massive scale. It can also help us think more
critically about the information we encounter in our own social feeds and
search results, and give us the foothold needed to effect positive change in the
online information space.
So let’s dive in…
How the Stream Works
Social media and search algorithms are important trade secrets. Outside of
the developers who work on those algorithms directly, no one can know with
100% certainty how they work in all the gory details. However, equipped with
a knowledge of how machine learning algorithms work in general, alongside
experiences of seeing the inputs and outputs of the models, we can reverse
engineer a fair amount of the details. Add that to research papers published
by the data scientists behind these models, and we can paint a fairly good
picture of how they work and the effects they have on society.
Let’s use a search engine as an example. When someone performs an image
search, the search algorithm, or model, considers a number of inputs alongside
the search term in order to determine the output—the result images in an
order that surfaces the images most likely to be clicked at the top of the list.
In addition to the search query itself, the model considers information from
my user profile and past activity.6 For example, if my profile tells it that I
currently live in the United States, I’ll likely see different results than if my
profile tells the model I live in India or Venezuela. My profile for most platforms
also contains information about my gender, age, education, marital status,
race, etc., either because I provided that information or because the platform
inferred it from my activity. 7 Inferred data is, of course, imperfect—one of my
Google accounts thinks I’m male, another thinks I’m female, and both have
wildly misidentified my taste in music and sports—but it is used to personalize
content alongside user-provided data.
Depending on the platform, various kinds of past activity are weighed to
determine the content recommended by the algorithm. For a search engine,
that may be past searches and the results we clicked. For advertising platforms,
6“How search algorithms work,” Google, www.google.com/search/howsearchworks/
algorithms/.
7You can find some of the information Google has inferred about you from your online
activity at https://adssettings.google.com/.
36
Chapter 3 | Swimming Upstream
it may include purchase made at partner stores, products viewed on partner
e-commerce web sites, or retailers in our area frequented by many of our
social media contacts. For a social network content feed, it likely includes the
kinds of content we post and the kinds of content we “engage” most, measured
by likes/favorites, comments, clicks, and in some cases, the amount of time we
spend looking at it before we scroll by.8
But there’s a third, and probably the most important, category of inputs that
help the model dete
rmine the content we see: data that has nothing to do with us.
This, of course, includes information about the content under consideration—
the data and metadata that makes that content unique, general information
about its popularity, etc. But it also contains information about the activity of
other users.9
Data from other users is key for any content recommendation engine.
Without that data, the recommendation would only have past user activity as
the basis for determining the best content to surface next. Because every
person and situation is unique, that small amount of data would provide very
little context in which to make a decision. In many cases, it would be flying
blind, with no relevant data to draw on to make a prediction. As a result, the
algorithm’s recommendations would be rather poor.
Think about it this way. In many respects, a content recommendation engine
is like a dating app, except instead of matching a person with another person,
it matches a person with content. Only in the case of a dating app, it’s possible
to require a minimum amount of data from a user before proceeding, ensuring
that user matches can be filtered and ranked according to complete profiles. 10
The same is true for the algorithms that assign medical school graduates with
residencies. 11 But in the case of search engines, social networks, and even
“personalized” educational apps, the algorithm needs to filter and rank content
that’s unlike any the user has engaged with before. The solution is to join the
user’s data with that of other users in a process called collaborative filtering.
To understand collaborative filtering, let’s consider a simplistic profile for
musical taste. Suppose there were a number of musical features that were
8I don’t want to call out any single developer or company here, but a web search for “track
user scrolls on a page” returns a number of solutions for web developers who want to
track what web site visitors scroll by and what they don’t.
9This is even true for “personalized” education apps. See “Knewton Adaptive Learning:
Building the world’s most powerful education recommendation engine,” Knewton,
accessed July 24, 2017, https://cdn.tc-library.org/Edlab/Knewton-adaptive-
learning-white-paper-1.pdf.