by Ajay Agrawal
that person when they turn forty.4
17.3 Data Repurposing, AI, and Privacy
The lengthy time frame that digital persistence of data implies increases
uncertainty surrounding how the data will be used. This is because once
created, a piece of data can be reused an infi nite number of times. As predic-
tion costs are lower, this generally expands the number of circumstances and
occasions where data may be used. If an individual is unable to reasonably
anticipate how their data may be repurposed or what the data may pre-
dict in this repurposed setting, modeling their choices over the creation of
their data becomes more diffi
cult and problematic than in our current very
deterministic models, which assume certainty over how data will be used.
17.3.1 Unanticipated
Correlations
There may be correlations in behavior across users that may not be antici-
pated when data is created, and it is in these kinds of spillovers that the larg-
est potential consequences for privacy of AI may be found.
One famous example of this is that someone liking (or disliking) curly fries
on Facebook would have been unable to reasonably anticipate it would be
3. https:// trends.google .com/ trends/.
4. As discussed in articles such as http:// www .nature .com/ news/ 2008/ 080624/ full/ news
.2008.913 .html, DNA does change somewhat over time, but that change is itself somewhat predictable.
Privacy, Algorithms, and Artifi cial Intelligence 429
predictive of intelligence (Kosinski, Stillwell, and Graepel 2013) and there-
fore potentially used as a screening device by algorithms aiming to identify
desirable employees or students.5
17.3.2 Unanticipated Distortions in Correlations
In these cases, an algorithm could potentially make a projection based on
a correlation in the data, using data that was created for a diff erent purpose.
The consequences for models of economics of privacy are that they assume
a singular use of data, rather than allowing for the potential of reuse in
unpredictable contexts.
However, even supposing that individuals were able to reasonably antici-
pate the repurposing of their data, there are incremental challenges with
thinking about their ability to project distortions that might come about as
a result of the repurposing of their data.
The potential for distortions based on correlations in data is something
we investigate in new research.6
In Miller and Tucker (2018) we document the distribution of advertising
by an advertising algorithm that attempts to predict a person’s ethnic affi
n-
ity from their data online. We ran multiple parallel ad campaigns targeted
at African American, Asian American, and Hispanic ethnic affi
nities. We
also ran an additional campaign targeted at those judged to not have any of
these three ethnic affi
nities. These campaigns highlighted a federal program
designed to enhance pathways to a federal job via internships and career
guidance.7 We ran this ad for a week and collected data on how many people
the ad was shown to in each county. We found that relative to what would be
predicted by the actual demographic makeup of that county given the census
data, the ad algorithm tended to predict that more African American people
are in states where there is a historical record of discrimination against
African Americans. This pattern is true for states that allowed slavery at the
time of the American Civil War, and also true for states that restricted the
ability of African Americans to vote in the twentieth century. In such states,
it was only the presence of African Americans that was over predicted, not
people with Hispanic or Asian American backgrounds.
We show that this cannot be explained by the algorithm responding to
behavioral data in these states, as there was no diff erence in click- through
patterns across diff erent campaigns across states, with or without this his-
tory of discrimination.
5. This study found that the best predictors of high intelligence include Thunderstorms, The Colbert Report, Science, and Curly Fries, whereas low intelligence was indicated by Sephora, I Love Being A Mom, Harley Davidson, and Lady Antebellum.
6. This new research will be the focus of my presentation at the NBER meetings.
7. For details of the program, see https:// www .usajobs .gov/ Help/ working- in-government
/ unique- hiring- paths/ students/.
430 Catherine Tucker
We discuss how this can be explained by four facts about how the algo-
rithm operates:
1. The algorithm identifi es a user as having a particular ethnic affi
nity
based on their liking of cultural phenomena such as celebrities, movies, TV
shows, and music.
2. People who have lower incomes are more likely to use social media to
express interest in celebrities, movies, TV shows, and music.
3. People who have higher incomes are more likely to use social media to
express their thoughts about the politics and the news.8
4. Research in economics has suggested that African Americans are more
likely to have lower incomes in states that have exhibited historic patterns of
discrimination (Sokoloff and Engerman 2000; Bertocchi and Dimico 2014).
The empirical regularity that an algorithm predicting race is more likely
to predict someone is black in geographies that have historic patterns of
discrimination matters because it highlights the potential for historical per-
sistence in algorithmic behavior. It suggests that dynamic consequences of
earlier history may aff ect how artifi cial intelligence makes predictions. When
that earlier history is repugnant, it is even more concerning. In this particular
case the issue is using a particular piece of data to predict a trait when the
generation of that data is endogenous.
This emphasizes that privacy policy in a world of predictive algorithms
is more complex than in a straightforward world where individuals make
binary decisions about their data. In our example, it would seem problem-
atic to bar low- income individuals from expressing their identities via their
affi
nity with musical or visual arts. However, their doing so could likely lead
to a prediction that they belong to a particular ethnic group. They may not
be aware ex ante of the risk that disclosing a musical preference may cause
Facebook to infer an ethnic affi
nity and advertise to them on that basis.
17.3.3 Unanticipated Consequences of Unanticipated Repurposing
In most economic models, a consumer’s prospective desire for privacy
in the data depends here on the consumer being able to accurately forecast
the uses to which the data is put. One problem with data privacy is that AI/
algorithmic use of existing data sets may be reaching a point where data
can be used and recombined in ways that people creating that data in, say,
2000 or 2005, could not reasonably have foreseen or incorporated into their
decision- making at the time.
Again, this brings up legal concerns where an aggregation, or mosaic,
of data on an individual is held to be shar
ply more intrusive than each
datum considered in isolation. In United States v. Jones (2012), Justice Soto-
mayor wrote in a well- known concurring opinion, “It may be necessary to
8. One of the best predictors of high income on social media is a liking of Dan Rather.
Privacy, Algorithms, and Artifi cial Intelligence 431
reconsider the premise that an individual has no reasonable expectation of
privacy in information voluntarily disclosed to third parties [ . . . ]. This
approach is ill suited to the digital age, in which people reveal a great deal
of information about themselves to third parties in the course of carrying
out mundane tasks.” Artifi cial intelligence systems have shown themselves
as able to develop very detailed pictures of individuals’ tastes, activities, and
opinions based on analysis of aggregated information on our now digitally
intermediated mundane tasks. Part of the risk in a mosaic approach for
fi rms is that data previously considered not personally identifi able or person-
ally sensitive—such as ZIP Code, gender, or age to within ten years—when
aggregated and analyzed by today’s algorithms, may suffi
ce to identify you
as an individual.
This general level of uncertainty surrounding the future use of data,
coupled with certainty that it will be potentially useful to fi rms, aff ects the
ability of a consumer to be able to clearly make a choice to create or share
data. With large amounts of risk and uncertainty surrounding how private
data may be used, this has implications for how an individual may process
their preferences regarding privacy.
17.4 Data Spillovers, AI, and Privacy
In the United States, privacy has been defi ned as an individual right, spe-
cifi cally an individual’s right to be left alone (Warren and Brandeis 1890) (in
this specifi c case, from journalists with cameras).
Economists’ attempts to devise a utility function that refl ects privacy have
refl ected this individualistic view. A person has a preference for keeping
information secret (or not) because of the potential consequences for their
interaction with a fi rm. So far, their privacy models have not refl ected the
possibility that another person’s preferences or behavior could have spill-
overs on this process.
17.5 Some Types of Data Used by Algorithms
May Naturally Generate Spillovers
For example, in the case of genetics, the decision to create genetic data has
immediate consequences for family members, since one individual’s genetic
data is signifi cantly similar to the genetic data of their family members. This
creates privacy spillovers for relatives of those who upload their genetic
profi le to 23andme. Data that predicts I may suff er from bad eyesight or
macular degeneration later in life could be used to reasonably predict that
those who are related to me by blood may also be more likely to share a
similar risk profi le.
Of course, one hopes that an individual would be capable of internalizing
the potential externalities on family members of genetic data revelation, but
432 Catherine Tucker
it does not seem far- fetched to imagine situations of estrangement where
such internalizing would not happen and there would be a clear externality.
Outside the realm of binary data, there are other kinds of data that by
their nature may create spillovers. These include photo, video, and audio
data taken in public places. Such data may be created for one purpose such
as the result of a recreational desire to use video to capture a memory or
to enhance security, but may potentially create data about other individu-
als whose voices or images are captured without them being aware that
their data is being recorded. Traditionally, legal models of privacy have
distinguished between the idea of a private realm where an individual has
an expectation of privacy and a public realm where an individual can have
no reasonable expectation of privacy. For example, in the Supreme Court
case California v. Greenwood (1988), the court refused to accept that an
individual had a reasonable expectation of privacy in garbage he had left
on the curb.
However, in a world where people use mobile devices and photo capture
extensively, facial recognition allows accurate identifi cation of any indi-
vidual while out in public, and individuals have diffi
culty avoiding such
identifi cations. Encoded in the notion that we do not have a reasonable
expectation of privacy in the public realm are two potential errors: that one’s
presence in a public space is usually transitory enough to not be recorded,
and that the record of one’s activities in the public space will not usually be
recorded, parsed, and exploited for future use. Consequently, the advance
of technology muddies the allocation of property rights over the creation
of data. In particular, it is not clear how video footage of my behavior in
public spaces, which can potentially accurately predict economically mean-
ingful outcomes such as health outcomes, can be clearly dismissed as being a
context where I had no expectation of privacy, or at least no right to control
the creation of data. In any case, these new forms of data, due in some sense
to the incidental nature of data creation seem to undermine the clear- cut
assumption of easily defi nable property rights over the data that is integral
to most economic models of privacy.
17.5.1 Algorithms Themselves Will Naturally
Create Spillovers across Data
One of the major consequences of AI and its ability to automate predic-
tion is that there may be spillovers between individuals and other economic
agents. There may also be spillovers across a person’s decision to keep some
information secret, if such secrecy predicts other aspects of that individual’s
behavior that AI might be able to project from.
Research has documented algorithmic outcomes that appear to be dis-
criminatory, and has argued that such outcomes may occur because the algo-
rithm itself will learn to be biased on the basis of the behavioral data that
Privacy, Algorithms, and Artifi cial Intelligence 433
feeds it (O’Neil 2017). Documented alleged algorithmic bias spans charging
more to Asians for test- taking prep software9 to black names being more
likely to produce criminal record check ads (Sweeney 2013) to women being
less likely to seeing ads for an executive coaching service (Datta, Tschantz,
and Datta 2015).
Such data- based discrimination is often held to be a privacy issue (Custers
et al. 2012). The argument is that it is abhorrent for a person’s data to be used
to discriminate against them—especially if they did not explicitly consent
to its collection in the fi rst place. However, though not often discussed in
the legally orientated data- based discrimination literature, there are many
links between the fears expressed for the potential of data- based discrimina-
tion and the earlier economics literature on statistical discrimination litera-
ture. In much the same way that some fi nd it distasteful when an employer
&
nbsp; extrapolates from general data on fertility decisions and consequences
among females to project similar expectations of fertility and behavior onto
a female employee, an algorithm making similar extrapolations is equally
distasteful. Such instances of statistical discrimination by algorithms may
refl ect spillovers of predictive power across individuals, which in turn may
not be necessarily internalized by each individual.
However, as of yet there have been few attempts to try to understand why
ad algorithms can produce apparently discriminatory outcomes, or whether
the digital economy itself may play a role in the apparent discrimination.
I argue that above and beyond the obvious similarity to the statistical dis-
crimination literature in economics, sometimes apparent discrimination can
be best understood as spillovers in algorithmic decision- making. This makes
the issue of privacy not just one of the potential that an individual’s data
can be used to discriminate against them.
In Lambrecht and Tucker (forthcoming), we discuss a fi eld study into
apparent algorithmic bias. We use data from a fi eld test of the display of an
ad for jobs in the science, technology, engineering, and math fi elds (STEM).
This ad was less likely to be shown to women. This appeared to be a result
of an algorithmic outcome, as the advertiser had intended the ad to be gen-
der neutral. We explore various ways that might explain why the algorithm
acted in an apparently discriminatory way. An obvious set of explanations is
ruled out. For example, it is not because the predictive algorithm has fewer
women to show the ad to, and it is not the case that the predictive algorithm
learns that women are less likely are to click the ad, since women are more
likely to click on it—conditional on being shown the ad—than men. In other
words, this is not simply statistical discrimination. We also show it is not that
9. https:// www .propublica .org/ article/ asians- nearly- twice- as-likely- to-get- higher- price
- from- princeton- review. In this case, the alleged discrimination apparently stemmed from the fact that Asians are more likely to live in cities that have higher test prep prices.
434 Catherine Tucker
the algorithm learned from local behavior that may historically have been
biased against women. We use data from 190 countries and show that the