Obfuscation Page 5 Read online free by Finn Brunton

Obfuscation Page 5

If the use of “quote stuffing” were to spread, it might threaten the very

integrity of the stock market as a working system by overwhelming the physical infrastructure on which the stock exchanges rely with hundreds of thousands of useless quotes consuming bandwidth. “This is an extremely disturbing development,” the observer quoted above adds, “because as more HFT

systems start doing this, it is only a matter of time before quote-stuffing shuts down the entire market from congestion.”5

2.6 Swapping loyalty cards to interfere with analysis of

shopping patterns

Grocery stores have long been in the technological vanguard when it comes to working with data. Relatively innocuous early loyalty-card programs were

used to draw repeat customers, extracting extra profit margins from people who didn’t use the card and aiding primitive data projects such as organizing direct mailings by ZIP code. The vast majority of grocers and chains outsourced the business of analyzing data to ACNielsen, Catalina Marketing, and a few other companies.6 Although these practices were initially perceived as isolated and inoffensive, a few incidents altered the perception of purpose from innocuous and helpful to somewhat sinister.

In 1999, a slip-and-fall accident in a Los Angeles supermarket led to a

lawsuit, and attorneys for the supermarket chain threatened to disclose the victim’s history of alcohol purchases to the court.7 A string of similar cases over the years fed a growing suspicion in the popular imagination that so-called loyalty cards were serving ends beyond the allotment of discounts. Soon after their widespread introduction, card-swapping networks developed. People

shared cards in order to obfuscate data about their purchasing patterns—

initially in ad hoc physical meetings, then, with the help of mailing lists and online social networks, increasingly in large populations and over wide

28

CHAPTER 2

geographical regions. Rob’s Giant Bonus Card Swap Meet, for instance, started from the idea that a system for sharing bar codes could enable customers of the DC-area supermarket chain Giant to print out the bar codes of other customers and then paste them onto their cards.8 Similarly, the Ultimate Shopper project fabricated and distributed stickers imprinted with the bar code from a Safeway loyalty card, thereby creating “an army of clones” whose shopping

data would be accrued.9 Cardexchange.org, devoted to exchanging loyalty

cards by mail, presents itself as a direct analogue to physical meet-ups held for the same purpose. The swapping of loyalty cards constitutes obfuscation as a group activity: the greater the number of people who are willing to share their cards, and the farther the cards travel, the less reliable the data become.

Card-swapping websites also host discussions and post news articles

and essays about differing approaches to loyalty-card obfuscation and some of the ethical issues they raise. Negative effects on grocery stores are of concern, as card swapping degrades the data available to them and perhaps to other recipients. It is worth noting that such effects are contingent both on the card programs and on the approaches to card swapping. For example, sharing of a loyalty card within a household or among friends, though it may deprive a store of individual-level data, may still provide some useful information about shopping episodes or about product preferences within geographic areas. The value of data at the scale of a postal code, a neighborhood, or a district is far from insignificant. And there may be larger patterns to be inferred from the genuine information present in mixed and mingled data.

2.7 BitTorrent Hydra: using fake requests to deter collection

of addresses

BitTorrent Hydra, a now-defunct but interesting and illustrative project, fought the surveillance efforts of anti-file-sharing interests by mixing genuine

requests for bits of a file with dummy requests.10 The BitTorrent protocol broke a file into many small pieces and allowed users to share files with one another by simultaneously sending and receiving the pieces.11 Rather than download an entire file from another user, one assembled it from pieces obtained from anyone else who had them, and anyone who needed a piece that you had could get it from you. This many-pieces-from-many-people approach expedited the

sharing of files of all kinds and quickly became the method of choice for

moving large files, such as those containing movies and music.12 To help users OTHER EXAMPLES

29

of BitTorrent assemble the files they needed, “torrent trackers” logged IP

addresses that were sending and receiving files. For example, if you were

looking for certain pieces of a file, torrent trackers would point you to the addresses of users who had the pieces you needed. Representatives of the

content industry, looking for violations of their intellectual property, began to run their own trackers to gather the addresses of major unauthorized upload-ers and downloaders in order to stop them or even prosecute them. Hydra

counteracted this tracking by adding random IP addresses drawn from those

previously used for BitTorrent to the collection of addresses found by the torrent tracker. If you had requested pieces of a file, you would be periodically directed to a user who didn’t have what you were looking for. Although a small inefficiency for the BitTorrent system as a whole, it significantly undercut the utility of the addresses that copyright enforcers gathered, which may have belonged to actual participants but which may have been dummy addresses

inserted by Hydra. Doubt and uncertainty had been reintroduced to the system, lessening the likelihood that one could sue with assurance. Rather than

attempt to destroy the adversary’s logs or to conceal BitTorrent traffic, Hydra provided an “I am Spartacus” defense. Hydra didn’t avert data collection;

however, by degrading the reliability of data collection, it called any specific findings into question.

2.8 Deliberately vague language: obfuscating agency

According to Jacquelyn Burkell and Alexandre Fortier, the privacy policies of health information sites use particularly obtuse linguistic constructions when describing their use of tracking, monitoring, and data collection.13 Conditional verbs (e.g., “may”), passive voice, nominalization, temporal adverbs (e.g.,

“periodically” and “occasionally”), and the use of qualitative adjectives (as in

“small piece of data”) are among the linguistic constructions that Burkell and Fortier identify. As subtle as this form of obfuscation may seem, it is recogniz-ably similar in operation to other forms we have already described: in place of a specific, specious denial (e.g., “we do not collect user information”) or an exact admission, vague language produces many confusing gestures of possible activity and attribution. For example, the sentence “Certain information may be passively collected to connect use of this site with information about the use of other sites provided by third parties” puts the particulars of what a site does with certain information inside a cloud of possible interpretations.

30

CHAPTER 2

These written practices veer away from obfuscation per se into the more general domain of abstruse language and “weasel words.”14 However, for purposes of illustrating the range of obfuscating approaches, the style of obfuscated language is useful: a document must be there, a straightforward denial isn’t possible, and so the strategy becomes one of rendering who is doing

what puzzling and unclear.

2.9 Obfuscation of anonymous text: stopping stylometric analysis

How much in text identifies it as the creation of one author rather than another?

Stylometry uses only elements of linguistic style to attribute authorship to anonymous texts. It doesn’t have to account for the possibility that only a certain person would have knowledge of some matter, for posts to an online forum, for other external clues (such as IP addresses), or for timing. It considers length of sentences, choice of words, and
syntax, idiosyncrasies in formatting and usage, regionalisms, and recurrent typographical errors. It was a stylometric analysis that helped to settle the debate over the pseudonymous authors of the Federalist Papers (for example, the use of “while” versus

“whilst” served to differentiate the styles of Alexander Hamilton and James Madison), and stylometry’s usefulness in legal contexts is now well

established.15

Given a small amount of text, stylometry can identify an author. And we

mean small—according to Josyula Rao and Pankaj Ratangi, a sample con-

sisting of about 6,500 words is sufficient (when used with a corpus of identified text, such as email messages, posts to a social network, or blog posts) to make possible an 80 percent rate of successful identification.16 In the course of their everyday use of computers, many people produce 6,500 words in a

few days.

Even if the goal is not to identify a specific author from a pool of known individuals, stylometry can produce information that is useful for purposes of surveillance. The technology activist Daniel Domscheit-Berg recalls the

moment when he realized that if WikiLeaks’ press releases, summaries of

leaks, and other public texts were to undergo stylometric analysis it would show that only two people (Domscheit-Berg and Julian Assange) had been

responsible for all those texts rather than a large and diverse group of volun-teers, as Assange and Domscheit-Berg were trying to suggest.17 Stylometric analysis offers an adversary a more accurate picture of an “anonymous” or

OTHER EXAMPLES

31

secretive movement, and of its vulnerabilities, than can be gained by other means. Having narrowed authorship down to a small handful, the adversary is in a better position to target a known set of likely suspects.

Obfuscation makes it practicable to muddle the signal of a public body of

text and to interfere with the process of connecting that body of text with a named author. Stylometric obfuscation is distinctive, too, in that its success is more readily tested than with many other forms of obfuscation, whose precise effects may be highly uncertain and/or may be known only to an uncooperative adversary.

Three approaches to beating stylometry offer useful insights into obfus-

cation. The first two, which are intuitive and straightforward, involve assuming a writing style that differs from one’s usual style; their weaknesses highlight the value of using obfuscation.

Translation attacks take advantage of the weaknesses of machine translation by translating a text into multiple languages and then translating it back into its original language—a game of Telephone that might corrupt an author’s style enough to prevent attribution.18 Of course, this also renders the text less coherent and meaningful, and as translation tools improve it may not do a

good enough job of depersonalization.

In imitation attacks, the original author deliberately writes a document in the style of another author. One vulnerability of that approach has been elegantly exposed by research.19 Using the systems you would use to identify

texts as belonging to the same author, you can determine the most powerful identifier of authorship between two texts, then eliminate that identifier from the analysis and look for the next-most-powerful identifier, then keep repeat-ing the same process of elimination. If the texts really are by different people, accuracy in distinguishing between them will decline slowly, because beneath the big, obvious differences between one author and another there are many smaller and less reliable differences. If, however, both texts are by the same person, and one of them was written in imitation of another author, accuracy in distinguishing will decline rapidly, because beneath notable idiosyncrasies fundamental similarities are hard to shake.

Obfuscation attacks on stylometric analysis involve writing in such a way that there is no distinctive style. Researchers distinguish between “shallow”

and “deep” obfuscation of texts. “Shallow” obfuscation changes only a small number of the most obvious features—for example, preference for “while” or 32

CHAPTER 2

for “whilst.” “Deep” obfuscation runs the same system of classifiers used to defeat imitation, but does so for the author’s benefit. Such a method might provide real-time feedback to an author editing a document, identifying the highest-ranked features and suggesting changes that would diminish the

accuracy of stylometric analysis—for example, sophisticated paraphrasing. It might turn the banalities of “general usage” into a resource, enabling an author to blend into a vast crowd of similar authors.

Anonymouth—a tool that is under development as of this writing—is a

step toward implementing this approach by producing statistically bland prose that can be obfuscated within the corpus of similar writing.20 Think of the car provided to the getaway driver in the 2011 movie Drive: a silver late-model Chevrolet Impala, the most popular car in California, about which the mechanic promises “No one will be looking at you.”21 Ingenious as this may be, we

wonder about a future in which political manifestos and critical documents strive for great rhetorical and stylistic banality and we lose the next Thomas Paine’s equivalent to “These are the times that try men’s souls.”

2.10 Code obfuscation: baffling humans but not machines

In the field of computer programming, the term “obfuscated code” has two

related but distinct meanings. The first is “obfuscation as a means of protection”—that is, making the code harder for human readers (or the various

forms of “disassembly algorithms,” which help explicate code that has been compiled for use) to interpret for purposes of copying, modification, or compromise. (A classic example of such reverse engineering goes as follows: Mic-rosoft sends out a patch to update Windows computers for security purposes; bad actors get the patch and look at the code to figure out what vulnerability the patch is meant to address; they then devise an attack exploiting the vulnerability they have noticed hitting.) The second meaning of “obfuscated code”

refers to a form of art: writing code that is fiendishly complex for a human to untangle but which ultimately performs a mundane computational task that is easily processed by a computer.

Simply put, a program that has been obfuscated will have the same func-

tionality it had before, but will be more difficult for a human to analyze. Such a program exhibits two characteristics of obfuscation as a category and a

concept. First, it operates under constraint—you obfuscate because people

will be able to see your code, and the goals of obfuscation-as-protection are OTHER EXAMPLES

33

to decrease the efficiency of the analysis (“at least doubling the time needed,”

as experimental research has found), to reduce the gap between novices and skilled analysts, and to give systems that (for whatever reason) are easier to attack threat profiles closer to those of systems that are more difficult to attack.22 Second, an obfuscated program’s code uses strategies that are familiar from other forms of obfuscation: adding significant-seeming gibberish; having extra variables that must be accounted for; using arbitrary or deliberately confusing names for things within the code; including within the code deliberately confusing directions (essentially, “go to line x and do y”) that lead to dead ends or wild goose chases; and various forms of scrambling. In its protective mode, code obfuscation is a time-buying approach to thwarting

analysis—a speed bump. (Recently there have been advances that signifi-

cantly increase the difficulty of de-obfuscation and the amount of time it requires; we will discuss them below.)

In its artistic, aesthetic form, code obfuscation is in the vanguard of coun-terintuitive, puzzling methods of accomplishing goals. Nick Montfort has

described these practices in considerable detail.23 For example, because of how the programming language C interpret
s names of variables, a programmer can muddle human analysis but not machine execution by writing code

that includes the letters o and O in contexts that trick the eye by resembling zeroes. Some of these forms of obfuscation lie a little outside our working definition of “obfuscation,” but they are useful for illustrating an approach to the fundamental problem of obfuscation: how to transform something that is open to scrutiny into something ambiguous, full of false leads, mistaken identities, and unmet expectations.

Code obfuscation, like stylometry, can be analyzed, tested, and optimized

with precision. Its functionality is expanding from the limited scope of buying time and making the task of unraveling code more difficult to something closer to achieving complete opacity. A recent publication by Sanjam Garg and colleagues has moved code obfuscation from a “speed bump” to an “iron wall.”

A Multilinear Jigsaw Puzzle can break code apart so that it “fits together” like pieces of a puzzle. Although many arrangements are possible, only one

arrangement is correct and represents the actual operation of the code.24 A programmer can create a clean, clear, human-readable program and then run

it through an obfuscator to produce something incomprehensible that can

withstand scrutiny for a much longer time than before.

34

CHAPTER 2

Code obfuscation—a lively, rich area for the exploration of obfuscation in general—seems to be progressing toward systems that are relatively easy to use and enormously difficult to defeat. This is even applicable to hardware: Jeyavijayan Rajendran and colleagues are utilizing components within circuits to create “logic obfuscation” in order to prevent reverse engineering of the functionality of a chip.25

2.11 Personal disinformation: strategies for individual

disappearance

Disappearance specialists have much to teach would-be obfuscators. Many of these specialists are private detectives or “skip tracers”—professionals in the business of finding fugitives and debtors—who reverse engineer their own

‹ Prev Next ›