The Idealists
Page 20
In 2007, PACER announced that, for a limited time, it would offer completely free access to its database.51 (At the time, PACER charged only eight cents a page, not ten.) The free access was available only at sixteen federal depository libraries across the United States—“that’s one library every twenty-two thousand square miles, I believe,” Malamud quipped52—and researchers would have to physically visit these libraries to take advantage of the offer. But once you were logged in to one of those computers, you could download as much data as you wanted.
Malamud posted a message on his website announcing what he dubbed the Pacer Recycling Project and soliciting volunteers for an informal “thumb drive corps.” Malamud enjoined members of the corps to visit the depository libraries, download PACER records to portable flash drives, and then “recycle” that material by uploading it to resource.org, where it would live in perpetuity as a free alternative to PACER. “Is this legal?” Malamud asked, before answering his own rhetorical question. “You betcha! These are public documents.”53 Theoretically, if enough people answered Malamud’s call, they could siphon the entire PACER database.
Malamud didn’t expect that to happen, though.54 The “thumb drive corps” was more a symbolic initiative than anything else; an example of all that could be done if people only exerted the will. On September 4, 2008, he got an e-mail from someone who did. “Do you have any guidelines for the thumb drive core [sic]?” inquired Aaron Swartz. “Any particular things we should be sure to capture with the pacer docs?”55
* * *
AARON Swartz had long admired Carl Malamud. In 2002, when he was fifteen, Swartz spotlighted Malamud on his blog as “today’s featured superhero,” lauding him as “an unstoppable technical and social hacker.”56 That same year, he asked Malamud to grant him the web.resource.org subdomain, which Swartz planned to use to host “information that’s useful to the Web community.”57 “It’s a mutual admiration society,” responded Malamud. “Of course, we’d be more than happy to delegate web to you!”58
During the year prior to their PACER collaboration, Swartz had reestablished contact with Malamud, soliciting his advice and assistance with watchdog.net and a slew of other bulk-downloading projects. Swartz had developed several methods of acquiring large data sets. Sometimes he’d purchase them. Sometimes he’d request them directly from government agencies under the Freedom of Information Act. Sometimes, his tactics were less direct.
The free trial of PACER was made available to on-site patrons of the various chosen libraries—the Alaska State Court Law Library, the Sacramento County Public Law Library, the Portland (Maine) Public Library, and thirteen others59—but Swartz figured that, rather than sit at a library terminal all day, it would be simpler to deploy a computer program that would download the PACER data remotely and automatically. This method of acquiring databases was quick and easy—especially for Swartz, who so disliked having to ask other people for help—but it also had the potential to greatly annoy the database providers.
In a blog post published in January 2013, the librarian Eric Hellman recalled how, upon meeting Swartz, he took him to task about “how some of his mass-downloading was getting people really upset and could have negative consequences for the things he was trying to accomplish. If he would just ask, I told him, he could have an account for [a database tool] that DIDN’T crash to smithereens when asked for millions of records. And people were working really hard to make the information he wanted free, it just needed some years to make sure the machinery wouldn’t collapse. Aaron sounded embarrassed.”60
Embarrassed though he may have been, Swartz had no intention of changing his ways. This attitude complicated his collaboration with Carl Malamud. Throughout his career as a data liberation activist, Malamud had always taken care to work strictly within the bounds of the law, both as a means of self-preservation and as a way of underscoring a broader point: public data, by law, belonged to the public, and there was nothing illegal about making it public.
This attitude was sensible. Taking care to comply with every federal database’s terms of use often proved time-consuming and inefficient. But by doing so, the downloader retained the moral high ground. Shortcuts cast a shadow on conduct and increased the likelihood of governmental scrutiny and suspicion.
The Princeton professor Stephen Schultze had written a simple computer script that was designed to crawl and download PACER; Swartz had helped code the program, and was now itching to deploy it. But the terms of the PACER access initiative did not explicitly authorize siphoning the database remotely, and this made Malamud nervous. “do you have your library’s permission/tacit agreement to drain pacer?” he asked.61 “no,” Swartz replied.62 “sigh. this is not how we do things. :),” Malamud e-mailed Swartz on September 4, 2008. “we don’t cut corners. we belly up to the bar and get permission.”63 If Swartz wanted to collaborate with Malamud, he would have to play by the rules.
Swartz gave his assent and then, without telling Malamud, ran the program remotely anyway. He persuaded a friend in California to visit the library in Sacramento and surreptitiously download an “authentication cookie”—a digital keycard, basically—that Swartz could use from home to fool PACER into thinking he was at the Sacramento library. In Massachusetts, Swartz ran the program, and then sat back and watched the files roll in. “We’re going to have fun with this,” Malamud told Swartz in late September, after Swartz had estimated that he would be able to capture approximately four terabytes worth of PACER records.64 “awesome. :-),” Swartz replied.65
Around the same time, on September 20, 2008, Swartz revisited the “Guerilla Open Access Manifesto” in a blog post promoting the launch of a website called guerillaopenaccess.com. “I realized that the Open Access movement simply wasn’t enough—even if we got all journals going forward to be open, the whole history of scientific knowledge would be locked up,” he explained. “Talking with others at the [EIFL] meeting, I realized what must be done. If we couldn’t get free access to this knowledge, folks would have to take it.”66
He reiterated those points that morning at a Free Software Foundation–sponsored event called Software Freedom Day, where he delivered an “eye-opening” keynote address on public records and Guerilla Open Access. “[Swartz has] been using free software to make government records and other public domain material easily available and searchable by the public,” the Free Software Foundation’s blog reported. “He implored us to all call him, if we want to help.”67 As Software Freedom Day neared its end, Richard Stallman delighted attendees with a surprise appearance. Wearing a plaid shirt and khakis, he spoke briefly on the history of the GNU Project and the future of free software and society. “He exhorted us to think of the GNU project’s 25 year history as a foundation for the work to come,” the FSF blog noted, “and encouraged people to keep pushing for a completely free system.”
A week after Software Freedom Day, the government noticed the unusually high number of downloads purportedly originating from the Sacramento County Public Law Library and severed Swartz’s access to the PACER database. Swartz’s script had taxed the database, eventually crashing it. But Swartz didn’t realize that at the time; all he knew was that his log-in attempts now elicited an “Access Denied” message.68 When Malamud learned that Swartz had been running his crawler remotely despite instructions to the contrary, he told Swartz, “You definitely went over the line, even after I specifically told you I didn’t want that to happen on my resources.”69 Then, worse came to worst: fearing a security breach, PACER suspended the trial-access program entirely.70
Swartz had downloaded almost 20 million pages from PACER, which constituted about 20 percent of the entire database. Malamud feared that Swartz’s actions might spark some sort of investigation, maybe even a legal action.71 Automatically downloading PACER wasn’t illegal, as far as they believed, but it was certainly unusual, and as Malamud knew well, federal agencies tended to be suspicious of the unusual. “If they want to go after you, I’ll shield yo
u as long as I can, but at the end of the day, we’ll simply agree that you did what you did,” Malamud wrote to Swartz on September 30.72 “There was not an explicit rule against what you did. It was pretty stupid, I think, but the motive was good.”
A few months passed, and no consequences seemed to be forthcoming. In a New York Times article about Swartz and Malamud and the PACER incident, John Schwartz reported that a US Courts spokeswoman was unable to comment on “whether there had been a criminal investigation into the mass download.”73 But the FBI had indeed been busy investigating the affair. In a report dated February 6, 2009, the Washington field office of the FBI noted that, thanks to Swartz’s actions, “the PACER system was being inundated with requests. One request was being made every three seconds.”74 Wondering exactly what Swartz and Malamud had been up to, the agency initiated an “information gathering phase.”
The file that the FBI opened on Swartz contained a précis of his recent activities. It noted his involvement with watchdog.net, and his ambitions of “pulling all information about politics, votes, lobbying records, and campaign finance reports under one unified interface.” Swartz’s personal website, the FBI observed, “includes a section titled ‘Aaron Swartz: a lifetime of dubious accomplishments.’ ” The FBI agents reported that Carl Malamud had “published an online manifesto about freeing PACER documents,” and that the exploits of Malamud, Swartz, and the thumb drive corps had been covered in the Times under the headline “Steal These Federal Records—Okay, Not Literally.”75
Each of these items was innocuous when taken on its own. Taken together, from the government’s perspective, they seemed to indicate some sort of overarching nefarious scheme to do . . . something. In February, the FBI sent a car to surveil Swartz’s parents’ house in Highland Park. On April 14, 2009, an agent called Highland Park hoping to talk with Swartz in person. Swartz wasn’t at home, but the FBI agent spoke with his mother, who was spooked enough to send Carl Malamud a frantic e-mail and Twitter message informing him what had happened. (“tell your mother that twitter is *not* the right way to reach me on this stuff :),” Malamud told Swartz.)76
Swartz eventually returned the call on his cell phone. “I’m sure you can guess what this is about. PACER,” said Special Agent Kristina Honeycutt, in Swartz’s telling. “We’re interested in sitting down and talking to you about it, more so to just find out exactly what happened, so we can help the US Courts get their system back up.” Honeycutt asked if Swartz would be willing to meet at some point soon for a face-to-face conversation. “If it was something bigger than that,” she said pointedly, “we wouldn’t have called you to ask.”77
“you shouldn’t worry about me. I’m happy to take the fallout if it comes,” Swartz wrote to Malamud, aware that Malamud was at the time vying for a job at the Government Printing Office.78 “understood, but I’m not going to let them hang you out to dry,” Malamud responded.79 The situation never came to that. Swartz’s lawyer eventually called the FBI and said that Swartz would agree to meet only if the agency could guarantee that doing so would not work to his detriment. The FBI couldn’t make that promise, so Swartz never met with them. The investigation was eventually closed on April 20, 2009. Later, Swartz requested his FBI file and posted the contents online.80 Though Swartz feigned bravado in his blog post about his FBI file, calling the document “truly delightful,” he had been scared witless at the time it was being compiled.
Swartz had spent two years downloading and uploading various data sets in a flurry of shotgun activism, spreading his shot wide, not caring particularly about which target he hit. When he had first proposed joining the thumb drive corps, Malamud had advised him not to let PACER distract him from more potentially significant projects such as watchdog.net. Swartz hadn’t listened, and now PACER had gone sour and watchdog.net had gone stale. “There seems to be an impression or at least a worry that you’ve simply dropped watchdog on the floor and the whole thing could be finished very soon,” Malamud wrote to Swartz in January 2009. “What’s up?”81 Swartz acknowledged that there would soon be some “personnel reshuffling” at watchdog.net, and Malamud gently chided him for failing to keep his supporters apprised of the changes. “communication is, by far, the hardest part of leadership,” Malamud wrote.82 “no, you’re right. i have not been doing a particularly good job keeping up with people lately,” Swartz responded.83 “not a problem, just self-correct,” said Malamud. “screwing up is fine, not realizing it is not. :))”84
But far from convincing Swartz to curb his ambitions and proceed with more caution, the PACER experience, if anything, drove him deeper into the caves. Swartz’s guerillaopenaccess.com website linked to the website of a group called the Content Liberation Front, self-described “guerillas of the open access movement.” The Content Liberation Front’s website was a simple list of projects, the first of which was the acquisition of expired journals.
“Many online journal sites, like JSTOR, even charge for articles which have entered the public domain,” the site said. “If you have copies of such articles, please upload them to archive.org and let us know.” But uploading public-domain articles was only the first step: “If you have a bit more skills or time, we suggest liberating entire journal archives from these sites and uploading them to file sharing networks. If anyone does so, let us know, we’ll post about it here.”85
The site urged visitors to send hard copies of databases to its mailing address:
The Content Liberation Front
c/o Aaron Swartz
950 Massachusetts Ave., #320
Cambridge, MA 02139
USA
That was Swartz’s apartment, between Harvard Square and Central Square, just down the road from the Massachusetts Institute of Technology.
8
HACKS AND HACKERS
JSTOR, which stands for “journal storage,” is an online database of academic journal articles that was conceived in 1993 and launched in January 1997. With complete archival runs of scholarly journals in many academic disciplines available to institutional subscribers in an instant, JSTOR, in many ways, could be considered the incarnate dream of the infinite library. Yet it also exemplifies the failure of that dream to properly materialize.
In his comprehensive history JSTOR, Roger Schonfeld recounted the genesis of the service. Academic librarians, facing the serials pricing crisis, as well as perennial budget and storage constraints, had long discussed the notion of a “central lending library for periodicals” that would obviate the need for individual libraries to archive and subscribe to these journals themselves.1 In 1993, an enterprising fellow named William G. Bowen, the president of the Andrew W. Mellon Foundation and an economist who had studied nonprofit structures, first floated the idea for a digital version of this central lending library. Bowen may have been a dreamer, but he was an extremely well-connected one. Within a year, plans were under way to turn his chimerical concept into JSTOR, which would serve both as a potential solution to the serials crisis and as a “real demonstration that large-scale digital libraries were feasible.”2
But making JSTOR a reality proved challenging to Bowen and his colleagues. JSTOR officials initially struggled to convince some academic publishers to participate in the project. Certain publishers worried that signing on to JSTOR in the present might preclude them from monetizing their digital backfiles in the future. Other entities, mostly learned societies that published journals in their respective fields, feared that scholars would cancel their learned-society memberships once they could access these journals online from a central database.3
To assuage these concerns and others, JSTOR took care to reassure publishers that their participation would not diminish the value of their content. JSTOR would assume all digitization and archival maintenance costs. The service would not be premised on exclusivity, so publishers would be free to make their backfiles available elsewhere, too. Moreover, as Schonfeld noted, JSTOR guaranteed publishers “that participation in the project would bri
ng them no harm—in terms of lost revenues or other concerns.”4
The reassurances worked. JSTOR launched in 1997 and has expanded ever since, to the delight of the many students and scholars who have come to rely on its vast digital archives. But soon enough, observed Schonfeld, “JSTOR began to behave like a business, with proprietary rights that required protection.”5 And when those rights were threatened, JSTOR did not hesitate to act in self-defense.
* * *
IN the early evening of September 25, 2010, a JSTOR employee noticed something strange. The JSTOR website was sluggish: tasks were accumulating and going uncompleted, Web forms weren’t loading. At 6:48 p.m., the staffer reported the problem in an e-mail to colleagues with the subject line “website sad.”6 Nobody likes a sad website, especially not those people tasked with keeping it happy. The JSTOR tech team examined the problem and three minutes later identified its cause: someone was bombarding the JSTOR servers with download requests. Hundreds per minute. And those requests were unraveling the system.
The user in question was clearly using a computer program to initiate download sessions in rapid succession and acquire articles from JSTOR’s database in the process called scraping.7 These actions violated JSTOR’s terms of service—and, of more immediate concern to the employees on duty that Saturday night, they threatened the stability of JSTOR servers in Ann Arbor, Michigan.8 Soon, other JSTOR staffers chimed in. “Any chance the offending scraper has an IP from the Portland area?” one asked. “We had a tool from Portland State University apologize and admit he was using 3+ PCs to mass download after they went to his house and punched him in the face (if only).”9 (Note: These e-mails are taken from a store of internal e-mails voluntarily released by JSTOR, from which all names have been redacted. When I use real names, it means that I have been able to independently verify the writer’s identity.)