Algorithms to Live By

Page 12

by Brian Christian

Here’s why: if temporal locality holds, then the rough-sorting shelves contain the most important books in the whole building. These are the books that were most recently used, so they are the ones that patrons are most likely to be looking for. It seems a crime that arguably the juiciest and most browseworthy shelf of the libraries’ miles of stacks is both hidden away and constantly eroded by earnest library staff just doing their jobs.

Meanwhile, the lobby of the Moffit Undergraduate Library—the location of the most prominent and accessible shelves—showcases the library’s most recently acquired books. This is instantiating a kind of FIFO cache, privileging the items that were last added to the library, not last read.

The dominant performance of the LRU algorithm in most tests that computer scientists have thrown at it leads to a simple suggestion: turn the library inside out. Put acquisitions in the back, for those who want to find them. And put the most recently returned items in the lobby, where they are ripe for the browsing.

Humans are social creatures, and presumably the undergraduate body would find it interesting to peruse its own reading habits. It would nudge the campus toward a more organic and free-form version of what colleges strive for when they assign “common books”: the facilitation of intellectual common points of reference. Here, the books being read on campus, whatever they happened to be, would become the books most likely to be serendipitously encountered by other students. A kind of grassroots, bottom-up analogue of the common book program.

But a system like this wouldn’t only be more socially positive. Since the items most recently returned are the ones most likely to be next checked out, it would also be more efficient. It’s true that students might be puzzled by the fact that popular books will sometimes be found in the stacks and sometimes in the lobby. However, recently returned books that await reshelving are missing from the stacks either way. It’s just that currently they are off-limits during this brief limbo. Allowing the returned books to adorn the lobby instead would give students a chance to short-circuit the shelving process entirely. No employees would have to venture into the stacks to deposit the volumes, and no students would have to venture into the stacks to get them back out. That’s exactly how caching is meant to work.

The Cloud at the End of the Street

“We actually made a map of the country, on the scale of a mile to the mile!”

“Have you used it much?” I enquired.

“It has never been spread out, yet,” said Mein Herr: “the farmers objected: they said it would cover the whole country, and shut out the sunlight! So we now use the country itself, as its own map, and I assure you it does nearly as well.”

—LEWIS CARROLL

We often think of the Internet as a flat, independent, and loosely connected network. In fact, it’s none of those things. A quarter of all Internet traffic at present is handled by a single corporation, one that manages to stay almost entirely out of the headlines. This Massachusetts-based company is called Akamai, and they’re in the caching business.

We also think of the Internet as abstract, dematerial, post-geographic. We’re told our data is “in the cloud,” which is meant to suggest a diffuse, distant place. Again, none of these are true. The reality is that the Internet is all about bundles of physical wires and racks of metal. And it’s much more closely tied to geography than you might expect.

Engineers think about geography on a tiny scale when they design computer hardware: faster memory is usually placed closer to the processor, minimizing the length of the wires that information has to travel along. Today’s processor cycles are measured in gigahertz, which is to say they are performing operations in fractions of nanoseconds. For reference, that’s the time it takes light to travel a few inches—so the physical layout of a computer’s internals is very much a concern. And applying the same principle at a dramatically larger scale, actual geography turns out to be critical for the functioning of the web, where the wires span not inches but potentially thousands of miles.

If you can create a cache of webpage content that is physically, geographically closer to the people who want it, you can serve those pages faster. Much of the traffic on the Internet is now handled by “content distribution networks” (CDNs), which have computers around the world that maintain copies of popular websites. This allows users requesting those pages to get their data from a computer that’s nearby, without having to make the long haul across continents to the original server.

The largest of these CDNs is managed by Akamai: content providers pay for their websites to be “Akamaized” for better performance. An Australian who streams video from the BBC, for instance, is probably hitting local Akamai servers in Sydney; the request never makes it to London at all. It doesn’t have to. Says Akamai’s chief architect, Stephen Ludin, “It’s our belief—and we build the company around the fact—that distance matters.”

In our earlier discussion, we noted that certain types of computer memory have faster performance but cost more per unit of storage, leading to a “memory hierarchy” that tries to get the best of both. But it’s not actually necessary to have memory made of different materials for caching to make sense. Caching is just as useful when it’s proximity, rather than performance, that’s the scarce resource.

This fundamental insight—that in-demand files should be stored near the location where they are used—also translates into purely physical environments. For example, Amazon’s enormous fulfillment centers generally eschew any type of human-comprehensible organization, of the kind you’d find in a library or a department store. Instead, employees are told to place incoming items wherever they can find space in the warehouse—batteries cheek by jowl with pencil sharpeners, diapers, barbecue grills, and learn-the-dobro DVDs—and tag the location of each item in a central database using bar codes. But this deliberately disorganized-looking storage system still has one visible exception: high-demand items are placed in a different area, more quickly accessible than the rest. That area is Amazon’s cache.

Recently, Amazon was granted a patent for an innovation that pushes this principle one step further. The patent talks about “anticipatory package shipping,” which the press seized upon as though Amazon could somehow mail you something before you bought it. Amazon, like any technology company, would love to have that kind of Bélády-like clairvoyance—but for the next best thing, it turns to caching. Their patent is actually for shipping items that have been recently popular in a given region to a staging warehouse in that region—like having their own CDN for physical goods. Then, when somebody places an order, the item is just down the street. Anticipating the purchases of individuals is challenging, but when predicting the purchases of a few thousand people, the law of large numbers kicks in. Somebody in Berkeley is going to order, say, recycled toilet paper in a given day, and when they do it’s already most of the way there.

When the things popular in an area are also from that area, an even more interesting geography of the cloud emerges. In 2011, film critic Micah Mertes created a map of the United States using each state’s “Local Favorites” from Netflix—highlighting the movies uncommonly popular in each of those states. Overwhelmingly, it turned out, people love watching movies set where they live. Washingtonians favor Singles, set in Seattle; Louisianans watch The Big Easy, set in New Orleans; Angelinos unsurprisingly enjoy L.A. Story; Alaskans love Braving Alaska; and Montanans, Montana Sky.* And because nothing benefits quite so much from local caching as the enormous files that comprise full-length HD video, it’s certain that Netflix has arranged it so the files for, say, L.A. Story live right in Los Angeles, just like its characters—and, more importantly, its fans.

Caching on the Home Front

While caching began as a scheme for organizing digital information inside computers, it’s clear that it is just as applicable to organizing physical objects in human environments. When we spoke to John Hennessy—president of Stanford University, and a pioneering computer architect who helped
develop modern caching systems—he immediately saw the link:

Caching is such an obvious thing because we do it all the time. I mean, the amount of information I get … certain things I have to keep track of right now, a bunch of things I have on my desk, and then other things are filed away, and then eventually filed away into the university archives system where it takes a whole day to get stuff out of it if I wanted. But we use that technique all the time to try to organize our lives.

The direct parallel between these problems means that there’s the potential to consciously apply the solutions from computer science to the home.

First, when you are deciding what to keep and what to throw away, LRU is potentially a good principle to use—much better than FIFO. You shouldn’t necessarily toss that T-shirt from college if you still wear it every now and then. But the plaid pants you haven’t worn in ages? Those can be somebody else’s thrift-store bonanza.

Second, exploit geography. Make sure things are in whatever cache is closest to the place where they’re typically used. This isn’t a concrete recommendation in most home-organization books, but it consistently turns up in the schemes that actual people describe as working well for them. “I keep running and exercise gear in a crate on the floor of my front coat closet,” says one person quoted in Julie Morgenstern’s Organizing from the Inside Out, for instance. “I like having it close to the front door.”

A slightly more extreme example appears in the book Keeping Found Things Found, by William Jones:

A doctor told me about her approach to keeping things. “My kids think I’m whacky, but I put things where I think I’ll need them again later, even if it doesn’t make much sense.” As an example of her system, she told me that she keeps extra vacuum cleaner bags behind the couch in the living room. Behind the couch in the living room? Does that make any sense?… It turns out that when the vacuum cleaner is used, it is usually used for the carpet in the living room.… When a vacuum cleaner bag gets full and a new one is needed, it’s usually in the living room. And that’s just where the vacuum cleaner bags are.

A final insight, which hasn’t yet made it into guides on closet organization, is that of the multi-level memory hierarchy. Having a cache is efficient, but having multiple levels of caches—from smallest and fastest to largest and slowest—can be even better. Where your belongings are concerned, your closet is one cache level, your basement another, and a self-storage locker a third. (These are in decreasing order of access speed, of course, so you should use the LRU principle as the basis for deciding what gets evicted from each level to the next.) But you might also be able to speed things up by adding yet another level of caching: an even smaller, faster, closer one than your closet.

Tom’s otherwise extremely tolerant wife objects to a pile of clothes next to the bed, despite his insistence that it’s in fact a highly efficient caching scheme. Fortunately, our conversations with computer scientists revealed a solution to this problem too. Rik Belew of UC San Diego, who studies search engines from a cognitive perspective, recommended the use of a valet stand. Though you don’t see too many of them these days, a valet stand is essentially a one-outfit closet, a compound hanger for jacket, tie, and slacks—the perfect piece of hardware for your domestic caching needs. Which just goes to show that computer scientists won’t only save you time; they might also save your marriage.

Filing and Piling

After deciding what to keep and where it should go, the final challenge is knowing how to organize it. We’ve talked about what goes in the closet and where the closet should be, but how should things be arranged inside?

One of the constants across all pieces of home-organization advice we’ve seen so far is the idea of grouping “like with like”—and perhaps no one so directly flies in the face of that advice as Yukio Noguchi. “I have to emphasize,” says Noguchi, “that a very fundamental principle in my method is not to group files according to content.” Noguchi is an economist at the University of Tokyo, and the author of a series of books that offer “super” tricks for sorting out your office and your life. Their titles translate roughly to Super Persuasion Method, Super Work Method, Super Study Method—and, most relevantly for us, Super Organized Method.

Early in his career as an economist, Noguchi found himself constantly inundated with information—correspondence, data, manuscripts—and losing a significant portion of each day just trying to organize it all. So he looked for an alternative. He began by simply putting each document into a file labeled with the document’s title and date, and putting all the files into one big box. That saved time—he didn’t have to think about the right place to put each document—but it didn’t result in any form of organization. Then, sometime in the early 1990s, he had a breakthrough: he started to insert the files exclusively at the left-hand side of the box. And thus the “super” filing system was born.

The left-side insertion rule, Noguchi specifies, has to be followed for old files as well as new ones: every time you pull out a file to use its contents, you must put it back as the leftmost file when you return it to the box. And when you search for a file, you always start from the left-hand side as well. The most recently accessed files are thus the fastest to find.

This practice began, Noguchi explains, because returning every file to the left side was just easier than trying to reinsert it at the same spot it came from. Only gradually did he realize that this procedure was not only simple but also startlingly efficient.

The Noguchi Filing System clearly saves time when you’re replacing something after you’re done using it. There’s still the question, however, of whether it’s a good way to find the files you need in the first place. After all, it certainly goes against the recommendations of other efficiency gurus, who tell us that we should put similar things together. Indeed, even the etymology of the word “organized” evokes a body composed of organs—which are nothing if not cells grouped “like with like,” marshalled together by similar form and function.

But computer science gives us something that most efficiency gurus don’t: guarantees.

Though Noguchi didn’t know it at the time, his filing system represents an extension of the LRU principle. LRU tells us that when we add something to our cache we should discard the oldest item—but it doesn’t tell us where we should put the new item. The answer to that question comes from a line of research carried out by computer scientists in the 1970s and ’80s. Their version of the problem is called “self-organizing lists,” and its setup almost exactly mimics Noguchi’s filing dilemma. Imagine that you have a set of items in a sequence, and you must periodically search through them to find specific items. The search itself is constrained to be linear—you must look through the items one by one, starting at the beginning—but once you find the item you’re looking for, you can put it back anywhere in the sequence. Where should you replace the items to make searching as efficient as possible?

The definitive paper on self-organizing lists, published by Daniel Sleator and Robert Tarjan in 1985, examined (in classic computer science fashion) the worst-case performance of various ways to organize the list given all possible sequences of requests. Intuitively, since the search starts at the front, you want to arrange the sequence so that the items most likely to be searched for appear there. But which items will those be? We’re back to wishing for clairvoyance again. “If you know the sequence ahead of time,” says Tarjan, who splits his time between Princeton and Silicon Valley, “you can customize the data structure to minimize the total time for the entire sequence. That’s the optimum offline algorithm: God’s algorithm if you will, or the algorithm in the sky. Of course, nobody knows the future, so the question is, if you don’t know the future, how close can you come to this optimum algorithm in the sky?” Sleator and Tarjan’s results showed that some “very simple self-adjusting schemes, amazingly, come within a constant factor” of clairvoyance. Namely, if you follow the LRU principle—where you simply always put an item back at the very fron
t of the list—then the total amount of time you spend searching will never be more than twice as long as if you’d known the future. That’s not a guarantee any other algorithm can make.

Recognizing the Noguchi Filing System as an instance of the LRU principle in action tells us that it is not merely efficient. It’s actually optimal.

Sleator and Tarjan’s results also provide us with one further twist, and we get it by turning the Noguchi Filing System on its side. Quite simply, a box of files on its side becomes a pile. And it’s the very nature of piles that you search them from top to bottom, and that each time you pull out a document it goes back not where you found it, but on top.*

In short, the mathematics of self-organizing lists suggests something radical: the big pile of papers on your desk, far from being a guilt-inducing fester of chaos, is actually one of the most well-designed and efficient structures available. What might appear to others to be an unorganized mess is, in fact, a self-organizing mess. Tossing things back on the top of the pile is the very best you can do, shy of knowing the future. In the previous chapter we examined cases where leaving something unsorted was more efficient than taking the time to sort everything; here, however, there’s a very different reason why you don’t need to organize it.

You already have.

The Forgetting Curve

Of course, no discussion of memory could be complete without mention of the “memory organ” closest to home: the human brain. Over the past few decades, the influence of computer science has brought about something of a revolution in how psychologists think about memory.

The science of human memory is said to have begun in 1879, with a young psychologist at the University of Berlin named Hermann Ebbinghaus. Ebbinghaus wanted to get to the bottom of how human memory worked, and to show that it was possible to study the mind with all the mathematical rigor of the physical sciences. So he began to experiment on himself.

‹ Prev Next ›