Book Read Free

Nine Algorithms That Changed the Future: The Ingenious Ideas That Drive Today's Computers

Page 12

by John MacCormick


  Slowly but surely, however, AI has been chipping away at the collection of thought processes that might be defined as uniquely human. For years, many believed that the intuition and insight of human chess champions would beat any computer program, which must necessarily rely on a deterministic set of rules rather than intuition. Yet this apparent stumbling block for AI was convincingly eradicated in 1997, when IBM's Deep Blue computer beat world champion Garry Kasparov.

  Meanwhile, the success stories of AI were gradually creeping into the lives of ordinary people too. Automated telephone systems, servicing customers through speech recognition, became the norm. Computer-controlled opponents in video games began to exhibit human-like strategies, even including personality traits and foibles. Online services such as Amazon and Netflix began to recommend items based on automatically inferred individual preferences, often with surprisingly pleasing results.

  Indeed, our very perceptions of these tasks have been fundamentally altered by the progress of artificial intelligence. Consider a task that, in 1990, indisputably required the intelligent input of humans, who would actually be paid for their expertise: planning the itinerary of a multistop plane trip. In 1990, a good human travel agent could make a huge difference in finding a convenient and low-cost itinerary. By 2010, however, this task was performed better by computers than humans. Exactly how computers achieve this would be an interesting story in itself, as they do use several fascinating algorithms for planning itineraries. But even more important is the effect of the systems on our perception of the task. I would argue that by 2010, the task of planning an itinerary was perceived as purely mechanistic by a significant majority of humans—in stark contrast to the perception 20 years earlier.

  This gradual transformation of tasks, from apparently intuitive to obviously mechanistic, is continuing. Both AI in general and pattern recognition in particular are slowly extending their reach and improving their performance. The algorithms described in this chapter—nearest-neighbor classifiers, decision trees, and neural networks—can be applied to an immense range of practical problems. These include correcting fat-fingered text entry on cell phone virtual keyboards, helping to diagnose a patient's illness from a complex battery of test results, recognizing car license plates at automated toll booths, and determining which advertisement to display to a particular computer user—to name just a few. Thus, these algorithms are some of the fundamental building blocks of pattern recognition systems. Whether or not you consider them to be truly “intelligent,” you can expect to see a lot more of them in the years ahead.

  7

  Data Compression: Something for Nothing

  Emma was gratified, and would soon have shewn no want of words, if the sound of Mrs Elton's voice from the sitting-room had not checked her, and made it expedient to compress all her friendly and all her congratulatory sensations into a very, very earnest shake of the hand.

  —JANE AUSTEN, Emma

  We're all familiar with the idea of compressing physical objects: when you try to fit a lot of clothes into a small suitcase, you can squash the clothes so that they are small enough to fit even though they would overflow the suitcase at their normal size. You have compressed the clothes. Later, you can decompress the clothes after they come out of a suitcase and (hopefully) wear them again in their original size and shape.

  Remarkably, it's possible to do exactly the same thing with information: computer files and other kinds of data can often be compressed to a smaller size for easy storage or transportation. Later, they are decompressed and used in their original form.

  Most people have plenty of disk space on their own computers and don't need to bother about compressing their own files. So it's tempting to think that compression doesn't affect most of us. But this impression is wrong: in fact, compression is used behind the scenes in computer systems quite often. For example, many of the messages sent over the internet are compressed without the user even knowing it, and almost all software is downloaded in compressed form—this means your downloads and file transfers are often several times quicker than they otherwise would be. Even your voice gets compressed when you speak on the phone: telephone companies can achieve a vastly superior utilization of their resources if they compress voice data before transporting it.

  Compression is used in more obvious ways, too. The popular ZIP file format employs an ingenious compression algorithm that will be described in this chapter. And you're probably very familiar with the trade-offs involved in compressing digital videos: a high-quality video has a much larger file size than a low-quality version of the same video.

  LOSSLESS COMPRESSION: THE ULTIMATE FREE LUNCH

  It's important to realize that computers use two very different types of compression: lossless and lossy. Lossless compression is the ultimate free lunch that really does give you something for nothing. A lossless compression algorithm can take a data file, compress it to a fraction of its original size, then later decompress it to exactly the same thing. In contrast, lossy compression leads to slight changes in the original file after decompression takes place. We'll discuss lossy compression later, but let's focus on lossless compression for now. For an example of lossless compression, suppose the original file contained the text of this book. Then the version you get after compressing and decompressing contains exactly the same text—not a single word, space, or punctuation character is different. Before we get too excited about this free lunch, I need to add an important caveat: lossless compression algorithms can't produce dramatic space savings on every file. But a good compression algorithm will produce substantial savings on certain common types of files.

  So how can we get our hands on this free lunch? How on earth can you make a piece of data, or information, smaller than its actual “true” size without destroying it, so that everything can be reconstructed perfectly later on? In fact, humans do this all the time without even thinking about it. Consider the example of your weekly calendar. To keeps things simple, let's assume you work eight-hour days, five days a week, and that you divide your calendar into one-hour slots. So each of the five days has eight possible slots, for a total of 40 slots per week. Roughly speaking, then, to communicate a week of your calendar to someone else, you have to communicate 40 pieces of information. But if someone calls you up to schedule a meeting for next week, do you describe your availability by listing 40 separate pieces of information? Of course not! Most likely you will say something like “Monday and Tuesday are full, and I'm booked from 1 p.m. to 3 p.m. on Thursday and Friday, but otherwise available.” This is an example of lossless data compression! The person you are talking to can exactly reconstruct your availability in all 40 slots for next week, but you didn't have to list them explicitly.

  At this point, you might be thinking that this kind of “compression” is cheating, since it depends on the fact that huge chunks of your schedule were the same. Specifically, all of Monday and Tuesday were booked, so you could describe them very quickly, and the rest of the week was available except for two slots that were also easy to describe. It's true that this was a particularly simple example. Nevertheless, data compression in computers works this way too: the basic idea is to find parts of the data that are identical to each other and use some kind of trick to describe those parts more efficiently.

  This is particularly easy when the data contains repetitions. For example, you can probably think of a good way to compress the following data:

  AAAAAAAAAAAAAAAAAAAAABCBCBCBCBCBCBCBCBCBCAAAAAADEFDEFDEF

  If it's not obvious immediately, think about how you would dictate this data to someone over the phone. Instead of saying “A, A, A, A,…, D, E, F,” I'm sure you would come up with something more along the lines of “21 A's, then 10 BC's, then another 6 A's, then 3 DEF's.” Or to quickly make a note of this data on a piece of paper, you might write something like “21A,10BC,6A,3DEF.” In this case, you have compressed the original data, which happens to contain 56 characters, down to a string of only 16 characters.
That's less than one-third of the size - not bad! Computer scientists call this particular trick run-length encoding, because it encodes a “run” of repetitions with the “length” of that run.

  Unfortunately, run-length encoding is only useful for compressing very specific types of data. It is used in practice, but mostly only in combination with other compression algorithms. For example, fax machines use run-length encoding in combination with another technique, called Huffman coding, that we will encounter later. The main problem with run-length encoding is that the repetitions in the data have to be adjacent—in other words, there must be no other data between the repeated parts. It's easy to compress ABABAB using run-length encoding (it's just 3AB), but impossible to compress ABXABYAB using the same trick.

  You can probably see why fax machines can take advantage of run-length encoding. Faxes are by definition black-and-white documents, which get converted into a large number of dots, with each dot being either black or white. When you read the dots in order (left to right, top to bottom), you encounter long runs of white dots (the background) and short runs of black dots (the foreground text or handwriting). This leads to an efficient use of run-length encoding. But as mentioned above, only certain limited types of data have this feature.

  So computer scientists have invented a range of more sophisticated tricks that use the same basic idea (find repetitions and describe them efficiently), but work well even if the repetitions aren't adjacent. Here, we'll look at only two of these tricks: the same-as-earlier trick and the shorter-symbol trick. These two tricks are the only things you need to produce ZIP files, and the ZIP file format is the most popular format for compressed files on personal computers. So once you understand the basic ideas behind these two tricks, you will understand how your own computer uses compression, most of the time.

  The Same-as-Earlier Trick

  Imagine you have been given the dreary task of dictating the following data over the telephone to someone else:

  VJGDNQMYLH-KW-VJGDNQMYLH-ADXSGF-O-

  VJGDNQMYLH-ADXSGF-VJGDNQMYLH-EW-ADXSGF

  There are 63 characters to be communicated here (we are ignoring the dashes, by the way—they were only inserted to make the data easier to read). Can we do any better than dictating all 63 characters, one at a time? The first step might be to recognize that there is quite a lot of repetition in this data. In fact, most of the “chunks” that are separated by dashes get repeated at least once. So when dictating this data, you can save a lot of effort by saying something like “this part is the same as something I told you earlier.” To be a bit more precise, you will have to say how much earlier and how long the repeated part is—perhaps something like “go back 27 characters, and copy 8 characters from that point.”

  Let's see how this strategy works out in practice. The first 12 characters have no repetition, so you have no choice but to dictate them one by one: “V, J, G, D, N, Q M, Y, L, H, K, W.” But the next 10 characters are the same as some earlier ones, so you could just say “back 12, copy 10.” The next seven are new, and get dictated one by one: “A, D, X, S, G, F, O.” But the 16 characters after that are one big repeat, so you can say “back 17, copy 16.” The next 10 are repeats from earlier too, and “back 16, copy 10” takes care of them. Following that are two characters that aren't repeats, so they are dictated as “E, W.” Finally, the last 6 characters are repeats from earlier and are communicated using “back 18, copy 6.”

  Let's try to summarize our compression algorithm. We'll use the abbreviation b for “back” and c for “copy.” So a back-and-copy instruction like “back 18, copy 6” gets abbreviated as b18c6. Then the dictation instructions above can be summarized as

  VJGDNQMYLH-KW-b12c10-ADXSGF-O-b17c16-b16c10-EW-b18c6

  This string consists of only 44 characters. The original was 63 characters, so we have saved 19, or nearly a third of the length of the original.

  There is one more interesting twist on this same-as-earlier trick. How would you use the same trick to compress the message FG-FG-FG-FG-FG-FG-FG-FG? (Again, the dashes are not part of the message but are only added for readability.) Well, there are 8 repetitions of FG in the message, so we could dictate the first four individually, and then use a back-and-copy instruction as follows: FG-FG-FG-FG-b8c8. That saves quite a few characters, but we can do even better. It requires a back-and-copy instruction that might at first seem nonsensical: “back 2, copy 14,” or b2c14 in our abbreviated notation. The compressed message is, in fact, FG-b2c14. How can it possibly make sense to copy 14 characters when only 2 are available to be copied? In fact, this causes no problem at all as long as you copy from the message being regenerated, and not the compressed message. Let's do this step by step. After the first two characters have been dictated, we have FG. Then the b2c14 instruction arrives, so we go back 2 characters and start copying. There are only two characters available (FG), so let's copy those: when they are added to what we had already, the result is FG-FG. But now there are two more characters available! So copy those as well, and after adding them to the existing regenerated message you have FG-FG-FG. Again, two more characters available so you can copy two more. And this can continue until you have copied the required number of characters (in this case, 14). To check that you understood this, see if you can work out the uncompressed version of this compressed message: Ab1c250.1

  The Shorter-Symbol Trick

  To understand the compression trick that we'll be calling the “shorter-symbol trick,” we need to delve a little deeper into how computers store messages. As you may have heard before, computers do not actually store letters like a, b, and c. Everything gets stored as a number, and then gets interpreted as a letter according to some fixed table. (This technique for converting between letters and numbers was also mentioned in our discussion of checksums, on page 68.) For example, we might agree that “a” is represented by the number 27, “b” is 28, and “c” is 29. Then the string “abc” would be stored as “272829” in the computer, but could easily be translated back into “abc” before it is displayed on the screen or printed on a piece of paper.

  The table on the next page gives a complete list of 100 symbols that a computer might want to store, together with a 2-digit code for each one. By the way, this particular set of 2-digit codes is not used in any real computer system, but the ones used in real life are quite similar. The main difference is that computers don't use the 10-digit decimal system that humans use. Instead, as you may already know, they use a different numeric system called the binary system. But those details are not important for us. The shorter-symbol compression trick works for both decimal and binary number systems, so we will pretend that computers use decimal, just to make the explanation easier to follow.

  Take a closer look at the table of symbols. Notice that the first entry in the table provides a numeric code for a space between words, “00.” After that come the capital letters from A (“01”) to Z (“26”) and the lowercase letters from a (“27”) to Z (“52”). Various punctuation characters follow, and finally some characters for writing non-English words are included in the last column, starting with á (“80”) and ending with Ù (“99”).

  So how would these 2-digit codes be used by a computer to store the phrase “Meet your fiancé there.”? Simple: just translate each character into its numeric code and string them all together:

  M e e t y o u r f i a n c é t h e r e .

  1331314600514147440032352740298200463431443166

  It's very important to realize that inside the computer, there is no separation between the pairs of digits. So this message is actually stored as a continuous string of 46 digits: “1331314600514147440032352740298200463431443166.” Of course, this makes it a little harder for a human to interpret, but presents no problem whatsoever for a computer, which can easily separate the digits into pairs before translating them into characters to be displayed on the screen. The key point is that there is no ambiguity in how to separate out the numeric codes, since each code uses exactly two di
gits. In fact, this is exactly the reason that A is represented as “01” rather than just “1”—and B is “02” not “2,” and so on up to the letter I (“09” not “9”). If we had chosen to take A = “1,” B = “2,” and so on, then it would be impossible to interpret messages unambiguously. For example, the message “1123” could be broken up as “1 1 23” (which translates to AAW), or as “11 2 3” (KBC) or even “1 1 2 3” (AABC). So try to remember this important idea: the translation between numeric codes and characters must be unambiguous, even when the codes are stored next to each other with no separation. This issue will come back to haunt us surprisingly soon!

  Numeric codes that a computer could use for storing symbols.

  Meanwhile, let's get back to the shorter-symbol trick. As with many of the supposedly technical ideas described in this book, the shorter-symbol trick is something that humans do all the time without even thinking about it. The basic idea is that if you use something often enough, it's worth having a shorthand abbreviation for it. Everyone knows that “USA” is short for “United States of America”—we all save a lot of effort each time we type or say the 3-letter code “USA” instead of the full 24-letter phrase it stands for. But we don't bother with 3-letter codes for every 24-letter phrase. Do you know an abbreviation for “The sky is blue in color,” which also happens to be a 24-letter phrase? Of course not! But why? What is the difference between “United States of America” and “The sky is blue in color”? The key difference is that one of these phrases is used much more often than the other, and we can save a lot more effort by abbreviating a frequently used phrase instead of one that is rarely used.

 

‹ Prev