What Just Happened: A Chronicle From the Information Frontier
Page 37
Yet one still tries to take their measure.
How much information …?
When an object (a number or a bitstream or a dynamical system) can be expressed a different way in fewer bits, it is compressible. A frugal telegraph operator prefers to send the compressed version. Because the spirit of frugal telegraph operators kept the lights on at Bell Labs, it was natural for Claude Shannon to explore data compression, both theory and practice. Compression was fundamental to his vision: his war work on cryptography analyzed the disguising of information at one end and the recovery of the information at the other; data compression likewise encodes the information, with a different motivation—the efficient use of bandwidth. Satellite television channels, pocket music players, efficient cameras and telephones and countless other modern appurtenances depend on coding algorithms to compress numbers—sequences of bits—and those algorithms trace their lineage to Shannon’s original 1948 paper.
The first of these, now called Shannon-Fano coding, came from his colleague Robert M. Fano. It began with the simple idea of assigning short codes to frequent symbols, as in Morse code. They knew their method was not optimal, however: it could not be relied on to produce the shortest possible messages. Within three years it was surpassed by work of a graduate student of Fano’s at MIT, David Huffman. In the decades since, versions of the Huffman coding algorithm have squeezed many, many bytes.
Ray Solomonoff, a child of Russian immigrants who studied at the University of Chicago, encountered Shannon’s work in the early 1950s and began thinking about what he called the Information Packing Problem: how much information could one “pack” into a given number of bits, or conversely, given some information, how could one pack it into the fewest possible bits.♦ He had majored in physics, studied mathematical biology and probability and logic on the side, and gotten to know Marvin Minsky and John McCarthy, pioneers in what would soon be called artificial intelligence. Then he read Noam Chomsky’s offbeat and original paper “Three Models for the Description of Language,”♦ applying the new information-theoretic ideas to the formalization of structure in language. All this was bouncing around in Solomonoff’s mind; he was not sure where it led, but he found himself focusing on the problem of induction. How do people create theories to account for their experience of the world? They have to make generalizations, find patterns in data that are always influenced by randomness and noise. Could one enable a machine to do that? In other words, could a computer be made to learn from experience?
He worked out an elaborate answer and published it in 1964. It was idiosyncratic, and hardly anyone noticed until the 1970s, when both Chaitin and Kolmogorov discovered that Solomonoff had anticipated the essential features of what by then was called algorithmic information theory. In effect, Solomonoff, too, had been figuring out how a computer might look at sequences of data—number sequences or bit strings—and measure their randomness and their hidden patterns. When humans or computers learn from experience, they are using induction: recognizing regularities amid irregular streams of information. From this point of view, the laws of science represent data compression in action. A theoretical physicist acts like a very clever coding algorithm. “The laws of science that have been discovered can be viewed as summaries of large amounts of empirical data about the universe,”♦ wrote Solomonoff. “In the present context, each such law can be transformed into a method of compactly coding the empirical data that gave rise to that law.” A good scientific theory is economical. This was yet another way of saying so.
Solomonoff, Kolmogorov, and Chaitin tackled three different problems and came up with the same answer. Solomonoff was interested in inductive inference: given a sequence of observations, how can one make the best predictions about what will come next? Kolmogorov was looking for a mathematical definition of randomness: what does it mean to say that one sequence is more random than another, when they have the same probability of emerging from a series of coin flips? And Chaitin was trying to find a deep path into Gödel incompleteness by way of Turing and Shannon—as he said later, “putting Shannon’s information theory and Turing’s computability theory into a cocktail shaker and shaking vigorously.”♦ They all arrived at minimal program size. And they all ended up talking about complexity.
The following bitstream (or number) is not very complex, because it is rational:
D: 14285714285714285714285714285714285714285714285714…
It may be rephrased concisely as “PRINT 142857 AND REPEAT,” or even more concisely as “1/7.” If it is a message, the compression saves keystrokes. If it is an incoming stream of data, the observer may recognize a pattern, grow more and more confident, and settle on one-seventh as a theory for the data.
In contrast, this sequence contains a late surprise:
E: 10101010101010101010101010101010101010101010101013
The telegraph operator (or theorist, or compression algorithm) must pay attention to the whole message. Nonetheless, the extra information is minimal; the message can still be compressed, wherever pattern exists. We may say it contains a redundant part and an arbitrary part.
It was Shannon who first showed that anything nonrandom in a message allows compression:
F: 101101011110110110101110101110111101001110110100111101110
Heavy on ones, light on zeroes, this might be emitted by the flip of a biased coin. Huffman coding and other such algorithms exploit statistical regularities to compress the data. Photographs are compressible because of their subjects’ natural structure: light pixels and dark pixels come in clusters; statistically, nearby pixels are likely to be similar; distant pixels are not. Video is even more compressible, because the differences between one frame and the next are relatively slight, except when the subject is in fast and turbulent motion. Natural language is compressible because of redundancies and regularities of the kind Shannon analyzed. Only a wholly random sequence remains incompressible: nothing but one surprise after another.
Random sequences are “normal”—a term of art meaning that on average, in the long run, each digit appears exactly as often as the others, one time in ten; and each pair of digits, from 00 to 99, appears one time in a hundred; and each triplet likewise, and so on. No string of any particular length is more likely to appear than any other string of that length. Normality is one of those simple-seeming ideas that, when mathematicians look closely, turn out to be covered with thorns. Even though a truly random sequence must be normal, the reverse is not necessarily the case. A number can be statistically normal yet not random at all. David Champernowne, a young Cambridge friend of Turing’s, invented (or discovered) such a creature in 1933—a construction made of all the integers, chained together in order:
G: 12345678910111213141516171819202122232425262728293…
It is easy to see that each digit, and each combination of digits, occurs equally often in the long run. Yet the sequence could not be less random. It is rigidly structured and completely predictable. If you know where you are, you know what comes next.
Even apart from freaks like Champernowne’s, it turns out that normal numbers are difficult to recognize. In the universe of numbers, normality is the rule; mathematicians know for sure that almost all numbers are normal. The rational numbers are not normal, and there are infinitely many rational numbers, but they are infinitely outnumbered by normal numbers. Yet, having settled the great and general question, mathematicians can almost never prove that any particular number is normal. This in itself is one of the more remarkable oddities of mathematics.
Even Π retains some mysteries:
C: 3.1415926535897932384626433832795028841971693993751…
The world’s computers have spent many cycles analyzing the first trillion or so known decimal digits of this cosmic message, and as far as anyone can tell, they appear normal. No statistical features have been discovered—no biases or correlations, local or remote. It is a quintessentially nonrandom number that seems to behave randomly. Given the nth digit, there is no short
cut for guessing the nth plus one. Once again, the next bit is always a surprise.
How much information, then, is represented by this string of digits? Is it information rich, like a random number? Or information poor, like an ordered sequence?
The telegraph operator could, of course, save many keystrokes—infinitely many, in the long run—by simply sending the message “Π.” But this is a cheat. It presumes knowledge previously shared by the sender and the receiver. The sender has to recognize this special sequence to begin with, and then the receiver has to know what Π is, and how to look up its decimal expansion, or else how to compute it. In effect, they need to share a code book.
This does not mean, however, that Π contains a lot of information. The essential message can be sent in fewer keystrokes. The telegraph operator has several strategies available. For example, he could say, “Take 4, subtract 4/3, add 4/5, subtract 4/7, and so on.” The telegraph operator sends an algorithm, that is. This infinite series of fractions converges slowly upon Π, so the recipient has a lot of work to do, but the message itself is economical: the total information content is the same no matter how many decimal digits are required.
The issue of shared knowledge at the far ends of the line brings complications. Sometimes people like to frame this sort of problem—the problem of information content in messages—in terms of communicating with an alien life-form in a faraway galaxy. What could we tell them? What would we want to say? The laws of mathematics being universal, we tend to think that Π would be one message any intelligent race would recognize. Only, they could hardly be expected to know the Greek letter. Nor would they be likely to recognize the decimal digits “3.1415926535 …” unless they happened to have ten fingers.
The sender of a message can never fully know his recipient’s mental code book. Two lights in a window might mean nothing or might mean “The British come by sea.” Every poem is a message, different for every reader. There is a way to make the fuzziness of this line of thinking go away. Chaitin expressed it this way:
It is preferable to consider communication not with a distant friend but with a digital computer. The friend might have the wit to make inferences about numbers or to construct a series from partial information or from vague instructions. The computer does not have that capacity, and for our purposes that deficiency is an advantage. Instructions given the computer must be complete and explicit, and they must enable it to proceed step by step.♦
In other words: the message is an algorithm. The recipient is a machine; it has no creativity, no uncertainty, and no knowledge, except whatever “knowledge” is inherent in the machine’s structure. By the 1960s, digital computers were already getting their instructions in a form measured in bits, so it was natural to think about how much information was contained in any algorithm.
A different sort of message would be this:
Even to the eye this sequence of notes seems nonrandom. It happens that the message they represent is already making its way through interstellar space, 10 billion miles from its origin, at a tiny fraction of light speed. The message is not encoded in this print-based notation, nor in any digital form, but as microscopic waves in a single long groove winding in a spiral engraved on a disc twelve inches in diameter and one-fiftieth of an inch in thickness. The disc might have been vinyl, but in this case it was copper, plated with gold. This analog means of capturing, preserving, and reproducing sound was invented in 1877 by Thomas Edison, who called it phonography. It remained the most popular audio technology a hundred years later—though not for much longer—and in 1977 a committee led by the astronomer Carl Sagan created a particular phonograph record and stowed copies in a pair of spacecraft named Voyager 1 and Voyager 2, each the size of a small automobile, launched that summer from Cape Canaveral, Florida.
So it is a message in an interstellar bottle. The message has no meaning, apart from its patterns, which is to say that it is abstract art: the first prelude of Johann Sebastian Bach’s Well-Tempered Clavier, as played on the piano by Glenn Gould. More generally, perhaps the meaning is “There is intelligent life here.” Besides the Bach prelude, the record includes music samples from several different cultures and a selection of earthly sounds: wind, surf, and thunder; spoken greetings in fifty-five languages; the voices of crickets, frogs, and whales; a ship’s horn, the clatter of a horse-drawn cart, and a tapping in Morse code. Along with the phonograph record are a cartridge and needle and a brief pictographic instruction manual. The committee did not bother with a phonograph player or a source of electrical power. Maybe the aliens will find a way to convert those analog metallic grooves into waves in whatever fluid serves as their atmosphere—or into some other suitable input for their alien senses.
THE “GOLDEN RECORD” STOWED ABOARD THE VOYAGER SPACECRAFT (Illustration credit 12.1)
Would they recognize the intricate patterned structure of the Bach prelude (say), as distinct from the less interesting, more random chatter of crickets? Would the sheet music convey a clearer message—the written notes containing, after all, the essence of Bach’s creation? And, more generally, what kind of knowledge would be needed at the far end of the line—what kind of code book—to decipher the message? An appreciation of counterpoint and voice leading? A sense of the tonal context and performance practices of the European Baroque? The sounds—the notes—come in groups; they form shapes, called melodies; they obey the rules of an implicit grammar. Does the music carry its own logic with it, independent of geography and history? On earth, meanwhile, within a few years, even before the Voyagers had sailed past the solar system’s edge, music was seldom recorded in analog form anymore. Better to store the sounds of the Well-Tempered Clavier as bits: the waveforms discretized without loss as per the Shannon sampling theorem, and the information preserved in dozens of plausible media.
In terms of bits, a Bach prelude might not seem to have much information at all. As penned by Bach on two manuscript pages, this one amounts to six hundred notes, characters in a small alphabet. As Glenn Gould played it on a piano in 1964—adding the performer’s layers of nuance and variation to the bare instructions—it lasts a minute and thirty-six seconds. The sound of that performance, recorded onto a CD, microscopic pits burned by a laser onto a slim disc of polycarbonate plastic, comprises 135 million bits. But this bitstream can be compressed considerably with no loss of information. Alternatively, the prelude fits on a small player-piano roll (descendant of Jacquard’s loom, predecessor of punched-card computing); encoded electronically with the MIDI protocol, it uses a few thousands bits. Even the basic six-hundred-character message has tremendous redundancy: unvarying tempo, uniform timbre, just a brief melodic pattern, a word, repeated over and over with slight variations till the final bars. It is famously, deceptively simple. The very repetition creates expectations and breaks them. Hardly anything happens, and everything is a surprise. “Immortal broken chords of radiantly white harmonies,” said Wanda Landowska. It is simple the way a Rembrandt drawing is simple. It does a lot with a little. Is it then rich in information? Certain music could be considered information poor. At one extreme John Cage’s composition titled 4′33″ contains no “notes” at all: just four minutes and thirty-three seconds of near silence, as the piece absorbs the ambient sounds around the still pianist—the listeners’ shifting in their seats, rustling clothes, breathing, sighing.
How much information in the Bach C-major Prelude? As a set of patterns, in time and frequency, it can be analyzed, traced, and understood, but only up to a point. In music, as in poetry, as in any art, perfect understanding is meant to remain elusive. If one could find the bottom it would be a bore.
In a way, then, the use of minimal program size to define complexity seems perfect—a fitting apogee for Shannon information theory. In another way it remains deeply unsatisfying. This is particularly so when turning to the big questions—one might say, the human questions—of art, of biology, of intelligence.
According to this measure, a million zer
oes and a million coin tosses lie at opposite ends of the spectrum. The empty string is as simple as can be; the random string is maximally complex. The zeroes convey no information; coin tosses produce the most information possible. Yet these extremes have something in common. They are dull. They have no value. If either one were a message from another galaxy, we would attribute no intelligence to the sender. If they were music, they would be equally worthless.
Everything we care about lies somewhere in the middle, where pattern and randomness interlace.
Chaitin and a colleague, Charles H. Bennett, sometimes discussed these matters at IBM’s research center in Yorktown Heights, New York. Over a period of years, Bennett developed a new measure of value, which he called “logical depth.” Bennett’s idea of depth is connected to complexity but orthogonal to it. It is meant to capture the usefulness of a message, whatever usefulness might mean in any particular domain. “From the earliest days of information theory it has been appreciated that information per se is not a good measure of message value,”♦ he wrote, finally publishing his scheme in 1988.
A typical sequence of coin tosses has high information content but little value; an ephemeris, giving the positions of the moon and planets every day for a hundred years, has no more information than the equations of motion and initial conditions from which it was calculated, but saves its owner the effort of recalculating these positions.
The amount of work it takes to compute something had been mostly disregarded—set aside—in all the theorizing based on Turing machines, which work, after all, so ploddingly. Bennett brought it back. There is no logical depth in the parts of a message that are sheer randomness and unpredictability, nor is there logical depth in obvious redundancy—plain repetition and copying. Rather, he proposed, the value of a message lies in “what might be called its buried redundancy—parts predictable only with difficulty, things the receiver could in principle have figured out without being told, but only at considerable cost in money, time, or computation.” When we value an object’s complexity, or its information content, we are sensing a lengthy hidden computation. This might be true of music or a poem or a scientific theory or a crossword puzzle, which gives its solver pleasure when it is neither too cryptic nor too shallow, but somewhere in between.