In all these projects Dally was warring against the conventional computer architecture of step-by-step serial processing—associated with a memory problem called the “von Neumann bottleneck.” You live in the real world, right? The real world offers intrinsically parallel problems such as images that flood the eye all at once, whether you’re driving a car in the snow or summoning a metaverse with computer-generated graphics or pattern-matching in “machine learning” argosies across the seas of big data.
The von Neumann bottleneck was recognized by von Neumann himself. In response, he proposed a massively parallel architecture called cellular automata, which led to his last book before his death at age fifty-seven. In The Computer and the Brain, he contemplated a parallel solution called neural networks, which were based on a primitive idea of how billions of neurons might work together in the human neural system.
Von Neumann concluded the brain is a non-von machine nine orders of magnitude slower than the gigahertz he prophesied for computers back in 1957. Amazingly, von Neumann anticipated the many-million-fold “Moore’s Law” speedup that we have experienced. But he estimated that the brain is nine orders of magnitude (a billion times) more energy-efficient than a computer. This is a delta larger even than that claimed by the guys from Google Brain for their Tensor chip. In the age of Big Blue and Watson at IBM, the comparison remains relevant. When a supercomputer defeats a man in a game of chess or Go, the man is using maybe fourteen watts of power, while the computer and its networks are tapping into the gigawatt clouds on the Columbia.
In the age of Big Data, the von Neumann bottleneck has philosophical implications. The more knowledge that is put into a von Neumann machine, the bigger and more crowded its memory, the further away its average data address, and the slower its functioning. Danny Hillis, of the erstwhile Thinking Machines, writes, “This inefficiency remains no matter how fast we make the processor, because the length of the computation becomes dominated by the time required to move data between processor and memory.” That span, traveled in every step in the computation, is governed by the speed of light, which on a chip is around nine inches a nanosecond—a significant delay on chips that now bear as much as sixty miles of tiny wires.
What Dally saw is that the serial computer has reached the end of the line. Most computers (smartphones and tablets and laptops and even self-driving cars) are not plugged into the wall any more. Even supercomputers and data centers suffer from power constraints, manifested in the problems of cooling the machines, whether by giant fans and air conditioners or by sites near rivers or glaciers. As Hölzle comments, “By classic definitions, there is little ‘work’ produced by a datacenter since most of the energy is converted to heat.”
Hitting the energy wall and the light-speed barrier, the chip’s architecture will necessarily fragment into separate modules and asynchronous and more parallel structures. We might term these processors time-space “mollusks”—Einstein’s word for entities in a relativistic world. Setting the size of the integrated circuit cell will be a measure comparable in the microcosm to light-years in the cosmos. It will enforce a distribution of computing capabilities analogous to the distribution of human intelligence.
As a result, says Dally, harking back to Tredennick, leading-edge-wedge computer performance must now be measured not by the conventional metrics of operations per second or silicon area but by operations per watt. Based on the natural parallelism of images hitting the eye at once, graphics processors are not only as ubiquitous as vision but supremely parallel. Thus many of the “cool chips” of today tend to be made by Nvidia.
Still, in operations per watt, the prevailing champion is made not of silicon but of carbon. It is the original neural network, the human brain and its fourteen watts, which is not enough to illuminate the lightbulb over a character’s head in a cartoon strip. In the future, computers will pursue the energy ergonomics of brains rather than the megawattage of Big Blue or even the giant air-conditioned expanses of data centers. All computers will have to use the power-saving techniques that have been developed in the battery-powered smartphone industry and then move on to explore the energy economics of real carbon brains.
There is a critical difference between programmable machines and programmers. The machines are deterministic and the programmers are creative.
That means that the AI movement, far from replacing human brains, is going to find itself imitating them. The brain demonstrates the superiority of the edge over the core: It’s not agglomerated in a few air-conditioned nodes but dispersed far and wide, interconnected by myriad sensory and media channels. The test of the new global ganglia of computers and cables, worldwide Webs of glass and light and air, is how readily they take advantage of unexpected contributions from free human minds in all their creativity and diversity, which cannot even be measured by the metrics of computer science.
As the Silicon Valley legend Carver Mead of Caltech has shown in his decades of experiments in neuro-morphic computation, any real artificial intelligence will likely have to use not silicon substrates but carbon-based materials. With some 200,000 compounds, carbon is more adaptable and chemically complex than silicon by orders of magnitude. Recent years have seen an efflorescence of new carbon materials, such as the organic light-emitting diodes and photodetectors now slowly taking over the display market. Most promising is graphene, a one-atom-deep sheet of transparent carbon that can be curled up in carbon nanotubes, layered in graphite blocks, or architected in C-60 “Buckyballs.”
Graphene has many advantages. Its tensile strength is sixty times that of steel, its conductivity two hundred times that of copper. There is no band gap to slow it down, and it provides a relatively huge sixty-micron mean-free path for electrons. As the nanotech virtuoso James Tour of Rice University has demonstrated in his laboratory, graphene, carbon nanotube swirls, and their compounds make an array of nano-machines, vehicles, and engines possible. They offer the still-remote promise of new computer architectures such as quantum computers that can actually model physical reality and thus may finally yield some real intelligence.
The current generation in Silicon Valley has yet to come to terms with the findings of von Neumann and Gödel early in the last century or with the breakthroughs in information theory of Claude Shannon, Gregory Chaitin, Anton Kolmogorov, and John R. Pierce. In a series of powerful arguments, Chaitin, the inventor of algorithmic information theory, has translated Gödel into modern terms. When Silicon Valley’s AI theorists push the logic of their case to explosive extremes, they defy the most crucial findings of twentieth-century mathematics and computer science. All logical schemes are incomplete and depend on propositions that they cannot prove. Pushing any logical or mathematical argument to extremes—whether “renormalized” infinities or parallel universe multiplicities—scientists impel it off the cliffs of Gödelian incompleteness.
Chaitin’s “mathematics of creativity” suggests that in order to push the technology forward it will be necessary to transcend the deterministic mathematical logic that pervades existing computers. Anything deterministic prohibits the very surprises that define information and reflect real creation. Gödel dictates a mathematics of creativity.
This mathematics will first encounter a major obstacle in the stunning successes of the prevailing system of the world not only in Silicon Valley but also in finance.
CHAPTER 8
Markov and Midas
One of the supremely seminal ideas of the twentieth century is the Markov chain. Introduced by the Russian mathematician and information theorist Andrey Markov in 1913, it became a set of statistical tools for predicting the future from the present. Powerfully extended in what are called “hidden Markov models,” the technique can uncover unobserved realities behind a series of observations, such as images of cats and dogs at Google, patterns of weather over time, or even human minds.1
A black-bearded atheist, chess master, and political activist, Markov was dubbed “Andrey the Furious.” A cantan
kerous genius, he was aligned with the left in the closing years of the Tsarist regime, not anticipating the totalitarian turn it would take after the triumph of the Bolsheviks. Though he achieved in his lifetime a certain eminence as a mathematician, his real influence would not be felt for nearly a century, when his work proved essential to the foundation of the Google-era system of the world.
From physics to economics, science has long had trouble coming to terms with time. Until Markov, the theory of probability, like the theory of physics, mostly avoided temporal considerations. As Amy Langville and Philipp von Hilgers write in a canonical essay, the dominant probability concepts failed to differentiate between serial and parallel processes, between “a thousand throws of a single die and a thousand dice each thrown once.”2 Addressing the temporal dependencies between events, how one thing leads to another, Markov chains trace the probabilistic transitions from one state or condition to another, step by step through time.
Markov followed the lead of the nineteenth-century intellectual giants James Clerk Maxwell and Ludwig Boltzmann, who had pioneered this statistical mode of thought in physics. They invented probabilistic tools to describe physical phenomena, such as the hidden behavior of atoms and molecules, waves and particles, which could not be seen or measured by the scientific instruments of their day. Their statistical laws of thermodynamics provided theoretical physics a much-needed arrow of time derived from the concept of entropy.
Remarkably, the first man to expound and use these statistical tools, several years before they were publically formulated by Markov, was Albert Einstein. In 1905, calculating the hidden behavior of molecules in Brownian motion, he showed that they occupied a chain of states that jiggled at a rate of around two gigahertz following a “random walk,” as in Markov’s concept. Showing the movements of atoms without seeing or measuring them, Einstein translated from what is now termed a Markov sequence of observable states of a gas to his proof of the then-still-hidden Brownian motion of the molecules.
Markov kept his head down during the Russian Revolution while working on his theory. By the time of his death in 1922, he had turned his precursors’ improvisations into a full-fledged system. Markovian techniques, which pervade the science of information theory, are behind the dominant advances of the Google era, from big data and cloud computing to speech recognition and machine learning.
In an early triumph, a statistical study of Pushkin’s poem Eugene Onegin, Markov showed that linguistic properties could be grasped mathematically and predicted without knowing the particular language. In focusing on patterns of vowels and consonants, Markov came close to anticipating Claude Shannon’s information metric. Shannon’s theory treated all transmitters across a communications channel as Markov processes.3
Refining and extending Markov’s discoveries through the twentieth century and into our own era were a series of transformative thinkers. Some, like Shannon, are widely celebrated. Andrew Viterbi is best known as a co-founder of Qualcomm, but perhaps his greatest feat was to develop a recursive algorithm for efficiently computing complex chains, overcoming the computing costs that grew exponentially with the size of the chain.
The precocious MIT star Norbert Wiener, author of Cybernetics (1948), extended Markov sequences from discrete to continuous phenomena and contributed the idea of pruning improbable results.4 This advance helped calculations of rocket or airplane trajectories during World War II, using Markov math to predict the future location of moving objects by observing their current positions.
Bringing Markov chains to big data, the mathematician Leonard E. Baum of the Institute for Defense Analyses (IDA) demonstrated how a sufficiently long chain of observations can be iterated until the probability of an underlying explanation is maximized. These maximized probabilities define the original structure of the source and allow subsequent predictions, whether of words or financial prices. Facilitating Baum’s work was the prestigious but little-known Markov contributor Lee Neuwirth, the longtime head of IDA, who named the predictive use of the chains “hidden Markov models” at a 1980 conference in Princeton.
By every measure, the most widespread, immense, and influential of Markov chains today is Google’s foundational algorithm, PageRank, which encompasses the petabyte reaches of the entire World Wide Web. Treating the Web as a Markov chain enables Google’s search engine to gauge the probability that a particular Web page satisfies your search.5
To construct his uncanny search engine, Larry Page paradoxically began with the Markovian assumption that no one is actually searching for anything. His “random surfer” concept makes Markov central to the Google era.
PageRank treats the Internet user as if he were taking a random walk across the Web, which we users know is not what we are doing. Since a random surfer would tend to visit the best-connected sites most frequently, his hypothetical itinerary defines the importance and authority of sites. Because PageRank is a manageably simple model that requires no knowledge about surfers or websites, it enables Markov math quickly and constantly to calculate their rankings across the galactic topography of the Internet.
Beyond Web pages, Markov models treat the world as a sequence of “states”—phonemes, words, weather conditions, consumer choices, transactions, security prices, sensor data, DNA bases, sports results, health indices, CO2 levels, bomb trajectories, Turing machine steps, chess positions, gambling prospects, computer performance, commodity markets, traffic reports—you name it—linked to other states by “transition probabilities.” I drew three kings; what is the likelihood of a fourth? It snowed today; what is the probability that it will rain tomorrow? The opening price of an Amazon share is $1,421 at nine A.M.; what will be the price at 9:01? The transition probabilities may be calculated from previous data and updated with new observations. The Markovian world of random wanderings among the states is governed by the probability weights.
This approach freed analysts of the burden of figuring out people’s intentions or plans or of working out the logical connections between events. All you need is a record of states and the probabilities between them. All else can be assumed to be random. In his contributions to the Central Limit Theorem in probability, Markov showed that all random events or data, independent or not, ultimately conform to normal distributions. Chains with dependencies over time are a tractable part of the mathematical universe. This is consistent with what we know about statistics: they predict group behavior without accounting for individual decisions or free will.
A defining property of a Markov chain is that it is memoryless. The history is assumed to be summed up by the current state and not by any past history of the chain. This feature greatly simplifies the computational process. Following a Markov model, a browser pursues a “random walk” of transitions from one position to another, bouncing off “reflecting states” (unwanted sites), moving through “transitional states” (Utah, Nevada), stopping at “absorbing states” (Google Mountain View headquarters!), all without needing to factor in intentionality or plan.
Hierarchical hidden Markov models enable multiple levels of abstraction, from phonemes up a neural network tree to words and phrases and meanings and models of reality. Ray Kurzweil, a Google vice president and Markov enthusiast, maintains that in recognizing speech or other patterns, hierarchical hidden Markov models are a guide to the mind: “essentially computing what is going on in the neocortex of a speaker—even though we have no direct access to that person’s brain. . . . We might wonder, if we were to look inside the speaker’s neocortex, would we see connections and weights corresponding to the hierarchical hidden Markov Model computed by the software?” In his book How to Create a Mind, he concludes that “there must be an essential mathematical equivalence to a high degree of precision between the actual biology [of the brain] and our attempt to emulate it; otherwise these systems would not work as well as they do.”6
Like Einstein calculating the Brownian motions of unseen molecules, Kurzweil was using an intuitive hidden Markov think
ing process to show that the brain is largely a Markovian thinking process. Perhaps, by now, Ray’s brain has been trained and weighted to be one.
Like so many achievements of modern computers, the reach of Markov algorithms depends on the velocity of their computation. Accelerate the data processing and expand the data and you can use Markov to predict and exploit an ever-wider range of future events before anyone else can respond. Siren Servers in huge arrays in the cloud have vastly enlarged the amount of data that can be processed and thus the number of sequences that can be predicted.
All the titans of the cloud from Amazon to Facebook have made heuristic use of Markov models to decide what customers are saying and to predict what they will do next. But the most impressive Markov warriors and Siren Servers are not at Google or Amazon or Facebook. They reside at a little-known but astonishingly successful company transforming the world of finance. The real Markovian masters of the universe run a venture in Setauket, Long Island, called Renaissance Technologies. It is the Google-era titan of finance and investment.
Remember Leonard Baum of the Institute for Defense Analyses? The eminent IDA mathematician James Simons is the founder of Renaissance, which exploits Big Data in accord with Baum’s Markovian vision. The author of the Chern-Simons formula in string theory, a performer of secret cerebrations for IDA, and the genius behind this greatest of hedge funds, Simons has performed a world-beating demonstration of practical mathematics, massive computational power, and entrepreneurship.
Spinning out of the IDA, Renaissance began in 1978 as “Monemetrics” and was mostly devoted to trading currencies with Baum’s hidden Markov modeling techniques still in formation at IDA. This first version was modestly successful. The major breakthroughs came when Simons hired Robert Mercer and Peter Brown from the IBM speech-recognition group in 1993 and unleashed them to create a vast Siren Server designed to make money out of Markov and derivative algorithms.
Life After Google Page 8