Book Read Free

A Beautiful Math

Page 23

by Tom Siegfried


  From a practical point of view, though, the math of objective and subjective probability theory does not really differ in any fundamental respect other than its interpretation. It's just that in some cases it seems more convenient, or more appropriate, to use one rather than another, as Jaynes pointed out half a century ago.

  INFORMATION AND IGNORANCE

  In his 1957 paper,7 Jaynes championed the subjectivity side of the probability debate. He noted that both views, subjectivist and objectivist, were needed in physics, but that for some types of problems only the subjective approach would do.

  He argued that the subjective approach can be useful even when you know nothing about the system you are interested in to begin with. If you are given a box full of particles but know nothing about them—not their mass, not their composition, not their internal structure—there's not much you can say about their behavior. You know the laws of physics, but you don't have any knowledge about the system to apply the laws to. In other words, your ignorance about the behavior of the particles is at a maximum.

  Early pioneers of probability theory, such as Jacob Bernoulli and Laplace, said that in such circumstances you must simply assume that all the possibilities are equally likely—until you have some reason to assume otherwise. Well, that helps in doing the calculations, perhaps, but is there any real basis for assuming the probabilities are equal? Except for certain cases where an obvious symmetry is at play (say, a perfectly balanced two-sided coin), Jaynes said, many other assumptions might be equally well justified (or the way he phrased it, any other assumption would be equally arbitrary).8

  Jaynes saw a way of coping with this situation, though, with the help of the then fairly new theory of information devised by Claude Shannon of Bell Labs. Shannon was interested in quantifying communication, the sending of messages, in a way that would help engineers find ways to communicate more efficiently (he worked for the telephone company, after all). He found math that could quantify information quite nicely if you viewed communication as the reduction of uncertainty. Before communication begins, all messages are possible, so uncertainty is high; as messages are actually received, that uncertainty is reduced.

  Shannon's math applied generally to any system of signaling, from Morse Code to smoke signals. But suppose, for example, that all you wanted to do was send someone a single English word (from among all the words in a standard unabridged dictionary, about half a million). If you tell the recipient of the message that the word is in the first half of the dictionary, you've reduced the number of possibilities from 500,000 to 250,000. In other words, you have reduced the uncertainty by half (which so happens to correspond to one bit of information).

  Shannon elaborated on this idea to show how all communication could be quantified based on the idea that messages reduce uncertainty. He found a formula for a quantity that measures that uncertainty precisely—the greater the uncertainty, the greater the quantity. Shannon called the quantity entropy, a conscious analogy to the entropy term used by physicists in statistical mechanics and thermodynamics.

  Physicists' entropy is a measure of the disorder in a physical system. Suppose you have a chamber containing separate compartments, and you place a zillion molecules of oxygen in the left-side compartment and 4 zillion molecules of nitrogen in the right-side compartment. Then you remove the partition between the compartments. The molecules soon get all mixed up—more disordered—and so the entropy of the system has increased. But something else has happened—you no longer know where the molecules are. Your ignorance of their location has increased just as the entropy has increased. Shannon showed that his formula for entropy in communication—as a measure of ignorance or uncertainty—is precisely the same equation that is used in statistical mechanics to describe the increasing entropy of a collection of particles.

  Entropy, in other words, is the same thing as ignorance. Entropy is synonymous with uncertainty. Information theory therefore provides a precise new way of measuring uncertainty in a probability distribution.

  So here's a clue about what to do when you know nothing about the probabilities in the system you want to study. Choose a probability distribution that maximizes the entropy! Maximum entropy means maximum ignorance, and if you know nothing, ignorance is by definition at a maximum. Assuming maximum entropy/ ignorance, then, is not just an assumption; it's a factual statement about your situation.

  Jaynes proposed that this notion of maximum ignorance should be elevated to the status of a basic principle for describing anything scientifically. In his view, statistical mechanics itself just became a system of statistical inference about a system. By taking the maxent approach, you still get all the computational rules that statistical mechanics provides, without the need to assume anything at all about the underlying physics.

  In particular, you now can justify the notion that all the possibilities are equally possible. The whole idea is that no possibility (allowed by the laws of physics) gets left out. Everything not explicitly excluded by the information you've got has to be viewed as having some probability of occurring. (In standard statistical mechanics, that feature was simply assumed without evidence—probability distributions were based on the idea that molecules would explore all their possible behaviors.) And if you know nothing, you cannot say that any one possibility is more likely than any other—that would be knowledge.

  Of course, if you know something about the probabilities, you can factor that in to the probability distribution you use to make your predictions about what's going to happen. But if you know nothing at all, there's only one probability distribution that you can identify for making your bets: the one that has the maximum entropy, the maximum uncertainty, the maximum ignorance. It makes sense, after all, because knowing nothing is, in fact, being maximally ignorant.

  This is the magic that makes it possible to make a prediction even when knowing nothing about the particles or people you're making the prediction about. Of course, your prediction might not turn out to be right. But it's still the best possible prediction you can make, the likeliest answer you can identify, when you know nothing to begin with.

  "The fact that a probability distribution maximizes the entropy subject to certain constraints becomes the essential fact which justifies use of that distribution for inference," Jaynes wrote. "Whether or not the results agree with experiment, they still represent the best estimates that could have been made on the basis of the information available."9

  But what, exactly, does it mean to "maximize the entropy"? It simply means choosing the probability distribution that would result from adding up all the possibilities permitted by the laws of nature (since you know nothing, you cannot leave out anything that's possible). Here's a simple example. Suppose that you want to predict the average grade for a class of 100 students. All you know are the rules (that is, the laws of nature)—everybody gets a grade, and the grade has to be A, B, C, D, or F (no incompletes allowed). You don't know anything about the caliber of the students or how hard the class is. What is your best prediction of the average grade for the kids in the class? In other words, how do you find a probability distribution for the grades that tells you which grade average is the most probable?

  Applying the maxent or maximum ignorance principle, you simply assume that the grades can be distributed in all possible ways—all possible combinations equally likely. For instance, one possible distribution is 100 A's and nothing else. Another would be all F's. There could be 20 students with each grade. You could have 50 C's, 20 B's and 20 D's, 5 A's and 5 F's. All the combinations sum to an ensemble of possibilities that constitutes the probability distribution corresponding to no knowledge—maximum ignorance—about the class and the kids and their grades.

  In statistical physics, this sort of thing is called the "canonical ensemble"—the set of possible states for the molecules in a system. Each possible combination is a microstate. Many different possible microstates (distributions of grades) can correspond to the same average (the macro
state).

  Don't try to list all the possible combinations; it would take you a very long time. (You're talking something close to 10 to the 70th power.) But you can calculate, or even see intuitively, that the most likely average grade will be C. Of all the possible microstate combinations, many more work out to a C average than to any other grade. There is only one way, for instance, to have a perfect A average—all 100 students getting A's. But you can get a C average in many different ways—100 C's, 50 A's and 50 F's, 20 students getting each of the five grades, and so on.10

  It's just like flipping pennies, four flips at a time, with the grade corresponding to the number of heads that turn up (0 = F, 4 = A). In 100 trials, many combinations give an average of 2, but only a few will give an average of 0 or 4. So your prediction, based on knowing nothing, will be an average grade of C.

  BACK TO THE GAME

  In game theory, a player's mixed strategy is also a probability distribution, much like grades or penny flips. Game theory is all about how to figure out what each player's best mixed strategy would be (for maximizing utility, or the payoff, of the game). In a multiplayer game, there is at least one mix of all players' mixed strategies for which no one player could do any better by changing strategies— the Nash equilibrium, game theory's most important foundational principle.

  But Nash's foundation of modern game theory has its cracks. While it's true that, as Nash showed, all games (with certain qualifications) have at least one Nash equilibrium, many games can have more than one. In those cases, game theory cannot predict which equilibrium point will be reached—you can't say what sets of mixed strategies the players will actually adopt in a real-world situation. And even if there is only one Nash equilibrium in a complicated game, it is typically beyond the capability of a committee of supercomputers to calculate what all the players' mixed strategies would have to be.

  In turn, that crack is exacerbated by a weakness in the cardinal assumption underlying traditional game theory—that the players are rational payoff maximizers with access to all the necessary information to calculate their payoffs. In a world where most people can't calculate the sales tax on a cheeseburger, that's a tall order. In real life, people are not "perfectly rational," capable of figuring out the best money-maximizing strategy for any strategy combination used by all the other competitors. So game theory appears to assume that each player can do what supercomputers can't. And in fact, almost everybody recognizes that such total rationality is unachievable. Modern approaches to game theory often assume, therefore, that rationality is limited or "bounded."

  Game theorists have devised various ways to deal with these limitations on Nash's original math. An enormous amount of research, of the highest caliber, has modified and elaborated game theory's original formulations into a system that corrects many of these initial "flaws." Much work has been done on understanding the limits of rationality, for instance. Nevertheless, many game theorists often cling to the idea that "solving a game" means finding an equilibrium—an outcome where all players achieve their maximum utility. Instead of thinking about what will happen when the players actually play a game, game theorists have been asking what the individual players should do to maximize their payoff.

  When I visited Wolpert at NASA Ames, a year after our conversation in Boston, he pointed out that the search for equilibrium amounts to viewing a game from the inside, from the viewpoint of one of the participants, instead of from the vantage point of an external scientist assessing the whole system. From the inside, there may be an optimal solution, but a scientist on the outside looking in should merely be predicting what will happen (not trying to win the game). If you look at it that way, you know you can never be sure how a game will end up. A science of game theory should therefore not be seeking a single answer, but a probability distribution of answers from which to make the best possible prediction of how the game will turn out, Wolpert insists. "It's going to be the case that whenever you are given partial information about a system, what must pop out at the other end is a distribution over possibilities, not a single answer."11

  In other words, scientists in the past were not really thinking about the game players as particles, at least not in the right way. If you think about it, you realize that no physicist computing the thermodynamic properties of a gas worries about what an individual molecule is doing. The idea is to figure out the bulk features of the whole collection of molecules. You can't know what each molecule is up to, but you can calculate, statistically, the macroscopic behavior of all the molecules combined. The parallel between games and gases should be clear. Statistical physicists studying gases don't know what individual molecules are doing, and game theorists don't know what individual players are thinking. But physicists do know how collections of molecules are likely to behave—statistically—and can make good predictions about the bulk properties of a gas. Similarly, game theorists ought to be able to make statistical predictions about what will happen in a game.

  This is, as Wolpert repeatedly emphasizes, the way science usually works. Scientists have limited information about the systems they are studying and try to make the best prediction possible given the information they have. And just as a player in a game has incomplete information about all the game's possible strategy combinations, the scientist studying the game has incomplete information about what the player knows and how the player thinks (remember that different individuals play games in different ways).

  All sciences face this sort of problem—knowing something about a system and then, based on that limited knowledge, trying to predict what's going to happen, Wolpert pointed out. "So how does science go about answering these questions? In every single scientific field of endeavor, what will come out of such an exercise is a probability distribution."12

  From this point of view, another sort of mixed strategy enters game theory. It's not just that the player has a mixed strategy, a probability distribution of possible moves from which to choose. The scientist describing the game also has a kind of "mixed strategy" of possible predictions about how the game will turn out.

  "When you think about it, it's obvious," Wolpert said. "If I give you a game of real human beings, no, you're not going to always have the same outcome. You're going to have more than one possible outcome…. It's not going to be the case they are always going to come up with the exact same set of mixed strategies. There's going to be a distribution over their mixed strategies, just like in any other scientific scenario."

  This is clearly taking game theory to another level. While each player has a mixed strategy, a probability distribution of pure strategies, the scientist describing the game should compute a probability distribution of all the players' mixed strategies. And how do you find those probability distributions of mixed strategies? By maximizing your ignorance, of course. If you want to treat game theory as though the people were particles, the best approach is to assume a probability distribution for their strategies that maximizes the uncertainty (or the entropy, in information theory terms). Using this approach, you don't need to assume that the players in a game have limits on their rationality; such limits naturally appear in the formulas that information theory provides. Given a probability distribution of possible outcomes for the game, then, you can choose which outcome to bet on using the principles of decision theory.

  "When you need a prediction, a probability distribution won't do," said Wolpert. "You have to decide to fire the missile or don't fire; turn left or right." The underlying axioms for the mathematical basis for making such a decision were worked out in the 1950s by Leonard Savage13 in some detail, but they boil down to something like Pascal's Wager. If you have a probability distribution of possible outcomes, but don't know enough to distill the possibilities down to a single prediction, you need to consider how much you have to lose (or to gain) if your decision is wrong (or right).

  "If you predict X, but the truth is Y, how much are you hurt? Or conversely, how much do you benefit?" Wolpert explained. "Cer
tain kinds of mispredictions aren't going to hurt you very much, depending on what the truth is. But in other instances … your prediction of truth might cause all sorts of problems—you've now launched World War III."

  Decision theory dictates that you should make the prediction that minimizes your expected loss ("expected" signifying that the relative probabilities of the choices are taken into account—you average the magnitudes of loss over all the possibilities). Consequently, Wolpert observes, different individual observers would make different predictions about the outcome of a game, even if the probability distribution of possible outcomes is the same, because some people would have more to lose than others for certain incorrect predictions.

  "In other words, for the exact same game, your decision as the external person making the prediction is going to vary depending on your loss function," he says. That means the best prediction about the outcome isn't some equilibrium point established within the game, but rather depends on "the person external to the game who's making the prediction about what's going to come out of it." And so sometimes the likeliest outcome of a game will not be a Nash equilibrium.

  But why not, if a Nash equilibrium represents the stable outcome where nobody has an incentive to change? It seems like people would keep changing their strategy until they had no incentive not to. But when game theory is cast in the information-theoretic equations of maximum entropy, the answer becomes clear. A term in the equations signifies the cost of computing the best strategy, and in a complicated game that cost is likely to be too high. In other words, a player attempting to achieve a maximum payoff must factor in the cost of computing what it takes to get that payoff. The player's utility is not just the expected payoff, but the expected payoff minus the cost of computing it.

 

‹ Prev