Algorithms to Live By

Page 35

by Brian Christian

the definitive textbook on optimal stopping: That’s Ferguson, Optimal Stopping and Applications.

2. EXPLORE/EXPLOIT

“Make new friends”: Joseph Parry, “New Friends and Old Friends,” in The Best Loved Poems of the American People, ed. Hazel Felleman (Garden City, NY: Doubleday, 1936), 58.

“life so rich and rare”: Helen Steiner Rice, “The Garden of Friendship,” in The Poems and Prayers of Helen Steiner Rice, ed. Virginia J. Ruehlmann (Grand Rapids, MI: Fleming H. Revell), 47.

“You try to find spaces”: Scott Plagenhoef, personal interview, September 5, 2013.

The odd name comes from: In a letter to Merrill Flood dated April 14, 1955 (available in the Merrill Flood archive at the University of Michigan), Frederick Mosteller tells the story of the origin of the name. Mosteller and his collaborator Robert Bush were working on mathematical models of learning—one of the earliest instances of what came to be known as mathematical psychology, informing the research that Tom does today. They were particularly interested in a series of experiments that had been done with a T-shaped maze, where animals are put into the maze at the bottom of the T and then have to decide whether to go left or right. Food—the payoff—may or may not appear on either side of the maze. To explore this behavior with humans they commissioned a machine with two levers that people could pull, which Mosteller dubbed the two-armed bandit. He then introduced the mathematical form of the problem to his colleagues, and it ultimately became generalized to the multi-armed bandit.

A comprehensive introduction to multi-armed bandits appears in Berry and Fristed, Bandit Problems. Our focus in this chapter is on bandits where each arm either produces a payoff or doesn’t, with different probabilities but the same payoff amount on all arms. This is known as a Bernoulli bandit in the literature, as the probability distribution that describes a coin flip is called the Bernoulli distribution (after the seventeenth-century Swiss mathematician Jacob Bernoulli). Other kinds of multi-armed bandits are also possible, with unknown distributions of different kinds characterizing the payoffs from each arm.

how good the second machine might actually be: The “myopic” strategy of pulling the arm with higher expected value is actually optimal in some cases. Bradt, Johnson, and Karlin, “On Sequential Designs for Maximizing the Sum of N Observations,” showed that if the probabilities of a payoff for a two-armed bandit (with p1 for one arm, p2 for the other) satisfy p1 + p2 = 1, then this strategy is optimal. They conjectured that this also holds for pairs of probabilities where (p1, p2) either take on the values (a, b) or (b, a) (i.e., if p1 is a, then p2 is b, and vice versa). This was proved to be true by Feldman, “Contributions to the ‘Two-Armed Bandit’ Problem.” Berry and Fristed, Bandit Problems, has further details on myopic strategies, including a result showing that choosing the highest expected value is optimal when p1 and p2 are restricted to take on just two possible values (e.g., either or both of p1 or p2 could be 0.4 or 0.7, but we don’t know which of these possibilities is true).

“embodies in essential form”: Whittle, Optimization over Time.

“Eat, drink, and be merry”: “Eat, drink, and be merry, for tomorrow we die,” an idiom in common parlance and in pop culture (e.g., forming the chorus of “Tripping Billies” by the Dave Matthews Band, among many other references), appears to be a conflation of two biblical verses: Ecclesiastes 8:15 (“A man hath no better thing under the sun, than to eat, and to drink, and to be merry”) and Isaiah 22:13 (“Let us eat and drink, for tomorrow we die”).

“why take the risk?”: Chris Stucchio, personal interview, August 15, 2013.

“a sixth helping of X-Men”: Nick Allen, “Hollywood makes 2013 the year of the sequel” http://www.telegraph.co.uk/culture/film/film-news/9770154/Hollywood-makes-2013-the-year-of-the-sequel.html. See also http://www.shortoftheweek.com/2012/01/05/has-hollywood-lost-its-way/ and http://boxofficemojo.com/news/?id=3063.

Profits of the largest film studios declined: “Between 2007 and 2011, pre-tax profits of the five studios controlled by large media conglomerates (Disney, Universal, Paramount, Twentieth Century Fox and Warner Bros) fell by around 40%, says Benjamin Swinburne of Morgan Stanley.” In “Hollywood: Split Screens,” Economist, February 23, 2013, http://www.economist.com/news/business/21572218-tale-two-tinseltowns-split-screens.

ticket sales have declined: Statistics from http://pro.boxoffice.com/statistics/yearly and http://www.the-numbers.com/market/. See also Max Willens, “Box Office Ticket Sales 2014: Revenues Plunge to Lowest in Three Years,” International Business Times, January 5, 2015.

“Squeezed between rising costs”: “Hollywood: Split Screens,” Economist, February 23, 2013, http://www.economist.com/news/business/21572218-tale-two-tinseltowns-split-screens.

“the ultimate instrument of intellectual sabotage”: Whittle’s comment on the difficulty of bandit problems appears in his discussion of Gittins, “Bandit Processes and Dynamic Allocation Indices.”

Robbins proved in 1952: Robbins, “Some Aspects of the Sequential Design of Experiments” introduces the Win-Stay, Lose-Shift algorithm.

Following Robbins, a series of papers: Bradt, Johnson, and Karlin, “On Sequential Designs for Maximizing the Sum of N Observations,” showed that “stay on a winner” is always true where the probability of a payoff is unknown for one arm but known for the other. Berry, “A Bernoulli Two-Armed Bandit,” proved that the principle is always true for a two-armed bandit. Generalizations of this result (and a characterization of the cases where it doesn’t apply) appear in Berry and Fristed, Bandit Problems.

exactly how many options and opportunities: This solution to the “finite horizon” version of the multi-armed bandit problem is presented in Bellman’s magnum opus Dynamic Programming, a book that is impressive as the starting point (and sometimes endpoint) of a number of topics in optimization and machine learning. Among other uses, dynamic programming can efficiently solve problems that require backward induction—which we also encountered briefly in chapter 1 in the context of the full-information game.

“a byword for intransigence”: Introduction to Gittins, “Bandit Processes and Dynamic Allocation Indices.”

“would be a pretty good approximation”: John Gittins, personal interview, August 27, 2013.

Deal or No Deal: The many worldwide incarnations of this game show began with the Dutch show Miljoenenjacht, which first aired in 2000.

the multi-armed bandit problem is no different: Previous researchers had also found solutions for this “one-armed bandit” problem over a fixed interval (Bellman, “A Problem in the Sequential Design of Experiments”; Bradt, Johnson, and Karlin, “On Sequential Designs for Maximizing the Sum of N Observations”).

maximizing a single quantity that accounts for both: The ideas behind the Gittins index were first presented at a conference in 1972 and appeared in the proceedings as Gittins and Jones, “A Dynamic Allocation Index for the Sequential Design of Experiments,” but the canonical presentation is Gittins, “Bandit Processes and Dynamic Allocation Indices.”

we provide the Gittins index values: The table of Gittins index scores for the Bernoulli bandit was taken from Gittins, Glazebrook, and Weber, Multi-Armed Bandit Allocation Indices, which is a comprehensive guide to the topic. It assumes complete ignorance about the probability of a payoff.

drives us toward novelty: Taking this to an extreme results in one simple strategy called the Least Failures Rule: always choose the option that’s failed the fewest number of times. So, landing in a new city, pick a restaurant at random. If it is good, stick with it. As soon as it fails to satisfy, choose at random from the other restaurants. Continue this process until all restaurants have failed to satisfy once, then go back to the restaurant with the most nights of successful dining and repeat. This strategy builds on the win-stay principle, and it’s precisely what the Gittins index yields if you’re the patient sort who values tomorrow’s payoff as being essentially as good as today’s. (The rule appears in Kelly, “Multi-Ar
med Bandits with Discount Factor Near One”; formally, it is optimal under geometric discounting in the limit as the discount rate approaches 1.) In a big city with plenty of new restaurants opening all the time, a Least Failures policy says quite simply that if you’re ever let down, there’s too much else out there; don’t go back.

a variety of experiments in behavioral economics: See, for example, Kirby, “Bidding on the Future.”

if there’s a cost to switching: This case is analyzed in Banks and Sundaram, “Switching Costs and the Gittins Index.”

“Regrets, I’ve had a few”: Frank Sinatra, “My Way,” from My Way (1969), lyrics by Paul Anka.

“For myself I am an optimist”: Prime Minister Winston Churchill, speech, Lord Mayor’s Banquet, London, November 9, 1954. Printed in Churchill, Winston S. Churchill: His Complete Speeches.

“To try and fail is at least to learn”: Barnard, The Functions of the Executive.

“wanted to project myself forward to age 80”: Jeff Bezos, interview with the Academy of Achievement, May 4, 2001, http://www.achievement.org/autodoc/page/bez0int-3.

several key points about regret: Lai and Robbins, “Asymptotically Efficient Adaptive Allocation Rules.”

the guarantee of minimal regret: Ibid. offered the first such algorithms, which were refined by Katehakis and Robbins, “Sequential Choice from Several Populations”; Agrawal, “Sample Mean Based Index Policies”; and Auer, Cesa-Bianchi, and Fischer, “Finite-Time Analysis of the Multiarmed Bandit Problem,” among others. The latter present perhaps the simplest strategy of this kind, which is to assign arm j a score of , where sj is the number of successes out of nj plays on that arm, and n = Σjnj is the total number of plays of all arms. This is an upper bound on the probability of a successful payoff (which is just sj/nj). Choosing the arm with the highest score guarantees logarithmic regret (although there are tweaks to this score that result in better performance in practice).

known as the “confidence interval”: Confidence intervals originate with Neyman, “Outline of a Theory of Statistical Estimation.”

“optimism in the face of uncertainty”: Kaelbling, Littman, and Moore, “Reinforcement Learning.”

“optimistic robots”: Leslie Kaelbling, personal interview, November 22, 2013. See Kaelbling, Learning in Embedded Systems.

$57 million of additional donations: Siroker and Koomen, A/B Testing.

A/B testing works as follows: Christian, “The A/B Test.” Also informed by Steve Hanov, personal interview, August 30, 2013, and Noel Welsh, personal interview, August 27, 2013.

In the case of Obama’s donation page: Dan Siroker, “How We Used Data to Win the Presidential Election” (lecture), Stanford University, May 8, 2009, available at https://www.youtube.com/watch?v=71bH8z6iqSc. See also, Siroker, “How Obama Raised $60 Million,” https://blog.optimizely.com/2010/11/29/how-obama-raised-60-million-by-running-a-simple-experiment/.

live A/B tests on their users: Google’s first A/B test was run on February 27, 2000. See, e.g., Christian, “The A/B Test.”

Companies A/B test their site navigation: See, e.g., Siroker and Koomen, A/B Testing.

tested forty-one shades of blue: Laura M. Holson, “Putting a Bolder Face on Google,” New York Times, February 28, 2009.

“how to make people click ads”: Ashlee Vance, “This Tech Bubble Is Different,” Bloomberg Businessweek, April 14, 2011. http://www.bloomberg.com/bw/magazine/content/11_17/b4225060960537.htm.

“destroyed by madness”: Ginsberg, Howl and Other Poems.

$50 billion in annual revenue: Google’s finances are detailed in their quarterly shareholder reports. Reported 2013 advertising revenue was $50.6 billion, roughly 91% of total revenue of $55.6 billion. See https://investor.google.com/financial/2013/tables.html.

online commerce comprises hundreds of billions: Online sales estimated by Forrester Research. See, for instance, “US Online Retail Sales to Reach $370B By 2017; €191B in Europe,” Forbes, 3/14/2013, http://www.forbes.com/sites/forrester/2013/03/14/us-online-retail-sales-to-reach-370b-by-2017-e191b-in-europe/.

best algorithms to use remain hotly contested: Chris Stucchio, for instance, penned a cutting article titled “Why Multi-armed Bandit Algorithms Are Superior to A/B Testing,” which was then countered by an equally cutting article called “Don’t Use Bandit Algorithms—They Probably Won’t Work for You”—also written by Chris Stucchio. See https://www.chrisstucchio.com/blog/2012/bandit_algorithms_vs_ab.html and https://www.chrisstucchio.com/blog/2015/dont_use_bandits.html. Stucchio’s 2012 post was written partly in reference to an article by Paras Chopra titled “Why Multi-armed Bandit Algorithm Is Not ‘Better’ than A/B Testing” (https://vwo.com/blog/multi-armed-bandit-algorithm/), which was itself written partly in reference to an article by Steve Hanov titled “20 lines of code that will beat A/B testing every time” (http://stevehanov.ca/blog/index.php?id=132).

it appeared in the Washington Star: Jean Heller, “Syphilis Patients Died Untreated,” Washington Star, July 25, 1972.

document known as the Belmont Report: The Belmont Report: Ethical principles and guidelines for the protection of human subjects of research, April 18, 1979. Available at http://www.hhs.gov/ohrp/humansubjects/guidance/belmont.html.

proposed conducting “adaptive” trials: See Zelen, “Play the Winner Rule and the Controlled Clinical Trial.” While this was a radical idea, Zelen wasn’t the first to propose it. That honor goes to William R. Thompson, an instructor in the School of Pathology at Yale, who formulated the problem of identifying whether one treatment is more effective than another, and proposed his own solution, in 1933 (Thompson, “On the Likelihood That One Unknown Probability Exceeds Another”).

The solution that Thompson proposed—randomly sampling options, where the probability of choosing an option corresponds to the probability that it is the best based on the evidence observed so far—is the basis for much recent work on this problem in machine learning (we return to the algorithmic uses of randomness and sampling in chapter 9).

Neither Frederick Mosteller nor Herbert Robbins seemed to be aware of Thompson’s work when they started to work on the two-armed bandit problem. Richard Bellman found the “little-known papers” a few years later, noting that “We confess that we found these papers in the standard fashion, namely while thumbing through a journal containing another paper of interest” (Bellman, “A Problem in the Sequential Design of Experiments”).

ECMO saved the life of a newborn girl: University of Michigan Department of Surgery, “‘Hope’ for ECMO Babies,” http://surgery.med.umich.edu/giving/stories/ecmo.shtml.

has now celebrated her fortieth birthday: University of Michigan Health System, “U-M Health System ECMO team treats its 2,000th patient,” March 1, 2011, http://www.uofmhealth.org/news/ECMO%202000th%20patient.

early studies in adults: Zapol et al., “Extracorporeal Membrane Oxygenation in Severe Acute Respiratory Failure.”

a study on newborns: Bartlett et al., “Extracorporeal Circulation in Neonatal Respiratory Failure.”

“did not justify routine use of ECMO”: Quotation from Ware, “Investigating Therapies of Potentially Great Benefit: ECMO,” referring to conclusions in Ware and Epstein, “Comments on ‘Extracorporeal Circulation in Neonatal Respiratory Failure,’” which is in turn a comment on Bartlett et al., “Extracorporeal Circulation in Neonatal Respiratory Failure.”

“difficult to defend further randomization ethically”: Ware, “Investigating Therapies of Potentially Great Benefit: ECMO.”

one of the world’s leading experts: It was Berry, in his 1971 PhD dissertation, who proved that staying on a winner is optimal. The result was published as Berry, “A Bernoulli Two-Armed Bandit.”

“Ware study should not have been conducted”: Berry, “Comment: Ethics and ECMO.”

nearly two hundred infants in the United Kingdom: UK Collaborative ECMO Group, “The Collaborative UK ECMO Trial.”

clinical trials for a variety of cancer treatments:
Don Berry, personal interview, August 22, 2013.

the FDA released a “guidance” document: The FDA’s “Adaptive Design Clinical Trials for Drugs and Biologics” from February 2010 can be found at http://www.fda.gov/downloads/Drugs/Guidances/ucm201790.pdf.

shown a box with two lights on it: The study appears in Tversky and Edwards, “Information Versus Reward in Binary Choices.”

two airlines: Meyer and Shi, “Sequential Choice Under Ambiguity.”

an experiment with a four-armed bandit: Steyvers, Lee, and Wagenmakers, “A Bayesian Analysis of Human Decision-Making on Bandit Problems.”

what has been termed a “restless bandit”: Restless bandits were introduced by Whittle, “Restless Bandits,” which discusses a strategy similar to the Gittins index that can be used in some cases. The computational challenges posed by restless bandits—and the consequent pessimism about efficient optimal solutions—are discussed in Papadimitriou and Tsitsiklis, “The Complexity of Optimal Queuing Network Control.”

when the world can change: Navarro and Newell, “Information Versus Reward in a Changing World,” provides recent results supporting the idea that human over-exploration is a result of assuming the world is restless.

“There is in fact a sort of harmony”: Thoreau, “Walking.”

“A Coke is a Coke”: Warhol, The Philosophy of Andy Warhol.

“a developmental way of solving the exploration/exploitation tradeoff”: Alison Gopnik, personal interview, August 22, 2013. See also Gopnik, The Scientist in the Crib.

“a juncture in my reading life”: Lydia Davis, “Someone Reading a Book,” Can’t and Won’t: Stories.

challenging our preconceptions about getting older: Carstensen, “Social and Emotional Patterns in Adulthood” presents the basic “socioemotional selectivity theory” we discuss in this section, as well as some of the evidence for it.

“lifelong selection processes”: Ibid.

about to move across the country: Fredrickson and Carstensen, “Choosing Social Partners.”

‹ Prev Next ›