Book Read Free

Algorithms to Live By

Page 38

by Brian Christian


  An immediate payoff was the consummate ease with which we could communicate problem types. Visitors to our offices were sometimes baffled to hear exchanges such as: “Since one-arejay-sum-ceejay is NP-hard, does that imply that one-preemption-arejay-sum-ceejay is NP-hard, too?” “No, that’s easy, remember?” “Well, one-deejay-sum-ceejay is easy and that implies one-preemption-deejay-sum-ceejay is easy, so what do we know about one-preemption-arejay-deejay-sum-ceejay?” “Nothing.”

  (In formal notation: “Since 1|rj|∑ Cj is NP-hard, does that imply that 1|pmtn, rj|∑ Cj is NP-hard, too?” “No, that’s easy, remember?” “Well, 1|dj|∑ Cj is easy and that implies 1|pmtn, dj|∑ Cj is easy, so what do we know about 1|pmtn, rj, dj|∑ Cj?” “Nothing” [Lawler et al., “A Gift for Alexander!”; see also Lawler, “Old Stories”].)

  the problem becomes intractable: In fact, it’s equivalent to the “knapsack problem,” computer science’s most famously intractable problem about how to fill space. The connection between this scheduling problem and the knapsack problem appears in Lawler, Scheduling a Single Machine to Minimize the Number of Late Jobs.

  a certain time to start some of your tasks: What we are calling “start times” are referred to in the literature (we think somewhat ambiguously) as “release times.” Lenstra, Rinnooy Kan, and Brucker, “Complexity of Machine Scheduling Problems,” showed that both minimizing sum of completion times and minimizing maximal lateness with arbitrary release times are NP-hard. The case of minimizing the number of late jobs with arbitrary release times is discussed in Lawler, “Scheduling a Single Machine to Minimize the Number of Late Jobs.”

  A recent survey: Lawler et al., “Sequencing and Scheduling.” The most recent version of this list is available at http://www.informatik.uni-osnabrueck.de/knust/class/.

  with a fairly straightforward modification: The effect of preemption on minimizing maximal lateness with release times is analyzed in Baker et al., “Preemptive Scheduling of a Single Machine.” The problem of minimizing the sum of completion times with release times and preemption is analyzed in Schrage, “A Proof of the Optimality of the Shortest Remaining Processing Time Discipline” and Baker, Introduction to Sequencing and Scheduling.

  still the preemptive version of Earliest Due Date: The result for minimizing expected maximum lateness by choosing the job with earliest due date is discussed in Pinedo, Scheduling.

  the preemptive version of Shortest Processing Time: The effectiveness of choosing the job with the weighted shortest expected processing time for minimizing the sum of weighted completion times in a dynamic setting (provided the estimate of the time to complete a job is nonincreasing in the duration worked on that job) was shown by Sevcik, “Scheduling for Minimum Total Loss Using Service Time Distributions,” as part of a more general strategy for dynamic scheduling.

  sum of the weighted lateness of those jobs: Pinedo, “Stochastic Scheduling with Release Dates and Due Dates,” showed that this algorithm is optimal for these problems under the (fairly strong) assumption that the times of jobs follow a memoryless distribution, which means that your estimate of how long they will take remains constant no matter how long you have been doing them. In stochastic scheduling, optimal algorithms won’t necessarily be ideal for every possible workload, but rather minimize the expected values of their relevant metrics.

  “Replace ‘plan’ with ‘guess’”: Jason Fried, “Let’s just call plans what they are: guesses,” July 14, 2009, https://signalvnoise.com/posts/1805-lets-just-call-plans-what-they-are-guesses.

  “You must not get off the train”: Ullman, “Out of Time.”

  can include both delays and errors: Monsell, “Task Switching.”

  “I’ll just do errands instead”: Kirk Pruhs, personal interview, September 4, 2014.

  “You have part of my attention”: The Social Network, screenplay by Aaron Sorkin; Columbia Pictures, 2010.

  “Nobody knew anything about that”: Peter Denning, personal interview, April 22, 2014.

  “caused a complete collapse of service”: Denning, “Thrashing: Its Causes and Prevention.”

  “The caches are warm for the current workload”: Peter Zijlstra, personal interview, April 17, 2014.

  any situation where the system grinds to a halt: Thrashing can also take place in database systems, where the competition between different processes to acquire “locks” to access the database can swamp the system’s ability to let the processes currently holding the locks get anything done. Similarly, thrashing can appear in networking contexts, where a cacophony of different signals competing for the network channel can prevent anything at all from getting through. We’ll take a closer look at the latter scenario in chapter 10.

  replaced their scheduler: The “O(n) Scheduler” used by Linux starting with version 2.4 in 2001 sorted all processes by priority, which took longer the more processes there were. This was scrapped in favor of the “O(1) Scheduler” starting with Linux 2.6 in 2003, which bucket-sorted all processes into a predetermined number of buckets, regardless of how many processes there were. However, doing this bucket sort required computing complex heuristics, and beginning with Linux 2.6.23 in 2007, the “O(1) Scheduler” was replaced with the even more straightforward “Completely Fair Scheduler.”

  In Linux this minimum useful slice: This value is defined in the Linux kernel’s “Completely Fair Scheduler” in the variable sysctl_sched_min_granularity.

  Methods such as “timeboxing” or “pomodoros”: Timeboxing has been written about widely in the context of the management of software development teams; the term “timeboxing” appears to originate with Zahniser, “Timeboxing for Top Team Performance.” The “Pomodoro Technique,” whose name comes from a tomato-shaped kitchen timer (the Italian word for tomato being pomodoro), was devised by Francesco Cirillo in the late 1980s and has been taught by Cirillo starting in 1998. See, e.g., Cirillo, The Pomodoro Technique.

  programmers have turned to psychology: E.g., Peter Zijlstra, personal interview, April 17, 2014.

  Computers themselves do something like this: Linux added support for timer coalescing in 2007; Microsoft included it in Windows starting with Windows 7 in 2009; and Apple followed suit in OS X Mavericks in 2013.

  “just a one-line bug in your algorithm”: Peter Norvig, personal interview, September 17, 2014.

  “I don’t swap in and out”: Shasha and Lazere, Out of Their Minds, 101.

  “my role is to be on the bottom of things”: Donald Knuth, “Knuth versus Email,” http://www-cs-faculty.stanford.edu/~uno/email.html.

  6. BAYES’S RULE

  “All human knowledge is uncertain”: Bertrand Russell, Human Knowledge: Its Scope and Limits, 1948, p. 527.

  There he saw the Berlin Wall: Gott, “Implications of the Copernican Principle for Our Future Prospects.”

  “The Unreasonable Effectiveness of Data”: The talk was derived from Halevy, Norvig, and Pereira, “The Unreasonable Effectiveness of Data.”

  “these arguments must be probable only”: An Enquiry Concerning Human Understanding, §IV, “Sceptical Doubts Concerning the Operations of the Understanding.”

  Bayes’s own history: Our brief biography draws on Dale, A History of Inverse Probability, and Bellhouse, “The Reverend Thomas Bayes.”

  in either 1746, ’47, ’48, or ’49: Bayes’s legendary paper, undated, had been filed between a pair of papers dated 1746 and 1749. See, e.g., McGrayne, The Theory That Would Not Die.

  defense of Newton’s newfangled “calculus”: An Introduction to the Doctrine of fluxions, and Defence of the Mathematicians against the Objections of the Author of the analyst, so far as they are assigned to affect their general methods of Reasoning.

  “deserves to be preserved”: Introduction to Bayes, “An Essay Towards Solving a Problem in the Doctrine of Chances.”

  “the proportion of Blanks to Prizes”: Appendix to ibid.

  we need to first reason forward: To be precise, Bayes was arguing that given hypotheses h and some obs
erved data d, we should evaluate those hypotheses by calculating the likelihood p(d|h) for each h. (The notation p(d|h) means the “conditional probability” of d given h—that is, the probability of observing d if h is true.) To convert this back into a probability of each h being true, we then divide by the sum of these likelihoods.

  Laplace was born in Normandy: For more details on Laplace’s life and work, see Gillispie, Pierre-Simon Laplace.

  distilled down to a single estimate: Laplace’s Law is derived by working through the calculation suggested by Bayes—the tricky part is the sum over all hypotheses, which involves a fun application of integration by parts. You can see a full derivation of Laplace’s Law in Griffiths, Kemp, and Tenenbaum, “Bayesian Models of Cognition.” From the perspective of modern Bayesian statistics, Laplace’s Law is the posterior mean of the binomial rate using a uniform prior.

  If you try only once and it works out: You may recall that in our discussion of multi-armed bandits and the explore/exploit dilemma in chapter 2, we also touched on estimates of the success rate of a process—a slot machine—based on a set of experiences. The work of Bayes and Laplace undergirds many of the algorithms we discussed in that chapter, including the Gittins index. Like Laplace’s Law, the values of the Gittins index we presented there assumed that any probability of success is equally likely. This implicitly takes the expected overall win rate for a slot machine with a 1–0 record to be two-thirds.

  “no more consistent or conceivable than the rest”: An Enquiry Concerning Human Understanding, §IV, “Sceptical Doubts Concerning the Operations of the Understanding.”

  the real heavy lifting was done by Laplace: In fairness, an influential 1950 paper (Bailey, Credibility Procedures) referred to “Laplace’s Generalization of Bayes’s Rule,” but it didn’t quite stick. Discoveries being named after somebody other than their discoverer is a sufficiently common phenomenon that statistician and historian Stephen Stigler has asserted that it should be considered an empirical law—Stigler’s Law of Eponymy. Of course, Stigler wasn’t the first person to discover this; he assigns the credit to sociologist Robert K. Merton. See Stigler, “Stigler’s Law of Eponymy.”

  multiply their probabilities together: For the mathematically inclined, here’s the full version of Bayes’s Rule. We want to calculate how much probability to assign a hypothesis h given data d. We have prior beliefs about the probability of that hypothesis being true, expressed in a prior distribution p(h). What we want to compute is the “posterior” distribution, p(h|d), indicating how we should update our prior distribution in light of the evidence provided by d. This is given by

  where h′ ranges over the full set of hypotheses under consideration.

  “especially about the future”: The uncertain origins of this saying are described in detail in Quote Investigator, “It’s Difficult to Make Predictions, Especially About the Future,” http://quoteinvestigator.com/2013/10/20/no-predict/.

  surprising if there were even a New York City: The New Yorker cover is Richard McGuire, “Time Warp,” November 24, 2014. For a fascinating and more detailed analysis of the probable life spans of cities and corporations, see the work of Geoffrey West and Luis Bettencourt—e.g., Bettencourt et al., “Growth, Innovation, Scaling, and the Pace of Life in Cities.”

  a flurry of critical correspondence: For example, see Garrett and Coles, “Bayesian Inductive Inference and the Anthropic Principles” and Buch, “Future Prospects Discussed.”

  a raffle where you come in knowing nothing: The statistician Harold Jeffreys would later suggest, instead of Laplace’s (w+1)⁄(n+2), using rather (w+0.5)⁄(n+1), which results from using an “uninformative” prior rather than the “uniform” prior (Jeffreys, Theory of Probability; Jeffreys, “An Invariant Form for the Prior Probability in Estimation Problems”). One method for defining more informative priors results in predictions of the form (w+w′+1)⁄(n+n′+2), where w′ and n′ are the number of wins and attempts for similar processes in your past experience (for details see Griffiths, Kemp, and Tenenbaum, “Bayesian Models of Cognition”). Using this rule, if you had previously seen 100 lottery drawings with only 10 winning tickets (w = 10, n = 100), your estimate after seeing a single winning draw for this new lottery would be a much more reasonable 12/103 (not far from 10%). Variants on Laplace’s Law are used extensively in computational linguistics, where they provide a way to estimate the probabilities of words that have never been seen before (Chen and Goodman, “An Empirical Study of Smoothing Techniques for Language Modeling”).

  or last for five millennia: For a quantity like a duration, which ranges from 0 to ∞, the uninformative prior on times t is the probability density p(t) ∝ 1/t. Changing the scale—defining a new quantity s that is a multiple of t—doesn’t change the form of this distribution: if s = ct, then p(s) ∝ p(t = s/c) ∝ 1/s. This means that it is scale-invariant. Lots more information on uninformative priors appears in Jeffreys, Theory of Probability, and Jeffreys, “An Invariant Form for the Prior Probability in Estimation Problems.”

  the Copernican Principle emerges: This was shown by Gott, “Future Prospects Discussed,” in responding to Buch, “Future Prospects Discussed.”

  determining the number of tramcars: Jeffreys, Theory of Probability, §4.8. Jeffreys credits mathematician Max Newman for bringing the problem to his attention.

  sought to estimate the number of tanks: This has come to be known as the “German Tank Problem,” and has been documented in a number of sources. See, e.g., Gavyn Davies, “How a Statistical Formula won the War,” the Guardian, July 19, 2006, http://www.theguardian.com/world/2006/jul/20/secondworldwar.tvandradio.

  fruits in an orchard: For instance, the 2002 New Zealand Avocado Growers Association Annual Research Report found that “by April, fruit size profiles were normally distributed and remained so for the remainder of the monitored period.”

  The average population of a town: This figure comes from Clauset, Shalizi, and Newman, “Power-Law Distributions in Empirical Data,” which in turn cites the 2000 US Census.

  can plausibly range over many scales: The general form of a power-law distribution on a quantity t is p(t) ∝ t−γ, where the value of γ describes how quickly the probability of t decreases as t gets larger. As with the uninformative prior, the form of the distribution doesn’t change if we take s = ct, changing the scale.

  a domain full of power laws: The observation that wealth is distributed according to a power-law function is credited to Pareto, Cours d’économie politique. Another good discussion of the power-law distributions of populations and incomes is Simon, “On a Class of Skew Distribution Functions.”

  The mean income in America: The mean individual adjusted gross income (AGI), derived from IRS filings, was estimated to be $55,688 for the 2009 tax year, the most recent year for which an estimate was available; see the 2011 working paper “Evaluating the Use of the New Current Population Survey’s Annual Social and Economic Supplement Questions in the Census Bureau Tax Model,” available at https://www.census.gov/content/dam/Census/library/working-papers/2011/demo/2011_SPM_Tax_Model.pdf, which in turn cites data from the US Census Bureau’s 2010 Current Population Survey Annual Social and Economic Supplement.

  two-thirds of the US population make less than the mean income: The cutoff for the top 40% of AGI in 2012 was $47,475, and the cutoff for the top 30% was $63,222, from which we can infer that an AGI of $55,688 lands at approximately the top 33%. See Adrian Dungan, “Individual Income Tax Shares, 2012,” IRS Statistics of Income Bulletin, Spring 2015, available at https://www.irs.gov/pub/irs-soi/soi-a-ints-id1506.pdf.

  the top 1% make almost ten times the mean: The cutoff for the top 1% was an AGI of $434,682 in 2012, and the cutoff for the top 0.01% was $12,104,014. Ibid.

  the process of “preferential attachment”: A good general-audience discussion of the idea of power-law distributions emerging from preferential attachment can be found in Barabási, Linked.

  “‘could go on forever’ in a good
way?”: Lerner, The Lichtenberg Figures.

  appropriate prediction strategy is a Multiplicative Rule: All the prediction rules discussed in this section are derived in Griffiths and Tenenbaum, “Optimal Predictions in Everyday Cognition.”

  poems follow something closer to a power-law: Ibid.

  formalized the spread of intervals: Erlang first modeled the rate of phone calls appearing on a network using a Poisson distribution in “The Theory of Probabilities and Telephone Conversations,” and in turn developed the eponymous Erlang distribution for modeling the intervals between arriving calls in “Solution of Some Problems in the Theory of Probabilities of Significance in Automatic Telephone Exchanges.” For more details on Erlang’s life, see Heyde, “Agner Krarup Erlang.”

  odds of which are about 20 to 1: To be precise, the odds against being dealt a blackjack hand in the eponymous game are exactly 2,652 to 128, or about 20.7 to 1. To see the derivation of why this leads to an expectation of playing 20.7 hands before getting it, we can define our expectation recursively: either we land blackjack for a result of 1, or we don’t (in which case we’re back where we started one hand later). If x is our expectation, x = 1 + (2524/2652)x, where 2524/2652 is our chance of not getting dealt blackjack. Solving for x gives about 20.7.

  known to statisticians as “memoryless”: Technically, the time to the next blackjack follows a geometric distribution (similar to the exponential distribution for a continuous quantity), which is constantly decreasing, rather than the more wing-like Erlang distribution we describe in the main text. However, both can yield memoryless predictions under the right circumstances. If we encounter a particular phenomenon at some random point in its duration, as Gott assumed regarding the Berlin Wall, then the wing-like Erlang gives us memoryless Additive Rule predictions. And if we continuously observe a phenomenon that has a geometric distribution, as in playing a game of blackjack, the same kind of Additive Rule predictions result.

 

‹ Prev