Big Data: A Revolution That Will Transform How We Live, Work, and Think
Page 23
[>] Price Microsoft paid for Farecast—From media reports, notably “Secret Farecast Buyer Is Microsoft,” Seattlepi.com, April 17, 2008 (http://blog.seattlepi.com/venture/2008/04/17/secret-farecast-buyer-is-microsoft/?source=mypi).
[>] One way to think about big data—There is a loud and unproductive debate over the origin of the term “big data” and how to perfectly define it. The two words have occasionally appeared in unison for decades. A research report in 2001 by Doug Laney of Gartner set out the “three Vs” of big data (volume, velocity, and variety), which was useful for its time but imperfect.
[>] Astronomy and DNA sequencing—Cukier, “Data, Data Everywhere.”
Billions of shares traded—Rita Nazareth and Julia Leite, “Stock Trading in U.S. Falls to Lowest Level Since 2008,” Bloomberg, August 13, 2012 (http://www.bloomberg.com/news/2012-08-13/stock-trading-in-u-s-hits-low est-level-since-2008-as-vix-falls.html).
[>] Google’s 24 petabytes per day—Thomas H. Davenport, Paul Barth, and Randy Bean, “How ‘Big Data’ Is Different,” Sloan Review, July 30, 2012, pp. 43–46 (http://sloanreview.mit.edu/themagazine/2012fall/54104/howbigdataisdifferent/).
Facebook stats—Facebook IPO prospectus, “Form S-1 Registration Statement,” U.S. Securities and Exchange Commission, February 1, 2012 (http://sec.gov/Archives/edgar/data/1326801/000119312512034517/d287954ds1.htm).
YouTube stats—Larry Page, “Update from the CEO,” Google, April 2012 (http://investor.google.com/corporate/2012/ceo-letter.html).
Number of tweets—Tomio Geron, “Twitter’s Dick Costolo: Twitter Mobile Ad Revenue Beats Desktop on Some Days,” Forbes, June 6, 2012 (http://www.forbes.com/sites/tomiogeron/2012/06/06/twitters-dick-costolo-mobile-ad-revenue-beats-desktop-on-some-days/).
Information on the amount of data—Martin Hilbert and Priscilla López, “The World’s Technological Capacity to Store, Communicate, and Compute Information” Science, April 1, 2011, pp. 60–65; Martin Hilbert and Priscilla López, “How to Measure the World’s Technological Capacity to Communicate, Store and Compute Information?” International Journal of Communication 2012, pp. 1042–55 (http://www.ijoc.org/ojs/index.php/ijoc/article/viewFile/1562/742).
[>] Estimate of the amount of stored information by 2013—Cukier interview with Hilbert, 2012.
[>] Printing press and eight million books; more produced since the founding of Constantinople—Elizabeth L. Eisenstein, The Printing Revolution in Early Modern Europe (Canto/Cambridge University Press, 1993), pp. 13–14.
Peter Norvig’s analogy—From Norvig’s talks based on the paper: A. Halevy, P. Norvig, and F. Pereira, “The Unreasonable Effectiveness of Data,” IEEE Intelligent Systems, March/April 2009, pp. 8–12 (http://www.computer.org/portal/cms_docs_intelligent/intelligent/homepage/2009/x2exp.pdf). (Note that the title is a play on Eugene Wigner’s article “The Unreasonable Effectiveness of Mathematics in the Natural Sciences” in which he considers why physics can be nicely expressed in basic math but the social sciences resist such tidy formulas. See E. Wigner, “The Unreasonable Effectiveness of Mathematics in the Natural Sciences,” Communications on Pure and Applied Mathematics 13, no. 1 (1960), pp. 1–14.) Among Norvig’s talks on the paper is “Peter Norvig—The Unreasonable Effectiveness of Data,” lecture at University of British Columbia, YouTube, September 23, 2010 (http://www.youtube.com/watch?v=yvDCzhbjYWs).
On physical size affecting operative physical law (although not entirely correct), the often cited reference is to J. B. S. Haldane, “On Being the Right Size,” Harper’s Magazine, March 1926 (http://harpers.org/archive/1926/03/on-being-the-right-size/).
Picasso on the Lascaux images—David Whitehouse, “UK Science Shows Cave Art Developed Early,” BBC News Online, October 3, 2001 (http://news.bbc.co.uk/1/hi/sci/tech/1577421.stm).
2. More
[>] Jeff Jonas quotation—Conversation with Jonas, December 2010, Paris.
[>] History of the U.S. census—U.S. Census Bureau, “The Hollerith Machine” Online history. (http://www.census.gov/history/www/innovations/technology/the_hollerith_tabulator.html.
[>] Neyman’s contribution—William Kruskal and Frederick Mosteller, “Representative Sampling, IV: The History of the Concept in Statistics, 1895–1939,” International Statistical Review 48 (1980), pp. 169–195, pp. 187–188. Neyman’s famous paper is Jerzy Neyman, “On the Two Different Aspects of the Representative Method: The Method of Stratified Sampling and the Method of Purposive Selection,” Journal of the Royal Statistical Society 97, no. 4 (1934), pp. 558–625.
A sample of 1,100 observations is sufficient—Earl Babbie, Practice of Social Research (12th ed. 2010), pp. 204–207.
[>] The cellphone effect—“Estimating the Cellphone Effect,” September 20, 2008 (http://www.fivethirtyeight.com/2008/09/estimating-cellphone-effect-22-points.html); for more on polling biases and other statistical insights see Nate Silver, The Signal and the Noise: Why So Many Predictions Fail—But Some Don’t (Penguin, 2012).
[>] Steve Jobs’s gene sequencing—Walter Isaacson, Steve Jobs (Simon and Schuster, 2011), pp. 550–551.
[>] Google Flu Trends predicting to city level—Dugas et al., “Google Flu Trends.”
Etzioni on temporal data—Interview by Cukier, October 2011.
[>] John Kunze quotation—Jonathan Rosenthal, “Special Report: International Banking,” The Economist, May 19, 2012, pp. 7–8.
Sumo match fixing—Mark Duggan and Steven D. Levitt, “Winning Isn’t Everything: Corruption in Sumo Wrestling,” American Economic Review 92 (2002), pp. 1594–1605 (http://pricetheory.uchicago.edu/levitt/Papers/DugganLevitt2002.pdf).
[>] Lytro’s 11 million light rays—from Lytro’s corporate website (http://www.lytro.com).
[>] Replacing sampling in the social sciences—Mike Savage and Roger Burrows, “The Coming Crisis of Empirical Sociology,” Sociology 41 (2007), pp. 885–899.
On analyzing comprehensive data from a mobile phone operator—J. P. Onnela et al., “Structure and Tie Strengths in Mobile Communication Networks,” Proceedings of the National Academy of Sciences of the United States of America (PNAS) 104 (May 2007), pp. 7332–36 (http://nd.edu/~dddas/Papers/PNAS0610245104v1.pdf).
3. Messy
[>] Crosby—Alfred W. Crosby, The Measure of Reality: Quantification and Western Society, 1250–1600 (Cambridge University Press, 1997).
On Kelvin and Bacon quotations—These aphorisms are widely attributed to both men, though the actual expression in their written works is slightly different. In Kelvin, it’s part of a longer quotation on measurement, from his lecture “Electrical Units of Measurement” (1883). For Bacon, it’s considered to be a loose translation from Latin, in Meditationes Sacrae (1597).
[>] Many ways to refer to IBM—DJ Patil, “Data Jujitsu: The Art of Turning Data into Product,” O’Reilly Media, July 2012 (http://oreillynet.com/oreilly/data/radarreports/data-jujitsu.csp?cmp=tw-strata-books-data-products).
[>] 30,000 trades per second on NYSE—Colin Clark, “Improving Speed and Transparency of Market Data,” NYSE EURONEXT blog post, January 9, 2011 (http://exchanges.nyx.com/cclark/improving-speed-and-transparency-market-data).
Idea that “2+2=3.9”—Brian Hopkins and Boris Evelson, “Expand Your Digital Horizon with Big Data,” Forrester, September 30, 2011.
Improvements in algorithms—President’s Council of Advisors on Science and Technology, “Report to the President and Congress, Designing a Digital Future: Federally Funded Research and Development in Networking and Information Technology,” December 2010, p. 71 (http://www.whitehouse.gov/sites/default/files/microsites/ostp/pcast-nitrd-report-2010.pdf).
[>] Chess endgame tables—The most comprehensive endgame table publicly available, the Nalimov tableset (named after one of its creators), covers all games for six or fewer chess pieces. Its size exceeds seven terabytes, and compressing the information in it is a major challenge. See E. V. Nalimov, G. McC. Haworth, and E. A. Heinz, “Space-efficient Indexing of Chess Endgame Tables,” ICGA Journal 23, no. 3 (200
0), pp. 148–162.
Microsoft and algorithm performance—Michele Banko and Eric Brill, “Scaling to Very Very Large Corpora for Natural Language Disambiguation,” Microsoft Research, 2001, p. 3 (http://acl.ldc.upenn.edu/P/P01/P01-1005.pdf).
IBM demo, words, and quotation—IBM, “701 Translator,” press release, IBM archives, January 8, 1954 (http://www-03.ibm.com/ibm/history/exhibits/701/701_translator.html). See also John Hutchins, “The First Public Demonstration of Machine Translation: The Georgetown-IBM System, 7th January 1954,” November 2005 (http://www.hutchinsweb.me.uk/GU-IBM-2005.pdf).
IBM Candide—Adam L. Berger et al., “The Candide System for Machine Translation,” Proceedings of the 1994 ARPA Workshop on Human Language Technology, 1994 (http://aclweb.org/anthology-new/H/H94/H94-1100.pdf).
History of machine translation—Yorick Wilks, Machine Translation: Its Scope and Limits (Springer, 2008), p. 107.
[>] Candide’s millions of texts versus Google’s billions of texts—Och interview with Cukier, December 2009.
Google’s corpus of 95 billion sentences—Alex Franz and Thorsten Brants, “All Our N-gram are Belong to You,” Google blog post, August 3, 2006 (http://googleresearch.blogspot.co.uk/2006/08/all-our-n-gram-are-belong-to-you.html).
[>] Brown corpus and Google’s 1 trillion words—Halevy, Norvig, and Pereira, “The Unreasonable Effectiveness of Data.”
Quotation from paper Norvig co-authored—ibid.
[>] BP pipe corrosion and hostile wireless environment—Jaclyn Clarabut, “Operations Making Sense of Corrosion,” BP Magazine, issue 2 (2011) (http://www.bp.com/liveassets/bp_internet/globalbp/globalbp_uk_eng lish/reports_and_publications/bp_magazine/STAGING/local_assets/pdf/BP_Magazine_2011_issue2_text.pdf). The difficulty of wireless data readings comes from Cukier, “Data, Data, Everywhere.” The system is obviously not infallible: a fire at the BP Cherry Point refinery in February 2012 was blamed on a corroded pipe.
[>] Billion Prices Project—From interview with co-founders with Cukier, October 2012. Also, James Surowiecki, “A Billion Prices Now,” The New Yorker, May 30, 2011; data and details can be found on the project’s website (http://bpp.mit.edu/); Annie Lowrey, “Economists’ Programs Are Beating U.S. at Tracking Inflation,” Washington Post, December 25, 2010 (http://www.washingtonpost.com/wp-dyn/content/article/2010/12/25/AR2010122502600.html).
[>] On PriceStats as a check on national statistics—“Official Statistics: Don’t Lie to Me, Argentina,” The Economist, February 25, 2012 (http://www.economist.com/node/21548242).
Number of photos on Flickr—From Flickr website (http://www.flick.com). On the challenge to categorize information—See David Weinberger, Everything Is Miscellaneous: The Power of the New Digital Disorder (Times, 2007).
[>] Pat Helland—Pat Helland, “If You Have Too Much Data Then ‘Good Enough’ Is Good Enough,” Communications of the ACM, June 2011, pp. 40, 41. There is a vigorous debate within the database community about the models and concepts best able to meet the needs of big data. Helland represents the camp arguing for a radical break with tools used in the past. Microsoft’s Michael Rys, in “Scalable SQL,” Communications of the ACM, June 2011, p. 48, argues that much-adapted versions of existing tools will work fine.
[>] Visa using Hadoop—Cukier, “Data, data everywhere.”
[>] Only 5 percent of information is structured-data—Abhishek Mehta, “Big Data: Powering the Next Industrial Revolution,” Tableau Software White Paper, 2011 (http://www.tableausoftware.com/learn/whitepapers/big-data-revolution).
4. Correlation
[>] Linden story as well as “Amazon voice”—Linden interview with Cukier, March 2012.
WSJ on Amazon critics—As cited in James Marcus, Amazonia: Five Years at the Epicenter of the Dot.Com Juggernaut (New Press, 2004), p. 128.
[>] Marcus quotation—Marcus, Amazonia, p. 199.
[>] Recommendations one-third of Amazon’s income—This figure has never been officially confirmed by the company but has been published in numerous analyst reports and articles in the media, including “Building with Big Data: The Data Revolution Is Changing the Landscape of Business,” The Economist, May 26, 2011 (http://www.economist.com/node/18741392/).
The figure was also referenced by two former Amazon executives in interviews with Cukier.
Netflix price information—Xavier Amatriain and Justin Basilico, “Netflix Recommendations: Beyond the 5 stars (Part 1),” Netflix blog, April 6, 2012.
[>] “Fooled by Randomness”—Nassim Nicholas Taleb, Fooled by Randomness (Random House, 2008); for more, see Nassim Nicholas Taleb, The Black Swan: The Impact of the Highly Improbable (2nd ed., Random House, 2010).
[>] Walmart and Pop-Tarts—Constance L. Hays, “What Wal-Mart Knows About Customers’ Habits,” New York Times, November 14, 2004 (http://www.nytimes.com/2004/11/14/business/yourmoney/14wal.html).
[>] Examples of predictive models by FICO, Experian, and Equifax—Scott Thurm, “Next Frontier in Credit Scores: Predicting Personal Behavior,” Wall Street Journal, October 27, 2011 (http://online.wsj.com/article/SB10001424052970203687504576655182086300912.html).
[>] Aviva’s predictive models—Leslie Scism and Mark Maremont, “Insurers Test Data Profiles to Identify Risky Clients,” Wall Street Journal, November 19, 2010 (http://online.wsj.com/article/SB10001424052748704648604575620750998072986.html). See also Leslie Scism and Mark Maremont, “Inside Deloitte’s Life-Insurance Assessment Technology,” Wall Street Journal, November 19, 2010 (http://online.wsj.com/article/SB10001424052748704104104575622531084755588.html). See also Howard Mills, “Analytics: Turning Data into Dollars,” Forward Focus, December 2011 (http://www.deloitte.com/assets/Dcom-UnitedStates/Local%20Assets/Documents/FSI/US_FSI_Forward%20Focus_Analytics_Turning%20data%20into%20dollars_120711.pdf).
Example of Target and pregnant teenager—Charles Duhigg, “How Companies Learn Your Secrets,” New York Times, February 16, 2012 (http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html). The article is adapted from Duhigg’s book The Power of Habit: Why We Do What We Do in Life and Business (Random House, 2012); Target has stated there are inaccuracies in media accounts of its activities but declines to say what those inaccuracies are. Asked about the matter for this book, a Target spokesperson replied: “The goal is to use guest data to enhance the guest relationship with Target. Our guests want to receive great value, relevant offers, and a superior experience. Like many companies, we use research tools that help us understand guest shopping trends and preferences so that we can give our guests offers and promotions that are relevant to them. We take our responsibility to protect our guests’ trust in us very seriously. One way we do this is by having a comprehensive privacy policy that we share openly on Target.com and by routinely educating our team members on how to secure our guests’ information.”
[>] UPS analytics work—Cukier interviews with Jack Levis, 2012.
[>] Preemies—Based on interviews with McGregor in 2010 and 2012. See also Carolyn McGregor, Christina Catley, Andrew James, and James Padbury, “Next Generation Neonatal Health Informatics with Artemis,” in European Federation for Medical Informatics, User Centred Networked Health Care, ed. A. Moen et al. (IOS Press, 2011), p. 117. Some material comes from Cukier, “Data, Data, Everywhere.”
[>] On the correlation between happiness and income—R. Inglehart and H.-D. Klingemann, Genes, Culture and Happiness (MIT Press, 2000).
[>] On measles and health expenses, and on new non-linear tools for correlation analysis—David Reshef et al., “Detecting Novel Associations in Large Data Sets,” Science 334 (2011), pp. 1518–24.
[>] Kahneman—Daniel Kahneman, Thinking, Fast and Slow (Farrar, Straus and Giroux, 2011), pp. 74–75.
[>] Pasteur—For readers interested in Pasteur’s larger influence on how we perceive things, we suggest Bruno Latour, The Pasteurization of France (Harvard University Press, 1993).
Risk of catching rabies—Melanie Di Quinzio and Anne McCarthy, “Rabies Risk Among Travellers,” CMAJ 178, no. 5 (
2008), p. 567.
[>] Causality can rarely be proven—The Turing Award–winning computer scientist Judea Pearl has developed a way to formally represent causal dynamics; while no formal proof, this offers a pragmatic approach to analyzing possible causal connections; see Judea Pearl, Causality: Models, Reasoning and Inference (Cambridge University Press, 2009).
[>] Orange car example—Quentin Hardy. “Bizarre Insights from Big Data,” nytimes.com, March 28, 2012 (http://bits.blogs.nytimes.com/2012/03/28/bizarre-insights-from-big-data/); and Kaggle, “Momchil Georgiev Shares His Chromatic Insight from Don’t Get Kicked,” blog posting, February 2, 2012 (http://blog.kaggle.com/2012/02/02/momchil-georgiev-shares-his-chromatic-insight-from-dont-get-kicked/).
[>] Weight of manhole covers, number of explosions, and height of the blast—Rachel Ehrenberg, “Predicting the Next Deadly Manhole Explosion,” Wired, July 7, 2010 (http://www.wired.com/wiredscience/2010/07/manhole-explosions).
Con Edison working with Columbia University statisticians—This case is described for the lay audience in Cynthia Rudin et al., “21st-Century Data Miners Meet 19th-Century Electrical Cables,” Computer, June 2011, pp. 103–105. Technical descriptions of the work are available through Rudin’s and her collaborators’ academic articles on their websites, in particular Cynthia Rudin et al., “Machine Learning for the New York City Power Grid,” IEEE Transactions on Pattern Analysis and Machine Intelligence 34, no. 2 (2012), pp. 328–345 (http://hdl.handle.net/1721.1/68634).
[>] Messiness of the term “service box”—This list comes from Rudin et al., “21st-Century Data Miners Meet 19th-Century Electrical Cables.”
Rudin quotation—From interview with Cukier, March 2012.
[>] Anderson’s views—Chris Anderson, “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete,” Wired, June 2008 (http://www.wired.com/science/discoveries/magazine/16-07/pb_theory/).