Book Read Free

Info We Trust

Page 12

by R J Andrews


  Simple arithmetic and easy-to-draw pictures… looking at data to see what it seems to say.

  JOHN TUKEY, 1977

  Which general-purpose plot was most likely to allow unsuspected but interesting aspects of the data to show themselves?

  HOWARD WAINER, 1997

  John Tukey introduced the world to exploratory data analysis (EDA) in 1977 by likening it to quantitative detective work. When we explore data, we search for clues that reveal. How you explore for clues is up to you. The London Bobby and Texas Ranger each have localized skills. Methods are likewise not identical across data explorers. The health-care coder and financial analyst have different approaches to different data for different reasons. Still, there are strategies we can all consider, whatever our line of inquiry. Data exploration demands a basic ability to produce simple charts from data, and this requires some technological fluency. It does not, however, require you abandon drawing by hand. A hybrid approach can lead to creative avenues of inquisition. Alternate between data-driven computer charts and hand sketched ideas you would like to populate with on-screen data.

  The term “forensic” … derives from forensis: of or before the forum where Romans appeared publicly for the disposition of criminal accusations. Evidence, in other words, is information available to all without prejudice.

  ABBY SMITH RUMSEY, 2016

  A means of orchestrating a conversation with yourself. Putting thoughts down allows us to step outside ourselves and tap into our visual system and our ability to see in relation. We thus extend our thinking—distributing it between conception and perception—engaging both simultaneously. We draw not to transcribe ideas from our heads but to generate them in search of greater understanding.

  NICK SOUSANIS, 2015

  Data exploration can be intensely private or embarked upon as a team. In either case, we are not yet trying to communicate anything beyond the investigation. We are also not yet trying to establish any confidence. The detective hunts for clues— footprints, bloodstains, and cigarette ash—while trying to crack the case. We hunt by playing with different forms, data dimensions, and even data sources. Later, we will turn to what the evidence may prove.

  Your first step might be to ask the data some establishing questions. Tell me a little bit about yourself. Where did you come from? Look at the data to see what kind of shape it has arrived in. What is the terrain like? Not all questions stem directly from the dataset. It provides one stock of raw materials to work with, your knowledge of the world provides the other. Exploration, Tukey summarized, “is a creative act.” The sparks of creative exploration fire when you strike your data. Good data sketches deliver answers, great data sketches inspire more questions. Our goal is to find something worth further investigation and further sketching. What can we hope our data will reveal?

  Hypotheses are nets: only he who casts will catch.

  NOVALIS, 1802

  There is no logical path to these laws; only intuition, based upon something like an intellectual love of the objects of experience.

  ALBERT EINSTEIN, 1918

  Let's Compare

  We are comparison-making machines. The world assaults us with more information than we could ever pay attention to. A short walk to your kitchen for a glass of water brings few details of the journey to your attention. You are already habituated to everything you see. But, spot an unexpected silhouette with your peripheral vision and your focus will automatically rocket into high alert. The difference between your environment and your expectations is what counts.

  Every discovery contains an irrational element, or a creative intuition… Scientific discovery is impossible without faith in ideas which are of a purely speculative kind, and sometimes even quite hazy.

  KARL POPPER, 1935

  Deviations capture our attention. Every viewer arrives to a data story with expectations about the world of the data. Sometimes, this prior knowledge provides enough context for the viewer to make sense of a lonely number. But to rely on only this flavor of comparison risks too much. This is why the singular datum is an anachronism. On its own, just one data point has no meaning. Good data stories supply both sides necessary for a comparison. If you are shown two stacks of stuff, you will automatically focus on the void of negative space that separates them. How much more is needed to make both equal? These comparisons can make data stories interesting. So, what makes a superior comparison? Consider two of NASA's early human spaceflight programs: Apollo and Gemini.

  In my research, I have needed the data to test my hypotheses, but the hypotheses themselves often emerged from talking to, listening to, and observing people. … we should be highly skeptical about conclusions derived purely from number crunching.

  HANS ROSLING, 1948–2017

  Context is from Latin con (together) and texere (to weave).

  The Apollo program crew had one more astronaut than Project Gemini. Apollo's Saturn V rocket had about seventeen times more thrust than the Gemini-Titan II.

  The language used to describe these differences illustrate two types of comparisons. For the crew we say one more, or add. For the rocket thrust we say seventeen times more, or multiply.

  Add and subtract comparisons like the crew comparison—a few more, a few less—are easier for us to wrap our heads around. This makes them more useful comparisons. Presented visually, these additive comparisons are appreciated with little effort. You are in fact born with the ability to instantly do addition and subtraction with a small numbers of objects, years before you know anything about numbers, numerals, or arithmetic.

  Relative to additive comparisons, multiply and divide differences are more difficult. It is hard for us to make sense of them, even visually. Comparisons that are expressed as ratios—a few times more, a few times less—all land too quickly into the mental bin of “well, this is just a lot bigger than that other thing.” What does it mean to have seventeen times more thrust at liftoff? It is just a big number. Without very specialized knowledge, 17x means the same to us as 15x or 19x.

  Subitize means to instantly determine how many objects are in a very small collection, up to four. In studies, newborns have been seen to “count” and perform simple addition and subtraction with small groups of objects.

  The two types of comparisons, additive and multiplicative, hit us differently because multiplication is cognitively more abstract than addition. Addition is natural. Add or take objects away from a group and the size of the group increases or shrinks accordingly. You can see it. Multiplication is a conceptual extension of addition and that makes it harder to see. We interpret the operation of three times four with either “pooling” or “repeated addition.” If solving the problem with pooling, we see three small sets of four objects each, and mass them into one large group of 12. Alternatively, in repeated addition we imagine adding four objects into a box, three times. In either case, we no longer manipulate objects directly. Instead, we are adding a number of sets, which are themselves each represented by a number. Multiplication's number of numbers kicks us, conceptually, into a higher abstraction. Multiplication is one more level away from physical reality. That makes ratio comparisons difficlt to comprehend. Seeing the NASA 17x thrust comparison (33 versus 1.9 mega-newtons) emphasizes how the difference between the crews is more natural.

  If it is not too much bother, could you please remember the number forty-one for me? You see, after the challenge of multiplicative ratios, we must consider one more common comparison struggle. Our experience of time is a strange phenomenon. Carrying numbers in our memory relegates all comparison to the imagination. Now, contrast that original number to twenty-four. You see? Carrying numbers forward in the mind slows thinking, introduces error, and makes comparisons more difficult. There is no reason to memorize; forget that number.

  One of the reasons we have pushed beyond text-riddled data tables and toward visual presentation is to reduce reliance on working memory. Do not shuttle numbers between moments if a comparison can be made in an instant.
Jacques Bertin extolled the power of the “image” of the mind: a temporal unit of meaningful visual perception. We must learn how to create comparisons within single mental images. Even William Playfair wrote, back in 1801, about the “pains and labour” that memory requires. John Tukey said it best, “learn how to make one picture do.” William Cleveland distinguished two types of visual comparisons: juxtaposed and superposed. Juxtaposition places items next to one another. Like numbers in a table, juxtaposed graphics require you to shuttle information between moments. Juxtaposition's spatial gap creates a perceptual gap in time. Superposition closes the gap by layering graphics on top of one other. Together, the layers share the same reference scales, the same context. Inside their common world, from two given elements a third materializes, the negative space of the comparison between them.

  Patterns beyond Compare

  The word behavior suggests how people conduct themselves over time. More generally, behavior is concerned with the way a subject acts. What does it mean for data to behave? It implies a connected performance, often sequenced over time. But we could also consider how low a cantilevered beam bends under a heavy load as a type of behavior. Whether we appraise behavior across the length of a beam, or across a length of time, the anchor is the same: a dimension that provides connective tissue. The third meter of the beam is connected to the second meter, just as the third day is connected to the second day. Behavior has a natural structure. Tomorrow's activities may not have anything to do with what happened today, but we initially sequence days in the order they are experienced. Because of its association with time, we usually position the connective variable on the rightward, horizontal axis. It is the foundation of the image. The performant variable can then be raised vertically against this horizontal foundation, creating the familiar Cartesian plane. This Flatland presents an opportunity for a special kind of visual exploration: the search for familiar patterns.

  Trend: Something practically everybody is interested in showing or knowing or spotting or deploring or forecasting.

  DARRELL HUFF, 1954

  Patternicity is the tendency to perceive meaningful patterns, whether they are actually there or not. We are particularly fond of recognizing faces.

  We are able to distinguish objects because we sense patterns. The mind ingests many pixels of color from the real world. It groups similar nearby pixels into contours and regions, and then tries to match these with the shapes of known objects stored in our memory. This all happens automatically, of course, and our mental storehouse of patterns is refined across our lives. Patternicity allows us to pick a friend out of a crowd from the memorized gait of their walk and skim through pages of text in search of a target word. When we explore data for patterns, our goal is to recognize behavior that can clue us in to what is going on.

  Coup d' œil is a glance that takes in a comprehensive view, literally “stroke of eye” in French.

  Binding is combining different features that will come to be identified as parts of the same contour or region.

  The mark is a wonderfully useful commodity. At its egalitarian best, each data point receives its own mark, most often a tiny circle placed at a precise point. Many marks combine to show a large quantity of data in a very small area. Together, marks display distributions that may reveal outliers, clusters, and asymmetries.

  Scagnostics (scatter plot computer-guided diagnostics) can help categorize shapes by measuring: outlying, skewed, clumpy, sparse, striated, convex, skinny, stringy, monotonic.

  By plotting marks on the rectangular canvas we can highlight some familiar patterns.

  Your mind can already recognize many data patterns, like the happy upward trend. Learning additional patterns will help you better explore data, just as your ability to recognize objects from real life expands as you build out your library of memories. Some patterns are specific to your field. A physician must study many time-series of the electrical activity of the heart in order to rapidly interpret electrocardiogram records. Other patterns reappear across data worlds.

  See how simple design decisions impact the nature of how a pattern appears to us, below. Presenting the same data with visually different slopes shows us that a spotted pattern may not be inherently interesting. It also reveals that an interesting pattern may be lurking in the data, but not apparent on a given canvas. Once any pattern is recognized, you must help determine if it is relevant by providing context. Does it agree with or disrupt your prior beliefs? If a strong relationship is expected, then the absence of a pattern may be the most intriguing picture. Or, if a strong relationship is already known, then the outliers that do not conform may hold the real story. Deviation from the expected continues to fascinate.

  Theory and experimentation led to a general principle, that judging the ratio of two positive slopes is optimized by a mid-angle of 45°. … at best we can only get a rough visual estimate of rate of change from slope judgments, even in the best of circumstances.…

  CLEVELAND, McGILL, AND McGILL, 1988

  The aspect ratio, canvas height-to-width, can make the rise of the same data appear severe, or not.

  Similar effects can be produced with an increase or decrease of the empty space that surrounds the data.

  Whether the slope is above or below the canvas diagonal is seen (especially when it is 45°), but it is difficult to be more precise.

  Throughout this book, the mean (average) is barely mentioned. It is a potentially dangerous summary that flies against the spirit of looking at all the patterns the data has to offer. The mean twists the impact of outliers, variation, aggregation, and sample size. It is guilty of aiding and abetting many numerical paradoxes and traps we must be on the lookout for. Summaries, abstractions, and simplifications threaten to bury the unexpected. See how the mean misrepresents the actual data in each of the examples below.

  Each of the four datasets yields the same standard output from a typical regression program, namely: number of observations, mean of x's, mean of y's, regression of coefficient (b1) of y on x, equation of regression line, sum of squares, regression sum of squares, residual sum of squares of y, estimated standard error of b1, and multiple R2.

  F.J. ANSCOMBE, 1973

  Summaries, graphic and otherwise, are not the enemy. Geographic regionalization and best-fit curves are essential to helping us probe data. But, because summaries conceal by design, we risk error if we rely entirely on them. Take a look at the most granular level of data available. Otherwise, you might miss a clue that sparks further creative exploration.

  The median is the middle term in a series ordered by magnitude. The outlier impacts the mean, but not the median, in the following two series: 2, 4, 6, 8, 16, 17, 32 2, 4, 6, 8, 16, 17, 92

  Instead of the mean, consider the median. It is called a resistive summary measure because it is not impacted by outliers as quickly as the mean can be. Also, you can point to the median in the dataset and say there it is. The median ushers us toward a much more honest and direct visual description of the data.

  Drawing with data is an invaluable tool to discover what is unique about the numbers at hand. It helps to reveal new possible analyses to perform: Instead of being overwhelmed by the size of a dataset and by millions of numbers, we focus only on their nature, their organization, and doing so often opens new opportunities originating from this vantage point.

  GIORGIA LUPI, 2017

  It's been a few pages since we last left Michelangelo. Recall why the artist sketches, and why we explore data: to familiarize yourself with the subject, to experiment, and to find interesting things you never expected. We sketch to discover and develop a vision for what we are going to say. Exploring data visually takes us to new understandings economically. Simple data sketches, comparisons and patterns, are the first steps into a universe of exploration.

  The greatest value of a picture is when it forces us to notice what we never expected.

  JOHN TUKEY, 1977

  Remember how we originally differentiated infor
mation from data. Information puts data into forms that are readable to humans. Data often does not display interesting patterns out of the box. We cannot always pick out visual clues based on the first plot of marks. Perhaps too many dots overlap in the corner of a plot, the population of a city dwarfs its associated landmass, or the network is a tangled hairball. Data sketching pursues new, weird, and better forms that let us take a more meaningful look.

  The eye cannot look on similar forms without involuntarily as it were comparing their magnitudes. So that what in the usual mode was attended with some difficulty, becomes not only easy, but as it were unavoidable.

 

‹ Prev